[Po4a-devel][patch] Making Html.pm (slightly) better

Yves Rutschle debian.anti-spam@rutschle.net
Wed, 24 Nov 2004 17:37:31 +0000


--vtzGhvizbBRQ85DL
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline

Hi all,

It looks like my spare time has shrunk further, hence my
long silence. Martin's last comments, and my spending some
time reading Sgml.pm and Xml.pm, along with running into
harder files from my site, have me almost convinced that
Martin is right and Html.pm is going the wrong way.

Here is the patch to it that I currently use. This brings it
to state in which it is useful for "simple" files (i.e.
files with simple paragraphs, little in-line formatting),
which I think still is useful for sites that contain a lot
of simple text (how-to's, for example, would be good
candidates if they used html as their primary format).

It doesn't change the fundamentals of its working, so all of
Martin's objections still hold true. Rather, it fixes the
module's shortcomings:
* Paragraphs are now spit along paragraphs, instead of
  random 512-byte-aligned boundaries,
* title and alt attribute contents now create msgids.

I hope to get some time to play with Sgml.pm and Xml.pm soon
(because, let's face it, it's much more fun than actually
doing translations).

Y.


--vtzGhvizbBRQ85DL
Content-Type: text/plain; charset=us-ascii
Content-Disposition: attachment; filename="html.patch"

Index: lib/Locale/Po4a/Html.pm
===================================================================
RCS file: /cvsroot/po4a/po4a/lib/Locale/Po4a/Html.pm,v
retrieving revision 1.8
diff -u -r1.8 Html.pm
--- lib/Locale/Po4a/Html.pm	27 Aug 2004 10:31:53 -0000	1.8
+++ lib/Locale/Po4a/Html.pm	24 Nov 2004 17:26:36 -0000
@@ -80,6 +80,8 @@
     my ($self,$filename)=@_;
     my $stream = HTML::TokeParser->new($filename)
         || die "Couldn't read HTML file $filename : $!";
+
+    $stream->unbroken_text( [1] );
     
     my @type=();
     NEXT : while (my $token = $stream->get_token) {
@@ -97,14 +99,36 @@
 #  $encoded = HTML::Entities::encode($a);
 #  $decoded = HTML::Entities::decode($a);
 	    #print STDERR $token->[0];
-            $self->pushline( " ".$self->translate($text,
+            $self->pushline( $self->translate($text,
 		                                  "FIXME:0",
 		                                  (scalar @type ? $type[scalar @type-1]: "NOTYPE")
-	                                         )." " );
+	                                         ),
+                             'wrap' => 1
+                             );
             next NEXT;
 	} elsif ($token->[0] eq 'S') {
 	    push @type,$token->[1];
-            $self->pushline( get_tag( $token ) );
+            my $text =  get_tag( $token );
+            if ( $token->[1] eq 'img' ) {
+                my %foo = %{$token->[2]};
+                my $title = (exists $foo{"title"}?$foo{"title"}:"")."\n";
+                my $alt   = (exists $foo{"alt"}?$foo{"alt"}:"")."\n";
+                for my $attr ($title, $alt) {
+                    if (defined $attr) {
+                        $attr = trim($attr), 
+                        my $translated = $self->translate( 
+                                              $attr,
+                                              "FIXME:0",
+                                              (scalar @type?
+                                                   $type[scalar @type-1]:
+                                                   "NOTYPE")
+                                              );
+                        $attr = quotemeta $attr;
+                        $text =~ s/$attr/$translated/;
+                    }
+                }
+            }
+            $self->pushline( $text );
         } elsif ($token->[0] eq 'E') {
 	    pop @type;
             $self->pushline( get_tag( $token ) );
@@ -136,9 +160,10 @@
 
 sub trim { 
     my $s=shift;
-    $s =~ s/\n//g;  # remove \n in text
-    $s =~ s/\r//g;  # remove \r in text
-    $s =~ s/\t//g;  # remove tabulations
+    $s =~ s/\n/ /g;  # remove \n in text
+    $s =~ s/\r/ /g;  # remove \r in text
+    $s =~ s/\t/ /g;  # remove tabulations
+    $s =~ s/\s+/ /g; # remove multiple spaces
     $s =~ s/^\s+//; # remove leading spaces
     $s =~ s/\s+$//; # remove trailing spaces
     return $s;

--vtzGhvizbBRQ85DL--