[Po4a-devel]HTML translating [Was: Administrivia]

Yves Rutschle debian.anti-spam@rutschle.net
Sun, 7 Nov 2004 00:41:32 +0000


On Sun, Nov 07, 2004 at 12:37:57AM +0100, Martin Quinson wrote:
> But later, when I reviewed the code I discovered that the way it did split
> the sentences makes it very hard to use. "I <b>like</b> it" is splitted in 3
> msgid which have to be translated separately ("I" ; "like" and "it"). 

That's not actually the reason I found it useless as is: the
current CVS version slices paragraphs randomly (well, on 512
bytes boundaries or something like that), which means that
irrelevant changes in formatting anywhere in the file
fuzzify the entire file.

> See http://po4a.alioth.debian.org/en/po4a.7.html#Why_not_to_split_on_ and
> http://graal.ens-lyon.fr/~mquinson/l10n.html#l2.2 for my point on this. I
> should have explained my point to Laurent before...

Yes, I have actually run into a couple of those problems
myself. While the splitting in 3 as in your above example
is, indeed, a bit confusing, I don't find that makes it
useless, and more to the point, I just don't think there is
any other good solution: the bottom line is, you want to
specify that something in that sentence is important, which
will need to be in a different msgid.

The solution, I find, is to have the translator understand
the structure of the original text so she'll know to
translate:

"It's a <b>blue</b> car" => ("It's a", "blue", "car" )

into:

"C'est une voiture <b>bleue</b>" => ( "C'est une voiture", "bleue", " ") 


I guess an alternative would be to have a list of "small
formatting tags" ( bold, italics etc) that do not actually
split at all, and appear in the msgid with the onus on the
translator to know enough HTML to know what to do with them
(so you'd have something like:

msgid "It's a <b>blue</b> car"
msgstr "C'est une voiture <b>bleue</b>"

That would have the advantage of providing the translator
with context information. In fact that goes a long way
towards your point of splitting at paragraph level :-)
That's actually fairly easily achievable: the list of
paragraph-marking tags is fairly small (<p>, <div>,
<h1,2,3,4,...>) and XHTML makes it mandatory for text to be
included in a block-level element of some sort.

> Nowadays, this module should probably be reimplemented using Jordi's great
> work on XML-like formats.

I know next to nothing about XML; last time I saw some, I
thought it looked quite different from HTML. A quick read of
Jordi's module makes me think it's mostly an XML parser:
Html.pm relies on Gisle Aas' HTML parser, and it doesn't
seem to be very beneficial to change parsers just for fun;
besides, Gisle's parser is supposed to be quite good at
handling broken HTML, which I doubt XML is very good at
(then again, helping bad HTML spread probably isn't good :)


Y.