[Po4a-devel]HTML translating [Was: Administrivia]

Martin Quinson martin.quinson@imag.fr
Mon, 8 Nov 2004 14:22:42 +0100

Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

On Sun, Nov 07, 2004 at 12:41:32AM +0000, Yves Rutschle wrote:
> On Sun, Nov 07, 2004 at 12:37:57AM +0100, Martin Quinson wrote:
> > But later, when I reviewed the code I discovered that the way it did sp=
> > the sentences makes it very hard to use. "I <b>like</b> it" is splitted=
 in 3
> > msgid which have to be translated separately ("I" ; "like" and "it").=
> That's not actually the reason I found it useless as is: the
> current CVS version slices paragraphs randomly (well, on 512
> bytes boundaries or something like that), which means that
> irrelevant changes in formatting anywhere in the file
> fuzzify the entire file.

Ouch. Even worse than expected.

> > See http://po4a.alioth.debian.org/en/po4a.7.html#Why_not_to_split_on_ a=
> > http://graal.ens-lyon.fr/~mquinson/l10n.html#l2.2 for my point on this.=
> > should have explained my point to Laurent before...
> Yes, I have actually run into a couple of those problems
> myself. While the splitting in 3 as in your above example
> is, indeed, a bit confusing, I don't find that makes it
> useless, and more to the point, I just don't think there is
> any other good solution: the bottom line is, you want to
> specify that something in that sentence is important, which
> will need to be in a different msgid.

Ok. I wanted to reply this message the way it desserve (with a long
argumentation to base my point), but I lack the time to do so. I'll be
short. Check the URL given above for more details.

> The solution, I find, is to have the translator understand
> the structure of the original text so she'll know to
> translate:
> "It's a <b>blue</b> car" =3D> ("It's a", "blue", "car" )
> into:
> "C'est une voiture <b>bleue</b>" =3D> ( "C'est une voiture", "bleue", " "=

And now, add this english sentence to your system: "it's a <b>blue</b> hors=

You then have the following translations (one per line)
it's a -> "c'est un" or "c'est une" depending on the context since horse is
          masculin in french
blue -> "bleu" or "bleue" (same issue)
car -> voiture
horse -> cheval.

And now, add "it's a <b>small</b> car". This time, the issue is that in
french, the adjective is placed before the noun where the translation of
"blue" is placed afterward.

How you'll implement this different translations depending on the context
and the reordering of sentence elements? My point is that splitting
sentences is *never* a viable solution.

If you think that such issues are seldom and dealable with, type=20
man Locale::Maketext::TPJ13 in a terminal ;)

> I guess an alternative would be to have a list of "small
> formatting tags" ( bold, italics etc) that do not actually
> split at all, and appear in the msgid with the onus on the
> translator to know enough HTML to know what to do with them
> (so you'd have something like:
> msgid "It's a <b>blue</b> car"
> msgstr "C'est une voiture <b>bleue</b>"
> That would have the advantage of providing the translator
> with context information. In fact that goes a long way
> towards your point of splitting at paragraph level :-)
> That's actually fairly easily achievable: the list of
> paragraph-marking tags is fairly small (<p>, <div>,
> <h1,2,3,4,...>) and XHTML makes it mandatory for text to be
> included in a block-level element of some sort.

That's exactly my point, indeed. You should split the translation on a
paragraph boundary because if you take bigger chunks, gettext and po editors
get clumsy. If you take smaller chunks, you run into endless issues about
context changing the meaning of the chunk.

You thus have to show some formating tags to the translators. We do so in
all other modules. I don't see any better idea.

> > Nowadays, this module should probably be reimplemented using Jordi's gr=
> > work on XML-like formats.
> I know next to nothing about XML; last time I saw some, I
> thought it looked quite different from HTML. A quick read of
> Jordi's module makes me think it's mostly an XML parser:
> Html.pm relies on Gisle Aas' HTML parser, and it doesn't
> seem to be very beneficial to change parsers just for fun;

It's not for fun, it's because the XML module do work and is done to allow
the rapid developpement of other modules (no new code needed) whereas the
existing HTML module does not work.

Moreover, I'd be pleased to cut a dependency. I hate unjustified
dependencies, but it may be personal.

> besides, Gisle's parser is supposed to be quite good at
> handling broken HTML, which I doubt XML is very good at
> (then again, helping bad HTML spread probably isn't good :)

That's a good argument to stick to this parser, then.

Thanks for your interest for po4a,

Content-Type: application/pgp-signature; name="signature.asc"
Content-Description: Digital signature
Content-Disposition: inline

Version: GnuPG v1.2.4 (GNU/Linux)

