Bug#366992: [debiandoc-sgml-pkgs] Bug#366992: debiandoc-sgml: [INTL:uk] Ukrainian language support

Wed May 17 21:23:58 UTC 2006

17 травня 2006 о 22:00 +0200 Jens Seidel написав(-ла):
> Hi Eugeniy,
> 
> On Fri, May 12, 2006 at 08:19:50PM +0300, Eugeniy Meshcheryakov wrote:
> > Please, apply attached patch for Ukrainian language support. This patch
> > also contains some fixes needed for UTF-8 support (not ideal, but at
> > least useable).
> > 
> > The patch was made by Borys Yanovych and improved by me.
>  
> first of all I want to thank you and Borys 
> (isn't Boris the latin
> representation?)
Well, his name as written in Ukrainian is Борис, that can be represented
in IPA as [bo'rıs], and letter 'и' is usually transliterated as 'y' (in
contrast with letter 'і'(cyrl) thar usually becames 'i'(latn)).

> for this patch.
>  
> UTF-8 support is really a nice add on, even if it should not be
> necessary to support Ukrainian language as Russian demonstrates and
> considering the fact that Ukrainian shares the same alphabet as Russian
> (except of course the additional i/I character).
..and є/Є, and ї/Ї, and ґ/Ґ. Different characters is not the biggest
problem. Unicode makes it possible to use more characters (like em-dash
or quotation marks) than 8-bit encoding (like KOI8-U), and next stable 
Debian release is going to be UTF-8 by default. So I think UTF-8 is a
good choice.

> 
> The only problem I could imagine is that SGML will not or wrongly complain about
> invalid characters. I have to check this.
> 
> > diff -urN debiandoc-sgml-1.1.95/sgml/dtd/debiandoc.dcl /home/eugen/borman/devel1/debiandoc-sgml+uk-1.1.95/sgml/dtd/debiandoc.dcl
> > --- debiandoc-sgml-1.1.95/sgml/dtd/debiandoc.dcl	2001-04-18 03:07:10.000000000 +0300
> > +++ /home/eugen/borman/devel1/debiandoc-sgml+uk-1.1.95/sgml/dtd/debiandoc.dcl	2006-05-07 17:43:31.000000000 +0300
> > @@ -15,7 +15,7 @@
> >            32 95 32
> >           127  1 UNUSED
> >  BASESET  "ISO Registration Number 100//CHARSET ECMA-94 Right Part of Latin Alphabet Nr. 1//ESC 2/13 4/1"
> > -DESCSET  128 32 UNUSED
> > +DESCSET  128 32 32
> >           160 96 32
> >  CAPACITY PUBLIC    "ISO 8879:1986//CAPACITY Reference//EN"
> >  SCOPE    DOCUMENT
> > @@ -23,10 +23,7 @@
> >  SHUNCHAR CONTROLS   0   1   2   3   4   5   6   7   8   9
> >                     10  11  12  13  14  15  16  17  18  19
> >                     20  21  22  23  24  25  26  27  28  29
> > -                   30  31                     127 128 129
> > -                  130 131 132 133 134 135 136 137 138 139
> > -                  140 141 142 143 144 145 146 147 148 149
> > -                  150 151 152 153 154 155 156 157 158 159
> > +                   30  31                     127 
> 
> A stupid question from my side, but could you please explain this?
> That's Ardo's code and I'm not familiar with it.
This part of patch fixes problem that sgml processor complains about bad
characters in UTF-8 text (at least written in Ukrainian).
Line 
	DESCSET  128 32 32
Says that 32 characters (octets) with codes starting from 128 are not
UNUSED and correspond to characters in encoding "ISO Registration Number
100//CHARSET ECMA-94 Right Part of Latin Alphabet Nr. 1//ESC 2/13 4/1"
starting from number 32. This is incorrect but the next line is allready
incorrect and I do not see other way to say to sgml processor to not
complain about bad characters (and AFAICS that 'internal' encoding is
only used for converting thinks like &#XX; to octets).

Second part tells sgml processor to not ignore characters in range
128-159.

So effect of those two parts is - sgml processor handles characters with
codes 128-159 as usuall (allowed) characters.

> > diff -urN debiandoc-sgml-1.1.95/tools/lib/Locale/Alias.pm /home/eugen/borman/devel1/debiandoc-sgml+uk-1.1.95/tools/lib/Locale/Alias.pm
> > --- debiandoc-sgml-1.1.95/tools/lib/Locale/Alias.pm	2005-05-26 23:50:38.000000000 +0300
> > +++ /home/eugen/borman/devel1/debiandoc-sgml+uk-1.1.95/tools/lib/Locale/Alias.pm	2006-05-07 14:55:27.000000000 +0300
> > @@ -161,6 +161,10 @@
> >  		   'tr_TR'			=> 'tr_TR.ISO8859-9',
> >  		   'tr_TR.ISO8859-9'		=> 'tr_TR.ISO8859-9',
> >  
> > +		   'uk'				=> 'uk_UA.UTF-8',
> > +		   'uk_UA'			=> 'uk_UA.UTF-8',
> > +		   'uk_UA.UTF-8'		=> 'uk_UA.UTF-8',
> > +		   
> 
> I wonder why you do not add a 8 bit encoding as well, but maybe it should
It can be done, but I do not see good reason to do this. If someone need
to have sgml *source* in other encoding, support for this can be added
later.

> be autogenerated from UTF-8 mode/strings in debiandoc-sgml itself.
> 
> > +	   'after begin document' => '\\renewcommand{\\vpageref}[1]{на стор. \\pageref{#1}}',
> 
> Probably you want to add a non breakable space "~" in front of the number?
> 
Yes, that'll be good, and as I can see doing so does not break existing
translations.

> > +	   'pdfhyperref' => 'unicode'
> 
> If I remember correctly this is only supported in Acrobat to properly
> show bookmarks. xpdf and other PDF viewer just display garbage
> (independent of the unicode option).
As I can see that bookmarks are supported in evince too. There is
also a patch for xpdf but I did not try it. And you are right, without
this option all viewers will display garbage.

Thanks,
Eugeniy Meshcheryakov
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: Digital signature
Url : http://lists.alioth.debian.org/pipermail/debiandoc-sgml-pkgs/attachments/20060518/fd0fa0d2/attachment.pgp