Bug#366992: [debiandoc-sgml-pkgs] Bug#366992: debiandoc-sgml: [INTL:uk] Ukrainian language support

Wed May 17 21:54:32 UTC 2006

On Thu, May 18, 2006 at 12:23:58AM +0300, Eugeniy Meshcheryakov wrote:
> 17 травня 2006 о 22:00 +0200 Jens Seidel написав(-ла):
> > On Fri, May 12, 2006 at 08:19:50PM +0300, Eugeniy Meshcheryakov wrote:
> > UTF-8 support is really a nice add on, even if it should not be
> > necessary to support Ukrainian language as Russian demonstrates and
> > considering the fact that Ukrainian shares the same alphabet as Russian
> > (except of course the additional i/I character).
> ..and є/Є, and ї/Ї, and ґ/Ґ.

Thanks, I didn't know this. Even my Russian colleagues didn't know this IIRC.

> Different characters is not the biggest
> problem. Unicode makes it possible to use more characters (like em-dash
> or quotation marks) than 8-bit encoding (like KOI8-U), and next stable 
> Debian release is going to be UTF-8 by default. So I think UTF-8 is a
> good choice.

Agreed.

> > The only problem I could imagine is that SGML will not or wrongly complain about
> > invalid characters. I have to check this.
> > 
> > > -DESCSET  128 32 UNUSED
> > > +DESCSET  128 32 32
> > > @@ -23,10 +23,7 @@
> > >  SHUNCHAR CONTROLS   0   1   2   3   4   5   6   7   8   9
> > >                     10  11  12  13  14  15  16  17  18  19
> > >                     20  21  22  23  24  25  26  27  28  29
> > > -                   30  31                     127 128 129
> > > -                  130 131 132 133 134 135 136 137 138 139
> > > -                  140 141 142 143 144 145 146 147 148 149
> > > -                  150 151 152 153 154 155 156 157 158 159
> > > +                   30  31                     127 
> > 
> > A stupid question from my side, but could you please explain this?
> > That's Ardo's code and I'm not familiar with it.
> This part of patch fixes problem that sgml processor complains about bad
> characters in UTF-8 text (at least written in Ukrainian).

Yep.

> Second part tells sgml processor to not ignore characters in range
> 128-159.
> 
> So effect of those two parts is - sgml processor handles characters with
> codes 128-159 as usuall (allowed) characters.

OK. But 0-31 and 127 are still rejected, right?
I assume these numbers to not refer to UTF-8 characters but to single
bytes. This makes UTF-8 characters consisting of two bytes with a second
byte of this range invalid!? Can you confirm this?

On the other side these characters are currrently not supported at all.

Any reason not to remove 0-31 and 127 as well (except that it would be
accepted in the first byte as well which is bad)?

> > I wonder why you do not add a 8 bit encoding as well, but maybe it should
> It can be done, but I do not see good reason to do this. If someone need
> to have sgml *source* in other encoding, support for this can be added
> later.

Right.

> > > +	   'pdfhyperref' => 'unicode'
> > 
> > If I remember correctly this is only supported in Acrobat to properly
> > show bookmarks. xpdf and other PDF viewer just display garbage
> > (independent of the unicode option).
> As I can see that bookmarks are supported in evince too. There is

Thanks, I didn't know this.

> also a patch for xpdf but I did not try it. And you are right, without
> this option all viewers will display garbage.

Jens