Bug#366992: [debiandoc-sgml-pkgs] Bug#366992: debiandoc-sgml: [INTL:uk] Ukrainian language support

Jens Seidel jensseidel at users.sf.net
Thu May 18 07:48:24 UTC 2006


On Thu, May 18, 2006 at 01:37:06AM +0300, Eugeniy Meshcheryakov wrote:
> 17 травня 2006 о 23:54 +0200 Jens Seidel написав(-ла):
> > > > The only problem I could imagine is that SGML will not or wrongly complain about
> > > > invalid characters. I have to check this.
> > > > 
> > > > > -DESCSET  128 32 UNUSED
> > > > > +DESCSET  128 32 32
> > > > > @@ -23,10 +23,7 @@
> > > > >  SHUNCHAR CONTROLS   0   1   2   3   4   5   6   7   8   9
> > > > >                     10  11  12  13  14  15  16  17  18  19
> > > > >                     20  21  22  23  24  25  26  27  28  29
> > > > > -                   30  31                     127 128 129
> > > > > -                  130 131 132 133 134 135 136 137 138 139
> > > > > -                  140 141 142 143 144 145 146 147 148 149
> > > > > -                  150 151 152 153 154 155 156 157 158 159
> > > > > +                   30  31                     127 
> > 
> > > Second part tells sgml processor to not ignore characters in range
> > > 128-159.
> > > 
> > > So effect of those two parts is - sgml processor handles characters with
> > > codes 128-159 as usuall (allowed) characters.
> > 
> > OK. But 0-31 and 127 are still rejected, right?
> > I assume these numbers to not refer to UTF-8 characters but to single
> > bytes. This makes UTF-8 characters consisting of two bytes with a second
> > byte of this range invalid!? Can you confirm this?
> > 
> Characters with codes 0x0..0x7f (0..127) are the same as in ASCII, they

I know.

> cannot be found in sequences that correspond to other characters. So if
> they are not currently needed, they are not needed for UTF-8 support too.

You are right (but I referred to bytes in a UTF-8 multibyte character *not*
to characters).

I assumed in the past that a UTF-8 character is represented by
 * 0xxxxxxx (ASCII, 1 byte only) or
 * 1xxxxxxx xxxxxxxx (not ASCII, two bytes)
and worried about ASCII characters (such as <, > which have a special meaning
in SGML) in the second byte.

But according to the table in http://de.wikipedia.org/wiki/UTF-8 that's wrong
and non-ASCII characters in UTF-8 are never represented with ASCII bytes. The
same as you explained ...

Great!

I need to test the patch in more detail, but will probably commit it soon.

PS: I wonder why you do not use capitalisation of subsection, paragraph, ...
(розділ, параграф) as for chapter, appendix, ... But I'm sure you have good
reasons.

Jens




More information about the Debiandoc-sgml-pkgs mailing list