Bug#366992: [debiandoc-sgml-pkgs] Bug#366992: debiandoc-sgml: [INTL:uk] Ukrainian language support

Thu May 18 07:48:24 UTC 2006

On Thu, May 18, 2006 at 01:37:06AM +0300, Eugeniy Meshcheryakov wrote:
> 17 травня 2006 о 23:54 +0200 Jens Seidel написав(-ла):
> > > > The only problem I could imagine is that SGML will not or wrongly complain about
> > > > invalid characters. I have to check this.
> > > > 
> > > > > -DESCSET  128 32 UNUSED
> > > > > +DESCSET  128 32 32
> > > > > @@ -23,10 +23,7 @@
> > > > >  SHUNCHAR CONTROLS   0   1   2   3   4   5   6   7   8   9
> > > > >                     10  11  12  13  14  15  16  17  18  19
> > > > >                     20  21  22  23  24  25  26  27  28  29
> > > > > -                   30  31                     127 128 129
> > > > > -                  130 131 132 133 134 135 136 137 138 139
> > > > > -                  140 141 142 143 144 145 146 147 148 149
> > > > > -                  150 151 152 153 154 155 156 157 158 159
> > > > > +                   30  31                     127 
> > 
> > > Second part tells sgml processor to not ignore characters in range
> > > 128-159.
> > > 
> > > So effect of those two parts is - sgml processor handles characters with
> > > codes 128-159 as usuall (allowed) characters.
> > 
> > OK. But 0-31 and 127 are still rejected, right?
> > I assume these numbers to not refer to UTF-8 characters but to single
> > bytes. This makes UTF-8 characters consisting of two bytes with a second
> > byte of this range invalid!? Can you confirm this?
> > 
> Characters with codes 0x0..0x7f (0..127) are the same as in ASCII, they

I know.

> cannot be found in sequences that correspond to other characters. So if
> they are not currently needed, they are not needed for UTF-8 support too.

You are right (but I referred to bytes in a UTF-8 multibyte character *not*
to characters).

I assumed in the past that a UTF-8 character is represented by
 * 0xxxxxxx (ASCII, 1 byte only) or
 * 1xxxxxxx xxxxxxxx (not ASCII, two bytes)
and worried about ASCII characters (such as <, > which have a special meaning
in SGML) in the second byte.

But according to the table in http://de.wikipedia.org/wiki/UTF-8 that's wrong
and non-ASCII characters in UTF-8 are never represented with ASCII bytes. The
same as you explained ...

Great!

I need to test the patch in more detail, but will probably commit it soon.

PS: I wonder why you do not use capitalisation of subsection, paragraph, ...
(розділ, параграф) as for chapter, appendix, ... But I'm sure you have good
reasons.

Jens