[debiandoc-sgml-pkgs] Re: Debiandoc/zh-cn fix + UTF-8 modifications

Danai SAE-HAN ( 韓達耐 ) danai.sae-han at edpnet.be
Thu Apr 12 20:54:02 UTC 2007


Hi!

From: Osamu Aoki <osamu at debian.org>

> I think you are in the right path but you need to be careful not to
> break old behavior too.
> 
> 'charset' in tools/lib/Locale/{SG,XML} uses traditional non-UTF-8
> encodings.  If Japan, EUC-JP, If Wetern Europe, Latin-1, If Russia,
> KOI-8.  

I see.  So I could just make zh_CN.UTF-8/SGML and change "iso-8859-1"
into "utf-8", right?

> This is what we should do.
> 
> We convert all locale specific data to UTF-8 and use them as the base
> data.  
> 
> We also make traditional non-UTF-8 encoding data at the pacjage build
> time to make traditional behavior available.  

Shouldn't be too much of a problem with iconv, but why keep the
traditional encodings for the languages that support UTF-8?
If one needs the GB2312 version, just reencode the files in zh-cn and
use zh_CN.GB2312.  In the modifications that I have (locally), I can
use both zh_CN.UTF-8 and zh_CN.GB2312; I only need to reencode the
contents of zh-cn with iconv.

Three files need changing as well in qref: default.ent and
bin/getdocdate (re-encode thepart of zh-cn back to GB2312), and
bin/getlocale (s/zh-cn/zh_CN.GB2312/ instead of zh_CN.UTF-8).

Perhaps we could just add this info in README.Debian for those who
still want the traditional encoding?

> By using new script option (e.g. -u) or specifying full locale name with
> ".utf-8", this script should accept utf-8 encoded data.  Oh, html
> generation code needs to be swichable too.

Hmmm, I don't really like this idea.  Why should we keep non-UTF-8
data for the languages that are supported?  Such as zh_CN: I got UTF-8
support, so I see no reason to keep GB2312-encoded files.

> Another easier and safer approach is to create new UTF-8 version of
> debiandoc-sgml (say, debiandoc-sgml-utf8 package conflicting with
> debiandoc-sgml).  Simply use encoding change.  Fic html header and latex
> code generation. This is more like what you are thinking.

Indeed, but not by creating a new package.  Let's just switch to UTF-8
for the languages that are already supported.

I could get many more languages, but I just need DFSG-free TTF fonts.
Once I get that, then latex-cjk will support about every language
(perhaps an exception for languages with difficult ligatures such as
Arabic or Indic scripts).

> Once you are successful, start filing all debiandoc-sgml depending
> packages to start using new utf-8 version while converting source text
> to UTF-8
> 
> I was thinking first option but that may be too complicated.  Your
> thought may be good for migration since we still have old package for a
> while.

Perhaps it's best if I uploaded my patch so you could have a look, and
see if it breaks things.

I'll make a patch file, so you can patch it locally.  And if it's
okay, then I'll upload it to CVS.


Best regards



Danai SAE-HAN
韓達耐

-- 
題目:《偶題》
作者:張耒(1052-1112)

相逢記得畫橋頭,花似精神柳似柔。
莫謂無情即無語,春風傳意水傳愁。



More information about the Debiandoc-sgml-pkgs mailing list