Bug#496266: UTF-8 string characters not properly recognized

Tue Sep 2 22:12:21 UTC 2008

2008-09-02 (화), 13:19 -0500, Adam Majer:
> Christian Perrier wrote:
> >> Le samedi 23 août 2008 à 19:59 -0500, Adam Majer a écrit :
> >>> Package: gedit
> >>> Version: 2.22.3-1
> >>> Severity: normal
> >>>
> >>> The following UTF-8 string is not correctly handled in gedit,
> >>>
> >>> const char *unicode_insert = "?Э";
> >>>
> >>> The " and the ? characters are viewed as one character, making the
> >>> entire thing next to impossible to copy/paste/edit.
> >> Looks like an issue in pango, since it is not specific to gedit.
> >>
> >> Such things seem to happen a lot when using Tibetan characters, so this
> >> may or may not be intentional. I’d prefer to have the input of someone
> >> who uses them. Is there anyone on debian-i18n who’s more knowledgeable
> >> about Tibetan glyphs?
> > 
> > 
> > Adding Pema Geyleg and Tenzin Dendup, our fellow Dzongkha translation
> > coordinators, who certainly have skills about Tibetan-family scripts
> > (Dzongkha is one of these) and could maybe point you to people with
> > needed knowledge.
> 
> 
> I'm sorry, but aren't we missing the entire point here? This is not
> about bad handling of some Tibetan characters. It is about bad handling
> of 3-byte UTF-8 characters.
> 
> http://en.wikipedia.org/wiki/UTF-8
> 
> So, the following characters should have the same problems,
> 
> "ऄक
> 
> "ঈউঊ
> 
> "ਜਗਏ
> 
> "ଜଁଂ
> 
> "ஔ
> 
> "ంఁః
> 
> "ಂಖ
> 
> "ഈഃ
> 
> etc..
> 
> 
> I've put a Ascii " in front of all the different characters. In emacs, 
> I'm able to select the " in front of these characters and copy it. vim 
> under a UTF-8 gnome terminal also allows the " to be selected. The 2nd 
> last line above (using icedove), I can't independently select the " but 
> I can select the " and ಂ together and then remove the 2nd character.
> 
> Maybe it is just my misunderstanding of UTF-8, I'm not sure. But at 
> least my expected behaviour was being able to select 1 UTF-8 character 
> at a time, even if linguistically it does not make any sense.

The Tibetan code in this case, U+0FA1 is NOT a character. It's a Tibetan
code for combining with other Tibetan codes to form a Tibetan character.
Unicode code points do not necessarily represent characters. Selecting
combined character is more expected than selecting its sub-parts (even
when it's possible).

This issue is about handling Unicode combining. In this case, Pango
interprets a quote mark (") and U+0FA1 Tibetan code (wrong combination)
as one combined character. I'm not sure whether it's a defined behavior.

-- 
Changwoo Ryu <cwryu at debian.org>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 185 bytes
Desc: This is a digitally signed message part
Url : http://lists.alioth.debian.org/pipermail/pkg-gnome-maintainers/attachments/20080903/173d74fe/attachment.pgp