UTF-8 and ispell

Fri Sep 21 07:20:17 UTC 2007

[Cc'ing to Paul.  Paul: I know you are subscribed to this ML, but I wanted
to be sure you will see my question at the end of this post.]

* G. Milde <milde at users.sourceforge.net> [2007-09-20 11:31]:

> It should look for utf8 in the aff files an add a line like::
> 
>   deutsch (Old German UTF-8)
> 
> to ispell-dicts-list.txt for every dictionary providing 'altstringtype "utf8"'
> 
> jed-ispell-dicts.sl should then contain something like ::
> 
>   ispell_add_dictionary (
>     "german-old-tex",
>     "ogerman",
>     "\"",
>     "[']",
>     "~tex",
>     "-C -d ogerman");
>   
>   if (_slang_utf8_ok) {
>     ispell_add_dictionary (
>       "german-old-utf8",
>       "ogerman",
>       "Ã„Ã–ÃœÃ¤Ã¶ÃŸÃ¼",
>       "[']",
>       "~utf8",
>       "-C -d ogerman");
>   } else {
>     ispell_add_dictionary (
>       "german-old8",
>       "ogerman",
>       "ÄÖÜäößü",
>       "[']",
>       "~latin1",
>       "-C -d ogerman");
>   }
>   
> so that the correct argument is passed to ispell.
> 
> This works now in both, UTF8 and latin1 enabled jed.
> 
> (I did not check how this could be done and how it fits in the
> dictionaries-common policy.)  

Actually, my mental model of how the whole thing works was wrong.  The
jed-ispell-dicts.sl is automatically generated by dictionaries-common at
installation time for package i<language> from the information provided in
file debian/i<language>.info-ispell also in
/var/lib/dictionaries-common/ispell/i<language>).  In the ingerman package,
this file contains:

    Language: deutsch (New German -tex mode-)
    Hash-Name: ngerman
    Emacsen-Name: german-new
    Casechars: [A-Za-z\"]
    Not-Casechars: [^A-Za-z\"]
    Otherchars: [']
    Many-Otherchars: no
    Additionalchars: \"
    Ispell-Args: -C -d ngerman
    Extended-Character-Mode: ~tex
    Coding-System: iso-8859-1
    Locale: de_DE

    Language: deutsch (New German 8 bit)
    Hash-Name: ngerman
    Emacsen-Name: german-new8
    Casechars: [A-Za-z������
    Not-Casechars: [^A-Za-z������
    Otherchars: [']
    Many-Otherchars: no
    Additionalchars: ����
    Ispell-Args: -C -d ngerman
    Extended-Character-Mode: ~latin1
    Coding-System: iso-8859-1
    Locale: de_DE

If a new record is created in this file containing, as you suggested:

    Language: deutsch (New German 8 bit UTF-8)
    Hash-Name: ngerman
    Emacsen-Name: german-new8-utf8
    Casechars: [A-Za-zÄÖÜäößü]
    Not-Casechars: [^A-Za-zÄÖÜäößü]
    Otherchars: [']
    Many-Otherchars: no
    Additionalchars: ÄÖÜäößü
    Ispell-Args: -C -d ngerman
    Extended-Character-Mode: ~utf8
    Coding-System: utf-8
    Locale: de_DE

then the following would appear in jed-ispell-dicts.sl:

    ispell_add_dictionary (
      "german-new8-utf8",
      "ngerman",
      "ÄÖÜäößü",
      "[']",
      "~utf8",
      "-C -d ngerman");

So, my conclusion is that it is not jed-extra's neither
dictionnaries-common's responsibility to provided utf-8 support for
ispell.sl but rather it is up to the individual i<language> package to
provide it through the debian/i<language>.info-ispell files. (I will
consider filling bug reports against the ispell dictionary packages.)

The only donwside of this approach is that users will be provided with both
choices "<language>" and "<language>-utf8" when calling
ispell_change_dictionary although only one of them will make ispell.sl work
correctly according to the character encoding system used.

It would be good if non-UTF8 possibilities could be filtered out when 
_slang_utf8_ok, probably by looking at the extchr argument passed to
ispell_add_dictionary().  [Paul: what do you think?]

-- 
Rafael