[Dict-common-dev] (Not) available aspell dictionaries

Agustin Martin agustin.martin@hispalinux.es
Mon, 24 Jan 2005 19:55:01 +0100


On Sun, Jan 23, 2005 at 10:14:45PM -0800, Brian Nelson wrote:
> On Mon, Jan 24, 2005 at 02:44:07AM +0100, Christoph Berg wrote:
> > I had a look at the list of available aspell dictionaries at [1] and
> > compared it with the Debian packages that are available. The
> > comparision is a OO.o table (attached as .sxc and .pdf, .html version
> > available at [2]).
> > 
> > 28 dictionaries are packaged [3]
> > 39 dictionaries are not packaged
> > 
> > I'm interested in getting these 39 aspell dictionaries into Debian.
> > The problem is that I don't speak any of them [4], but I should be
> > able to ask some friends for linguistic support for at least the
> > slavian and romanian languages missing.
> 
> There are a few things keeping all of those languages from being
> supported:
> 
> * Aspell 0.60 in Debian.  0.60 adds support for a lot more languages
>   than were supported by 0.50 and earlier versions.
> 
> * The arch-dependent nature of the dictionaries.  Many compiled
>   dictionaries are huge (> 20 MB) and currently all dictionary packages
>   are arch-dependent.  If the average dictionary package is 10MB, 10 MB
>   * 12 arches * 39 dictionaries = 4680MB.  That's a very big hit on the
>   mirrors for something that is avoidable.  We need to make dictionary
>   packages Arch: all.
 
For some languages that was even worse, current catalan aspell dict is
really a severely stripped down version of the ispell dict, since using
the complete one made aspell hash size be over 100Mb.

aspell 0.60 should improve things with affix compression, but for this is
better to look first at the available myspell dicts (aspell uses myspell
aff tables), which should result in smaller sizes, and mix with aspell
data files. I have done some experiments with the galician dict and seems
to work well apart from some funny things: aspell uses the myspell affix
table, but the ispell munched wordlist, that is, without the first line
with the word count myspell dict has. Nothing difficult to deal with
(just stripping first line).

I will try to write down my experience with this for other developers
benefit.

Regarding 'arch: all' Brian and me were thinking about that and even
did some preliminary experiments, but without affix compression. I should
try again with affix compression. The drawbacks are a bit larger
installed size and that aspell-bin needs to always be present, since is
needed to build hashes from the postinst. I do not expect the computer
power needed in that building be a problem. The advantage is that is
much nicer to the mirrors and that all binary format changes in aspell
dicts can automatically be handled.

Cheers,

-- 
Agustin