[libvoikko] Moving support for HFST spellers from experimental to stable
Flammie Pirinen
flammie at iki.fi
Mon Jan 7 08:43:54 EET 2013
2013-01-06, Harri Pitkänen sanoi:
> If I recall correctly most of the
> problems with spellers built from these have been with spelling
> correction (either it has been slow or missed corrections that
> Hunspell would have provided). But last time I seriously looked at
> this was over a year ago, maybe this is no longer an issue?
Probably maybe. At the moment I cannot find no bug reports, no emails,
no IRC message nor any specific complaints I can find so I cannot be
certain. I haven't worked with hunspell dictionaries in a while and I'm
sure there are some differences left. The speed should not be an issue
compared to hunspell, hunspell is quite slow in most of my test runs
(the official version, I know there is google's custom one that is fast
but far from feature complete); we won't be beating aspell for English
any time soon though. In general the speed and memory usage should be
usable for interactive use with even the biggest automata.
All in all, I thin it would be useful to set some specific target goals
with regards to speed, precision, coverage, maybe with bug reports to
track that this gets done.
> In any case since many of the languages being worked on don't have
> Hunspell dictionaries I believe it is finally time to promote HFST
> spellers into officially supported status within libvoikko.
I think that is the strongest argument, I don't believe people using
hunspell already will change just for some improvements, after all,
hunspell is good enough for all its current users.
> I want to
> do this for the next release. I can see three possible options:
>
> 1) HFST spellers are installed by placing suitable zhfst speller
> archives under ~/.voikko/3/
> 2) HFST spellers are installed by placing a metadata file
> (voikko.pro), HFST acceptor (spl.hfstol), and HFST error model
> (err.hfstol) under a subdirectory of ~/.voikko/3/. Each language
> would have its own subdirectory.
> 3) Combination of 1 and 2, that is we would use both zhfst speller
> and a separate metadata file (this is essentially how it works right
> now).
>
> I don't want to go with option 3 as it is the most complicated one and
> requires duplicating the speller metadata. Option 1 would be the most
> convenient for the users. Unfortunately I don't know if reading the
> XML metadata from zhfst spellers is possible with current version of
> hfst-ospell? I see a method called metadata_dump() but that's not
> good for this.
I'm fine with any of the approaches, for most users it's minor and
invisible detail.
I think the version of hfst-ospell command-line in the repository
should read and print the XML metadata when called with --verbose, so
the needed code is there, at least I remember implementing it once.
> If anyone has time to implement option 1 during the next few months,
> that would be great. Otherwise I will most likely proceed with option
> 2 as everything needed for it is mostly done and we would not need to
> depend on XML and ZIP libraries.
I think it should be doable. Assuming the files in voikko/3/ are
language coded as speller-...zhfst, the only missing piece is to have
the language code parsing, or just enumerating all zhfst files,
currently I think the code just uses hard-coded speller.zhfst in 2/
dirs instead (and the pro file parsing).
--
Flammie, computer scientist bachelor, linguist master, free software
Finnish localiser, and more! <http://www.iki.fi/flammie/>
More information about the Libvoikko
mailing list