[libvoikko] Sámi/HFST

Kevin Brubeck Unhammer p.ixiemotion at gmail.com
Mon Jun 7 11:51:22 EEST 2010


2010/6/7 Harri Pitkänen <hatapitk at iki.fi>:
> On Monday 07 June 2010, Harri Pitkänen wrote:
>> That solved the problem and I was able to do
>> some basic spell checking:
>
> I did a bit more testing by comparing Voikko (binary sme transducer from
> hfst.sf.net) and Hunspell (hunspell-se 1.0~beta6.20081222-1.1 from Debian).
> From performance point of view HFST/Voikko seems to be much better than
> Hunspell:
>
> - Checking all unique words from Sámi Wikipedia took 9.5 seconds with Voikko
> and 20.5 seconds with Hunspell. These numbers contain the time needed to
> initialize the speller and perform the actual checking.
>
> - Use of non-shareable memory after loading the speller was 26 Mb with Voikko
> and 150 Mb with Hunspell. Both programs used about 25 Mb of shareable memory
> on top of those numbers.
>
> It should be noted that the configuration used with Voikko does not support
> spelling suggestions at all. Depending on how those would be implemented
> memory footprint for Voikko can end up being much larger than in this test.
>
> Starting Hunspell with Sámi language caused lots of errors like this:
>
>  error: line 3512: flag id 65529 is too large (max: 65509)
>
> I cannot say which one is linguistically better. Both can be used in OOo so I
> made a screenshot that contains some Northern Sámi text checked with both
> spellers:
>
>  http://www.puimula.org/htp/testing/hfst/openoffice-sme-spelling.png

Erroneously marked by hunspell but OK with voikko: beaivve, ii,
oahppiide, álgoálbmotraporterejeaddji, olmmošvuoigatvuohtaraporttat
Erroneously marked by voikko but OK with hunspell: James, Matti
Morottaja, Klemetti Näkkäläjärvi, Juvvá Lemet, Oslos, Irja
Seurujärvi-Kari

I can't tell if there were false positives anywhere, but voikko
definitely wins here ;-)
(All those names were recognised when I run them through hfst-lookup, strange.)


best regards,
Kevin B. Unhammer



More information about the Libvoikko mailing list