[libvoikko] Status of North Sámi (SME) hfst speller

Harri Pitkänen hatapitk at iki.fi
Thu Sep 29 12:17:36 EEST 2011


> It would be interesting to compare Voikko+malaga
> with Voikko+omorfi, to see how much of the speed difference is due to the
> hfst backend compared to Malaga. For that I would need gold-standard
> spelling error test documents for Finnish - I can give instructions if
> anyone would be interested in creating it.

It would indeed be nice to have such test document. I did some quick
testing with one list of words that I often use for regression testing. It
is not suitable for precision/recall testing but should be OK for
performance tests. Finnish zhfst speller was downloaded from
http://www.helsinki.fi/~tapirine/tmp/zhfst/fi/ and Malaga morphology was
Suomi-malaga 1.10 (or at least something very close to it).

First, since the Hfst and Malaga (here malstd) have different
initialization times, I checked the times it takes to initialize the
spellers and check just one correct word:

$ time echo kissa | voikkospell -d fi-x-malstd
C: kissa

real    0m0.011s
user    0m0.004s
sys     0m0.008s

$ time echo kissa | voikkospell -d fi-x-hfst
C: kissa

real    0m0.879s
user    0m0.648s
sys     0m0.224s


Then I ran the test file of 82037 words through both spellers. Note that
with the following commands no spelling suggestions are generated and
therefore we only measure the acceptor, not error model:

$ time cat test.dic | voikkospell -d fi-x-malstd | grep W: > malaga.txt

real    0m23.565s
user    0m23.061s
sys     0m0.792s

$ time cat test.dic | voikkospell -d fi-x-hfst | grep W: > hfst.txt

real    0m31.269s
user    0m30.654s
sys     0m0.976s


Finally I took 20 words from the intersection of true positives for both
spellers. Then I generated corrections for them using both spellers:

$ time cat errortest.txt | voikkospell -s -d fi-x-malstd > malaga-err.txt

real    0m1.164s
user    0m1.144s
sys     0m0.016s

$ time cat errortest.txt | voikkospell -s -d fi-x-hfst > hfst-err.txt

real    0m8.746s
user    0m8.481s
sys     0m0.240s


Using these numbers I can calculate some rough performance numbers:

Average time needed to check a word - HFST:   0.370 ms
Average time needed to check a word - Malaga: 0.287 ms

Average time needed to generate suggestions - HFST:   393 ms
Average time needed to generate suggestions - Malaga:  58 ms

                     HFST    Malaga
Correct in top 3:       3    15
Correct not in top 3:   0    0
Wrong suggestions:      1    4
No suggestions:        16    1

These numbers look bad for HFST but it should be said that these errors,
while they are from real world material, are perhaps from longer than
average words (which is due to the nature of the original test material).
HFST might work better with shorter words. The actual list of words used
in this spelling correction test is pasted below in case someone is
interested.

aatesuunataus
hailakansisinen
hämäränäkökykyisuus
huoneusto
jatkuvayllpitoisuus
jääpallomaajoukkoe
julkistamskielto
kaapelpipalvelu
kaavapareriluonnos
karkoittaa
kongressiatpahtuma
kontrastisääto
käsipallojoukkoe
lääke-esitely
lämmittelija
maantetyö
mehiläisvahavestos
metsastyslupa
moskiitohyttynen
nopealiikeisyys





More information about the Libvoikko mailing list