[libvoikko] Status of North Sámi (SME) hfst speller [was: HFST speller lexicon spec - RC1]
sjurnm at mac.com
Thu Sep 29 08:41:17 EEST 2011
[this e-mail is also sent cc to the Divvun and Giellatekno teams as well as the hfst bug list, as the findings have general interest for all groups; replies should go to the libvoikko list only (membership required, see http://lists.puimula.org/listinfo/libvoikko)]
Den 10. aug. 2011 kl. 18.55 skrev Harri Pitkänen:
> The list of words I used in these tests does not represent real life Swedish
> at all which will affect the results. I'm very interested in seeing results
> from similar tests for other HFST transducers and Hunspell dictionaries.
I am finally in a position to provide such data for SME. The details can be found here:
http://divvun.no/doc/proof/spelling/testing/sme/vkhfst/goldstandard/latest-GoldstandardTexts.txt.summary.html - Voikko + HFST
http://divvun.no/doc/proof/spelling/testing/sme/pl/goldstandard/latest-GoldstandardTexts.txt.summary.html - our MS Word speller
http://divvun.no/doc/proof/spelling/testing/sme/hu/goldstandard/latest-GoldstandardTexts.txt.summary.html - our Hunspell variant
The lexicon basis is the same for all three variants, but due to conversion between the formats and other technical issues there are differences in lexical coverage.
The main observations are:
Precision & recall
The voikko+hfst speller is slightly *better* than our MS Word speller, and far better than our Hunspell speller! This is most likely due to the lack of conversion for the Hfst speller - we can use our morphological transducers directly, eliminating conversion steps that introduce bugs.
The MS Word speller beats all with roughly 91% of all misspellings getting relevant suggestions. But the voikko+hfst speller is not bad at all for a very first attempt at creating an hfst-based speller. Almost 3/4 (~ 74%) of the misspellings get relevant suggestions. Hunspell is quite good with roughly 86% of misspellings getting relevant suggestions.
I would expect the number of relevant suggestions to increase as we get more time to work on the error model. The goal is to reach at least the same quality as our MS Word speller.
I have not measured the real consumption with any tools, but the acceptor transducer is 11 Mb, and the error model transducer is 1,4 Mb. The zipped zhfst file is 3.0 Mb, which is not bad at all. Based on file size this looks definitely acceptable. Others will have to measure real memory consumption during use (or give me instructions on how to do it).
Running the gold-standard test
- - - - - - - - - - - - - - - -
MS Word speller (command-line tool from technology provider, not available outside Divvun):
These figures are of course relative depending on hardware, RAM etc. But all tests were run on the same hardware under similar conditions, and the results thus give a good picture of the speed differences between the three spellers tested. It would be interesting to compare Voikko+malaga with Voikko+omorfi, to see how much of the speed difference is due to the hfst backend compared to Malaga. For that I would need gold-standard spelling error test documents for Finnish - I can give instructions if anyone would be interested in creating it.
I guess it would also be possible to do more finegrained measuring of where time is spent, but that is beyond me. It would even be interesting to compare different hfst transducers (ie different languages) to see to what extent the transducer itself contributes to the speed differences.
Subjective speed impression in actual use
- - - - - - - - - - - - - - - - - - - - -
When used in LibreOffice, the Sámi hfst spellers give a "hesitant" impression. It usually takes about 2 seconds for the popup menu to appear when right-clicking a misspelled word, independent of whether there are any suggestions. This is definitely slower than both our MS Word speller and the Hunspell speller. This is also different from Voikko+Malaga, which returns "immediately".
Whether the slowdown when using hfst is due to the hfstospell lib, the transducer used or both requires more work to find out.
Except for the speed issue, the Voikko+hfst speller for SME is already a very good beta, if not release candidate. No major work needs to be done on the acceptor (as expected), but some more work should be put into the suggestion transducer.
Regarding the speed, more detailed investigations are needed to identify the real bottlenecks. Based on a very limited test with Voikko+Malaga (in LibreOffice only), it seems that the problem is not within the Voikko code.
More information about the Libvoikko