[libvoikko] Status of North Sámi (SME) hfst speller

Sjur Moshagen sjurnm at mac.com
Thu Sep 29 12:47:03 EEST 2011


Den 29. sep. 2011 kl. 12.17 skrev Harri Pitkänen:

>> It would be interesting to compare Voikko+malaga
>> with Voikko+omorfi, to see how much of the speed difference is due to the
>> hfst backend compared to Malaga. For that I would need gold-standard
>> spelling error test documents for Finnish - I can give instructions if
>> anyone would be interested in creating it.
> 
> It would indeed be nice to have such test document.

For the precision & recall testing we're using real documents (from different sources found on the net) and marked them up according to this format:

http://www.divvun.no/doc/proof/spelling/testdoc/error-markup.html

Some of the details are biased towards Sámi needs, but those details do not affect the basic markup needed to calculate precision & recall.

For SME we have collected a bit below 8000 words containing 630 spelling errors. It is not that much, but enough to get a decent evaluation of the speller.

The texts collected should be publicly available. A major problem is to find texts with enough real spelling errors in them to be useful for testing, that is, texts not written using a spell checker. Net forum discussions tend to be quite free form, but they also tend to be quite oral which makes the spelling errors less representative of real spelling errors people make.

If anyone would be interested in marking up a Finnish text according to our guidelines, I will add it to our corpus repository, and run the same tests as I did for SME on it.

Sjur




More information about the Libvoikko mailing list