[libvoikko] hfst-ospell API

Sam Hardwick sam.hardwick at gmail.com
Wed Aug 11 19:50:14 EEST 2010


Sjur Moshagen wrote:
> Here's some basic feedback from testing on the Mac:
> 
> Command used:
> 
> $ cut -f1 ../../../langtech/main/gt/sme/src/typos.txt | grep -v '^#' | grep -v '^$' | ./hfst-ospell /Users/sjur/hfst-langs/omorfi/src/suggestion/edit-distance-1.hfst  ../../../langtech/main/gt/sme/bin/sme.hfst.ol
> 
> Ie, take a list of (North Sámi) typos, use the correction transducer from omorfi, and run using the regular North Sámi optimised hfst analyser.
> 
> Result:
> 
> The process takes forever, and consumes well over a gigabyte of RAM. In the end (after more than 30 min.) I just killed it.

The omorfi error transducers are rather large - I think Tommi and I
looked at a ~300M one. How large was the one you used? I'm not sure, but
I think something is off with them - the ones I've generated and tested
with have been some hundreds of kilobytes. Still, maybe it's still a
performance problem if we can't deal with big error sources (or perhaps
generally big alphabets).

One way to do tests is to use the python script (I just updated it) in
/test and hfst tools. The python script takes a string of characters
(and optionally an epsilon symbol) and produces an edit distance 1
transducer, but without character swaps.
You can do eg.

python test/editdist.py abcdefghijklmnopqrstuvwxyzåäö @0@ | hfst-txt2fst
-e "@0@" -w | hfst-lookup-optimize > findist1.hwfst.ol

For me that generates a 12K transducer. To make bigger edit distances,
an easy way is to use hfst-repeat, eg.

python test/editdist.py abcdefghijklmnopqrstuvwxyzåäö @0@ | hfst-txt2fst
-e "@0@" -w - | hfst-repeat -f 1 -t 2 | hfst-lookup-optimize >
findist2.hwfst.ol

For edit distance 1&2. I get a 55K transducer like that.

The transducers grow pretty rapidly with the alphabet, especially with
larger edit distances. Adding all the capital letters and numbers, I get
a 59K distance 1 and 283K distance 2 transducers. Adding letter swaps
makes it worse yet, but I don't think it should be completely
pathological...

Sam Hardwick



More information about the Libvoikko mailing list