[libvoikko] HFST backend is no longer experimental
sjurnm at mac.com
Mon Mar 18 17:47:54 EET 2013
Den 18. mar 2013 kl. 17:28 skrev Harri Pitkänen:
> So there seems to be two issues:
> - Suggestions are not sorted as they should. It looks like libvoikko uses the
> ospell library in a way that ignores the weights. I'll fix that.
> - Libvoikko will (if it results in a valid word) try to convert the
> suggestions to match the case of the original word. Ospell however returns
> "хакас" and "Хакас" as separate suggestions which will then result in "Хакас"
> being suggested twice. Here I'm not sure what to do. If you think we should
> just trust hfst-ospell I can fix libvoikko to not touch the character case.
> But then I also think that suggesting the same word with multiple different
> capitalizations may not generally be a good idea.
Agreed. hfst-ospell returns what the transducer contains, and we have for text processing traditionally handled casing as part of the fst, in some cases even all-upper-cased words. This uppercasing is expensive, though (initial only much less so than all-upper, but still expensive), both in terms of disk space and processing time.
My assumption is that the case handling & mapping of libvoikko is fast and reliable (also across script systems), so I would suggest that we assume that the fst's only contain canonical case, and nothing else. This should result in smaller and faster fst's.
Given this assumption, the bug is actually in the fst, and not in the code of either libvoikko nor hfst-ospell. It should be easy to fix, though.
More information about the Libvoikko