[libvoikko] Sámi/HFST

Harri Pitkänen hatapitk at iki.fi
Mon Jun 7 13:33:06 EEST 2010


On Monday 07 June 2010, Kevin Brubeck Unhammer wrote:
> (All those names were recognised when I run them through hfst-lookup,
> strange.)

This is because HfstSpeller in libvoikko does not support the optimization 
that would allow determining with single backend call spell("matti") whether
 SPELL_OK:        "matti" and "Matti" are both correct
 SPELL_CAP_FIRST: "Matti" is correct but "matti" is not
 SPELL_FAILED:    neither "matti" nor "Matti" are correct.

Now HfstSpeller returns SPELL_FAILED when it should return SPELL_CAP_FIRST.

We could fix HfstSpeller to handle this particular case but it would not help 
with more complex capitalization scenarios. Spell checker should be able to 
check words written COMPLETELY IN UPPER CASE. In that case "MATTI" is correct 
if any of "matti", "Matti", "MATTI", "mAtti", ... is correct. Checking all 
those combinations separately is not possible within reasonable time but right 
now HFST or Sámi transducer does not appear to support any other way of doing 
case insensitive checking.

I suppose we should at least add some support for backends (or languages) that 
do not support case insensitive checking. This would mean disabling the 
optimization for capitalized first letter with such backends and probably just 
stating that case insensitive (all-caps) checking may not work correctly when 
these are used.

Harri



More information about the Libvoikko mailing list