[libvoikko] HFST backend is no longer experimental

Sjur Moshagen sjurnm at mac.com
Mon Mar 18 21:51:11 EET 2013


Den 18. mar 2013 kl. 19:49 skrev Harri Pitkänen:

> On Monday 18 March 2013, Sjur Moshagen wrote:
>> Den 18. mar 2013 kl. 17:28 skrev Harri Pitkänen:
>>> So there seems to be two issues:
>>> 
>>> - Suggestions are not sorted as they should. It looks like libvoikko uses
>>> the ospell library in a way that ignores the weights. I'll fix that.
>> 
>> Ok.
> 
> I have fixed this now.

Nice. One thing crossed my mind: presently only the error model is weighted, the acceptor is in practice unweighted. But I imagine that we in the future will start to add weights to the acceptor as well, as further fine tuning of suggestions (e.g. suggest lexicalised compounds over dynamic compounds, etc). Is this taken into account, or could it cause issues in the future?

>> My assumption is that the case handling & mapping of libvoikko is fast and
>> reliable (also across script systems), so I would suggest that we assume
>> that the fst's only contain canonical case, and nothing else. This should
>> result in smaller and faster fst's.
>> 
>> Given this assumption, the bug is actually in the fst, and not in the code
>> of either libvoikko nor hfst-ospell. It should be easy to fix, though.
>> 
>> WDYT?
> 
> Sounds good to me. It should work if
> 
> - libvoikko knows about case mappings for the language (I think it does for
>   the languages that are being worked on)
> - the language allows all (or at least those that matter) of the words to be
>   written in these three (but no other) forms:
>   * in canonical case
>   * initial letter capitalized, other letters in canonical case
>   * all letters capitalized

Should mostly be fine, but there are cases of words that should not be capitalised (at least in some languages), like 'van' in "Ludwig van Beethoven" and similar construct. I don't know what to do with such words.

> - error model is able to produce the necessary character case corrections so
>   that capitalized first letter is suggested when it is necessary to
>   capitalize the first letter.

That should be no problem, if the canonical case contains an initial upper case.

> I think it would be good to document this in the file format specification to 
> avoid confusion:
> 
>  http://www.divvun.no/no/future/proofing/lexfile-spec.html

Yes, I will update the documentation.

Sjur




More information about the Libvoikko mailing list