[libvoikko] Small patch to HFST2 backend stuff
Sjur Moshagen
sjurnm at mac.com
Sat Apr 10 10:16:38 EEST 2010
Den 10. apr. 2010 kl. 03.11 skrev Flammie Pirinen:
>> There are many errors in hyphenation that are caused by compound
>> entries in Kotus-sanalista. For example
>>
>> <st><s>päivällisaika</s><t><tn>9</tn><av>D</av></t></st>
>>
>> causes HFST backend to produce hyphenation "päi-väl-lisai-ka" when
>> correct hyphenation (and the one that Malaga backend produces)
>> is "päi-väl-lis-ai-ka". I wonder if dropping Kotus-sanalista
>> completely from hyphenation transducer and using only Joukahainen
>> would improve things? It would also help to take advantage of
>> morpheme border hints that have been entered for many words in
>> Joukahainen.
>
> Yes, since kotus-sanalista showed no indication that there's a compound
> boundary involved, the schoolbook algorithm finds the wrong
> hyphenation, which automatic compounding disagrees with. The correct
> solution would be to introduce the compound boundaries for
> kotus-sanalista data. And also using the morpheme boundary hints from
> Joukahainen will help.
Wouldn't it be possible to run the kotus-sanalista through the Omorfi analyser, to (semi-)automatically find the word boundaries? That does of course require that word boundaries are marked in the output from Omorfi, but in one way or another this should be possible, right?
Best regards,
Sjur
More information about the Libvoikko
mailing list