[libvoikko] Small patch to HFST2 backend stuff

Sjur Moshagen sjurnm at mac.com
Sat Apr 10 10:16:38 EEST 2010


Den 10. apr. 2010 kl. 03.11 skrev Flammie Pirinen:

>> There are many errors in hyphenation that are caused by compound
>> entries in Kotus-sanalista. For example
>> 
>>  <st><s>päivällisaika</s><t><tn>9</tn><av>D</av></t></st>
>> 
>> causes HFST backend to produce hyphenation "päi-väl-lisai-ka" when
>> correct hyphenation (and the one that Malaga backend produces) 
>> is "päi-väl-lis-ai-ka". I wonder if dropping Kotus-sanalista
>> completely from hyphenation transducer and using only Joukahainen
>> would improve things? It would also help to take advantage of
>> morpheme border hints that have been entered for many words in
>> Joukahainen.
> 
> Yes, since kotus-sanalista showed no indication that there's a compound
> boundary involved, the schoolbook algorithm finds the wrong
> hyphenation, which automatic compounding disagrees with. The correct
> solution would be to introduce the compound boundaries for
> kotus-sanalista data. And also using the morpheme boundary hints from
> Joukahainen will help. 

Wouldn't it be possible to run the kotus-sanalista through the Omorfi analyser, to (semi-)automatically find the word boundaries? That does of course require that word boundaries are marked in the output from Omorfi, but in one way or another this should be possible, right?

Best regards,
Sjur




More information about the Libvoikko mailing list