[libvoikko] Small patch to HFST2 backend stuff

Flammie Pirinen flammie at iki.fi
Sat Apr 10 11:01:47 EEST 2010


2010-04-10, Sjur Moshagen sanoi:

> Den 10. apr. 2010 kl. 03.11 skrev Flammie Pirinen:
> 
> >> There are many errors in hyphenation that are caused by compound
> >> entries in Kotus-sanalista. For example
> >> 
> >>  <st><s>päivällisaika</s><t><tn>9</tn><av>D</av></t></st>
> >> 
> >> causes HFST backend to produce hyphenation "päi-väl-lisai-ka" when
> >> correct hyphenation (and the one that Malaga backend produces) 
> >> is "päi-väl-lis-ai-ka". I wonder if dropping Kotus-sanalista
> >> completely from hyphenation transducer and using only Joukahainen
> >> would improve things? It would also help to take advantage of
> >> morpheme border hints that have been entered for many words in
> >> Joukahainen.
> > 
> > Yes, since kotus-sanalista showed no indication that there's a
> > compound boundary involved, the schoolbook algorithm finds the wrong
> > hyphenation, which automatic compounding disagrees with. The correct
> > solution would be to introduce the compound boundaries for
> > kotus-sanalista data. And also using the morpheme boundary hints
> > from Joukahainen will help. 
> 
> Wouldn't it be possible to run the kotus-sanalista through the Omorfi
> analyser, to (semi-)automatically find the word boundaries? That does
> of course require that word boundaries are marked in the output from
> Omorfi, but in one way or another this should be possible, right?

Certainly. Either way it's not a hard or long task, just something I
haven't got done, since it hasn't really been needed before. I'll look
into it in next few days hopefully.

-- 
Flammie, computer scientist bachelor, linguist master, free software
Finnish localiser, and more! <http://www.iki.fi/flammie/>



More information about the Libvoikko mailing list