[libvoikko] Small patch to HFST2 backend stuff
Flammie Pirinen
flammie at iki.fi
Sat Apr 10 11:01:47 EEST 2010
2010-04-10, Sjur Moshagen sanoi:
> Den 10. apr. 2010 kl. 03.11 skrev Flammie Pirinen:
>
> >> There are many errors in hyphenation that are caused by compound
> >> entries in Kotus-sanalista. For example
> >>
> >> <st><s>päivällisaika</s><t><tn>9</tn><av>D</av></t></st>
> >>
> >> causes HFST backend to produce hyphenation "päi-väl-lisai-ka" when
> >> correct hyphenation (and the one that Malaga backend produces)
> >> is "päi-väl-lis-ai-ka". I wonder if dropping Kotus-sanalista
> >> completely from hyphenation transducer and using only Joukahainen
> >> would improve things? It would also help to take advantage of
> >> morpheme border hints that have been entered for many words in
> >> Joukahainen.
> >
> > Yes, since kotus-sanalista showed no indication that there's a
> > compound boundary involved, the schoolbook algorithm finds the wrong
> > hyphenation, which automatic compounding disagrees with. The correct
> > solution would be to introduce the compound boundaries for
> > kotus-sanalista data. And also using the morpheme boundary hints
> > from Joukahainen will help.
>
> Wouldn't it be possible to run the kotus-sanalista through the Omorfi
> analyser, to (semi-)automatically find the word boundaries? That does
> of course require that word boundaries are marked in the output from
> Omorfi, but in one way or another this should be possible, right?
Certainly. Either way it's not a hard or long task, just something I
haven't got done, since it hasn't really been needed before. I'll look
into it in next few days hopefully.
--
Flammie, computer scientist bachelor, linguist master, free software
Finnish localiser, and more! <http://www.iki.fi/flammie/>
More information about the Libvoikko
mailing list