[libvoikko] Small patch to HFST2 backend stuff

Flammie Pirinen flammie at iki.fi
Sat Apr 10 03:11:25 EEST 2010

2010-04-09, Harri Pitkänen sanoi:

> On Thursday 08 April 2010 06:53:05 Flammie Pirinen wrote:
> > Yes, in fact I used versions of factories where hfst backends were
> > used unconditionally #if HAVE_HFST, so I excluded them from
> > previous patch as that surely isn't wanted in the development
> > version.
> The factories are fixed now. A configuration file that enables all
> hfst backends is at
> http://www.puimula.org/htp/testing/hfst/voikko-fi_FI.pro

Ok, also updated in omorfi SVN.

> In the future I think it would be OK for you to commit your changes
> directly to libvoikko SVN. Just let me know the SourceForge account
> to which the commit access should be given.

That shall be nice, the account name is 'mie'.

> There are many errors in hyphenation that are caused by compound
> entries in Kotus-sanalista. For example
>   <st><s>päivällisaika</s><t><tn>9</tn><av>D</av></t></st>
> causes HFST backend to produce hyphenation "päi-väl-lisai-ka" when
> correct hyphenation (and the one that Malaga backend produces) 
> is "päi-väl-lis-ai-ka". I wonder if dropping Kotus-sanalista
> completely from hyphenation transducer and using only Joukahainen
> would improve things? It would also help to take advantage of
> morpheme border hints that have been entered for many words in
> Joukahainen.

Yes, since kotus-sanalista showed no indication that there's a compound
boundary involved, the schoolbook algorithm finds the wrong
hyphenation, which automatic compounding disagrees with. The correct
solution would be to introduce the compound boundaries for
kotus-sanalista data. And also using the morpheme boundary hints from
Joukahainen will help. 

> Another common hyphenation problem with HFST backend appears in words
> ending with "minen", such as "juokseminen". This seems to be caused
> by Omorfi analyzing the word as "juoksemin+en":
> I suppose rejecting compounds ending with "ei" would be the correct
> thing to do here.

I thought I already fixed that bug with r184, but apparently I missed
some deverbals still. Might be fixed in r186.

Also, I do not remember why I've turned the comparison on for
participles, but it did have some reason for it.

> Main problem with HFST suggestion code right now is that it is slow.
> For the test word above the time to produce the suggestions was 470
> ms with HFST and 18 ms with Malaga. My informal goal has been that
> suggestions should be produced within 100 ms on a modern low end
> laptop and the system used to run these tests had a 1.5 GHz CPU which
> is close to that standard. Presumably switching to optimized lookup
> will improve the performance so maybe it is too early to worry about
> this.

Yeah, if anything I was worried that suggestion engine wasn't working,
since when generating all the possible suggestions of edit distance 2
the delay on my aspire one was more like a few seconds, and below second
in interactive processing is still usable. So indeed while waiting for
optimized lookup to get finalized I think this is bearable.

Flammie, computer scientist bachelor, linguist master, free software
Finnish localiser, and more! <http://www.iki.fi/flammie/>

More information about the Libvoikko mailing list