[libvoikko] Small patch to HFST2 backend stuff

Flammie Pirinen flammie at iki.fi
Sat Apr 10 03:11:25 EEST 2010


2010-04-09, Harri Pitkänen sanoi:

> On Thursday 08 April 2010 06:53:05 Flammie Pirinen wrote:
> > Yes, in fact I used versions of factories where hfst backends were
> > used unconditionally #if HAVE_HFST, so I excluded them from
> > previous patch as that surely isn't wanted in the development
> > version.
> 
> The factories are fixed now. A configuration file that enables all
> hfst backends is at
> http://www.puimula.org/htp/testing/hfst/voikko-fi_FI.pro

Ok, also updated in omorfi SVN.

> In the future I think it would be OK for you to commit your changes
> directly to libvoikko SVN. Just let me know the SourceForge account
> to which the commit access should be given.

That shall be nice, the account name is 'mie'.

> There are many errors in hyphenation that are caused by compound
> entries in Kotus-sanalista. For example
> 
>   <st><s>päivällisaika</s><t><tn>9</tn><av>D</av></t></st>
> 
> causes HFST backend to produce hyphenation "päi-väl-lisai-ka" when
> correct hyphenation (and the one that Malaga backend produces) 
> is "päi-väl-lis-ai-ka". I wonder if dropping Kotus-sanalista
> completely from hyphenation transducer and using only Joukahainen
> would improve things? It would also help to take advantage of
> morpheme border hints that have been entered for many words in
> Joukahainen.

Yes, since kotus-sanalista showed no indication that there's a compound
boundary involved, the schoolbook algorithm finds the wrong
hyphenation, which automatic compounding disagrees with. The correct
solution would be to introduce the compound boundaries for
kotus-sanalista data. And also using the morpheme boundary hints from
Joukahainen will help. 

> Another common hyphenation problem with HFST backend appears in words
> ending with "minen", such as "juokseminen". This seems to be caused
> by Omorfi analyzing the word as "juoksemin+en":
> 
>   [BOUNDARY=ULTIMATE][LEMMA='juosta'][POS=VERB][KTN=70][GEN=ACT][PCP=MA]
>   [CMP=SUP][NUM=SG][CASE=NOM][BOUNDARY=COMPOUND][GUESS=COMPOUND]
>   [LEMMA='ei'][POS=VERB][GEN=ACT][PRS=SG1][BOUNDARY=ULTIMATE]
> 
> I suppose rejecting compounds ending with "ei" would be the correct
> thing to do here.

I thought I already fixed that bug with r184, but apparently I missed
some deverbals still. Might be fixed in r186.

Also, I do not remember why I've turned the comparison on for
participles, but it did have some reason for it.

> Main problem with HFST suggestion code right now is that it is slow.
> For the test word above the time to produce the suggestions was 470
> ms with HFST and 18 ms with Malaga. My informal goal has been that
> suggestions should be produced within 100 ms on a modern low end
> laptop and the system used to run these tests had a 1.5 GHz CPU which
> is close to that standard. Presumably switching to optimized lookup
> will improve the performance so maybe it is too early to worry about
> this.

Yeah, if anything I was worried that suggestion engine wasn't working,
since when generating all the possible suggestions of edit distance 2
the delay on my aspire one was more like a few seconds, and below second
in interactive processing is still usable. So indeed while waiting for
optimized lookup to get finalized I think this is bearable.

-- 
Flammie, computer scientist bachelor, linguist master, free software
Finnish localiser, and more! <http://www.iki.fi/flammie/>



More information about the Libvoikko mailing list