[libvoikko] Small patch to HFST2 backend stuff

Harri Pitkänen hatapitk at iki.fi
Fri Apr 9 18:36:49 EEST 2010


On Thursday 08 April 2010 06:53:05 Flammie Pirinen wrote:
> Yes, in fact I used versions of factories where hfst backends were used
> unconditionally #if HAVE_HFST, so I excluded them from previous patch
> as that surely isn't wanted in the development version.

The factories are fixed now. A configuration file that enables all hfst 
backends is at http://www.puimula.org/htp/testing/hfst/voikko-fi_FI.pro

In the future I think it would be OK for you to commit your changes directly 
to libvoikko SVN. Just let me know the SourceForge account to which the 
commit access should be given.

Some issues I noticed:

There are many errors in hyphenation that are caused by compound entries in 
Kotus-sanalista. For example

  <st><s>päivällisaika</s><t><tn>9</tn><av>D</av></t></st>

causes HFST backend to produce hyphenation "päi-väl-lisai-ka" when correct 
hyphenation (and the one that Malaga backend produces) 
is "päi-väl-lis-ai-ka". I wonder if dropping Kotus-sanalista completely from 
hyphenation transducer and using only Joukahainen would improve things? It 
would also help to take advantage of morpheme border hints that have been 
entered for many words in Joukahainen.

Another common hyphenation problem with HFST backend appears in words ending 
with "minen", such as "juokseminen". This seems to be caused by Omorfi 
analyzing the word as "juoksemin+en":

  [BOUNDARY=ULTIMATE][LEMMA='juosta'][POS=VERB][KTN=70][GEN=ACT][PCP=MA]
  [CMP=SUP][NUM=SG][CASE=NOM][BOUNDARY=COMPOUND][GUESS=COMPOUND]
  [LEMMA='ei'][POS=VERB][GEN=ACT][PRS=SG1][BOUNDARY=ULTIMATE]

I suppose rejecting compounds ending with "ei" would be the correct thing to 
do here.

I did not test suggestion generation very heavily but it seems to work quite 
well. The random test word I used was "*laminaattiattia" (correct would have 
been "laminaattilattia"). Both HFST and Malaga backends were able to suggest 
the correct form, HFST as the first suggestion, Malaga as the second.

Main problem with HFST suggestion code right now is that it is slow. For the 
test word above the time to produce the suggestions was 470 ms with HFST and 
18 ms with Malaga. My informal goal has been that suggestions should be 
produced within 100 ms on a modern low end laptop and the system used to run 
these tests had a 1.5 GHz CPU which is close to that standard. Presumably 
switching to optimized lookup will improve the performance so maybe it is too 
early to worry about this.

Overall the HFST backend appears quite promising. I do think it should be 
possible to switch to Omorfi/HFST as the primary backend for Finnish at some 
point in the future. I would really like to see that happen since Malaga 
development tools are no longer actively maintained and nobody else seems to 
be using it anymore.

Harri



More information about the Libvoikko mailing list