[libvoikko] Small patch to HFST2 backend stuff
Harri Pitkänen
hatapitk at iki.fi
Fri Apr 9 18:36:49 EEST 2010
On Thursday 08 April 2010 06:53:05 Flammie Pirinen wrote:
> Yes, in fact I used versions of factories where hfst backends were used
> unconditionally #if HAVE_HFST, so I excluded them from previous patch
> as that surely isn't wanted in the development version.
The factories are fixed now. A configuration file that enables all hfst
backends is at http://www.puimula.org/htp/testing/hfst/voikko-fi_FI.pro
In the future I think it would be OK for you to commit your changes directly
to libvoikko SVN. Just let me know the SourceForge account to which the
commit access should be given.
Some issues I noticed:
There are many errors in hyphenation that are caused by compound entries in
Kotus-sanalista. For example
<st><s>päivällisaika</s><t><tn>9</tn><av>D</av></t></st>
causes HFST backend to produce hyphenation "päi-väl-lisai-ka" when correct
hyphenation (and the one that Malaga backend produces)
is "päi-väl-lis-ai-ka". I wonder if dropping Kotus-sanalista completely from
hyphenation transducer and using only Joukahainen would improve things? It
would also help to take advantage of morpheme border hints that have been
entered for many words in Joukahainen.
Another common hyphenation problem with HFST backend appears in words ending
with "minen", such as "juokseminen". This seems to be caused by Omorfi
analyzing the word as "juoksemin+en":
[BOUNDARY=ULTIMATE][LEMMA='juosta'][POS=VERB][KTN=70][GEN=ACT][PCP=MA]
[CMP=SUP][NUM=SG][CASE=NOM][BOUNDARY=COMPOUND][GUESS=COMPOUND]
[LEMMA='ei'][POS=VERB][GEN=ACT][PRS=SG1][BOUNDARY=ULTIMATE]
I suppose rejecting compounds ending with "ei" would be the correct thing to
do here.
I did not test suggestion generation very heavily but it seems to work quite
well. The random test word I used was "*laminaattiattia" (correct would have
been "laminaattilattia"). Both HFST and Malaga backends were able to suggest
the correct form, HFST as the first suggestion, Malaga as the second.
Main problem with HFST suggestion code right now is that it is slow. For the
test word above the time to produce the suggestions was 470 ms with HFST and
18 ms with Malaga. My informal goal has been that suggestions should be
produced within 100 ms on a modern low end laptop and the system used to run
these tests had a 1.5 GHz CPU which is close to that standard. Presumably
switching to optimized lookup will improve the performance so maybe it is too
early to worry about this.
Overall the HFST backend appears quite promising. I do think it should be
possible to switch to Omorfi/HFST as the primary backend for Finnish at some
point in the future. I would really like to see that happen since Malaga
development tools are no longer actively maintained and nobody else seems to
be using it anymore.
Harri
More information about the Libvoikko
mailing list