[libvoikko] Introducing VFSTs (Varissuo Finite State Transducers)

Wed Feb 29 22:14:53 EET 2012

There is a new experimental morphology backend available in libvoikko. The
primary purpose of this backend is to help us port the Malaga based
morphology for Finnish into finite state world while maintaining the
ability to operate in environments where memory is tight and use of swap
space is expensive.

There are only two tools available for VFST transducers. "voikkovfstc"
builds the transducer file from unweighted AT&T representation. The lookup
code is build into libvoikko. I have no plans to expand VFST into anything
more generic. You will still need to use HFST or other tools to create the
transducers and convert them into AT&T format. VFST format has significant
restrictions that make it unsuitable for some uses where ZHFST can be
used:

 - There is no support for weighted transducers.
 - Maximum file size for the transducers is about 128 megabytes.

== So what can it do? ==

VFST can handle the morphological transducer from Omorfi version 20110505.
I have compared the analysis output for a list of about 80000 Finnish
words and the results from voikkospell(VFST) and hfst-optimized-lookup are
almost identical. VFST produces about 30 extra analysis results. Most of
these result from the fact that VFST returns all successful paths from the
transducer even if the output symbols for two paths were identical. Few
differences seem to be actual bugs that still need to be investigated.
Performance measurements from the test run produce the following results:

                       VFST   hfst-optimized-lookup
Running time           9.8s   80.4s
Memory (total VM)      35Mb   183Mb
Memory (running set)   6.7Mb  143Mb
Memory (non-shareable) 0.5Mb  151Mb
File size              8.7Mb  30Mb

A thing to note is that my version of hfst-optimized-lookup is six months
old. I will try updating it soon and post new results if they differ from
the above. The numbers seem almost plausible for memory use but I was not
expecting to see similar difference in running time. The VFST lookup
implementation has not been optimized in any significant ways. VFST does
linear lookups where binary search would be possible etc. and that's why
I'm very surprised to see that hfst-optimized-lookup is slower. So don't
take these numbers as final yet.

== How to try it out? ==

No external dependencies are needed to enable VFST support in libvoikko.
Configuring with --enable-vfst is enough. In fact "voikkovfstc" and VFST
lookup functionality is currently built even if the switch is not present.
The switch only enables the backend implementation. The idea behind this
was to allow other parts of the code (such as grammar checker autocorrect
functionality) to take advantage of VFST transducers. I'm not sure if this
is really needed though so maybe I will make those parts conditional as
well.

HFST transducer can be converted to VFST format with the following command:

  hfst-fst2txt -D morphology.omor.hfst | voikkovfstc

There is no switch for output file name yet. This will create
"transducer.vfst" in the current working directory. To use it with
libvoikko rename it to "mor.vfst" and place it in
~/.voikko/2/mor-something. As usual, you will need configuration file
voikko-fi_FI.pro with

  info: Voikko-Dictionary-Format: 2
  info: Language-Code: fi_FI
  info: Language-Variant: omorfi
  info: Description: Omorfi-pohjainen VFST-morfologia
  info: Morphology-Backend: vfst

Testing lookup:

  $ echo kissa | voikkospell -M -d fi-x-omorfi
  A(kissa):1:CLASS=none
  A(kissa):1:FSTOUTPUT=[BOUNDARY=LEXITEM][LEMMA='kissa'][POS=NOUN][KTN=9][NUM=SG][CASE=NOM][BOUNDARY=LEXITEM][CASECHANGE=NONE]
  A(kissa):1:SIJAMUOTO=none
  A(kissa):1:STRUCTURE==ppppp

== Things that don't work yet but will be fixed ==

This code does some low level stuff and is only known to work on little
endian Linux systems.

- Windows does not work because different code is needed there for mapping
files to memory. This just needs to be implemented, won't be a problem
since Malaga does that already.

- Big endian architectures may work but transducer files are not
compatible between little endian and big endian systems. This is again
similar to Malaga. I will write some code to do byte swapping on-the-load
so that we can use little endian transducers on all systems. Benefit of
sharing memory between processes is then lost on big endian architectures
but since those are quite rare these days I don't think anyone will care.
Having platform independent transducers is more important in my opinion
(hopefully distro packages feel the same way).

- The code may already work on Intel OS X. Please let me know if you try
it and have problems.

- Lookup code does not verify the consistency of transducer image on load.
Thus you should (as with Malaga) treat VFST transducers as code when
evaluating the security of the system. It is possible to verify the
transducer without impacting lookup performance. This will add some delay
to initialization and causes the whole transducer to be paged into memory
so I'm not sure if it is worth doing it by default.

- voikkovfstc has some requirements on the order of entries in AT&T input.
Hfst-fst2txt seems to satisfy those requirements but of course I need to
fix voikkovfstc to work with unordered input as well.

- Results from morphological analysis are not propagated to Malaga style
analysis attributes yet. Thus spelling, hyphenation and grammar checking
will not work correctly yet (but may work up to some degree).

== Other observations ==

- hfst-fst2txt seems to produce different results when run on .hfst and
.hfstol files. I would expect that AT&T representations of
morphology.omor.hfstol and morphology.omor.hfst were the same but that
does not seem to be the case. Or perhaps it is just the order of entries
that causes problems for voikkovfstc. Anyway, I had to convert from .hfst
to get a working VFST transducer, conversion from .hfstol resulted in a
transducer that accepted no input. I'll return to this once voikkovfstc is
fixed.

- Originally "V" in "VFST" referred to Voikko. Then I realized that it is
customary to name transducers formats after a place and I also don't want
that VFST would become "the transducer format" for libvoikko. The format
was not developed in a place starting with V but Varissuo (a suburb in
Turku) is near enough.

== Downloads for convenience ==

If you want to try VFST out quickly, you can dowload a transducer and
configuration file from here:

  http://www.puimula.org/htp/testing/vfst/

This transducer has been converted from Omorfi 20110505
(morphology.omor.hfst). Using this you won't need any HFST tools, just
compiling libvoikko with --enable-vfst should be enough.

Harri