[libvoikko] The future of zhfst [Was: hfst-ospell and Windows]

Sjur Moshagen sjurnm at mac.com
Wed Nov 20 15:16:55 EET 2013

4. juli 2013 kl. 08:12 skrev Flammie Pirinen <flammie at iki.fi>:

> 2013-06-30, Harri Pitkänen sanoi:
>> Then I found TinyXML2: http://www.grinninglizard.com/tinyxml2/
>> It might be used as a replacement for libxml++ and would kill the
>> need for all its dependencies, reducing the above list of
>> dependencies from 8 to only 3 libraries.
> Seems like a reasonable option, I made a configure switch for it and
> started to hack but ran into dead end since the tinyxml2 documentation
> isn't too good and I neither have time to dig solutions from examples
> or source codes for now. If I understand the system correctly it should
> be mostly copypaste and replace from libxml++ version but still a bit
> of doing. I've left libxml++ as default for now. I suppose libxml would
> be another option, it looks absolutely horrible in c++ but is quite
> widely used everywhere so should be relatively much easier.

Tommi has now fully implemented the use of tinyxml2 to read xml files. But we still have a long chain of dependencies due to libarchive (at least on the Mac):

$ otool -L /opt/local/lib/libarchive.13.dylib 
	/opt/local/lib/libarchive.13.dylib (compatibility version 15.0.0, current version 15.2.0)
	/opt/local/lib/libnettle.4.dylib (compatibility version 4.0.0, current version 4.5.0)
	/opt/local/lib/liblzo2.2.dylib (compatibility version 3.0.0, current version 3.0.0)
	/opt/local/lib/liblzma.5.dylib (compatibility version 6.0.0, current version 6.5.0)
	/opt/local/lib/libcharset.1.dylib (compatibility version 2.0.0, current version 2.0.0)
	/opt/local/lib/libbz2.1.0.dylib (compatibility version 1.0.0, current version 1.0.6)
	/opt/local/lib/libxml2.2.dylib (compatibility version 12.0.0, current version 12.1.0)
	/opt/local/lib/libz.1.dylib (compatibility version 1.0.0, current version 1.2.8)
	/usr/lib/libSystem.B.dylib (compatibility version 1.0.0, current version 1197.1.1)
	/opt/local/lib/libiconv.2.dylib (compatibility version 8.0.0, current version 8.1.0)

libSystem.B.dylib can be ignored, but the rest seems like a real dependency list.

We have been pondering a bit on IRC #hfst about other options to reduce the dependencies, and still keep the one-file-to-download feature. This discussion is in no way settled, but here are a couple of options:

1. Go back to the voikko/2/ format (or something similar)

* one or more transducer files
* metadata in a text file, possibly more .ini like

* simple, already supported
* easy metadata lookup

* multiple files to download

2. Use one single fst archive

The hfst file format can pack several fst’s inside one file. With standardised transducer names, we could store both the acceptor and the error model in one hfst file. We could even store the metadata as a transducer, as key:value, so that by looking up the key, one would get the value back.

* just one file to download
* no dependencies beyond hfst-ospell

* reading metadata requires the fst’s to be read from disk, could be slow

In any case, it is quite important to get this moving forward. The long list of dependencies has in practice created a showstopper for getting the hfst-based spellers fully supported in libvoikko and the LO voikko oxt, and there is a growing list of users and developers of the Divvun/Giellatekno languages just waiting for starting to use such spellers for their languages.

Feedback would be very welcome.


More information about the Libvoikko mailing list