[libvoikko] HFST speller lexicon spec - RC1

Sjur Moshagen sjurnm at mac.com
Sun Jul 3 22:45:49 EEST 2011

Den 3. jul. 2011 kl. 21.01 skrev Harri Pitkänen:

> On Sunday 03 July 2011, Sjur Moshagen wrote:
>> I also added an option to support encrypted zip archives, in addition to
>> unencrypted ones. All implementations *must* support unencrypted zip
>> files, but they can also support encrypted ones if so desired. The
>> motivation is to make it possible to support both OSS-based transducers as
>> well as closed-source transducers in one and the same implementation. This
>> will e.g. make it easier for the Sámi languages to be added in a
>> distribution of otherwise closed-source and commercial tools.
> I think that the encryption method used for zip archives is not very useful 
> for protecting any trade secrets since the same shared key must be known by 
> the providers of the transducer and shipped to the end user. And even without 
> encryption it is quite difficult to reverse engineer the source code from an 
> optimized transducer. So I don't see the benefit but if this what some 
> providers want, I'm fine with it.

No, no-one has explicitly required this, it is just my naïve ideas about what could be useful:)

I have reformulated that section, saying that the required encryption is done directly in the transducers, and that implementors needing such encryption should add it to their hfst runtimes/engines/libraries on their own. Supporting unencrypted transducers is REQUIRED.

The updated spec is released as RC2 - the links are as in my previous mail, but I'll repeat here for Cc readers:

https://victorio.uit.no/langtech/trunk/plan/proof/doc/lexfile-spec.xml (immediately updated)
http://divvun.no/no/future/proofing/lexfile-spec.html (updated after a couple of hours)

>>> And then we need to figure out
>>> how to set the default speller for a language when there are more than
>>> one available, and how this default can be changed by the user in case
>>> there is no application level support for selecting the variant.
>> What about reserving one filename for this use? There can be only one file
>> in the required location with this name, which would then be picked in
>> case of conflicting zip file content. The solution to the end user if
>> there is no other way of specifying which variant to use is then to just
>> rename the preferred speller file to this exact name. Manual renaming is
>> error prone, but I see no better solution at the moment.
> This is more or less the way how Voikko handles this in the current dictionary 
> format and seems sufficient to me.
> In fact Voikko has an additional method for setting the default variant 
> through symbolic links (or registry keys on Windows) which does not require 
> renaming the original dictionary and can therefore be used in situations where 
> user has no right to modify the original file. Since the HFST speller format 
> will quite likely be a subformat of "Voikko dictionary format version 3" all 
> mechanisms provided by Voikko will also be available for HFST dictionaries 
> when libvoikko is used.


This point is still needed in the spec since it shouldn't assume LibVoikko, only HFST.

>> I also received feedback from other implementors, such that at least one
>> commercial entity has backed this specification.
> Seems good to me. I'm willing to back this specification too but before 
> implementing it I would like to wait until there is at least one compliant 
> open source speller transducer available for any language that has been 
> demonstrated to perform better than other open source spellers for that 
> language. If you have compared Northern Sámi HFST speller with Hunspell and 
> have some numbers (speed, memory use, false negatives, false positives etc.) 
> I'd be very interested in seeing those.

Unfortunately that part has been lagging too, but I'll try to do some preliminary work before I go on vacation.

I would suggest the following procedure going forward:

* I start to prepare such testing using our north sami hunspell and hfst lexicons
* the hfst group adds support for this zipped transducer format in hfst-ospell (both lib and frontend)
* as soon as hfst-ospell has been updated to support the zhfst format, I'll run some tests and report back

Does it seem reasonable? What do the HFST gang think? I will need some help to test memory consumption.

Thanks a lot for your feedback.


More information about the Libvoikko mailing list