[libvoikko] HFST speller lexicon spec - RC1

Harri Pitkänen hatapitk at iki.fi
Sun Jul 3 22:01:01 EEST 2011


On Sunday 03 July 2011, Sjur Moshagen wrote:
> I also added an option to support encrypted zip archives, in addition to
> unencrypted ones. All implementations *must* support unencrypted zip
> files, but they can also support encrypted ones if so desired. The
> motivation is to make it possible to support both OSS-based transducers as
> well as closed-source transducers in one and the same implementation. This
> will e.g. make it easier for the Sámi languages to be added in a
> distribution of otherwise closed-source and commercial tools.

I think that the encryption method used for zip archives is not very useful 
for protecting any trade secrets since the same shared key must be known by 
the providers of the transducer and shipped to the end user. And even without 
encryption it is quite difficult to reverse engineer the source code from an 
optimized transducer. So I don't see the benefit but if this what some 
providers want, I'm fine with it.

> Two more small changes: I changed the filename part that identifies the zip
> file as a speller file from "-spl" to "-speller" - I see no reason to be
> too cryptic, and the filename length is not increased a lot (four more
> characters).

Ok.

> I also renamed the @version to @dtdversion, where the dtd version is
> identical to the version of this specification.

Ok.

> > We might of course want to support using HFST based and other tools for
> > some languages, like for example using an acceptor implemented with HFST
> > and spelling suggestion error model or hyphenator coded directly in C.
> > Such things would require an extra configuration file, something that
> > would replace voikko-fi_FI.pro in current v2 configuration.
> 
> I think we can leave this out for now, and rather add such a feature to an
> update of the specification.

Yes.

> > And then we need to figure out
> > how to set the default speller for a language when there are more than
> > one available, and how this default can be changed by the user in case
> > there is no application level support for selecting the variant.
> 
> What about reserving one filename for this use? There can be only one file
> in the required location with this name, which would then be picked in
> case of conflicting zip file content. The solution to the end user if
> there is no other way of specifying which variant to use is then to just
> rename the preferred speller file to this exact name. Manual renaming is
> error prone, but I see no better solution at the moment.

This is more or less the way how Voikko handles this in the current dictionary 
format and seems sufficient to me.

In fact Voikko has an additional method for setting the default variant 
through symbolic links (or registry keys on Windows) which does not require 
renaming the original dictionary and can therefore be used in situations where 
user has no right to modify the original file. Since the HFST speller format 
will quite likely be a subformat of "Voikko dictionary format version 3" all 
mechanisms provided by Voikko will also be available for HFST dictionaries 
when libvoikko is used.

> The propsed default name would be:
> 
> LOCALE-default-speller.zhfst
> 
> where LOCALE is restricted to specifying the language only. Example:
> 
> fi-default-speller.zhfst
> 
> would take precedence over other speller files for Finnish if they all
> contain a default accepting transducer.
>
> I have added this to the specification.

Ok.

> I also received feedback from other implementors, such that at least one
> commercial entity has backed this specification.

Seems good to me. I'm willing to back this specification too but before 
implementing it I would like to wait until there is at least one compliant 
open source speller transducer available for any language that has been 
demonstrated to perform better than other open source spellers for that 
language. If you have compared Northern Sámi HFST speller with Hunspell and 
have some numbers (speed, memory use, false negatives, false positives etc.) 
I'd be very interested in seeing those.

Harri



More information about the Libvoikko mailing list