[libvoikko] HFST speller lexicon spec - RC1

Sjur Moshagen sjurnm at mac.com
Sun Jul 3 17:28:42 EEST 2011


Hello,

I am sorry it has taken me this long to follow up on the discussion. Based on the last posts before this long break, I have now made available what I consider release candidate 1. The changes compared to version 0.2 is detailed below:

Den 7. nov. 2010 kl. 09.25 skrev Harri Pitkänen:

> On Saturday 06 November 2010, Flammie Pirinen wrote:
>> Ideally there would be some small library
>> available on all systems for this use as to not get any more
>> dependencies for hfst-ospell.
> 
> Would http://zziplib.sourceforge.net/ work? I did not look at it closer, I 
> just found it with "apt-cache search unzip" but it seems to handle the 
> necessary stuff and is under LGPL.

I added this link as a possible implementation to use (but implementors are of course free to choose their zip lib as long as the required features are supported.

I also added an option to support encrypted zip archives, in addition to unencrypted ones. All implementations *must* support unencrypted zip files, but they can also support encrypted ones if so desired. The motivation is to make it possible to support both OSS-based transducers as well as closed-source transducers in one and the same implementation. This will e.g. make it easier for the Sámi languages to be added in a distribution of otherwise closed-source and commercial tools.

Two more small changes: I changed the filename part that identifies the zip file as a speller file from "-spl" to "-speller" - I see no reason to be too cryptic, and the filename length is not increased a lot (four more characters).

I also renamed the @version to @dtdversion, where the dtd version is identical to the version of this specification.

>> By the way, on voikko part of the world, can I expect that spellers can
>> be tossed to $voikkodir/3/*.zhfst?  
> 
> Yes, I think so. I have not yet had time to think about this but ideally we 
> should make this as simple as possible so that no extra configuration would be 
> needed.
> 
> We might of course want to support using HFST based and other tools for some 
> languages, like for example using an acceptor implemented with HFST and 
> spelling suggestion error model or hyphenator coded directly in C. Such things 
> would require an extra configuration file, something that would replace 
> voikko-fi_FI.pro in current v2 configuration.

I think we can leave this out for now, and rather add such a feature to an update of the specification.

> And then we need to figure out 
> how to set the default speller for a language when there are more than one 
> available, and how this default can be changed by the user in case there is no 
> application level support for selecting the variant.

What about reserving one filename for this use? There can be only one file in the required location with this name, which would then be picked in case of conflicting zip file content. The solution to the end user if there is no other way of specifying which variant to use is then to just rename the preferred speller file to this exact name. Manual renaming is error prone, but I see no better solution at the moment.

The propsed default name would be:

LOCALE-default-speller.zhfst

where LOCALE is restricted to specifying the language only. Example:

fi-default-speller.zhfst

would take precedence over other speller files for Finnish if they all contain a default accepting transducer.

I have added this to the specification.

> But I believe none of 
> these requirements prevent us from handling the simplest use case the way you 
> suggested.

Good:)

> I will publish the first release candidate libvoikko 3.1 very soon, maybe 
> tomorrow. Then it would be possible to start coding the v3 configuration and 
> really add support for all these things. Hopefully we can then release 
> libvoikko 3.2 with non-experimental support for HFST based spellers.

I also received feedback from other implementors, such that at least one commercial entity has backed this specification.

With the above modifications I announce the specification as RC1. It is immediately available at (with rough formatting):

https://victorio.uit.no/langtech/trunk/plan/proof/doc/lexfile-spec.xml

and after a couple of hours with better formatting at:

http://divvun.no/no/future/proofing/lexfile-spec.html

I would like to publish the final 1.0 version of the spec before I go on vacation, which means that I would like to receive feedback within the next couple of days, for a final release towards the end of this week.

On the other hand, having waited this long, I probably could wait till after the summer:)

Best regards,
Sjur




More information about the Libvoikko mailing list