[libvoikko] Aligning development of hfst-based proofing tools

Sjur Moshagen sjurnm at mac.com
Fri Sep 24 10:31:21 EEST 2010


Thanks to you both for the valuable feedback.

Den 21. sep. 2010 kl. 01.12 skrev Flammie Pirinen:

> 2010-09-20, Harri Pitkänen sanoi:
> 
>> - File name conventions: If it is expected that hyphenation lexicon
>> format is specified separately, should we have an explicit suffix in
>> the speller file name that makes it clear that this is a speller
>> archive? For example LOCALE- spl.zhfst. Or would the future
>> hyphenation lexicon be embedded within this same archive? 
> 
> In general I'd say we can leave this at the level of saying that any
> unrecognised names should be considered by implementations as
> unsupported features? This allows indeed arbitrary future extensions.
> If we want to reserve space for future file names, we could say that
> -*.hfst is the file format and * defines file type, for example,  but I
> don't know if we need to specify that yet.

Regarding whether hyphenation should be in the same or a separate archive is an open question. Some implementations basically use one and the same lexicon for encoding both the accepted language and the hyphenation patterns, whereas others separate the two. This is partly technology-based (ie the tex hyphenation patterns are clearly separate from any *spell lexicons and implementations), partly dictated by the host applications (ie in MS Office, there are separate API's and files for spellers and hyphenators, even thouth the present (non-hfst-based) Sámi speller and hyphenation lexicons are one and the same).

I tend to think that it will be a cleaner setup if we separate the two in different files, even though this might in some cases duplicate information or files. That would also entail that we would write a separate (but obviously quite similar) specification for the hyphenation file.

>> - Similarly some way of specifying some variants (standard
>> vocabulary, medical vocabulary etc.) would be needed. Within this
>> specification BCP 47 private use subtags in the locale string could
>> be used for this purpose. Even then we would need to give at least a
>> recommended naming scheme so that medical vocabularies for different
>> languages could be detected without human help.
> 
> This is something that is very useful. Also it may be possible to
> separate dictionaries from the one used in  accepting and suggesting,
> usually latter  can be a bit more restricted for practical purposes
> (obscenities, special dictionaries). (Of course also possible with
> analyzer approach).

Good points. It should also be possible to add specialised dictionaries as a separate archive file, so that one wouldn't need to redistribute the whole package just to add one specialised dictionary, or build many separate archives for different combinations of general and specialisted dictionaries.

>> - Specification for the exact file format of HFST transducers (both
>> one and two level) should be included at least by reference. In
>> particular if there are multiple versions of these we need to know
>> which format versions an implementation must support. Preferably just
>> one format unless there are technical reasons such as memory/speed
>> tradeoffs that justify supporting multiple formats.
> 
> I'd leave it at requiring HFST 3 metadata header with strong suggestion
> towards optimized lookup weighted.

Tommi's suggestion seems fine, but it would be good for (backwards) compatibility checks that this is specified in the index.xml file.

[Side note: requiring HFST 3 headers presupposes that HFST 3 is available on relevant plattforms - any plans for the public release of it? There are still build problems on the Mac;) ]

I will update the specification draft based on your feedback, and announce the new draft version when I'm done.

Sjur




More information about the Libvoikko mailing list