[libvoikko] Aligning development of hfst-based proofing tools

Flammie Pirinen flammie at iki.fi
Tue Sep 21 02:12:51 EEST 2010

2010-09-20, Harri Pitkänen sanoi:

> - File name conventions: If it is expected that hyphenation lexicon
> format is specified separately, should we have an explicit suffix in
> the speller file name that makes it clear that this is a speller
> archive? For example LOCALE- spl.zhfst. Or would the future
> hyphenation lexicon be embedded within this same archive? 

In general I'd say we can leave this at the level of saying that any
unrecognised names should be considered by implementations as
unsupported features? This allows indeed arbitrary future extensions.
If we want to reserve space for future file names, we could say that
-*.hfst is the file format and * defines file type, for example,  but I
don't know if we need to specify that yet.

> - Similarly some way of specifying some variants (standard
> vocabulary, medical vocabulary etc.) would be needed. Within this
> specification BCP 47 private use subtags in the locale string could
> be used for this purpose. Even then we would need to give at least a
> recommended naming scheme so that medical vocabularies for different
> languages could be detected without human help.

This is something that is very useful. Also it may be possible to
separate dictionaries from the one used in  accepting and suggesting,
usually latter  can be a bit more restricted for practical purposes
(obscenities, special dictionaries). (Of course also possible with
analyzer approach).

> - Specification for the exact file format of HFST transducers (both
> one and two level) should be included at least by reference. In
> particular if there are multiple versions of these we need to know
> which format versions an implementation must support. Preferably just
> one format unless there are technical reasons such as memory/speed
> tradeoffs that justify supporting multiple formats.

I'd leave it at requiring HFST 3 metadata header with strong suggestion
towards optimized lookup weighted.

Flammie, computer scientist bachelor, linguist master, free software
Finnish localiser, and more! <http://www.iki.fi/flammie/>

More information about the Libvoikko mailing list