[libvoikko] Aligning development of hfst-based proofing tools

Flammie Pirinen flammie at iki.fi
Tue Sep 21 02:00:20 EEST 2010


2010-09-20, Sjur Moshagen sanoi:

> This is a followup on a discussion that started with a small
> suggestion to some people earlier this summer. After I received this
> and other feedback, I have now put together a first draft of what
> could become a specification for a speller lexicon file format for
> hfst-based spellers (see attached pdf file).

Seems good. For the project of converting and testing other formats to
automata I've been using the convention of installing automata into
$HFSTPREFIX/LOCALE/prefix.name.suffix, where HFSTPREFIX is one of
SYSTEMPREFIX/share/hfst, $HOMEDIR/.hfst, LOCALE as in BCP 37 and prefix
describes the automaton as one of dictionary, morphology, error,
hyphenation, name is arbitrary string (preferably without full stops
though), and suffix hfst. Maybe parts of this structure can be
integreated to zip file contents even though it's just a quickly
concocted personal convention to help my testing.

> The source xml file (which renders quite ok in modern web browsers)
> is available at our svn repository:
> 
> https://victorio.uit.no/langtech/trunk/techdoc/proofdoc/spell/hfst/lexfile-spec.xml

Few specific suggestions, up for discussion so I won't provide patch
yet:

> Base file
> format - zip archive The base file format is a zip archive containing
> a number of single files that together comprises the full speller
> lexicon file package. 

Zip is a very nice format. IIRC most of our transducers also compress
nicely with zip's compression algorithm (at least they do with
gzip/bzip/xz).

> LOCALE.hfst is the acceptance transducer; filename
> as for the zip file, although there are NO exceptions to the filename
> convention for this file (ie it MUST follow BCP 47).

With the potential possibility of multiple dictionaries and such, I'd
think the naming of the acceptance automata could also use error model
style of naming convention. If I understand correctly we already are
within LOCALE.zhfst so LOCALE.hfst would be redundant, although
potentially usable sanity check of course.

> The transducer
> can either be a one-level or a two-level transducer; if it is a
> two-level transducer, the other level must produce analyses of the
> correct words or the accepted suggestions produced by the error
> models, and the fact that it is a two-level transducer producing
> analyses must be expressed in the index.xml file's description of
> this file. It is an error to archive an analysing speller transducer
> without stating so in the index.xml file.

If possible I would strongly be in favor of encoding the nature of
automaton in the filenames systematically as well. It'll relieve some
of the burden of discovery from humans and software as well. It'll be
nice to have some tools doing basic things with installed HFST spellers
without requiring XML libraries.

> errmodel1.hfst...errmodeln.hfst is one or more transducers containing
> spelling error corrections ("error models") for the speller

Also here, some human readable names in the file names would be nice.
Doing wildcard substitution on filenames should be easily available in
all systems?

> <info>

For the info metadata section it might be usable to embrace XML
namespaces and just use external standards to define the metadata, e.g.
Dublin Core? I suppose the most typical uses for these are to be human
readable metadata, perhaps displayed in a GUI

> <acceptor analyzing="true"
> id="se.hfst"/>

If the id's are unique, they could be xml:id's (or were they
discouraged)?

All in all it seems very usable suggestion to me, mostly it's easily
extensible for all the uses I can think of right now.

-- 
Flammie, computer scientist bachelor, linguist master, free software
Finnish localiser, and more! <http://www.iki.fi/flammie/>



More information about the Libvoikko mailing list