[libvoikko] HFST speller lexicon spec - RC1

Tue Aug 9 05:53:33 EEST 2011

2011-07-03, Sjur Moshagen sanoi:

> * the hfst group adds support for this zipped transducer format in
> hfst-ospell (both lib and frontend)
> * as soon as hfst-ospell has been updated to support the zhfst
> format, I'll run some tests and report back
> 
> Does it seem reasonable? What do the HFST gang think? I will need
> some help to test memory consumption.

I commited a proof-of-concept quality sketch of an implementation with
libarchive (BSD licence) and libXML (MIT licence) as requirements;
here's an usage example, ignore the metadata, I just filled it in with
test garbage:

$ hfst-ospell fi/fi-speller.zhfst 
Following metadata was read from ZHFST archive:
locale: fi
version: 2011-09-01 [vcsrev: 500]
date: 2011-09-23
producer: Univ. Helsinki & omorfi contributors[email: <tommi.pirinen at helsinki.fi>, website: <http://home.gna.org/omorfi/>]
title [en]: Finnish spell-checking (omorfi)
title [fi]: Suomen kielen oikaisuluin (omorfi)
title [sv]: Finsk stavningskontrol (omorfi)
description [fi]: 
        Suomen kielen oikaisuluku, joka perustuu HFST-tekniikaan ja
        omorfin kielimalleihin. Aloitettu Helsingin yliopistossa, avoimen 
        lähdekoodin projekti.

description [sv]: 
        En finsk stavningskontroll som änvänds HFST teknologi och
        omorfi. Som är byggd i universitetet Helsingfors.

talo
"talo" is in the lexicon

talö
Corrections for "talö":
tali    1024
talo    1024
Daly    2048
Hali    2048
Kalu    2048
Valu    2048
taio    2048
valo    2048
vale    2048
vala    2048
töllö    2048
työ    2048
tyly    2048
tulu    2048
tulo    2048
tuli    2048
tule    2048
tola    2048
tili    2048
tila    2048
telo    2048
teli    2048
tele    2048
tela    2048
tavu    2048
tavi    2048
taut    2048
taus    2048
taun    2048
taulu    2048
taula    2048
tau    2048
valu    2048
tase    2048
tasa    2048
taru    2048
taro    2048
tapa    2048
taot    2048
taos    2048
taon    2048
taoa    2048
tao    2048
tanu    2048
talvi    2048
taltu    2048
talsi    2048
talot    2048
talon    2048
taloa    2048
jalo    2048
talma    2048
tallo    2048
talli    2048
talla    2048
talka    2048
talja    2048
talit    2048
talin    2048
talia    2048
ale    2048
talas    2048
taka    2048
tain    2048
taju    2048
tai    2048
taho    2048
tahi    2048
tagi    2048
tag    2048
tae    2048
taco    2048
tabu    2048
taat    2048
taas    2048
taan    2048
taala    2048
taa    2048
salo    2048
taso    2048
sala    2048
palo    2048
pala    2048
mali    2048
kalu    2048
kali    2048
kale    2048
kala    2048
Pal    2048
halu    2048
halo    2048
hali    2048
ali    2048
Mali    2048
Valo    2048
ala    2048
Vale    2048
Vala    2048
Talo    2048
Tali    2048
Salo    2048
Sali    2048
sali    2048
Ralf    2048
Palo    2048
Pala    2048
Halo    2048
Malmö    2048
Malm    2048
Kali    2048
Hall    2048
Kale    2048
Kala    2048
Sala    2048
Halu    2048
Bali    2048
Hal    2048
Falk    2048
Jalo    2048
Gale    2048
Aale    2048

Please try your best to break it; I haven't performed even the most
rudimentary cleanups to the code so it must at the very minimum leak
memory and pollute cwd at the moment. The example zip files are at:
<http://www.helsinki.fi/%7Etapirine/tmp/zhfst/>.

Implementation notes w.r.t. standards spec:

> If encryption is needed for protection/commercial reasons, that
> should be done directly on the transducer(s), and it is up to the
> vendor in question to add support for the encryption in their HFST
> lib code. Supporting unencrypted transducers is REQUIRED.

An encrypted automaton must have encryption scheme declared in HFST3
metadata header (tbd). Then implementations without encryption support
will gracefully ignore encrypted automata possibly with neat error
message.

In Zip file content - file naming conventions of contained files:

in current parsing algorithm

* The DESCR is assumed to be any sequence of characters excepting NUL
  and full stop => filename is assumed to contain at least two full
  stops
* The DESCR is always assumed to exist; in case of
  filename acceptor.hfst the DESCR := hfst
* initial speller will be made of the acceptors and errmodels with DESCR
  == default, or in lack of that models will arbitrarily be picked from
  the archive (currently it would be first along c++
  std::less<std::string>).
* if there are more than two full stops in file name, the parts after
  DESCR may be reserved for future extensions

The content of index.xml is currently only used to store and display
info.

Probably more details I've forgotten, but at least it should be
testable now.

-- 
Flammie, computer scientist bachelor, linguist master, free software
Finnish localiser, and more! <http://www.iki.fi/flammie/>