[libvoikko] HFST speller lexicon spec - RC1
Flammie Pirinen
flammie at iki.fi
Tue Aug 9 05:53:33 EEST 2011
2011-07-03, Sjur Moshagen sanoi:
> * the hfst group adds support for this zipped transducer format in
> hfst-ospell (both lib and frontend)
> * as soon as hfst-ospell has been updated to support the zhfst
> format, I'll run some tests and report back
>
> Does it seem reasonable? What do the HFST gang think? I will need
> some help to test memory consumption.
I commited a proof-of-concept quality sketch of an implementation with
libarchive (BSD licence) and libXML (MIT licence) as requirements;
here's an usage example, ignore the metadata, I just filled it in with
test garbage:
$ hfst-ospell fi/fi-speller.zhfst
Following metadata was read from ZHFST archive:
locale: fi
version: 2011-09-01 [vcsrev: 500]
date: 2011-09-23
producer: Univ. Helsinki & omorfi contributors[email: <tommi.pirinen at helsinki.fi>, website: <http://home.gna.org/omorfi/>]
title [en]: Finnish spell-checking (omorfi)
title [fi]: Suomen kielen oikaisuluin (omorfi)
title [sv]: Finsk stavningskontrol (omorfi)
description [fi]:
Suomen kielen oikaisuluku, joka perustuu HFST-tekniikaan ja
omorfin kielimalleihin. Aloitettu Helsingin yliopistossa, avoimen
lähdekoodin projekti.
description [sv]:
En finsk stavningskontroll som änvänds HFST teknologi och
omorfi. Som är byggd i universitetet Helsingfors.
talo
"talo" is in the lexicon
talö
Corrections for "talö":
tali 1024
talo 1024
Daly 2048
Hali 2048
Kalu 2048
Valu 2048
taio 2048
valo 2048
vale 2048
vala 2048
töllö 2048
työ 2048
tyly 2048
tulu 2048
tulo 2048
tuli 2048
tule 2048
tola 2048
tili 2048
tila 2048
telo 2048
teli 2048
tele 2048
tela 2048
tavu 2048
tavi 2048
taut 2048
taus 2048
taun 2048
taulu 2048
taula 2048
tau 2048
valu 2048
tase 2048
tasa 2048
taru 2048
taro 2048
tapa 2048
taot 2048
taos 2048
taon 2048
taoa 2048
tao 2048
tanu 2048
talvi 2048
taltu 2048
talsi 2048
talot 2048
talon 2048
taloa 2048
jalo 2048
talma 2048
tallo 2048
talli 2048
talla 2048
talka 2048
talja 2048
talit 2048
talin 2048
talia 2048
ale 2048
talas 2048
taka 2048
tain 2048
taju 2048
tai 2048
taho 2048
tahi 2048
tagi 2048
tag 2048
tae 2048
taco 2048
tabu 2048
taat 2048
taas 2048
taan 2048
taala 2048
taa 2048
salo 2048
taso 2048
sala 2048
palo 2048
pala 2048
mali 2048
kalu 2048
kali 2048
kale 2048
kala 2048
Pal 2048
halu 2048
halo 2048
hali 2048
ali 2048
Mali 2048
Valo 2048
ala 2048
Vale 2048
Vala 2048
Talo 2048
Tali 2048
Salo 2048
Sali 2048
sali 2048
Ralf 2048
Palo 2048
Pala 2048
Halo 2048
Malmö 2048
Malm 2048
Kali 2048
Hall 2048
Kale 2048
Kala 2048
Sala 2048
Halu 2048
Bali 2048
Hal 2048
Falk 2048
Jalo 2048
Gale 2048
Aale 2048
Please try your best to break it; I haven't performed even the most
rudimentary cleanups to the code so it must at the very minimum leak
memory and pollute cwd at the moment. The example zip files are at:
<http://www.helsinki.fi/%7Etapirine/tmp/zhfst/>.
Implementation notes w.r.t. standards spec:
> If encryption is needed for protection/commercial reasons, that
> should be done directly on the transducer(s), and it is up to the
> vendor in question to add support for the encryption in their HFST
> lib code. Supporting unencrypted transducers is REQUIRED.
An encrypted automaton must have encryption scheme declared in HFST3
metadata header (tbd). Then implementations without encryption support
will gracefully ignore encrypted automata possibly with neat error
message.
In Zip file content - file naming conventions of contained files:
in current parsing algorithm
* The DESCR is assumed to be any sequence of characters excepting NUL
and full stop => filename is assumed to contain at least two full
stops
* The DESCR is always assumed to exist; in case of
filename acceptor.hfst the DESCR := hfst
* initial speller will be made of the acceptors and errmodels with DESCR
== default, or in lack of that models will arbitrarily be picked from
the archive (currently it would be first along c++
std::less<std::string>).
* if there are more than two full stops in file name, the parts after
DESCR may be reserved for future extensions
The content of index.xml is currently only used to store and display
info.
Probably more details I've forgotten, but at least it should be
testable now.
--
Flammie, computer scientist bachelor, linguist master, free software
Finnish localiser, and more! <http://www.iki.fi/flammie/>
More information about the Libvoikko
mailing list