[libvoikko] Aligning development of hfst-based proofing tools

Mon Sep 20 22:13:22 EEST 2010

On Monday 20 September 2010, Sjur Moshagen wrote:
> Feedback on any part of this specification, and suggestions for expansions,
> additions etc are very welcome.

Thanks. Looks mostly OK to me but here are a some comments and questions:

- ZIP format has lots of options for compression, encryption etc. We should 
figure out which ones are actually useful and allow only a minimal subset of 
those features that meets these requirements to be used. This would make it 
easier to produce a compatible implementation. I'm not sure if all compression 
methods that have been used within ZIP archives are free of patent issues. I 
can study this further and propose something. The ODF specification may 
contain something similar that we could just borrow.

- File name conventions: If it is expected that hyphenation lexicon format is 
specified separately, should we have an explicit suffix in the speller file 
name that makes it clear that this is a speller archive? For example LOCALE-
spl.zhfst. Or would the future hyphenation lexicon be embedded within this 
same archive? I'm also not so sure if the requirement to follow this naming 
convention needs to be so strong. There might be non-mandatory but still 
reasonable reasons to deviate, for example if someone wants to store this 
stuff in a database instead of file system. Renaming a file is quite easy 
after all.

- What is the canonical method for determining the locale here? It is not 
explicitly specified in index.xml. You can find it out by looking at the 
archive file name (which is unreliable, see above) or from the name of the 
acceptance transducer. Somehow it would seem cleaner to me if the locale was 
specified within <info>:

 <info>
   <locale>se</locale>
   <title>....
   ...
 </info>
 <acceptor analyzing="true" />

and then the acceptor could have a fixed name like "acceptor.hfst". We would 
also avoid the (at the moment entirely theoretical) corner case where some 
locale string would clash with a file name of some other transducer within the 
archive. And I believe this would result in a bit cleaner code for the 
implementations.

The downside is that parsing the XML is then needed to determine the locale, 
in your proposal it can be found out directly from the ZIP index.

- I think it would be reasonable to allow an archive without any error models. 
It would still be enough for a speller that detects errors, only without 
giving spelling suggestions. Technically supporting this case is trivial, in 
fact probably easier than failing with error if no error models are found.

- I'm unsure about the open case of handling multiple error models. 
Unconditionally applying them all does not seem right. Libvoikko already 
supports choosing between two models (typo/OCR correction) so I'd like to be 
able to do that within this format too.

It would also be nice to have some way of figuring out, without having a human 
read the description, whether an error model is designed for typing errors, 
OCR errors or something else. This would be needed to properly support current 
libvoikko API. An optional element to specify 1-n machine readable "suitable 
uses" for the error model would help here. Maybe something like this (totally 
theoretical example):

<errormodel>
 (title and description as already specified)
 <type>OCR</type>
 <type>handwriting</type>
 <model>errormodel1.hfst</model>
 <model>errormodel2.hfst</model>
</errormodel>
<errormodel>
 (title and description as already specified)
 <type>typo</type>
 <model>errormodel1.hfst</model>
 <model>errormodel3.hfst</model>
</errormodel>

meaning that when we want to correct errors from OCR or handwriting 
recognition we should apply error models 1 and 2 in parallel (perhaps would we 
like an option to chain them too?) and if we want typo suggestions we should 
apply error models 1 and 3.

- Similarly some way of specifying some variants (standard vocabulary, medical 
vocabulary etc.) would be needed. Within this specification BCP 47 private use 
subtags in the locale string could be used for this purpose. Even then we 
would need to give at least a recommended naming scheme so that medical 
vocabularies for different languages could be detected without human help.

- Specification for the exact file format of HFST transducers (both one and 
two level) should be included at least by reference. In particular if there 
are multiple versions of these we need to know which format versions an 
implementation must support. Preferably just one format unless there are 
technical reasons such as memory/speed tradeoffs that justify supporting 
multiple formats.

Harri