[libvoikko] Aligning development of hfst-based proofing tools
Harri Pitkänen
hatapitk at iki.fi
Mon Sep 20 22:13:22 EEST 2010
On Monday 20 September 2010, Sjur Moshagen wrote:
> Feedback on any part of this specification, and suggestions for expansions,
> additions etc are very welcome.
Thanks. Looks mostly OK to me but here are a some comments and questions:
- ZIP format has lots of options for compression, encryption etc. We should
figure out which ones are actually useful and allow only a minimal subset of
those features that meets these requirements to be used. This would make it
easier to produce a compatible implementation. I'm not sure if all compression
methods that have been used within ZIP archives are free of patent issues. I
can study this further and propose something. The ODF specification may
contain something similar that we could just borrow.
- File name conventions: If it is expected that hyphenation lexicon format is
specified separately, should we have an explicit suffix in the speller file
name that makes it clear that this is a speller archive? For example LOCALE-
spl.zhfst. Or would the future hyphenation lexicon be embedded within this
same archive? I'm also not so sure if the requirement to follow this naming
convention needs to be so strong. There might be non-mandatory but still
reasonable reasons to deviate, for example if someone wants to store this
stuff in a database instead of file system. Renaming a file is quite easy
after all.
- What is the canonical method for determining the locale here? It is not
explicitly specified in index.xml. You can find it out by looking at the
archive file name (which is unreliable, see above) or from the name of the
acceptance transducer. Somehow it would seem cleaner to me if the locale was
specified within <info>:
<info>
<locale>se</locale>
<title>....
...
</info>
<acceptor analyzing="true" />
and then the acceptor could have a fixed name like "acceptor.hfst". We would
also avoid the (at the moment entirely theoretical) corner case where some
locale string would clash with a file name of some other transducer within the
archive. And I believe this would result in a bit cleaner code for the
implementations.
The downside is that parsing the XML is then needed to determine the locale,
in your proposal it can be found out directly from the ZIP index.
- I think it would be reasonable to allow an archive without any error models.
It would still be enough for a speller that detects errors, only without
giving spelling suggestions. Technically supporting this case is trivial, in
fact probably easier than failing with error if no error models are found.
- I'm unsure about the open case of handling multiple error models.
Unconditionally applying them all does not seem right. Libvoikko already
supports choosing between two models (typo/OCR correction) so I'd like to be
able to do that within this format too.
It would also be nice to have some way of figuring out, without having a human
read the description, whether an error model is designed for typing errors,
OCR errors or something else. This would be needed to properly support current
libvoikko API. An optional element to specify 1-n machine readable "suitable
uses" for the error model would help here. Maybe something like this (totally
theoretical example):
<errormodel>
(title and description as already specified)
<type>OCR</type>
<type>handwriting</type>
<model>errormodel1.hfst</model>
<model>errormodel2.hfst</model>
</errormodel>
<errormodel>
(title and description as already specified)
<type>typo</type>
<model>errormodel1.hfst</model>
<model>errormodel3.hfst</model>
</errormodel>
meaning that when we want to correct errors from OCR or handwriting
recognition we should apply error models 1 and 2 in parallel (perhaps would we
like an option to chain them too?) and if we want typo suggestions we should
apply error models 1 and 3.
- Similarly some way of specifying some variants (standard vocabulary, medical
vocabulary etc.) would be needed. Within this specification BCP 47 private use
subtags in the locale string could be used for this purpose. Even then we
would need to give at least a recommended naming scheme so that medical
vocabularies for different languages could be detected without human help.
- Specification for the exact file format of HFST transducers (both one and
two level) should be included at least by reference. In particular if there
are multiple versions of these we need to know which format versions an
implementation must support. Preferably just one format unless there are
technical reasons such as memory/speed tradeoffs that justify supporting
multiple formats.
Harri
More information about the Libvoikko
mailing list