[libvoikko] Another language for voikko: Avar!

Mon Mar 10 11:17:18 EET 2014

9. mars 2014 kl. 19:51 skrev Harri Pitkänen <hatapitk at iki.fi>:

> On Sunday 09 March 2014 12:43:05 Francis Tyers wrote:
>> It would be good to be able to release spellcheckers with a kind of
>> spellrelax where the Latin characters do not cause spelling errors
>> (really this isn't a spelling error, it's an encoding error).
>> 
>> Any thoughts on how to do this ? -- Most of the errors you see in that
>> text are because of this problem.
> 
> Libvoikko does have some code to handle similar (language independent) 
> situations where certain Unicode characters are essentially equal from the 
> point of view of spell checking. These are mostly related to combining 
> diacritical marks, ligatures and hyphens. We normalize the words before 
> sending them to the speller backend. We also attempt (in a very limited way) 
> to ensure that spelling suggestions do not contain encoding changes that are 
> unrelated to fixing the actual spelling error. That is, if the word contains a 
> non-breaking hyphen and there is a spelling error, the suggestions will just 
> fix the spelling error without changing the non-breaking hyphen into a hyphen-
> minus.
> 
> I think that supporting similar language dependent rules within libvoikko 
> would be useful. But if we want ZHFST spellers to be fully self-contained then 
> the information about such relaxed spelling rules or transformations would 
> need to be stored in the ZHFST file. So it might be easier to handle this 
> completely within hfst-ospell. And the third option is to modify the 
> transducers so that the alternative characters are recognized directly.

My first reaction is that language dependent behavior should be handled by the language specific component - the fst’s. The C++ code should be as language independent as possible.

After a short discussion with Fran, here is what I suggest:

* add support for another error model in the zhfst file, tentatively named errmodel.encoding.hfst
* add a check box to the speller configuration dialog, to allow automatic corrections of encoding errors
* if the check box is checked, when the text is run through the acceptor, every unaccepted string that can be automatically turned into an accepted string using this error model is automatically changed to that string; other errors are treated the usual way
* if the check box is _not_ checked, behave as now, and let encoding errors be handled by the default error model
* if such an error model is not found, the check box is greyed out or otherwise not accessible/setable

That is, using a special error model it should be possible to implement a safe autocorrect mode for encoding errors. Care has to be taken to ensure that the error model only generates one suggestion for each input.

Does this sound like a viable option?

Also support for the OCR error model could be added at the same time (I believe multiple error models aren’t supported by the zhfst code now, don’t know whether this limitation is in the hfst-ospell or libvoikko code).

Sjur