[libvoikko] Another language for voikko: Avar!

Mon Mar 10 19:05:21 EET 2014

On Monday 10 March 2014 11:17:18 Sjur Moshagen wrote:
> After a short discussion with Fran, here is what I suggest:
> 
> * add support for another error model in the zhfst file, tentatively named
> errmodel.encoding.hfst

OK.

> * add a check box to the speller configuration
> dialog, to allow automatic corrections of encoding errors

OK.

> * if the check
> box is checked, when the text is run through the acceptor, every unaccepted
> string that can be automatically turned into an accepted string using this
> error model is automatically changed to that string; other errors are
> treated the usual way

OK.

> * if the check box is _not_ checked, behave as now,
> and let encoding errors be handled by the default error model

OK. Should it be checked by default? And more generally, should the old or new 
behavior be the default for applications that do not know about this new 
setting? For many applications we cannot provide a settings dialog at all.

> * if such an
> error model is not found, the check box is greyed out or otherwise not
> accessible/setable

This sounds like a minor detail but would in fact be quite hard to implement. 
Currently the preferences for libreoffice-voikko are the same for all 
languages while such error model might be available for only some of them.

> That is, using a special error model it should be possible to implement a
> safe autocorrect mode for encoding errors. Care has to be taken to ensure
> that the error model only generates one suggestion for each input.
> 
> Does this sound like a viable option?

Yes, and if we are willing to accept that the check box is never grayed out it 
should be relatively easy to implement.

> Also support for the OCR error model could be added at the same time (I
> believe multiple error models aren’t supported by the zhfst code now, don’t
> know whether this limitation is in the hfst-ospell or libvoikko code).

We could add this one as well but I believe these should still be independent 
settings? Even with text produced by OCR software you might wish to choose 
whether encoding errors should be corrected or not. I'm not really familiar 
with OCR software and don't know if it is generally possible to force them to 
only produce characters in a specific subset of Unicode.

The setting is currently ignored in libvoikko if HFST suggestion backend is 
used.

Harri