[libvoikko] Relation between spelling and grammar checkers

Sjur Moshagen sjurnm at mac.com
Thu Sep 19 23:19:53 EEST 2013


19. sep. 2013 kl. 21:55 skrev Harri Pitkänen <hatapitk at iki.fi>:

> 1) During dictionary loading both dictionaries are loaded and wired into the
>    same voikko_options_t structure. This would be quite easy to implement.
>    * There is one complicated corner case though: what if the language tags
>      represent different variants of the same dictionary? If user requested
>      "sme-x-medicine" and we have only medical spell checker but standard
>      grammar checker, can the standard grammar checker be used as a
>      substitute?
> 
> 2) We can also require that all grammar checkers must also provide a spell
>    checker. This will simplify the logic: format 4 would always hide a
>    format 3 dictionary if they have the same language tag.
>    * The variant issue would still be present. We might have a spell checker
>      in format 3 for some variant and only standard dictionary in format 4.

I would go for 2). The main reasoning is that you can then provide a spell checker together with the grammar checker that is tailored to work with the grammar checker. E.g. one can allow such a spell checker to be a bit more relaxed, if one knows that certain error patterns are better handled by the grammar checker. This would not be possible with option 1), because under that scenario we do not have any control of which version 3 dictionary is installed.

The corner case can be broken down into (at least) two scenarios:

a) content variant
b) dialect/geographic variant

The example mentioned (se-x-medicine) is of type a). The assumption here is that the dictionary contains additional lexical material not covered by the standard speller dictionary (and possibly an error model adapted to the content). It would be reasonable to handle this situation such that both variants are used, along the following lines: everything checked by version 4) first, but fall back to the version 3) dictionary for content unknown to the version 4) dictionary. Only in case the word in question is unknown also for the version 3) content variant dictionary, the word would be flagged as actually unknown.

In the b) case the question is trickier. Since the variants are not split according to lexical coverage, they could presumably both be equally valid choices for the user. As such it should ideally be up to the user to choose. One could argue that a more specified (in terms of locale specificity) variant should take precedence over a more general variant, under the assumption that the more specific variant has been deliberately installed by the user over the more general one. I am not yet sure what effect this would have on a grammar checker.

Sjur




More information about the Libvoikko mailing list