[libvoikko] Voikko, cyrillic and case handling

Harri Pitkänen hatapitk at iki.fi
Fri Jan 27 19:45:32 EET 2012


pe 27.1.2012 11:04 Sjur Moshagen kirjoitti:
> A pattern has emerged for the discrepancy between voikkospell and
> ooovoikko/LibreOffice:
>
> In all cases where there is no suggestion in LibreOffice, the original
> input string contains the Latin character ö instead of the corresponding
> Cyrillic one.

Thanks for the hint, this helped me to find a possible cause for the problem.

Initially when I tested this with your transducer I did not get
suggestions for words containing ö either with LibreOffice or voikkospell.
Then I did "svn update" and "make install" for hfst-ospell and suddenly it
just worked, also in LibreOffice.

So my guess is that your build of libreoffice-voikko is linked against
outdated hfst-ospell. Try building it again with latest version and see if
that helps.

By the way, does hfst-ospell or your Komi transducer support canonically
decomposed forms of these Unicode characters? Normally Cyrillic ö is
written as 04E7 but it can also be decomposed as 043E 0308. Libvoikko does
automatic conversion of decomposed forms into the more widely used
composed form for Latin letters so that underlying morphologies don't have
to care about this issue. This is not yet done for Cyrillic letters
though. If you don't already support decomposed forms let me know and I
can add the necessary mappings.

Harri




More information about the Libvoikko mailing list