[libvoikko] Voikko, cyrillic and case handling

Harri Pitkänen hatapitk at iki.fi
Tue Jan 24 22:23:23 EET 2012


ti 24.1.2012 21:22 Sjur Moshagen kirjoitti:
> It seems that voikko is handling upper-cased text using internal code, but
> only for Latin-scripted languages:

This is correct.

> Now, we are testing the voikko+hfst combo with a couple of cyrillic
> languages as well, and it seems that voikko is not able to handle
> uppercasing for those languages. Is this correct?

Yes.

> In which file(s) are uppercasing defined? Would it be ok to add it (and
> send in a patch if it seems to work ok)? Or do you prefer a different
> solution for handling case in non-latin (or all) casing languages/scripts?

Extending current mappings and sending patches is fine. Case mappings are
defined in the first two functions in src/character/SimpleChar.cpp. There
is a TODO comment at the end of both functions, you can add your ranges
just before that comment.

Additionally you should make sure that get_char_type in
src/character/charset.cpp returns CHAR_LETTER for cyrillic letters, I
don't think it does that yet.

Harri




More information about the Libvoikko mailing list