[libvoikko] Voikko, cyrillic and case handling
hatapitk at iki.fi
Tue Jan 24 22:23:23 EET 2012
ti 24.1.2012 21:22 Sjur Moshagen kirjoitti:
> It seems that voikko is handling upper-cased text using internal code, but
> only for Latin-scripted languages:
This is correct.
> Now, we are testing the voikko+hfst combo with a couple of cyrillic
> languages as well, and it seems that voikko is not able to handle
> uppercasing for those languages. Is this correct?
> In which file(s) are uppercasing defined? Would it be ok to add it (and
> send in a patch if it seems to work ok)? Or do you prefer a different
> solution for handling case in non-latin (or all) casing languages/scripts?
Extending current mappings and sending patches is fine. Case mappings are
defined in the first two functions in src/character/SimpleChar.cpp. There
is a TODO comment at the end of both functions, you can add your ranges
just before that comment.
Additionally you should make sure that get_char_type in
src/character/charset.cpp returns CHAR_LETTER for cyrillic letters, I
don't think it does that yet.
More information about the Libvoikko