[libvoikko] libvoikko and list of Finnish words

Sun Jan 12 12:38:18 EET 2020

Harri Pitkänen <hatapitk at iki.fi> さんはかきました:

> We do not have anything for this particular purpose but I see some
> options that could work.
>
> - For internal testing I wrote a script to extract words from Finnish
>   Wikipedia. The script has some comments on how to filter it to get N 
> most common words. After that you can use voikkospell command line
> tool to filter out English words etc. if you like. But this is not
> going to give good results for mobile use where people often write SMS
> / Whatsapp style of messages that contain words that are not likely to
> be found in Wikipedia. It might work OK for more formal document
> editing.
>
> - Get access to a Finnish corpus that contains samples of less formal
>   language and use that (perhaps combining with Wikipedia texts or 
> something similar). Maybe the Suomi24 corpus would be good
> (http://urn.fi/urn:nbn:fi:lb-2017021630)? Unfortunately that corpus is 
> not freely available at the moment but it might be possible to get the
> list of most common words extracted from it. If anyone on this list 
> knows more of this I please let us know.
>
> The resources that are used to build dictionaries for Voikko mostly
> lack the information on how common particular words are so they won't
> be very useful for this on their own.

Thank you! Spellchecking is quite different from word prediction so I
guessed that as Voikko has focused on spellchecking, it probably has not
data about the frequency of words.

My idea for future improvement of word prediction in ibus-typing-booster
is parsing Wikipedia or corpuses and get frequency data of words in
context that way. As you say Wikipedia is probably much more formal tha
what is used in chats. But even frequency data of word groups from
Wikipedia should already give much better results than using single
words from spell checking dictionaries. And Wikipedia is available in
many languages. So I think I’ll go with Wikipedia first. I have to
implement lots of stuff in ibus-typing-booster anyway and frequency data
parsed from Wikipedia should be better than nothing and good enough to
test my implementation. If everything works, I can replace data parsed
from Wikipedia with data parsed from more suitable corpuses later.

-- 
Mike FABIAN <mfabian at redhat.com>
睡眠不足はいい仕事の敵だ。