[libvoikko] libvoikko and list of Finnish words

Harri Pitkänen hatapitk at iki.fi
Sat Jan 11 18:15:25 EET 2020


Marko Myllynen kirjoitti 2020-01-10 15:02:
> Hi,
> ibus-typing-booster 
> (https://mike-fabian.github.io/ibus-typing-booster/)
> is a completion input method to speedup typing.
> Mike (CC'ed) kindly added Finnish spelling check support to
> ibus-typing-booster recently using the Python interface to libvoikko 
> and
> this seems to be working nicely.
> However, it is not clear is it possible to use libvoikko to provide 
> list
> of Finnish words for word completion with ibus-typing-booster. The
> latest ibus-typing-booster utilizes the Finnish ispell dictionary from
> http://ispell-fi.sourceforge.net/finnish.dict.bz2 but using this almost
> 20 years old dict file doesn't seem ideal.
> Would this kind of functionality of providing list of Finnish words for
> word completion be in scope for libvoikko and if so how to use it?

We do not have anything for this particular purpose but I see some 
options that could work.

- For internal testing I wrote a script to extract words from Finnish 
Wikipedia. The script has some comments on how to filter it to get N 
most common words. After that you can use voikkospell command line tool 
to filter out English words etc. if you like. But this is not going to 
give good results for mobile use where people often write SMS / Whatsapp 
style of messages that contain words that are not likely to be found in 
Wikipedia. It might work OK for more formal document editing.

- Get access to a Finnish corpus that contains samples of less formal 
language and use that (perhaps combining with Wikipedia texts or 
something similar). Maybe the Suomi24 corpus would be good 
(http://urn.fi/urn:nbn:fi:lb-2017021630)? Unfortunately that corpus is 
not freely available at the moment but it might be possible to get the 
list of most common words extracted from it. If anyone on this list 
knows more of this I please let us know.

The resources that are used to build dictionaries for Voikko mostly lack 
the information on how common particular words are so they won't be very 
useful for this on their own.


More information about the Libvoikko mailing list