[libvoikko] libvoikko and list of Finnish words
Harri Pitkänen
hatapitk at iki.fi
Sat Jan 11 18:15:25 EET 2020
Hi!
Marko Myllynen kirjoitti 2020-01-10 15:02:
> Hi,
>
> ibus-typing-booster
> (https://mike-fabian.github.io/ibus-typing-booster/)
> is a completion input method to speedup typing.
>
> Mike (CC'ed) kindly added Finnish spelling check support to
> ibus-typing-booster recently using the Python interface to libvoikko
> and
> this seems to be working nicely.
>
> However, it is not clear is it possible to use libvoikko to provide
> list
> of Finnish words for word completion with ibus-typing-booster. The
> latest ibus-typing-booster utilizes the Finnish ispell dictionary from
> http://ispell-fi.sourceforge.net/finnish.dict.bz2 but using this almost
> 20 years old dict file doesn't seem ideal.
>
> Would this kind of functionality of providing list of Finnish words for
> word completion be in scope for libvoikko and if so how to use it?
We do not have anything for this particular purpose but I see some
options that could work.
- For internal testing I wrote a script to extract words from Finnish
Wikipedia. The script has some comments on how to filter it to get N
most common words. After that you can use voikkospell command line tool
to filter out English words etc. if you like. But this is not going to
give good results for mobile use where people often write SMS / Whatsapp
style of messages that contain words that are not likely to be found in
Wikipedia. It might work OK for more formal document editing.
- Get access to a Finnish corpus that contains samples of less formal
language and use that (perhaps combining with Wikipedia texts or
something similar). Maybe the Suomi24 corpus would be good
(http://urn.fi/urn:nbn:fi:lb-2017021630)? Unfortunately that corpus is
not freely available at the moment but it might be possible to get the
list of most common words extracted from it. If anyone on this list
knows more of this I please let us know.
The resources that are used to build dictionaries for Voikko mostly lack
the information on how common particular words are so they won't be very
useful for this on their own.
Harri
More information about the Libvoikko
mailing list