[libvoikko] Voikko and upcoming changes to Firefox extensions

Sjur Moshagen sjurnm at mac.com
Fri Apr 7 16:58:14 EEST 2017


I’ll comment only those parts I have an opinion on:

5. apr. 2017 kl. 13.07 skrev Henri Sivonen <hsivonen at hsivonen.fi>:
> At this point, I'd like to understand if scenario #2 can be made
> feasible.

I like that :)

> Can the code size impact be made smaller (reduce cons)? Can
> the addressable audience be larger than what the present situation of
> Finnish and Greenlandic having Firefox extensions suggests (increase
> pros)?

Yes, see below.

> For the addressable audience:
> * Does there exist languages with very large numbers of users for
> which a Hunspell dictionary cannot exist or for which a Hunspell
> dictionary necessarily results in a poor user experience but for which
> a VFST/HFST dictionary is likely to come into existence and would
> yield a markedly better user experience?

For the purposes of this discussion, you can split the world’s languages into three groups:

1) languages with simple morphology and simple (morpho)phonology
2) languages with complex but concatenative morphology and simple (morpho)phonolology
3) the rest

1) can be served (more or less considering things like Unicode etc) by Aspell and other limited & similar solutions. English and Romance languages fall into this group, as well as many others.

2) This group can be served by Hunspell, although the picture is not always straightforward. E.g. Norwegian, which is in most respects more like group 1), has a somewhat complex morphology when it comes to compounding, and so far noone has been able to build a reliable and linguistically sound/correct handling of compounding for it using Hunspell. There is a general point to this that I will return to below.

3) The rest of the languages of the world. This includes languages such as Finnish & Greenland, all Sámi languages,  most of the first nation/indigenous languages of the Americas, most Uralic languages in Russia, all Turkic languages, aboriginal languages in Australia, and many more. Many of these languages have small speaker communities and are threathened, but e.g. the speakers of the Turkic languages are counted in tens of millions.

That is, the number of people that would benefit from language technology and spellers based on fst technology easily approach hundreds of millions of speakers.

Will anyone build spellers for them using fst technology? That is a very different question, and in many cases the answer will be no, for several reasons, a.o.:

- people with existing Hunspell-based solutions won’t switch even if fst technology is superior - too much work
- Hunspell is perceived as simple, and has a certain traction among programmers without linguistic background, so they tend to use Hunspell even in cases where it is far from suitable
- fst-based solutions (I can only speak for hfst-ospell, that’s the only one I have extensive experience with) has so far not met expectations when it comes to suggestion quality, and partially speed:

-- suggestion quality: I see this partly as a result of lack of experience, and partly as a result of (so far) trying to build everything using general fst solutions. It might be that general fst solutions are too heavy for a specialised application like spellers, but it is also certainly so that we in the Giella community knows too little about how to build good error model fst’s for speller corrections.

-- speed: for our most complex languages hfst-ospell is sometimes too slow. This is partially a result of the complexity of the language, and partially a result of the size of the lexicon. It could also be the result of how the fst is constructed - years of different people doing things in different ways will not always help :) The complexity and size of the lexicon also means that any approximation to a Hunspell-based solution would most likely also have speed issues. The only solution to this will be traditional programming work: trying to find execution bottlenecks, and then optimise and improve based on the findings.

I am certain that fst-based spellers (hfst-ospell, vfst, or something else) will develop and only improve from here, and the technology has proved itself to be both good enough and the only working solution for most of the languages we work on within the Giella infrastructure.

That is, I really hope you can make the case that we need voikkospell/hfst-ospell or somesuch as an alternative to (and in parallel to) Hunspell in Firefox. There are just too many language communities that would not be served if not :)

> * In the light of the previous question, what's the deal with Russian
> showing up at https://gtsvn.uit.no/langtech/trunk/langs/rus/ ?
> (Russian does appear to have a Hunspell dictionary.)

It is accidental. We normally only do minority languages in Tromsø, but from time to time also a majority language is included, either because we need it for a minority language application (machine translation) or because someone wants to work on it and at the same time likes our infrastructure and what it provides. This is the case for Russian in the Giella infrastructure :)


> P.S. http://www.ling.helsinki.fi/kieliteknologia/tutkimus/hfst/ talks
> about GPLv3 rather than the Apache License 2.0. GitHub indicates that
> LGPLv3 applies to the hfst repo while Apache License 2.0 applies to
> hfst-ospell repo. It would be good to clarify this on helsinki.fi.
> -- 
> Henri Sivonen
> hsivonen at hsivonen.fi
> https://hsivonen.fi/
> _______________________________________________
> Libvoikko mailing list
> Libvoikko at lists.puimula.org
> http://lists.puimula.org/listinfo/libvoikko

More information about the Libvoikko mailing list