[libvoikko] Voikko and upcoming changes to Firefox extensions

Tue Apr 25 16:08:22 EEST 2017

On Sun, Apr 23, 2017 at 11:42 AM, Harri Pitkänen <hatapitk at iki.fi> wrote:
> Henri Sivonen kirjoitti 2017-04-05 13:07:
>>
>> My original questions about Voikko licensing got Warnocked, but
>> assuming that my reading (as expressed in this thread) is correct, it
>> seems that
>>  1) None of the obvious technical solutions for continued
>> Firefox/Voikko interop are ruled out by licensing.
>
> Yes. Just make sure not to enable the legacy Malaga backend that is GPL
> only (I have already deleted it in Git master so there will be no risk of
> accidentally enabling that in the next release.)

OK. Thanks.

>> At this point, I'd like to understand if scenario #2 can be made
>> feasible. Can the code size impact be made smaller (reduce cons)? Can
>> the addressable audience be larger than what the present situation of
>> Finnish and Greenlandic having Firefox extensions suggests (increase
>> pros)?
>>
>> For code size:
>>
>>  * What's the purpose of the VFST vs. HFST distinction for Finnish vs.
>> everything else in libvoikko? Is VFST superior for Finnish and,
>> therefore, going to stay? Or is HFST more capable than VFST and,
>> therefore, a migration from VFST to HFST expected for Finnish? (Sorry
>> about basic questions like these. I have no clue about the underlying
>> tech.)
>
> This is a good question that should be in our FAQ at our web site. I will
> try to answer it briefly:
>
> The transducer formats (VFST and HFST) are not really language dependent.
> Their main difference is that HFST is designed for speed and VFST is
> designed for low memory footprint and small runtime code size. We can
> convert spellers between these formats so this alone does not explain
> why Finnish uses VFST and the rest use HFST.
>
> Libvoikko dictionary format 3 (or the "HFST backend" as we often say)
> provides spell checking and spelling suggestions using two transducers:
> acceptor (determines which words are "correct") and error model (how
> to produce suggestions for misspelled word). To generate suggestions we
> need to run these two in parallel. Testing has shown that for this to
> work fast enough in real world we really need to use HFST. We have
> experimented with similar arrangement with VFST transducers and it works
> but may be too slow to use in real world.
>
> The Finnish speller in libvoikko was not originally based on finite state
> technologies. Thus it does some things like generating the spelling
> suggestions with language dependent C++ code. It also has features that
> allow us to do things that cannot be expressed using regular expressions
> (this is the limitation with FSTs). These features are implemented in
> dictionary format 5 (the "Finnish VFST backend"). So going back to your
> question: yes, we can say that "VFST is superior for Finnish" but that
> is not really due to the file format but the other features in that
> backend code.
>
> So could we drop VFST format just by changing the file format in dictionary
> format 5 without dropping the other Finnish specific code? Yes we could.
> But this would have very small impact on the code size. The total length of
> *.cpp files in
>
>   https://github.com/voikko/corevoikko/tree/master/libvoikko/src/fst
>
> is just around 1000 lines. Half of that is for weighted VFST which is an
> experimental feature and already disabled by default. So dropping support
> for VFST format would decrease the code size just by 500 lines. I don't
> think that is worth the effort. After all, this is the format that is
> optimized for low memory use :)

Do you mean that in the scenario of dropping only 500 lines for VFST,
Finnish-specific suggestion code would still be retained elsewhere
(outside the fst directory)? If the VFST code in libvoikko is so
small, how come hfst-ospell is so large?

>>  * To what extent do the grammar checking and hyphenation functions
>> rely on the analysis code from spellchecking? That is, can substantial
>> code size reductions be expected from excluding grammar checking and
>> hyphenation from the build? (No disrespect for grammar checking or
>> hyphenation implied. It just happens that currently Firefox doesn't
>> support grammar checking at all and the hyphenation infrastructure in
>> Firefox already supports Finnish.)
>
>
> Dictionary format 3 (language independent HFST spellers) does not support
> these features so nothing can be removed there.
>
> Dictionary format 5 provides grammar checker and hyphenation functions
> for Finnish. These share the analysis code that the spell checker uses.
> They cannot currently be disabled at build time. I would expect that
> excluding these would indeed reduce the code size. There are also other
> supporting functions such as tokenizer that could be disabled. If you
> are seriously considering including libvoikko in Firefox I can add build
> switches to make these optional. It is hard to estimate the impact without
> trying because this is object oriented C++ code where SLOC count is not
> necessarily a good estimate for final code size.

I'd really to get libvoikko included in Firefox, and if it's not much
work for you to test the code size without grammar checking and
hyphenation, it would be great to have the code size data.

However, I'm shy to ask you to do a lot of work to investigate,
because, one one hand, I have a hard time seeing how Finnish-only
libvoikko would be accepted as a statically-linked part of Firefox,
but, on the other hand, I also have a hard time seeing how any code
size reduction in libvoikko proper could be enough if hfst-ospell adds
125 KB. Is there any opportunity to make hfst-ospell smaller than
that? In particular, does libvoikko use only a part of hfst-ospell?
I.e. is there effectively dead code (in the no grammar checking and no
hyphenation scenario) on the hfst-ospell side?

If the code size ends up in the ball park that I currently think it's
in (212 KB for libvoikko itself and 125 KB for hfst-ospell), it's
quite likely that outright inclusion (in the sense of static linking)
won't fly, and I should move onto the dynamic linking scenarios from
my earlier email. It would be sad to abandon the static linking
scenario based on a mistaken idea of the code size impact, though.

A new technical question that arose due to changes that we intend to
apply to Hunspell integration:
Does libvoikko care which thread it is being called on if all the
calls are on the same thread? I.e. can it be called on a non-main
thread as long as it's consistently always the same thread?

On Fri, Apr 7, 2017 at 4:58 PM, Sjur Moshagen <sjurnm at mac.com> wrote:
> -- suggestion quality: I see this partly as a result of lack of experience, and partly as a result of (so far) trying to build everything using general fst solutions. It might be that general fst solutions are too heavy for a specialised application like spellers, but it is also certainly so that we in the Giella community knows too little about how to build good error model fst’s for speller corrections.
>
> -- speed: for our most complex languages hfst-ospell is sometimes too slow. This is partially a result of the complexity of the language, and partially a result of the size of the lexicon. It could also be the result of how the fst is constructed - years of different people doing things in different ways will not always help :) The complexity and size of the lexicon also means that any approximation to a Hunspell-based solution would most likely also have speed issues. The only solution to this will be traditional programming work: trying to find execution bottlenecks, and then optimise and improve based on the findings.

Hmm. These issues seem worrying. It's unfortunate if these issues
can't be easily quantified relative to Hunspell for a simple sales
pitch. :-/

-- 
Henri Sivonen
hsivonen at hsivonen.fi
https://hsivonen.fi/