[libvoikko] Voikko and upcoming changes to Firefox extensions
Harri Pitkänen
hatapitk at iki.fi
Sun Apr 23 11:42:28 EEST 2017
Hi!
(Sorry for replying so late. I was in the middle of a significant
(positive)
event in my personal life when this thread started and just starting to
recover...)
Henri Sivonen kirjoitti 2017-04-05 13:07:
> My original questions about Voikko licensing got Warnocked, but
> assuming that my reading (as expressed in this thread) is correct, it
> seems that
> 1) None of the obvious technical solutions for continued
> Firefox/Voikko interop are ruled out by licensing.
Yes. Just make sure not to enable the legacy Malaga backend that is GPL
only (I have already deleted it in Git master so there will be no risk
of
accidentally enabling that in the next release.)
> At this point, I'd like to understand if scenario #2 can be made
> feasible. Can the code size impact be made smaller (reduce cons)? Can
> the addressable audience be larger than what the present situation of
> Finnish and Greenlandic having Firefox extensions suggests (increase
> pros)?
>
> For code size:
>
> * What's the purpose of the VFST vs. HFST distinction for Finnish vs.
> everything else in libvoikko? Is VFST superior for Finnish and,
> therefore, going to stay? Or is HFST more capable than VFST and,
> therefore, a migration from VFST to HFST expected for Finnish? (Sorry
> about basic questions like these. I have no clue about the underlying
> tech.)
This is a good question that should be in our FAQ at our web site. I
will
try to answer it briefly:
The transducer formats (VFST and HFST) are not really language
dependent.
Their main difference is that HFST is designed for speed and VFST is
designed for low memory footprint and small runtime code size. We can
convert spellers between these formats so this alone does not explain
why Finnish uses VFST and the rest use HFST.
Libvoikko dictionary format 3 (or the "HFST backend" as we often say)
provides spell checking and spelling suggestions using two transducers:
acceptor (determines which words are "correct") and error model (how
to produce suggestions for misspelled word). To generate suggestions we
need to run these two in parallel. Testing has shown that for this to
work fast enough in real world we really need to use HFST. We have
experimented with similar arrangement with VFST transducers and it works
but may be too slow to use in real world.
The Finnish speller in libvoikko was not originally based on finite
state
technologies. Thus it does some things like generating the spelling
suggestions with language dependent C++ code. It also has features that
allow us to do things that cannot be expressed using regular expressions
(this is the limitation with FSTs). These features are implemented in
dictionary format 5 (the "Finnish VFST backend"). So going back to your
question: yes, we can say that "VFST is superior for Finnish" but that
is not really due to the file format but the other features in that
backend code.
So could we drop VFST format just by changing the file format in
dictionary
format 5 without dropping the other Finnish specific code? Yes we could.
But this would have very small impact on the code size. The total length
of
*.cpp files in
https://github.com/voikko/corevoikko/tree/master/libvoikko/src/fst
is just around 1000 lines. Half of that is for weighted VFST which is an
experimental feature and already disabled by default. So dropping
support
for VFST format would decrease the code size just by 500 lines. I don't
think that is worth the effort. After all, this is the format that is
optimized for low memory use :)
>
> * To what extent do the grammar checking and hyphenation functions
> rely on the analysis code from spellchecking? That is, can substantial
> code size reductions be expected from excluding grammar checking and
> hyphenation from the build? (No disrespect for grammar checking or
> hyphenation implied. It just happens that currently Firefox doesn't
> support grammar checking at all and the hyphenation infrastructure in
> Firefox already supports Finnish.)
Dictionary format 3 (language independent HFST spellers) does not
support
these features so nothing can be removed there.
Dictionary format 5 provides grammar checker and hyphenation functions
for Finnish. These share the analysis code that the spell checker uses.
They cannot currently be disabled at build time. I would expect that
excluding these would indeed reduce the code size. There are also other
supporting functions such as tokenizer that could be disabled. If you
are seriously considering including libvoikko in Firefox I can add build
switches to make these optional. It is hard to estimate the impact
without
trying because this is object oriented C++ code where SLOC count is not
necessarily a good estimate for final code size.
Harri
More information about the Libvoikko
mailing list