[libvoikko] grammar checking in libvoikko
Harri Pitkänen
hatapitk at iki.fi
Tue Sep 17 18:23:15 EEST 2013
On Tuesday 17 September 2013 13:43:55 Francis Tyers wrote:
> How would you recommend the data for the grammar checker be
> distributed ? In a file like with .zhfst ? The checker needs two files,
> (1) the descriptive morphological analyser, (2) the grammar checker
> rules. The first will be an HFSTOL transducer, and the second a VISLCG3
> binary format file. Do you have any preference for how it should be laid
> out ? e.g. a zip file in ~/.voikko/ or something else ?
The preferred structure would be something like this:
~/.voikko/4/sme/index
~/.voikko/4/sme/transducer.hfstol
~/.voikko/4/sme/vislcg3_binary_file
~/.voikko/4/sme/...
- Number 4 is the format version (that is the next unused version number at
the moment).
- Under that you have a directory for each language.
- "index" would be a plain text file with the necessary dictionary metadata
(similar to voikko-fi_FI.pro in format 2 but possibly simpler).
- Other files you can name as you wish, and you may have as many as you like.
If you want to put everything in a zip file you could just zip the structure
above and have ~/.voikko/4/sme.zip. But personally I prefer using unpacked
formats since all distribution tools (rpm, deb, oxt, msi, apk, ...) will
handle packing and compressing anyway so it rarely matters for the end users.
> > To do all this without breaking the existing code we need to build an
> > abstract GrammarChecker superclass and extend it with two subclasses, one
> > for the existing implementation and another for your new implementation.
> > The exactly same has been done with Analyzer, SpellChecker and others. I
> > can help you with that and some other small things that will be needed
> > such as changes to the LibreOffice plugin.
>
> Great, thanks! When you say lines 84-134 do you mean this method:
>
> void gc_paragraph_to_cache(voikko_options_t * voikkoOptions, const
> wchar_t * text, size_t textlen) {
Yes. There is some general purpose stuff happening at the start of that
function but the rest of the function is where the implementation specific
things happen.
> As far as I can see, I need to replace:
>
> analysis.cpp : gc_analyze_paragraph
> gc_analyze_sentence
> gc_analyze_token
>
> with methods that use the HFST optimised lookup library to analyse
> individual words. Actually, probably only gc_analyze_token.
>
> Then I need to replace:
>
> cache.cpp : gc_paragraph_to_cache
>
> with a method that takes the sentences with analyses from HFST and
> passes each one through the CG and collects the error tags.
Using gc_analyze_paragraph and gc_analyze_sentence might be possible if the
built in tokenizer in libvoikko is good enough for you. You can try
echo "paragraph text" | voikkogc --tokenize
to see what happens. Originally the tokenizer was build for Finnish only which
might be a problem or not.
Harri
More information about the Libvoikko
mailing list