[libvoikko] grammar checking in libvoikko

Tue Sep 17 18:23:15 EEST 2013

On Tuesday 17 September 2013 13:43:55 Francis Tyers wrote:
> How would you recommend the data for the grammar checker be
> distributed ? In a file like with .zhfst ? The checker needs two files,
> (1) the descriptive morphological analyser, (2) the grammar checker
> rules. The first will be an HFSTOL transducer, and the second a VISLCG3
> binary format file. Do you have any preference for how it should be laid
> out ? e.g. a zip file in ~/.voikko/ or something else ?

The preferred structure would be something like this:

  ~/.voikko/4/sme/index
  ~/.voikko/4/sme/transducer.hfstol
  ~/.voikko/4/sme/vislcg3_binary_file
  ~/.voikko/4/sme/...

 - Number 4 is the format version (that is the next unused version number at 
the moment).
 - Under that you have a directory for each language.
 - "index" would be a plain text file with the necessary dictionary metadata 
(similar to voikko-fi_FI.pro in format 2 but possibly simpler).
 - Other files you can name as you wish, and you may have as many as you like.

If you want to put everything in a zip file you could just zip the structure 
above and have ~/.voikko/4/sme.zip. But personally I prefer using unpacked 
formats since all distribution tools (rpm, deb, oxt, msi, apk, ...) will 
handle packing and compressing anyway so it rarely matters for the end users.

> > To do all this without breaking the existing code we need to build an
> > abstract GrammarChecker superclass and extend it with two subclasses, one
> > for the existing implementation and another for your new implementation.
> > The exactly same has been done with Analyzer, SpellChecker and others. I
> > can help you with that and some other small things that will be needed
> > such as changes to the LibreOffice plugin.
> 
> Great, thanks! When you say lines 84-134 do you mean this method:
> 
> void gc_paragraph_to_cache(voikko_options_t * voikkoOptions, const
> wchar_t * text, size_t textlen) {

Yes. There is some general purpose stuff happening at the start of that 
function but the rest of the function is where the implementation specific 
things happen.

> As far as I can see, I need to replace:
> 
> analysis.cpp : gc_analyze_paragraph
>                gc_analyze_sentence
>                gc_analyze_token
> 
> with methods that use the HFST optimised lookup library to analyse
> individual words. Actually, probably only gc_analyze_token.
> 
> Then I need to replace:
> 
> cache.cpp : gc_paragraph_to_cache
> 
> with a method that takes the sentences with analyses from HFST and
> passes each one through the CG and collects the error tags.

Using gc_analyze_paragraph and gc_analyze_sentence might be possible if the 
built in tokenizer in libvoikko is good enough for you. You can try

  echo "paragraph text" | voikkogc --tokenize

to see what happens. Originally the tokenizer was build for Finnish only which 
might be a problem or not.

Harri