[libvoikko] grammar checking in libvoikko

Francis Tyers ftyers at prompsit.com
Tue Sep 17 18:44:30 EEST 2013


El dt 17 de 09 de 2013 a les 18:23 +0300, en/na Harri Pitkänen va
escriure:
> On Tuesday 17 September 2013 13:43:55 Francis Tyers wrote:
> > How would you recommend the data for the grammar checker be
> > distributed ? In a file like with .zhfst ? The checker needs two files,
> > (1) the descriptive morphological analyser, (2) the grammar checker
> > rules. The first will be an HFSTOL transducer, and the second a VISLCG3
> > binary format file. Do you have any preference for how it should be laid
> > out ? e.g. a zip file in ~/.voikko/ or something else ?
> 
> The preferred structure would be something like this:
> 
>   ~/.voikko/4/sme/index
>   ~/.voikko/4/sme/transducer.hfstol
>   ~/.voikko/4/sme/vislcg3_binary_file
>   ~/.voikko/4/sme/...
> 
>  - Number 4 is the format version (that is the next unused version number at 
> the moment).
>  - Under that you have a directory for each language.
>  - "index" would be a plain text file with the necessary dictionary metadata 
> (similar to voikko-fi_FI.pro in format 2 but possibly simpler).
>  - Other files you can name as you wish, and you may have as many as you like.
> 
> If you want to put everything in a zip file you could just zip the structure 
> above and have ~/.voikko/4/sme.zip. But personally I prefer using unpacked 
> formats since all distribution tools (rpm, deb, oxt, msi, apk, ...) will 
> handle packing and compressing anyway so it rarely matters for the end users.

Having it outside a zipfile is fine by me. One thing, could we call it 

~/.voikko/4/sme-x-standard/index 
etc.

It could be that there are different grammar checkers for L1 and L2
speakers. Although I don't know how that would be handled by the codes.

> > > To do all this without breaking the existing code we need to build an
> > > abstract GrammarChecker superclass and extend it with two subclasses, one
> > > for the existing implementation and another for your new implementation.
> > > The exactly same has been done with Analyzer, SpellChecker and others. I
> > > can help you with that and some other small things that will be needed
> > > such as changes to the LibreOffice plugin.
> > 
> > Great, thanks! When you say lines 84-134 do you mean this method:
> > 
> > void gc_paragraph_to_cache(voikko_options_t * voikkoOptions, const
> > wchar_t * text, size_t textlen) {
> 
> Yes. There is some general purpose stuff happening at the start of that 
> function but the rest of the function is where the implementation specific 
> things happen.
> 
> > As far as I can see, I need to replace:
> > 
> > analysis.cpp : gc_analyze_paragraph
> >                gc_analyze_sentence
> >                gc_analyze_token
> > 
> > with methods that use the HFST optimised lookup library to analyse
> > individual words. Actually, probably only gc_analyze_token.
> > 
> > Then I need to replace:
> > 
> > cache.cpp : gc_paragraph_to_cache
> > 
> > with a method that takes the sentences with analyses from HFST and
> > passes each one through the CG and collects the error tags.
> 
> Using gc_analyze_paragraph and gc_analyze_sentence might be possible if the 
> built in tokenizer in libvoikko is good enough for you. You can try
> 
>   echo "paragraph text" | voikkogc --tokenize
> 
> to see what happens. Originally the tokenizer was build for Finnish only which 
> might be a problem or not.

I tried that, but got the following error...

E: Initialization of Voikko failed: Specified dictionary variant was not
found

I think it is because I can't get suomimalaga installed...

sed -e "s/VANHAHKOT_MUODOT/yes/; s/VANHAT_MUODOT/no/;
s/VOIKKO_MURRE/no/; s/SUKIJAN_MUODOT/no/; s/SM_VOIKKO_VARIANT/standard/;
s/SM_VOIKKO_DESCRIPTION/suomi (perussanasto)/; s/SM_VERSION/1.14/;
s/SM_PATCHINFO//; s/SM_BUILDCONFIG/GENLEX_OPTS= EXTRA_LEX=/;
s/SM_BUILDDATE/Tue, 17 Sep 2013 15:31:59 +0000/" <
voikko/voikko-fi_FI.pro.in > voikko/voikko-fi_FI.pro
echo "define @voikko_debug := no;" > voikko/config.inc
/bin/sh: malmake: no s'ha trobat l'ordre

Any idea where I can find 'malmake' ? 

Fran




More information about the Libvoikko mailing list