[libvoikko] grammar checking in libvoikko

Tue Sep 17 23:47:05 EEST 2013

El dt 17 de 09 de 2013 a les 15:44 +0000, en/na Francis Tyers va
escriure:
> El dt 17 de 09 de 2013 a les 18:23 +0300, en/na Harri Pitkänen va
> escriure:
> > On Tuesday 17 September 2013 13:43:55 Francis Tyers wrote:
> > > How would you recommend the data for the grammar checker be
> > > distributed ? In a file like with .zhfst ? The checker needs two files,
> > > (1) the descriptive morphological analyser, (2) the grammar checker
> > > rules. The first will be an HFSTOL transducer, and the second a VISLCG3
> > > binary format file. Do you have any preference for how it should be laid
> > > out ? e.g. a zip file in ~/.voikko/ or something else ?
> > 
> > The preferred structure would be something like this:
> > 
> >   ~/.voikko/4/sme/index
> >   ~/.voikko/4/sme/transducer.hfstol
> >   ~/.voikko/4/sme/vislcg3_binary_file
> >   ~/.voikko/4/sme/...
> > 
> >  - Number 4 is the format version (that is the next unused version number at 
> > the moment).
> >  - Under that you have a directory for each language.
> >  - "index" would be a plain text file with the necessary dictionary metadata 
> > (similar to voikko-fi_FI.pro in format 2 but possibly simpler).
> >  - Other files you can name as you wish, and you may have as many as you like.
> > 
> > If you want to put everything in a zip file you could just zip the structure 
> > above and have ~/.voikko/4/sme.zip. But personally I prefer using unpacked 
> > formats since all distribution tools (rpm, deb, oxt, msi, apk, ...) will 
> > handle packing and compressing anyway so it rarely matters for the end users.
> 
> Having it outside a zipfile is fine by me. One thing, could we call it 
> 
> ~/.voikko/4/sme-x-standard/index 
> etc.
> 
> It could be that there are different grammar checkers for L1 and L2
> speakers. Although I don't know how that would be handled by the codes.
> 
> > > > To do all this without breaking the existing code we need to build an
> > > > abstract GrammarChecker superclass and extend it with two subclasses, one
> > > > for the existing implementation and another for your new implementation.
> > > > The exactly same has been done with Analyzer, SpellChecker and others. I
> > > > can help you with that and some other small things that will be needed
> > > > such as changes to the LibreOffice plugin.
> > > 
> > > Great, thanks! When you say lines 84-134 do you mean this method:
> > > 
> > > void gc_paragraph_to_cache(voikko_options_t * voikkoOptions, const
> > > wchar_t * text, size_t textlen) {
> > 
> > Yes. There is some general purpose stuff happening at the start of that 
> > function but the rest of the function is where the implementation specific 
> > things happen.
> > 
> > > As far as I can see, I need to replace:
> > > 
> > > analysis.cpp : gc_analyze_paragraph
> > >                gc_analyze_sentence
> > >                gc_analyze_token
> > > 
> > > with methods that use the HFST optimised lookup library to analyse
> > > individual words. Actually, probably only gc_analyze_token.
> > > 
> > > Then I need to replace:
> > > 
> > > cache.cpp : gc_paragraph_to_cache
> > > 
> > > with a method that takes the sentences with analyses from HFST and
> > > passes each one through the CG and collects the error tags.
> > 
> > Using gc_analyze_paragraph and gc_analyze_sentence might be possible if the 
> > built in tokenizer in libvoikko is good enough for you. You can try
> > 
> >   echo "paragraph text" | voikkogc --tokenize
> > 
> > to see what happens. Originally the tokenizer was build for Finnish only which 
> > might be a problem or not.
> 
> I tried that, but got the following error...
> 
> E: Initialization of Voikko failed: Specified dictionary variant was not
> found
> 
> I think it is because I can't get suomimalaga installed...
> 
> sed -e "s/VANHAHKOT_MUODOT/yes/; s/VANHAT_MUODOT/no/;
> s/VOIKKO_MURRE/no/; s/SUKIJAN_MUODOT/no/; s/SM_VOIKKO_VARIANT/standard/;
> s/SM_VOIKKO_DESCRIPTION/suomi (perussanasto)/; s/SM_VERSION/1.14/;
> s/SM_PATCHINFO//; s/SM_BUILDCONFIG/GENLEX_OPTS= EXTRA_LEX=/;
> s/SM_BUILDDATE/Tue, 17 Sep 2013 15:31:59 +0000/" <
> voikko/voikko-fi_FI.pro.in > voikko/voikko-fi_FI.pro
> echo "define @voikko_debug := no;" > voikko/config.inc
> /bin/sh: malmake: no s'ha trobat l'ordre
> 
> Any idea where I can find 'malmake' ? 

Fixed this (googling *shame*), now the tokenisation works:

$ echo "Helga Pedersen (riegádan ođđajagimánu 13. b. 1973
Mátta-Várjjagis) lea Norgga politihkkár Bargiidbellodagas. Son
válljejuvvui Stuorradiggái Finnmárkkus jagi 2009, ja lea leamaš
parlamentáralaš jođiheaddji seamma guhká. Son lea Bargiidbellodaga
nubbinjođiheaddji ja lei Norgga guolástan- ja riddoministtar Jens
Stoltenberga nubbi ráđđehusas 2005-2009. Dalle lei son ráđđehusa
nuoramus áirras. Pedersenis lea sámi kultuvralaš duogáš, ja vuosttaš
sámegielat nubbinjođiheaddji Bargiidbellodagas." | voikkogc --tokenize
W: "Helga"
S: " "
W: "Pedersen"
S: " "
P: "("
W: "riegádan"
S: " "
W: "ođđajagimánu"
S: " "
W: "13"
P: "."
S: " "

...

It looks fine, but perhaps it might pay to have a language-dependent
tokenisation module, thinking of the future (e.g. for languages which
don't write spaces).

Fran