[libvoikko] grammar checking in libvoikko
Francis Tyers
ftyers at prompsit.com
Tue Sep 17 23:47:05 EEST 2013
El dt 17 de 09 de 2013 a les 15:44 +0000, en/na Francis Tyers va
escriure:
> El dt 17 de 09 de 2013 a les 18:23 +0300, en/na Harri Pitkänen va
> escriure:
> > On Tuesday 17 September 2013 13:43:55 Francis Tyers wrote:
> > > How would you recommend the data for the grammar checker be
> > > distributed ? In a file like with .zhfst ? The checker needs two files,
> > > (1) the descriptive morphological analyser, (2) the grammar checker
> > > rules. The first will be an HFSTOL transducer, and the second a VISLCG3
> > > binary format file. Do you have any preference for how it should be laid
> > > out ? e.g. a zip file in ~/.voikko/ or something else ?
> >
> > The preferred structure would be something like this:
> >
> > ~/.voikko/4/sme/index
> > ~/.voikko/4/sme/transducer.hfstol
> > ~/.voikko/4/sme/vislcg3_binary_file
> > ~/.voikko/4/sme/...
> >
> > - Number 4 is the format version (that is the next unused version number at
> > the moment).
> > - Under that you have a directory for each language.
> > - "index" would be a plain text file with the necessary dictionary metadata
> > (similar to voikko-fi_FI.pro in format 2 but possibly simpler).
> > - Other files you can name as you wish, and you may have as many as you like.
> >
> > If you want to put everything in a zip file you could just zip the structure
> > above and have ~/.voikko/4/sme.zip. But personally I prefer using unpacked
> > formats since all distribution tools (rpm, deb, oxt, msi, apk, ...) will
> > handle packing and compressing anyway so it rarely matters for the end users.
>
> Having it outside a zipfile is fine by me. One thing, could we call it
>
> ~/.voikko/4/sme-x-standard/index
> etc.
>
> It could be that there are different grammar checkers for L1 and L2
> speakers. Although I don't know how that would be handled by the codes.
>
> > > > To do all this without breaking the existing code we need to build an
> > > > abstract GrammarChecker superclass and extend it with two subclasses, one
> > > > for the existing implementation and another for your new implementation.
> > > > The exactly same has been done with Analyzer, SpellChecker and others. I
> > > > can help you with that and some other small things that will be needed
> > > > such as changes to the LibreOffice plugin.
> > >
> > > Great, thanks! When you say lines 84-134 do you mean this method:
> > >
> > > void gc_paragraph_to_cache(voikko_options_t * voikkoOptions, const
> > > wchar_t * text, size_t textlen) {
> >
> > Yes. There is some general purpose stuff happening at the start of that
> > function but the rest of the function is where the implementation specific
> > things happen.
> >
> > > As far as I can see, I need to replace:
> > >
> > > analysis.cpp : gc_analyze_paragraph
> > > gc_analyze_sentence
> > > gc_analyze_token
> > >
> > > with methods that use the HFST optimised lookup library to analyse
> > > individual words. Actually, probably only gc_analyze_token.
> > >
> > > Then I need to replace:
> > >
> > > cache.cpp : gc_paragraph_to_cache
> > >
> > > with a method that takes the sentences with analyses from HFST and
> > > passes each one through the CG and collects the error tags.
> >
> > Using gc_analyze_paragraph and gc_analyze_sentence might be possible if the
> > built in tokenizer in libvoikko is good enough for you. You can try
> >
> > echo "paragraph text" | voikkogc --tokenize
> >
> > to see what happens. Originally the tokenizer was build for Finnish only which
> > might be a problem or not.
>
> I tried that, but got the following error...
>
> E: Initialization of Voikko failed: Specified dictionary variant was not
> found
>
> I think it is because I can't get suomimalaga installed...
>
> sed -e "s/VANHAHKOT_MUODOT/yes/; s/VANHAT_MUODOT/no/;
> s/VOIKKO_MURRE/no/; s/SUKIJAN_MUODOT/no/; s/SM_VOIKKO_VARIANT/standard/;
> s/SM_VOIKKO_DESCRIPTION/suomi (perussanasto)/; s/SM_VERSION/1.14/;
> s/SM_PATCHINFO//; s/SM_BUILDCONFIG/GENLEX_OPTS= EXTRA_LEX=/;
> s/SM_BUILDDATE/Tue, 17 Sep 2013 15:31:59 +0000/" <
> voikko/voikko-fi_FI.pro.in > voikko/voikko-fi_FI.pro
> echo "define @voikko_debug := no;" > voikko/config.inc
> /bin/sh: malmake: no s'ha trobat l'ordre
>
> Any idea where I can find 'malmake' ?
Fixed this (googling *shame*), now the tokenisation works:
$ echo "Helga Pedersen (riegádan ođđajagimánu 13. b. 1973
Mátta-Várjjagis) lea Norgga politihkkár Bargiidbellodagas. Son
válljejuvvui Stuorradiggái Finnmárkkus jagi 2009, ja lea leamaš
parlamentáralaš jođiheaddji seamma guhká. Son lea Bargiidbellodaga
nubbinjođiheaddji ja lei Norgga guolástan- ja riddoministtar Jens
Stoltenberga nubbi ráđđehusas 2005-2009. Dalle lei son ráđđehusa
nuoramus áirras. Pedersenis lea sámi kultuvralaš duogáš, ja vuosttaš
sámegielat nubbinjođiheaddji Bargiidbellodagas." | voikkogc --tokenize
W: "Helga"
S: " "
W: "Pedersen"
S: " "
P: "("
W: "riegádan"
S: " "
W: "ođđajagimánu"
S: " "
W: "13"
P: "."
S: " "
...
It looks fine, but perhaps it might pay to have a language-dependent
tokenisation module, thinking of the future (e.g. for languages which
don't write spaces).
Fran
More information about the Libvoikko
mailing list