[libvoikko] grammar checking in libvoikko

Francis Tyers ftyers at prompsit.com
Wed Sep 18 00:57:09 EEST 2013


El dt 17 de 09 de 2013 a les 18:23 +0300, en/na Harri Pitkänen va
escriure:
> On Tuesday 17 September 2013 13:43:55 Francis Tyers wrote:
> > How would you recommend the data for the grammar checker be
> > distributed ? In a file like with .zhfst ? The checker needs two files,
> > (1) the descriptive morphological analyser, (2) the grammar checker
> > rules. The first will be an HFSTOL transducer, and the second a VISLCG3
> > binary format file. Do you have any preference for how it should be laid
> > out ? e.g. a zip file in ~/.voikko/ or something else ?
> 
> The preferred structure would be something like this:
> 
>   ~/.voikko/4/sme/index
>   ~/.voikko/4/sme/transducer.hfstol
>   ~/.voikko/4/sme/vislcg3_binary_file
>   ~/.voikko/4/sme/...
> 
>  - Number 4 is the format version (that is the next unused version number at 
> the moment).
>  - Under that you have a directory for each language.
>  - "index" would be a plain text file with the necessary dictionary metadata 
> (similar to voikko-fi_FI.pro in format 2 but possibly simpler).
>  - Other files you can name as you wish, and you may have as many as you like.
> 
> If you want to put everything in a zip file you could just zip the structure 
> above and have ~/.voikko/4/sme.zip. But personally I prefer using unpacked 
> formats since all distribution tools (rpm, deb, oxt, msi, apk, ...) will 
> handle packing and compressing anyway so it rarely matters for the end users.
> 
> > > To do all this without breaking the existing code we need to build an
> > > abstract GrammarChecker superclass and extend it with two subclasses, one
> > > for the existing implementation and another for your new implementation.
> > > The exactly same has been done with Analyzer, SpellChecker and others. I
> > > can help you with that and some other small things that will be needed
> > > such as changes to the LibreOffice plugin.
> > 
> > Great, thanks! When you say lines 84-134 do you mean this method:
> > 
> > void gc_paragraph_to_cache(voikko_options_t * voikkoOptions, const
> > wchar_t * text, size_t textlen) {
> 
> Yes. There is some general purpose stuff happening at the start of that 
> function but the rest of the function is where the implementation specific 
> things happen.
> 
> > As far as I can see, I need to replace:
> > 
> > analysis.cpp : gc_analyze_paragraph
> >                gc_analyze_sentence
> >                gc_analyze_token
> > 
> > with methods that use the HFST optimised lookup library to analyse
> > individual words. Actually, probably only gc_analyze_token.
> > 
> > Then I need to replace:
> > 
> > cache.cpp : gc_paragraph_to_cache
> > 
> > with a method that takes the sentences with analyses from HFST and
> > passes each one through the CG and collects the error tags.
> 
> Using gc_analyze_paragraph and gc_analyze_sentence might be possible if the 
> built in tokenizer in libvoikko is good enough for you. You can try

I think what I'd like to start with is making: analysis.cpp and
cache.cpp versions for Hfst/CG (HfstAnalysis and VislCache) and moving
the existing ones to MalagaAnalysis and MalagaCache.

Is there a reason why there are some source files in upper case and
others in lower case ? ... is it basically if they are classes or not ? 

Does this seem reasonable ? 

Fran




More information about the Libvoikko mailing list