[libvoikko] Test cases for libvoikko/HFST needed

Harri Pitkänen hatapitk at iki.fi
Tue Jan 19 20:08:14 EET 2010


On Tuesday 19 January 2010, Flammie Pirinen wrote:
> Ah yes, that's one thing that isn't entirely trivial, or at least,
> ideal solution exceeds my C++ skills. Since HFST is just a bridge-like
> wrapper over underlying libraries, and currently it includes the
> external libraries in source tree, and some of the definitions leak to
> public installed headers of hfst. Easy way out would be to fix
> underlying libraries from using e.g. deprecated data structures in
> their respective public interfaces, but I suppose there must be
> something in the proper bridge etc. design patterns that do the hiding
> more elegantly without need to modify the external library code.

I think the correct solution depends on what sort of applications are supposed 
to use this API. If the applications do not need to know anything about the 
underlying libraries you can just remove all the functionality that depends on 
SFST/OpenFST headers and stop including those headers. This is most certainly 
the case for libvoikko since we basically do only lookups and nothing else.

If on the other hand HFST is acting as a helper library for applications that 
need to perform operations with specific backends then including the headers 
as is done in the current version may be OK. This has serious disadvantages 
though because then your API and ABI will break whenever any of the underlying 
libraries changes their API/ABI.

I have not studied these headers very carefully but it seems that the problem 
may be that HFST is not really providing an abstraction layer. It seems to 
equate weighted transducers with OpenFST and unweighted transducers with SFST 
and use the backend data types directly in public headers. Often such types 
can be replaced with pointers to incomplete types or abstract base classes.

> > - Make sure that HFST can be built on Windows using MS Visual C++.
> 
> Yes, that's definetely up for grabs for anyone who has experience with
> the environment and possesses one. I tend to steer clear of Microsoft
> products if at all possible and I think even my colleague who may
> implement windows support will limit it to ming/cygwin. As far as I
> know we don't even have licences to visual studio programs here.

I don't really like working with MS tools either and usually use them only 
when I'm paid to do so. Libvoikko can be built with both GCC and MS Visual C++ 
on Windows but since these compilers are not entirely ABI compatible some 
things will break if you try to use a library built with GCC in an application 
built with MSVC. In particular official versions of OpenOffice.org and Python 
are built with MSVC and work correctly only with MSVC version of libvoikko.

Since HFST API is using quite complicated C++ structures I guess you need to 
build HFST with the same compiler as libvoikko. The Express edition of MSVC is 
free-as-in-beer but you need to register to some MS services and spend a lot 
of time clicking through dialogs and accepting licenses to get it. The Express 
edition has all the required features for this type of development. The only 
restriction that matters to me is that it does not come with a license to 
redistribute the MSVC runtime "redistributable" meaning that if you publish 
the binaries you must ask users to download the runtime libraries directly 
from MS.

I have not paid for the professional version either which also explains why 
the Windows port of openoffice.org-voikko requires the separate download of 
MSVC runtime. The professional version of MSVC cost something like 500-1000 
euros.

By the way, support for MSVC (or for Windows in general) is not required to 
have code included in libvoikko. We do not yet support Windows in any formal 
way.

> > implement checking of correct
> > capitalisation.
> 
> Is it enough if implementers of morphologies are encouraged to make a
> suggestion mechanism, which always prefers (initial) capitalisation
> over anything else, given that the language in question contains
> capitalisation of any form?
>
> Assuming the suggestion mechanism will
> eventually be fast enough, it possibly won't give much advantage to
> check capitalisation separately. Of course on user interface side it
> should still be trivial to check if the capitalisation is first
> suggestion in the list and inform user of appropriately.

The advantage is significant at least with Malaga since we are now able to 
implement various modes for checking capitalisation while doing only one 
analysis operation per word.

If the advantage is not significant with HFST you can just create an adapter 
that uses suggestions for determining the spelling result. If I understood 
correctly that would lead to roughly the same performance than the solution 
you suggested.

> One reason for this question is also that capitalisation of course has
> a few language dependent cases. E.g. i in turkish, ij in dutch, ss in
> german and so forth. Also I'm not sure but I think some language may
> have more complex capitalisation rules than word initial?

For these it will be necessary to come up with better definitions of 
capitalisation. It is always possible to decide that complex capitalisation 
errors should be treated just as any other spelling error.

> 
> > - Provide Debian packages for HFST and Sámi morphology.
> 
> Is there anything blocking debian packages of HFST? It uses mostly
> standard autotools setup and dependencies I believe are documented in
> README's. At least gentoo packaging went nicely in with defaults:
>  <http://git.overlays.gentoo.org/gitweb/?p=proj/sci.git;a=blob;f=sci-misc/h
> fst/hfst-2.2.ebuild;h=a688b29d682bc321b489faf0c0d399974df72151;hb=HEAD

Probably nothing is blocking it. The included copies of SFST and OpenFST could 
be an issue for the distributions that care about such things. For me it is 
enough if I can build libvoikko with dpkg-buildpackage and have HFST support 
compiled in, I don't mind if the packages don't follow the Debian policy at 
this point.

Harri



More information about the Libvoikko mailing list