[libvoikko] Test cases for libvoikko/HFST needed
Harri Pitkänen
hatapitk at iki.fi
Tue Jan 19 20:08:14 EET 2010
On Tuesday 19 January 2010, Flammie Pirinen wrote:
> Ah yes, that's one thing that isn't entirely trivial, or at least,
> ideal solution exceeds my C++ skills. Since HFST is just a bridge-like
> wrapper over underlying libraries, and currently it includes the
> external libraries in source tree, and some of the definitions leak to
> public installed headers of hfst. Easy way out would be to fix
> underlying libraries from using e.g. deprecated data structures in
> their respective public interfaces, but I suppose there must be
> something in the proper bridge etc. design patterns that do the hiding
> more elegantly without need to modify the external library code.
I think the correct solution depends on what sort of applications are supposed
to use this API. If the applications do not need to know anything about the
underlying libraries you can just remove all the functionality that depends on
SFST/OpenFST headers and stop including those headers. This is most certainly
the case for libvoikko since we basically do only lookups and nothing else.
If on the other hand HFST is acting as a helper library for applications that
need to perform operations with specific backends then including the headers
as is done in the current version may be OK. This has serious disadvantages
though because then your API and ABI will break whenever any of the underlying
libraries changes their API/ABI.
I have not studied these headers very carefully but it seems that the problem
may be that HFST is not really providing an abstraction layer. It seems to
equate weighted transducers with OpenFST and unweighted transducers with SFST
and use the backend data types directly in public headers. Often such types
can be replaced with pointers to incomplete types or abstract base classes.
> > - Make sure that HFST can be built on Windows using MS Visual C++.
>
> Yes, that's definetely up for grabs for anyone who has experience with
> the environment and possesses one. I tend to steer clear of Microsoft
> products if at all possible and I think even my colleague who may
> implement windows support will limit it to ming/cygwin. As far as I
> know we don't even have licences to visual studio programs here.
I don't really like working with MS tools either and usually use them only
when I'm paid to do so. Libvoikko can be built with both GCC and MS Visual C++
on Windows but since these compilers are not entirely ABI compatible some
things will break if you try to use a library built with GCC in an application
built with MSVC. In particular official versions of OpenOffice.org and Python
are built with MSVC and work correctly only with MSVC version of libvoikko.
Since HFST API is using quite complicated C++ structures I guess you need to
build HFST with the same compiler as libvoikko. The Express edition of MSVC is
free-as-in-beer but you need to register to some MS services and spend a lot
of time clicking through dialogs and accepting licenses to get it. The Express
edition has all the required features for this type of development. The only
restriction that matters to me is that it does not come with a license to
redistribute the MSVC runtime "redistributable" meaning that if you publish
the binaries you must ask users to download the runtime libraries directly
from MS.
I have not paid for the professional version either which also explains why
the Windows port of openoffice.org-voikko requires the separate download of
MSVC runtime. The professional version of MSVC cost something like 500-1000
euros.
By the way, support for MSVC (or for Windows in general) is not required to
have code included in libvoikko. We do not yet support Windows in any formal
way.
> > implement checking of correct
> > capitalisation.
>
> Is it enough if implementers of morphologies are encouraged to make a
> suggestion mechanism, which always prefers (initial) capitalisation
> over anything else, given that the language in question contains
> capitalisation of any form?
>
> Assuming the suggestion mechanism will
> eventually be fast enough, it possibly won't give much advantage to
> check capitalisation separately. Of course on user interface side it
> should still be trivial to check if the capitalisation is first
> suggestion in the list and inform user of appropriately.
The advantage is significant at least with Malaga since we are now able to
implement various modes for checking capitalisation while doing only one
analysis operation per word.
If the advantage is not significant with HFST you can just create an adapter
that uses suggestions for determining the spelling result. If I understood
correctly that would lead to roughly the same performance than the solution
you suggested.
> One reason for this question is also that capitalisation of course has
> a few language dependent cases. E.g. i in turkish, ij in dutch, ss in
> german and so forth. Also I'm not sure but I think some language may
> have more complex capitalisation rules than word initial?
For these it will be necessary to come up with better definitions of
capitalisation. It is always possible to decide that complex capitalisation
errors should be treated just as any other spelling error.
>
> > - Provide Debian packages for HFST and Sámi morphology.
>
> Is there anything blocking debian packages of HFST? It uses mostly
> standard autotools setup and dependencies I believe are documented in
> README's. At least gentoo packaging went nicely in with defaults:
> <http://git.overlays.gentoo.org/gitweb/?p=proj/sci.git;a=blob;f=sci-misc/h
> fst/hfst-2.2.ebuild;h=a688b29d682bc321b489faf0c0d399974df72151;hb=HEAD
Probably nothing is blocking it. The included copies of SFST and OpenFST could
be an issue for the distributions that care about such things. For me it is
enough if I can build libvoikko with dpkg-buildpackage and have HFST support
compiled in, I don't mind if the packages don't follow the Debian policy at
this point.
Harri
More information about the Libvoikko
mailing list