[libvoikko] Test cases for libvoikko/HFST needed

Flammie Pirinen flammie at iki.fi
Wed Jan 20 07:47:04 EET 2010

Harri Pitkänen kirjoitti 19.1.2010 kello 20.08:

> On Tuesday 19 January 2010, Flammie Pirinen wrote:
>> Ah yes, that's one thing that isn't entirely trivial, or at least,
>> ideal solution exceeds my C++ skills. Since HFST is just a bridge- 
>> like
>> wrapper over underlying libraries, and currently it includes the
>> external libraries in source tree, and some of the definitions leak  
>> to
>> public installed headers of hfst. Easy way out would be to fix
>> underlying libraries from using e.g. deprecated data structures in
>> their respective public interfaces, but I suppose there must be
>> something in the proper bridge etc. design patterns that do the  
>> hiding
>> more elegantly without need to modify the external library code.
> I think the correct solution depends on what sort of applications  
> are supposed
> to use this API. If the applications do not need to know anything  
> about the
> underlying libraries you can just remove all the functionality that  
> depends on
> SFST/OpenFST headers and stop including those headers.

Yes that has been my impression of the HFST and I hope there are not  
any software that would use structures or functions of underlying  
libraries directly. The reason I believe that just removing the  
headers isn't possible is that public interface of hfst operates on  
some structures or classes which have at least private members from  
underlying libraries and necessitate inclusion of underlying  
libraries' headers in public headers of hfst (I hope that makes sense,  
I haven't been actively developing the library side of things myself).

> This is most certainly
> the case for libvoikko since we basically do only lookups and  
> nothing else.

Hopefully for libvoikko as well as many other end applications we can  
provide the lightweight lookup transducers with specialised code for  
faster lookup, as Krister said in other mail. This will cut the size  
of library to a fraction and since it's entirely our code then it will  
not have licencing issues that may be problematic to some users.

> I have not studied these headers very carefully but it seems that  
> the problem
> may be that HFST is not really providing an abstraction layer. It  
> seems to
> equate weighted transducers with OpenFST and unweighted transducers  
> with SFST
> and use the backend data types directly in public headers. Often  
> such types
> can be replaced with pointers to incomplete types or abstract base  
> classes.

Yes that is certainly current state of the things, only guarantee in  
current library is that it provides almost same function signatures  
for both back ends. Our svn contains a reformulation in object  
oriented terms that provides framework for inclusion of more backends,  
but it seemingly does not escape the requirement of including headers  
of underlying libraries as implementation classes contain private  
members of data structures from the backends.
>>> implement checking of correct
>>> capitalisation.
>> Is it enough if implementers of morphologies are encouraged to make a
>> suggestion mechanism, which always prefers (initial) capitalisation
>> over anything else, given that the language in question contains
>> capitalisation of any form?
>> Assuming the suggestion mechanism will
>> eventually be fast enough, it possibly won't give much advantage to
>> check capitalisation separately. Of course on user interface side it
>> should still be trivial to check if the capitalisation is first
>> suggestion in the list and inform user of appropriately.
> The advantage is significant at least with Malaga since we are now  
> able to
> implement various modes for checking capitalisation while doing only  
> one
> analysis operation per word.

Theoretically the different modes of capitalisation would either  
require their own suggestion transducers or some short passage of code  
allowing or skipping entries depending on some settings. Of course the  
version where you capitalise first and test if that alone results in  
correct spelling will be cheaper in not requiring different suggestion  
relations (e.g. you might have edit distance plus initial  
capitalisation and edit distance without caps as separate transducers)  
nor handling the suggestions by c code.

>> Is there anything blocking debian packages of HFST?
>> [...]
> Probably nothing is blocking it. The included copies of SFST and  
> OpenFST could
> be an issue for the distributions that care about such things.

Oh yes of course, bundling does certainly prevent the HFST ebuild  
entering gentoo's main repository, I hadn't even thought of having it  
outside the science repo. Hopefully if the library gains enough  
importance there will be available experience for debundling the  
backend libraries as well.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.puimula.org/pipermail/libvoikko/attachments/20100120/49efce5d/attachment.html>

More information about the Libvoikko mailing list