[libvoikko] Strange bug in the interface between hfst-ospell and libvoikko

Børre Gaup borre.gaup at uit.no
Sun Dec 22 07:08:58 EET 2013


On duor, 2013-12-12 at 01:57 +0100, Sjur Moshagen wrote:
> Hello,
> 
> The following bug has me puzzled:
> 
> $ voikkospell -l -p tools/spellcheckers/fstbased/hfst/
> libc++abi.dylib: terminating with uncaught exception of type hfst_ol::ZHfstMetaDataParsingError
> Abort trap: 6
> 

The fix for this is applying this patch to hfst-ospell:
--- ZHfstOspellerXmlMetadata.cc (revision 3658)
+++ ZHfstOspellerXmlMetadata.cc (working copy)
@@ -900,7 +900,7 @@
 ZHfstOspellerXmlMetadata::read_xml(const char* xml_data, size_t
xml_len)
   {
     tinyxml2::XMLDocument doc;
-    if (doc.Parse(xml_data) != tinyxml2::XML_NO_ERROR)
+    if (doc.Parse(string(xml_data).substr(0, xml_len).c_str()) !=
tinyxml2::XML_NO_ERROR)
       {
         throw ZHfstMetaDataParsingError("Reading XML from memory");
       }

This works for Mac OS X 10.6, 10.8 and Linux, and doesn't affect the
libxmlpp code at all (which works on the three platforms I have tested)

The reason that this error occured was this:
voikkospell reads the .zhfst lexicon twice everytime it is started up.

The first time in src/spellchecker/HfstSpeller.cpp:38
speller->read_zhfst(zhfstFileName.c_str());
The second time in src/setup/V3DictionaryLoader.cpp:63:
speller->read_zhfst(fullPath.c_str());

I found that the first time, when the buffer from extract_from_mem is
converted to a string, it is always as long as the file. Since
hfst-ospell only reads the .zhfst once, it works as it should.
The second time around, for some reason I wasn't able to find, the
buffer is converted to a string that is longer than the file.

tinyxml reads the given string, and finds junk at the end of the xml.
Sometimes it crashes, other times it throws a
hfst_ol::ZHfstZipReadingError

Before I applied the tiny patch above, voikkospell + hfst-ospell with
tinyxml2 behaved differently depending on platform, which compression
level that was used when zipping the .zhfst file and if the index.xml
file was placed first or last in the .zhfst file.
voikkospell + libxmlpp worked as it should.

I used this little script to test voikkospell with all these variables:

i=0; 
while [ $i -lt 10 ]; 
do 
echo $i; 
rm smj.zhfst; 
zip -q -$i smj.zhfst acceptor.default.hfst errmodel.default.hfst
index.xml; 
cp smj.zhfst 3/; 
cat word.txt | voikkospell -s -p ./ -d smj; rm smj.zhfst; 
zip -q -$i smj.zhfst index.xml acceptor.default.hfst
errmodel.default.hfst; 
cp smj.zhfst 3/; 
cat word.txt | voikkospell -s -p ./ -d smj; 
i=$((i+1)); 
done

and these were the findings:

OS X 10.6, hfst-ospell r3657
zip -#                    | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
tinyxml2, index.xml first | p | f | f | f | f | f | f | f | f | f
tinyxml2, index.xml last  | f | f | f | f | f | f | f | f | f | f
libxmlpp, index.xml first | p | p | p | p | p | p | p | p | p | p
libxmlpp, index.xml last  | p | p | p | p | p | p | p | p | p | p

Linux (Kubuntu 14.04), hfst-ospell r3657
zip -#                    | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
tinyxml2, index.xml first | p | p | p | p | p | p | p | p | p | p
tinyxml2, index.xml last  | f | f | f | f | f | f | f | f | f | f
libxmlpp, index.xml first | p | p | p | p | p | p | p | p | p | p
libxmlpp, index.xml last  | p | p | p | p | p | p | p | p | p | p

OS X 10.8, hfst-ospell r3657
zip -#                    | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
tinyxml2, index.xml first | p | f | f | f | f | f | f | f | f | f
tinyxml2, index.xml last  | f | f | f | f | f | f | f | f | f | f
libxmlpp, index.xml first | p | p | p | p | p | p | p | p | p | p
libxmlpp, index.xml last  | p | p | p | p | p | p | p | p | p | p

After the applying the patch, voikkospell + tinyxml2 behaved as it
should.

Børre



More information about the Libvoikko mailing list