[libvoikko] Strange bug in the interface between hfst-ospell and libvoikko

Gaup Børre borre.gaup at uit.no
Mon Dec 16 17:11:18 EET 2013


On Thursday 12. of December 2013 16.32.21 Sjur Moshagen wrote:
> 12. des. 2013 kl. 16:16 skrev Sjur Moshagen <sjurnm at mac.com>:
> > 12. des. 2013 kl. 15:52 skrev Harri Pitkänen <hatapitk at iki.fi>:
> >> On Thursday 12 December 2013 01:57:07 Sjur Moshagen wrote:
> >>> Configuration:
> >>> * svn HEAD of hfst-ospell
> >>> * newest revision of the master branch of libvoikko
> >> 
> >> I cannot reproduce this bug with this configuration on Linux. Does
> >> switching to different XML parser make any difference?
> 
> [...]
> 
> > So - while both xml libraries are somewhat broken, tinyxml seems to be
> > more broken.
> > 
> > This is all tested on MacOSX 10.6.
> 
> Here’s the result for MacOSX 10.9:
> 
> Using TinyXML2:
> 
> $ cd ../sma/
> $ voikkospell -l -p tools/spellcheckers/fstbased/hfst/
> Segmentation fault: 11
> 
> $ cd ../sms/
> $ voikkospell -l -p tools/spellcheckers/fstbased/hfst/
> libc++abi.dylib: terminating with uncaught exception of type
> hfst_ol::ZHfstMetaDataParsingError Abort trap: 6
> 
> Using libxml2++:
> $ cd ../sms/
> $ voikkospell -l -p tools/spellcheckers/fstbased/hfst/
> und-x-standard:
> $ cd ../sma/
> $ voikkospell -l -p tools/spellcheckers/fstbased/hfst/
> und-x-standard:
> 

I have been debugging hfst-ospell.

I found out that hfst-ospell often gives this error:

"Reading XML from memory"

when using this configure line:
./configure  --enable-zhfst --enable-xml=tinyxml2

It does not fail when using these configure lines:
./configure  --enable-zhfst --enable-xml=tinyxml2 --with-extract=tmpdir
./configure  --enable-zhfst
./configure  --enable-zhfst --with-extract=tmpdir

What I found out was the following:
When the call to extract_to_mem has been done at about line 270 in 
ZHfstOspeller.cc the resulting variable full_data sometimes contains junk at 
the end of the xml string.

The reason that junk appears at the end of full_data is that somehow 
libarchive truncates white space found at line endings in the xml file, but 
reads the amount of bytes as (correctly) reported by the file size of the xml 
file.

So if an xml file contains this:
«<abc> »
«</abc> »
«»

then full_data contains
«<abc»
«</abc>»
«(two bytes of junk read further on from the archive)»

libxmlpp tackles this, tinyxml2 reports an error when this kind of data is 
received.

A hack to solve this problem is found in the patch below:
---
 hfst-ospell/ZHfstOspeller.cc | 11 ++++++++++-
 hfst-ospell/ZHfstOspeller.h  |  1 +
 2 files changed, 11 insertions(+), 1 deletion(-)

diff --git a/hfst-ospell/ZHfstOspeller.cc b/hfst-ospell/ZHfstOspeller.cc
index 000942b..15a68d7 100644
--- a/hfst-ospell/ZHfstOspeller.cc
+++ b/hfst-ospell/ZHfstOspeller.cc
@@ -268,7 +268,8 @@ ZHfstOspeller::read_zhfst(const string& filename)
 #elif ZHFST_EXTRACT_TO_MEM
             size_t xml_len = 0;
             void* full_data = extract_to_mem(ar, entry, &xml_len);
-            metadata_.read_xml(reinterpret_cast<char*>(full_data), xml_len);
+            std::string data = 
remove_junk_at_end(string(reinterpret_cast<char*>(full_data)));
+            metadata_.read_xml(data.c_str(), data.length());
             free(full_data);
 #endif
 
@@ -325,6 +326,14 @@ ZHfstOspeller::read_zhfst(const string& filename)
 #endif // HAVE_LIBARCHIVE
   }
 
+std::string
+ZHfstOspeller::remove_junk_at_end(std::string data)
+  {
+    std::string::size_type last_lt = data.find_last_of('>');
+
+    return data.substr(0, last_lt + 1);
+  }
+
 void
 ZHfstOspeller::read_legacy(const std::string& path)
   {
diff --git a/hfst-ospell/ZHfstOspeller.h b/hfst-ospell/ZHfstOspeller.h
index cdf40c6..3d43cbe 100644
--- a/hfst-ospell/ZHfstOspeller.h
+++ b/hfst-ospell/ZHfstOspeller.h
@@ -64,6 +64,7 @@ namespace hfst_ol
             //! @brief create string representation of the speller for
             //!        programmer to debug
             std::string metadata_dump() const;
+            std::string remove_junk_at_end(std::string data);
         private:
             //! @brief file or path where the speller came from
             std::string filename_;
-- 

-- 
Børre Gaup
http://divvun.no/ - http://uit.no
Mob: +47 41 08 03 64


More information about the Libvoikko mailing list