[libvoikko] Status of North Sámi (SME) hfst speller

Thu Sep 29 13:42:27 EEST 2011

> Memory consumption
> ------------------
> I have not measured the real consumption with any tools, but the acceptor
> transducer is 11 Mb, and the error model transducer is 1,4 Mb. The zipped
> zhfst file is 3.0 Mb, which is not bad at all. Based on file size this
> looks definitely acceptable. Others will have to measure real memory
> consumption during use (or give me instructions on how to do it).

On Linux you one can measure the memory use by letting voikkospell run for
a while with some test data, then stop it with Ctrl+Z and run

  $ cat /proc/`pidof voikkospell`/status

to get the numbers. Assuming that there are no memory leaks and there is
plenty of free memory available on the system this should give quite
consistent results. With the same test material I used in my previous mail
voikko/HFST/fi gives me

VmPeak:   173800 kB
VmSize:   161428 kB
VmLck:         0 kB
VmHWM:    114528 kB
VmRSS:    101232 kB
VmData:   112808 kB
VmStk:       136 kB
VmExe:        24 kB
VmLib:      5864 kB
VmPTE:       324 kB
VmSwap:        0 kB

and voikko/Malaga/fi

VmPeak:    56384 kB
VmSize:    56384 kB
VmLck:         0 kB
VmHWM:      9664 kB
VmRSS:      9664 kB
VmData:      500 kB
VmStk:       136 kB
VmExe:        24 kB
VmLib:      5864 kB
VmPTE:       132 kB
VmSwap:        0 kB

There are lots of things you can read from these numbers but for me the
most important are VmRSS (resident set size, amount of physical memory the
program is currently using) and VmData (amount of non-shareable memory the
program has reserved). VmRSS is roughly the memory needed to run one
instance of the speller and VmData tells you how much additional memory is
needed for each duplicate of the same speller (if you for example use the
same speller in Firefox and LibreOffice).

As you can see here, Malaga is roughly ten times as efficient as HFST for
the first instance and about 200 times as efficient for additional
instances.

I would assume that the difference VmRSS-VmData should be about the same
for all HFST spellers and VmData might be directly proportional to the
size of the zhfst file. Finnish zhfst is 5.5 Mb so if SME zhfst is 3.0 Mb
I could estimate that for HFST/SME the numbers would be

VmData: 80000 kB
VmRSS:  70000 kB

Actually I think it is a bit strange to have VmData larger than VmRSS...
this could due to a memory leak or some part of the allocated memory being
used very infrequently.

With SME Hunspell dictionary I get

VmPeak:   176684 kB
VmSize:   176680 kB
VmLck:         0 kB
VmHWM:    153568 kB
VmRSS:    153568 kB
VmData:   152020 kB
VmStk:       216 kB
VmExe:        52 kB
VmLib:      4324 kB
VmPTE:       360 kB
VmSwap:       96 kB

So it seems like HFST is roughly twice as efficient as Hunspell when it
comes to memory use with the SME dictionary. Of course it would be more
reliable to test with the actual SME zhfst file, would it be possible to
have it uploaded somewhere?

Harri