[libvoikko] HFST backend performance observations

Harri Pitkänen hatapitk at iki.fi
Thu Oct 6 16:22:56 EEST 2011


> The memory allocations are not really due to initialization, but the way
> states are handled. The speller is always in a triple (error-state,
> lexicon-state, lexicon-flag-state), and these states are generated and
> placed on a queue. When they're processed, they get removed. This causes
> a certain amount of allocating and deallocating small amounts of memory.
>
> This is likely to be a bottleneck, and would (I suppose) be remedied by
> writing our own memory handling for this process.

So it looks like most of the cost is coming from these two members of
TreeNode class:

  SymbolVector string;
  FlagDiacriticState flag_state;

TreeNodes are often (copy)constructed in the current implementation. Since
these two members of the class are STL vectors we will allocate two arrays
from the heap each time even if TreeNode itself would be a stack variable.

I hacked the implementation a bit to remove the SymbolVector and
mutator_state completely from TreeNode (this breaks spelling suggestions),
added a few const qualifiers and changed some places in the code to use
pass-by-reference. The improvement in spell checking performance was
significant: run time decreased from 31.3 s down to 16.8 s without any
change in the output of voikkospell, making it faster than Malaga for
Finnish.

Of course you don't want to break spelling correction this way so you will
need a better fix. This just demonstrates that the allocations are quite
expensive.

Harri




More information about the Libvoikko mailing list