[libvoikko] VFST file format and weights

Harri Pitkänen hatapitk at iki.fi
Tue Jan 20 18:29:19 EET 2015


On Tuesday 20 January 2015 09:35:46 Sjur Moshagen wrote:
> I have a recollection that Harri mentioned that the VFST format could be
> extended with simple weights, something like a one-byte weight (0-255
> integer) - but I can find no mentioning of this in any e-mail. Is that
> still something that could be done? The main reason for this would be to be
> able to order speller suggestions.

VFST format is described here:

  https://github.com/voikko/corevoikko/wiki/vfst-fileformat

The 8 byte transition cell does not have unused bytes left. But there are at 
least three approaches that could be used:


1) Doubling the cell size to 16 bytes for weighted transducers. The layout for 
transition cells could then be
 - 4 bytes for input symbol
 - 4 bytes for output symbol
 - 4 bytes for target state
 - 2 bytes for weight
 - 1 byte for # of outgoing transitions + 1 byte for future use OR
   2 bytes for # of outgoing transitions

Using 4 bytes for target state (instead of 3 as we have now) would allow for 
building much larger transducers. Currently we are limited to about 16 million 
transitions which is not always enough. The downside is that doubling the size 
of the transducer will make it more difficult to use CPU caches efficiently.


2) Use two cells for transitions that have a weight != 1. We can mark such 
transitions by setting symOut = 0xFFFF which is "reserved for future use" in 
the current format version. This would work well if such transitions are 
relatively rare.


3) We can reserve a range from the alphabet for representing weights. This 
also would require that most transitions have weight = 1 and would lead to 
splitting weighted transitions into two (weight + unweighted transition). So 
this would be similar to 2) but just implemented at higher level in the code.


Harri


More information about the Libvoikko mailing list