[libvoikko] VFST file format and weights
Harri Pitkänen
hatapitk at iki.fi
Tue Jan 20 18:29:19 EET 2015
On Tuesday 20 January 2015 09:35:46 Sjur Moshagen wrote:
> I have a recollection that Harri mentioned that the VFST format could be
> extended with simple weights, something like a one-byte weight (0-255
> integer) - but I can find no mentioning of this in any e-mail. Is that
> still something that could be done? The main reason for this would be to be
> able to order speller suggestions.
VFST format is described here:
https://github.com/voikko/corevoikko/wiki/vfst-fileformat
The 8 byte transition cell does not have unused bytes left. But there are at
least three approaches that could be used:
1) Doubling the cell size to 16 bytes for weighted transducers. The layout for
transition cells could then be
- 4 bytes for input symbol
- 4 bytes for output symbol
- 4 bytes for target state
- 2 bytes for weight
- 1 byte for # of outgoing transitions + 1 byte for future use OR
2 bytes for # of outgoing transitions
Using 4 bytes for target state (instead of 3 as we have now) would allow for
building much larger transducers. Currently we are limited to about 16 million
transitions which is not always enough. The downside is that doubling the size
of the transducer will make it more difficult to use CPU caches efficiently.
2) Use two cells for transitions that have a weight != 1. We can mark such
transitions by setting symOut = 0xFFFF which is "reserved for future use" in
the current format version. This would work well if such transitions are
relatively rare.
3) We can reserve a range from the alphabet for representing weights. This
also would require that most transitions have weight = 1 and would lead to
splitting weighted transitions into two (weight + unweighted transition). So
this would be similar to 2) but just implemented at higher level in the code.
Harri
More information about the Libvoikko
mailing list