[libvoikko] VFST file format and weights

Sjur Moshagen sjurnm at mac.com
Fri Mar 20 00:07:40 EET 2015


20. jan. 2015 kl. 17:29 skrev Harri Pitkänen <hatapitk at iki.fi>:
> 
> On Tuesday 20 January 2015 09:35:46 Sjur Moshagen wrote:
>> I have a recollection that Harri mentioned that the VFST format could be
>> extended with simple weights, something like a one-byte weight (0-255
>> integer) - but I can find no mentioning of this in any e-mail. Is that
>> still something that could be done? The main reason for this would be to be
>> able to order speller suggestions.
> 
> VFST format is described here:
> 
>  https://github.com/voikko/corevoikko/wiki/vfst-fileformat
> 
> The 8 byte transition cell does not have unused bytes left. But there are at 
> least three approaches that could be used:
> 
> 
> 1) Doubling the cell size to 16 bytes for weighted transducers. The layout for 
> transition cells could then be
> - 4 bytes for input symbol
> - 4 bytes for output symbol
> - 4 bytes for target state
> - 2 bytes for weight
> - 1 byte for # of outgoing transitions + 1 byte for future use OR
>   2 bytes for # of outgoing transitions
> 
> Using 4 bytes for target state (instead of 3 as we have now) would allow for 
> building much larger transducers. Currently we are limited to about 16 million 
> transitions which is not always enough. The downside is that doubling the size 
> of the transducer will make it more difficult to use CPU caches efficiently.

I really don’t have the knowledge to have an opinion in this matter, but to me it seems that this would be the better choice, due to support for larger transducers.

Sjur



More information about the Libvoikko mailing list