[libvoikko] Using BCP 47 language tags in libvoikko

Mon Apr 12 20:24:21 EEST 2010

I am in the process of extending the library API to work with multiple 
languages. The original plan was to extend the voikkoInit function so that the 
single langcode parameter would be replaced by two parameters, one for the 
language and one for language variant.

Previous versions of libvoikko have operated on the assumption that "langcode" 
always refers to a variant of Finnish language. "", "default" and "fi_FI" have 
all referred to standard vocabulary and other values have been used freely for 
other variants. Dialects, standard vocabulary + special vocabulary from 
medical field, standard vocabulary + extended morphological data and Omorfi 
have all been made available through this mechanism.

After having considered this again I'm starting to think that splitting the 
parameter in two parts is not necessary and would even limit things in the 
future. Instead of doing that I'm now proposing that we adopt IETF BCP 47 
language tags for identifying the available vocabularies:

  http://tools.ietf.org/rfc/bcp/bcp47.txt

This standard seems to cover all the needs that I can think of for the 
foreseeable future. Non-standard variants such as our medical vocabulary would 
be described with private use subtags:

  fi-x-medicine
  fi-x-hfst

Unfortunately length of an individual private use subtag is limited to eight 
characters which is an additional limitation to our previous rules for 
langcode. This can be worked around by adding multiple private use subtags 
which will lead to rather weird compatibility mappings between the old and new 
API:

  reallylongvariantname <-> fi-x-reallylo-x-ngvarian-x-tname

I don't think this will matter much in practice, we just need to implement 
this mapping to remain compatible with our current format for vocabulary data.

One benefit of using BCP 47 is that it incorporates RFC 4647 (Matching of 
Language Tags) which would provide us with an algorithm for filtering 
available dictionaries (voikko_list_dicts) and looking up most appropriate 
vocabulary when incomplete language tag is specified in voikkoInit.

Do you have better ideas or standards in mind that could be used for 
identifying languages?

Harri