[libvoikko] Using BCP 47 language tags in libvoikko
Harri Pitkänen
hatapitk at iki.fi
Mon Apr 12 20:24:21 EEST 2010
I am in the process of extending the library API to work with multiple
languages. The original plan was to extend the voikkoInit function so that the
single langcode parameter would be replaced by two parameters, one for the
language and one for language variant.
Previous versions of libvoikko have operated on the assumption that "langcode"
always refers to a variant of Finnish language. "", "default" and "fi_FI" have
all referred to standard vocabulary and other values have been used freely for
other variants. Dialects, standard vocabulary + special vocabulary from
medical field, standard vocabulary + extended morphological data and Omorfi
have all been made available through this mechanism.
After having considered this again I'm starting to think that splitting the
parameter in two parts is not necessary and would even limit things in the
future. Instead of doing that I'm now proposing that we adopt IETF BCP 47
language tags for identifying the available vocabularies:
http://tools.ietf.org/rfc/bcp/bcp47.txt
This standard seems to cover all the needs that I can think of for the
foreseeable future. Non-standard variants such as our medical vocabulary would
be described with private use subtags:
fi-x-medicine
fi-x-hfst
Unfortunately length of an individual private use subtag is limited to eight
characters which is an additional limitation to our previous rules for
langcode. This can be worked around by adding multiple private use subtags
which will lead to rather weird compatibility mappings between the old and new
API:
reallylongvariantname <-> fi-x-reallylo-x-ngvarian-x-tname
I don't think this will matter much in practice, we just need to implement
this mapping to remain compatible with our current format for vocabulary data.
One benefit of using BCP 47 is that it incorporates RFC 4647 (Matching of
Language Tags) which would provide us with an algorithm for filtering
available dictionaries (voikko_list_dicts) and looking up most appropriate
vocabulary when incomplete language tag is specified in voikkoInit.
Do you have better ideas or standards in mind that could be used for
identifying languages?
Harri
More information about the Libvoikko
mailing list