[libvoikko] Using BCP 47 language tags in libvoikko

Flammie Pirinen flammie at iki.fi
Tue Apr 13 07:29:03 EEST 2010


2010-04-12, Harri Pitkänen sanoi:

> After having considered this again I'm starting to think that
> splitting the parameter in two parts is not necessary and would even
> limit things in the future. Instead of doing that I'm now proposing
> that we adopt IETF BCP 47 language tags for identifying the available
> vocabularies:
> 
>   http://tools.ietf.org/rfc/bcp/bcp47.txt

I would agree that BCP 47 is the most suitable standard for naming
languages. It has been in use in its various incarnations reasonably
long and I have yet to see real shortcomings for all its applications.

> Unfortunately length of an individual private use subtag is limited
> to eight characters which is an additional limitation to our previous
> rules for langcode. This can be worked around by adding multiple
> private use subtags which will lead to rather weird compatibility
> mappings between the old and new API:
> 
>   reallylongvariantname <-> fi-x-reallylo-x-ngvarian-x-tname

I might be reading the ABNF wrong-, but doesn't 

  privateuse    = "x" 1*("-" (1*8alphanum))

mean that you could as well use fi-x-really-long-variant-name (or
reallylo-ngvarian-tname assuming automatic mapping, of course)? 

> One benefit of using BCP 47 is that it incorporates RFC 4647
> (Matching of  Language Tags) which would provide us with an algorithm
> for filtering available dictionaries (voikko_list_dicts) and looking
> up most appropriate vocabulary when incomplete language tag is
> specified in voikkoInit.

I haven't checked the algorithm, but I suppose that it can do
something reasonable if you have e.g. only HFST variant and medical
variant available. Of course in the end good user interface is always
required.

-- 
Flammie, computer scientist bachelor, linguist master, free software
Finnish localiser, and more! <http://www.iki.fi/flammie/>



More information about the Libvoikko mailing list