[libvoikko] Follow up on BCP 47; make check
Harri Pitkänen
hatapitk at iki.fi
Sun Apr 18 23:42:43 EEST 2010
SVN trunk of libvoikko now partially supports BCP 47 language tags in the new
API. Partially means that in practice only tags in format "fi", "fi-FI", "fi-
x-something" and "fi-x-some-thing" are supported.
Note that while "fi-x-something" is not a valid tag it is still supported. I
decided to do the following:
- voikko_dict_variant still returns the variant name exactly as given in
Language-Variant header of voikko-fi_FI.pro. No change here from previous
version of libvoikko.
- Old initialization API still accepts the language codes as before.
- New initialization API (voikkoInit) accepts the variant name as a private
use subtag. You may add the extra hyphens required by BCP 47 or leave them
out. This does not cause ambiguities since hyphen has not been an accepted
character in variant name of version 2 dictionary format. It is recommended
that applications create the language tag by concatenating "fi-x-" and the
string returned by voikko_dict_variant since this will work also with
version 3 dictionary format (which has not yet been specified).
There is a naming inconsistency here since voikko_dict_variant corresponds to
"private use subtag" of BCP 47, not "variant subtag". I chose to ignore this
inconsistency since I don't believe we will have much use for BCP 47 variants
and associating our medical vocabularies with term "private use" would likely
confuse people who have not read the standard.
I also added two new functions to the API that allow applications to find out
which languages are supported:
- voikkoListSupportedLanguages lists the currently supported languages. This
means that at least spell checking will work with these languages. Now that
I think of it, I need to rename this function to
voikkoListSupportedSpellingLanguages to avoid confusion in the future. The
languages are listed in a way that is suitable for typical applications that
do not care or cannot handle multiple dictionaries for one language. In
practice the returned strings contain only the language subtag. Only for the
few special cases where it is customary to have multiple options for one
language shown in the user interface (for example en-US and en-GB) we may
return codes containing language subtag AND region or script subtag.
- voikko_dict_language returns the language subtag for given voikko_dict.
Similar functions still need to be added for region and script subtags but
this does not necessarily need to happen for libvoikko 3.0.
Note that while the API now seems to support multiple languages it is not
actually possible to create a dictionary that would be advertised as
containing anything else than Finnish. It is best to leave that to libvoikko
3.1 or later. The important thing is that it is now possible to make
applications (openoffice.org-voikko, mozvoikko and Enchant) fully language
independent so that an update of libvoikko in the future is enough to enable
the actual feature.
It is also still recommended that an installation of libvoikko comes with hard
dependency on Suomi-malaga. This is because we still support the old API which
allowed developers to assume that support for Finnish spell checking and
hyphenation is always present. At least openoffice.org-voikko still relies on
this assumption and will behave in a suboptimal way if this is not the case.
While making these API changes I added some tests to "check" target of the
autotools build system, previously there were none. Not much is tested there
yet but hopefully the situation will improve. Our own Debian packaging files
in SVN now run this test suite as a part of the build process. It might be a
good idea to do this in the "real" distribution packaging scripts too since
the tests could catch some errors that would otherwise go unnoticed (runtime
issues on rarely tested architectures). The test suite requires no additional
dependencies compared to a normal build. It does not have significant effect
on build time either.
Harri
More information about the Libvoikko
mailing list