[libvoikko] Proposed interface for hyphenator components

Mon Nov 30 16:15:21 EET 2009

Hi,

Thanks for drafting the interface. I see in the svn log that you have coded a bit since you wrote this, so I hope my feedback isn't coming too late.

I have comments both to the provided functions and more general comments. I'll start with the more general ones.

Linguistics and language independence
-------------------------------------
For the library to be truly language independent, the functions need to be as well. This means that all language-dependent behaviour should be moved to the linguistic components below/behind the library, ie Finnish behaviour should be in the Finnish component, Norwegian in the Norwegian component, Sámi in the Sámi one, etc.

Graded hyphenation points
-------------------------
In many cases there are clear preferences as to which hyphenation points are more prefereable. Which one to choose must be up to the editor (the human or application), though. To be able to guide the editor to choose the best hyphenation point in each case, each word should be given hyphenation points with different weights. The weight scale should be universal, ie applicable to all languages and tools, but could be given linguistic motivation for each language. One example could be a scale from 0-9, where 0 is "never hyphenate", and 9 is best ("always hyphenate at this point"). For languages with free compounding a (semi)linguistic interpretation of the scale could be:

9 - hard hyphen (DNA-sekvens)
8 - manually inserted soft hyphen
7 - word boundary (eplehuset -> eple-huset)
5 - other morfological boundary (e.g. between stem and inflectional ending: hus-et)
2 - other hyphenation points (eple -> ep-le)

The fact that most (all?) applications today are not able to use such information is no reason not to start providing the information - without data, it isn't possible to build the functionality into applications, but by providing the data, we indicate the need for this functionality, and hopefully that will lead to better hyphenation functionality in the applications.

Context
-------
In some cases the correct hyphenation pattern can only be determined after disambiguation. For this to happen, the hyphenator would need the syntactic context, at least the full sentence.

Actually, there are cases where the only possibility of determining the correct hyphenation pattern is only possible with a full text classification combined with a similarly classified lexicion, one example from Swedish is: bildrulle, which could be analysed as either bil#drulle (slow driver) or bild#rulle (photographic film roll). Although noone has made such a system, it is technically possible - all the components and technologies for achieving this is available today.

To the actual interface draft
-----------------------------

/**
 * Hyphenate given word.
 * @param word word to hyphenate
 * @param wlen length of the word in wchar_t units
 * @return null-terminated character string containing the hyphenation
 * using the following notation:
 *    ' ' = no hyphenation point before or at this character
 *    '-' = hyphenation point before this character
 *          (character at this position
 *          is preserved in the hyphenated form)
 *    '=' = hyphentation point (character at this position
 *          is replaced with hyphen.)
 * Returns null on error.
 */
virtual char * hyphenate(const wchar_t * word, size_t wlen) = 0;

If I understand this correctly, the following could exemplify the input and output:

"eplehuset"
"  - -  - "

I assume this interface is based on the present malaga implementation of the Finnish hyphenation, but please correct me if I'm wrong.

There are a couple of issues with this. The first is the assumption that only inserting hyphens or replacing single characters with hyphens is enough to achieve correct hyphenation. This assumption is wrong for many languages. In general, I think that the only language independent way to return a hyphenated string, is to return the string itself with the proper hyphenation points inserted. In Swedish and Norwegian, a double-consonant sequence will turn into a tripple-consonant sequence when hyphenated:

busskysstasjonen -> buss-skyss-sta-sjo-n-en

This is straightforward to implement in a transducer, but is not possible to represent in the above datastructure.

Other languages might have even more complicated hyphenation patterns, and the only way to isolate the library from these complexities is to assume nothing about the returned string, except it being a string with hyphenation points.

The second issue is that of hyphenation point preference. Using the above example and the 0-9 scale, it could look something like this:

busskysstasjonen -> buss-7skyss-7sta-2sjo-2n-5en

That is, better hyphenation can be provided if the different hyphenation points are weighted. The notation used is just an example, it could be something else.

The third issue is that of context mentioned above. There should be an optional additional argument containing the sentence of the word being hyphenated, so that it can be used to disambiguate between different competing analyses with corresponding competing hyphenation patterns.

/**
 * Insert hyphenation positions that are considered to be ugly
 * but correct. Typically this option is not set in text processors that
 * use hyphenation for splitting words at the end of line. It is
 * used in applications that need to split words into syllables.
 * Default: true
 */
virtual void setUglyHyphenation(bool uglyHyphenation) = 0;

Could you exemplify a bit what kind of applications you have in mind? In my view, ugglyness is a relative thing, and the cost of allowing "ugglyl" hyphenation is related to the layout requirements - in a narrow column you may want to allow ugglier hyphenation than with broader or no columns.

If the scaled gradation of hyphenation points is implemented, I suggest that the scale would replace this function call.

/**
 * Hyphenate unknown words. Default: true
 */
virtual void setHyphenateUnknown(bool hyphenateUnknown) = 0;

Is this related to some user setting? It should be up to the user to determine this behaviour, but the default is ok.

/**
 * There are two possible rules that can be applied when hyphenating
 * compound words that can be split in more than one different way. We
 * either take the intersection of (1) all possible hyphenations or
 * (2) all hyphenations where the compound word has the minimal number
 * of parts (:= m) in it. The rule (1) is applied if and only if
 * m > voikko_intersect_compound_level. Default: 1
 */
virtual void setIntersectCompoundLevel(int level) = 0;

See my comments above about language independence, compounds, etc. Also, one could imagine a user interface for letting the user choose between two different hyphenations - some sort of grammar-checker like interface where the user is asked to pick one or the other in cases where the hyphenator isn't able to uniquely identify one pattern. If so, one would need to be able to return more than one hyphenation pattern.

Does this option as well correspond to a user choice? Or am I missing something here?

/**
 * The minumum length for words that may be hyphenated. This limit is
 * also enforced on individual parts of compound words. Default: 2
 */
virtual void setMinHyphenatedWordLength(int length) = 0;

What is the need for this function? Usually this is a user setting in the application, and the length test can be done by the library before reaching the hyphenation code. Which means that the use of this function is to inform the hyphenator that compound elements shorter than this should not be hyphenated. Right?

/**
 * Ignore extra dot at the end of word. This option is set when
 * the provider of words to be hyphenated cannot know whether a dot
 * at the end of a word is a part of that word. Default: false.
 */
virtual void setIgnoreDot(bool ignoreDot) = 0;

The provider of words to be hyphenated can never know whether a dot at the end of a word is part of the word or not - for that you need linguistic analysis of some sort, which again points to the linguistic component making the decision about possible hyphenation points.

I hope my comments make sense - please ask if not :)

Best regards,
Sjur N. Moshagen
Samediggi · Sametinget
Project Manager for the Divvun project
http://www.divvun.no/
http://www.samediggi.no/
+358-9-49 75 29 (w)
+358-505 634 319 (m)

Den 24. nov. 2009 kl. 23.53 skrev Harri Pitkänen:

> I have drafted an interface for hyphenator implementations in libvoikko:
> 
> http://voikko.svn.sourceforge.net/viewvc/voikko/trunk/libvoikko/src/hyphenator/Hyphenator.hpp?view=markup
> 
> Currently this interface has no implementations. It consists of a single 
> method for hyphenating words and five methods for setting options that affect 
> subsequent calls to hyphenate method.
> 
> The reason why I want to get comments about this interface before implementing 
> it is that I know it is not flexible enough for languages that require 
> various non-standard hyphenation features. Apparently these are needed even 
> in some Nordic languages and we need to change the format of result strings 
> from Hyphenator::hyphenate to be able to support them. I'd like to know if 
> you have any suggestions on how to do that. If there is already an existing 
> format for describing such cases we could save some time by not re-inventing 
> the wheel here.
> 
> Support for non-standard hyphenation in libvoikko will require a change in the 
> external API as well and that cannot be done in the next release. But if we 
> can agree on the internal API first, it will be easier to change the external 
> API in libvoikko 3.0.
> 
> You may propose changes to hyphenator option methods as well. At least one of 
> them (setIntersectCompoundLevel) may seem weird to you. I am not aware of any 
> real word use cases for it and could consider removing it if you feel that it 
> is not needed or too specific to the current implementation. The option is 
> currently available in the external API of libvoikko and when we once broke 
> it I did receive a bug report. So apparently someone was actually using it 
> for something. But if there is no publicly available source code 
> demonstrating its usefullness I could drop it in libvoikko 3.0.
> 
> Harri
> _______________________________________________
> Libvoikko mailing list
> Libvoikko at lists.puimula.org
> http://lists.puimula.org/listinfo/libvoikko