[libvoikko] Proposed interface for hyphenator components

Mon Dec 7 02:55:46 EET 2009

If the word is very long and the column narrow, a single word may have
to be broken onto several lines. If all the hyphenation points and their
side-effects are not explicitly linked to the input character sequence,
it may be impossible for a language-independent device to know how to
correctly insert multiple hyphens into the string.

To recapitulate the discussion:
 > We could make this more explicit by marking what needs to be
 > removed/inserted when hyphenating at given position: bus[/s-]skys[/s-]
 > sta[/-]sjo[/-]n[/-]en. But even that is not enough if there is a
 > language where hyphenating at some position requires changes to the
 > word in multiple places. For example if in some language we have a
 > word "abcdefg" that hyphenates as "ab-bcdeefg" and "abc-defg" there is
 > no way to describe the hyphenation using these formats.
 >
 > A more general alternative would be to return a list of strings
 > showing each possible hyphenation separately.
 >
 >> busskysstasjonen -> buss-7skyss-7sta-2sjo-2n-5en
 >>
 >> That is, better hyphenation can be provided if the different
 >> hyphenation points are weighted. The notation used is just an
 >> example, it could be something else.
 >
 > If we take all these issues into account, the result from hyphenate
 > method could be a List<HyphenatedWord> ...
 >
 > And for the example word the result would be (in pseudocode)
 > [("buss-skysstasjonen", 7),
 >  ("busskyss-stasjonen", 7),
 >  ("busskyssta-sjonen", 2),
 >  ("busskysstasjo-nen", 2),
 >  ("busskysstasjon-en", 5)]

All of the above can be returned in a single weighted transducer-like
character array where the hyphenation point after a given character is
indicated using a weight. This also removes the burden from the
application software to know what language is being hyphenated and how
to align the input string with the output string when inserting multiple
hyphens:

b b 0
u u 0
s s 0
s s 7
   s 0
k k 0
y y 0
s s 0
s s 7
   s 0
t t 0
a a 0
s s 2
j j 0
o o 2
n n 5
e e 0
n n 0

There are also languages where more than one character can be added when
introducing a hyphenation point. This may look something like the
example below, if the intention is to indicate that something is added
both at the end of one syllable and at the beginning of the next:

a a 0
b b 0
   b 7
   c 0
c c 0
d d 0

Other effects on the spelling could also be modeled with this, but the
consolidated output assumes that the effects of hyphenation are always
local, i.e. there are no discontinuous side-effects further away in the
string for hyphenating at one point and that any changes in the input
string surrounding the introduced hyphen relate to this hyphen, i.e. the
first non-changing character on either side of a hyphen breaks the need
to modify the input string.

This also accommodates languages where characters disappear if the word
is hyphenated:

a a 0
b b 0
c c 7
d   0
e e 0
f f 0

To fully model complex, overlapping, conflicting or non-local effects of
hyphenation, we may still need several outputs. (However, I am not
currently aware of a language with such complex or non-local hyphenation
effects.) In any case, we would still need to be able to relate each
output character to the original input character sequence in order to
consolidate the effects of multiple hyphenation points.

Krister