[libvoikko] Proposed interface for hyphenator components

Tue Dec 1 19:40:41 EET 2009

Thanks for your comments!

On Monday 30 November 2009, Sjur Moshagen wrote:
> Thanks for drafting the interface. I see in the svn log that you have coded
>  a bit since you wrote this, so I hope my feedback isn't coming too late.

No, its not too late. I had some extra time to work on libvoikko and decided 
to implement the proposed interface just to see that it actually is sufficient 
for our current requirements. Changes can still be made.

> Linguistics and language independence
> -------------------------------------
> For the library to be truly language independent, the functions need to be
>  as well. This means that all language-dependent behaviour should be moved
>  to the linguistic components below/behind the library, ie Finnish
>  behaviour should be in the Finnish component, Norwegian in the Norwegian
>  component, Sámi in the Sámi one, etc.

Yes, this is the goal. It will just take some time to separate everything.

> Graded hyphenation points
> -------------------------
> 9 - hard hyphen (DNA-sekvens)
> 8 - manually inserted soft hyphen
> 7 - word boundary (eplehuset -> eple-huset)
> 5 - other morfological boundary (e.g. between stem and inflectional ending:
>  hus-et) 2 - other hyphenation points (eple -> ep-le)

I think this is a good idea. We actually have considered implementing 
something like this before (SourceForge bug #1647744) but I have not yet found 
time for it. I'll add this to the API for libvoikko 3.0.

We could also add an option to OpenOffice.org extension that would allow the 
user to select the minimum level used in automatic hyphenation.

> Context
> -------
> In some cases the correct hyphenation pattern can only be determined after
>  disambiguation. For this to happen, the hyphenator would need the
>  syntactic context, at least the full sentence.

Are there any backend implementations or applications that support this yet? I 
think this can be added, but perhaps as a new interface that would be added 
later. Since this is a new feature that does not change existing 
functionality, it can be added any time in a minor release after 3.0 once 
there is a proof of concept implementation for some language.

> "eplehuset"
> "  - -  - "
> 
> I assume this interface is based on the present malaga implementation of
>  the Finnish hyphenation, but please correct me if I'm wrong.
> 
> There are a couple of issues with this. The first is the assumption that
>  only inserting hyphens or replacing single characters with hyphens is
>  enough to achieve correct hyphenation. This assumption is wrong for many
>  languages. In general, I think that the only language independent way to
>  return a hyphenated string, is to return the string itself with the proper
>  hyphenation points inserted. In Swedish and Norwegian, a double-consonant
>  sequence will turn into a tripple-consonant sequence when hyphenated:
> 
> busskysstasjonen -> buss-skyss-sta-sjo-n-en
> 
> This is straightforward to implement in a transducer, but is not possible
>  to represent in the above datastructure.

I'm afraid this will not work in general. In fact the result is ambiguous even 
in the example above. Is it correct to hyphenate the word as "buss-
skysstasjonen" or "buss-skyssstasjonen"? The first is correct, but unless I am 
missing something, the result could be interpreted either way.

We could make this more explicit by marking what needs to be removed/inserted 
when hyphenating at given position: bus[/s-]skys[/s-]sta[/-]sjo[/-]n[/-]en
But even that is not enough if there is a language where hyphenating at some 
position requires changes to the word in multiple places. For example if in 
some language we have a word "abcdefg" that hyphenates as "ab-bcdeefg" and 
"abc-defg" there is no way to describe the hyphenation using these formats.

A more general alternative would be to return a list of strings showing each 
possible hyphenation separately.

> busskysstasjonen -> buss-7skyss-7sta-2sjo-2n-5en
> 
> That is, better hyphenation can be provided if the different hyphenation
>  points are weighted. The notation used is just an example, it could be
>  something else.

If we take all these issues into account, the result from hyphenate method 
could be a List<HyphenatedWord> where HyphenatedWord is something like

struct HyphenatedWord {
  wchar_t word;
  int weight;
}

And for the example word the result would be (in pseudocode)
[("buss-skysstasjonen", 7),
 ("busskyss-stasjonen", 7),
 ("busskyssta-sjonen", 2),
 ("busskysstasjo-nen", 2),
 ("busskysstasjon-en", 5)]

> virtual void setUglyHyphenation(bool uglyHyphenation) = 0;
> 
> Could you exemplify a bit what kind of applications you have in mind? In my
>  view, ugglyness is a relative thing, and the cost of allowing "ugglyl"
>  hyphenation is related to the layout requirements - in a narrow column you
>  may want to allow ugglier hyphenation than with broader or no columns.

At least in Finnish there are certain syllables that must never be used split 
to next line, for example syllables consisting of a single vowel. These 
hyphenation points are totally forbidden in text processor like 
OpenOffice.org, but they must be present if you are writing words in fully 
hyphenated form (such as in text books for young children).

> If the scaled gradation of hyphenation points is implemented, I suggest
>  that the scale would replace this function call.

Yes, we can do that. We just have to standardise the scale values so that text 
processing applications know to skip the always forbidden positions by 
default. Some applications (at least OpenOffice.org) are not able to take 
column widths into account so they can only work with fixed settings.

> /**
>  * Hyphenate unknown words. Default: true
>  */
> virtual void setHyphenateUnknown(bool hyphenateUnknown) = 0;
> 
> Is this related to some user setting? It should be up to the user to
>  determine this behaviour, but the default is ok.

This setting is part of the external API of libvoikko and it is made available 
to the users of OpenOffice.org through openoffice.org-voikko.

> virtual void setIntersectCompoundLevel(int level) = 0;
>
> Does this option as well correspond to a user choice? Or am I missing
>  something here?

This setting is part of the external API of libvoikko but it is not used in 
any real world application that I know of. I think we can drop it in libvoikko 
3.0 unless someone objects (and provides good justification for leaving it 
in).

> virtual void setMinHyphenatedWordLength(int length) = 0;
> 
> What is the need for this function? Usually this is a user setting in the
>  application, and the length test can be done by the library before
>  reaching the hyphenation code. Which means that the use of this function
>  is to inform the hyphenator that compound elements shorter than this
>  should not be hyphenated. Right?

Exactly. Users of openoffice.org-voikko can use a boolean option for 
controlling whether the minimum length setting in OOo is applied only to the 
whole word or also to individual compound elements.

In fact if your proposal of graded hyphenation points is implemented, we will 
not need this option either as long as there is a fixed weight for compound 
word boundaries (7 in your example). Then applications can look for segments 
separated by hyphenation points with grade 7 or higher and decide to ignore 
other hyphenation points within the segment if its length is too short.

So let's just drop this option.

> virtual void setIgnoreDot(bool ignoreDot) = 0;
> 
> The provider of words to be hyphenated can never know whether a dot at the
>  end of a word is part of the word or not - for that you need linguistic
>  analysis of some sort, which again points to the linguistic component
>  making the decision about possible hyphenation points.

It is possible that the source of words to be hyphenated is a manually 
collected list of words and in that case we know that any dot in the word must 
be part of that word. In most real world applications this is not the case but 
it can be useful for testing and some special situations.

I would like to keep this option but the default value could be changed to 
true if this is what most applications need. That way the backend 
implementation could simply ignore the option if it is too complicated to 
implement it and the loss of functionality would be minimal.

> I hope my comments make sense - please ask if not :)

This was exactly what I was looking for.

Currently I expect that the next version of libvoikko (2.3) will be released 
in late January. No external API changes will be made in that release and that 
makes some of the proposed changes somewhat complicated (but not impossible) 
to implement. Once 2.3 is released I will break the external API compatibility 
and then it will be easy to have all these changes implemented.

Do you have working hyphenator code that you would like to integrate to 
libvoikko in the next few weeks? If so, we could create a branch in SVN for 
2.3 and start working on these changes quite soon. But if there is no code 
available yet I'd like to wait until 2.3 is done.

Harri