bes island

Сайт Дениса Попова


13 сентября 2010, 11:08 (в редакции от 25 февраля 2014, 21:20)

I’ve set my website up with the automatic word hyphenation. The plugin introduced between generating response and sending response takes contents of every XHTML element and, if it is a character data, inserts soft hyphen characters in the places appropriate for a line break.

I have not tested it in every popular browser yet; hopefully it would not cause problems nowadays. I remember several years ago there was bunch of problems, for example, if a contents of the title element (i.e. page title) was soft-hyphenated. Well, let’s see.

Update: unfortunately, the problems were still there, so I’ve removed the automatic hyphenation from my website after all.

As for the hyphenation algorithm, I considered Knuth-Liang algorithm, but discarded the idea. It seems to be overcomplicated for such a simple task. It is difficult to understand, it is difficult to make into a working program, and it requires a lot of prepared data. Yes, it produces fine result and may be considered an industrial standard. Still, I do not share Liang’s superthorough opinion on what correct and incorrect hyphenation is. Times changed, and life shows that almost any hyphenation is comfortable for a reader. Rules tend to become simpler, and it’s fine.

It is nothing wrong with breaking “selfadjoint” after “l”, or “Reagan” after “e”, or “homeowners” after “ho”, which all are the Liang’s examples of erroneous hyphenation in the introduction to his PhD thesis. “Exacting” with the hyphen after “ex”, “coincidence” with the hyphen after “coin”, “legends” with the hyphen after “leg” (samples from another article on the subject) might seem odd, but only when written within a single line of text. If a line break occurs in such a place, noone will read these words as “ex acting”, “coin cidence” or “leg ends”. No, a human reader will correctly understand the word, at least because of the context.

To make long story short, I prefer the simple rule-based approach. Liang points out:

The original TeX hyphenation algorithm was designed by Prof. Knuth and the author in the summer of 1977. It is essentially a rule-based algorithm, with three main types of rules: (1) suffix removal, (2) prefix removal, and (3) vowel-consonant-consonant-vowel (vccv) breaking. The latter rule states that when the pattern ‘vowel-consonant-consonant-vowel’ appears in a word, we can in most cases split between the consonants. There are also many special case rules; for example, “break vowel-q” or “break after ck”. Finally a small exception dictionary (about 300 words) is used to handle particularly objectionable errors made by the above rules, and to hyphenate certain common words (e.g. pro-gram) that are not split by the rules.

I am making things even more simple. The vccv rule is good. Other mentioned rules are excessive. We need to carry a dictionary of suffixes, a dictionary of prefixes, a list of special case rules, and a list of exceptions. And there is no real need for all of this.

I have thought the following set of rules out:

1. A soft hyphen is inserted between two vowels (v-v).

2. If a vowel is followed by a consonant followed by a vowel, a soft hyphen is inserted after the first vowel (v-cv).

3. In a vowel-consonant-consonant-vowel pattern, a soft hypen is inserted between the consonants (vc-cv).

4. A soft hyphen is never inserted after the first letter of the word on before the last letter of the word.

Basically, hyphenation deals with syllables, and a vowel constitutes a syllable nucleus. So, between two vowels we may insert a hyphen. This immediately leads us to the 1st rule. The 2nd and the 3rd rules are empirical. As for the situation when more than two consonants appear between the two vowels, it is complicated and for the sake of simplicity we skip it, not inserting any hyphens at all. The 4th rule is intended for the better readability.

We do not need any dictionary data except for what letters are considered vowels (and every other letter may quite safely be considered a consonant, though it is not always one). That makes the algorithm extremely simple in realisation. First, we extract every continuous chain of letters. (Strictly speaking, we need another dictionary data containing the list of letters. It can easily be achieved by the PCRE \p{L} sequence or by similar means.) This chain is considered a word for the goal of the hyphenation. If its length is less than or equal to 3 characters, we leave it untouched (because of the 4th rule). Else we insert soft hyphens according to the four rules.

As for the results, they are good. Not as good as the Liang’s, but still– good.

Обсудить в «Живом журнале»

О нашей науке Немного о деловой лексике
По возникшим вопросам пишите:

Хостинг — «Диджитал оушен»

Этот сайт использует куки (cookie). Посещая его, вы даёте согласие на хранение и передачу куки.