Orthographic transliteration

5.4 Orthographic transliteration. (T) The motivation for transliteration is to make uniform the many writing systems that have been invented or borrowed in the many languages of the world. In so far as these are directly derived from the alphabetic system of the Romans, the goal may be considered to have been attained. But the actual situation makes this much less straight-forward. The systematic evolution of some languages has simply made them incompatible with each other on many dimensions so that any conformity as to their writting system is usually more apparent than real.

Figure 57 lists examples of four functions that transform strings containing symbols of one system to strings of symbols of another. Their designations with a single subscript to the letter T_C may be misleading, as each of these are families of functions, or even families of families. The first example transforms each letter of a Roman alphabet to a corresponding letter of a Cyrillic alphabet. There are many languages that use a Cyrillic alphabet, so there are modifications possible by each language. The same thing is true of the Roman alphabet. Many languages make use of diacritical marks and some other letters and modifications to adapt it to the needs of their own system. This means that in transliterating between alphabets it is important to know the languages of the input and output strings, or there is no way to choose the inverse function to transliterate back again. This should be clear from the example T_E called “borrowing,” that even when the language may be known, there is often information lost in the transliteration process. In this example the diacritical tilda over the “a” is lost with no way to restore it by some inverse function without that function being given an intimate knowledge of the words and sometimes the phrases of the language.

A third example represents the process of changing a syllabary. Like the transliteration proper, where the function works on alphabets, here it works on syllabaries. Where the two syllabaries are in the same language, it is very much like changing upper-case letters to lower-case letters. There is no example here, but tables have also been put together to mediate a transformation from these particular syllabaries to a Roman alphabet. Even then, the structure of the language motivates the use of at least one and sometimes more diacritical marks.

Even the rendering of letters of the alphabet in their case forms in English (and some other languages) is an example of transliteration. Like the others, information carried by case may easily be lost, or impossible to restore without a good knowledge of the words and phrases of the language. In certain names, like DeSpain, it is impossible to restore the mixed case after it has gone into upper-case and merged with other case variations: deSpain, Despain, and even dEspain, where an apostrophe of the original alphabet has been lost. As long as the case variations are variants in the same name group, the bucketing problem is not exacerbated. The convention in genealogy of rendering surnames in all-caps has not been helpful to preserving the case of the records.

The distance function must be designed to make transcription as transparent as its ubiquitous nature suggests it should be. In other words, no transcribed version should be measured to be very far from the original. The presence of so many stylistic variants in the records has led many data managers to adopt the philosophy that case is unimportant and cannot be preserved in practice. Yet, the use of algorithms, such as VIEWEX, suggests that the preservation of case cannot but improve the efficiency of any matching algorithm.

Works of Wonder | Genealogy