Section 4: CODING NAME VARIANTS


There is considerable unreliability because of numerous kinds of error that interfere in the production of an accurate source. Almost all errors in a source appear to be related to fallability in 1) the clerk creating the source, 2) the informant providing the information to the clerk, and 3) the interpreter extracting information from the source. Of course, all three of these may be involved. Further, the reasons the source was produced, its history, and the relative persistence of the media involved can also help in explaining why errors may occur. Concerned analysts have attempted various algorithms or methodologies to remedy one or more of the errors that occur. These methodologies fall into three general classes: 1) name coding, 2) table look-up, 3) string comparators. It is not likely that any one of these could illiminate the effects of all the kinds of errors that may be evident in a source.

We consider first several attempts that have been made to bring name variants together by establishing certain rules. The idea is to treat certain letters or letter combinations that appear in the different spellings as equivalent. The trick is to choose which letters are best treated the same and which are better to keep distinct. In long words this may not be as difficult. For example, suppose there are the following variant spellings:

Edinburgh EdinburgEdinbourgEdinborgEdenburgEdinboroEdinborough

One rule could say that e and i are the same letter. Another might say that o, u, and ou are the same letter. A third might say that -oro and -orough are the same ending. This last statement is rather specific. It is therefore weak, possibly too weak, but the other two rather strong, possibly too strong. These first two might have unwanted effects by bringing together the same letters in other names where they are not confused. To weaken such statements, the linguist may want to make them depend on the environment. One could say that the first statement is only true between dental sounds like d and n. One could say that the second statement is only true after a b and before an r+g at the end of the word. In this way it is possible that the linguist might come up with a couple of statements that would hold for the alternate spellings of a number of other names. Rules that state that certain symbols are equivalent in certain stated environments are called rewrite rules.

In the following sections we outline the details of three code-producing systems. The most popular system is the Russell Soundex system. A second more elaborate system is the New York State Immunological Information System (NYSIIS). These two coding systems seem to have been accepted by many as an academic standard. A third system is the Henry Code. This set of rules is very much like NYSIIS, but designed for French names.

The Soundex rule set is extremely simple containing about a dozen rules. Vowel letters are used to help define which consonant letters are to be coded, then dropped. Consonant letters fall into six codes. Yet, the resulting groups have not been efficient enough for many users. Thence came NYSIIS, with a couple of newer versions, continually getting better for names that appear in the official records kept by New York State. An enhanced NYSIIS is the most popular version. At some point this system neutralizes all vowel letters to “A” and reduces many consonant letter clusters to single general consonant letters.

After some experience with the deficiencies of these codes, it becomes clear that a different set of rules would be appropriate for each separate language. The Family and Church History department has supported the development of a number of rule sets built according to this principle. These are the various rule sets of the Phondex system.