Section 6: Name Spelling


An Automatic Name Grouping Process for Name Authority

[20 Jan 2005]

Central to the performance of standardization on personal names is the establishment of a lexicon containing the pool of given names and a pool of surnames. The ODM standard catalogs are an example of such lexicons for various regions and cultures of the world. The stated goal is to become less dependent upon culturally trained and experienced name experts in making the decision about whether a particular spelling is a name variant of some other spelling, i.e., they are variants of the same name. Previous attempts and designs have been proposed that would help guide the expert in making the final decision. This brief is intended to outline the elements required and give a direction toward building a model for a process that will make the decision in an automated fashion.

The process envisioned is a variety of machine learning. As with as such schemes it is necessary to provide the machine with a set acceptable name groups that are known to be effective in practice. There is also a set of features of the spellings from which it may infer which and to what degree they are relevant to the grouping decision. The machine is then trained with these true groups, from which it then establishes values for scoring a comparison. These scores are combined and a threshold established so that it will imitate its own performance on the training set with any new comparison presented to any desired degree of accuracy. This process requires: 1) a training set, 2) a set of features, 3) a support vector machine (SVM), i.e., a special kind of learning machine that is able to do the statistical calculations required while minimizing the risk that it will be trained incorrectly.

The data used for name grouping needs to conform to a model with the following form: f : RN {±1}. This means that the function has input in a finite set of real numbers characterizing the name spellings to be compared and tells us whether the two spellings are in the same group (yes/no, i.e., ±1). The data for each set is expressed as the first element of an ordered pair, xi, and the decision as the second element, yi. There are in all l pairs in the training set. It is convenient to express the xi as a vector x and the yi as a (Boolean) variable y, i.e., (x, y), so that we have f(x) = y, for examples not in the training set.

1.  Training set. For the training set it would seem quite appropriate to use pairs drawn from one of the ODM standard name catalogs. The cultures are presently synonymous with the world regions. Name buckets are classified by: male/female, given name/surname, ranked by size. The pairs could be selected and cleaned manually as needed.

2.  Features. The model requires that the features of a spelling be reduced to a real number. Here are some of the features that the name expert uses in making a judgement as to name group membership:

a)  Actual spelling. This is a concatenation of alphabetic letters, sometimes with case distinctions (not available in training set). These may be compared directly with an algorithm (Jaro-Winkler) that calculates edit-distance, i.e., a real number between 0 and 1 expressing how near to having the same spelling the representations are.

b)  Phonetic spelling. This is a concatenation of alphabetic letters and a few other symbols representing the way the spelling is pronounced in a particular language. Some of the cultures have a set of production rules (Phondex®) that will generate a distinctive spelling meeting these requirements to a greater or lesser extent. Comparison may result in one (when identical) or less than one: either zero (no agreement) or an edit-distance, which here would be a phonic-distance.

c)  Frequency. This is how often the spelling is used in the general population. Training data has this number as from the region of the catalog. The more common spellings tend to be in separate name groups, whereas very unusual spellings tend to belong to groups of greater popularity.

d)  Nickname/Abbreviation. This is a special matching form occurring with given names that does not score well with either (a) or (b) and possibly not (c). There is an algorithm that will generate potential abbreviated and pet forms from a standard given name by rule. These can be associated with each standard name. The comparison would favor full agreement with one of these forms to the complete exclusion of (a) and (b).

e)  Parsing class. This is the category of name piece, a constant in the present design. Names that contain certain affixes or are regularly associated with certain pre-posed and post-posed particles are variants of those versions without such appendages. These may be eliminated by pre-processing the training set and by ignoring them during the automatic grouping process.