1.5 Classifier.
The standardization process must assign to each token its possible class membership.
Each token is an individual meaning carrying portion or combination of name pieces that are found as entries in the LEXICON.
There are also situations where multiple tokens together form a phrase which has an entry in the LEXICON.
Hence, the classifier first takes the full phrase and analyzes it piecemeal, considering smaller and smaller combinations as it goes.
In the following table we consider a number of possible lexical entries.
This will illustrate the various possibilities that may be found in the English and other related cultures and languages.
Token | Category | Class | Gender | Standard |
---|
| hyphen | deviant | | - |
, | comma | standard | | |
- | hyphen | standard | | |
/ | separator | abbreviation | | or |
-son | patronymic | standard, affix | M | |
aka | separator | standard, abbreviation | | |
also known as | separator | full form | | aka |
ap | particle | standard | M | |
Archbishop | position | standard | M | |
Bachelor of Science | postnominal initials | full form | | B.S. |
Baker | occupation name | standard | M | |
Baker | surname | standard | | |
Baron | position | standard | M | |
Baron | surname | standard | | |
beatitude | abstraction | standard | M | |
Bishop | position | standard | | |
Bishop | surname | standard | | |
Black | epithet | standard | | |
Black | surname | standard | | |
Brother | rank | standard | M | |
B. S. | postnominal initials | standard, abbreviation | | |
Buttons | occupation name | standard | M | |
Buttons | surname | standard | | |
Canterbury | domain name | standard | M | |
Cardinal | cardinal | standard | M | |
Christianson | surname | standard | | |
Cobler | occupation name | standard | M | |
Cobler | surname | standard | | |
Cowper | occupation name | standard | M | |
Cowper | surname | standard | | |
D. D. | postnominal initials | standard, abbreviation | | |
de | nobiliary | standard | | |
dit | separator | standard | | |
Doctor | rank | standard | | |
Doctor of Divinity | postnominal initials | full form | | D. D. |
Doctor of Philosophy | postnominal initials | interpretive form of Philosophiae Doctor | | Ph. D. |
Donald | given name | standard | M | |
Donald | surname | standard | | |
eminence | abstraction | standard | M | |
Father | rank | standard | M | |
fifth | ordinal phrase | full form | M | V |
geb. | separator | standard, abbreviation | | |
genannt | separator | standard | | |
her | possessive | standard | F | |
his | possessive | standard | M | |
holiness | abstraction | standard | M | |
House of Representatives | domain name | standard | | |
Howard | surname | standard | | |
Howard | title | standard | M | |
II | ordinal phrase | standard, Roman numeral | | |
John | given name | standard | M | |
John | surname | standard | | |
Jones | surname | standard | | |
Junior | comparative adjective | standard | M | |
King | position | standard | M | |
King | surname | standard | | |
Little | epithet | standard | | |
Longespee | epithet | standard | M | |
Lord | position | standard | M | |
Lord | surname | standard | | |
Mac- | patronymic particle | standard, affix | | |
Macedonia | domain name | standard | M | |
Mary | given name | standard | F | |
Mayor | epithet | standard | M | |
McDonald | surname | standard | | |
Monsignor | rank | standard | M | |
most | quantifier | standard | | |
Mother | rank | standard | F | |
O- | patronymic particle | standard, affix | | |
of | preposition | standard | | |
or | separator | standard | | |
Patriarch | position | standard | M | |
Peace | epithet | standard | | |
Ph. D. | postnominal initials | standard, abbreviation | | |
Philosophiae Doctor | postnominal initials | full form | M | Ph. D. |
Pope | rank | standard | M | |
Pope | surname | standard | | |
reverend | attribute | standard | M | |
right | quantifier | standard | | |
royal | qualifier | standard | | |
second | ordinal phrase | full form | | II |
Senior | comparative adjective | standard | M | |
S. C. | postnominal initials | standard, abbreviation | | |
Sister | rank | standard | F | |
S. J. | postnominal initials | standard, abbreviation | M | |
Smith | surname | standard | | |
Society of Jesus | postnominal initials | full form | M | S. J. |
Stewart | title | standard | M | |
Stewart | surname | standard | | |
Superior | comparative adjective | standard | | |
Supreme Court | postnominal initials | full form | | S. C. |
the | determiner | standard | | |
Thomas | given name | standard | M | |
Thomas | surname | standard | | |
Thomas Jefferson | given name | standard | M | |
Thos. | given name | abbreviation | M | Thomas |
T. J. | given name | nickname, abbreviation | M | Thomas Jefferson |
V | ordinal phrase | standard, Roman numeral | M | |
v. | patronymic particle | abbreviation | F | verch |
v. | nobiliary | abbreviation | | van |
van | nobiliary | standard | | |
venerable | attribute | standard | M | |
verch | patronymic particle | standard | F | |
very | quantifier | standard | | |
vulgo | separator | standard | M | |
Weaver | occupation name | standard | M | |
Weaver | surname | standard | | |
Windsor | title | standard | | |
Windsor | surname | standard | | |
Xtianson | surname | abbreviation | | Christianson |
York | surname | standard | | |
York | title | standard | | |
There are. of course, many other features (we have shown only gender) that may show up on the lexical entry.
In addition to these, it is important in some cases for relative frequency information to be given.
The parser requires some means to rank ambiguous structures according to likelihood of occurence.
For example, John is far and away more frequent as a given name than as a surname.
When record linkage uses the personal name for identification purposes, it is also necessary that there be some indication of relative frequency when it comes to calculating individual record weights.