1.3 Pre-processor. The third stage in standardizing a name string uses a set of rules called character transforms. The characters usually intended to undergo this initial clean-up process are: apostrophies, dashes (hyphens), periods, and underscores. In the past they have been called the ADPU rules. The specific rules relate to the way these symbols are interpreted as appropriate to the culture and language of the data. Here are examples of how they are handled in the names in files using the GIANT system of standardization.
Name | Symbol | Signification | Treatment |
---|---|---|---|
Apostrophe | | a distinct sound in Hawaiian, or omission of one or more letters | leave in |
Paired single quotes | | preceded or followed by whitespace are same as quote marks () | leave in |
Hyphen | - | pieces on either side class as parts of a compound | save class & strip |
Period | . | preceding piece classes as an abbreviation | save class & strip |
Underscore | _ | preceding piece has its gender opposite to the sex of the person assigned the name | save class & strip |
preceding piece is given-name but has form of a prepositive title | save class & strip | ||
preceding piece with apocope to following piece is given-name abbreviation | save class & strip | ||
preceding piece is title of noble house when following piece is RN (royalty/nobility) | save class & strip | ||
preceding piece is prepositive title with the (deviant) form of the piece following | save class & strip |
Other cultures and other data sets may use other conventions. The 1881 Census of Canada includes the following:
Name | Symbol | Signification | Treatment |
---|---|---|---|
Apostrophe | | omission of one or more letters in French | leave in |
Hyphen | - | pieces on either side class as parts of a compound | save class & strip |
pieces on either side are parts of a multiple syllable name (Native American, Chinese, interpretive) | save class & strip | ||
Period | . | preceding piece classes as an abbreviation | save class & strip |
pieces on either side are parts of a multiple syllable name (Native American, Chinese) | save class & strip | ||
Underscore | _ | pieces on either side are parts of a multiple syllable name (Native American, Chinese) | save class & strip |
Also what to do with some special characters like #, &, and % depends on the culture and data set involved. Possible treatments include: (1) strip it, (2) convert it to whitespace before tokenizing, (3) consider it a valid character, or (4) transliterate it to some other unicode character.