1.3 Pre-processor. The third stage in standardizing a name string uses a set of rules called character transforms. The characters usually intended to undergo this initial clean-up process are: apostrophies, dashes (hyphens), periods, and underscores. In the past they have been called the “ADPU rules.” The specific rules relate to the way these symbols are interpreted as appropriate to the culture and language of the data. Here are examples of how they are handled in the names in files using the GIANT system of standardization.

NameSymbolSignificationTreatment
Apostrophea distinct sound in Hawaiian, or omission of one or more lettersleave in
Paired single quotes’ ’preceded or followed by whitespace are same as quote marks ()leave in
Hyphen-pieces on either side class as parts of a compoundsave class & strip
Period.preceding piece classes as an abbreviationsave class & strip
Underscore_preceding piece has its gender opposite to the sex of the person assigned the namesave class & strip
preceding piece is given-name but has form of a prepositive titlesave class & strip
preceding piece with apocope to following piece is given-name abbreviationsave class & strip
preceding piece is title of noble house when following piece is “RN” (royalty/nobility)save class & strip
preceding piece is prepositive title with the (deviant) form of the piece followingsave class & strip

Other cultures and other data sets may use other conventions. The 1881 Census of Canada includes the following:

NameSymbolSignificationTreatment
Apostropheomission of one or more letters in Frenchleave in
Hyphen-pieces on either side class as parts of a compoundsave class & strip
pieces on either side are parts of a multiple syllable name (Native American, Chinese, interpretive)save class & strip
Period.preceding piece classes as an abbreviationsave class & strip
pieces on either side are parts of a multiple syllable name (Native American, Chinese)save class & strip
Underscore_pieces on either side are parts of a multiple syllable name (Native American, Chinese)save class & strip

Also what to do with some special characters like #, &, and % depends on the culture and data set involved.  Possible treatments include: (1) strip it, (2) convert it to whitespace before tokenizing, (3) consider it a valid character, or (4) transliterate it to some other unicode character.