Pre-pocessor

1.3 Pre-processor. The third stage in standardizing a name string uses a set of rules called character transforms. The characters usually intended to undergo this initial clean-up process are: apostrophies, dashes (hyphens), periods, and underscores. In the past they have been called the “ADPU rules.” The specific rules relate to the way these symbols are interpreted as appropriate to the culture and language of the data. Here are examples of how they are handled in the names in files using the GIANT system of standardization.

Name	Symbol	Signification	Treatment
Apostrophe	’	a distinct sound in Hawaiian, or omission of one or more letters	leave in
Paired single quotes	’ ’	preceded or followed by whitespace are same as quote marks (”)	leave in
Hyphen	-	pieces on either side class as parts of a compound	save class & strip
Period	.	preceding piece classes as an abbreviation	save class & strip
Underscore	_	preceding piece has its gender opposite to the sex of the person assigned the name	save class & strip
		preceding piece is given-name but has form of a prepositive title	save class & strip
		preceding piece with apocope to following piece is given-name abbreviation	save class & strip
		preceding piece is title of noble house when following piece is “RN” (royalty/nobility)	save class & strip
		preceding piece is prepositive title with the (deviant) form of the piece following	save class & strip

Other cultures and other data sets may use other conventions. The 1881 Census of Canada includes the following:

Name	Symbol	Signification	Treatment
Apostrophe	’	omission of one or more letters in French	leave in
Hyphen	-	pieces on either side class as parts of a compound	save class & strip
		pieces on either side are parts of a multiple syllable name (Native American, Chinese, interpretive)	save class & strip
Period	.	preceding piece classes as an abbreviation	save class & strip
		pieces on either side are parts of a multiple syllable name (Native American, Chinese)	save class & strip
Underscore	_	pieces on either side are parts of a multiple syllable name (Native American, Chinese)	save class & strip

Also what to do with some special characters like #, &, and % depends on the culture and data set involved. Possible treatments include: (1) strip it, (2) convert it to whitespace before tokenizing, (3) consider it a valid character, or (4) transliterate it to some other unicode character.

Works of Wonder | Genealogy