Tokenizer

1.4 Tokenizer. After the special characters that appear within tokens are transformed into uniformly interpretable characters, it makes sense to identify those portions of the name string identifiable as tokens. These are the individual meaning carrying portions and combinations of name pieces that are generally separated from one another by spaces. Other characters may supplant or augment the white space and with the advent of computer coded character strings the possibilities have multiplied. Among such white-space characters are notably: (1) blank space, (2) tab key, (3) new line, (4) carriage return, (5) form feed or new page, (6) null, and (7) backspace. With the advent of Unicode, there are even more characters of this sort to keep track of.

There appear to be at least three ways that the tokenizer can treat a delimiter. 1) Handle the symbol as though it were white space. 2) Acknowledge its role as separating tokens but consider it as part of one of them. This is the so-called “bound” delimiter. 3) Consider the symbol in and by itself as a separate token. This would be a “delimiting token.” A delimiter must be of this class when it lends a special interpretation to the name piece thus separated.

In the following table we consider a number of possible delimiters. This will illustrate the various possibilities that may be found in different data sets in different cultures, languages, and writing styles.

Name	Symbol	Signification	Class	Comments
period	.	abbreviation precedes	bound	from pre-processor
		syllables of a native name	token	certain census returns
colon	:	abbreviation precedes	bound	old European writing styles
comma	,	jurisdictions in a hierarchy		locality name
hyphen	-	parts of compound name or syllables of a native name	token	like –, —, and many similar unicode characters
semi-colon	;	postpositive titles	bound	Ancestral File
forward slash	/	surname follows or precedes	token	paired with itself; GEDCOM standard
double quote	"	epithet or nickname follows or precedes	token	paired with itself; Ancestral File
question mark	?	uncertain character	token
		uncertain characters precede	token	in Spanish paired with ¿
		preceding characters in parens are uncertain	token
		uncertain tokens precede	token	in Spanish paired with ¿
less than	<	propagated data follows	token	paired with >
greater than	>	propagated data precedes	token	paired with <
left paren	(	optional data follows	token	paired with )
right paren	)	optional data precedes	token	paired with (
left bracket	[	postpositive title follows	token	paired with ]
right bracket	]	postpositive title precedes	token	paired with [
left brace	{	optional data follows	token	paired with }
right brace	}	optional data precedes	token	paired with {

Works of Wonder | Genealogy