1.4 Tokenizer. After the special characters that appear within tokens are transformed into uniformly interpretable characters, it makes sense to identify those portions of the name string identifiable as tokens. These are the individual meaning carrying portions and combinations of name pieces that are generally separated from one another by spaces. Other characters may supplant or augment the white space and with the advent of computer coded character strings the possibilities have multiplied. Among such white-space characters are notably: (1) blank space, (2) tab key, (3) new line, (4) carriage return, (5) form feed or new page, (6) null, and (7) backspace. With the advent of Unicode, there are even more characters of this sort to keep track of.

There appear to be at least three ways that the tokenizer can treat a delimiter. 1) Handle the symbol as though it were white space. 2) Acknowledge its role as separating tokens but consider it as part of one of them. This is the so-called “bound” delimiter. 3) Consider the symbol in and by itself as a separate token. This would be a “delimiting token.” A delimiter must be of this class when it lends a special interpretation to the name piece thus separated.

In the following table we consider a number of possible delimiters. This will illustrate the various possibilities that may be found in different data sets in different cultures, languages, and writing styles.

NameSymbolSignificationClassComments
period.abbreviation precedesboundfrom pre-processor
syllables of a native nametokencertain census returns
colon:abbreviation precedesboundold European writing styles
comma,jurisdictions in a hierarchylocality name
hyphen-parts of compound name or syllables of a native nametokenlike , , and many similar unicode characters
semi-colon;postpositive titlesboundAncestral File
forward slash/surname follows or precedestokenpaired with itself; GEDCOM standard
double quote"epithet or nickname follows or precedestokenpaired with itself; Ancestral File
question mark?uncertain charactertoken
uncertain characters precedetokenin Spanish paired with ¿
preceding characters in parens are uncertaintoken
uncertain tokens precedetokenin Spanish paired with ¿
less than<propagated data followstokenpaired with >
greater than>propagated data precedestokenpaired with <
left paren(optional data followstokenpaired with )
right paren)optional data precedestokenpaired with (
left bracket[postpositive title followstokenpaired with ]
right bracket]postpositive title precedestokenpaired with [
left brace{optional data followstokenpaired with }
right brace}optional data precedestokenpaired with {