1.4 Tokenizer. After the special characters that appear within tokens are transformed into uniformly interpretable characters, it makes sense to identify those portions of the name string identifiable as tokens. These are the individual meaning carrying portions and combinations of name pieces that are generally separated from one another by spaces. Other characters may supplant or augment the white space and with the advent of computer coded character strings the possibilities have multiplied. Among such white-space characters are notably: (1) blank space, (2) tab key, (3) new line, (4) carriage return, (5) form feed or new page, (6) null, and (7) backspace. With the advent of Unicode, there are even more characters of this sort to keep track of.
There appear to be at least three ways that the tokenizer can treat a delimiter. 1) Handle the symbol as though it were white space. 2) Acknowledge its role as separating tokens but consider it as part of one of them. This is the so-called bound delimiter. 3) Consider the symbol in and by itself as a separate token. This would be a delimiting token. A delimiter must be of this class when it lends a special interpretation to the name piece thus separated.
In the following table we consider a number of possible delimiters. This will illustrate the various possibilities that may be found in different data sets in different cultures, languages, and writing styles.
Name | Symbol | Signification | Class | Comments |
---|---|---|---|---|
period | . | abbreviation precedes | bound | from pre-processor |
syllables of a native name | token | certain census returns | ||
colon | : | abbreviation precedes | bound | old European writing styles |
comma | , | jurisdictions in a hierarchy | locality name | |
hyphen | - | parts of compound name or syllables of a native name | token | like , , and many similar unicode characters |
semi-colon | ; | postpositive titles | bound | Ancestral File |
forward slash | / | surname follows or precedes | token | paired with itself; GEDCOM standard |
double quote | " | epithet or nickname follows or precedes | token | paired with itself; Ancestral File |
question mark | ? | uncertain character | token | |
uncertain characters precede | token | in Spanish paired with ¿ | ||
preceding characters in parens are uncertain | token | |||
uncertain tokens precede | token | in Spanish paired with ¿ | ||
less than | < | propagated data follows | token | paired with > |
greater than | > | propagated data precedes | token | paired with < |
left paren | ( | optional data follows | token | paired with ) |
right paren | ) | optional data precedes | token | paired with ( |
left bracket | [ | postpositive title follows | token | paired with ] |
right bracket | ] | postpositive title precedes | token | paired with [ |
left brace | { | optional data follows | token | paired with } |
right brace | } | optional data precedes | token | paired with { |