3.2 Phonetic, dialect, & compressed. The Phondex rule sets comprise four divisions of rules: Affix, phonetic, dialect, and compressed. The rules execute so that the output of each level becomes the input of the next level. Since the output of any of the levels may serve as the grouping decision, they are each designated as similarity codes.

Phondex rules allow for the specification of a valid environment for its validity as follows. The rule specifies first the OLD element, which sequence must appear in the input string. It also specifies the NEW element, the string that the OLD element is to be replaced by. If this element is blank, then the OLD element is to be deleted. Only rarely is replacement unconditional. The WHERE element of the rule places one such condition on the rule’s execution: OLD must appear string initial (I), final (F), medial, i.e., neither initial nor final (M), or anywhere (A). A more precise means of specifying these conditions makes use of two additional fields. In practice it is entirely possible to make the WHERE element redundant by the appropriate statement of the conditions using these two fields.

 OLD    NEW   ANTE  POST
Figure 3Format of Rewrite Rule

Environments are specified in terms of a character or character sequence: 1) one that must appear before (ANTE) the OLD element and/or 2) one that must follow (POST). The power of the statement of these conditions resides in cover symbols (variables). Alphabetic characters are taken literally (literals) and multiple characters together in a concatenated string are indicated using a plus sign (+) between them. The underscore character ( _ ) is the cover symbol for (stands for) any legitimate vowel or vowel sequence. The virgule, i.e., pipe or sheffer stroke ( | ) indicates any legitimate sequencing of non-vowel. The linguist decides beforehand in building his cover symbol table what symbols, such as numbers, lower-case letters, etc. he wants to have stand for the various phonetic elements of the language. Multiple ANTE strings and POST strings are allowed in a single rule, provided they are each separated by a comma. The pound (#) character in ANTE indicates that OLD may be word-initial and in POST it indicates that OLD may be word-final. The minus sign (–) indicates that the class, vowel or non-vowel, is to be diminished by removing the symbol following it. These minus signs can be accumulated when the class is to be further diminished in membership.

There are some special conditions that may appear in the two positions of the SPECIAL field of a rule. The vee (V) means that one syllable (a vowel or vowel cluster along with its following consonant or consonant cluster) must come before the string in OLD. If this vee is replaced by two (2), the interpretation is that there must be at least two leading syllables. The second position of SPECIAL indicates whether and in which of the two execution cycles the rule should be activated.

The Phondex system allows rules to be made as specific as desired. In case there is a rule for each possible spelling of a name, the rule system would be rather inefficient. This would be quite the case of indexing the name groups thus defined on every abnormal spelling. Making generalizations into rules compromises full coverage, but the coding process becomes much more efficient. It is, of course, possible to make very specific rules for common names and more general ones for the rarer names. It seems that if the tuning is sufficient, these rules could be capable of encompassing a final grouping decision.

There are a number of processes by which the data described above interrelate in the production of name groups. The main process uses as input an actual spelling of a name as found in the data base. This form is then modified by the application of rules by which the various coded forms are produced. Figure 3 outlines the two loops involved in code generation. Each process modifying the data is represented by a light green rectangle. Rules tables, which are collections of identically structured data items, are represented by the yellow arrows. The normal form is the code that serves as input. Each code transition is identified by the j values in sequence which is represented by a green circle — the outer loop. In the inner loop, each rule, identified by the i value, is checked to see if the OLD string and other conditions hold for the string in that stage.