String comparator

4.6 String comparator. Besides coding and tables there is a third idea. This requires the capability of taking two spellings and comparing them to each other, letter-by-letter, and measuring the difference. The success of this method depends on how the differences are weighted. For example, a single metathesis (switching of two adjacent letters) typically contributes much less weight than a second instance of a letter. The relative positions of letters on the keyboard might also contribute to the weight of the difference — adjacent letters (hand position or finger choice) weighting much less that more distant letters (finger choice and hand position). There is also a method for the analyst to choose a threshold — the weight above which the accumulated difference becomes significant. The analyst may choose this method when errors arise in consequence of keyboard entry or when the letters are romanized. I have not seen any attempt to remedy the error that comes from the incorrect duplication of syllables, though it is not hard to think of an appendage to the string comparator method, that might help.

The best known string comparator is the Jaro string comparator. This algorithm is usually applied when actual spellings are compared to neutralize the deleterious effects of typographical error, viz. grapheme metathesis and substitution. Winkler has given it a significant improvement lending such transformations near the beginning a negative bias (cf. Winkler, 1990). He has shown it to be more effective than the standard Damerau-Levenstein metric (Winkler, 1985, 1990). Belin (1993) ruled it as the best way to improve record linkage when there were significant amounts of minor typographical error in name fields.

Works of Wonder | Genealogy