Appendix A: RECORD LINKAGE ENTITIES

The processes of search and retrieval are in the domain of probabilistic record linkage (Despain, 2003). This technology depends critically on avoiding the “comparison of apples to oranges.” When we compare the data in two records of the same entity, it is critical that the fields correspond. We must classify the contents correctly. The degree to which the classification and form of the data harmonizes directly affects the effectiveness of the comparison. In general when a person has used a particular spelling for his name on the two records that we compare, the name will be recognized as the same. There are two measures of efficiency of record linkage. The one measure is recall. This will improve when all the spellings of the name are the same. The other measure is precision. When we use an identifier that is different for different individuals the precision improves. Typically we are able to say that comparing certain name fields is more efficient than comparing others. The record linkage systems we have used in the past use a system of coding names, so that differences in actual spelling can to some degree be neutralized.

We have tried different codes for the names of localities, but with limited success. The reason seems to have been that the linkage records we had available were normally already partitioned and sampled according to the standardized locality data in them. Any that were not handled correctly would not be seen. It is also possible to use codes for dates, cf. Despain, op. cit., §1-6. In general, classifying the data increases the comparability, and providing a standard code yields a field with increased reliability. Providing standards also increases the coincidence value making it less precise. However, it appears that the increase in reliability more than compensates for any loss in precision.