Importance of Field Data Dependence

Chapter 5: IMPORTANCE OF FIELD DATA DEPENDENCE

Outcome probabilities
Accounting for presence dependence
Accounting for value dependence
Value specificity & co-dependence

In the process of record linkage it is important to compare the values in the fields of the query to the values in the corresponding fields of the record being compared. This chapter and the next address in detail the problems arising from the fact that there is sometimes a dependence between the values in the various fields within either of these records. We mentioned in (cf. ¶ 4-3.1) that we do not have a model for the frequency density distribution of the outcome weights. In this case it is important to take dependence into account. This allows us to estimate the probability that a particular comparison weight will occur for a pair of records whether matched or unmatched. Another place is in actually calculating the record comparison weight. The principle of adding together the agreement, disagreement (and missing) weights of the fields of the record to get a record comparison weight (cf. ¶ 4-1.5) depends critically on the fact that the probabilities represented are independent (cf. ¶ 2-2.5) and that we may therefore multiply their corresponding odds together to get a value proportional to the probability of the combination occurring. And yet we will see that there are situations where ignoring the assumption that the field values are independent will mislead us into accepting incorrect record comparison weights.

Outcome probabilities. Suppose that the reliability of the fields we decide to weight on is 1.00. This means that the comparison weight for a matched pair would consist solely of agreement weights. On the other hand, suppose the reliability of the fields is 0.00. In this case the comparison weight for the matched pair would consist solely of disagreement weights. As it happens, the reliability falls somewhere in between. Now suppose that every field that we decided to compare always had data present. This would make it possible to make a good estimate of the probability that a given comparison weight occurred in a matched pair by simply using the reliability of the fields (r_i). The same argument for using the reliability as an estimate of the probability that a particular comparison weight will occur in a matched pair holds for using the coincidence (c_i) as an estimate of the probability that the particular comparison weight will occur in an unmatched pair.

Besides reliability and coincidence there are at least two additional factors that influence the outcome probability of a given comparison weight. One factor is the presence of data in a field (p_i). We need to account for the fact that the presence of data in one particular field may well depend on (or affect) whether there is data present in some other field. Expecting to have a value in one field of a record may not make sense whenever there is no value in some other field. We say then that the first field is dependent on the presence of data in the second; there is presence dependence between the two fields. For example, the day of the month is seldom present unless the month is also present, and the month of the year is seldom present without the year also being given.

A second factor is value dependence, i.e., when there is data present in the fields of a record, the value of the data in one field may depend on or affect what the value may be in another field. Agreement when the values are compared can then depend on what the specific values in the other field of the record may be. Name standards or codes when related to the actual spelling of the name is an example of this phenomenon. The actual spelling field in two records won't agree unless the name code also agrees. However, the reverse does not hold. The power of a name code is that it may agree when the actual spelling disagrees in a certain way. Disagreement in name codes implies disagreement in actual spelling. The variety of combination of agreement and disagreement of certain multiple fields is not free and independent.

Accounting for presence dependence. The presence of the various fields in a date is clearly dependent. As an example consider the death date field in a database of records from Akershus, Norway. The day (d) appears 9.03% of the time, the month (m) appears 9.32% of the time, and the year (y) appears 10.9% of the time. The probability of data being present in two records being compared is the square of this value. Since it is outcome probabilities for comparisons that we are interested in, we make these values the respective p values.

It is clear that not accounting for presence dependence can give outcome probabilities that are wildly incorrect. This means that an adjustment of presence within a record would need to be made. When multiple fields are dependent, such an adjustment must be algorithmically generalized across all the fields involved. It is also possible that fields in multiple records being comparied exhibit field presence dependence. Similarly when multiple fields are being compared, this adjustment, too, may be algorithmically generalized.

Accounting for value dependence. The value in a particular field may not be free to vary and instead may depend on the value in some other field. One of the ways that such dependence makes itself most apparent is when a weighting field shows only agreement. For example, suppose the investigator chooses to block on the Principal’s Given Name Code. The code is derived from the actual spelling of the name. The custom in many countries is to choose a name for an individual according to the sex of the child. For example, “Ann” is a given name for girls and “William” is a given name for boys. Now what happens if we weight on the Principal’s Sex? Every record in the block already has the same name code. Since the name was chosen according to the sex of the individual, then weighting on Principal’s Sex simply adds a particular additional weight to every record in the block. However, if the same code were given to “Harry” and “Harriet,” for example, then weighting on Principal’s Sex could be helpful.

As with presence dependence, it is equally clear that not accounting for it can give outcome probabilities that are wildly incorrect. Sometimes there are rather specific conditions that signal such a dependence. This means that an adjustment to the outcome probabilities needs to be made. Sometimes the dependence involves non-uniform partitioning of the entity by its values. It is also possible that field values may be standardized in a non-uniform way. One kind of value dependence may be termed constructive in its nature.

As with presence dependence, the probabilities by which the outcomes are estimated may be adjusted. This involves an algorithm for adjusting coincidence and one for adjusting reliability. These algorithms may also be generalized.

There are several varieties of value dependence that are easily exemplified. One is of a relatively innocuous variety, another is significantly more important, and a third should not be ignored.

Value specificity & co-dependence. There is little if any dependence on another field’s values for the values in a given name field. In some cultures there is a preponderant tendency for the oldest son to have the same name as the father’s father. Yet both names do not usually appear in one record so as to be applied as individual identifiers. However, the chances that there will be agreement in unmatched comparisons depends greatly on what the value of the name is. A common name will have a higher coincidence value than a rare name, i.e., agreement in a common name is not as coincidental as in a rare name.

Works of Wonder | Science of Genealogy