Section 6-2 CALCULATIONS FOR DEATH DATE EXAMPLE



It will be instructive to take our example of fields in the death date and calculate the various weights and their probabilities of occurrence. This set of fields is not entirely typical, since their low presence values will result in very small probabilities. However, if we were to use the birth data, we would not be able to illustrate all the combinations because the birth year is always present. We will be adjusting presence, but using reliability and coincidence as measured. Value dependency involves a rather straightforward adjustment to these values, which we will cover in §6-3.

Comparison vectors.   In the following we will refer to the year field as field one; the month field as field two, and the day field as field three. One of the first problems we have not addressed as yet is a suitable way to refer to the subscripts Rk and Ck. Each k is a different combination of agreement, disagreement, and missing. It is customary to refer to such combinations as vectors. A vector in this sense is simply a single dimensioned array of values. In this case the values are ternary, i.e., there are three possible [Agreement, Disagreement, Missing]. We need to adjust the presence values when the data is not missing. For three fields there are eight combinations (vectors) when data is either present or absent: 1 = [   ], i.e., no fields present, 2 = [ 1 ], i.e., only year present, 3 = [ 2 ], i.e., only month present, 4 = [ 3 ], i.e., only day present, 5 = [ 1, 2 ], i.e., year and month present, 6 = [ 1, 3 ], i.e., year and day present, 7 = [ 2, 3 ], i.e., month and day present, 8 = [ 1, 2, 3 ], i.e., year, month and day all present. As pointed out in 5-2.4, there is a presence dependence, so the adjusted presence on only four of these is non-zero: 1 = [   ], i.e., no fields present, 2 = [ 1 ], i.e., only year present, 5 = [ 1, 2 ], i.e., year and month present, and 8 = [ 1, 2,3 ], i.e., all three date fields present. We also explained in 5-2.4, that when we compare two records their various comparisons are then grouped into four cases; these cases we then refer to as ap0, ap1, ap2, ap3. Table 1 includes these values as some of the raw data needed before beginning probability calculations.
Case Comparisons Adjusted Presence Eq. (4.1) Value
0 1:1   1:2   1:5   1:8 ap0 = 1 – p12 1.0000 – 0.0119 = 0.9881
1 2:2   2:5   2:8 ap1 = p12 - p22 0.0119 – 0.0087 = 0.0032
2 5:5   5:8 ap2 = p22p32 0.0087 – 0.0082 = 0.0005
3 8:8 ap3 = p32 0.0082
Reliability Coincidence
Agreement Disagreement Agreement Disagreement
r1 = 0.9506 1 – r1 = 0.0494 c1 = 0.0040 1 – c1 = 0.9960
r2 = 0.9063 1 – r2 = 0.0437 c2 = 0.0178 1 – c2 = 0.9822
r3= 0.8525 1 – r3 = 0.1475 c3 = 0.0326 1 – c3 = 0.9674
Table 1 — Preliminary Calculations

Calculating all possible weights.   In the first stage we would use equations 4.4 and 4.5 to calculate the weights (x) for each combination of fields compared (k). The next step is to use equation 6.4 to calculate P( xk | M ), the probability that each comparison weight is observed for a matched pair. Then comes equation 6.5 to calculate the analogous probability for an unmatched pair, P( xk | U ).

Case 0: data missing.

The first adjusted presence value, ap0 = 0.9881, tells how often the death date fails to contribute anything to the total weight. Since there are no fields present, at least not in both records, no agreement is possible and weights (x0 or cw) calculate to 0.

Case 1: only year present.

The next case is case 1. Here only the year allows comparison. There are two (21) "combinations," one of agreement and one of disagreement. The most straight-forward way to characterize the subscripts for the various combinations (k) seems to be to use plus (+) for agreement and minus ( – ) for disagreement. (Dependence of presence allows us to omit a symbol (0) for missing.) Case one has only the year present to compare. When the year agrees, we use equation 3.4, and when it disagrees we use equation 3.6:
x+ = log2(r1 c1) = log2(0.9506 0.0040) = log2(237.7) = + 7.89
x = log2[(1 – r1) (1 – c1)] = log2(0.0494 0.9960) = log2(0.0496) = – 4.33

Case 2: year & month present.

Case 2 is where the year and month are the greatest number of fields present in both records and can be compared. There are four (22) combinations of agreement (+) and disagreement ( – ). When both agree, we can use equation 3.4 multiplying the results of each field.
x+ + = log2[( r1 c1) ( r2 c2)] = log2(237.7 0.9063 0.0178) = log2(12100) = + 13.56
When the first agrees and the second disagrees, we have:
x+ – = log2{( r1 c1) [( 1 – r2 ) ( 1 – c2)]} = log2(237.7 0.0437 0.9822) = log2(10.58) = + 3.40
When the first disagrees and the second agrees, we have:
x– + = log2{[( 1 – r1 ) ( 1 – c1)] ( r2 c2)} = log2(0.0496 0.9063 0.0178) = log2(2.525) = + 1.34
When they both disagree, we have:
x– – = log2{[(1 – r1) (1 – c1)] [(1 – r2) (1 – c2)]} = log2(0.0496 0.0437 0.9822) = log2(0.0047) = + 7.73

Case 3: year, month, & day present.

In case three all three fields have data present and there are 8 (23) equations to work out. This makes a total of 14 possible weights.

Comparison weight probabilities.   Each one of the above 14 comparison weights may occur with a particular probability as representing a matched pair. These are calculated as the y or P( x=X | M ) values in equation 6.4. It will be found that the sum of these probabilities is the probability that the death date contributes to the comparison weight in matched pairs: P( xk=X | M ) = 1 - 0.9881 = 0.0119. Each of the 14 comparison weights also has a probability of occurrence as an unmatched pair. These are the z or P( x=X | U ) values as expressed by equation 6.5.