Section 5-2 ACCOUNTING FOR PRESENCE DEPENDENCE



Assumption of independence.   If we were to assume independence, we would use these probabilities to calculate the expected proportion of each and every combination of fields. In table 1 the particular combinations of fields marked with an asterisk don’t actually occur. This means that their presence should be zero. It seems safe to assume that a day of month does not occur without a month of year, and that a month of year does not occur without a year. Although there might in reality be exceptions, such combinations of data would truly be rare. This observation allows us to adjust the presence for the combinations that actually occur.

Death date
Combination of fields
Calculation of Occurrence
When Fields Are Independent
Presence Assuming Independence Calculation of Occurrence
When Fields Are Fully Dependent
Presence Assuming Full Dependence
[ ] ( 1 – d ) × ( 1 – m ) × ( 1 – y ) 0.7350 1 – y 0.8910
[ D ]* ( 1 – m ) × ( 1 – y ) 0.0730
[ M ]* ( 1 – d ) × ( 1 – y ) 0.0755
[ Y ] ( 1 – d ) × ( 1 – m ) 0.0899 y – m 0.0158
[ D, M ]* d × m 0.0084
[ D, Y ]* d × y 0.0098
[ M, Y ] m × y 0.0102 m – d 0.0029
[ D, M, Y ] d × m × y 0.0009 d 0.0903
Table 1 — Estimating Presence of Death Date Fields

Adjusting presence within a record.   It is because of presence dependence that we cannot try to estimate the probability of occurrence of a field directly from the presence value for that field. Instead it will be important to adjust the field's presence value according to what other fields are present that its presence depends on. Suppose the fields that we are estimating presence values for are fully dependent so that field 3 is dependent on field 2, which is in turn dependent on field 1. Now we need to look at a combination of fields and ask if they actually might occur together. If one of the fields is present, say field 3, but not one of those it depends on, say 2 or 1, then the combination is impossible. It helps the symbolization to use subscripts for each combination so as to also refer directly to the length (n) of the dependency chain. So what we are saying is that we can use n to refer to the combination [ 1, 2, . . . , n ]. The adjusted presence for combination k is apk, and there are k fields present in that combination.

Generalizing the adjustment algorithm within a record.   By using induction we may determine an adjusted presence value (apk) for the four valid combinations of three presence dependent fields as the following:
[ 1, 2, 3 ] ap3 = p3
[ 1, 2 ] ap2 = p2p3
[ 1 ] ap1 = p1p2
[   ] ap0 = 1 – p1
The generalization of this algorithm allows us to calculate the adjusted presence values for combinations of fields in dependency chains of arbitrary length (n). We say that k = i, except where k = i = 0 where po = 1, and where k = i = n+1 where pn+1 = 0.
apk = pipi+1(5.1)
Field presence dependence between two records.   Comparing two records involves a judgement about agreement or disagreement only when the particular field is present in both records. This means that of all the field combinations possible, some will admit comparison even when they are not identical and others will not. Table 2 lists the various specific comparisons for the death date example. A number marks the field comparisons that are equivalent so far as data that may be compared is the same. We need to account for these as outcomes, but missing data disallows comparison. Adding equivalent individual combinations together results in a simple calculation for comparisons.
Death date
Comparison
First Record
Death date
Comparison
Second Record
  Calculation of
Occurrence of Each
Combination
Presence
Assuming
Alone
Calculation of
Occurrence In Pairwise
Comparison
Presence of
Comparison
Assuming Pairs
[   ] [   ] 1 ( 1 – y ) × ( 1 – y ) 0.7939 (1)   1 – y2 0.9881
[   ] [ Y ] 1 ( 1 – y ) × ( y – m ) 0.0141
[   ] [ M, Y ] 1 ( 1 – y ) × ( m – d ) 0.0026
[   ] [ D, M, Y ] 1 ( 1 – y ) × d 0.0805
[ Y ] [   ] 1 ( y – m ) × ( 1 – y ) 0.0141
[ Y ] [ Y ] 2 ( y – m ) × ( y – m) 0.0003 (2)   y2 – m2 0.0032
[ Y ] [ M, Y ] 2 ( y – m ) × ( m – d ) 0.0001
[ Y ] [ D, M, Y ] 2 ( y – m ) × d 0.0014
[ M, Y ] [   ] 1 ( m – d ) × ( 1 – y ) 0.0026
[ M, Y ] [ Y ] 2 ( m – d ) × ( y – m ) 0.0001
[ M, Y ] [ M, Y ] 3 ( m – d ) × ( m – d ) 0.0000 (3)   m2 – d2 0.0005
[ M, Y ] [ D, M, Y ] 3 ( m – d ) × d 0.0003
[ D, M, Y ] [   ] 1 d × ( 1 – y ) 0.0805
[ D, M, Y ] [ Y ] 2 d × ( y – m ) 0.0014
[ D, M, Y ] [ M, Y ] 3 d × ( m – d ) 0.0003
[ D, M, Y ] [ D, M, Y ] 4 d × d 0.0081 (4)   d2 0.0081

Table 2 — Presence in Death Date Comparisons

Generalizing the adjustment algorithm between records.   The generalization of the algorithm allows the calculation of presence proportions for combinations of fields in dependence chains of arbitrary length (n), where po2 = 1 and pn+12 = 0. This adjustment in equation 5.2 applies to all comparisons between records where k is the least number of fields present in either record.
apk = pi2pi+12(5.2)
It is straightforward to derive this relationship by induction. Note that you simply add up the calculations for all the cases belonging to each combination.