Accounting for Presence Dependence

Section 5-2 ACCOUNTING FOR PRESENCE DEPENDENCE

Assumption of independence
Adjusting presence within a record
Generalizing the adjustment algorithm within a record
Field presence dependence between two records
Generalizing the adjustment algorithm between records

Assumption of independence. If we were to assume independence, we would use these probabilities to calculate the expected proportion of each and every combination of fields. In table 1 the particular combinations of fields marked with an asterisk don’t actually occur. This means that their presence should be zero. It seems safe to assume that a day of month does not occur without a month of year, and that a month of year does not occur without a year. Although there might in reality be exceptions, such combinations of data would truly be rare. This observation allows us to adjust the presence for the combinations that actually occur.

Death date Combination of fields	Calculation of Occurrence When Fields Are Independent	Presence Assuming Independence	Calculation of Occurrence When Fields Are Fully Dependent	Presence Assuming Full Dependence
[ ]	( 1 – d ) × ( 1 – m ) × ( 1 – y )	0.7350	1 – y	0.8910
[ D ]*	( 1 – m ) × ( 1 – y )	0.0730
[ M ]*	( 1 – d ) × ( 1 – y )	0.0755
[ Y ]	( 1 – d ) × ( 1 – m )	0.0899	y – m	0.0158
[ D, M ]*	d × m	0.0084
[ D, Y ]*	d × y	0.0098
[ M, Y ]	m × y	0.0102	m – d	0.0029
[ D, M, Y ]	d × m × y	0.0009	d	0.0903

Table 1 — Estimating Presence of Death Date Fields

Adjusting presence within a record. It is because of presence dependence that we cannot try to estimate the probability of occurrence of a field directly from the presence value for that field. Instead it will be important to adjust the field's presence value according to what other fields are present that its presence depends on. Suppose the fields that we are estimating presence values for are fully dependent so that field 3 is dependent on field 2, which is in turn dependent on field 1. Now we need to look at a combination of fields and ask if they actually might occur together. If one of the fields is present, say field 3, but not one of those it depends on, say 2 or 1, then the combination is impossible. It helps the symbolization to use subscripts for each combination so as to also refer directly to the length (n) of the dependency chain. So what we are saying is that we can use n to refer to the combination [ 1, 2, . . . , n ]. The adjusted presence for combination k is ap_k, and there are k fields present in that combination.

Generalizing the adjustment algorithm within a record. By using induction we may determine an adjusted presence value (ap_k) for the four valid combinations of three presence dependent fields as the following:

	[ 1, 2, 3 ]	ap₃ = p₃

	[ 1, 2 ]	ap₂ = p₂ – p₃

	[ 1 ]	ap₁ = p₁ – p₂

	[ ]	ap₀ = 1 – p₁

The generalization of this algorithm allows us to calculate the adjusted presence values for combinations of fields in dependency chains of arbitrary length (n). We say that k = i, except where k = i = 0 where p_o = 1, and where k = i = n+1 where p_n₊₁ = 0.

ap_k = p_i – p_i₊₁

(5.1)

Field presence dependence between two records. Comparing two records involves a judgement about agreement or disagreement only when the particular field is present in both records. This means that of all the field combinations possible, some will admit comparison even when they are not identical and others will not. Table 2 lists the various specific comparisons for the death date example. A number marks the field comparisons that are equivalent so far as data that may be compared is the same. We need to account for these as outcomes, but missing data disallows comparison. Adding equivalent individual combinations together results in a simple calculation for comparisons.

Death date Comparison First Record	Death date Comparison Second Record		Calculation of Occurrence of Each Combination	Presence Assuming Alone	Calculation of Occurrence In Pairwise Comparison	Presence of Comparison Assuming Pairs
[ ]	[ ]	1	( 1 – y ) × ( 1 – y )	0.7939	(1) 1 – y²	0.9881
[ ]	[ Y ]	1	( 1 – y ) × ( y – m )	0.0141
[ ]	[ M, Y ]	1	( 1 – y ) × ( m – d )	0.0026
[ ]	[ D, M, Y ]	1	( 1 – y ) × d	0.0805
[ Y ]	[ ]	1	( y – m ) × ( 1 – y )	0.0141
[ Y ]	[ Y ]	2	( y – m ) × ( y – m)	0.0003	(2) y² – m²	0.0032
[ Y ]	[ M, Y ]	2	( y – m ) × ( m – d )	0.0001
[ Y ]	[ D, M, Y ]	2	( y – m ) × d	0.0014
[ M, Y ]	[ ]	1	( m – d ) × ( 1 – y )	0.0026
[ M, Y ]	[ Y ]	2	( m – d ) × ( y – m )	0.0001
[ M, Y ]	[ M, Y ]	3	( m – d ) × ( m – d )	0.0000	(3) m² – d²	0.0005
[ M, Y ]	[ D, M, Y ]	3	( m – d ) × d	0.0003
[ D, M, Y ]	[ ]	1	d × ( 1 – y )	0.0805
[ D, M, Y ]	[ Y ]	2	d × ( y – m )	0.0014
[ D, M, Y ]	[ M, Y ]	3	d × ( m – d )	0.0003
[ D, M, Y ]	[ D, M, Y ]	4	d × d	0.0081	(4) d²	0.0081

Table 2 — Presence in Death Date Comparisons

Generalizing the adjustment algorithm between records. The generalization of the algorithm allows the calculation of presence proportions for combinations of fields in dependence chains of arbitrary length (n), where p_o² = 1 and p_n₊₁² = 0. This adjustment in equation 5.2 applies to all comparisons between records where k is the least number of fields present in either record.

ap_k = p_i² – p_i₊₁²

(5.2)

It is straightforward to derive this relationship by induction. Note that you simply add up the calculations for all the cases belonging to each combination.

Works of Wonder | Science of Genealogy