Probabilities for a comparison weight

Section 6-1 PROBABILITIES FOR A COMPARISON WEIGHT

Probability of a matched comparison weight occurring
Probability of an unmatched comparison weight occurring
Probability of a comparison weight being matched
Probability of a comparison weight being unmatched

There are two probabilities of a particular comparison weight occurring. The first is when the comparison is between matched records and the second is when it is between unmatched records. In order to make these probabilities independent of whether the comparison weight is matched or unmatched we need to multiply by the probability that it falls into one or the other of these categories.

Probability of a matched comparison weight occurring. We may state the probability of occurrence for a particular comparison weight as directly proportional to the adjusted presence (ap_k) for the combination (k) of individual field weights. The probability of the comparison weight occurring in matched records when all fields agree is related to the adjusted reliability as in equation 6.1. When they disagree this probability relates to the complement of the adjusted reliability as in equation 6.2. When the data are all missing it is independent of either the reliability or the coincidence, but equal to the adjusted presence as in equation 6.3.

	P(aw_i =X_k \| M ) = ap_kar_ki		(6.1)
	P(dw_i =X_k \| M ) = ap_k( 1 - ar_ki )		(6.2)
	P( 0 =X_k \| M ) = ap_k		(6.3)

It is possible to combine this set of equations into a single simpler one if we symbolize those parts that are different with the same letter, say R. We then make the stipulation that R consists of three factors: R₁ =

ar_ki, being the k₁ fields that agree; R₂ =

(1 – ar_ki) being the k₂ fields that disagree; and R₃ = 1 for the k₃ fields where at least one value is missing. This allows one to combine equations 6.1 and 6.2 with 6.3 as equation 6.4 — the probability of occurrence (y_Rk) for a particular matched comparison weight (

cw):

y_Rk = P(

cw_i =X_k | M ) = ap_k ×

R_j

(6.4)

With these equations in mind we can write an algorithm to calculate the probability that a particular comparison weight will occur in a matched comparison.

Probability of an unmatched comparison weight occurring. In a completely analogous fashion it is possible to estimate the corresponding probability for unmatched comparison weights. In this case each agreement weight occurs in proportion to the coincidence value (ac_ki) and each disagreement weight occurs in proportion as one minus the coincidence value. Here we symbolize the three cases with C and express the probability of a particular comparison weight (z_Ck) as in equation 6.5:

y_Ck = P(

cw_i =X_k | U ) = ap_k ×

C_j

(6.5)

Probability of a comparison weight being matched. In order to make these probabilities independent of whether the comparison weight is matched or unmatched it is important to multiply by the probability that it falls into one or the other of these categories. The first is the probability that a comparison weight is matched, P(M). The numerator of this value, the number of matched comparison weights (W_M), would be equal to the number of matched records in blocks of equation 2.9 (M_Q), and the denominator the sum of matched and unmatched records.

P(M) = W_M ÷ (W_M + W_U) where W_M = M_Q

(6.6)

Probability of a comparison weight being unmatched. The second is the probability that a comparison weight is unmatched, P(U). The numerator of this value is the number of unmatched comparison weights (W_U). We can derive this value from equation 6.6 as soon as we realize that P(M), the probability of a weight being matched, is, in our situation, exactly equal to the blocking precision, which we estimated with equation 3.12 (see also equation 6.10 below).

W_U = [ W_M ÷ P(M) ] – W_M where W_M = M_Q

(6.7)

The second probability would therefore be:

P(U) = W_U ÷ (W_M + W_U)

(6.8)

Works of Wonder | Science of Genealogy