Chapter 4: SELECTING A WEIGHTING SCHEME & WEIGHTING THRESHOLDS



Blocking efficiency is measured by recall and precision and the same holds true of weighting efficiency. Before measuring its efficiency, however, we must describe how to derive the weights themselves. Weighting is the process by which we determine whether a comparison would be indicative of a match. The more reliable a field is, the more likely that a disagreement in the data will indicate that the comparison is a non-match. The higher the coincidence value (agrees non-coincidentally), the more likely that agreement in the data will indicate that the comparison is a match. The record comparison weight calculations allow us to express this relationship more precisely

Depending on the agreement or disagreement of each field we may assign the appropriate weight to the comparison. A sufficiently high weight should indicate a high probability that the comparison represents a matched record. Ideally an unmatched pair should be assigned a rather low weight. The threshold is the comparison weight above which the pair is to be classed as linked and below which the pair is to be classed as unlinked. The proportion of matched pairs that we thereby class as linked is the weighting recall. The proportion of linked pairs that are actually matched is the weighting precision.

Using odds for weighted probabilities.   Record comparison weights are based on odds for the kind of comparison being made between the fields of the record. Odds is an expression related to probability ([pi]) as follows:
odds definition 4.1(4.1)

It is possible to measure the conditional probability that a field agrees, where the condition is that the comparison is matched and data is present. We do this by isolating a set of records that we know to belong to duplicate groups. We have been calling this measure reliability. We do not need the duplicate groups to measure the probability of agreement when the comparison is not matched though present. This is what we have called general coincidence. We use the duplicate groups if the additional accuracy of an entity coincidence is desired.

It will be important, therefore, to figure the odds of field value agreement, disagreement, and missing. These odds are then expressed as weights for each field. The weights of the fields then chosen as important are combined to derive a record comparison weight.


  1. Odds of field value agreement
  2. Odds of field value disagreement
  3. Odds of field value missing
  4. Changing odds to weights
  5. Calculating record comparison weights

Weighting efficiency.   Weighting recall is the probability that a matched weight will be above threshold and weighting precision is the probability that a particular weight above threshold is matched. We are relating four classes of weights:
1) all those that are accepted above threshold, i.e., linked (Lt ),
2) all those that are matched (M or WM ),
3) those that are matched above threshold (Mt ), and
4) those that are unmatched (WU ) above threshold (Ut ).

recall = P ( Lt | M ) = Mt ÷ M(4.11)

precision = P ( M | Lt ) = Mt ÷ ( Mt + Ut )(4.12)

Selecting a comparison weight as threshold.   The best choice of threshold depends on our ablilty to maximize both weighting recall and weighting precision. Normally if we try to increase recall, the precision is reduced, and if we take steps to increase precision, the recall is reduced.

Weighting Recall & Precision

We now discuss how to to plot the distribution of comparison weights with an eye to setting the most appropriate threshold. Once this is done the parameters for the matching algorithm will be complete. Refer to figure 2 below. This figure is actually an idealized version of a frequency polygon based on the histogram constructed by charting weights for all the comparisons made in blocks. The negative weights, which represent fractional powers of two, tend to be those for unmatched comparisons. The positive represent in large part the matched comparisons. Thus the curve is bimodal.

Figure 2  Comparison Weight Distribution

In this case we do not have a model for the frequency density distribution. This means that we can only guess at the two curves underlying the two modes. A conservative estimate is outlined on the figure with the dotted lines. The green line is a pretty much minimal estimate of the intrusion of the unmatched on the matched, and the orange, the matched on the unmatched. Aiding this guess is an important observation: the proportion of linked to unlinked at zero should be about the same as the proportion of the two totals.