3-3.6 Measuring coincidence.

The field's coincidence (coincidental agreement) is defined as the probability that the field values agree in non-matched pairs, i.e., by chance, which coincidence is different for each specific value.

It turns out that in practice it is not straight-forward to take direct measures of the quantities indicated in the above equation. So, first we estimate the probability of each field value occurring in a record by measuring its relative frequency in the database (B-value), which, when the duplication rate is not too high, is very close to its relative frequency in one record of a non-matched pair.

The square the B-value for each field value would then be an estimate of its value-specific coincidental agreement in a comparison.

We then sum of the squares of the B-value of each possible field value to estimate the field's coincidental agreement in a comparison.

Note that the revision to reliability that we made in ¶ 3-1.3, equation 3.3, does not improve the theoretical precision. However, in the same way as the measure of duplication rate and reliability are ambiguous, so is also coincidence. As we see above, coincidence may be the proportion of comparisons either as tallied 1) across non-matched comparisons, or 2) across all comparisons possible (with data present). In the latter case we might make the following refinement: take a fractional agreement when the comparison includes records out of a duplicate group, i.e., where there are several possible within the same linkage entity. We might then call the coincidence involving an unrefined tally a general coincidence, i.e., the probability that a field value agrees in a comparison taken at random provided only that data is present in the field. The refinement would then be the entity coincidence, i.e., the probability that a field value agrees in a comparison taken at random provided 1) that data is present in the field, and 2) each comparison is counted only in proportion as it represents a uniquely significant record linkage entity.

Accidental agreement depends on the specific value in the field. The value is more likely to agree when it is common than when it is rare. In fact, the probability of the field agreeing in a comparison is simply the sum of all the probabilities of each specific value agreeing. But the calculation of this B-value and its general coincidence does not regard whether the record is a singleton or belongs to a group. In the case of entity coincidence we first tally the presence (P = occurrences = tokens) of each specific value (the index j = type) within a duplicate group (entity), weighting it as a single entity, then tally the entity frequencies. The index on the innermost sigma is k running from 1 to Cijl (combinations), which equals = 1 for singletons. The index on the second sigma is l running through the duplicate groups including singletons to NU (uniquely significant records). In this way each entity counts its value but once. These are then squared as comparisons and then summed to result in total agreements in comparisons.

(3.16)

Using the total number of uniquely significant records actually omits the fact that we want to exclude all comparisons of a record to itself. For precision we subtract one (1) from this number. Typically, and especially when there are large numbers of duplicates, the entity coincidence is smaller than the general coincidence. Specific data values are shown to be more distinctive than they otherwise would be estimated to be.

Probabilistic Record Linkage Principle of Field Coincidence