Two definitions for duplication rate. There are at least two ways to define duplication rate. One way would be as the probability that when you choose a record at random, there will be at least one other record that represents the same entity in the file. We might call this the redundancy rate. In this case we divide the number of records in groups (NG) less the number of groups (G) by the total number of records in the file (Ntotal) as in equation 3.6: |
![]() | (3.6) |
Another way to define the duplication is to figure the probability that when you choose an entity at random, there will be more than one record representing it in the file. In this case we divide the number of duplicate groups in the file by the number of unique linkage entities represented. |
![]() | (3.7) |
Proportions of records. |
Figure 1 diagrams a file containing Ntotal records. A typical database should contain some records that are unique in the sense that they are the ones that have no other record referring to the same entity. These are the singletons (NS) one record for each entity. If the whole database were singletons, the duplication rate would be zero. The two ways to look at duplication rate relate to two very different ratios. |
Calculating an entity duplication rate. Here is how to calculate the entity duplication rate from numbers that might easily be derived from the data. One number needed is how many individuals all the records in the file together represent (NU). We may call the number of individuals (linkage entities) the uniquely significant records meaning that this is the smallest number of records required to represent all the entities just once each. To get this figure we subtract the redundant records (NR) from the total (Ntotal). |
NR = NG G | (3.8) |
With these figures in the equations it is possible to find the duplication rate of equation 3.7 by dividing the number of duplicate groups (G) by the number of uniquely significant (non-redundant) records (NU, equation 3.8). Table 1 below shows some of the numbers for a test database from Akershus, Norway. |
Years in Sample | Total | Unique | Duplicates | Total | Duplication Rate | Ntotal | NU | Pairs | Triples | Quadruples | NG | P(DE) | P(RN) |
1736-1755 | 10849 | 10227 | 563 (557) | 25 (32) | 3 (2) | 1213 | 0.0578 | 0.0573 |
1781-1794 | 9772 | 9465 | 270 (279) | 17 (9) | 1 (0) | 595 | 0.0304 | 0.0314 |
1805-1814 | 6465 | 6458 | 151 (154) | 7 (4) | 323 | 0.0245 | 0.0255 | |
1836-1845 | 11249 | 11088 | 141 (149) | 10 ( 2) | 312 | 0.0136 | 0.0143 | |
1866-1875 | 7198 | 7062 | 126 (128) | 5 (3) | 267 | 0.0185 | 0.0189 | |
Total | 45533 | 44142 | 1251 (1279) | 64 (39) | 4 (1) | 2710 | 0.0299 | 0.0305 |
Estimating numbers of duplicate n-tuples. Based on the duplication rates for each sample it is possible to calculate the estimated expected number of duplicate pairs, triples, and quadruples. For example, when there are 10227 unique records we would expect 591, i.e., 0.0578 times 10227 to be in duplicate groups. Of these 34, i.e., 0.0578 times 591 are in groups having more than two members, and 2, i.e., 0.0578 times 34 are in groups with more than three members. The estimates in table 2 are from subtracting out the numbers in overlapping classes: 2 in quadruples, 34 minus 2 = 32 in triples, and 591 minus 34 = 557 in pairs. |
Anomalous duplication rates. The consistently higher estimates for multiple duplicates make it appear that a query would be more likely to duplicate an entity that is already duplicated than a singleton. Indeed, this may well be. In the real world, which the records represent traces of, some individuals are in fact more copiously documented than others. Recording a famous person or popular ancestor could well increase the odds of the entity being a duplicate. The fact that duplicated individuals more likely have appropriate identifiers present must also play a role. |