3-2.3 Calculating an entity duplication rate.
Here is how to calculate the entity
duplication rate from numbers that might easily be derived from the data. One number we need
to determine is how many individuals all the records in the file together represent (NU). We
call the number of individuals (linkage entities) the uniquely significant records and what we
mean by this is that these are the fewest number of records needed to represent all the individuals
just once each. To get this figure we subtract the redundant records (NR) from the total (Ntotal).
unique records (NU) = Ntotal NR | (3.8) |
The redundant records are all in duplicate groups. We can get this number by taking one from the
number in each of the duplicate groups and adding the remainders together, i.e., we subtract the
number of groups (G) from the number of records in groups (NG).
redundant records (NR) = NG G | (3.9) |
With these figures in the equations it is possible to find the duplication rate of equation 3.7 by
dividing the number of duplicate groups (G) by the number of uniquely significant (non-redundant)
records (NU, equation 3.8). Table 1 below shows some of the numbers for a test database from
Akershus, Norway.
Years in Sample |
Total |
Unique |
Duplicates |
Total |
Duplication Rate |
|
Ntotal |
NU |
Pairs |
Triples |
Quadruples |
NG |
P(DE) |
P(RN) |
1736-1755 |
10849 |
10227 |
563 (557) |
25 (32) |
3 (2) |
1213 |
0.0578 |
0.0573 |
1781-1794 |
9772 |
9465 |
270 (279) |
17 (9) |
1 (0) |
595 |
0.0304 |
0.0314 |
1805-1814 |
6465 |
6458 |
151 (154) |
7 (4) | |
323 |
0.0245 |
0.0255 |
1836-1845 |
11249 |
11088 |
141 (149) |
10 ( 2) | |
312 |
0.0136 |
0.0143 |
1866-1875 |
7198 |
7062 |
126 (128) |
5 (3) | |
267 |
0.0185 |
0.0189 |
Total |
45533 |
44142 |
1251 (1279) |
64 (39) |
4 (1) |
2710 |
0.0299 |
0.0305 |
Table 1 Sample Duplication