Selecting a Blocking Scheme

Chapter 3: SELECTING A BLOCKING SCHEME

Blocking recall
Duplication rate
Blocking precision

Generally the files of records to be linked are too large to allow the comparison of every record with every other record. The use of an index is usually unavoidable. Probabilistic record linkage calls this blocking the records. Blocking cuts down on the number of comparisons that need be made. All the record comparisons made are among records in a block (sharing the same key value), but there is the question of what field(s) would be suitable as a key. For this purpose we will want to choose one or more fields that will result in the most matches brought back rather than lost, and at the same time will not bring back so many possibilities that we spend inordinate amounts of time comparing non-matches. These are the two measures of efficiency, respectively recall and precision. The actual comparing of the records in the block is called weighting.

It is possible knowing the definition of blocking efficiency to run a test and come up with the measures empirically. Having a set of matched duplicates, we simple take one at random from each group, use it as a query and count how many other records in its group are in the block that it defines and how many are not. As good as these numbers are, we can't tell their certainty without doing the same thing several times using other sets of queries chosen at random. However, we can also determine the blocking efficiency from other measures that we need to take on our sample anyway. As mentioned above in ¶ 2-2.7 there are three principle field characteristics from the database of interest for which we need measures: 1) presence, 2) reliability, and 3) coincidence. Based on these measures we may select certain fields as best suited for either blocking or weighting. These are the blocking and weighting schemes. In order to optimize the blocking scheme we would need an algorithm that maximizes recall and precision based on the cost of calculating the weight of a comparison. The greater the noise (less precision), the less desirable the blocking scheme. In this chapter we also develop the equations for duplication rate, which is important in getting at the blocking precision.

Blocking recall. Suppose we make a query against a database of records representing individuals and that the records are ordered by personal name. We query by personal name. Suppose one or more records for the name queried are found. All of the records with the same value for the personal name will be retrieved and these constitute a block of records — the retrieval set. It may well be that there are other records for the same individual, just not having the same value for the personal name. The recall would be imperfect in this case.

Duplication rate. In order to get at precision we need to estimate the noise and the noise depends on how many duplicates there are in the test database. Duplicates come in groups. Most of the time a duplicate group is a pair, but sometimes it is a triple, a quadruple, or an even larger group. The distribution of pairs, triples, etc., depends on the duplication rate and the size of the file. The larger the sample file and the higher the duplication rate, the more likely there will be larger sized duplicate groups.

Blocking precision. In contrast to blocking recall in equation 3.1, precision is a bit more difficult to find an equation for. This measure is the proportion of matched records in a block. Of course, all matched records that are recalled are in blocks, but there are usually unmatched records there too. These unmatched records are the noise. Now, suppose we are searching a very large file. If we use the same blocking fields as we did on a small file, agreement in these fields will probably define larger blocks. The only way to retain the same sized blocks is to choose a field whose number of distinctive values is greater in proportion to the size of the file.

Works of Wonder | Science of Genealogy