Section 5-4 VALUE SPECIFICITY & CO-DEPENDENCE



Using generic agreement/disagreement weights.   It is possible to use the coincidence values for a field rather than the individual coincidence values for each possible value in the field. In our test database of individuals in Norway the coincidence of the Principal's Given Name code was 0.0250 and its reliability 0.9617. The field is present in both records being compared 96.79% of the time. This means that the agreement weight from equation 4.6 would be +5.27, corresponding to odds of about 75 to 2. Since “Anna” and its variants is more common than usual, its coincidence value of 0.0732 has the effect of reducing its agreement weight to +3.76 (27 to 2). A unique name has a coincidence value of 0.0000072. This increases its agreement weight to +17.07 or adjusted odds of 137,588 to one.

Sensitivity of agreement/disagreement weights.   The disagreement weights are sensitive to reliability and insensitive to coincidence. Without accounting for value specificity the PGN-Code disagreement weight according to equation 4.7 is 7.30, corresponding to odds of 6 in a 1000. These weights are only slightly different depending on which values are in disagreement. For example, suppose one record has “Anna” and the second record has “Karen,” both very common names. The complement to the coincidence value of the first is 0.9268 and of the second 0.9557. The reliability is not value specific and its difference from presence is 0.0062. The disagreement weight in this case calculates to 7.22. Suppose both names are unique with the very small coincidence value given above (a highly coincidental event). Its complement is 0.9999927 making the weight 7.33.

Effects of grouping assumption.   Related to value specificity is the bias introduced by the very grouping of variants. Newcombe (1993) calls this the “grouping assumption.” Ideally one might want to trace the different statistics for occurrences of “Anna,” “Anne,” “Ann,” etc. The idea is that the disagreement in actual spelling may be significant. We cannot use actual spelling as a separate field without taking the steps outlined in ¶¶5-3.4 — 5-3.9, since doing so would violate the independence assumption.

Measuring value specificity.   In one study we were able to measure the actual frequency of occurrence in a test database of the 50 most common values for selected fields. Using this information it was possible to model some of the effects of value specificity. However, the sheer number of different values that may occur in certain fields precludes calculating every possibility. The Norwegian individual test database contained 1850 different Principal’s Given Name Codes (names for both sexes), and about half that many Father’s Given Name Codes (for males only). There were 200 different Year values. There were 45 different Event Town values (all the towns in the one county of Akershus).

Using a statistical distribution.   The idea of calculating every combination even using a computer is daunting. Two principles work together to simplify the task without obliging us to completely ignore the effects of value specificity. The first is that we may predict the relative frequency distribution of the specific value frequencies. The second that we may use the distribution to make our own relative frequency categories as needed for the desired accuracy. Since any ordering of the specific values is arbitrary, we order them by relative frequency. The problem reduces to modeling the distribution of the specific values and partitioning the predicted coincidence values into as many or as few values as may be practical to use.

Fitting to the decay curve as a model.   There is a formula commonly used for modeling growth and decay in science and economics that appears to model our personal name data quite well. In these former applications the formula looks at change (c) over time (t) with a starting value (y0). In our case we look at relative frequency differences from rank to rank (r). In the most general form the relationship is as in equation 5.5 where e = 2.718 . . . , i.e., the base of the natural logarithms.
f(r) = a × eb(r – 1) + c(5.5)
We rank the relative frequencies arbitrarily from greatest to least and so the constant b is negative. When r = 1, a is the relative frequency of the most common field value. Each spelling of a name is called a type. When we take the token count of each type and divide by the total number of tokens of all types, we arrive at a relative frequency for that type. Knowing the relative frequency for the various names, we can find the best fit by choosing the appropriate values for a, b, and c.

Partitioning relative frequencies into categories.   By performing integration on formula 5.5 it becomes possible to partition values of Given Name Codes, for example, into say five categories: very common, common, familiar, unusual, rare. The categories are bounded so as to divide the spectrum into five equally likely divisions. Each category then has a representative coincidence value for use with comparisons that agree when unmatched or disagree when matched. If we choose a value specific field as a blocking field, the precision will depend on which representative coincidence value defines the block; hence, there are five precision values. Similarly weighting precision is influenced by as many additional comparisons as there are weighted fields that have multiple coincidence values. Having multiple coincidence values affects both the number of weights calculated and the number of probabilities for those weights to be unmatched.

Co-dependence in personal names of relatives.   There are some additional sources of bias involved in choosing blocking and weighting fields. One of these is the use of a family identifier to identify an individual. The surname is typically a family name, and so belongs to all the siblings in a family (Newcombe, 1992). The odds of an agreement indicating a match when comparing such family identifiers are indirectly proportional to the size of the family.

Use of a family identifier.   If an individual uses his father's name as one of his own identifiers, the coincidence value as a relative frequency does not really express the probability that the names will agree by chance (and not represent the same individual). The chances that the father's name (either given name or surname) or the mother's name refers to the same individual in both records depends on the number of siblings the name might refer to instead. To approximate this factor we could assume that the average number of siblings that a person might have is around six. In this case the surname would be one-sixth as reliable as we would otherwise expect. Hence, we have something like equations 5.6 and 5.7 for the surname field and the parents' name fields.
aw = log2 [(r ÷ 6) c](5.6)
dw = log2 [(1 – [r ÷ 6]) ÷ (1 – c)](5.7)
The siblings within a family are also prone to have different given names. Our sense is that the reliability of the given name would increase when the names of the parents agree. Yet the fact that some children may receive the names of previously (or shortly to be) deceased children complicates this consideration. These are all factors judged manually by the genealogist on a case-by-case basis.

Compound surname by marriage.   Another specific case to consider is a person who carries a name such as “Barker-Cuthbert” for a surname. What are the consequences of viewing this as two surnames? Maybe some record keeper analyzes this surname into two pieces and chooses one of them to record. In this case there seems little difficulty in estimating a value specific comparison weight of "Barker-Cuthbert." There would be a moderate weight for <Barker> and another one for <Cuthbert>. The sum of these two weights would be comparable to a single high field weight, quite like we would guess that <Barker-Cuthbert> ought to be. But two common names would weight too little together, and two very rare names would be more like having two field weights.

The key to combining the weights is the realization that it is the coincidence value, not the reliability that needs adjusting. The coincidence (c) for a combination of names ought to be the product of the coincidences of each component name (c1 × c2).
aw = log2 [r ÷ (c1 × c2)](5.8)
dw = log2 [(1 – r) ÷ (1 – [c1 × c2])](5.9)

Latin surname compounding.   People in Latin cultures usually carry as a surname either the first part of the surname (mostly Spanish) or the last part of the surname (Portuguese) of each of their parents. These customs were more consistent in the past. What are the consequences of viewing this as two surnames? No native record keeper would knowingly analyze such a surname into two pieces and choose one of them to record. But here there is a different justification for splitting the name apart. Keeping value specific weights for every possible combination would seem impractical. Thousands of name pieces become millions of combinations. Apparently we can estimate the value specific weight for the combined form in the surname field from that of its constituents as in equations 5.10 and 5.11.
aw = log2 [r ÷ (c1 × c2 × 6 ) ](5.10)
dw = log2 [(1 – [r ÷ 6 ] ) ÷ (1 – [c1 × c2 ] ) ](5.11)

We use the same logic as in the ¶ 5-4.10 above, and a factor of 1/6 as discussed above in ¶ 5-4.9 since there is not just one person carrying the name; all the children of the union share the same compound surname. (NB: In certain cultures another fraction would be more appropriate.)