Section 1-5 RECORD LINKAGE ALGORITHMS & DUPLICATE GROUPS



Entity class transformations of the lineage-linked model and engine.   The matrix of table 2 diagrams the class of transformations needed to view and access the individuals of a lineage-linked model. This matrix is much simpler than the one for the FGRA since there are only four positions for individuals to go in the one distinctive family. This matrix allows the record linkage system to define an individual either as a child (C) in a family or as a parent (P).

123
1CM
2PMPM/PFPF
3CF
Table 2 — Transformation Matrix; Lineage-linked Model

Figure 4  Symbols for Individuals in a Lineage-Linked Family

Algorithm I or P* linkage.   The most central sibship (FA) on the family group sheet is comprised of a husband (H=PM) and a wife (W=PF). If a husband or wife marries more than once, there may also be cross-reference information to other wives of the husband (OW), or other husbands of the wife (OH) in secondary sibships (FB). In this case, however, it may well be that the other family group in which this spouse participates will be on file with a duplicate H, viz. W, and the OW,viz. OH represented as a W, viz. H. The goal of algorithm I is to detect this case of duplication. This algorithm identifies duplicate families (FA linked to FB) and individual spouses in multiple families of procreation (FP). On the FGRA transformation matrix we are dealing with rows 1 and 2 for the husband and rows 5 and 6 for the wife.

123456789
1KHKH/QHQHHH/OOW
2KHKH/QHQHHH/WW
3HH/WWSS/GGF
4GMG/DDWH/WH
5HH/WWQWKW/QWKW
6OHO/WWQWKW/QWKW
Table 1 — Transformation Matrix; FGRA Model

Algorithm II or O* linkage.   Any child (C) of the central nuclear family (FA) may have an additional set of parents on some other sheet (FB). One example is when one or both of the natural parents are not involved in the upbringing of the child, as with adoption. This kind of duplication of S=CA as S=CB or D=CA as D=CB is detected by algorithm II. This algorithm finds duplicate individuals as children in multiple families of orientation (CA linked to CB). The same process may find H=CA as H=CB or W=CA as W=CB, but in this case the parents’ names are their only identifiers. There is no set of rows on the FGRA transformation matrix dealing with this situation. In fact any individual position that shows parents’ names may manifest discrepancies that would indicate uncertainty about which set of parents are which. We single out the S, D, H, and W because they are the ones whose identifying information can speak loudest for being the same, and whose parents’ information will do the same.

Algorithm III or O(P) linkage.   Each child of the central sibship (FA) may begin its own families through single, sequential, or polygamous marriages. Such children may have cross-reference information to first spouses (G) and subsequent spouses (GO). In such a case it may well be that the family group sheet in which each spouse participates as H, viz. W will be on file, thus duplicating the S, viz. D, and the G and GO’s represented as a W, viz. H. The goal of algorithm III is to detect this case of duplication. This algorithm identifies duplicate families (FA linked to FB) where the individual’s family of procreation (FP =FB) is referred to on the family group sheet for his/her family of orientation (FO =FA). On the FGRA transformation matrix we are dealing with rows 2 and 3 for the son/husband and rows 4 and 5 for the daughter/wife.

Algorithm IV or P(O) linkage.   The last algorithm finds duplicates of S, viz. D, with G whether or not there is a corresponding H and W. This is the family of orientation that may or may not be indicated in a family of procreation. On the FGRA transformation matrix we are dealing with rows 3 and 4 for the son/spouse and rows 4 and 3 for the daughter/spouse.

An additional use of algorithm IV.   Every husband and wife may have a set of parents’ names given (K and Q). In the case where husband or wife have duplicates by algorithm III and algorithm II, it is possible to determine that these parents are ambiguously natural or adopted, etc. Multiple parents' names result from siblings each forming his/her own families of procreation as either H or W. There may or may not be a family group for their parents as H and W on file. There is nothing on the FGRA transformation matrix that deals in this way with the parents of husband and wife.
In either of the two cases, algorithm IV finds duplicates among the parents of husband and wife — an O-family indicated on the sheet for the individual’s P-family.

Possible structure for individual duplicate groups.   When record linkage detects duplicates, it makes each of the records that represent the same individual a member of the same individual duplicate group and each of those that represent the same family a member of the same sibship duplicate group.
Each individual duplicate record has the following identifiers: Linked inidividual (LINo), family group sheet number (FamNo), position on sheet (K, Q, H, W, OH, OW, S, D, G, GO), sequence number, union number. The husband is always sequence 1, the wife, 2. Each of the n children then have a sequence number: 3, 4, . . ., n+2 . The union number is always 0 for H and W, but runs from 1 to m for O or G, where m is the number of additional marriages listed for H, viz. W, or S, viz. D.

Possible structure of sibship duplicates.   Each sibship duplicate record has the following identifiers: Linked sibship (LSNo), family group sheet number (FamNo), position of sibship union on sheet (KH/QH, KW/QW, H/W, H/O, W/O, S/G, D/G, S/GO, D/GO), sequence number, union number. Since persons classed as male or female child may appear on a sheet more than once, they need a sequence number. Since O and G are possibly on a list of spouses, they need a union number.