Chapter 1: LINKAGE RECORDS & LINKAGE ENTITIES



The first stage in performing record linkage is to analyze and define the resource data that must be linked. There are basically two kinds of record linkage projects:
1) making a resource accessible as records of entities to be queried against, and
2) bringing data about an entity in two resources together into one new resource.
In the first case the entity corresponding to a query must be made accessible. In the second case the entities of a resource each become a query and when one or more additional records of that entity are accessed, its data can be reconciled against the query and a new record built as the best version, this becoming the new resource.

Genealogical Record Linkage Resources.   Record linkage is the technology necessary to discover matching records. To bring about a match it is necessary by definition that the records relate to the same entity. Challenges arise when the entities are of the same kind, but are not identified in the same way. In a census schedule, for example, the individuals may be identified by name, sex, age, birth place, marital status, and relationship to the head of the household. In contrast, in a marriage return, the individuals may have a name, sex, and age. If we are to link records of individuals, we need to make their identifiers in these two types of records compatible for comparison. Name and sex are often rather directly comparable (more about this later). Ages in conjunction with the date of the event (census viz. marriage) will transform into the individual’s birth date. Further, if we are to link records of the families, it will be necessary to transform the husband-wife relationship in the census to the groom-bride relationship implied by the marriage. There would be a requirement that the census event supporting this linkage chronologically follow the marriage event; the latter of these two family events begins the family. In addition, their successful individual linkage would be another precondition on the family linkage. In other words the same principals on both records generally* implies that they form the same family.

Linkage Of Genealogical Records.   It appears possible to classify records ready for linkage according to both the kind of entity that the record represents and the kind of identifiers it contains. In the example of the last paragraph we implied that one kind of linkage may depend on the success of another. In genealogy it will be best at first to confine ourselves to a limited number of entity types. This paper will distinguish three levels and three subtypes with records defined for each:
1) event record linkage for documents,
a) locality record linkage for event identifiers,
b) date record linkage for event identifiers,
2) individual record linkage for persons, and
3) family record linkage for nuclear families uniting persons (and events) together in a single structure,
c) name record linkage for identifiers of families, persons, places, and other entities.
Success in linking spellings may permit individuals and places to be identified in a more consistent way. Success in linking events and idividuals will allow other individuals and families to be identified and related more consistently.

Models & Engines.   Genealogical resources contain records of different kinds and each may or may not correspond to a kind of genealogical record linkage. To link entities that are represented by different kinds of records we need to bring the identifying data into a single record — one which has the fields appropriate for the desired linkage result. For example, suppose the genealogical record linkage system is to link two records as representing the same event. The system needs to view and access two different documents and see them as recording the same event. To build compatible linkage records the system requires a model of the record structure. To view or access the records requires an engine. An automated system builds documents following the document model and when it accesses them, it uses the document engine. When this system attaches names to places, it must apply the event model and engine. When it links events and names into individuals, it must apply the individual model and engine. When it links events and individuals into families, it must add the capabilities of a lineage-linked model and engine.

There is no way to discuss in this paper all these models and engines in much detail, but here are brief introductions to a few topics that may shed light on their general outlines:
1) the lineage-linked model for understanding the role of models,
2) the basic genealogical relationships for modeling kinships,
3) some entity class transformations for relating linkage entities,
4) data propagation rules for overcoming the challenge of missing data, and
5) individuals and their merotypes for representing persons in their genealogical contexts.

FGRA Corpus Analysis.   The ideas of this section came about in developing algorithms required for use on a particularly large corpus of data. However, the principles of analysis used should be applicable in general for record linkage in any other collection. The analysis involved record linkage algorithms designed to derive a lineage-linked pedigree from the individuals in a set of family groups. The corpus comprised the family group sheets of the 1942–1970 archives collection — the family group record archives (FGRA). Universal data entry (UDE) had computerized these data and fractionated the families into individuals for inclusion in the International Genealogical Index (IGI). The first step of that process in effect translated the data in the documents from a family group sheet model into GEDCOM format. The second step, however, transformed the data into the IGI individual and family (marriage) model.

The following paragraphs are intended to give a cursory understanding of:
1) individuals and
2) families on the family group sheet,
3) linking the family group sheets, and
4) individual linkage entities.

Record Linkage Algorithms & Duplicate Groups.   This section first examines transformations involving the elements of the lineage-linked model, involving an analysis similar to the one for the family group model in the last section. The salient point is that there are various ways in which information on one family group sheet duplicates information on another. This means that the individuals identified by the information are duplicates. The two models become linkage records where pains have been taken to make the data match precisely. Moving between the family group model and the lineage-linked model requires an additional set of transformations. The following paragraphs outline processes that apply four algorithms. The different record linkage algorithms are the ways to produce linkage records from the information used to identify individuals and ways to compare the records with each other to detect the duplicate individuals.

The eight paragraphs of this section should clarify some of the details of the algorithms and the duplicate groups produced by them. The algorithms belong to the entity class transformations of the lineage-linked model and engine and include those for:
1) P* linkage,
2) O* linkage,
3) O(P) linkage, and
4) P(O) linkage, which has a secondary use.
The duplicate groups may be structured as:
1) individual duplicate groups and
2) sibship duplicate groups.

Elements of Data Propagation.   In this section we consider the kinds of data that might be propagated, i.e., guessed at or estimated, in filling out the individual identifiers of a genealogical record linkage entity. In some cases it is necessary to undo the effects of any propagation that might have been performed by a researcher to get at the data that record linkage must take into consideration. The date of an event is one of the simplest data structures, yet its propagation is the most involved and precise. Next in complexity is the locality of an event. The simplest, and least amenable to propagation is the name of the individual. The semantic base for personal names are the individuals, which are identified by vital events having dates and localities. The vital events of an individual which may be propagated are birth, marriage, and sometimes death. As it is possible to propagate a death from a probate or burial record, it is also possible to propagate the other way around: a possible probate, or burial from a death. The record linkage algorithm must take into account all these possibilities when weighting the relative importance of the data in the fields of the records to be linked.

Here we consider only the most basic rules. The examples we give all relate data identifying the principal of an event document (P1) to a person whose identifying information is being generated (P2). There are examples of
1) date propagation, and
2) locality propagation, and
2) personal name propagation, and
To state the rules formally it is necessary to posit a model to structure the semantic base for these elements:
1) dates, and
2) localities.
All data propagation rules may require the following four elements:
1) the relationship of P1 to P2,
2) the marital status of P1 (conditional on value of relationship),
3) the sex of P1,
4) the precision of the identifiers for P1 and P2: date, and locality.
In addition to the four basic elements, there are at least two others that are important to consider when devising a date propagation rule:
5) conditions on the age of P1, and
6) demographic parameters needed for calculations.
Propagation rules for personal names also require specification of the four basic parameters. In this case one indicator of precision may be a pre-positive title produced, such as “Miss.”