Dimensionality Reduction in Genealogy
Posted by: David A Stumpf, MD, PhD
Post Date: 2021-04-23 Modified: 2021-05-16
Topic : Graphs for Genealogy
This post expands on a prior post with a deeper dive into the strategy and methods. You should read that prior post for context.
Dimensionality reduction involves sequential steps to decrease the size of the data being managed. The goal is to create a manageable dataset out of an overwhelmingly complex starting set. Looking for distant ancestors is like finding a needle in a haystack. We want a magnet that will pull the needle out without sifting through all the hay. Genealogists have developed many methods for doing this, which we can transfer into a graph environment. One of the nice things about graph analytics is that they are intuitive to experts. This blog will make sense to genealogists because the principles and methods are familiar. The novelty is in the way the principles and methods are deployed. The tool kit infrastructure and execution distinguish the methods we will discuss.
The key infrastructure in Neo4j, a native graph database whose use in genealogy is discussed elsewhere. Neo4j uses the Cypher query language. With experience this becomes intuitive. Cypher has a rich array of capabilities, some of which will be illustrated in our dimensonality reduction project.
We start with several graphs:
- Family tree. Person nodes connected to each other and to union nodes.
- DNA Test Results. Kit nodes connected to match nodes (shared matches) and chromosome segments. Haplogroup results are also reported.
- Curated data. Information the expert genealogist organizes and submits to the analytics
- Link of family tree person node to the DNA Kit node.
- Haplotrees are created by linking SNPs to define branches and the clusters of SNPs at a branch.
- Surname synonyms: variations in the spellings
- Notes: information to help focus and constrain traditional genealogical research.
- FTDNA Project. FTDNA provides a summary file for group administrators that includes all members, testing done and haplogroups when available.
- Knowlege graphs.
- Y-haplotree block nodes and edges to i) SNPs in the block and ii) child blocks in the haplotree.
- Relationship ontology. Used to look up the English name of a relationship defined by the number of MRCAs and the hops in the family tree to reach them.
Your family tree has many branches. The analysis involves a specific distant branch, such as one of your 32 five-great grandfather. Appropriate kits were described previously. This selection avoids many testers and, more importantly, their matches which are mostly to persons in other branches of the famility tree. When dealing with multiple kits, there are often discrete sets of people (e.g., Ohio branch or Georgia branch). You know member of a group are related, but although DNA suggests it, you do not have proof of relatedness of those in different branches. For analytic purposes it is useful to tie them together with fiction MRCA ancestors. The green highlighted columns contain the anhentafels at the generation on the path to the MRCA.
FTDNA identifies matches. Particularly valuable are distant cousins in the patrilineal line because they share very few common ancestors and fewer but more relevant shared matches. While these cousins have smaller shared segments, you have many more of them because of the larger descendent count of distant ancestors. When you have over 100,000 matches, your odds are good for finding matches sharing 7 or more cM segments; often longer segments are found.
Your matches have many surnames. You are interested in those with various spellings of the targetted surname. You are also interested in women who assumed married surnames but still have their parents' DNA. As distant ancestors these women are deceased. But their descendants will have a relevant surname. If you have good guesses about the mother or other female relatives of the researched ancestor, use their maiden names too. An example will be helpful. In my Avitts surname project we hypothesized that the mother of William H. Averatts (c1813-) is Nancy McGhee (1797->1840). So our analytics include Avitts and variations (Averatts, etc.) and McGhee in its various spellings. William had daughters not in my direct line with married names of Spracklin, Rushing, and Tannhill. The other men in a surname project have road blocks with dead-end ancestors, but also mothers, sisters and daughters whose descendants carry their married surname. Queries using these surnames discover matches that will match more than one of the known descendant, including those in different branches of the common ancestor descendency chart.
Taken individually each of these three first steps contributes to dimensionality reduction. But consider them in a Veen diagram. At the center, the three circles all overlap, with an area that is small. We have matches to a small branch of our family tree, with a specific surname on one of its branches and whose DNA shares a chromosome segment. We can now recognize segments that have many matches; so many that it is likely a pile up region in the family which we can ignore. We next can see overlapping segments that are potential triangulation group.
We have used different graphs, each with many branches and traversed each of them to find the nodes in each that are shared by a small group of people who are now available for traditional genealogy research. The power of graph queries is seen because this can all be done by a single query that runs in a few seconds.
Triangulation groups (TGs) are chromosome regions inherited from a common ancestor by a group of descendants. The descendants often have overlapping segments that define the boundaries of the region. Presently, this is a curated manual process, but the goal is to identify criteria for automating the computation of the boundaries. Jim Bartlett developed a useful display of TG data with a row for each individual, their segment data (chromosome, start and end positions), the common ancestor(s), and the ahnentafel number of the path between the propositus and the common ancestor(s).
A single graph query can produce the requisite content. One component of the query aggregates the individuals and their segments within the TG boundaries. Another component traverses the graph between the propositus and the common ancestor with the individual on each row. Graph queries allow you to "collect" facts as you traverse the path up the two branches of the trees of the two individuals to the common ancestor(s). Our query creates a bit string of 0's and 1's for each male and female ancestor in the path. This is an ahnentafel number in base-2. The query also collects the names and identifiers of the ancestors in path order. Finally, it counts the hops from the propositus and individual to the common ancestor. Some quick post processing makes the display more agreeable to genealogists. The base-2 anhentafel is converted to base-10. Since each 1 and 0 is a generation, we can remove them sequentially and convert the result to base-10 to get the anhentafel of the ancestor at each generation. The hops on each limb define the relationship, which is looked up in the ontology.
|Triangulation Group Initial Report with Anonymized Individuals|
|The TG has 14 unique individuals (identified by their record number only), many have more than one row because they have more than one MRCA with the propositus (in this case me) and in some TGs the individual may have interupted segments. Because the rows are produced from multiple kit matches, the shared cm and snp may not be the same in each match; thus, the minumum and maximums are shown. This report is limited to kits of individuals who are in the family tree and thus have a record number from it.|
To facilitate research, it is helpful to see the names of the ancestors on the path to the MRCA and the geography of their births and deaths if known. This next chart provides this additional detail.
|Triangulation Group Initial Report with Added Path and Geographic Information|
|Another view of the TG adds the paths from the propositus (Path1) and the row's match (Path2) to the MRCA. The individuals are anonymized and only a record number shown. Only a few rows are shown.|
There are other views of the TG that allow one to access their complexity (see below) further. To aid subsequent research one report includes all the matches at the TG, including those not in the family tree. These individuals are not phased, but research on them may turn up genealogical clues about their ancestors and if or how they relate to the TG MRCAs and known members.
Most of the above steps involve information about known ancestors. But in Step 4 we identify matches who we previously had not placed within the family tree. This is a small number of matches which constrains the amount of traditional genealogical research. That research includes exploring whether these matches have family trees, familiar matches, geographic clues, etc. These matches are added to the curated file, with notes (3-d above) about the research. Notes such as "no family tree" help you avoid redundant dead end research work. The helpful discovered matches will have paths entered into the family tree, haplogroups if available, and might be recruited to join the project and provide their matches. Thus, this is an iterative effort which the steps repeated after adding new kits. Because the population of the Neo4j database is done with computer code, you get a cup of coffee while it's repopulated. With 40 kits it may take 10 to 20 minutes.
If there is available data, other dimensionality reduction steps may be useful. These include:
- Haplotrees. The analytics start with a patrilineal line. Some surname projects have many men who have Big Y results identifying haplogroups on small twigs or leaves of the hierarchy. They may have varying levels of testing and thus be at different points on the path or even have divergent branches at the twig level. The kits sharing a twig have matches that are likely more relevant than others in the analytics. Graph queries can quickly identify the men on these distal haplotree branches and then proceed with the steps above.
- Triangulation group segregation. A distant common ancestor passes triagulation groups down to descendants, but they will segregate differently on different branches of the descendancy tree. After many generations of descent the the triangulation groups may be small. Indeed, we desire small triangulation groups because that is what we expect from the distant ancestors these methods are seeking. The methods described facilitate identifying these small triangulation groups.
- Geography. This is a work in progress. New matches identified may be more useful in guiding research if their demographics are available. If they were in the same region as known ancestors, they can be prioritized in traditional genealogy research efforts.
|Triangulation Group Paths to MRCAs|
|This visualization of the same data found in the table above shows all the TG members and their paths to MRCAs. There are 105 Person nodes and 18 mother and 95 father edges; remember, we are starting with patrilineal lines. The TG members are generally around the perimeter and identified by having only one line point to a parent. The numbers are their record numbers in the family tree. Such renderings help one comprehend the complexity and focus research. The image is an svg file that will enlarge without losing resolution. The visualization is best seen by downloading the file.|
|Family Tree with Triangulation Groups|
|This anonymized family tree has the most distant ancestor at the top and the descendant branches ending with DNA testers. The testers have segments that map to multiple triangulation groups (TG), which are shown as numbers. These TG numbers are propagated up the tree and concatenated with those of cousins when their branches converge. The visualization is best seen by downloading the file.|
A DNA tester usually creates triangulation groups for themselves. The DNA results are theirs and limited to their matches whom they find share the TG segment. The TG ahnentafel paths refer to their family tree. The same is true for present day clustering algorithms.
The analytics described here use multiple kits. This design allows you to use many more matches that increase the likelihood you will find segments from distant ancestors. But it also confounds the analytics, requiring more perspectives to garner appropriate and accurate conclusions. Fortunately, one of the main benefits of a graph environment is the ease of altering the perspective. However, the logic and results are being scrutinized to define the limitations. More to follow ...