Graph Triangulation Methods
Post Date: 2020-09-29
Topic : Graphs for Genealogy
DRAFT -- please do not share with others or consider it as authoritative until it is peer-reviewed and this message is removed.A major computational challenge in genetic genealogy is overlapping chromosome segments. There is not a one-to-one correspondance of segments in different kits. Kits may share a portion of a larger chromosome segment. This problem, using traditional methods and relational databases, is computationally intense and in some analyzes prohibitive. Graph methods have advantage.
First, let's define the issues. The figure illustrates the 7 possibilities for segment overlap when comparing matches to a propositus. Full concordance of the segments exists only for match 1. Commonly the match has the same start or end position but not both (Matches 2 and 3). Match 6 is subsumed by the propositus and match 7 subsumes the propostus. The other scenarios overlap portions of the propositus segment.
|Illustration of overlapping segments|
|The propositus is the kit and the bars labelled 1 to 5 are matches to the kit.|
Analytics are constrained by the data available. Vendors differ in what they provide to genealogists. Ancestry.com provides no segment data. All other vendors provide their match results and segment (chromosome browser) data. This does NOT provide a complete picture of the segments, but only portions relevant to the match. You can only appreciate the larger extent of the segment by comparing more than one kit. Thus, consider the figure and appeciate how more kits gives a more complete picture of both the propositus and the individual kits. GEDmatch provides a visual phasing image that does portray the details of the entire chromosome. This is valuable, but is a method for reducing the amount of raw data retained for subsequent work such as that discussed here.
The design of the graph database schema is important in optimizing its use for genetic genealogy problems. So let's define what we desire in terms which are computable. Here are the goals:
- Identify all segments or portions of segments which matches share
- For each pair of matches, calculate the sum of the centimorgans of all shared segments or portions of segments to computer the shared centimorgans between the two kits.
- Create an edge in the graph between kit pairs sharing segments and set the sharedCM and segment count properties to the computed values.
- Link the segment data to individuals in the family tree to facilitate computations:
- Identify crossover points.
- Identify segments shared by multiple matches that may define triangulation groups.
- Most questions are resolved by triangulating segments for known persons in our family tree. For instance, we do not need a comprehensive list of overlapping segments for creating triangulaion groups.
- Phasing constrains analytics to specific family tree lines. We can do specific queries just involving relevant individuals when the anaytics are computationally intense.
- Creating exuberant edges creates options of more graph traversal paths which reduces computational work.
- Propagating properties to downstream nodes shortens traversals. For instance, putting a family tree record number in a chromosome browser match node creates an alternative to traversal in connecting the two tree. This brings us to an important feature in the platform: the ability of users to curate the links between the family tree and the DNA data. This is discussed elsewhere.
The schema optimizing these analytics is:
- Create match nodes from chromosome browser data (CB_Match)
- Create segment nodes from chromosome browser data (CB_Segment))
- Create match_segment edges for each match-segment with properties that enable constraining the path traversal to specific associated nodes.
- Create Person nodes from GEDCOM files and father mother edges for traversing the family tree.
- Create Gedcom_DNA edges to link the family tree Person node to the DNA CB_Match node for the family member. At the same time, add a RN (record number) property to the CB_Match node so traversals can start at this node.
Segment analytics uses matches identified in chromosome browser (CB) data from multiple kits. Understanding next steps requires attention to the details of the schema, shown in the figure. The FTDNA chromosome broswer file is used to create CB_Match nodes, their match_segment edges and the CB_Segment nodes. The FTDNA Family Finder file is migrated to Neo4j as FF_Match nodes. A user supplied GEDCOM populates Person nodes. The platform allows users to download a spreadsheet used to curate the link between Person and FF_Match nodes. On submission, the Gedcom_DNA and FF_KitMatch edges are created. The GEDCOM file provides the family tree relationships through father and mother edges.
|Schema enabling graph segment analytics.|
|The GEDCOM and two FTDNA files are the source of the data. People nodes are created from the GEDCOM. FF_Kit node are from the FTDNA Family Finder file. CB_Match and CB_Segment nodes are derived from the FTDNA chromosome browser file. By design, the platform uses files made available by FTDNA; there is no scraping of the webpages to collect data.|
Triangulation Analytics Simplified
We next need to look at how we will compute matches using segments and also clusters of matches to a segment that might form triangulation groups. The next figure illustrates a specific example. Queries to produce chromosome brower data for multiple kits run very fast and replicate those on the FTDNA platform with a few key exceptions. The platform segment metric is megabase pairs (mbp) rather than centimorgans. The platform produces its output with a singe click and not the cumbersome selection process at FTDNA. Extending these platform results is a key research project. It is hypothesized this framework with lead to automation of triangulation group identification with links to ancestral paths.
|Illustration of overlapping segments used in analytics|
|There are multiple matches defined by overlaping segments.|
Neo4j queries traverse the graph and quickly link kits and segments. These can take a few forms described individually here but used in combination where helpful:
- Common ancestors of kit owners are found by traversing the ancestral paths of two or more kits. With two kits, you have most recent common ancestor(s) (MRCA). With three kits you may have different common ancestors for each pairing by one more remote shared ancestor. A single Neo4j query identify these CAs. The queries can collect data along the traversal to produce ahnentafel numbers and path lengths. The path lengths and number of MRCAs define the relatedness (e.g., 2C2R, H3C, etc.). It's easy to computer pedigree collapse using these lenghts to get the coefficient of relatedness.
- Common segments of kit owners are found by traversing the path from kits to segments. If we know the relationship of the starting kit owners the query result will have a manageable list of segments for identifying triangulation groups.
- Descendant tree queries start with a specified ancestor and traverse the descendant graph to those who have DNA test results and continuing the traversal to segment data. This produces a segment report relevant to identifying triangulation groups (TG) aligned with that ancestor. It can also identify others who are not direct descendants and thereby provide phasing insights.
- With multiple kits, each with their own list of matches, you can identify shared matchs (often called in-common-with or ICW). By memorializing this with the shared_match edge we create a new graph useful in analytics.
- Similarly we can create the match_by_segment edge by finding shared segments, computing the overlap and summing up the segment lengths (as mbp). By creating propeties for each edge we create many more edges but enable traversals restricted to specific paths. This allows us to compute the shared mbp between two individuals, other kits at the segment (or overlapping segment) and also to recognize the source of overlapping segments in reporting. The boudaries of the two segments and the overlapping segement are returned by the query. The latter define crossover boundaries.
- An aspirational query or associated algorithm will automate the identification of triangulation groups.
- HTML tables with a link to export an Excel worksheet.
- Descriptive reports which summarize the number of nodes and edges.
- Analytic reports producing genealogically relevant information.
|Screenshot of the platform.|
|The user interface allows users to create their own database, upload their FTDNA files, see the counts of nodes and edges and then produce analytic reports such as that shown. This report is anonymized except the author. The user, at the top left, selects their database and enters two record numbers (RN) from their GEDCOM and then clicks on a link on the right, in this case the crossover boundary link. This populates the query textbox and executes the query, rendering the table. Note that this query using boolean logic to extract overlapping segments. There is a link to download the table as an Excel document. This image truncates the 630 rows that report segment overlap data for 22 autosomes and the X-chromosome. The entire column Match2 and r2.m are populated with a cousin who is RN2. Columns with r1 and r2 contain data in the properties of the match_segment edge (see other figure). While r1.m and r2.m are predefined, r2.m and r2.p are either the starting RN1 and RN2 or other kits that match the segment. All of those returned by the query have a common ancestor in one family tree branch, although this rendering only used family tree information in selecting RN1 and RN2. The segments are ordered by chr, statr and end position of the overlap region. The start and end positions and megabase pairs (mbp) are shown for the kits of RN1 and RN2 and the computed overlap segment. The boundaries of these overlap regions are exposed and the number of rows and mbp in some sequential rows may represent a triangulation group.|
Why is this important?
Genealogists work with many types of graphs. Most have little knowledge and even less experience with graph databases and methods. The platform provides a resource to address this and expand the horizon for innovation in genetic genealogy. There are many very creative citizen scientist in genealogy who can undoubtedly help formulate and develop this future state.
The project needs considerably more technology development, a business plan and skillful people. The platform cost is prohibitive for an individual. But the base infrastructure is the main cost and incremental costs by adding users will distribute these and make the individual cost palitable. Thus, the project needs not only genetic genealogy experts, but geeks, marketing and finance folks.
Please reach out with interest and ideas using the email in the footer of every page!