Graphs for Genealogists™
Powered by Neo4j® and a physician genealogist

Home     Blog     Log In

Graph Triangulation Methods

How graphs simplify the computations important to genealogists

DRAFT -- please do not share with others or consider it as authoritative until it is peer-reviewed and this message is removed.

A major computational challenge in genetic genealogy is overlapping chromosome segments. There is not a one-to-one correspondance of segments in different kits. Kits may share a portion of a larger chromosome segment. This problem, using traditional methods and relational databases, is computationally intense and in some analyzes prohibitive. Graph methods have advantage.

First, let's define the issues. The figure illustrates the 7 possibilities for segment overlap when comparing matches to a propositus. Full concordance of the segments exists only for match 1. Commonly the match has the same start or end position but not both (Matches 2 and 3). Match 6 is subsumed by the propositus and match 7 subsumes the propostus. The other scenarios overlap portions of the propositus segment.

Illustration of overlapping segments

Analytics are constrained by the data available. Vendors differ in what they provide to genealogists. provides no segment data. All other vendors provide their match results and segment (chromosome browser) data. This does NOT provide a complete picture of the segments, but only portions relevant to the match. You can only appreciate the larger extent of the segment by comparing more than one kit. Thus, consider the figure and appeciate how more kits gives a more complete picture of both the propositus and the individual kits. GEDmatch provides a visual phasing image that does portray the details of the entire chromosome. This is valuable, but is a method for reducing the amount of raw data retained for subsequent work such as that discussed here.

The design of the graph database schema is important in optimizing its use for genetic genealogy problems. So let's define what we desire in terms which are computable. Here are the goals:
  1. Identify all segments or portions of segments which matches share
  2. For each pair of matches, calculate the sum of the centimorgans of all shared segments or portions of segments to computer the shared centimorgans between the two kits.
  3. Create an edge in the graph between kit pairs sharing segments and set the sharedCM and segment count properties to the computed values.
  4. Link the segment data to individuals in the family tree to facilitate computations:
    • Identify crossover points.
    • Identify segments shared by multiple matches that may define triangulation groups.
  5. Create Excel documents summarizing shared segment data
There are a few other considerations that make computations more manageable:
  1. Most questions are resolved by triangulating segments for known persons in our family tree. For instance, we do not need a comprehensive list of overlapping segments for creating triangulaion groups.
  2. Phasing constrains analytics to specific family tree lines. We can do specific queries just involving relevant individuals when the anaytics are computationally intense.
  3. Creating exuberant edges creates options of more graph traversal paths which reduces computational work.
  4. Propagating properties to downstream nodes shortens traversals. For instance, putting a family tree record number in a chromosome browser match node creates an alternative to traversal in connecting the two tree. This brings us to an important feature in the platform: the ability of users to curate the links between the family tree and the DNA data. This is discussed elsewhere.

The schema optimizing these analytics is:
  1. Create match nodes from chromosome browser data (CB_Match)
  2. Create segment nodes from chromosome browser data (CB_Segment))
  3. Create match_segment edges for each match-segment with properties that enable constraining the path traversal to specific associated nodes.
  4. Create Person nodes from GEDCOM files and father mother edges for traversing the family tree.
  5. Create Gedcom_DNA edges to link the family tree Person node to the DNA CB_Match node for the family member. At the same time, add a RN (record number) property to the CB_Match node so traversals can start at this node.
Specific details about the Neo4j schema relevant to segment analytics

Segment analytics uses matches identified in chromosome browser (CB) data from multiple kits. Understanding next steps requires attention to the details of the schema, shown in the figure. The FTDNA chromosome broswer file is used to create CB_Match nodes, their match_segment edges and the CB_Segment nodes. The FTDNA Family Finder file is migrated to Neo4j as FF_Match nodes. A user supplied GEDCOM populates Person nodes. The platform allows users to download a spreadsheet used to curate the link between Person and FF_Match nodes. On submission, the Gedcom_DNA and FF_KitMatch edges are created. The GEDCOM file provides the family tree relationships through father and mother edges.

Schema enabling graph segment analytics.

Triangulation Analytics Simplified

We next need to look at how we will compute matches using segments and also clusters of matches to a segment that might form triangulation groups. The next figure illustrates a specific example. Queries to produce chromosome brower data for multiple kits run very fast and replicate those on the FTDNA platform with a few key exceptions. The platform segment metric is megabase pairs (mbp) rather than centimorgans. The platform produces its output with a singe click and not the cumbersome selection process at FTDNA. Extending these platform results is a key research project. It is hypothesized this framework with lead to automation of triangulation group identification with links to ancestral paths.

Illustration of overlapping segments used in analytics

Neo4j queries traverse the graph and quickly link kits and segments. These can take a few forms described individually here but used in combination where helpful:
  1. Common ancestors of kit owners are found by traversing the ancestral paths of two or more kits. With two kits, you have most recent common ancestor(s) (MRCA). With three kits you may have different common ancestors for each pairing by one more remote shared ancestor. A single Neo4j query identify these CAs. The queries can collect data along the traversal to produce ahnentafel numbers and path lengths. The path lengths and number of MRCAs define the relatedness (e.g., 2C2R, H3C, etc.). It's easy to computer pedigree collapse using these lenghts to get the coefficient of relatedness.
  2. Common segments of kit owners are found by traversing the path from kits to segments. If we know the relationship of the starting kit owners the query result will have a manageable list of segments for identifying triangulation groups.
  3. Descendant tree queries start with a specified ancestor and traverse the descendant graph to those who have DNA test results and continuing the traversal to segment data. This produces a segment report relevant to identifying triangulation groups (TG) aligned with that ancestor. It can also identify others who are not direct descendants and thereby provide phasing insights.
  4. With multiple kits, each with their own list of matches, you can identify shared matchs (often called in-common-with or ICW). By memorializing this with the shared_match edge we create a new graph useful in analytics.
  5. Similarly we can create the match_by_segment edge by finding shared segments, computing the overlap and summing up the segment lengths (as mbp). By creating propeties for each edge we create many more edges but enable traversals restricted to specific paths. This allows us to compute the shared mbp between two individuals, other kits at the segment (or overlapping segment) and also to recognize the source of overlapping segments in reporting. The boudaries of the two segments and the overlapping segement are returned by the query. The latter define crossover boundaries.
  6. An aspirational query or associated algorithm will automate the identification of triangulation groups.
The output of the platform queries include:
  1. HTML tables with a link to export an Excel worksheet.
    • Descriptive reports which summarize the number of nodes and edges.
    • Analytic reports producing genealogically relevant information.
  2. Graphic images rendered in D3 templates which can be downloaded as svg files. This is still in active development. The platform has a robust architecture to accelerate development. Rendering D3 documents requires several distinct steps which the platform represents in a computable json file. This architecture is flexible and scalable and not only applicable to genetic genealogy.
  3. Neo4j queries which can be editted and resubmitted, enabling more flexible reporting for those who familiarize themselves with the Neo4 cypher query language. In general, one of the advantages of Cypher is its more intuitive nature. However, some of the queries used on the platform are relatively complex.
  4. An API is aspirational but readily feasible with the current architecture. This would enable third party applications to use the graph database platform and leverage its advantages. The project is open source and will benefit from collaborator participation and contributions.
Screenshot of the platform.

Why is this important?

Genealogists work with many types of graphs. Most have little knowledge and even less experience with graph databases and methods. The platform provides a resource to address this and expand the horizon for innovation in genetic genealogy. There are many very creative citizen scientist in genealogy who can undoubtedly help formulate and develop this future state.

What's next?

The project needs considerably more technology development, a business plan and skillful people. The platform cost is prohibitive for an individual. But the base infrastructure is the main cost and incremental costs by adding users will distribute these and make the individual cost palitable. Thus, the project needs not only genetic genealogy experts, but geeks, marketing and finance folks.

Please reach out with interest and ideas using the email in the footer of every page!