Graph Strategy for Overlapping Segments
Posted by: David A. Stumpf, MD, PhD
Post Date: 2021-09-27
Topic : Neo4j User Defined Functions
Genetic genealogy data contains numerous overlapping segments (Figure). Computing thousands of overlaps by comparing their start and end positions becomes computationally challenging when there are multiple kits. Graph methods eliminate the requirement for directly computing overlaps while still providing sets of results which share as set of overlaping segments. The raw data of DNA results consists of a family tree (person, union and place nodes from a GEDCOM file), tester kit nodes, match and segment nodes and edges connecting these nodes. After creating these baseline nodes and edges, the graph can be enhanced by annotating nodes and creating new edges which facilitate desired queries.
|Illustration of overlapping chromosome segments|
|The propositus is the kit and the bars labelled 1 to 5 are matches to the kit.|
Enhancements will vary with the questions being asked. This discussion will address triangultion groups for distant ancestors. We begin by annotating the match nodes with a property, ancestor_rn, which is the unique identified for the matches' common ancestor. All the descendants of an ancestor are easily found by traversing the family tree. The found nodes get the added ancestor_rn property. This is easily and rapidly updatable.
We can next create triangulation group nodes with properties of chromosome, start and end position, and common ancestor. At present, TGs are manually curated. With the TG properties we can now create a tg_seg edge to all the segments they subsume. Then, because we have kit_segment and match_segment edges, we can create tg_kit and tg_match edges.
After sorting the segments by start position, a seg-seq edge is added, linking them by the physical chromosomal locations. Triangulation groups are defined by a start and end position of a chromosome and a common ancestor. We now have a schema which links together the kits, matches, and specific sets of trianulation group segments in sorted order. A single query can traverse the graph from the common ancestor to all the segments in a trianulation group. No need to compute the overaps, they just emerge.