Introduction to Graphs for Genealogists
Posted by: David A Stumpf, MD, PhD
Post Date: 2020-09-18 Modified: 2020-11-14
Topic : Graphs for Genealogy
The GFG web platform is designed to accelerate analytcs for experienced genetic genealogists. It is still under development and presently not mature enough for general use. GFG is an open source project. However we are recruiting beta-testers and developers. This post explains technical aspects of the project to help you appreciate opportunities, challenges and alignment with you skills and interests. The database can be dedicated to specific projects. The data is NOT admixed with other user data and access is restricted to the user or their designees. The data is encrypted at rest (on the server). Once uploaded the website guides user though a series of analytic tools.
GFG is a platform on Azure, Microsoft's cloud environment. Its core capability is a Neo4j version 4.x native graph database. The front end for user interaction is a django python website and its supporting SQL Server database. Neo4j v. 4.x enables multple independent databases, access control and role management. Capabilities are being developed, but their current status can be found at this link.
The code is housed at GitHub, presently in a private repository. As the project matures, this will be made public. If you are interested in participating, please send an email to David A Stumpf, MD, PhD at firstname.lastname@example.org with a brief message about your interests, skills, datasets for testing, and availability.
GFG users, after registration, can create their own Neo4j database. Data is uploaded sequentially, typically a GEDCOM file and followed by consumer DNA results from tests on multiple relatives. This creates two graphs, the genealogy family tree and the DNA graph. Linkng these together required curation. GFG provides a downloadable Excel document in which the DNA kit can be linked to the GEDCOM identifier. This is then uploaded and processed, thereby linking the two graphs.
GFG works with data readily available from vendors. Vendors allow their customers to download raw data and the results of some of their analytics such as matches and chromosome segment data. These files are used by GFG. GFG does not "scrape" data from the vendor platforms but relies on users to procure their data.
In a graph database information is represented in node and edges (relationships). Queries traverse the graph and can collect information along the path. This is a very efficient method for many genealogy queries.
GFG is designed for analytics, summarized as follows:
- Basic lineage reports: Traversing family tree graphs quickly delivers patrilineal, matrilineal and X-linked trees in either ancesding or descending order. These reports can be linked to and constrained to those who had DNA testing.
- Phasing: The curated data linking the GEDCOM and DNA graphs enables phasing, the initial step in distinguishing lineages in the DNA graph.
- Shared Matches: The multiple kits' datasets are used to identify shared matches, forming triangles with Kit A, Kit B and Kit C with edges between them (A->B, A->C and B->C) with the shared centimorgans. These shared match may or may not share chromosome segments.
- Shared Segments: The DNA graph connects kits to specific chromosome segments. Kits which share segments create triangles. These triangles may represent inherited segments. Identifying the triangles represents the first step in determining their relevance to the genealogist.
- Triangulation Groups: Mutliple kits sharing a segment create a triangulaton group (TG). GFG queries present the data in a format permitting the user to easily recognize TGs and curate them.
- Y-DNA + at-DNA: Men in a surname project have shared matches that provide clues about their relatedness.