Autosomal DNA analytics in surname projects
Posted by: David A Stumpf, MD, PhD
Post Date: 2020-11-02 Modified: 2021-04-10
Topic : Graphs for Genealogy
Family Tree DNA supports surname projects. Participants either share a surname themselves or have Y-DNA matches to the surname. In the latter case, there was likely a surname change somwhere back in the lineage. Surname projects group men based on haplogroups and the pattern of single nucleotide polymorphisms (SNPs). It is not unusual for projects to have numerous groups. Within a group the project seeks to identify a common male ancestor (CMA). When there are well documented family trees the CMA may already be defined and the DNA is used to validate or extend the conclusions derived from historical records. However, many projects have groups of men whose common ancestor is unknown. This is a common conundrum in American families who arrived in colonial times and whose diasphora is lost to history. It is this scenario which this document addresses.
Patrilineal trees are a convenient way to show relationships between men in a surname project group which presumably has a CMA. If individual trees a truncated before a CMA, one can assume the existence of the CMA and add him to the patrilineal tree. This allows one to analyse the group as a whole. In some cases there may be common ancestors of a subgroup(s) who are known and presumably, by an unknown path, converge on the more distant CMA for all the men.
This document explores methods for using at-DNA results to help map the missing steps in the path to the CMA of all the men. The at-DNA may be derived from the men in the project or other individuals in their families who share a path to the surname under investigation. The figure illustrates the opportunity.
|Matches (blue nodes) may match more than one surname kit (pink nodes) and provide additional clues about how the kit owners are related. The number on the edge is the shared centimorgans|
|This graph uses a set of project kits from the Avitts surname project and matches whose surname is McGhee or McGee. McGhee is hypothesized to be the surname of the mother of the kit owner's most distant known Avitts ancestor. That kit owner has 8 McGhee matches, supporting the hypothesis. But equally important is the other project kits also have McGhee matches, some of which are shared by more than one kit. Discovering these bridging matches from a pool of, in this case, 49,977 matches is very easy with graph analytics and very difficult with other methods.|
With this introduction, let's step back and outline the steps:
- Identify of suitable surname project. The group of men should have strong evidence for sharing a Y-chromosome, typically most recent SNPs on the same recent branch of the haplotree. There must be a critical mass of at-DNA test results available for these men or others who match them but have a different surname. Embedding an at-DNA project within a surname project encourages submission of these kits. One strategy is to include a group in the project for individuals matched to the surname but not in the patrilineal tree. The Modified Y-Utility can be used to determine the time to MRCAs and this will help determine kits most likely to benefit from at-DNA analytics.
- Collect the at-DNA result files. A project administrator can download the FTDNA Family Finder and Chromosome Browser comma delited files for each available kit.
- Curate a GEDCOM file. The Graphs for Genealogy (GFG) platform has tools allowing you to cross map matches to known individuals in the GEDCOM. The GEDCOM should incorporate hypothesized MRCAs who link known lines for groups of men in the project. This will enable queries of all putative descendants.
- Load the kits to the platform. This involves several steps described in another blog post.
- Discovery process. Graph methods are powerful because they enable discovery of insights that are otherwise difficult to find.
- Research Hypotheses. Discovered insights produce clues which will require in depth research. The research will also add individuals to the GEDCOM and, through curation, create new links between them and the project matches in the family tree graph.
- Peer review. Good research will hold up under scrutiny by peers knowledgable about the lines investigated and the methods used.
The Stinnett surname project (see below), as of 8 Apr 2021, incorporates 32 kits of people descended from a putative common ancestor 10 generations back. These 32 kits produce 112,520 unique matches, most of which will be in lines other that the line of interest. According to David Reich, 50% of descendants of a 10-generation ancestor will inherit a segment of their DNA. This will include 16 of the kits and 56,260 matches. If we have only one common ancestor, he or she will be one of 1024 (210) ancestors at 10 generations. Thus, of those 56,520 matches, only 57 would share that single common ancestor 10 generations back.
This defines the feasibility of this approach. Can we find the 57 matches in a pool of 112,520, the 0.01% that we seek? With traditional methods that is a nearly impossible challenge. But graph methods are well known for their ability to "discover" such hidden insights. That is the mission and rationale of this project.
The strategy uses diminsionality reduction. We progressively segment the data to get to a manageable and relevant group of matches:
- Start with kits in a surname project or the relatives in the patrilineal lineage
- from this group: Select the shared matches with surnames associated with the patrilineal lineage
- from this group: Select matches who share overlapping segments with a least two other matches
- from this group: Select those with a shared common ancestor(s).
Discovery methods are still under development! Research is defining the best methods and then code is created to enable users to implement the methods. This page will be updated to reflect progress.
- Targeted surnames. Matches with the project surname are perhaps the best targets for finding clues of relatedness. Some of these may be men who can then be recruited to the project.
- Haplotree twigs. Men in a surname project may fall into groups representing different limbs on the Y-haplotree. We can use these limbs, when available for filtering the kits and matches.
- Spouse surnames. Men sharing a CMA may also share a common female ancestor.
- Sister surnames. MCA's known sisters may have known husbands whose surname is propagated to his descendants who have at-DNA test results.
- Shared Segment(s). Matches are identified by shared chromosome segments and we can use this to find matches who share segments, thereby identifying the segments relevant to the patrilineal ancestor for a given match.
- Surname frequency. Counts of the number of matches with specific surnames may point you to best opportunities for research, particularly if the surname is not common in the general population but is prevalent in the project matches.
- Surname substitutes. Projects often have men whose surname is different from that of a distant patrilineal ancestor. Use these substituted surnames to discover insights.
- Clusters. Visualized graphs will show clusters of matches associated with various combinations of project kits. The matches in these clusters are bridges in the graph between two or more kits. Of particular interest are matches who match multiple kits, especially if the kits are in different known lines.
The Avitts surname project at FTDNA has several distinct groups of men with different spellings of their surname in the group with a haplogroup defining SNP of BY39551. The patrilineal tree incorporates hypothesized CMA, allowing the display of all project members and their lineages.
William H Averatt line. The author's great-great-great grandfather is William H. Averatts  (c1813 SC-). The number in  is a unique identifier. There were several hypotheses for his parents, but each lacked conclusive evidence. Among those was a suggestion his parents were John Dempsey Avit  (1785-) and Nancy McGhee  (1797->1840). Supporting this were the presence of a Nancy Avit next to a younger John T. Avit  (≬1815/20-) in the 1840 census of Gallatan Co., Illinois where William  was known to live. This census indicated Nancy  was born between 1791 and 1800. A very simple and intutive graph database query produced the graph (shown above) of McGhee linking multiple kits in the project. This included two men whose family trees mapped to a CMA, James H McGhee Sr  (1698-1774). This tree was known from previous research but assumed increase importance with the multiple McGhee matches in the project. Further study of the children of James H McGhee Sr  revealed a daughter Nancy, born in 1797 and a son William who died in Gallatan Co., IL in 1840. Futhermore, the geographic trajectory of the children of James McGhee  through North Carolina, South Carolina and Jackson Co., Alabama was aligned with that of the author's William H Averatts  line. William's oldest daughter was born near Birminham, Alabama. This DNA and historical evidence added weight to the notion that Nancy  was the mother of William  and John T. .
Biddy Wilson line. One of the other participants in the project was a Wilson whose Y-DNA matched him to Avitts. Research provided an explanation. His great great great grandmother was Biddy Wilson  (c1795->1860). She is named in the probate record of her mother Susanna in Spartanburg Co., South Carolina. This record names Biddy's husband as William Evet  (c1800-). She was evidently estranged from him and gave her children her maiden name. This finding led to research on Avitts/Evit in Spartanburg Co. The 1790 census shows, on the same page, William Avit  (<1774-) and John Avit  (<1774-). In the 1820 census we find a William Evet who may be .
Samuel Givens Evetts line. A third line in the project tracks it lineage to the CMA Samuel Givens Evetts Sr  (1774 NC-1850 TX) whose hypothesized but not proven parents are George Evit  (1745-1830) and Elizabeth Givens  (-). Using the simple graph database query, six distinct Givens matches were found linked to each of the three lines described here. None of these matches were shared by all three lines.
The FF_Green surname project at FTDNA has several distinct groups of men with different spellings of their surname in the group with a haplogroup defining SNP of BY96037. The patrilineal tree incorporates hypothesized CMA, allowing the display of all project members and their lineages.
Robert Green line. The author's great-great-great-great grandfather is Robert Green  (c1774-1826). He was living in Abbeville Co., SC in the late 1700s. His parents are unknown. His haplogroup root is R and this distinguishes him from descendants of another Robert Green in that county who are in the I root branch.
Ezekiel Green line. Several project participants descend from Ezekiel Green  (1774-). It is hypothesized that he descended from Greens who settled initially in Maryland.
Ohio Green line. Two project participants are descended from Green lines that are traced back to Ohio but whose earlier members are unknown.
A early finding in this project was discovering, among the 80,000 matches, a match to three of the kits whose family came from Early Co., Georgia and who had 3 brothers who are potential Y-DNA testers.
The Stinnett surname project at FTDNA has two distinct groups, the I- and R-Stinnett lines defined by their main haplotree branch. These branches separated many thousand years ago. But these two lines lived together in colonial Maryland, leading to hypotheses that they are related, sharing at-DNA while a Y-DNA was introduced from another line. The R-Stinnett kits match the Calvert/Colbert surname of the proprietors of Maryland, including the Y-DNA of an exhumed Calvert buried in colonial times. The math, discussed above, was from this project. Through extensive traditional genealogy research a hypothesis emerged that Mararget Stinnett (1667-1727) was the mother of two half-brothers, one an R-Stinnett [Benjamin Hughes Stinnett Sr. (c1710-c1773)] and the other an I-Stinnett [James Stinnett Sr (1706-1835)]. William Calvert (≬c1665/66-<1718) was married to Margaret, but the other father is unknown but presumably, based on the Y-DNA, a Stinnett relative. The many at-DNA matches shared by descendants of I- and R-Stinnett suggested they are related. From the traditional genealogy research we know the maiden and married surnames of both William Calvert and Margaret Stinnett. We use these surnames to query for matches who 1) have the surname, 2) also share a DNA segment larger that 7 cm with the project kits and other matches and 3) the kits share a known common ancestor. This is possible with a single Neo4j query bringing together the match, segment and family tree graphs. We then have the list of those who are in the small group of approximately 0.01% of all the matches. The query discovers triangulation groups. Research on this discovered group on matches shows some have sufficient data to document their connection to the common ancestor line. Thus, we have a proof of concept. The questions remaining are whether this is scalable to other projects and whether the clues provided result in solid genealogical proofs using a broader set of research results.
- Debbie Parker-Wayne: Parker Study: Combining atDNA & Y-DNA. in Advanced Genealogy: Techniques and Case Studies. Debbie Parer-Wayne, editor. Wayne Research, Cushing, Texas, 2019. pp 155-188.
- Jim Bartlett: Lessions Learned from Triangulating a Genome. in Advanced Genealogy: Techniques and Case Studies. Debbie Parer-Wayne, editor. Wayne Research, Cushing, Texas, 2019. pp 1-26.
- Blaine Bettinger: A Trianulation Intervention. The Genetic Genealogist. Accessed 10 April 2021
- Dimensionality Reduction, Wikipedia. Accessed 10 April 2021.