We use our DNA results to find relatives who match our DNA. Here we provide an understanding of how such DNA matching can work to help us identify a common ancestor (CA) with someone who shares parts of our DNA.
Autosomal DNA (atDNA or auDNA) is another term for recombinant DNA, the DNA that is inherited as 22 pairs of autosomes (recombining chromosomes), each pair consisting of one chromosome from each parent.
A human germ cell (gamete) contains a single set of 23 chromosomes, each randomly chosen from a corresponding autosomal pair, so each chromosome comes from one of the parents. Yet both parents are represented in each of these 23 chromosomes. How can one chromosome from a single parent represent both parents?
Each of a gamete’s chromosomes contains a mixture of DNA segments inherited from the parents (a resulting child’s grandparents). This mixing of DNA segments is called crossover. It occurs in an early phase of the complex, multi-stage meiosis process, during gamete production. The details are not germane here; see ‘Fat Alberts’, or other online discussions.
When a gamete joins an oppositely-sexed gamete during fertilization, the resulting offspring’s atDNA consists of paired chromosomes, each consisting of mixed segments representing both sets of grandparents. Two or three segments are usually pseudo-randomly mixed on each chromosome during meiosis crossover. There are hotspots and coldspots on each chromosome where such splicing is more often or less often performed.
No ancestor gets left out of one’s DNA in near-by generations, but by the 5th generation and earlier, the probability of having all ancestors represented in one’s DNA becomes vanishingly small:
- 96% chance of representing all 16 g-g-grandparents
- 54% chance of representing all 32 g-g-g-grandparents
- .01% chance of representing all 64 g-g-g-g-grandparents
Combined with the 223 different configurations of human gamete that can be expressed during meiosis (a father or mother choice for each chromosome), it is clear that meiosis is a source of great DNA shuffling, and explains why non-twinned children of the same parents are so unique.
Segments Over The Generations – Finding A Common Ancestor (CA)
Each chromosome in a human gamete is inherited from one of the parents. As a result of crossover, at least one, and on average 1.6 segments of DNA on a gamete chromosome is sourced from a chromosome provided by the other parent.
Each crossover event divides some of a chromosome’s segments into shorter segments. Thus, the length of a segment attributable to a specific ancestor is related to the number of ancestral generations that have intervened since that ancestor.
There are different measures of chromosome length: number of BPs, number of SNPs, and centiMorgans (cM). A cM is not actually a fixed measure of DNA length, but rather a probabilistic measure that accommodates the hot and cold spots mentioned previously. 1 cM is defined as the amount of location-specific DNA for which a 1% probability exists of its containing a segment crossover point. The average chromosome length is ~160 cM. Since there is, by definition, a 100% probability that a crossover point will occur in 100cM, all chromosomes are ensured to host one crossover segment; thus one can say no grandparent will be left out.
The resulting alternating-parent DNA segments each consists of many millions of DNA base pairs (BP), several thousands of SNPs, and measures many tens of centiMorgans (cM) in length, a count that varies widely in differing chromosome regions, and between M/F gametes.
If another person shares an entire chromosome after crossover, it is likely they will be a sibling; the parent will be the CA. If the other shares a significant part of a chromoome, it is likely they will be a first cousin, and a grandparent will be the CA. The further removed in ancestry one is from the genome being matched, the smaller the segment that will be shared.
When a shared segment is found in common between two relatives, the end points of each relative’s chromosome segment will likely be different. What is actually shared is the overlap between the two chromosome segments. Further, this overlap area may itself be shared by other relatives with the same or different CA. Triangulation is needed to define which relatives belong to which CA, a process called triangulation grouping.
In segmentology used for unknown CA determinations, one usually is looking at segment lengths between 25 and 125 cM, or 0.4% to 2% of shared DNA. The ancestral relations corresponding to these lengths are:
- 1.6% shared DNA or 110 cM – second cousins once removed, half second cousins, first cousin three times removed, half first cousin twice removed
- 0.8% shared DNA or 55 cM – third cousins, second cousins twice removed
- 0.4% shared DNA or 27 cM – third cousins once removed
There are two qualities of match when comparing shared segments, Identity by Descent (IBD), and Identity by State (IBS). Above we exemplify IBD matching, where the matching genomes have at least one near-term CA. This is nirvana, finding an ancestor by segment matching.
However, it is possible to match a segment, but for the two parties not to share a recent CA. This is IBS, and can occur when a segment descends from multiple sources having different ancestry, but by chance matches the comparison genome. Triangulation matching is used to weed out such IBS false positives that wrongly suggest a recent CA.
Heuristically, for matching segments larger than ~10cM (usually indicating a 4th-5th cousin or closer relationship), there is scant evidence of false positives through IBS. It is generally safe to assume IBD and to pursue identification of the CA. But at ~7cM of matching DNA, evidence suggests there is a 50-50 chance the match is IBS. For such shorter segments, extra information is needed to determine if there is a CA, via the triangulation process described below.
The three largest autosomal DNA testing companies are 23&Me, Ancestry, and FTDNA. GEDmatch is a free service that allows cross-comparing results from all three testing services; people simply download their raw results from their testing service and upload them to GEDmatch. Many people have migrated their results to GEDmatch, so it is the place to go to get the biggest bang for one’s efforts. GEDmatch also provides a facility for accepting and displaying paper genealogy trees via the GEDCOM standard. Using DNA to locate CAs cannot work if some relative doesn’t have the associated paper trails to identify the matches.
Many 23andMe customers have uploaded their DNA to GEDmatch, so most matching can be done at GEDmatch; only residual matching will be required at 23andMe directly. GEDmatch provides Tier 1 (paid) tools to assist in grouping DNA relatives based on shared common ancestors, a process called triangulation. There are likely other tools available also. But a spreadsheet and the standard (free) tools of 23andMe (where I tested DNA) and GEDmatch suffice for my purposes.
Triangulation requires, as a first step, finding all one’s DNA-registered relatives via search for shared DNA segments on all 22 autosome pairs. The second step is to compare all persons who match you with one another. Some will not match some others, even though they all match you, because we each inherit a chromosome from each parent; one person’s match on father’s chromosome, and another person’s match on mother’s chromosome, will both match you, but may not be related to each another.
From the main page after logging into GEDmatch, request the One-to-many report (free) from the Data Analysis panel, with segment length threshold set to 10 cM (default is 7 cM). This produces a table of people matching your DNA at some level. Copy the data columns of this report to your spreadsheet.
The table has a select column, and one must check it for each row entry for which detailed chromosome comparisons are wanted, which should be all people above a certain threshold of total segment length of matching DNA. I chose all people with total segment matching length of >23cM. That was over 150 people, a tedious selection process. Save this page with selections made, so you don’t have to redo the selections if you want to re-run the analysis. The Gedmatch paid triangulation tool seems to relate all persons having total shared segments length >15.5 cM, but I didn’t need that much resolution in my initial manual attempt at triangulation.
When all selections boxes are checked, click the Submit button near the top of the report page. The next page offers a choice of 2-D or 3-D chromosome browser. Select 2-D, then on the next page click on the word HERE. The result is a report of all shared segments in chromosome order, with graphic representation as well.
Go through this report and enter the segment detailed data (segment start position, end position, and length) into new columns in the spreadsheet. For easier readability, convert all segment locations to mbp units by dividing the displayed locations in base pairs by 1 million. Some on the list are duplicates or siblings with the same DNA. Remove these for a cleaner result.
A person with multiple matching chromosome segments will need a separate row for each matching segment; repeat rows as required to hold the data for each new shared segment. Remove any rows corresponding to individual segments less than 10 cM in length. (I allowed segments >8.5 cM if the same person had an adjacent segment >10cM, gaining a few more data points). Most segments smaller than 7 cM seem to be IBS matches, and they will confuse the process going forward; it is recommended they be culled.
Return to 23andMe (or whatever other DNA sites were used) to pick up the data for those relatives who had not uploaded their DNA to GEDmatch. On 23andMe, I matched their DNA individually with mine using the DNA tab, then copied the detailed matching segment data to the spreadsheet.
The spreadsheet has now expanded, with a row for each segment shared with me, for each person in GEDmatch and 23&Me. I ended up with 115 distinct relatives sharing with me over 150 DNA distinct segments > 8.5 cM, spread over all 22 autosomes.
The next step is to consolidate (roll-up) and name the prospective shared segments, eliminating duplication and thus simplifying the process going forward. Consolidation consists of rolling up overlapping, nested segments on a chromosome and noting the least start location and greatest end location for each distinct set of nested segments. Some judgement will be required regarding whether adjacent areas should be included, or excluded into their own distinct shared segment.
I chose a numeric name format xxyyyzzz for my rolled-up segments, where xx is the chromosome#, yyy and zzz are the segment start and end locations (mbp). This numeric format allows sorting rows by segment names, enabling our end game – organizing the table into mutually exclusive family groups, each rolled-up segment corresponding to a prospective common ancestor (CA), a set of g-, g-g- or g-g-g-grandparents whose segments are inherited. I ended up with approximately 50 unified (rolled-up) segment names, call them triangulation segments (TG).
Finally, one needs to consolidate TGs into into a set of CA groups, each group associated with the same set of common ancestors. Find all relatives that share a TG and check which other TGs they share with anyone else. Then bring all people who share any of those TGs into the CA group and repeat, until there are no more connections to follow. My ~50 TGs comprise 11 identified, mutually-exclusive CA groupings, with about 25 uncategorized TGs left over. Each CA grouping is independent of the others (i.e. comparing people in different CA groupings revealed no IBD ancestry links).
Now that all DNA relatives identified within 5 generations or so are sorted into mutually exclusive ancestry groups associated with specific ancestral lineages, we complete the process of triangulation by comparing the people in each group with one another. While doing these binary comparisons, additional segments are added as rows to the spreadsheet, identified by the two names that share them.
My spreadsheet, after sorting and colorizing by CA groupings, appears as follows:
where the column headers are:
Kit#, Sex, TG, Start, End, cM, Name
TG, Start, End, and Length (cM), are derived columns added to the original columns of the GEDmatch One-to-many report. A given named segment will span all the start/end segment boundaries nested within it.
In my case, I had determined three CAs by direct DNA match prior to the above sorting process. Two are common to the first large buff-colored CA group from my father’s side, one with my grandparents as CA, the other with my g-g-grandparents in the same lineage as CA. The remaining relative became associated with the sixth crimson-colored CA group, with g-grandparents as the CA from my mother’s side. Now I know with reasonable certainty that all the other people listed in those shared groups are associated with these same or related CAs.
For the other eight CA groups whose CAs are not yet identified, the plan is to contact each person, tell them what I know, point them to my extensive tree with most all ancestors identified back to g-g-g-grandparents, and see if any of these relatives have a paper tree sufficient in depth to identify our common ancestor.
A cell with non-paired chromosomes, such as a gamete, is called haploid. Cells with paired chromosomes are called diploid. Each chromosome pair are homologs (from homologous), identical in structure but different in content.
The 23rd pair of human chromosomes are called allosomes, aka X and Y sex chromosomes, and largely do not recombine during meiosis. Also, the cell mitochondria contains non-recombinant DNA.
An entire branch of genetic genealogy deals with such non-recombinant segments of our genomes, our Y-DNA and mt-DNA, where Y is the formal name of the male allosome, and ‘mt’ stands for mitochondria, which contains non-recombinant DNA present in all cells, and passed from a female to all her children via her ova.