Using DNA to Explore Ancestry

DNA is stored in the nucleus of our cells as 22 chromosomes plus the Y chromosome (males)  and Mitochondrial DNA organelles in the cytoplasm of each cell. DNA comes in two types, recombinant (aka autosomal) in the chromosomes, and non-recombinant (some of the Y-chromosome and the mitochondria).

Recombinant DNA is mixed in every generation, some genes from the mother’s parents and some genes from the father’s parents. It shuffles the deck of our grandparent’s DNA, making us diverse, and through diversity, healthier and with greater promise for future beneficial traits. The majority of our recent ancestors are identifiable through autosomal DNA, although by six generations back, there is only a .01% probability that all 64 ancestors in that generation are still represented in any descendant’s genes.

Non-recombinant DNA only identifies a tiny slice of our genetic heritage, so at best could identify only a very small percentage of our current and past relatives. More on that later. The non-recombinant DNA is inherited essentially unchanged, the Y-DNA from father to son, mitochondrial (mtDNA) from mother to children (but sourced from the egg cytoplasm, meaning only daughters can pass it on). 

Recombinant (autosomal) DNA

Recombinant DNA provides identification of personal family relations through matched DNA segments from the chromosomes. Companies such as 23andMe, FamilyTreeDNA, and MyHeritage provide this service. They can determine how closely two people are related by comparing their autosomal DNA. For each test client, the company provides a list of DNA ‘relatives’ who have also tested DNA with them. GEDmatch, another useful service that does no testing itself, accepts DNA collected  by other testing services, and provides a set of tools for analysis.

Here relative means a person who matches a segment of significant length on one or more chromosomes. Measurement of chromosome segment length uses a unit called a centiMorgan (cM), a probability function yielding the expected frequency of there being a recombination event (crossover) in the segment being measured. Thus two equivalent segment lengths, measured in base pairs of nucleotides, may reassigned different lengths, measured in cMs. It is functional length, not physical length, that is important to us.

Adequate family histories, through ancestral records research, are the prerequisite for good genealogy results. This is equally true of DNA research for genealogy. DNA is simply the scientific evidence for genealogical relationships. Without being able to assign names and locations to those matches, DNA is not proof of any specific result. Thus when giving a service a sample of our DNA, we should also give them our best effort at a family tree, to gain the most knowledge leverage from our DNA.

Relative lists typically extend from current family members to distant relations (cut off at about 6th cousins), based on the length of the cumulative segment matches, which decreases with each subsequent removed generation. Not all matches are equally valuable. Only matches that originate in a common ancestor are meaningful in ancestral analysis. So these companies also provide clients with on-line chromosome browsers to discover the details of the matching segments, necessary to confirm matches as meaningful in this context.

Matching is complicated by genetic distance. As a general rule, when relative distance is greater than 3rd cousin (or 2nd cousin once removed, etc), a three-way match (triangulation) is required to assert certainty. Three is better than two because the three must also compare with each other, yielding three two-way comparisons and thus guaranteeing that any matching segment originates with a single ancestor. With triangulation, the three paths to the Most Recent Common Ancestor (MRCA) are genetically validated as actual ‘blood lines’, documented in the DNA.

If a shared segment are large enough (typically greater than 20cM), just a single identified relative is considered a ‘certain’ match if the paper trail identifies the ancestor; no triangulation required. This is the usual case for third cousins, or more recent relatives.

When a match is found, perhaps the other parties will have had more success in historical research, which then can be mated to our own.  We all can make sizeable leaps by borrowing from others’ analyses. Research should be about sharing ideas and data. That’s what makes it fun, and generates sizable jumps in progress.

The author, through autosomal DNA triangulation at 23andMe (and then transferred to GEDMatch), has validated both mother’s and father’s blood lines back to most g-g-grandparents, and further in a few cases, in the process finding some unknown cousins.

Non-Recombinant DNA

While autosomal DNA can support general population studies in distant times via statistical means such as Principle Component Analysis (PCA), specific deep genetic ancestry is more definitively traced through non-recombinant DNA.

Beyond direct links to historical ancestors, non-recombinant DNA can indicate a broad history of one’s biological clans going back over millennia, via a unique genetic signature. Researchers study all public genetics databases containing non-recombinant DNA, to draw conclusions about where we come from and how we got here from there. 

The basic idea is that by studying the distribution of current peoples’ DNA, we can learn something about our distribution long ago. The resulting hypotheses can then be tested as technology improves for deriving DNA from ancient remains (paleogenomics). For background on the terminology used below, refer to the prior essay Genetics: Our Genetic Clanship.

We use non-recombinant DNA in personal ancestral research to identify the smallest clan sharing our signature (with which we are uniquely and unequivocally related). When we identify our ancestral clan, the population-wide research will have suggested where and when the clan originated. Thus, both personal and population-wide goals are advanced by testing non-recombinant DNA. Here, DNA stands in for a surname, since patronymic surnames likely were not yet used when these clans originated.

There are two types of non-recombinant DNA marker from which one can create an identifying signature: haplotype, the signature of a characteristic grouping of DNA Short Tandem Repeat (STR) markers, and haplogroup, a unique, clade-defining Single Nucleotide Polymorphism (SNP).

Historically, SNP haplogroup testing has been expensive and inefficient. Most people first do STR testing, which supports haplogroup inference without requiring explicit SNP tests. But ultimately it is the haplogroup that biologically asserts our genetic relatedness; the haplotype is only a rough proxy for this genetic fingerprint.

Non-recombinant DNA divides us into clans based on paternity and maternity. Our paternal and maternal clans’ ancestral wanderings through prehistory can be mapped from deep in the upper paleolithic era, down to recent times. This is why we record our non-recombinant genetic test results (a standard list of one’s DNA STR markers) in public databases. This permits us to look for others that match us. It further allows researchers to see where people of given DNA types cluster now, in order to extrapolate backward in time to where these types originated. The more people who register their results publicly, the greater the accuracy of such backward projections. Currently, 167 STR markers are available for testing, but the author’s 41 have proved sufficient to place him in the finest resolution haplotype currently visible to us.

There existed recently public databases for this purpose, but now removed from online access because of EU GDPR regulations in 2018. There is some matching still available to people who tested at FTDNA. The author has used these results to locate two other lineages sharing his haplogroup. 

Also see Paternal DNA Use Case

Comments Welcome