The following is an introduction to personal genetics.
There are two main thrusts in personal genetics:
- understanding one’s ancestral group
- understanding one’s coordinates in the space of personalized medicine
This page concentrates on the former. The latter is addressed within pages having a medical flavor, under the Genetics menu.
A New Industry
Genetic technology is spawning various genetic testing services who collect DNA, analyze it, store it, and catalog the results in databases, making it available to researchers and to people interested in analyzing their ‘deep’ genealogy. Participants typically give freely of their genetic material for testing, paying a small fee for the test itself.
The principal purpose of genealogical genetic testing services is to determine one’s current genetic group, the collection of people with similar DNA ‘fingerprints’. These fingerprints are a type of genetic signature comprised of unique mutations (markers) that distinguish one person’s DNA from another’s. Such testing returns the specific values for the DNA markers comprising one’s signature, which determines one’s grouping. Various numbers of DNA markers can be tested for, and the cost of the test varies accordingly. For a discussion of the rudiments of genetics, helpful for full comprehension of what follows, refer to my prior page Genetics Primer.
The genetic donors incidentally support the advancement of genetic discoveries useful to us all. By allowing their test results to be publicized in public databases, researchers can glean much information about human history: how we got here, even the likely origins of all hominids and how we differ from non-human hominids.
These databases further allow us to speculate on much more recent travelings of humans. Genealogists recognize that, beyond a certain time some 15 or so generations ago, we must abandon the hope of finding reliable information about the majority of our individual ancestors, switching to population studies to learn more about our distant ancestral groups and more recent ‘tribes’ and ‘clans’.
Why do we care about our prehistoric clans? Isn’t it true that we all descend from a single male and female progenitor many thousands of generations ago? Aren’t those of us with European ancestry best described as Euromutts? Yes, it’s true, we are a homogenized mix of DNA types; other than immediate related generations, we relate to many people equally well within our broad historical context. So study of ancient DNA patterns is not uniquely personal. It is an abstract quest for knowledge: of history, of how we came to be here, and of the struggles endured by our distant forebears.
Our great multitude of hypothetical ancestors is mathematically beyond comprehension when duplication is not considered. But as we go back more than a few generations, the populations in our ancestors’ locales perhaps numbered a few thousand individuals or less, so what appear to be separate people in our abstract ancestral trees are really not all distinct individuals, but common ancestors to many of our lines. Still, we all are products of unimaginable diversity.
The bulk of our DNA is largely homogenized via its recombinant (autosomal) nature; our genes are a mix and match combination of all our ancestors’ genes and thus reveal the extent of our ‘muttiness’. One mentions genes in this context because homogenization occurs only at a certain broad granularity. Beneath that granularity are the individual genes, which are inherited from either one parent or the other as a complete DNA module.
Genetics labs, in a supporting role for medical research, have been evaluating autosomal DNA. Now that the economics of this practice render ever less testing expense, their results now also have genealogical utility. For it is statistically possible to disassemble our genetic homogeneity and to learn of the population groups that were involved in our near heritage.
There is, however, a wormhole through our homogenized, recombinant DNA. Through two small genetic windows of non-recombinant DNA, we can peer back directly to all our own male and female progenitors with matching DNA. Over time, heritable mutations in these genetic pathways permit mapping the individual paths taken by our paternal and maternal lineages. Although this non-recombinant DNA is peripheral, telling little about our overall genome, it has a primordial flavor that is truly ours alone. This is personal. We are able to reconstruct the history of the travels of our M-F lineages with some accuracy based on current population distributions, further informed by paleogenomic discoveries.
The genealogy DNA testing services have largely concentrated on these small windows. The rest of this discussion explores what we have learned from testing the pure M-F lineages. Usually the more markers tested, the more detailed the population differentiation possible. But even a minimal marker set is usually capable of placing one in a geographically well-defined population group (aka clade in genetic/biological terminology). As continuing research is able to further define and track population clades, one can add specific markers to one’s test result for an incremental fee; one’s DNA material is saved by the labs (25 years in one case).
There are two types of DNA marker that are informative of DNA group boundaries, short tandem repeats (STR) and single nucleotide polymorphisms (SNP). One can order either form of test from the testing companies. These are described in more detail later, together with their naming conventions.
STRs have been historically much less expensive to test. Also, they have predictive properties for identifying corresponding SNPs. Groupings based on STR tests are somewhat fuzzy, relying on statistical calculations of a modal STR value set for a hypothetical grouping, where no one in the associated group may exactly match that modal set of STR values.
The most efficient testing strategy to date is to do an inexpensive, broad spectrum STR test, use the results to predict the most downstream corresponding SNP (using an internet SNP prediction calculator), then switch to SNP testing beginning at that SNP. Testing companies themselves use such a strategy internally if one requests a broad spectrum SNP test.
The principal value of SNPs is their uniqueness, enabling a precise DNA clade tree structure, with an SNP defining each branching node.
Identifying our Paternal/Maternal Genetic Clans
Currently two types of DNA are used for paternal and maternal population studies respectively. Paternal signatures derive from the non-recombinant Y chromosome DNA (NRY DNA or just Y-DNA). Y is the male sex chromosome that resides in the nucleus of males’ cells. Paternal NRY DNA is passed largely unchanged from father to son.
Maternal signatures derive from mitochondrial DNA (mtDNA). Maternal mtDNA is passed largely unchanged from mother to daughter (and to son, but only daughters can pass it on). A man has both paternal and maternal signatures present in each of his cells. A woman must look to her father, uncle, or a brother to determine her paternal signature.
One says non-recombinant DNA is ‘largely unchanged’ between generations because the inherited DNA does undergo rare mutations over time during the generational copying process. It is the aggregation of these mutated forms, the DNA markers, that identify a specific DNA signature. A living person is related most closely to persons with his/her same signature. Thus two persons with identical paternal or maternal signatures descend from a most recent common ancestor (MRCA) who is relatively recent compared to two arbitrary persons’ MRCA.
Y-DNA and mtDNA genetic signatures can be registered in public databases, the first and largest being http://www.YSearch.org and http://www.Mitosearch.org. These sites permit DNA result matching from all sources that support sharing of DNA results. The Ysearch site has a database of over 75,000 unique DNA signatures (2011), mostly from FTDNA. The Sorenson Molecular Genetics Foundation also had a private database, which has since been sold to a commercial firm (bad Sorenson).
Newer whole genome direct-to-consumer testing services, such as 23andMe, provide autosomal DNA results as well as mtDNA and Y-DNA results. Medically useful information results, as well as complete personal ancestry component analysis within the last ten or so generations. Before this time, the individual autosomal DNA fragments become so small as to lose their predictive grouping potential.
Whole-genome research groups, such as 1000genomes, have further provided broad-based worldwide testing. Their data provides identification of many new SNPs, which they share with the research community. As commercial companies add these new SNPs to their catalogues, consumers can specifically test for these SNPs and refine their ancestral groups accordingly.
Adam is the name we give to the MRCA of all living males, based on the Y-DNA markers. A similar person, Eve, corresponds to the MRCA of all living females, based on the mtDNA. These genetic Adam and Eve lived roughly 150-200 ka, estimated by genetic clocks (estimated rate of DNA mutation occurrence). The dates are open to wide variance, as all genetic data types of early antiquity must be considered to correctly calibrate the genetic clock used to estimate the date. Thus, considering only European genomes would yield a more recent MRCA than would including widely distributed small population groups in isolated parts of Africa.
Maternal genetic clans were popularized in a book describing the seven daughters of Eve in fictional detail, based on the seven basic mtDNA signatures known at the time. In these articles, we mainly discuss the paternal NRY heritage. Hypothetical Adam was an anatomically modern human (AMH) who lived in Africa. The earliest Adam descendants are the A and B haplogroups. Haplogroup A peoples of South Africa include the Khoisan (click language) people. The B peoples reside today in central sub-Saharan Africa, including Pygmy tribes of the forests.
As DNA analysis advances, researchers identify further useful genetic markers, resulting in a more detailed signature that supports further sub-group differentiation. The farther this can be pursued, the smaller our most immediate identifiable relationship group becomes. When combined with much larger databases of native populations by signature, such refined signatures should enable more precise mapping of the geographic migrations of populations over the last several millennia.
Much beyond that, though, and many lines have gone extinct, so a completely detailed picture cannot be inferred from DNA in the current population. Technology for gleaning DNA signatures from preserved human remains from the late paleolithic (paleogenomics) will be required to extend our knowledge into the distant past.
Such advancements may create the possibility to estimate where our paternal and maternal ancestors were living in each of the past 70 millennia. From this and archaeological findings, we may then know more about their phenotypes, cultural affinities, art, burial practices, housing, food sources, language, and general social structure, all while locating them in space and time.
Details of Our Genetic Signatures: Mutations, DNA Markers, Haplotype, Haplogroup
Our DNA contains large sets of paired molecules called nucleotides. There are four basic types of nucleotide: A, G, T, and C for short. The simplest copying mutation is the substitution of one nucleotide for another at a given location (locus) in the DNA strand. For example, a T might become a C in the next generation. Other nucleotide modifications include insertions, deletions, and repeats. Any single nucleotide mutation is called a Single Nucleotide Polymorphism (SNP, snip).
Because the NRY SNPs participating in human population genetics occur only once in our history, rather than oscillating or assuming multiple values, they belong to a subset of SNPs called Unique Event Polymorphisms (UEP). Such mutations occur rarely. UEPs on NRY DNA occur at a hypothesized rate of ~10^-8 per generation or one in 7,000 years presuming an average generation duration of ~25 years. The set of all SNPs in a signature is termed one’s haplogroup (short for haploid group, a technical word for signature). All haplogroups are based on SNP markers. Haplogroups are nested within other haplogroups, forming a binary tree whose nodes are the DNA markers. Earlier mutations are associated with larger haplogroups; successive mutations on the same limb of a tree define ever finer population groupings.
NRY SNP markers are named Qxxx, where ‘Q’ encodes the lab that discovered the SNP (M, P, L, S, Z,…) and xxx is a sequence number assigned when the marker is registered.
mtDNA mutation rates are highly variable, and on-average are higher than for chromosomal DNA, suggesting that higher resolution is possible, and that fewer generations are necessary to produce an SNP. One consequence of their hyper-variability is that mtDNA genetic markers are not necessarily unique in our history, hence not UEPs. They are SNPs that can switch back and forth; their predictive value is complicated and careful analysis is required. Genetic markers in mtDNA are said to be located in the hypervariable region, so named because of the recurrent mutation potential (hypervariability) of this region. To deal with these unstable SNPs, a relatively recent and known haplogroup is chosen as the Cambridge Reference Sequence (CRS). SNPs are then defined relative to the CRS. Markers so identified can then become UEPs when localized in genome time with respect to the CRS.
The basic 470 nucleotide test offered by one testing company looks at an mtDNA region called HVS-I (Hypervariable Segment I). The corresponding extended 760 nucleotide test consults the adjacent HVS-II region as well. Based on the discovered SNPs, the service will then assign the sample to a haplogroup. Testing services were initially very conservative and would only predict the highest level haplogroup to which one belonged. But in most cases, there is more differentiation information available from the reported SNPs than simply the reported haplogroup. Using information on the Internet, one can use one’s reported raw marker values to considerably refine one’s haplogroup. Right now testing companies ask for more money to make such determinations. So it is efficient to research one’s results and self-determine the haplogroup structure to the degree possible, making use of Internet information.
mtDNA SNP marker names have a physically descriptive nature, formatted as a molecular locus followed by the nucleotide type at that locus (after the mutation occurred). These SNPs are relative to the reference CRS haplotype. For example, one might have five SNPs reported: 174T, 189C, 192T, 270T, 311C. The symbol 174T is an abbreviation for 16174T where 16174 is the sequential nucleotide location in the mtDNA region being tested, and T is the mutated nucleotide at that location (mutated with respect to the CRS).
NRY Haplogroup Naming
The following diagram (from Y-Chromosome Diversity, Human Expansion, Drift, and Cultural Evolution; Chiaroni, Underhill, Cavalli-Sforza; 2009) shows the Y-DNA haplogroups that descend from Y-DNA Adam, and are still available in the current human genome.
- All the major clades (haplogroups) above were known early-on, when they were assigned the letters shown above. To avoid mass confusion, these letter identifiers remain sacrosanct.
- Since these original canonical letter names were designated, SNPs have been discovered that link two or more basic letter clades together into a super-clade, called a macro-haplogroup, e.g. IJK.
- In the following, haplogroups are identified by a ‘new style’ hyphenated form, a canonical letter from the figure above followed by the name of the single nucleotide polymorphism (SNP) that identifies this binary branch of the genetic tree, for example I-M170. This naming style is extensible to reference lower levels in the cladistic tree, by substituting the lower level SNP name whose clade is being referenced, for instance I-M436.
- SNPs are named after the lab that identified them. ‘M’ represents P. Underhill (Stanford); ‘P’ represents B. Hammer (U. of Arizona); ‘L’ represents T. Krahn (FTDNA); ‘S’ specifies J. Wilson (U. of Edinburgh); CTS represents Chris Tyler-Smith (Sanger Institute); Z represents a research consortium associated with the 1000 Genomes Project. The complete list is maintained by ISOGG. In many cases, different labs find the same SNPs and name them differently. Due to the continued responsiveness and openness of the Underhill lab, ‘M’ SNPs are preferred here for identification whenever they exist. In the I2 tree, P37.2, L38, and L460 have no known equivalent ‘M’ SNPs.
- A notation such as IJ* indicates a residual population that is ‘ancestral’ (not derived) with respect to any known subclades downstream from the indicated clade’s SNP. (Such a population group is called a paragroup, short for paraphyletic haplogroup).
- A notation such as M170+ indicates that the referenced people are ‘derived’ for that SNP; i.e. their DNA has this mutation.
- Nested subclades of a haplogroup were originally identified by appending chronologically sequenced, alternating number and letter suffixes to the haplogroup letter; e.g. E1b1b, where E1 is earlier than E2, E1b is later than E1a, etc. (E1b1b refers to a subclade of haplogroup E that is M215+, new style E-M215.)
- In the past, discovery of new SNPs that change the overall tree structure have resulted in substantial renaming of subclades using this original naming convention. For instance, the new L460 SNP early in the I tree has potential to cause the ad-hoc standards organization, ISOGG-2011, to change all the old-style downstream I2 names. In old style naming, my clade’s name might change from I2b1 to I2a2a, where both names would then be in use in reference materials, a nonsensical situation. Using new-style names such as my I-M223, my position in the new and old I-tree naming schemes remains the same.
- mtDNA haplogroups are also identified by alphabetic letters to identify the major ancestral groups from the paleolithic, and use similar structured hierarchical names for the more recent clades. These overall letter names are unique within NRY and mtDNA haplotypes, but have some cross-duplication. For instance, there is an NRY haplogroup I (Y-Hg I) and a mtDNA haplogroup I (M-Hg I).
STRs – DNA SNP Proxies
For NRY tests, the detailed results from initial discovery tests do not usually provide values for the SNP marker names above. A different set of markers is returned which can predict the values of the above SNPs. This indirectness is pursued for economic reasons, and usually gets one to the same result.
The common screening NRY DNA mutation marker is called a Short Tandem Repeat (STR) or microsatellite. Some sequence of nucleotides (usually four in length) is repeated a variable number of times at a specified locus. That number of repeats becomes the principal characteristic of that locus’ allele, a different type of genetic marker.
The most basic and cost-effective NRY test typically looks at 12 STRs and reports the allele value of each (# of repeats) as the detailed test result. Alleles are named DYSxxx where xxx is an assigned sequence number and DYS stands for DNA Y-chromosome Segment. Depending on what lab does the testing, the values returned (# of repeats at each tested locus) might be different, so a normalization may be required to compare apples to apples.
The set of STR alleles one has tested determines one’s Y-DNA haplotype (haploid genotype), a different genetic signature (but that can be associated with one’s haplogroup). The more alleles that are tested, the finer the resolution of one’s haplotype. People with the same surname that match at 25/25 STR markers may be confidently assumed to have a common ancestor within the last few generations (<10). Matching 37/37 STRs would make the common ancestor of both people very recent.
Once an NRY haplotype is predicted from the STR proxy set, specific related SNPs can then be tested economically. STR mutations occur much more frequently (10^-3 per generation) than SNP mutations, so offer the prospect of more precise geographic placement of mutation events in time. It is also easier (less expensive) to test for STRs than SNPs.
Improved time resolution and less expense are good, but it is the set of associated SNP mutations that ultimately determines one’s haplogroup. So one needs to translate the haplotype into an associated haplogroup. In many cases, knowing one’s haplotype (set of STR alleles), it is possible to infer one’s haplogroup without ever testing for specific SNPs. This is accomplished by simple pattern-matching with known tested results.
A database of test results showing haplogroup based on STR alleles should be consulted. By matching the alleles from a test to the alleles in the database, one can in effect look up or infer the haplogroup without the expense of testing for individual SNPs. Determining the haplogroup from a database is much less costly than testing for a lot of SNPs. As a further aid, one can plug the tested alleles into a web haplogroup calculator, which accesses haplotype lookup databases, then returns the highest resolution haplogroup together with a related confidence factor.
Testing companies offer packages of over 100 STRs for testing; although not usually useful for genealogy purposes yet, such abundant data can be of significant value to researchers in population genetics. One’s haplotype can be a fine tuning mechanism that supplements one’s haplogroup, providing additional resolution for isolating one’s ancestral clan on the genotypic map of archaic populations.
Testing companies offer tests for individual SNPs and also for large sets of SNPs (expensive) called ‘deep clade’ testing. It is better (less costly and more resource-efficient) strategy to test for a sufficient number (min. 25) of STR alleles to estimate which SNPs one likely may express. Then, join an online group researching your haplotype and learn from them which specific SNPs are predicted by your signature to be of further significance to you.
The Importance of the Haplogroup
Haplotypes are quickly and economically determined. But they do not by themselves define a binary tree structure that can elucidate population genetics. It is one’s SNPs, either directly tested or inferred from one’s NRY haplotype, that defines one’s NRY haplogroup (aka clade). Haplogroup relatedness is represented in a tree structure. SNP marker names typically are shown at branching points (nodes) of a haplogroup tree.
A haplogroup diagram is a phylogenetic tree, showing the genetic relatedness of different clades via the terminology and methods of cladistics. Cladistics uses genetic traits (genotype) to differentiate populations, a more accurate result than the prior taxonomy based on physical traits (phenotype). The haplogroup allows researchers to track population movement over time by looking at frequencies of occurrence for various clades across the world’s current populations.
Haplogroups and Race
The concept of Caucasian usually brings to mind a homogeneous population with a common background that makes Caucasians different from others. Genetics debunks this conception. Caucasoids do not form a monophyletic clade.
Genetically, the large majority of Caucasian Europeans or Americans (Y-Hg R1a/b) are more closely paternally related to a native of China (Y-Hg O) than to many of the people in rural Sweden (Y-Hg I), where descendants of the original population of Europe still appear in substantial numbers. Additionally, most all Europeans are truly Euromutts, some combination of all ancestral lines who ever lived there.
There is also ongoing work to establish DNA relationships with human disease occurrence rates. The NRY DNA loci used for genetic testing are thought to be in ‘garbage’ areas of the chromosome and hence not influential in human pathology, although Y-chromosome is being studied in this regard (e.g. Y-genotype associations with male infertility). mtDNA analysis for disease susceptibility patterns is extensive, including prediction of cancer and longevity. Knowing such information can be beneficial for an individual, and at the same time detrimental to the individual if other people can access this information and discriminate accordingly.
Probably all genotypes have increased susceptibility to some disease or another, so in the long term there may be little of discriminatory value in these databases. And assuredly DNA redlining for insurance purposes will be made illegal. Yet caution is indicated, and anyone being tested should ensure no readily-discernible links associate personal identification and the DNA information maintained by the testing facility.
The search databases are more than simply statistical sources. They provide the ability to contact individuals whose data are made available there. But this contact is only permitted from one registered user to another and is fully mediated by the search service.