Monograph |
Corresponding author: Hugo de Boer ( h.de.boer@nhm.uio.no ) © 2022 Hugo de Boer, Marcella Orwick Rydmark, Brecht Verstraete, Barbara Gravendeel.
This is an open access article distributed under the terms of the CC0 Public Domain Dedication.
Citation:
de Boer H, Rydmark MO, Verstraete B, Gravendeel B (2022) Molecular identification of plants: from sequence to species. Advanced Books. https://doi.org/10.3897/ab.e98875
|
Names are the carriers of knowledge. Without names, much of science would be meaningless. Names give us insight into the diseases that affect our health; the objects that sustain our economies; the celestial bodies that travel in the Universe. Names solve ambiguity.
In botany, the name of a plant may provide the first clues as to its characteristics, also called traits. Is it edible, or poisonous? Beautiful, or ugly? While some traits are relative (edible by whom, ugly to whom?), others are absolute: thorny, succulent, epiphytic. Some are obvious, others elusive. From morphological descriptions and DNA sequences to historical accounts and traditional uses, they are all linked by the name.
Until recently, the reliable identification of plants was the task of a select few: the taxonomists. Today, this is less so. The molecular identification of plants through DNA barcodes has been shown to perform just as well, and in fact often better, than taxonomists for many taxa, particularly when specimens lack reproductive structures. Other techniques, such as image recognition through machine learning and the spectrophotometric signature of leaves, can yield similar results. Does this mean the demise of taxonomists is on the horizon?
Not at all. I believe it is very much the opposite: in the current environmental crisis, the need to document and protect the world’s biodiversity has never been more acute. At the same time, some 20% of all plant species have not yet been scientifically described, and many of them may disappear even before we have identified and characterized them. The work of taxonomists remains therefore critical, but as molecular identification of species is underway and set to become routine across the private and public sectors, expert time can now be reallocated from bulk identifications to the training of students, build-up of physical and digital reference collections, and further development of identification methods. Technologies are here to help – not replace – taxonomy, by complementing the human strengths and compensating for some of our human weaknesses: an insufficient memory, a biased brain, and lack of time.
This book is for you who are curious about how plants can be identified using DNA: the most powerful source of information to link a plant to a name. This may sound trivial, but it is not. But don’t despair in advance: it is doable, mostly fun, and always rewarding. You just need to learn how.
Here, you will not only learn how various types of materials containing plant fragments can be identified to species in the lab and how to execute sophisticated computer analyses, but also gain a deeper understanding of the complexities and challenges faced by taxonomy in general, and plant identification in particular, including the lack of comprehensive reference databases. Enforcing strict species concepts onto nature’s inherent fluidity doesn’t always work, and despite all recent advances in this field it still happens that some plant samples cannot be confidently named. Yet, if this ever happens to you, this initially frustrating insight can also be scientifically revealing, and help you design further experiments.
The applications of molecular identification are far more numerous and trans-disciplinary than most people would imagine. Several chapters take a deep dive at applications in fields as seemingly disparate as palaeobotany and healthcare, but as I argued at the start of this text, they are all unified by a common denominator: the name, the information-carrier.
I hope you will find this book as inspiring, informative, and revelatory as I have, and that you will choose to carry out your own projects using the molecular identification of plants. And if you do so, just don’t forget to cite the chapters that inspired you!
An estimated 340,000–390,000 vascular plant species are known to science (
Organismal diversity is the foundation of all biological research, but species discovery and delimitation requires taxonomic skills. Even the most experienced taxonomists can rarely critically identify more than 0.01% of the estimated 10–15 million species (
The global scientific community lacks the expertise and continuity to identify all species diversity, and biodiversity is lost at a greater speed than we can discover and describe new taxa (
DNA-based species identification, i.e., molecular identification, makes it possible to identify species precisely from trace fragments such as pollen (
These innovations in molecular identification enable us to detect and identify species in places and settings that were unimaginable only a few decades ago, or even in 2020 (
Deoxyribonucleic acid (DNA) is the blueprint of life. DNA encodes genes which carry instructions for the production of proteins, the fundamental components of a cell’s machinery. DNA was first isolated and confirmed as the genetic material in cells, and thereby the basis of heredity, in the 1940s (
A fundamental tenet of molecular biology is that DNA is transcribed into ribonucleic acid (RNA), and subsequently translated into amino acids that form a protein sequence. We now have a much more detailed understanding of this framework, including the varied roles of RNA in gene expression and regulation, and the role of epigenetics—heritable changes in DNA that do not alter the base sequence (e.g., methylation). Since the discovery of DNA, there has been a steady increase in the use of DNA sequences as molecular markers in varied biological contexts, including medical and forensic applications, elucidation of genes encoding adaptive traits, understanding population genomic processes, as well as systematics of prokaryotic and eukaryotic organisms.
Most DNA extraction protocols extract total cellular DNA. In certain experimental cases, it can also be preferable to target either DNA contained in the nucleus or DNA comprising organellar genomes (in plants: mitochondria and plastids). Organellar genomes are much smaller than any plant nuclear genome.
As with virtually all eukaryotes, plants have endosymbiotically derived mitochondria for cellular respiration and energy production. However, compared to other eukaryotic kingdoms (animals in particular), the mitochondrial genome of plants is quite large, ranging between 200 and 750 Kbp in size (
In contrast, plastid genomes (e.g.: found in chloroplasts of leaves or amyloplasts of cereal grains) have a very stable genomic structure and a size of around 150 Kbp in most cases (
Nuclear genomes, particularly in angiosperms, are highly variable in size, with the angiosperm mean and modal 1C (the amount of DNA in an unreplicated gametic nucleus) both at around 5 pg/Gbp (
The first isolation of DNA, by the Swiss physician Friedrich Miescher in 1869, happened accidentally while studying proteins from leukocyte nuclei (
Plants possess a tough cell wall made up of cellulose and other compounds such as lignin, in addition to a cell membrane. This necessitates a robust first step for plant DNA extraction that disintegrates the structure of the plant tissue and breaks down cell walls. In a low-throughput scenario (or for samples that are tougher to disrupt), this could involve flash freezing the tissue with liquid nitrogen followed by grinding with a pestle and mortar. For higher throughput of samples, tissue-disrupting machinery can be applied. The ground material should then be taken forward immediately to the chemical steps of the process, which involve breakdown of the cellular membrane to release the lysate containing the soluble DNA. This is then separated from cell debris and other insoluble material. Various methods are subsequently used to separate DNA molecules from the remaining material, which can contain soluble proteins, nucleic acids, and small molecular metabolites (
Numerous protocols and procedures have been developed to extract DNA from plant material of varying origins (
A major innovation in DNA extraction protocols from plant material was developed by Doyle and Doyle (
Plant material for any research project must be collected ethically and legally, and the preparation of DNA extracts is no exception. Permission, prior informed consent and mutually agreeable terms of use must be obtained before using plant tissue for DNA extraction according to the Convention on Biological Diversity. This includes the fair and equitable sharing of benefits arising from the utilisation of genetic resources (as outlined in the Nagoya Protocol). National and international law and conventions apply to derivatives of biological materials, including DNA extracts and their transportation. The same principles apply to botanical collections such as seeds, silica dried specimens stored in a tissue bank, herbarium specimens, or plants in living collections. The terms under which they are stored in a collection may restrict the use of specimens for research and require additional permissions (for instance, from the regulatory authority in the country of origin) before they can be used. The storage and future use of DNA extracts, likewise, must comply with the terms of the permissions granted, which could include being stored indefinitely for future research, returned to the country or institute of origin, or discarded. See Chapter 2 DNA from museum collections for guidance about your responsibilities as a researcher.
DNA can be extracted from healthy plant tissues including leaves, flowers, buds, seeds, roots, bark, and even spines. Young leaf tissue is the preferred starting material (
Successful extraction of high-quality DNA from any plant material depends on the material being prepared correctly, dried rapidly (without excessive heat treatment), and stored in a dark, dry place to minimise degradation of its DNA. DNA degradation prior to extraction is caused by the release of endogenous nucleases during cellular lysis, which may be accelerated by environmental factors such as heat and humidity (
The extraction method is determined by the plant material available. For most kit and CTAB based protocols, a 1 cm2 section of herbaceous leaf tissue will suffice for a single extraction. Careful laboratory notes of the material used, including provenance data, sample weight, and extraction date, are vital for checking the quality of sequencing results against the specifics of the extraction process in the lab and for pinpointing reasons for variation between samples. For some protocols, weighed tissue can be placed straight into a 1.5 ml tube labelled with a unique number or laboratory code and other information, ready for the DNA extraction process.
Plant material dried and stored in silica gel – including as specimens stored in tissue banks specifically for the purpose of DNA extraction – tends to be a good source of high-quality DNA. Silica gel (silicon dioxide xerogel) is a desiccant that removes moisture from the atmosphere, drying out the plant tissue. Indicator silica gel crystals change colour when the silica is saturated, signalling when the silica gel should be regenerated or replaced. These crystals can be used in a mixture with non-indicating silica gel.
The use of silica gel is a popular approach to dry fresh plant material for DNA extraction because it is low cost and convenient compared to liquid nitrogen or lyophilization, especially when preparing tissue in the field. To effectively preserve the DNA in plant tissue, the recommended minimum ratio between plant material and silica is 1:10 (
One approach is to freeze plant tissue until needed for DNA extraction, preferably at –80 °C, and otherwise in a standard laboratory freezer at –20 °C, if the sample is properly sealed. Alternatively, material can be flash frozen in liquid nitrogen. The resulting rapidly frozen material can yield high-quality DNA extractions, but liquid nitrogen is impractical for some settings due to handling considerations and cost (
High-quality DNA can be extracted from lyophilized (or freeze-dried) tissue, such as leaves and roots (
After the plant material has been prepared by drying and/or freezing using one of the above-mentioned techniques, a DNA extraction protocol can be implemented. Although there are a multitude of available protocols, the general methodology involves the following steps, discussed in more detail below:
We place emphasis on the CTAB protocol due to its popularity, but also introduce other protocols that may be of interest to the reader.
The starting amount of plant tissue is important: too little will result in an unsatisfactory yield and too much may lead to poor grinding, saturation of the reaction and/or excessive debris which can also be detrimental to final yield. A useful starting ratio is a buffer quantity that is fivefold that of the weight of the leaf tissue (e.g., 0.2 g leaf tissue for 1 ml of buffer) (
Plant tissue must be finely ground to a powder such that the cell walls are disrupted and the cell membranes are more accessible for the chemical reagents in subsequent steps to act successfully. It is advisable to scrape hairs or wax from the surface of the plant tissue before weighing and grinding. For herbarium specimens, special care should be taken that any glue that may be present is removed since this can interfere with the reagents used during the DNA extraction. Sterilised sand can also be used to increase the friction and enhance the disruption of the tissue; it will be separated later in the DNA extraction protocol. Fleshy tissue can be flash frozen in a mortar with a little liquid nitrogen before grinding. The dewar for transporting the liquid nitrogen should be clean and free of potential contaminants.
Manual grinding is inexpensive, yet time consuming and requires a sterilised mortar, pestle, and spatula for each sample. Use of a mechanical homogenizer, also called a tissue lyser, is more efficient. A steel ball bearing is added to each tube with a sample and shaken at high frequency within the instrument. This allows multiple samples to be disrupted simultaneously with minimal degradation of the nucleic acids. It also minimises loss of material and the chances of contamination, as each sample is processed in the tube that it remains in for subsequent extraction steps. Metallic, ceramic, or silica beads of different sizes can be added to the sample tubes to increase the disruption of particularly tough or woody material. Metallic and ceramic beads must be removed before proceeding with the protocol, but silica beads can be separated later in the protocol.
This step can be included as an optimisation strategy for increased yield, quality, or purity of the extracted DNA. For example, when high amounts of polysaccharides and/or polyphenols in the plant material are a concern (as is the case for succulent plants and plants in high stress environments, respectively), the modified STE-CTAB protocol can be used (
In this stage, the goal is to release nucleic acids from the cell, whilst also minimising risk of nucleic acid degradation and to commence the segregation of unwanted cellular compounds from the DNA molecules.
The hallmark of the most widely adopted method for DNA extraction from plants, originally developed by Doyle and Doyle (
CTAB buffer is added to each sample tube containing ground plant tissue and the mixture is incubated at 60–65 °C for 15–60 minutes. This can be done in an automatic shaking incubator. Alternatively, the sample tubes can be periodically shaken manually.
Alternatively, methods involving an SDS buffer can be applied (
The goal of this stage is the separation of DNA from other molecules in the lysate, by making use of the differing polarity of these molecules. This is followed by DNA precipitation from the solution.
In the CTAB protocol, the methodology is phase separation using organic solvent(s), where hydrophilic molecules, including DNA, can be isolated. A 24:1 solution of chloroform-isoamyl alcohol (SEVAG buffer) is added to the incubated CTAB/leaf tissue mixture. This solution is hazardous and must be prepared and added to the sample tubes in a fume hood to avoid inhalation. It is also highly volatile and evaporates very quickly, so it should be handled quickly to avoid evaporation during the work. The mixture is then centrifuged at room temperature, which results in the DNA becoming concentrated in the clear upper phase (i.e., the aqueous phase). The supernatant is very carefully drawn off with a pipette without disturbing or touching the organic phase (containing the chloroform with lipids, proteins, and other cellular debris) and transferred to a new tube. The supernatant is purified by adding RNase A and chilled isopropanol, where the latter induces precipitation of DNA. Samples are then transferred to a freezer at -20 °C, either overnight or for several days if sample input is low and maximum precipitation is desirable (at the cost of potential co-precipitation of salts).
In the SDS protocol, proteins and polysaccharides precipitate with the SDS itself. Sodium acetate in turn is used to precipitate the DNA; in solution this compound dissociates and the sodium ions (Na+) neutralise the negative ions on the sugar phosphate backbone of DNA molecules, thus making it less hydrophilic and amenable to precipitation (
As a final step to both methodologies, the samples are centrifuged to encourage the formation of a DNA pellet, optionally washed with 70% ethanol at least once and re-suspended, preferably in 10 mM Tris-EDTA buffer (which serves to protect the DNA from damage, as explained in the CTAB buffer recipe above).
The DNA isolation stage is not perfect. Since the extraction process involves steps that segregate compounds by binding properties and molecular weight, co-extraction of molecularly similar polysaccharides is common. Furthermore, the eluent can contain certain contaminants, including traces of chemicals added during the extraction process and precipitated salts, as well as endogenous proteins, tannins, polysaccharides, and other molecules. The presence of such compounds can negatively impact the downstream experimental use of the DNA (i.e., act as PCR inhibitors), and further purification of DNA using various clean-up steps may be necessary.
One strategy is using a silica column and centrifugation-based method, by adding a chaotropic agent (commonly guanidine hydrochloride), which disrupts the hydrogen bonds between water molecules, creating a more hydrophobic environment. This increases the solubility of non-polar compounds (often contaminants) and additionally breaks up the hydration shell that forms around the negatively charged DNA phosphate backbone and further promotes efficient adsorption to the column surface under high salt and moderately acidic conditions (
An alternative involves the use of Solid Phase Reverse Immobilisation (SPRI) beads (
When a DNA extraction protocol does not yield satisfactory results, in terms of quality or quantity of extracted DNA, modifications can be applied. A valuable strategy for this is conducting a search of the scientific literature for protocols that have been used for similar experimental purposes or have targeted the same taxonomic groups.
If using the CTAB protocol, understanding the biochemical actions and interactions of its components is a useful starting point to identifying what might need adjustment to help improve the outcome. CTAB acts according to the ionic strength of the solution; the concentration of NaCl must be at least 0.5 M so that it does not bind to nucleic acids, but does bind to proteins and neutrally charged polysaccharides as desired. NaCl is most commonly used at a concentration of 1.4 M. When working with a plant group that has a high content of polysaccharides, experimenting with higher concentrations of NaCl may improve the purity of the final DNA. Sometimes, other reagents such as N-Lauroylsarcosine (sarkosyl) buffer can be added, to enhance lysis (rupturing of the cell membrane) and to reduce the activity of DNase or RNase enzymes. Proteinase K can also be added to enhance the denaturation of proteins. The volume of 24:1 chloroform-isoamyl alcohol solution can also be adjusted. Phenol can be added as an additional non-polar, organic solvent that is highly effective in denaturing proteins and can aid in increasing the final DNA yield, as opposed to solely applying chloroform (
Tris-HCl and EDTA are present in nearly all protocols. ß-mercaptoethanol is toxic and should thus be handled with care, and always in a fume hood with an extractor fan. One may consider simply not adding this reagent to the solution for plant tissues low in phenolic compounds. However, it is important to note that phenolic compounds co-precipitate with DNA and thus can be problematic in downstream steps of DNA laboratory work. ß-mercaptoethanol can be replaced with less toxic alternatives such as PVP (polyvinylpyrrolidone). PVP attaches to phenolic compounds via hydrogen bonding and can be removed together with them after centrifugation (
Most commercial kit-based protocols use a combination of buffers that perform similar functions to the components of the CTAB protocol, with a final step of elution through silica-columns, which tends to yield relatively clean DNA extracts. An added benefit of column-based kits is the use of filter columns at an earlier stage for the separation of crude plant material. Silica-based columns bind DNA so that it can be washed multiple times with alcohol-containing solutions to wash away contaminants before DNA elution. This speeds up DNA extraction significantly, reducing the total time from multiple days – as is common in regular protocols – to 6 hours. Drawbacks of these approaches however include the reduced yields of purified DNA in comparison to CTAB + chloroform extractions, as well as the significantly higher (~3–4 fold greater) cost.
Commercial kits that use magnetic beads are also becoming increasingly popular. Magnetic bead extraction kits are highly versatile and provide high yields of DNA that are also highly pure, in the absence of the hazardous solvents chloroform and phenol. After plant tissue grinding and lysis with an appropriate buffer, DNA is bound to the surface of the magnetic particles. The magnetic particle-DNA system is then washed several times with alcohol-containing solutions before a final elution step with a low salt buffer or nuclease-free water. In contrast to the column-based extraction method, binding of DNA to the magnetic particles occurs in solution, thus enhancing the efficiency and kinetics of binding and simultaneously increasing the contact of the bead-DNA compounds with the wash buffer, which improves the purity of the DNA. Magnetic particle kits have also been applied in combination with steps from the CTAB extraction method to extract high quality DNA from sorghum leaves and seeds, cotton leaves and pine needles (
Finally, a less common commercial method involves the use of Whatman FTA® PlantSaver cards and custom reagents. This method is very practical in terms of collection of samples in the field and their transportation. Furthermore, immediate mechanical disruption of the plant tissue can eliminate the need for obtaining permits. While this method has been predominantly applied to agricultural plant taxa, its performance in 15 phylogenetically diverse non-agricultural taxa has been demonstrated, where DNA from these samples was found to be less fragmented than that from replicate samples extracted alongside with the CTAB method (
Assessment of the properties of each genomic DNA (gDNA) sample post-extraction – its integrity, quantity, and purity – is imperative for making decisions regarding downstream molecular work. The methods described below have some overlapping uses in terms of assessing these different properties, but we highlight which is most appropriate for each DNA quality-related aspect.
Agarose gel electrophoresis is an appropriate method for estimating DNA integrity, as well as for crudely estimating DNA concentration. This method requires a horizontal gel electrophoresis tank with an external power supply, agarose, a running buffer such as Tris-acetate-EDTA (TAE) or sodium borate (SB), a fluorescent intercalating DNA dye, a loading dye, and a DNA standard (‘ladder’). The intercalating dye is added to the buffer (or sometimes to the loading dye) and serves to visualise the DNA in the agarose gel at the end point of electrophoresis. Historically, ethidium bromide was the standard intercalating agent, but it has now mostly been superseded by safer dyes that are less carcinogenic and do not require complex disposal procedures. Nonetheless, it is recommended that any compound that intercalates DNA be handled with care. The DNA standard is referred to as a ladder, since it is a complex of appropriately sized DNA standards of known concentrations which provide different benchmarks of size and concentration for comparison.
Each DNA sample and the DNA standard (ladder) are combined with loading dye and then pipetted into a well of the agarose gel, to then be subjected to an electric field. Due to the negatively charged phosphate backbone, DNA molecules will migrate towards the positively charged anode. The DNA migration rate depends on the fragment size, where smaller DNA fragments migrate faster, leading to a size-associated separation of DNA molecules. Additionally, the percentage of agarose in the gel will determine the size range of DNA that will be resolved with the greatest clarity. A range of 0.5% to 3% encompasses most applications, where < 1% is best for examining the genomic DNA of plants and 3% would be suitable for examining fragments with small (e.g., ~20 bp) differences in length. Once the fragments have migrated sufficiently to ensure resolution of the DNA and ladder, the gel is transferred to a cabinet with a UV light and the DNA fragments are visualised due to the excitation of the intercalating dye when UV is applied. The approximate yield and concentration of genomic DNA in a gel are indicated by comparison of the sample’s intensity of fluorescence to that of a standard.
Where a more precise estimation of the size of the DNA fragments is required, automated capillary electrophoresis can be used. Such systems (e.g., Agilent Bioanalyser, Agilent Tapestation) are more expensive to use, but – aside from precision – offer faster preparation and analysis time.
Fluorescent measurements are considered the most accurate quantification method for measuring DNA concentration. These involve the addition of fluorescent dyes (in an accompanying buffer), which selectively intercalate into the DNA. Fluorescence measurements use excitation and emission values that vary depending on the dye used. The concentration of unknown samples is calculated by the fluorometer (e.g., Quantus™ or Qubit™) based on a comparison to a standard measurement from DNA of a known concentration (usually lambda bacteriophage DNA). Since the dyes are sensitive to light and degrade rapidly in its presence, sample tubes must be stored in the dark if readings are not taken imminently after their preparation in the buffer.
A rough estimate of DNA yield and a more useful estimate of DNA purity can be measured via absorbance with a spectrophotometer that emits UV light through a UV-transparent cuvette containing the sample. Absorbance readings are conducted at 260 nm (A260), the wavelength of maximum absorption for DNA. The A260 measurement is then adjusted for turbidity (measured by absorbance at 320 nm), multiplied by the dilution factor, and calibrated using the following conversion factor: A260 of 1.0 = 50 µg/ml pure dsDNA. This useful relationship between light absorption and DNA concentration can be defined according to the Beer-Lambert law. Total yield is obtained by multiplying the DNA concentration by the final total purified sample volume. However, it is key to note that RNA also has maximum absorbance at 260 nm and aromatic amino acids have a maximum absorbance at 280 nm. Both molecules can contribute to the total measured absorbance at 260 nm and thus provide a misleading overestimate of DNA yield.
DNA purity is evaluated by measuring absorbance in the 230–320 nm range. Since proteins are the contaminant of primary concern, absorbance at 260 nm divided by absorbance at 280 nm is the standard metric. DNA can be considered of high quality and suitable for most genomic applications, when it has an A260/A280 ratio of 1.7–2.0. As a further step, the ratio of 260 nm to 230 nm can help evaluate the level of salt carryover in the purified DNA, where a A260/A230 of > 1.5 is considered to be of good quality. Strong absorbance at around 230 nm, which would lower this ratio, suggests the presence of organic compounds or chaotropic salts.
Instruments such as the NanoDrop® 2000 spectrophotometer are highly accurate for evaluating the A260/A280 and A260/A230 ratios. This method is not as accurate as fluorescence quantitation, but is most suitable where information on DNA purity is sought and is also time efficient (the sample is loaded directly into the machine and requires no preparation of buffers).
Particularly challenging types of plant tissue, as well as degraded plant material, can still yield high-quality DNA if suitably optimised protocols are followed.
For instance, seeds can be a good source of DNA if specialised protocols are used (
Advances in the sensitivity of genomic sequencing and optimised DNA extraction methods make it possible to study herbarium and other dried botanical specimens (
Physical and chemical degradation is to be expected in herbarium and museum specimens; DNA in deceased tissue breaks down over time. The rate of physical fragmentation is related to temperature and other environmental variables, as well as the composition of the plant tissue itself. In a study of herbarium specimens, it was shown that fragment length significantly regressed against sample age going back 300 years (
The CTAB extraction protocol is generally preferable for extracting fragmented DNA, as it generally gives higher yields of DNA than kit-based methods. Where fragment size distribution is predicted to be very low, a high-volume chaotropic salt used as a binding buffer in the latter stage of extraction can improve the recovery of DNA molecules (
A wide variety of DNA extraction protocols are available in the literature. The structural, biochemical, and genomic characteristics of plants present a particular set of challenges; isolating high purity, undamaged DNA from plant tissue is non-trivial and requires a careful and patient approach in the laboratory. Therefore, researchers must often optimise a chosen protocol for their specific experiment. Success in the primary step of a molecular workflow is crucial, unlocking the downstream steps of plant molecular identification and characterisation, and hence possibilities for addressing many exciting questions in molecular and evolutionary biology.
Absorbance – A measure of the quantity of light absorbed by a sample, also referred to as optical density, measured using an absorbance spectrophotometer.
Beer-Lambert law – For a material through which light is travelling, the path length of light and concentration of the sample are both directly proportional to the absorbance of the light.
Chaotropic agent – A chemical substance which in an aqueous solution destroys the hydrogen bonds between water molecules (e.g., guanidine hydrochloride).
Cryopreservation – A preservation treatment for biological material, which involves cooling to very low temperatures (at least -80 °C, or -196 °C using e.g., liquid nitrogen).
Desiccant – A substance with a high affinity for water, such that it attracts moisture from surrounding materials, resulting in a state of dryness in its vicinity (e.g., silica gel).
DNA integrity – The level of fragmentation of extracted DNA, where minimal fragmentation of the original chromosomes equates to high DNA integrity.
Intercalating dye – A dye, whose molecular components stack between two bases of DNA, which is invaluable for DNA visualisation, yet at the same time implies a hazard for human health and demands laboratory safety considerations.
Lysate – A commonly fluid mixture of cellular contents that is the result of the disruption of cell walls and membranes via cell lysis.
Molecular marker (in a genetic context) – A sequence of DNA, which can be a single base pair, a gene, or repetitive sequence, with a known location in the genome, which tends to exhibit variation amongst individuals or taxa, such that it has useful research applications.
Organellar genome – The genetic material present in a plastid or mitochondrion, typically in the form of a small and circular genome and often in multiple copies within each organelle. These are thought to be present in eukaryotic cells as a result of endosymbiosis.
Plastome – The total genetic information contained by the plastid (e.g., chloroplast) of a plant cell.
Museum collections of plant origin include herbaria (pressed plants), xylaria (woods), and economic botany (useful plant) specimens. They are not only places of history and display, but also of research, and contain rich repositories of molecules, including DNA. Such DNA, retrieved from historical or ancient tissue, carries unique degradation characteristics and regardless of its age is known as ancient DNA (aDNA). Research into aDNA has developed rapidly in the last decade as a result of an improved understanding of its biochemical properties, the development of specific laboratory protocols for its isolation, and better bioinformatic tools. Why are museum collections useful sources of aDNA? We identify three main reasons: 1) specimens can play a key role in taxonomic and macroevolutionary inference when it is difficult to sample living material, for example, by giving us snapshots of extinct taxa (Van de Paer et al. 2016); 2) accurate identification of specimens that were objects of debate or scientific mystery, as exemplified by misidentified type specimens of the watermelon’s progenitor (
However, extracting DNA does mean the destruction of a part of the specimen. Museum curators therefore face challenges in balancing the conservation of specimens for future research with the rising demand for aDNA analysis. Increasingly, curators are also considering legal and ethical issues in sampling (
With few exceptions, plant material found in museums originally grew on lands tended or owned by people for many millennia (
A first consideration is whether the plant species or artefacts (such as baskets or wooden objects) are of special significance (e.g., sacred) to the source community. Examples of sacred material include Banisteriopsis caapi, used to make ayahuasca in South America (
There are international conventions that usually apply when accessing, researching, and moving plant material between institutions and countries. Researchers must also be aware of country-specific laws that may require further permits and inspections, e.g., for plants that produce controlled substances, require phytosanitary checks, or are considered invasive species. Legal elements of the Convention on Biological Diversity (CBD), Nagoya Protocol, and Convention on Trade in Endangered Species (CITES) are covered in Chapter 27 Legislation and policy as well as in other published works (e.g.
Botanical gardens hold living specimens and distribute seeds of these via seed lists (Index Seminum). Their global collections can be searched via PlantSearch, hosted by Botanic Gardens Conservation International. Gene banks hold seeds, and sometimes also tissue and living plants. While they originally focused on crop plants and their wild relatives, many have now broadened in scope to include wild plants, such as Royal Botanic Gardens Kew’s Millennium Seed Bank. Many gene bank collections can be searched via Genesys. Herbaria hold dried plant specimens and can be located via Index Herbariorum. Although many herbaria are incompletely recorded in databases, substantial data can already be found in the Global Biodiversity Information Facility (GBIF) (
There are a number of pitfalls when searching online catalogues. It may be necessary to search for accepted names and common synonyms: the same species may appear under different botanical names in a single collection, and accuracy of specimen identification varies. In general, herbarium specimens are the most reliable, as they bear diagnostic criteria such as flowers on which taxonomists rely. Garden material and seeds are often misidentified, or become confused in labelling, or are hybridised during repeated cultivations. Their identifications should be confirmed, for example growing on the seeds or by using morphological criteria (
Research projects will benefit enormously from a close collaboration between researcher and curator. Museums should be approached early during a project, with the researcher providing sufficient detail about its background, aims, methodology, and timetable. Museums are often under-staffed and persistence may be required in making contact. Curators’ expertise will be crucial in identifying the most appropriate specimens for analysis, not only in their institutions, but in others with which they are familiar. The curator will also play a key role in assessing the provenance of specimens, using museum archives, and the implications for any of the ethical and legal issues addressed above. Curators often have good links to source communities and can advise on appropriate procedures.
After preliminary discussions, the researcher will usually need to fill in a ‘destructive sampling’ form. This acts as a permanent record of the justification for sampling, and allows the museum to make a detailed check on the aims and methodology of the project (see for example, British Museum form and policies). Requests that have unclear research aims or which employ inappropriate methodologies are unlikely to be approved. Researchers will likely need to sign a Material Transfer Agreement (MTA) or Material Supply Agreement (MSA) with the museum which sets out their legal responsibilities.
Sampling may be carried out by the researcher or the curator. If feasible, it is worthwhile for the researcher to carry out the sampling, as it allows for the investigation of the context of the specimen and for flexibility in choosing the samples. It may also speed up the process of obtaining samples, especially if a large number is required. It also allows samples to be safely hand-carried to the researcher’s laboratory. Where materials must be sent, it is safest to use a courier service, with specimens marked “Scientific specimens of no commercial value”.
It should be agreed with the museum whether, after sampling, surplus material should be returned or securely retained. Museums can require that they are informed about results and that they check manuscripts before publication. This is in any case good practice to ensure accurate reporting of sample details. Museum policies on co-authorship vary, and this topic should be discussed early. Significant contribution by the curator on the choice of appropriate samples, provenance research, or in technically complex sampling, merits co-authorship. Unless agreed otherwise, DNA sequencing data should be submitted to NCBI GenBank or other public repositories, taking care to give the correct specimen identifier. At a minimum, the museum’s unique catalogue number (if one exists), and the name of the museum should be cited. This allows the DNA sequence data to be linked directly with the specimen or object. Other museum and laboratory information may be included with the DNA sequence data or in publications (e.g., the collector name, collection number, dates, locations, and laboratory extraction numbers). Additionally, most museum collections will require that vouchers are annotated in a way that links them to DNA sequencing data (see below). Some museums have also started to permanently store DNA isolates, and we encourage researchers to share their stocks on request. Integrated data management and accessibility of the raw data and results will ultimately bolster curatorial practices, develop a more ethical science, and safeguard collections for future generations (
Sampling decisions will be determined both by the research design and the nature of the specimens, in addition to the legal and ethical factors mentioned above. Changes to agreed sampling lists are often necessary once specimens have been examined, for example when they are lost, in poor condition, inadequately annotated or georeferenced, present in small quantities, or of rare taxa. Bulk raw material is usually easy to sample, while objects are usually not subjected to destructive sampling unless the results will inform the history and significance of the object. For herbarium specimens, preserving the morphological features, especially those that are diagnostic, for future research, is critical. Sampling should be targeted towards tissue types or organs at a given developmental state that are most numerous. For example, if there are many flowers and few leaves, it may be preferable to sample a petal. Or if there are few cauline and many rosette leaves, it may be preferable to sample a rosette leaf.
Different parts of a specimen may yield varying amounts, quality, and types of DNA. Wood, husks, and other tissues that were undergoing senescence at the time of preservation may yield less DNA. Young, immature leaves will have higher cell densities, and therefore are expected to yield more DNA. Seeds are often excellent sources of nuclear DNA, although the genotype of the seed will differ from the parent plant and might be of inconsistent ploidy. It may be necessary to extract DNA from individual seeds or to remove maternal tissue such as the testa. Some herbarium sheets will contain multiple individuals and, in most cases, it is better to sample individuals rather than mixed material. If individuals are pooled for DNA extraction, it may complicate downstream analyses that depend on individual genotypes.
The method of specimen preservation is another consideration for DNA isolation. Desiccation has been shown to preserve plant DNA remarkably well, while charring or ethanol preservation destroys plant DNA almost completely (
Before sampling begins, the specimen’s identifying data, such as its herbarium ID, should be recorded with great care, and double-checked on both the sample label and typed list of specimens. Additionally, the museum may require that vouchers are annotated with the sampling date, tissue type, sample identifier, and information about the researchers. The voucher, including any labels, should be photographed, ideally before and after sampling. Digital links between herbarium vouchers, imaging, and DNA sequences are very useful; they can be included in herbarium and nucleotide databases.
For desiccated leaves, the most commonly sampled tissue, the process is usually straightforward. Using forceps and a scalpel or scissors one can make a precise cut and remove 1 cm2 or less of tissue. Generally, between 2 and 10 mg of dry leaf tissue is sufficient for the isolation of complex mixtures of genomic DNA fragments. It is preferable that leaves of lesser value are targeted, for example damaged, folded, or hidden, avoiding possible contamination by mould, lichen, or fungi. The sampling of detached “pocket” material should be conducted with caution, and only if the researcher and curator are confident that the detached material truly belongs to the voucher. For other tissue types, such as wood, researchers may need to develop tailored sampling methods on contemporary material first. After sampling, material should immediately be sealed in a labelled tube or envelope and packaged for transport.
Potential contamination of the sample, specimen, or wider collection with exogenous DNA is an important consideration. For most museum collections, there will inevitably already be surface DNA contamination of specimens. Ask the curator about adhesives (e.g., wheat starch) and preservatives that were used with the specimen of interest. Curatorial staff and other users of the collections may not routinely wear gloves or, if they do, may not change them between specimens. In most cases, there is unlikely to be any benefit from the person undertaking sampling wearing protective equipment (e.g., face masks, hair nets) that is beyond that normally used by users of the collection. Contamination control is only as good as the weakest link.
Extra precautions may be taken for equipment that is used directly in the sampling process, for example, disposable scalpels that are changed between samples, or wiping of scalpel blades with bleach and ethanol. This will reduce the risk of cross-contamination between specimens. Further precautions may be beneficial if internal tissue is being sampled (e.g., inside a seed). In these cases, surface decontamination (see section below on pre-processing) followed by sampling with DNA-free equipment and while wearing personal protective equipment may be appropriate. In some cases where specialistic equipment such as microdrill is required, it may be beneficial for sampling to be undertaken within an ancient DNA laboratory, where contamination controls can be better implemented, however bringing large amounts of plant material into the laboratory should be limited as it is an additional contamination source.
Contamination of specimens and collections by ‘modern’ DNA and especially amplified DNA is perhaps the greatest risk, potentially compromising future research. Researchers are likely to have been using molecular laboratories, and steps should be taken to prevent the inadvertent transfer of modern DNA to museum collections. These precautions can include not visiting a collection directly from a modern laboratory, cleaning items that must move between modern laboratories and collections (e.g., clothes, phones, cameras), and using sampling equipment (scalpels, tubes, pens) that has not been taken from a modern laboratory.
Before starting any experiments with historical and ancient plant samples, it is important to recognize challenges arising from the degraded nature of aDNA. Unlike DNA isolated from fresh samples, DNA from preserved specimens is fragmented, damaged, and contaminated post mortem (
aDNA is also affected by “damage”, post mortem substitutions that convert cytosine to uracil residues through deamination (uracils are read by insensitive DNA polymerases as thymine, hence the commonly used term “C-to-T substitutions’’) (
Finally, it is important to recognize that aDNA from plants is in fact a mixture of bona fide endogenous DNA, exogenous DNA introduced pre mortem, (e.g., from endophytic microbes), and exogenous DNA introduced post mortem (e.g., from microbes involved in decomposition, human-associated collection and museum practices; see above) (
Examples of selected successfully isolated and sequenced DNA from plant material. *BP: before present.
Species | Tissue | Age BP* | Endogenous DNA | Fragment length (bp) | Damage at 5’ end | Source |
---|---|---|---|---|---|---|
Thale cress (Arabidopsis thaliana) | Leaf | 184 | 83% | ~62 | 0.026 |
|
Potato (Solanum tuberosum) | Leaf | 361 | 87% | ~45 | 0.047 |
|
Maize (Zea mays) | Cobs | 1863 | 80% | ~52 | 0.052 |
|
Wheat (Triticum durum) | Chaff | 3150 | 40% | ~53 | 0.095 |
|
Barley (Hordeum vulgare) | Seeds | 4988 | 86% | ~49 | 0.138 |
|
Given the characteristics of aDNA (
The isolation and pre-amplification manipulation of aDNA should be carried out in a dedicated laboratory that is physically separated from labs where post-amplification steps are carried out. Ideally the aDNA laboratory should be supplied with HEPA-filtered air under positive pressure. Users should not move from a ‘modern’ laboratory (where amplified DNA is handled) to the aDNA laboratory on the same day. Reagents and materials in an aDNA lab should be DNA-free, disposable where possible, and never taken out of the clean lab. Surfaces should be cleaned before and after every experiment with 3–10% bleach, 70% ethanol, and overnight UV-C irradiation. To minimise contamination and ensure a DNA-free laboratory environment, users should wear full body suits, foot protectors, slippers, facemasks, sleeves, and double gloves (
Material preparation is an essential step before DNA can be isolated. Optional pre-processing of dirty samples can be done by gently cleaning the surface with a very low concentration (~3%) of bleach, and rinsing twice with ddH2O (
DNA extraction methods for different tissues should be considered. While plant materials tend to contain inhibitory substances like polyphenols, proteins, and polysaccharides, ancient plant materials can additionally be rich in humic acids and salts. This set of macromolecules might prevent successful DNA amplification (
Here we will cover the basics of recovering the highest quality of DNA from ancient plant tissues. Using a two-day extraction protocol will greatly increase the recovery of endogenous DNA. The first day consists of grinding the plant material. Tissue can be disrupted by: grinding dry, grinding flash-frozen, or grinding material soaked in lysis buffer. In all cases, grinding to finer particles increases the recovery of aDNA. Ground tissue is incubated in a fresh lysis buffer. Three commonly used buffers include CTAB (
By contrast to primed amplification approaches, even low amounts of isolated DNA can be used for genomic library preparation (
To help decide which sampled material is most promising for further DNA analyses it is necessary to obtain good estimates for fragmentation, damage, and contamination. This can be achieved through sequencing genomic libraries in low-throughput mode (about 10,000 DNA reads per sample), commonly referred to as “screening” and bioinformatic analyses that produce relevant summary statistics. Promising samples will contain aDNA with a median fragment length over 50 bp and endogenous content over 0.2. For samples of particular interest, mapping the accuracy for short aDNA reads can be improved with specialised procedures (de Filippo et al. 2018), and endogenous content can be increased by targeted enrichment on hybridization arrays (
Characterising DNA fragmentation and damage is very useful for authentication and establishing historical provenance of degraded plant samples. DNA degradation advances with time (
Chapter 2 Infographic: Overview of sampling and obtaining DNA from museum collections. An team effort of communities, curators and researchers (1) Collection of botanical material should have detailed consideration of its ethical and legal aspects and the consultation of source communities in advance, in accordance with CITES, CBD and Nagoya legal and ethical frameworks. (2) Curated botanical samples can be found in different types of museums that include botanic gardens, ethnobotany and anthropological collections. The next step is to find relevant specimens with preferably rich metadata, e.g. species identification, collection place and date. (3) Once the specimens have been identified, they should undergo molecular analyses in clean facilities. Where they will be pre-processed according to their traits, avoiding contamination with other samples, “modern” specimens, and amplicons. Then, it is crucial to identify samples that failed and passed quality controls for endogenous DNA. Finally, the data produced should be linked to their respective vouchers and made available in public repositories like NCBI and BOLD.
Library-based methods assist with the responsible use of collections, as they preserve the total (non-selective) DNA and ‘immortalise’ it for future use. Immortalisation only has value if the DNA that has been amplified is truly historical/ancient and devoid of contemporary contamination and hence all the aforementioned precautions are necessary when working with aDNA. We recommend that extracts or library builds are precisely annotated with the methods used and are properly archived.
aDNA – Ancient DNA, DNA that exhibits biochemical characteristics typical for DNA from old degraded material, i.e., damage and fragmentation, regardless of age.
Artefact – An object made by humans that is of historical or cultural importance, examples include: clothing, ornaments, utensils.
Authentication – Bioinformatic analyses that quantify damage and fragmentation of sequenced DNA to help rule out that DNA is derived from contemporary contamination.
Collection – Repository of curated biological material arranged in a systematic fashion.
Contamination – Introduction of alien tissue or DNA to a specimen or DNA isolate, examples include: microbial colonisation, human epithelium, plant-based foods, etc.
Curator – Custodian of a collection with expert knowledge about specimens, their organisation, and preservation.
Destructive sampling – Permanent removal of a fragment of a specimen of any size that will be irretrievable after biochemical characterization.
DNA damage – Typically conversion of cytosine to uracil in DNA through deamination, which accumulates with time. During sequencing, uracil is replaced with thymine, hence the common synonym, C-to-T substitutions.
Endogenous DNA – Authentic DNA from targeted individuals of a species, in contrast to exogenous DNA from associated microbes and contemporary plant and human DNA contamination.
Fragmentation – Breaks in the DNA backbone, most frequently caused by depurination, leading to shorter DNA fragments with time.
Immortalization – Molecular manipulation of DNA, for example the attachment of DNA adapters, that allows infinite re-amplification of the original DNA from a biological specimen.
Type specimen – Preserved individual plant that has defining features of that taxon that is used for the first taxonomic description of a species. This permanent feature-specimen link is recognized in a publication.
Voucher – Preserved botanical specimen kept in permanent collection and cited by research project. Vouchers will have been expertly identified and are usually annotated with collection time, place, and collector details.
The first studies conducted on DNA obtained from water samples were published in the 1990s. Cloning techniques were commonly used to investigate novel genes and functions of environmental communities at that time. Stein et al. (
Conventionally, biomonitoring of freshwater and marine environments is based on direct observation of indicator taxa to compute biotic metrics/indices. This can be time and labour intensive (
The main advantage of water is the ease of sample collection compared to other aquatic sample types such as sediments or biofilms, as these substrates usually require more sophisticated tools and longer sampling times (
Detecting DNA in water samples obtained from aquatic environments can be challenging because it is usually present at low concentrations with an uneven spatial distribution (
Natural processes influencing the composition and quantity of detectable DNA in a water sample can be categorised into 1) shedding of biological material from source organisms, 2) degradation, 3) transport across the water column, and 4) retention and resuspension (
Senescence in aquatic plants releases free cells into the water column that will eventually break down into organic compounds, including DNA. However, degradation in many cells begins via apoptosis before shedding. Apoptosis involves the shrinkage of the cell and its nucleus in a programmed way, in contrast to necrosis, which is uncontrolled cell death due to loss of osmotic control typically by swelling and bursting (
DNA is a highly stable molecule at neutral pH and moderate temperatures. However, there are several abiotic factors that directly and indirectly influence its stability in aquatic environments (
Hydrological characteristics of the water body are also critical to consider when inferring species presence and distribution. DNA can bind to particles of varying size in aquatic environments (less than 0.2 µm to greater than 180 µm) and this particle association is one of many parameters that affect DNA transport and diffusion (
Considering the higher dilution and the effects of currents and waves in marine waters, DNA is generally less concentrated and more quickly dispersed compared to freshwater ecosystems (
In rivers and streaming waters, the probability of DNA detection is strongly correlated with downstream transportation rates. Retention, rather than degradation, appears to be a more important factor that limits the transport of DNA in streaming waters (
Conventional sampling techniques often require a lot of time and effort for detecting indicator, rare, or invasive species. Keeping the target organism alive or intact might also be an important consideration in such cases. Detection of species via nucleic acids collected from environmental samples (eDNA/eRNA) is a relatively new approach that emerged in the last five years (Anglès d’Auriac et al. 2019). These methods offer a non-destructive and efficient complementary approach for the detection of aquatic organisms. They rely on reference sequences and the amount of available data varies among taxonomic groups and countries (Chapter 10 DNA barcoding and Chapter 11 Amplicon metabarcoding). For example, aquatic vascular plants used in biomonitoring are well represented in public databases (BOLD, GenBank), while this is hard to achieve for diatoms due to large proportions of undescribed species and the problems with cultivation of monoclonal cultures (
Although DNA from plant communities have been detected from environmental samples as parts of larger surveys (e.g., within coral reefs), biodiversity studies targeting a large number of plant species are still rare, possibly owing to issues with universal amplification and discriminatory power of single or multiple gene surveys in plants (
An important application for DNA-based methods is the quantification of species abundance and biomass since there are several environmental applications that rely on this information. Depending on the specific aim of the study, this information can be obtained at varying degrees of efficiency and reliability. Approaches employing species-specific methods are more suitable for abundance or biomass estimations (e.g., qPCR, ddPCR). However, they require a priori knowledge of the target group and are limited to already described species. On the other hand, high-throughput approaches can identify species that are rare or have low biomass (e.g., metabarcoding, metagenomics), but they suffer from biases introduced by downstream steps such as PCR amplification, sequencing (Chapter 9 Sequencing platforms and data types), availability of reference sequences, and even the bioinformatics analyses (Chapter 18 Sequence to species) (
Although molecular methods for species detection have been used as a tool for biodiversity management for more than a decade, only 2% of the available studies have focused on plants (
Recent studies that detect plant species in aquatic ecosystems via eDNA are mainly about methodological adjustments (
There are three main steps in a field study for the collection of aqueous eDNA: water collection, transportation, and filtering. In designing sampling strategies for species identification from water samples, there are many factors to consider. These include, but are not limited to, the field conditions, the distance between sampling point and laboratory, the amount of water that is required, and the morphology and life cycle of the target organism (
After the selection of the sampling location, the next step is to decide on the transportation strategy. Water samples can either be directly transported to the laboratory or filtered in the field. If direct transportation is the chosen method, the samples are usually collected with sterilised glass or plastic bottles or disposable plastic tubes. After that, DNA in the water samples can be captured by filtration or ethanol precipitation in the laboratory. This method both reduces the effort and time spent in the field and researchers can perform additional analyses on water samples or store subsamples for further processing (
Precipitation using ethanol or isopropanol can be used for capturing DNA after water collection, but filtration is the more widely used method (
Pore sizes of filters used in eDNA studies range from 0.22 µm to 60 µm (
The type of filter is one of the most important decisions to be made when designing the sampling strategy. Filters can be classified as open or encapsulated/cartridge filters (
Contamination of samples and the degradation of DNA are two critical processes that should be avoided as much as possible from water collection in the field to DNA isolation in the lab (
Choosing the correct DNA extraction protocol can be crucial in ensuring that the effect of PCR inhibitors in water samples will be minimised. The chemical and physical characteristics of samples can vary considerably, and therefore the quantity and purity of isolated DNA also vary (
Chapter 3 Infographic: Summary of steps from field collection of water samples to DNA extraction in the laboratory. (1) Open or closed (encapsulated/cartridge) filters can be used for filtering water samples on-site. Large filters (e.g., plankton net with 60 μm pore size) are preferred for filtering larger volumes of water, while small pore size filters can usually process a few litres. Closed filters offer the advantage of preventing contamination, therefore they are more commonly used for on-site filtration. (2) Degradation is another important issue that should be prevented until DNA extraction. Water or filter samples can either be preserved in a chemical buffer or transported in cold and dark conditions to the laboratory for further processing. (3) Plant DNA in water samples can be captured by filtration or precipitation. When using filtration, samples are usually incubated in a lysis solution to extract DNA, while in precipitation samples are mixed with ethanol and DNA is collected in the pellet. Commercial DNA isolation kits specifically designed for environmental sample types are commonly used with some small modifications.
DNA isolated from water samples can be used for several downstream applications based on the specific aim of the study or survey. Currently, qPCR methods are the most commonly used method for detecting specific target taxa in water samples, while metabarcoding is used for community analyses (Chapter 11 Metabarcoding). The studies comparing the efficiency of these DNA methods with more conventional methods show varying results. For some species or taxa, DNA-based detection methods appear to outperform more conventional methods (
Apoptosis – Controlled cell death which involves cell shrinkage, nuclear fragmentation, chromatin condensation, and chromosomal DNA fragmentation.
Biofilm – A consortium of microorganisms where cells stick to each other and often also to a surface.
Dimictic lake – A body of freshwater whose difference in temperature between surface and bottom layers becomes negligible twice per year.
Extracellular nucleases – Enzymes that can work outside of the cell and are capable of cleaving the phosphodiester bonds between nucleotides of nucleic acids.
Mesocosm – Any outdoor experimental system that simulates the natural environment under controlled conditions.
Necrosis – Uncontrolled cell death due to the loss of osmotic control typically by swelling and bursting.
PCR inhibitors – Any factor which prevents the amplification of nucleic acids through the polymerase chain reaction.
Primer – A short single stranded nucleic acid sequence used by all living organisms in the initiation of DNA synthesis.
qPCR (Quantitative PCR) – An extension of the PCR technique which allows estimation of the initial quantity of nucleic acids in a biological sample.
Senescence – The gradual deterioration of functional characteristics with ageing (can be used both for organismal or cellular ageing).
Thermal stratification – The phenomenon in which lakes develop two discrete layers of water of different temperatures; warm on top (epilimnion) and cold below (hypolimnion).
Vector (i.e., cloning vectors) – A small piece of DNA that can be stably maintained in an organism that a foreign DNA fragment can be inserted into for cloning purposes.
The natural presence of any plant entails the existence of a substrate where it can anchor itself and absorb nutrients for its development and survival (
Since the first isolation of DNA from soil bacteria, soil eDNA has gained attention for the assessment of terrestrial environments for several reasons: soil is virtually everywhere, it is easy to collect and transport, harbors signals from above and below biota including both active and dormant cells, and is a non-invasive sample collection technique (
Further, studies may also refer to bulk soil DNA when using soil samples to identify unknown communities, especially in forensic contexts (
Molecular (plant) identification using soil or sediment eDNA relies on isolating DNA traces from roots, debris, seeds, and pollen (
The state of DNA in the soil is subject to intrinsic and extrinsic DNA properties related to the origins of the DNA as well as factors influencing its decay (
iDNA persists due to protection from the cell wall and membranes against abiotic processes. Cells are more likely to remain intact in the ground if there is decreased enzymatic activity as a result of rapid soil desiccation, low temperatures, or extreme pH values (
Plant eDNA bound to soil particles can originate from multiple taxa and multiple vegetative parts, each one with particular mechanisms to bind, persist and degrade in soil substrates. Plant DNA persistence within soil allows us to harvest its botanical memory for identifying vegetation through time. Indeed, comparisons of plant identifications through both visual vegetation surveys and soil eDNA assessments have shed light on the temporal signals stored in top soils. In boreal areas, plant identification through soil eDNA signal mostly registered contemporary vegetation (
The flora and study area are key in any study to ensure sound conclusions. Below you will find considerations that can help you to answer common questions when designing field and wet lab experiments.
Soil sampling can be done either by scooping out the soil, drilling down a tube, i.e., a 50 ml falcon tube, or with a soil core sampler. We recommend to use sampling protocols specifically validated in an environment similar to your study site, e.g., woodlands, grasslands, meadows, boreal temperate, and tropical forest (
Obtaining clean DNA samples as well as avoiding cross contamination is challenging when sampling soil eDNA. Collection instruments should therefore be decontaminated between each sample (e.g., flaming, chlorine cleaning), gloves and masks should be worn and changed regularly to avoid introduction of DNA, and samples should be stored in separate plastic bags. In order to stop (or greatly reduce) enzymatic activity, samples should be stored cold or frozen, preferably at -20 °C, if immediate sample processing is not possible (
DNA extraction is a key bottleneck when capturing molecular data, and protocols need to be tailored to both the study area and the question(s). At a minimum, you need to decide which fraction of the total soil eDNA (iDNA or exDNA) you want to isolate to answer your research question. In general, isolating exDNA is preferred when targeting non-microorganisms and avoiding diversity patterns across short temporal scales (
If (meta)barcoding is used for identification, there are three desired features for a barcode in any study: sufficient polymorphism for identification at the desired taxonomic resolution, conserved primer binding sites for universal amplification, and available reference sequences for the target organism. In many cases, not all features can be met. You may therefore need to decide on which features are most important for your research question. For more general information about choosing suitable markers and available reference databases, see Chapter 10 DNA barcoding and Chapter 11 Amplicon metabarcoding. Soil eDNA studies targeting plants have used markers found in chloroplast DNA (trnL P6 loop, matK, rbcL) and in ribosomal DNA (ITS2;
Bioturbation – Biological processes involved in the dissemination of genetic media through terrestrial media.
DNA degradation – Refers to the physical changes of the DNA molecule.
DNA decay – Refers to the reduction in detectable quantity of eDNA.
DNA persistence – Refers to the amount of DNA that remains detectable across time.
DNA polymorphism – Presence of two or more variants of a particular DNA sequence.
Horizon – A layer parallel to the soil surface whose physical, chemical and biological characteristics differ from the layers above and beneath.
Power analysis – Probability of detecting an effect, given that the effect is really there. Can also be seen as rejecting the null hypothesis when it is in fact false.
Pedogenesis – The process of soil formation as regulated by the effects of place, environment, and history.
Rarefaction curves (in ecology) – A technique to assess species richness given the number of samples collected.
To identify pollen, spores, and other plant-related microremains, the field of palynology has traditionally relied on microscope-based analyses. This is a time-consuming process that requires highly trained specialists. Additionally, pollen grains from many plant families are morphologically indistinguishable using light microscopy (
These challenges highlight the necessity for innovative methods within the field of palynology, to increase both the speed and accuracy of pollen identifications. DNA-based methods for the molecular identification of pollen grains have the potential to be of complementary value. However, the extraction of DNA from pollen is non-trivial. This chapter therefore focuses on how DNA can be extracted from pollen, the common problems encountered, and the qualitative and quantitative molecular possibilities for analyses.
Using pollen grain DNA for identification has shown promising results in a number of applications, including the study of provenance and authentication of honey (
Collecting pollen for DNA analysis is mostly similar to collecting pollen for microscopic analysis, though more care should be taken to avoid contamination from other potential sources of DNA. This is because pollen generally contains low quantities of DNA and is therefore prone to contamination. Pollen grains can either be collected directly from the environment (air, water, soil, etc.) or from pollinators (pollen baskets, honey). Pollen collected from the environment will most often (though not always) be derived from anemophilous (wind pollinated) plants, while pollinators collect the majority of pollen from so-called entomophilous (insect pollinated) plants. Pollinators may, however, also have anemophilous pollen accidentally sticking to their bodies. For studies looking at pollen from pollinators, either all pollen grains on the animal’s body are collected by washing off the pollen or, when present, only the corbicular pollen baskets are collected (
Pollen grains can be referred to as “natural plastic”: they have a very hard outer cell wall called an exine, which is made of sporopollenin (
Overview of selected studies since 2017 that have used molecular techniques to identify pollen, including the aim, strategy for pollen lysis, extraction method, amount of PCR cycles, sequencing method, and marker choice.
Study | Aim | Pollen lysis step | Extraction method | PCR cycles | Sequencing method | Markers |
---|---|---|---|---|---|---|
|
Airborne pollen identification | Bead beating (one 5 mm stainless steel bead), two 1-min cycles at 30 Hz | DNeasy Plant Mini Kit (Qiagen) and Nucleomag kit (Macherey–Nagel) | 30 | Sanger sequencing | trnL |
|
Pollen quantification | Bead beating (mix of 0.5 and 1 mm silica beads), 2 min | Wizard (Promega) | N/A | Genome skimming | N/A |
|
Pollen quantification | Bead beating (mini-bead beater), 3 min | FastDNA SPIN Kit for Soil (MP Biomedicals) | 30 | Metabarcoding | nrITS2, rbcL |
|
Pollen quantification | Bead beating (five 1 mm stainless steel beads), 2 min at 22.5 Hz | Adapted CTAB | N/A | Genome skimming | N/A |
|
Plant pollinator interactions over time | Bead beating (one 3 mm stainless steel bead + lysis buffer), 2 min at 25 Hz | QIAamp DNA Micro Kit and DNeasy Plant Mini Kit (Qiagen), Nucleospin DNA Trace Kit (Macherey-Nagel) | 30 | Metabarcoding | nrITS1, nrITS2, rbcL |
|
Airborne pollen identification | Bead beating (3 mm tungsten beads), 4 min at 30 Hz | DNeasy Plant Mini Kit (Qiagen) | 35 | Metabarcoding | nrITS2, rbcL |
|
Bee pollen diet | Bead beating (3.355 mg 0.7 mm zirconia beads), 5 min | DNeasy Plant Mini kit (Qiagen) | Three steps (55 cycles in total) | Metabarcoding | nrITS2, rbcL, trnL, trnH |
|
Insect migration analysis | Bead beating (five zirconium beads), 1 min at 30 Hz | No extraction, using Phire Plant Direct Polymerase | Two steps (32 cycles in total) | Metabarcoding | nrITS2 |
|
Pollen quantification | CF lysis buffer (Nucleospin Food Kit) | DNeasy Plant Mini Kit (Qiagen) | 25, 30, 35 | Metabarcoding | nrITS1, trnL |
|
Airborne pollen identification | Bead beating (0.2 g 425–600 μm glass beads + lysis buffer), two 1-min cycles (3450 oscillations/min) | Adapted CTAB | 40 | Metabarcoding | rbcL |
|
Bee pollen diet | Bead beating (150 g mix of 1.4 mm ceramic and 3 mm tungsten beads + lysis buffer), two 45 second cycles at 6.5 m/s | DNeasy Plant Mini Kit (Qiagen) | 37 | Metabarcoding | nrITS2 |
It should be noted that other methods for DNA extraction from pollen exist in which the pollen grains are not destroyed, and in some specific cases, excluding the bead-beating step has even given better results (
Several commercially available DNA extraction protocols have been used for DNA extraction from pollen grains after the lysis step. Table
The quality of DNA that can be extracted from pollen samples is critical for any molecularly-based identification method, and particularly when working with very small amounts of DNA. Therefore, avoiding contamination is critical and it is essential to work in a clean lab, to keep windows closed, use sterilised tools in a laminar flow cabinet, and to keep the DNA extraction lab separate from the post-PCR environment.
Molecular methods can contribute to the analysis of pollen both by identifying which species are present (qualitative) as well as by giving a measure of the abundance of different pollen species (quantification). While DNA metabarcoding methods are currently most often used (Table
Species-resolution in pollen grain identifications is critical for studies that try to answer specific research questions including: what particular species of flower does a common carder bee prefer? What grass species is responsible for most of the pollen in the ambient air in early May? Species-specific markers and qPCR techniques can be used for the identification of specific species within a mixture of different pollen types (see Chapter 10 DNA barcoding). One study used custom-made primers for the nuclear Internal Transcribed Spacer (nrITS) to differentiate between mugwort (Artemisia vulgaris) and ragweed (Ambrosia artemisiifolia), two notoriously allergenic species from the Asteraceae family (
DNA barcoding can be used to target specific species, yet it is rare that a pollen sample contains only a single pollen species. DNA metabarcoding is therefore the most-often used method for the molecular identification of the different species of pollen grains from mixed samples (see Chapter 11 Amplicon metabarcoding). Both nuclear and chloroplast DNA can be amplified in pollen DNA (
Chapter 5 Infographic: Overview of pollen sources, DNA extraction, and downstream analytical methods for the molecular identification of plants from pollen DNA.
While research into targeting different barcoding regions and primers is ongoing (trnT-F;
It is important to use positive controls with known concentrations of different pollen species in any DNA metabarcoding study. This is because the amount of DNA that can be extracted from different pollen types has been shown to vary. For example, it can be easier to extract DNA from pollen with a thinner exine and from plant species that are richer in chloroplast DNA than from those having a more ‘sturdy’ exine (
Beyond identifying which pollen species are present in a particular sample, pollen grain quantification is equally important. For example, for hay fever forecasts, it is not just important to know if there are certain allergenic pollen in the air, but also how many pollen grains there are at a given point in time. The golden standard for palynology has been to count a certain number of pollen grains under the microscope (e.g., 200 to 500) to obtain a semi-quantitative measure of the pollen types in a sample. While DNA-based methods for pollen quantification are less developed than DNA-based methods for identification, DNA-based pollen quantification using metagenomics (reviewed in Chapter 12 Metagenomics) seems feasible, while there is still strong debate about using DNA metabarcoding reads for this purpose.
In a recent study on the use of DNA to quantify pollen grains, Bell and colleagues found a very weak correlation between pollen counts recorded by palynologists and the proportion of metabarcoding reads (
Another group of scholars, however, are finding more promising results in using DNA metabarcoding to quantify pollen grains.
Since using DNA metabarcoding approaches for pollen abundance may not give quantitative results with complex, multi-species samples, other molecular methods such as genome skimming and shotgun sequencing are being used to circumvent some of the drawbacks. The major advantage of these two methods is that they do not include a PCR-step and therefore do not introduce amplification bias (see Chapter 12 Metagenomics). Genome skimming has already been used to show that quantification is feasible, even for pollen from species that are very rare in mock mixtures (
Anemophilous – Wind-pollinated.
Bead beating – The application of beads to break open the outer cell wall of pollen grains.
Hirst-type pollen trap – Volumetric air sampler that is one of the standard devices for monitoring airborne pollen and spores.
cpDNA – Chloroplast DNA.
Entomophilous – Insect-pollinated.
Exine – Outer wall of pollen grains. Composed mainly of sporopollenin that is extremely resistant to degradation. The exine of pollen grains has to be broken to release the DNA from the organic material within the grains.
Palynology – The science that studies both living and fossil spores, pollen grains, and other microscopic structures (e.g., chironomids, dinocysts, acritarchs, chitinozoans, scolecodonts).
Pollen grains – The male gametophyte of seed plants; source and carrier for the male gametes (spermatozoids or sperm cells).
Pollenkitt – The outermost hydrophobic lipid layer mostly present on entomophilous pollen grains.
Sporopollenin – A chemically inert biological polymer that is a component of the outer wall (see Exine) of a pollen grain.
Super-resolution microscopy – Technique in optical microscopy that allows visualisation of images with resolutions up to 140 nm, much higher than those imposed by the diffraction limit. This technique also allows visualisation of internal structures.
DNA-based methods for the molecular identification of plant products can help us to address food and medicine authenticity issues at each stage in the supply chain (Di Bernardo et al. 2007). Documentation and requirements for DNA-based detection methods for food authentication are defined in collaborative activities by the European Committee for Standardization (CEN) and the International Organization for Standardization (ISO). Both rapid and accurate identification of plant products are crucial for the the herbal drug industry (
The majority of standardised DNA-based authentication methods for the inspection and regulation of food and plant-medicines use well-established PCR-based techniques for DNA amplification as these are sensitive, specific, and simple (
DNA barcoding methods are also established for the identification of unique medicinal and edible plant species (
High-throughput sequencing (HTS) methods such as amplicon metabarcoding are also powerful tools for the authentication of herbal end products, post-marketing control, pharmacovigilance, and the assessment of species composition in botanical medicines, such as in traditional Chinese medicines (TCMs) (
In addition to PCR-based techniques, the detection of single nucleotide polymorphisms (SNPs) is frequently used for the molecular identification and authentication of various food commodities using small DNA fragments (Di Bernardo et al. 2007;
DNA-based methods for molecular plant identification depend on well-curated nucleotide sequence repositories. In addition to GenBank (
Successful DNA extraction is the foundation for any further downstream analysis (
Four main factors that affect the efficacy of DNA isolation from food and medicine samples are the sample source and processing, collection and storage, homogenisation, and the presence of contaminants. Generally it is easier to extract high-quality DNA from fresh samples (
Removal of frequent contaminants that can reduce the yield of extracted DNA from edible and medicinal plants.
Proteins and RNA | |||
---|---|---|---|
What compounds define the chemical composition of your samples? | Polysaccharides (starch, sugars) | Polyphenolics | |
RNA | (plant secondary metabolites like: tannins, flavonoids, terpenoids, etc.) | ||
Understand the specific properties of your samples for DNA extraction | Can co-purify with DNA | Can co-precipitate with DNA | When bound to DNA very hard to remove in extraction |
dependending on the age of the samples and how they were conserved | Results in a sticky viscous consistency to DNA pellet after centrifugation | ||
Inhibition of enzymes used for molecular techniques (restriction endonucleases, polymerases, and ligases ( |
Results in contaminated pellets not usable for many downstream analyses ( |
||
Adherence to wells in agarose gel residing in long smears of bands detected in gel ( |
|||
Consider applying mitigation strategies to overcome difficulties in extracting DNA from your samples | RNA removable with DNase-free RNase A or ethanol precipitation using lithium chloride | Removal via highly concentrated sodium chloride (NaCl) in extraction buffers leading to increased solubility in ethanol | Binder compounds polyvinyl pyrrolidone (PVP) or polypyrrolidone (PVPP) can be used in extraction buffers to absorb polyphenols before polymerization with DNA |
Proteins can be removed by i) inclusion of detergents (cetyltrimethylammonium bromide (CTAB), SDS) in extraction buffer | |||
Combination of NaCl and cationic detergent CTAB | |||
CTAB with differential precipitation ( |
|||
Use of antioxidant compounds (BME, DDT, ascorbic acid, iso-ascorbate) in buffer to prevent polymerization ( |
|||
ii) protein denaturants e.g., β-mercaptoethanol (BME), dithiothreitol (DTT) | |||
iii) enzymatic proteases e.g., proteinase K |
Although CTAB-based methods usually result in DNA extraction from plants and processed food and medicine products, the quantity is often quite low and the protocols are time consuming (
Several studies exist that compare commercial and in-house DNA isolation techniques for food and medicine (
Overview of different DNA extraction methods recommended for use with food by the European Union Reference Laboratory for GM Food and Feed (EU-RL GMFF).
Plant source | Method of choice | Reference |
---|---|---|
Maize | CTAB precipitate (in-house) (Rogers and Bendich 1985) | CRLVL16/05XP corrected version 2 01/03/2018 |
Maize seeds and grains | For isolation of genomic DNA from a wide variety of maize tissues and derived matrices for high-quality genomic DNA from processed plant tissue (e.g., leaf, grain, or seed). | |
Lysis step (thermal lysis in the presence of Tris HCl, EDTA, CTAB, and β-mercaptoethanol). | ||
Tissues processed prior to extraction procedure. Possible methods of processing include a mortar and pestle with liquid nitrogen (leaf) or commercial blender (grain or seed). | ||
Soybean | CTAB precipitate (in-house) (Dellaporta et al. 1983) | CRLVL13/05XP 14/05/2007 |
Soybean seeds | “Dellaporta-derived” method starts with a lysis step (thermal lysis in the presence of Tris HCl, EDTA, NaCl, and β-mercaptoethanol). | |
Isopropanol precipitation and removal of contaminants such as lipophilic molecules and proteins by extraction with phenol:chloroform:isoamyl alcohol. | ||
Potato | “CTAB/Microspin” method | CRLVL09/05XP Corrected Version 1 20/01/2009 |
Freeze-dried potato tubers | Lysis step (thermal lysis in the presence of CTAB, EDTA, and proteinase K). | |
Removal of RNA by digestion with RNase A and removal of contaminants such as lipophilic molecules and proteins by extraction with chloroform. | ||
Remaining inhibitors are removed by a gel filtration step using the commercially available product S-300 HR Microspin Columns (Amersham Pharmacia). | ||
Rapeseed | CTAB precipitate (in-house) (Dellaporta et al. 1983) | CRLVL14/04XP Corrected Version 1 15/01/2007 |
Lysis step (thermal lysis in the presence of Tris HCl, EDTA, SDS, and β-mercaptoethanol). | ||
Removal of contaminants such as lipophilic molecules and proteins by extraction with phenol and chloroform. | ||
DNA precipitate is generated by using isopropanol. The pellet is dissolved in TE buffer. | ||
Rapeseed | Inhibitors are removed by an anion exchange chromatography step using the DNA Clean & Concentrator 25 kit (Zymo Research). | CRLVL14/04XP Corrected Version 1 15/01/2007 |
Multi-herbal products | CTAB precipitate (in-house) (Murray and Thompson 1980) | Arulandhu et al. 2017 |
Technique is ideal for the rapid isolation of small amounts of DNA from many different species and is also useful for large scale isolations. | ||
Lysis step (thermal lysis in the presence of Tris HCl, EDTA, CTAB, and β-mercaptoethanol). | ||
Removal of contaminants such as lipophilic molecules and proteins by extraction with phenol and chloroform. | ||
Samples processed prior to extraction procedure (mortar and pestle, liquid nitrogen, or commercial blender). |
After DNA extraction, measuring both the DNA concentration and purity is important before continuing with further downstream analysis. Isolated DNA can be tested for quality using absorbance methods, agarose gel electrophoresis, and fluorescent DNA-intercalating dyes (
It is in the interest of both biodiversity conservation and public safety that DNA-based techniques are further developed to screen food and medicine sourced from the global market (
Bioprospecting - The exploration of biodiversity for new resources of social and commercial value.
Pharmacophylogenomics - Plant pharmacophylogenomics is a field established by combining the fields of ethnopharmacology, plant systematics, phytochemistry, pharmacology, and bioinformatics. It is the application of phylogenomics to the study of pharmaceuticals.
Pharmacopoeia - From the obsolete typography pharmacopœia, literally, “drug-making”. In its modern technical sense, it is a book containing directions for the identification of compound medicines, and is published by the authority of a government or a medical or pharmaceutical society.
Pharmaphylogenetics - Field of research focusing on the phylogenetic correlation between phylogeny, chemical constituents, and pharmaceutical effects of medicinal plants.
Do you know that faeces are windows to the natural world? Faeces, although not the most glamorous thing in the world, are worth their weight in gold when it comes to providing information about the host(s) they are derived from. Faeces, also commonly known as scat, poop, droppings, excreta, or stools are solid remains of the ingested food that were not digested in the intestine. They are composed of water, protein, polysaccharides, fats, solids (e.g., fibres from plants), and bacteria (
Different types of information can be obtained from faeces. Chemical analyses provide information on hormonal changes that can occur from stress (
Non-molecular methods have traditionally been used for the analysis of contents from faecal samples. An example is microhistology, where small amounts of faecal samples are mounted on a microscope slide, and digested remains of plant cuticle fragments are identified based on morphology (
In plant molecular applications, a common use of faecal samples is in herbivore/omnivore diet studies. The goal of most plant-focused diet studies is to characterise the diet profile of the host, which can be used to answer research questions concerning for example, resource competition and partitioning (
The main advantage of using faecal samples for molecular plant identification as compared to other types of samples such as whole animals/insects (
One limitation when using faecal samples for molecular plant identification is that it can be difficult to obtain fresh faecal samples collected immediately after defecation, especially when working with wild animals. Age of samples can have an impact on the amount and quality of DNA that can be extracted due to DNA degradation caused by exposure to environmental conditions (
Advantages and limitations of using DNA from faeces to reconstruct plant communities.
Advantages | Limitations |
---|---|
Non-invasive | Fresh samples may be challenging to obtain from wild animals |
Easy to detect and collect | Presence of PCR inhibitors |
Not considered rare | DNA degradation |
Does not require capturing or locating animal of interest | Hard to distinguish morphologically with closely related species |
Additional molecular work needed | |
Increased cost and time |
Before designing any sampling strategies for the collection of faecal samples, there are at least six factors that researchers must take into consideration:
Based on the research question(s) and objectives (i.e., quantitative, presence/absence, composition), researchers must decide how many samples and replicates are needed from each individual and/or population to sufficiently meet their research objectives. The choice of downstream molecular methods used for reconstructing herbivore/omnivore diet will also have an impact on how many samples are required. In quantitative studies where the objective is to quantify the ingested biomass, the number of different individuals sampled is not as important as in composition studies, where more individuals are required to obtain a better overview of the dietary range of the studied species. This is due to the effect of individual food preference, which can lead to biases in retrieving the whole range of a dietary profile for a given species if only a few individuals are studied (
Generally, the more ecological information gathered and incorporated into sampling strategies, the higher the chance of successful faecal collection. For wild species, prior ecological information regarding the species of interest is essential for designing sound sampling strategies, to optimise and streamline sample collection. Researchers can use the following questions as a guide in planning their sample collection strategy:
Without this information, it is challenging to narrow down specific study sites for field collection. Additionally, such information can reduce the necessary man-power, resources, and time spent in the field while increasing the probability of finding sufficient numbers of faecal samples. Knowledge of habitat range and population density can prevent excessive amounts of samples collected from a single individual when the research question requires samples from multiple individuals. Differences in home-range and diet between seasons can also impact sample collection strategy (
Sampling strategies are also heavily dependent on budget constraints, which may reduce the time spent on sample collection, the number of samples processed, and also the molecular techniques used in analysing the faecal samples. Therefore, it is prudent to ensure that the budget fits the research objectives or that research objectives should be tailored to fit the research budget. While there are many different approaches to sampling, two commonly used approaches are systematic sampling and opportunistic sampling. In systematic sampling, the study area is divided into grids or transects, and samples are taken at each grid point or fixed intervals (
Finally, ethical consideration of minimising distress to studied animals is one of the main concerns in animal studies and there are legal restrictions as implemented in the EU Directive 2010/63/EU on the protection of animals used for scientific purposes (
Once the sampling strategy has been determined, the sampling in the field can start. The first step is to locate the faecal samples in the field. Once faecal samples have been located, collection can begin. When collecting faecal samples, there are a few materials that will be needed no matter what animal and habitat the faecal samples are derived from; sterile tubes filled with e.g. RNAlater™, silica beads or 90% ethanol, gloves, and a device to collect the samples. Sterile tubes will be necessary for sample storage. Tubes can have either removable screw-lids or hinged lids. Removable screw-lids have the advantage that the lids will not come off during transport. However, there is an increased risk of environmental contamination with these lids since they are separate from the tube and must be placed somewhere before collection. Tubes with hinged lids are easier to work with in that sense, though they can open during transport if not sealed (e.g., with parafilm™). Proper use of gloves and a collection device are also important to limit the risk of a collector becoming sick from directly handling faeces, as well as reducing the risk of sample contamination. The size and type of the sampling device can differ depending on the size of the faecal dropping and can range from a toothpick to a large spoon.
DNA-based diet analyses are very sensitive to contamination, and the trace amounts of digested plant material that can be extracted from faecal samples is easily contaminated. Contamination can occur between samples, by plant DNA from the surrounding environment, or even from the collector’s (plant-based) lunch (
To avoid DNA degradation, faecal samples should be preserved as soon as possible upon collection and stored under the same conditions (
To avoid contamination, extractions should be carried out in a room free of PCR amplified DNA. Due to the risk of zoonotic disease transmission, extraction should ideally be carried out in a flow-hood to avoid inhaling dust from dry faeces (
Faecal samples from plant-eating animals usually contain high levels of PCR inhibitors such as humic acid, which can lead to amplification failure during downstream analysis (
Depending on the research question(s), several different HTS methods can be used for analysing DNA extracted from faecal samples including metabarcoding (
Coprolites – Fossilised faeces.
Near-infrared spectroscopy (NIRS) – A non-destructive and fast technique utilising the near-infrared region of the electromagnetic spectrum.
RNAlater – Non-toxic aqueous reagent for storage purposes, preserving RNA and DNA.
Stable isotopes – Non-radioactive elements.
Zoonotic disease – Infectious disease caused by pathogens jumping from non-human hosts to humans.
Sedimentary ancient DNA studies aim to reconstruct the biology and ecology of past environments using the DNA present in the sediment record. Compared to modern soil and sedimentary DNA (see Chapter 4 DNA from soil), these analyses can be more challenging due to the prolonged exposure of the DNA to degradation processes. This has major implications for the scope of the study and the appropriate study design, which will be discussed in this chapter.
In order to use sedimentary ancient DNA for paleoecological studies (sedaDNA;
Ancient DNA is the hereditary genetic content of cells from organisms that died a long time ago. There is no consensus on how old DNA should be in order to be called ancient, as the age is generally less important than the exposure to degradation processes that make it more degraded than modern DNA. SedaDNA degradation processes are primarily related to environmental and sedimentary properties, such as temperature, pH, water content, oxygen levels, and minerals present in the sediment (
How does DNA end up in the sediment? Sediment is a result of erosion, weathering and biological processes and consists of organic and inorganic particles (e.g., sand and silt) that are transported by wind, water, or people (
Once exposed to the sedimentary environment, exDNA can undergo different post-depositional taphonomic processes that determine the quality of the DNA on longer timescales. ExDNA can be internalised by microbial cells (
Schematic overview of DNA degradation processes (hydrolysis, oxidation, alkylation and Maillard reaction) that can cause DNA damage in the form of cleavage, base modifications or cross-links. The major mechanism leading to miscoding lesions in aDNA is the hydrolysis of cytosine to uracil, which leads to G to A and C to T substitutions by DNA polymerases, whereas blocking lesions can obstruct the movement of DNA polymerases during PCR (
By analysing the ancient DNA present in the sediment (
Macrofossils and plant sedaDNA originate close to the sample location and give a similar local signal (
In general, palaeovegetation data are the result of the attributes of the original vegetation, combined with depositional factors and preservation, as well as the experimental procedures to produce the data. For sedaDNA analyses, this includes every step of the data generation itself: sampling, transport, storage, processing of the DNA in the laboratory, and finally, the bioinformatic pipelines used. In terms of the data generation, pollen analyses and macrofossil analyses rely on taxonomic identification by microscopy, which is labour-intensive and requires a high level of taxonomic knowledge. Although some training is needed to work in an ancient DNA laboratory, in principle, taxonomic identification by DNA can be carried out without prior taxonomic knowledge. However, familiarity with plant taxonomy, phylogenetic placement, and biology of different groups is invaluable in the interpretation of the automated identifications. For example, it is important to check if the automated DNA identifications make sense for the sample location, because contamination, DNA degradation, and the quality of the reference library can cause false DNA identifications (see Chapter 18 Sequence to species for details).
A combination of sedaDNA, macrofossils, and pollen proxies gives the most complete overview of plant diversity and community composition through time. The choice for these proxies is dependent on the aims of the study. Table
Comparison of pollen, plant macrofossils, and sedaDNA as proxies for palaeoecological reconstructions on the levels of: source and sediment, data generation, and data interpretation. Sources:
Category | Pollen | Plant macrofossils | SedaDNA |
---|---|---|---|
Source and sediment | |||
- Scale | Regional | Local | Local |
- Taxonomic groups | Pollen-producers | All plants | All organisms |
- Potential sources of bias | High pollen-producing plants; vegetation cover close to sampling area; differential preservation | Differential preservation of tissue-types and species | Differential DNA degradation and decay |
Data generation | |||
- Labour-intensive | Yes | Yes | No |
- Need for taxonomic knowledge | Yes | Yes | No |
- Taxonomic resolution | Limited to identifiable pollen types, generally to genus level | Generally to species-level | Depends on the marker, possible to species-level DNA contamination; |
- Potential sources of bias | Identifiability of the remains | Identifiability of the remains; random occurrence | choice of lab techniques; completeness of reference library |
Data interpretation | |||
- Qualitative | Yes | Yes | Yes |
- Quantitative | Partial | Limited | Debated |
The first study using sedaDNA of macroorganisms was published in 2003, demonstrating the possibility to detect plant and animal DNA in both permafrost sediments and temperate cave sediments (
Environmental reconstructions can range from polar, to temperate and tropical regions, although they are limited to sampling sites that allow preservation of sedaDNA, such as permafrost, lake sediments, and dry cave sediments. Permafrost sediment can be used to assess vegetational development in polar regions under climate change (e.g.,
Due to its low concentration, retrieving ancient DNA from sediment samples requires strict protocols to avoid contamination by modern DNA or further degradation (
The previous section described some sedaDNA studies focusing on palaeoecological and archaeological questions. In both cases, choices of location and methods are very much steered by the research focus and what is already known about the area, such as past changes in climate, geology, ecology, or human impacts. Although details in the study design can differ, all sedaDNA studies follow the same steps: site selection, collection of samples and metadata, DNA extraction, further processing of the DNA in the lab, sequencing, and finally, bioinformatic sequence quality filtering and data analyses (Figure
Simplified overview of the sedaDNA research process, including some of the major challenges and potential solutions indicated at each step.
Choices for the different options at each step depend on the aims of the study. For example, when performing a reconstruction of overall plant community dynamics with universal plant metabarcoding primers, the most common taxa and major trends in community change will be reliably retrieved in the first PCR performed (
The aims of the study define the temporal and spatial scale needed to achieve them, thereby steering the selection of relevant sampling sites. Lake sediments provide a record of the plants that occurred in the lake catchment, being the area of land from which water and surface runoff drains into the lake (
General conditions under which sedaDNA preserves well are: cold and stable temperatures, neutral pH, dry or anoxic sediments with a high mineral content. Sediments from rockshelters, dry caves, and lake sediments are generally preferred as they are protected and provide stable conditions: rockshelter and dry cave sediments are sheltered from rain and have stable temperatures and there is some evidence that calcite has a high adsorption capacity for DNA (
Dating is important in any study that involves ancient samples. Only with accurate dating can the timing of events be compared and their rates of change estimated. Commonly applied sediment dating methods are radioisotopic dating (in particular 210Pb, 14C, and luminescence dating) and dating based on chemostratigraphy or marker minerals (in particular tephrochronology), and the choice for a method depends on the type and age of the sediments (see Table
Summary of sediment dating methods, their applicability and limitations. Sources:
Dating method | Suitable sample types | Age limit | Sources of error and uncertainty |
---|---|---|---|
210Pb dating | Materials from aquatic environments such as lacustrine and marine deposits | ~100 to 150 years | Complex sedimentation processes that break the dating model assumptions, such as compaction, local mixing, erosion etc. |
14C (radiocarbon) dating | Organic remains (charcoal, wood, animal tissue), carbonates (corals, sediments, stalagmites and stalactites), water, air and organic matter from various sediments, soil, paleosol and peat deposits | Up to 50,000 years | Atmospheric 14C content fluctuation due to changes in cosmogenic production rate and exchange between the atmosphere and ocean |
Luminescence dating: | TL: materials containing crystalline minerals, such as sediments, lava, clay, and ceramics | TL: A few years to over 1,000,000 years | Variations in environmental radiation dose; saturation of electron traps in sample minerals |
- Thermoluminescence (TL) | |||
- Optical stimulated luminescence (OSL) | OSL: materials containing quartz or potassium feldspar sand-sized grains, or fine-grained mineral deposits | OSL: A few decades to ~150,000 years for quartz. | |
Tephrochronology | Terrestrial and lake sediments, marine deposits and ice cores that contain tephra | Up to 35,000 years, extendable under good conditions | Can only obtain indirect dates within the 14C age range |
Radioisotopic dating is based on the principle of radioactive decay. When a nucleus breaks down, it emits energy and forms a daughter product. The time this takes is expressed as the half-life, i.e., the time that it takes for 50% of a parent element to transmute into the daughter product. The relative quantity of a radioactive parent element in a sample can be used to infer its age. Relatively young aquatic sediments, with ages up to 150 years are commonly dated with 210Pb (half-life: 22.27 years;
Luminescence dating is based on the phenomenon that mineral crystals absorb electrons from the ionising radiation of surrounding sediments over time, and when stimulated in a laboratory by heat or light, they release the accumulated radiation as luminescence. The intensity of measured luminescence indicates the length of time between this in-lab stimulation and the last natural event of similar stimulation. Heat stimulated or thermoluminescence (TL) dating is used to date baked pottery from archeological sites or sediments once in contact with molten lava; optically stimulated luminescence (OSL) dating is used to date sediments once exposed to sunlight. The time range for luminescence dating can be from a few decades to over 1 Ma, depending on the ability of a mineral to absorb radiation over time. For studies concerning relatively young samples, OSL dating of quartz grains are generally used, covering from a few decades to ~150 ka.
Tephrochronology uses the chemical signature of tephra (volcanic ash) to pinpoint the age of that specific layer in a sediment sequence by reference to known or unknown dated volcanic eruptions. Terrestrial sediments (
DNA is everywhere - including in the air - and contamination can come from many different sources. When collecting and working with sedaDNA samples, it is important to keep in mind that the DNA you are interested in will probably be present in very low concentrations. Contamination with modern DNA can easily overpower the sedaDNA signal in which you are interested. Therefore it is important to absolutely minimise the amount of modern DNA coming into your samples and limit further degradation of the sedaDNA.
The precautions you can take include: work cleanly, use equipment that is free of DNA and nucleases, and try to keep the samples in a stable and cold environment. In practice this is not so easy, which is why dedicated ancient DNA facilities are set up to avoid any form of contamination. These facilities should be physically isolated - ideally in a separate building - from any location where PCRs are performed (
You should assume that everything that you bring into the lab is contaminated with DNA. Therefore, before entering the lab, you should have showered and changed into clean clothes and everything you bring into the lab should be decontaminated. Inside the lab, you should wear a hairnet, face mask, full body suit with hood, shoe covers, and gloves at all times. Wearing two layers of gloves will allow you to change the outer gloves while still covering your hands, and you should change your outer gloves regularly while working. All tools and equipment should be decontaminated before use, and regular cleaning of the aDNA workspace is needed. Decontamination can be achieved by using a DNA decontamination product (e.g., 3-10% bleach or DNA-ExitusPlusTM) for surfaces, ideally supplemented with UV irradiation of the workspace. To prevent cross-contamination, tools should be cleaned between working with each sample or sample-extract. Tools should be left in a DNA decontamination product for at least 10 minutes, rinsed with UV irradiated milliQ water, and ideally also UV irradiated using a UV crosslinker with irradiation at the shortest distance possible to the UV source (
Choices for sampling and personal protective equipment will depend on the setting, as the sampling of sediments at an archaeological site can be very different from the sub-sampling of a lake sediment core in a lab facility. It is important to try to limit the amount of potential contamination, but practical considerations and the target DNA can also be leading. For example, a study aiming to recover human aDNA will require stricter use of personal protective equipment than a study focussing on plant aDNA. Sampling of sediments can be done directly in the field or by subsampling of sediment cores in a clean, sheltered environment. When collecting sediment cores for sedaDNA, closed-chamber piston-type corers are preferred (
A general sedaDNA sampling kit contains personal protective equipment, sampling equipment, and cleaning products, including: full bodysuits, face masks, hairnets, nitrile gloves, sterile scalpels, sample tubes, clean ziplock bags, DNA decontamination products, distilled water, 70% ethanol, trays or beakers for cleaning the tools, paper towels, trash bags and pens for labelling. To limit potential contamination, much of the preparation for the sampling kit takes place in the ancient DNA lab facility: making sure the sampling tools and collection tubes are prepared and DNA-free. Aluminium foil can be helpful for covering your workspace and provides a clean surface for all of the sampling materials at a sampling site. Sterile syringes with the tip cut off can be useful mini-corers, speeding up the sample-taking (
The sampling itself follows aDNA lab procedures where possible, even if it takes place elsewhere: clean the workspace, use personal protective equipment, do not hover over the sediment you are sampling and change outer gloves and tools between each individual sample. In order to avoid contamination, sampling should start at the oldest part of the sediment, working your way up to the youngest parts and subsamples from sediment cores should be taken from inside the undisturbed centre (
The choice for a specific DNA extraction protocol depends on a range of factors, including the aim of your study, sample characteristics, available laboratory facilities and equipment, and costs of the reagents or extraction kits. The latter can be a consideration of investing either time or finances as it can be cheaper to make the buffers needed for extraction yourself, but this also increases the preparation time and could introduce additional contamination to your samples. There are several protocols that can be used for sedaDNA extraction (see
Overview of the advantages and limitations of several commonly used extraction protocols and some example publications using these protocols.
Extraction protocol | Sample size | Advantages | Limitations | Used by |
---|---|---|---|---|
DNeasy PowerMax kit (Qiagen) | ≤ 10 g | - Large initial sample volume | - Expensive |
|
- Few inhibitors in the resulting extract | - DNA can be lost with inhibitor removal solution | |||
DNeasy PowerSoil kit (Qiagen) | ≤ 250 mg | - Few amplification and sequencing inhibitors in the resulting extract | - DNA can be lost with inhibitor removal solution |
|
- Easy processing of large sets of samples | - Smaller initial sample volume compared to the PowerMax kit | |||
Rohland protocol ( |
≤ 50 mg | - Developed to recover small DNA fragments | - Small starting amount of sediment |
|
- Easy processing of large sets of samples | - Potential coextraction of inhibitors | |||
- Homemade buffers can increase contamination risk | ||||
Phosphate buffer + NucleoSpin® Soil kit ( |
≤ 15 g | - Large initial sample volume | - Extracts only extracellular DNA |
|
- Processes a 2 ml subsample of the phosphate buffer and sample mixture | ||||
Murchie protocol ( |
≤ 250 mg | - High DNA yields | - Optimised for permafrost samples and may not perform as well in lake sediment |
|
- Uses a high volume binding buffer to improve the recovery of small DNA fragments |
All extraction protocols include similar steps for the isolation of sedimentary DNA (Figure
Common DNA extraction steps: (1) samples are first homogenized using a sterile scalpel and later on go through a step, in which either (2a) extracellular DNA is washed off the sedimentary matrix (
SedaDNA studies employing protocols developed for the extraction of modern environmental DNA from soils and sediments generally add additional steps to increase the yield of DNA from low concentration ancient sediment samples. A lysis step can be added to extract iDNA from intact cells present in the samples through chemical lysis, and/or mechanical shearing of cell membranes using beads. Adding certain chemicals to the lysis buffer can also increase yield: N-phenacylthiazolium bromide (PTB) breaks down cross-links between DNA and proteins (
Chapter 8 Infographic: Visual representation of the content of this chapter. Top left image based on Pederson et al. (2015).
Be aware that the presence of certain substances may inhibit further amplification or sequencing steps. These can be derived from humic substances (important components of humus), which are commonly present in sediments and might inhibit downstream analysis. Moreover, the amount of humic substances is site-specific, and it might be necessary to repurify the samples or use inhibitor removal columns. During DNA extraction, contamination may be introduced from the laboratory facilities, tools, reagents and other consumables. It is essential to track this contamination by including a negative control. It is suggested to add one such extraction control for each batch of 11 samples, and include it in all subsequent steps (e.g., metabarcoding, library preparation, sequencing;
After extracting the DNA, the sedaDNA needs to be further processed before sequencing and several approaches are continuously being improved and new ones developed.
Most sedaDNA studies apply a DNA metabarcoding approach, using PCR amplification primers to target short DNA sequences (< 300 bp, preferentially around or below 100 bp) from taxonomic marker genes to identify specific taxonomic groups (see Chapter 11 Amplicon metabarcoding). It is relatively low cost and some of the metabarcoding primers give high taxonomic resolution. However, this method can introduce amplification bias (
Sequencing data can be processed using bioinformatic tools, where strict quality filtering of the sequence data is followed by taxonomic assignment. Further filtering allows removal of sequences with low identity scores, contaminants (i.e., sequences present in the controls), and false-positives (see Chapter 18 Sequence to species for details). False identifications can be caused by the quality of the reference library, but also by technical errors, contamination, or errors in the DNA sequences, especially as sedaDNA is generally highly degraded and of low concentration. It is therefore important to check if the identifications make sense for the sampling location and age before further analyses of the sedaDNA data.
Alkylation – Addition or substitution of an alkyl group (CnH2n+1) to an organic molecule.
Accelerator Mass-Spectrometry (AMS) dating – A dating method that determines the age of an organic material (i.e., macroscopic remains of plants or animals) by measuring their radiocarbon concentration.
Cell lysis – The process whereby the membrane(s) of a cell breaks down, thereby releasing the cell contents.
exDNA – Extracellular DNA; all DNA located outside cell membranes.
Geochemical fingerprinting – A method using chemical signals to infer the origin, the formation and/or the environment of a geological sample.
Half-life – The time necessary for half of a radioactive atom’s nucleus to decay by emission of matter and energy to form a new daughter product. The half-life is specific to a radioactive element, and can be used for dating purposes.
iDNA – Intracellular DNA; all DNA present within cell membranes.
Lake catchment – Area of land from which water and surface runoff drains into a lake.
Luminescence dating – A group of methods to determine how long ago mineral grains were last exposed to sunlight or sufficient heating by measuring the luminescence emitted by the mineral grain upon stimulation.
Metabarcoding – Method for the simultaneous identification of many taxa within the same complex DNA extract. This is achieved by high throughput sequencing (HTS) of amplicons from taxonomic marker genes (barcodes).
Next Generation Sequencing (NGS) – Massively parallel sequencing technology allowing high throughput of DNA.
Nucleases – Diverse group of enzymes able to hydrolyze the phosphodiester bonds of DNA and RNA thereby cleaving them into smaller fragments.
Optically stimulated luminescence (OSL) dating – Dating method that determines the age of a sample by measuring the luminescence it emits in response to visible or infrared light.
Palaeoecology – The study of the relationship between past organisms and their ancient environments.
Permafrost – Soil, sediment, or rock that is continuously exposed to temperatures of < 0 °C for at least two consecutive years.
Radioactive isotope – An atom with excess nuclear energy and prone to undergo radioactive decay.
Reference library – A database of known DNA sequences with their taxonomic identifications, used in bioinformatics as a reference to identify the DNA sequences obtained in a sedaDNA study.
sedaDNA – Sedimentary ancient DNA; this is the aged and degraded DNA from dead organisms now incorporated in the sediment record, either as iDNA in dead tissues, or as exDNA free in the sediment matrix or adsorbed to sediment particles.
Shotgun sequencing – A method for the random sequencing of all of the DNA within a DNA extract.
Taphonomic processes – The processes involved in the transfer, deposition and preservation or organismal remains, including DNA.
Target capture – A technique that allows the capture of the DNA of interest by hybridization to target-specific probes (baits).
Tephrochronology – A geochronological technique that uses layers of tephra (volcanic ash from a single volcanic eruption) to create a chronological framework for the sedimentary record.
Thermoluminescence (TL) dating – Dating method that determines the age of a sample by measuring the luminescence it emits in response to heat.
Total DNA – The intracellular and extracellular DNA combined.
Tree-ring dating – Also called dendrochronology; a method of dating tree rings to the exact year they were formed.
The revolution in genome-wide screening has vastly reduced the price for sequencing, with enormous implications in the biomedical field, industry, biodiversity monitoring, as well as in plant identification. The first plant genome (Arabidopsis thaliana L.) was sequenced using Sanger sequencing. This took 10 years to complete with an associated cost of approximately $100,000,000 (Arabidopsis Genome Initiative 2000). With current high-throughput sequencing (HTS) methods, this same genome now takes 1 week to sequence and assemble, and costs $1000 (
Sanger sequencing was introduced in 1977 by Sanger and colleagues, and for over 40 years, it was the most commonly-used form of sequencing (
In the second step of Sanger sequencing, the oligonucleotides are separated by size using capillary gel electrophoresis. A laser excites the terminal fluorescent nucleotide in each oligonucleotide, resulting in fluorescence emission that is detected and read by a computer. By reading the gel bands from smallest to largest, the 5’ to 3’ sequence of the target DNA can be determined at single base pair resolution. The data output for Sanger sequencing is a chromatogram which is automatically read by a computer to generate the DNA sequence. Primer sequences should be trimmed off the reads as these are not part of the target DNA, and the quality of the chromatogram should be assessed to determine the reliability of the generated DNA sequence. There are a number of online tutorials from both industrial and academic sources that we refer the reader to for assessing a chromatogram quality (University of Michigan, Biomedical Research Core Facilities, n.d.). Base calling accuracy can also be measured using Phred quality scores (
Sanger sequencing is not used today for large-scale genomic projects due its low throughput. The requirement of needing specific primers for a region of interest limits its easy use and application across divergent plant taxa. Additionally, the amplification of multicopy genes, such as the commonly used DNA barcode ITS (see Chapter 10 DNA barcoding), as well as markers in taxa of allopolyploid hybrid origin, result in difficult-to-interpret chromatograms. This is because nucleotide polymorphisms between different copies result in double peaks in the resulting chromatogram (
Illumina was the second HTS technique that became commercially available in the early 2000s (
In Illumina sequencing, like in other high throughput sequencing approaches, the target DNA is initially broken into shorter fragments that match the optimal fragment sequencing length of the platform, if not already present as shorter segments. These fragments are then PCR-amplified with adaptors that can be individually chemically tethered to the flow cell surface. Using bridge amplification (
Dyed dNTPs are added in a controlled fashion through the use of reversible blocking group chemistry, so that the emission of each added fluorescent dNTP is read before the addition of the next fluorescently-labelled dNTP. This process is done on millions of fragments simultaneously, making it a far more efficient method than Sanger sequencing for large-scale genomic projects (
Two limitations to consider with Illumina sequencing however are that the produced reads are relatively short (50 to 300 bp), and similarly to Sanger sequencing, most applications require a PCR amplification step. However, PCR free library kits and protocols provide increasingly good results, and have the important advantage of reducing typical PCR-induced biases. Assembling whole genomes using short read Illumina methods, especially if they are highly repetitive, can be challenging (
Current examples of Illumina sequencing platforms, specifications, and suitability for different applications in plant identification.
Illumina sequencing platform | MiSeq | HiSeq 2500* | HiSeq 3000* | HiSeq 4000* | NextSeq 1000 and 2000 | NovaSeq 6000 |
---|---|---|---|---|---|---|
Specifications | ||||||
Maximum read length (pair ended) | 2 x 300 | 2 x 250 | 2 x 150 | 2 x 150 | 2 x 150 | 2 x 250 |
Maximum reads per run (single reads) | 25 million | 600 million | 2.5 billion | 5 billion | 1.1 billion | 20 billion |
Flow Cell output | 15 Gb | 300 Gb | 750 Gb | 1.5 Tb | 330 Gb | 6 Tb |
Method suitability | ||||||
Metabarcoding | +++ | +++ | + | + | + | ++ |
Target Capture | + | + | + | +++ | + | +++ |
Shotgun sequencing | + | ++ | +++ | +++ | ++ | +++ |
Genome skimming | + | ++ | +++ | +++ | ++ | +++ |
Organellar sequencing (plastids) | + | ++ | +++ | +++ | ++ | ++ |
Transcriptomics:gene targeted | +++ | +++ | + | + | + | ++ |
Transcriptomics: | + | + | ++ | ++ | ++ | +++ |
total RNA/mRNA seq |
Pacific Biosciences (PacBio) sequencing is based on single molecule real time (SMRT) technologies for reading DNA and RNA sequences. No PCR amplification is required, which for certain applications can be advantageous. This includes if PCR inhibitors are/may be present, the sequence is GC rich, or if PCR bias should be avoided. Additionally, PacBio reads are considerably longer than in either Sanger or Illumina sequencing (up to 25 kb) ((Pacific Biosciences, n.d.). This reduces computational challenges related to assembling contigs into full sequences. PacBio is considered a third generation sequencing technology, as it reads the nucleotide sequence both in real-time and at the single molecule level (
Similarly to Illumina and Sanger sequencing, PacBio also uses fluorescently-labelled dNTPs for determining a target DNA sequence. PacBio however employs a technology called zero mode waveguides (ZMW) to read nucleotide sequences at the single molecule level. ZMWs are nanosized wells that can be etched into different materials, with attoliter (10-21 L) volumes. ZMW technology differentiates a fluorescent molecule that is floating in solution from a fluorescently-labelled nucleotide that is located at the bottom of the well. A single DNA polymerase is tethered to the bottom of each well, and when a fluorescently-labelled dNTP is incorporated into the growing DNA strand, the fluorescent label is cleaved off. There is a unique fluorescent marker for each of the 4 nucleotides, and each cleavage event is read and directly linked to a specific nucleotide (van Dijk et al. 2018). Additionally, the rate of addition can be used to infer whether the target DNA is modified (i.e., post-translationally phosphorylated or methylated), since a modified DNA strand moves more slowly through the DNA polymerase, resulting in a reduced incorporation rate for a fluorescent nucleotide. This information is extremely powerful for predicting epigenetic modifications that are critical for a variety of biological functions. In addition, chemical modifications that are often present in aDNA can also be detected, making PacBio a particularly useful technique for assessing aDNA damage (
While previously PacBio suffered from a high error rate in comparison to Illumina sequencing, this has been dramatically reduced by the introduction of circular consensus sequencing (CCS), also known as long high-fidelity (HiFi) reads (
Oxford Nanopore (or simply Nanopore) sequencing is also a third generation SMRT technology that is single-molecule based and measured in real time. Nanopore is unique from the other sequencing technologies discussed here in that no DNA polymerase is required, and no expensive chemically modified dNTPs are necessary for reading the target sequence. The system consists of an electrolytic solution and a nanosized, biologically-derived pore in an insulating solid (a material that does not conduct electricity). The biological nanopores used in this technology are derived from proteins that form pores in biological membranes that naturally function to allow for the passage of ions and biomolecules across the membrane. When an electric field is applied, ions in the electrolytic solution pass through the pore, resulting in a stable current that can be detected. When larger molecules pass through the pore, such as DNA strand, detectable disruptions in the current occur. With a DNA strand, sequences of 6–7 nucleotides move through the pore and the movement of these bases yield a changing detectable disruption. This disruption has a unique signature with a specific current change for a specific length of time that can be linked to each of the four individual nucleotides. From the current disruption pattern it is possible to deduce the sequence. As well, since it is the change in current through the pore that is detected, no other chemical markers are necessary (
Nanopore technologies, with a read length up to 4 Mb, are rapidly becoming important due to their scalability and portability. The MinION sequencing platform (theoretical output up to 50 Gb/flow cell) is a portable and cost-effective option (87 g, available from $1000) that can be used in the field. Already, a number of excellent examples of biodiversity studies (and plant-based studies in particular) are available in the literature (
Library preparations are essential for all experiments involving HTS. General points to consider are discussed here and we also refer to Chapter 12 Metagenomics and Chapter 15 Transcriptomics for more details.
Unlike in other forms of sequencing, Ion Torrent technologies are not based upon optical outputs, but rather on changes in pH. When a DNA polymerase adds a nucleotide to a growing DNA strand, a proton is released upon each addition. It is this release of protons into solution, and the resulting change in the pH of the solution, that is detected in Ion Torrent technologies (
Similarly to Illumina sequencing, the target DNA is initially fragmented (200–600 bps) and PCR-amplified with adaptors that can be tethered to micro-machined wells on a semiconductor chip. The plates are then flooded with one of the 4 nucleotides. If a nucleotide is added across from the complementary base in the single-stranded DNA by the DNA polymerase, it results in the release of a proton and a subsequent change in solution pH. This shift in solution pH is detected by an ion-sensitive field-effect transistor (ISEFT), which can detect changes in proton concentration. This is done in a massively parallel fashion, with 1000s of microwell plates being used simultaneously. The pH change that results from the addition of multiple nucleotides in a repetitive sequence is also detectable using this technology, as the addition of two nucleotides will result in double the voltage change as the addition of a single nucleotide. The data output with Ion Torrent technologies can provide an approximate readout of 10 MBb in a single run with conventional machines, and up to 10 GBb with the newest models. The platform however struggles with base calling of homopolymers, and for these sequences it can be a challenge to obtain accurate reads.
The Ion Torrent machine and sequencing chips are relatively inexpensive compared to Illumina and PacBio, and this made it popular in smaller labs without access to high throughput sequencing core facility sequencing, though its use is no longer as common.
The sequencing platform that is ultimately chosen by a scientist depends on a number of factors. This can include (but is not limited to) the scientific question being considered, the quality of target DNA (see Chapter 1 DNA from plant tissue), costs, as well as in-house expertise and/or availability of existing platforms. In all cases, however, the quality and sequencing depth of target DNA should be considered. For DNA that is primarily expected to exist in shorter sequences (i.e., samples that are expected to be degraded from herbarium or ancient sources), then technologies requiring long reads are often not necessary, and Illumina sequencing or Ion Torrent technologies may be sufficient. If however one wishes to avoid any PCR bias or acquire long reads, then using PacBio or Nanopore is advisable. Finally, it may even be useful to use two different types of sequencing to overcome each technology’s respective limitations. For example, in whole genome sequencing, hybrid methods combining Illumina with PacBio are commonly used to ensure long reads and high accuracy.
Sequencing platform choices for different experimental questions and sample types.
Experiment or sample considerations | Recommended method(s) | Comments |
---|---|---|
Whole genome or organellar sequencing project (genome skimming, genome resequencing, de novo genome assembly) | Illumina, PacBio, or a combination of both | Illumina is the method of choice for resequencing for high throughput short read projects due to its high read accuracy |
Barcoding | Sanger sequencing or PacBio CCS | Larger projects are moving to PacBio CCS to reduce costs. Multiplexing very large numbers of samples is necessary to optimise costs |
Metabarcoding/Target capture | Illumina, MGI, DNBSEQ, or Ion Torrent | PacBio and/or Nanopore may also be considered if the sequence is expected to be highly repetitive |
Heavily degraded samples (i.e., herbarium or ancient DNA samples) | Illumina (or Ion Torrent) | PacBio may also be relevant for the study of post-genetic modifications often found in ancient DNA samples, or if dealing with hard-to-phase sequences |
On-site sequencing | Nanopore (MinION) Hi-C/3C-Seq/Capture-C (Illumina) |
In the last decades, developments in sequencing platforms have primarily focused on increasing the throughput and accuracy of sequencing output, increasing the length of reads, and reducing costs. We can expect the field to continue developing further in this direction, with a focus in particular on the miniaturisation of these platforms for more on-site work, as well as better automation and integration of analytical software and data analysis pipelines. In particular, miniaturisation and automatization of data analysis can be expected to have major impacts in regulatory fields related to both food safety and trade, where the ability for non-specialists to rapidly test on-site for the presence/absence of species will be extremely useful (see Chapter 22 Healthcare and Chapter 23 Food safety). Further development of HTS technologies to be used at the single-cell level and in functional studies can be also expected.
Allopolyploid hybrids – A polyploid species with multiple sets of chromosomes that originate from different species. If the hybrid is derived from two diploid species, the resulting tetraploid is fertile. These allopolyploid hybrids may be at least partially reproductively isolated from the parent species from which they are derived, and allopolyploid speciation is the best known route to hybrid speciation in plants.
Bridge amplification – A method used in Illumina sequencing to create DNA clusters with 1000s of double-stranded copies of the target DNA in flow cells. After amplification and generation of these clusters is complete, the reverse strand is washed away and sequencing by synthesis takes place.
Capillary gel electrophoresis (CGE) – An analytical method for the separation of charged molecules. DNA is separated according to size with this technique, with only nanogram quantities necessary for the input. Single-base pair resolution can be achieved on fragments up to several hundred base pairs in length.
Circular consensus sequencing (CCS) – Developed by PacBio and also known as HiFi reads, involves the circulation of a target DNA strand by ligating the ends of the strand (called a SMRTbell). This SMRTbell can be read multiple times by a DNA polymerase, dramatically reducing the error rate in the generated sequence.
Electrolytic solution – An electrically conductive solution. This conductivity is often due to the presence of ions in solution (for example dissociated Na+ and Cl- ions), though non-ionic solutions can also be conductive.
Epigenetic modifications – Alterations in gene expression and cellular function without changes to the original DNA sequence. Three mechanisms for epigenetic modifications so far identified include DNA methylation, histone modification, and non-coding RNA (ncRNA)-associated gene silencing.
Insulating solid – A solid material that an electric current cannot pass through.
Ion-sensitive field-effect transistor (ISEFT) – A field effect transistor that can measure ion concentrations in solution. Changes in the H+ concentration result in a pH change in solution that results in changes in the current that is detected. This technology is used in Ion Torrent sequencing platforms to identify when a base pair is added to a growing DNA double strand and is the basis for identifying the target DNA sequence.
Phred quality scores – Scores to measure the confidence of the nucleobase identifications generated from DNA sequencing methods. They are widely accepted for assessing the quality of reads.
Rolling circular amplification (RCA) – Where a linear single-stranded DNA molecule is firstly circularised and then copied multiple times as a single sequence (
Single molecular real time sequencing (SMRT) – A term coined by PacBio to describe their sequencing technologies. In contrast to second generation sequencing methods, SMRT technologies possess single-molecule sensitivity and provide the sequence readout in real time, dramatically increasing the sensitivity and turnaround times for DNA sequencing.
Zero mode waveguide (ZMW) – Nanosized wells that can be etched into different materials, with attoliter (10-21 L) volumes. ZMW technology differentiates a fluorescent molecule that is floating in solution from a fluorescently-labelled nucleotide that is located at the bottom of the well. This technology is used by PacBio for the single-molecule detection of fluorescently-labelled nucleotides that are added to immobilised DNA at the bottom of these wells so that nucleotide incorporation can be detected in real time.
The method of identifying living organisms to species level using DNA sequences has been coined DNA barcoding (
DNA-based typing for species identification focused first on microbial organisms (
In recent years, the barcoding movement has grown substantially, and worldwide efforts coordinated by the Consortium for the Barcode of Life (CBOL) are now being focused on barcoding all organisms (
Species delimitation is a central tenet of taxonomy (see Chapter 17 Species delimitation). Traditionally, species were identified, described and classified based mainly on their morphological characters. This is more difficult when it comes to cryptic, hybridising or highly convergent species (
Most species concepts agree on species being evolving metapopulations (de Queiroz 2007), and this implies that genetic variation exists both within and between species. Advanced approaches using many accessions as well as many loci, such as species delimitation based on multispecies coalescent theory, can enhance species identification resolution. However, more data also adds new challenges, and inferred structure due to population-level processes and that due to species boundaries are hard to distinguish (
To identify an unknown DNA barcode using a reference library, one can use several approaches to look at the interrelatedness of the samples (see Chapter 18 Sequence to species). Many databases including GenBank and BOLD (
The mitochondrial genome in plants evolves far too slowly to allow it to distinguish between species (
The plastid marker rbcL was for example good to infer relationships between angiosperm families (
The nuclear ribosomal marker ITS, and specifically nrITS2, is used commonly in barcoding and metabarcoding studies (China Plant BOL Group et al. 2011;
The strict requirements for both universality and high variability for potential universal barcodes has led some to label DNA barcoding a “search for the Holy Grail” (
The core plant DNA barcoding markers are rbcL and matK (CBOL Plant Working Group 2009). nrITS (or nrITS2 only) is the third most commonly used barcode (China Plant BOL Group et al. 2011;
Barcode | Primer | Sequence (5’-3’) | Dir. | Reference |
---|---|---|---|---|
rbcLa | rbcLa_f | ATGTCACCACAAACAGAGACTAAAGC | F |
|
rbcLa_rev | GTAAAATCAAGTCCACCRCG | R |
|
|
matK | matk-3F | CGTACAGTACTTTTGTGTTTACGAG | F | CBOL Plant Working Group (2009) |
matk-1R | ACCCAGTCCATCTGGAAATCTTGGTTC | R | CBOL Plant Working Group (2009) | |
nrITS | ITS5a | CCTTATCATTTAGAGGAAGGAG | F | Wurdack in |
ITS4 | TCCTCCGCTTATTGATATGC | R |
|
|
nrITS2 | S2F | ATGCGATACTTGGTGTGAAT | F |
|
S3R | GACGCTTCTCCAGACTACAAT | R |
|
|
trnL P6 | trnL-g | GGGCAATCCTGAGCCAA | F |
|
trnL-h | CCATTGAGTCTCTGCACCTATC | R |
|
|
psbA-trnH | psbA | GTTATGCATGAACGTAATGCTC | F |
|
trnH | CGCGCATGGTGGATTCACAATCC | R |
|
When choosing appropriate markers for a plant DNA barcoding study it is important to consider the following questions:
What is the necessary taxonomic level of identification? For composition studies of a flora or vegetation, genus-level identifications are often sufficient. Species-level identification can however be important for other questions. Identifying all angiosperms in Greenland is more straightforward than in a Neotropical rainforest. Also, although family-level identifications in Greenland provide useful insights into the local flora, this information most often does not have meaningful applications in rainforests. After deciding on the appropriate level of identification, the researcher then needs to determine whether multiple markers are necessary to ensure that all species can be distinguished.
What kind of a reference library will you use to identify the target barcodes? Query identification in a database that contains all plants is more challenging than with a tailored reference library. For example, identifying a sequence of Oxalis (Oxalidaceae) is easy in a database of Scandinavian plant sequences because there is only a single native Oxalis species. Any queried Oxalis sequence would match the Scandinavian Oxalis acetosella because it would be the only reference Oxalis sequence in a local database. In contrast, a database with South American Oxalis species has hundreds of taxa, and identification requires a marker with sufficient variation to discriminate between these species. Thus, for Scandinavia, one could use a marker with limited variation but universal primers, whereas for South America a specific marker or markers should be sought that can distinguish all Oxalis species present in a global database. It is therefore critical to pick your marker(s) based on the expected diversity in your reference library.
What is your source of reference sequences? If you want to identify species, which is common in studies aiming to authenticate herbal drugs and supplements, you need to include all putative species in your reference library. For example, if your goal is to identify a European wild collected Hypericum, your reference library should ideally include all European Hypericum species that could be confused or substituted for Hypericum perforatum. A reference library can be compiled from de novo sequenced amplicons from voucher accessions or from reference sequences mined from public repositories.
: DNA barcoding of plants encompasses two streams of data from organism to DNA, one for the query sequence that should be identified and one for the reference sequence that is part of the reference library for identification. DNA source, marker choice, primer choice, sequencing approach and identification strategy all influence the ability and resolution of identification.
After choosing one or several markers, it is important to consider the following:
Are universal primers available? If yes, this facilitates your project. However, are these primers really universal? Check this by seeing whether the study publishing the primers gets cited by relevant studies and look for larger studies and reviews that might provide more information about (1) amplification success with these primers; (2) ability to amplify from degraded or poor DNA extracts, a common challenge when working with older herbarium vouchers or processed herbal products; and (3) the need to tweak amplification protocols to make these primers work. If no universal primers are available, try to find studies using this marker and see which primers were used, to find suitable primers that you can then test. If possible, use studies targeting the same target order, family or genus. If there are no previously published primers for your marker, then it is necessary to design your own. If your primers target a widely used marker, then the primer performance that is assessed based on matching these novel primers to multiple sequence alignments of published data (in-silico testing) is generally reliable. If only genomic data is available, however, the accuracy of in-silico testing will be highly dependent on the relatedness of the reference genomes.
Do the primers amplify the right part of the marker? Primers can target fragments of longer loci, i.e., parts of rbcL, matK, nrITS. It is thus important that the segment the primers amplify is useful for your study. It should generate sequences that are identifiable in your reference library and variable enough for your intended level of identification. For example, targeting trnL intron with the universal g-h primers will yield short amplicons, and these have less variation than the entire trnL-F region. Make sure you reassess your marker choice after selecting suitable primers.
How many primers per marker will you use? Long markers can be hard to amplify from degraded templates and can be split up into multiple primer pairs. Degraded DNA is a common challenge when working with a common challenge when working with older herbarium vouchers or processed herbal products. Different combinations of forward and reverse primers can also increase the chance of successful amplification as having multiple different primers can increase the chance that one of these has a good fit to the organism being tested. However, the primer pair with the best fit and targeting the shorter marker will amplify more effectively than other pairs or longer fragments, and can lead to amplification bias.
Once a suitable combination of markers has been found and suitable primers or primer panels have been selected, it is important to test the primers on a sufficient number of your samples. Template DNA quality, DNA concentration, and the effects of inhibiting secondary metabolites can all influence the efficacy of the PCR and might require optimization to obtain the best possible results for the largest number of samples. This is beyond the scope of this book, but sufficient online resources are available to help you with optimization. In addition, there are many online discussion forums to troubleshoot PCR optimization.
The subsequent chapters in section 2 describe different sequencing platforms and approaches to obtain DNA sequences for downstream analysis, and section 3 provides an overview of applications of molecular identification of plants. Depending on whether one chooses standard DNA barcoding using Sanger sequencing, DNA metabarcoding using Ion Torrent, Illumina, or other platforms, or a variety of whole or reduced library representation genome sequencing approaches, one will need to choose different wet lab steps to create the relevant sequencing libraries. Check out the relevant chapter for your application to find out more.
matK – Plastid gene coding for maturase K. matK is one of the core plant DNA barcodes.
nrITS – Internal transcribed spacer (ITS) is a spacer situated between the small-subunit rDNA and large-subunit rDNA genes. In plants, it flanks the 18S and 26S rDNA genes. nrITS is split into two spacers, nrITS1 and nrITS2 with the 5.8S rDNA gene in between. nrITS is highly variable, and primers are designed in the conservated 18S, 5.8S, and 26S rDNA genes.
psbA-trnH – Plastid intergenic spacer region between the coding genes psbA and trnH. psbA-trnH has been advocated as a plant DNA barcoding marker.
Primer – Short DNA sequence used to amplify a marker.
rbcL – Plastid gene coding for ribulose-1,5-bisphosphate carboxylase-oxygenase. Most barcoding studies target the rbcLa region, but will refer to rbcL. rbcL is one of the core plant DNA barcodes. Plastids in plants are often incorrectly referred to as chloroplasts.
DNA metabarcoding is an approach where taxonomically informative regions in the DNA are amplified from mixed-template samples containing DNA from different taxa for identification (Pompanon et al. 2012;
Metabarcoding is based on the DNA barcoding concept (see Chapter 10 DNA barcoding). However, for metabarcoding, samples containing DNA from a mix of different taxa are typically used. One of the first studies that used metabarcoding on a parallel sequencing system (herein referred to as DNA barcoding) to identify plants was by Valentini and colleagues (
Plant metabarcoding is widely used to study the taxonomic composition of mixed template samples such as water (
Plant metabarcoding has been used in various types of applications including species delimitation (see Chapter 17 Species delimitation), archaeo- and palaeo-botany (
DNA metabarcoding is a cost-effective method as compared to metagenomics (
However, DNA metabarcoding also has its limitations, and the PCR amplification step has previously proven to be particularly problematic (
Another drawback of DNA metabarcoding is primer binding bias due to mismatches between the primer and the template DNA. This can result in discrepancies between the proportion of the original taxa in the DNA extract and the amplified DNA sequences (
Finally, the taxonomic assignment of sequences to species is heavily dependent on the DNA reference database used for sequence matching. When the reference database to which the resulting sequences are compared to is incomplete and/or consists of inaccurately identified species, this results in erroneously identified species and/or false negatives (
At the start of any (plant) metabarcoding study lies a clearly defined research question. A study design should furthermore encompass a clear sampling strategy, and identification of suitable DNA extraction techniques for the sample type used before carrying out downstream analysis (
Barcode choice is one of the most important aspects of metabarcoding studies as it will determine which taxa are identified and to what resolution. Considerable efforts have gone into constructing libraries for these plant barcodes and in assessing their limitations (CBOL Plant Working Group 2009;
In the metabarcoding laboratory workflow, unique nucleotide tags are added to amplicons, and these tags are used to assign sequences to the sample they originate from (
In the ‘one-step PCR’ approach, the metabarcoding barcode is amplified and built into libraries during one PCR. This is achieved through the use of metabarcoding primers that carry both adapters and library indexes (
In the ‘two-step PCR’ approach, sample extracts are PCR-amplified with metabarcoding primers that only carry 5’ tails. These are added to act as templates for the following second PCR and do not include any labelling. The second PCR is carried out on each PCR product with primers that carry adapters and indexes (
In the ‘tagged PCR’ approach, DNA extracts are PCR amplified with metabarcoding primers that carry 5’ unique nucleotide tags. Next, the individually 5’ tagged PCR products are pooled and library preparation is carried out on the pools (first demonstrated by (
With the cost of sequencing decreasing exponentially, more effort can be put into applying technical PCR replicates to circumvent sequencing errors and other PCR related issues. When using PCR replicates they should be sequenced in separate locations on the same 96-well plate or, ideally, with replicates in separate plates.Taxa identification lies at the core of any ecological research question. Thus, it is crucial to perform a reliable and reproducible identification workflow to ensure correct identification. In general, care should be taken to avoid cross-contamination between samples by working in clean laboratories with filter-tipped pipettes and separate pre- and post-PCR labs. Normalisation of the amplicons prior to library construction is crucial to avoid overamplification of the most represented taxa in the sample. Since some often-used plant-specific marker regions are very short (e.g., trnL P6 loop, 8 to152 bp), they are prone to picking up the slightest contaminants from the environment. It is therefore recommended to work in a clean environment, e.g. an ancient DNA laboratory with protective clothing.
The preferred platforms for sequencing are currently IonTorrent and Illumina. Both platforms require an additional post-ligation PCR-step or PCR-free ligation of platform-specific adapters to the amplicons before sequencing. However, due to the different technologies behind both platforms, both the error rates and error types can differ. For Illumina (optical sequencing), a substitution error rate of 0.1% has been identified, while IonTorrent (based on detection of hydrogen ions) can show up to 1% indel errors (
Several different bioinformatic tools can be used to analyse the sequence output. Some commonly used packages are OBITools (
Currently, metabarcoding is the dominant technique used in the identification of plants from mixed samples. Developments and improvements in addressing methodological challenges such as PCR bias may one day allow for unbiased quantitative inferences from metabarcoding datasets. This would be a huge step forward for the metabarcoding community since it is still controversial to use read counts as an indication for biomass (
Metabarcoding could potentially be used to determine plant composition in a landscape from bulk arthropod samples. Bulk arthropod samples have been used for biodiversity monitoring of vertebrates (
Adapters – Specific nucleotide sequences unique to different types of sequencing platforms that are added to amplicon libraries to allow for the attachment of library fragments to the flow cell for sequencing.
Amplicons – Products of PCR amplification.
ASVs – Amplicon sequence variants are also known as exact sequence variants or zero-radius OTUs. Although sometimes considered synonymous to OTUs, they correspond to all the unique reads in a dataset and do not require clustering used in creating OTUs.
Barcode – Targeted gene region, see Locus.
Demultiplexing – Bioinformatics step of assigning sequences to samples based on assigned nucleotide tags and/or library indexes.
Epilithic – Plant growing on surfaces of rocks, e.g., seaweeds.
Homopolymers – Nucleotide repetition, usually in tandem of more than 7 nucleotides.
Indel errors – Insertions or deletions in sequences resulting from mutations.
ITS – The internal transcribed spacer is a nuclear ribosomal region found between the small subunit ribosomal RNA (rRNA) and large-subunit rRNA genes.
Library indexes – Nucleotide index added to amplicon libraries to allow for the parallel sequencing of multiple libraries, which can be used bioinformatically to assign reads to the correct amplicon libraries.
Locus – Section and position in a chromosome where a particular DNA sequence is located. It can also be referred to as a barcode.
Macrofossils – Preserved plant remains large enough to be seen without a microscope.
matK – Maturase K is a gene found in the chloroplast genome.
Meta-phylogeography – Study of phylogeographic features and intraspecies variation.
Multiplexing – Parallel amplification of barcodes in one PCR reaction.
OTU – Operational taxonomic unit. The term is used to categorise clusters of similar sequences.
Overhangs – Stretch of unpaired nucleotides at the end of DNA fragments.
PCR – Polymerase chain reaction.
PCR stochasticity – Uneven amplification of molecules during PCR that can be a result of some sequences being present in lower copy numbers than others.
Phylogeography – Investigate the origin of genetic variation within closely related species across a landscape.
Primers – A short single-stranded nucleic acid sequence that serves as a starting point for the DNA replication in the PCR.
Primer set – Nucleic acid sequences explained above complementary to the 5’ end and 3’ end of the flanking regions of a loci.
Primer bias – Differences in DNA amplification due to a primer inefficiently binding to the target template. This can result from sequence divergence in the primer binding sites.
qPCR – Polymerase chain reaction used for quantifying DNA.
rbcL – The ribulose-1,5-bisphosphate carboxylase large subunit gene is found in the chloroplast genome.
Singletons – A sequence only present in one copy.
Nucleotide tags – Short nucleotide sequences added at the 5’ end of the primer in metabarcoding studies.
Tag jumps – Generation of amplicons with different tags than originally used, resulting in false positives in the data. For more detail see (
Taxa – Plural of taxon. A taxon is a group of organisms that form a taxonomic group.
Taxonomic assignment – Matching the obtained sequences to taxa names.
trnH-psbA – An intergenic spacer region found in the chloroplast genome.
trnL – The trnL gene is part of the trnL-F region of the chloroplast genome.
Metagenomics is the study of genetic material recovered directly from environmental samples such as air, water, soil, or sediments (
The term ‘metagenome’ was first coined in 1998 by Handelsman et al. (
With the development of high-throughput sequencing (HTS) technologies, the need for cloning to increase the amount of starting material was eliminated. An early study recovered the first near-complete genomes of five dominant members of a natural acidophilic biofilm using an insert plasmid library and shotgun sequencing (
The immense amount of data collected by these methods introduced challenges in data analysis, resulting in several innovations in comparative metagenomics such as clustering orthologs (
Similar to metabarcoding, substrates that can be used for metagenomics in plant identification include environmental samples, fragmented template materials (i.e., dental calculus and faeces) (
Several promising applications exist for plant-related metagenomics as compared to conventional targeted genomic approaches. Dietary studies are one such application. While dietary studies have been revolutionised by conventional metabarcoding (see Chapter 11 Amplicon metabarcoding;
Similar to metabarcoding (see Chapter 11 Amplicon metabarcoding), metagenomics can potentially be used to reconstruct plant compositions from bulk arthropods samples, and to solve crimes in forensic genetics (see Chapter 26 Forensic genetics, botany, and palynology), especially by uncovering taxa that are not normally amplified in metabarcoding studies. It can also potentially be applied to plant resources for the retrieval of plant population genetic information from mixed templates (which has already been shown in mammals;
Metagenomics is an untargeted method that captures all genetic material in a sample, which is advantageous over targeted methods as no prior knowledge of the taxa and their genes is required (
Metagenomics does, however, come with some disadvantages that need to be considered. The main downside is the taxonomic inefficiency of the method. Sequenced material can originate from any part of the genome, but full nuclear genome references for most species are currently lacking. Thus, only a small proportion of species can currently be identified (
DNA fragmentation is an essential step in the metagenomic workflow, and the size of the DNA fragments required depends on the sequencing platform used. Broadly speaking, there are two methods for DNA fragmentation to obtain size-controlled DNA fragments: enzyme-based and mechanical. Each method has its associated advantages and disadvantages (
Library preparation is another important step in the metagenomics workflow as it can affect the results of the sequencing output. The addition of adapters to the ends of DNA fragments lets it bind to the sequencing flow cell, which allows for the identification of the reads (
DNA sequencing has gradually shifted from Sanger to HTS technologies in the last decades. These new sequencing technologies can provide much higher yields of reads at a much lower cost (see Chapter 9 Sequencing platforms and data types). Initially, 454/Roche pyrosequencing (discontinued) was the most widely used platform (
Short reads are bioinformatically challenging for metagenomic assembly because genes and chromosomal regions can be difficult to span, especially if they are long or composed of repetitive elements. Certain protocols have been developed to overcome such challenges (e.g., assembly after binning and taxonomic assignment), but long-read sequencing technologies offer excellent alternatives for metagenomics. PacBio and Oxford Nanopore technologies offer longer read lengths but can be accompanied by higher error rates and higher costs. In contrast to the other platforms which introduce inherent systematic errors (e.g., homopolymer regions, index hopping), errors in these platforms are mostly random, which might be overcome with technological improvements (
The exact number of reads required to effectively characterise a sample using metagenomics will be highly variable, and as such, no one number for the total number of reads required can be given universally. In principle, the total number of species in the sample, the genome sizes, and the relative abundance of each species should be known to make such an estimation. As a rule of thumb, it is suggested to maximise the output to capture as many reads as possible from the rare members of the community (
There are currently two main strategies to identify the contents of a metagenomic sample: identification of individual reads by alignment to a reference, or by assembling the reads into longer contigs prior to identification.
The most straightforward method for identification is by aligning the reads to a known reference dataset. BLAST and related tools such as MegaBLAST (
Two alternative approaches aim to speed up the identification of metagenomic datasets. These either use more compressed reference databases in combination with more efficient aligners or rely on exact alignments of k-mers between the reads and the reference (
Assembly methods attempt to generate longer contigs before downstream analysis. These longer contigs can be used for gene identifications (
Each bioinformatic strategy has its pros and cons, and the decision about which strategy to use depends on the starting material available as well as the research questions to aim to be answered. The alignment method works well when there is ample reference material available for the taxa of interest, when working with older and more fragmented material, or when the target taxa are sparse in a sample. The assembly method on the other hand performs best when there is abundant material available, which is often not the case for environmental datasets.
As sequencing costs continue to significantly decrease, bioinformatics pipelines are optimised, and more comprehensive DNA reference libraries are available (
Basic Local Alignment Search Tool (BLAST) – An alignment tool commonly used in conjunction with the NCBI nucleotide reference database for sequence identifications. Different BLAST versions exist for nucleotide or protein alignments.
Binning – Clustering sequences based on their nucleotide composition or similarity to a reference database.
Burrows-Wheeler transform – Data transformation algorithm to make transformed data more compressible.
Community genetics – Study of genetic interactions between species and their environment in complex communities.
Contigs – A longer assembled DNA sequence.
Coverage – The mean number of times a nucleotide is sequenced in a genome.
De Bruijn graphs – A popular method for the de novo assembly of contigs. The graph is built up out of k-mers that overlap, which can be solved to construct contigs.
De novo assembly – The assembly of contigs or genomes from sequenced data without the aid of a reference.
DNA fragmentation – Separating or breaking DNA molecules into smaller fragments.
DNA libraries – DNA libraries are a collection of DNA fragments with specific sequencing-platform adapters ligated to both ends.
Ecogenomics – Study of the influence of environmental factors on the genome.
Environmental genomics – Prediction of organism responses at the genetic level.
FM-index – A compressed data structure for full-text pattern searching based on the Burrows-Wheeler transform.
Functional metagenomics – Study of gene functions from DNA extracted from mixed communities.
Hydrodynamic shearing – Fragmentation of DNA molecules by forcing them through a small tube or small gauge needle at high velocity.
K-mer – A short subsequence of length k that is generated from longer sequencing reads. The shorter k-mers allow for faster alignments and assemblies.
Last Common Ancestor (LCA) – A point on the tree of life from which a set of taxa are descended.
MegaBLAST – A faster, though less accurate, version of the BLAST tool.
Metagenome – All genetic material found in an environmental sample. It contains the genomes of many different organisms.
Nebulisation – Process of breaking DNA molecules into small fragments by passing DNA solution into a nebuliser unit, resulting in a fine mist that is collected.
Orthologs – Genes in different species that evolved from a common ancestral gene.
Paired-end sequencing – Sequencing of a DNA fragment from both ends. Both sequences can either be merged into a single larger fragment, if overlap is present, or kept separate.
Read – A DNA sequence generated by a sequencer.
Shotgun sequencing – A technique that randomly fragments DNA and then reassembles the fragments by searching for overlapping regions.
Sonication – Application of sound energy to break up DNA strands into smaller fragments.
Accurate species identification is fundamental for correct assessment of species diversity and for studying the functioning of their communities and ecosystems. Additional applications include the use of species identification in food product authentication and for diagnosing diseases. Species identification can be carried out using morphological, (bio)chemical, or molecular traits. Pertaining to the molecular-based approaches, both PCR and post-PCR analyses have been extensively used for species identification and genotyping. The most widely used PCR-based method is DNA barcoding, which is able to provide species-level identifications using the sequences of short standard DNA regions (
HRM data analysis is straightforward and it does not require advanced bioinformatics skills, in contrast to other genetic analyses used for species identification. Furthermore, HRM is a cost-effective and high-throughput methodology. Due to its ability to discriminate between samples at the resolution of a single nucleotide (allowing for single nucleotide polymorphism (SNP) identification), HRM is commonly used for genotyping, mutation scanning, and DNA methylation analyses. The HRM analysis of DNA barcoding regions, e.g., ITS2, matK, trnL (see Chapter 10 DNA barcoding), is called Bar-HRM: Barcoding - High Resolution Melting analysis. It has been successfully introduced by
PCR amplification of the genetic region of interest is a prerequisite for HRM analysis and is done in the presence of a fluorescent dye that binds dsDNA. Such dyes intercalate into the dsDNA that is produced during a PCR reaction, without affecting PCR efficiency. Asymmetric cyanine dyes fluoresce strongly in the presence of dsDNA and are characterised by low intensity fluorescence in the unbound state (
The rate of dsDNA PCR product dissociation, and thus the shape of the HRM curve, depends on (1) the sequence itself and its length, (2) the GC content, (3) the complementarity, and (4) the nearest-neighbour thermodynamics of the amplicon (
An example of standard output curves of an HRM analysis. In both A and B, the three phases of the DNA melting profile are shown. (1) The pre-melting phase is characterised by an initial fluorescence given in relative fluorescence units (RFUs). Here all PCR products are double-stranded and the maximum amount of dye is bound. (2) In the active melting phase the inflection point (i.e., Tm) is where 50% of the PCR product the samples is denatured. (3) The post-melting phase. As the temperature increases, the PCR products denature, dye is released, and the fluorescent signal drops and plateaus A. A normalised melting curve with indication of the inflection point. B. A derivative curve, which shows the inflection point on the slope as a melting peak.
Identification of species by Bar-HRM relies on small genetic differences between DNA sequences, which will result in different melting curves. However, small differences between melting curve profiles may also arise from sources other than the DNA sequences, thus assay optimisation is a prerequisite for a successful Bar-HRM analysis. An example of a Bar-HRM workflow can be found in the infographic. Factors that could influence the outcome of an HRM analysis include genomic DNA quality, DNA extraction impurities, amplicon length, primer design, dye selection, PCR reagent choice (see Chapter 1 DNA from plant tissue), and the choice of Bar-HRM instruments and software (
The major factor associated with DNA quality is salt carryover, as this will change the thermodynamics of the DNA melting process. This could lead to lower reproducibility and higher error rates in the Bar-HRM results. A solution is to precipitate and resuspend the DNA extract in a buffer with a low salt concentration such as TE (10 mM Tris, 1 mM EDTA) prior to the PCR (
Amplicons up to 300 bp are generally preferable for Bar-HRM analysis since they are more suitable for the detection of DNA mutations such as SNPs, inversions, insertions, and deletions. The larger the fragments, the more likely it is to detect additional mutation sites that may complicate the discrimination between/among different sequence variants. On the other hand, amplicons that are too small (< 50 bp) may produce lower fluorescence signals, due to lower amounts of dye being incorporated into the PCR product (
The intercalating dyes used in the Bar-HRM analysis bind generically to any double-stranded DNA product. It is therefore important to design robust PCR primers that are specific to the region of interest, to ensure that this is the only region amplified in the PCR product. Each developed primer pair should be tested for specificity to the region of interest and should not produce any primer-dimers or non-specific products. PCR products from the developed primer pair should be assessed by gel electrophoresis, as the HRM software may not be able to detect all non-specific reaction products if their melting curves are similar (
HRM uses dsDNA-binding fluorescent dyes that do not interfere with the PCR reaction. The so-called “release-on-demand” dyes are preferred for HRM as they do not inhibit DNA polymerases or alter the Tm of the product (
Reagents for HRM analysis and reaction conditions should be optimised to reduce amplification biases as much as possible. Primer dimers and other non-specific products can significantly decrease the performance of the HRM analysis. So, in addition to optimising reactions, one must ensure that variation is not introduced by poor assay design or optimization decisions (see Chapter 1 DNA from plant tissue;
HRM analysis requires a PCR thermal cycler and an instrument with optics capable of detecting fluorescence. This can either be a rotary design in which samples spin past an optical detector or a block-based instrument in which samples are read by a scanning head or stationary camera. This instrument should be coupled with a computer with appropriate HRM analysis software capable of handling the large amounts of data generated during the analysis. A good HRM software package should provide a view of the raw fluorescence data points and a process to both align the data and view melting curve differences between samples (
The chemical improvement of “release-on-demand” DNA dyes and the increased instrumentation precision has widely expanded the use of Bar-HRM for genotyping (
Since the first description of the HRM methodology in 2003, it has been increasingly used as a research tool (
Medicinal plants and plant-based food products are often processed and lack the essential parts necessary for morphological identification when sold on the herbal market. In addition, the herbal market is highly competitive and lacks standardised methods for quality assessment. This has contributed to increasing problems with product adulteration and substitution. Numerous studies reported the substitution of costly ingredients in herbal products with plant material of inferior quality or unlabelled plant fillers (
Bar-HRM has been used for the identification and quantification of the ingredients in plant and animal food products, including Protected Designation of Origin (PDO) products. Olive oil for instance, which is one of the most adulterated vegetable oils on the market, has been successfully authenticated with Bar-HRM (
Apart from species identification, Bar-HRM can also be used for species quantification, which is also important for quality control, especially for quantifying adulterants in food or other processed products. Serial dilutions of a DNA sample mixed with adulterant DNA are made, corresponding to different known adulterant content percentages. These artificial serial admixtures are then used to create reference curves that can be used to quantify samples of unknown content (Lagiotis et al. 2020;
Bar-HRM technology can provide taxonomic identification of plant taxa, the tracking of a wide range of raw and processed herbal products, and the detection of adulterants and poisonous contaminants in food products. As the precision of the “release-on-demand” dyes and HRM instruments further increase, in addition to the development of melting curve reference databases, we can expect that Bar-HRM will be implemented as a routine analytical tool for species identification and authentication. Finally, the successful application of Bar-HRM as a tool for quality control in the food industry, renders it suitable to be also used in a regulatory framework by the corresponding authorities.
Chapter 13 Infographic: Illustration of a typical HRM methodology workflow. The initial substrate can be a single or multi-ingredient product from raw or processed plant material. Following DNA extraction, taxon-specific primers are developed based on DNA barcodes or other molecular markers and tested in silico. An intercalating fluorescent dye is added to the PCR reaction, which allows the detection of the PCR amplicons by the HRM equipment during the melting process. The output consists of melting graphs and a statistical report including Genotype Confidence Percentages (GCPs) allowing accurate discrimination between the reference and the analysed taxa.
Bar-HRM – DNA Barcoding - High Resolution Melting analysis is an HRM analysis coupled with a barcoding region of interest (such as trnL-F, ITS, or matK for plants) that is primarily used for the identification of organisms at various taxonomic levels.
GCP – Genotype Confidence Percentage is a parameter calculated by the HRM software and represents the confidence that a sample is the same as the reference genotype, with a value of 100 indicating an exact match.
HRM analysis – High Resolution Melting analysis is a post-PCR analysis that is used to identify variations in nucleic acid sequences. The method is based on detecting small differences in PCR melting (dissociation) curves.
Nearest-neighbour method – The method is based on a model in which the thermodynamic stability of a base pair in a DNA strand is dependent on the identity of the adjacent base pairs. These thermodynamic properties can be used for predicting the melting temperature of the DNA strand.
PDO – Protected Designation of Origin is a registered designation of products that have the strongest links to their area of production and that are protected by intellectual property rights.
qPCR – Quantitative polymerase chain reaction used for quantifying DNA in real-time.
RFU – Relative Fluorescence Unit is a unit of measurement used in real-time PCR analysis, which employs fluorescence detection. The computer software measures the results, determining the quantity or size of the fragments, at each data point, from the level of fluorescence intensity. Samples which contain higher quantities of amplified DNA will have higher corresponding RFU values.
SSR – Simple-sequence repeats (SSR), also known as microsatellites, are short tandem repeated nucleotide motifs that may vary in the number of repeats at a given locus.
Efforts to resolve the plant tree of life have led to the replacement of traditional DNA sequencing markers (
Unique challenges exist when trying to obtain sequences from plant nuclear genomes. Plant genomes are often large (
Thankfully, many nuclear genes have been discovered through an abundance of annotated transcriptomes (
Analysing SLCN genes requires multiple reads to cover the same genomic region (high coverage) to obtain high-quality assemblies. The goal of target capture, also called bait capture or hybrid capture, is to achieve high coverage on (nuclear) target loci by proportionally increasing (enriching) the target DNA fragments in a genomic library. The workflow is straightforward (see Infographic): DNA is extracted using tissue-specific protocols (see Section 1 Design, sampling, and substrates in this book), sheared to the desirable fragment length (e.g., 300–700 bp for Illumina sequencing, depending on the quality of the extract), processed into genomic libraries using indexing techniques for multiplex sequencing, enriched for the target genomic regions using specific baits (see below), and sequenced on a platform with high sequencing accuracy (e.g., Illumina or PacBio Sequel).
Target capture uses custom-designed short RNA- or DNA-baits (usually between 80 and 120 bp long), also called probes, that hybridise in solution to target loci with complementary sequences (
Besides target capture, there are other techniques for reducing genomic complexity in DNA samples (reduced representation sequencing) depending on the target loci, taxon, and/or application (
Target capture is a robust alternative for these applications as it works independently of specific PCR primers. The small size of the RNA-baits makes this method ideal for degraded DNA and baits do not need to be an exact match for the target to be captured. The hybridisation conditions can be modified for more or less permissive binding between bait and target, and locus capture can still be successful with up to 30% bait mismatch (
Nuclear sequence data from a single clade member, either a whole-genome or transcriptome, are enough to design efficient target capture baits for application across the clade. As the number of complete plant genomes (e.g., 128 species, Phytozome v.13;
Annotated whole-genome sequences are preferred for RNA-bait panel design, since: 1) target loci can be more carefully selected with the detailed gene copy number information; 2) annotated whole-genomes provide intronic and intergenic regions (sequences immediately 5’ or 3’ of the gene), which can be included in the panel design.
The advantage of using introns and intergenic regions is the inclusion of highly variable sequences that are useful for phylogenetic inference of recent diversification events (
Target capture is a cost-efficient, high-throughput, and customizable solution for plant phylogenomics and systematics (
Examples of available universal and clade-specific target capture panels. Number of exons and introns as well as total number of bases targeted are as reported in the original publications.
Taxonomic level | Number of loci | Exons/introns | Total target size (bp) | Reference |
---|---|---|---|---|
Flagellate plants and gymnosperms | 248 | 451 exons | 150,369 |
|
Ferns | 25 | Exons only | 39,134 |
|
Angiosperms | 353 | Exons only | 260,802 |
|
Order | ||||
Saxifragales | 301 | Not reported | Not reported |
|
Family | ||||
Annonaceae | Not reported | 469 exons | 364,653 |
|
Apocynaceae | 853 | Exons only | 1,545,593 |
|
Arecaceae | 4,184 | Exons only | 4,287,662 | |
Asparagaceae - Agavoideae | 2,473 | 3,709 exons | Not reported |
|
Asteraceae | 1,061 | Exons only | Not reported |
|
Bromeliaceae | 1,776 | Exons only | ± 2,300,000 |
|
Cactaceae | 120 (+ A353) | 469 exons | 136,495 |
|
Fabaceae | 507 (423 SLCN) | Exons only | 737,309 (SLCN only) |
|
Fabaceae - Detarioideae | 289 | 1,021 exons | 359,269 |
|
Fabaceae - Mimosoideae | 964 | Exons only | 1,134,513 |
|
Gesneriaceae | 830 | Exons only | 776,754 |
|
Melastomataceae | 384 (266 from A353) | Exons only | Not reported |
|
Ochnaceae | 275 | Exons only | 660,000 |
|
Orchidaceae | 963 | 6,005 exons | Not reported |
|
Salicaceae | 972 (593 SLCN) | Exons only | Not reported |
|
Sapotaceae | 1,241 | Exons only | Not reported |
|
Genus | ||||
Aloe (Asphodelaceae) | 189 | 1,029 exons | 353,794 |
|
Anacyclus (Asteraceae) | 872 | Not reported | Not reported |
|
Begonia (Begoniaceae) | 1,239 | Exons + introns | Not reported |
|
Burmeistera (Campanulaceae) | 745 | Exons only | Not reported |
|
Cyrtandra (Gesneriaceae) | 570 | Exons only | 180,784 |
|
Dioscorea (Dioscoreaceae) | 260 | Exons only | 441,626 | |
Heuchera (Saxifragaceae) | 278 | Including introns | 378,553 |
|
Hibiscus (Malvaceae) | 87 | Exons only | Not reported |
|
Hosta (Asparagaceae) | 283 | 676 exons | 171,365 |
|
Inga (Fabaceae) | 276 | 907 exons | Not reported |
|
Lens (Fabaceae) | Full exome | Exons only | 85 Mbp |
|
Silene (Caryophyllaceae) | 50 | Exons (131) + introns | 104,374 | Cangren et al., unpublished |
Rubus (Rosaceae) | 926 | 8,963 exons | Not reported |
|
Single species | ||||
Euphorbia balsamifera (Euphorbiaceae) | 431 | Exons only | 709 kbp |
|
Quercus robur (Fagaceae) | 9,748 | Including introns | 150 bp per (sub)gene |
|
Technique | Principle | Application | Reference |
---|---|---|---|
Target capture | Target enrichment using in-solution hybridisation with specifically designed baits: short oligonucleotides complementary to target loci. | Phylogenomics, population genomics |
|
RAD-Seq + target capture (Rapture) |
Using custom baits to capture selected restriction-site associated DNA (RAD) tags. | Population genomics with museum specimens |
|
Target capture + genome skimming (Hyb-Seq) | Adding an unenriched library to the enriched sample before sequencing to obtain low-coverage sequencing results from non-target nuclear regions and organellar genomes. | Phylogenomics, population genomics, comparing chloroplast and nuclear phylogeny. |
|
Target capture + allele phasing | Estimation of ploidy level based on allelic frequency and allelic ratio from the number of reads for each allele. | Estimation of ploidy level from museum specimens. |
|
Target capture + molecular identification | Using target capture to obtain high-coverage sequence data for SLCN genes to identify unknown samples of traded plants. | Trade monitoring, authentication of medicinal plants. |
|
Target capture + repetitive sequence analysis | Using off-target reads to investigate levels of DNA repetition across a taxonomic clade. | Structural evolution of genomes, repetitive DNA analysis. |
|
The effective enrichment of degraded DNA samples, wide taxonomic range, and increased availability of custom bait panels make target capture ideally suited for molecular identification of plants (
The success of molecular identification depends on a curated database of taxonomically verified reference sequences with a corresponding comprehensive phylogeny (
Once sequenced, unknown material can be identified using either genomic distance (
Analysing target capture data from mixed samples, where material from different species is combined into one sample, is complex and requires long-read sequencing and rigorous phasing. Short sequence reads require assembly into longer fragments (contigs), increasing the risk of erroneous assemblies in mixed samples where reads belonging to different species might end up in the same contig. Long sequence reads can be sorted based on variable sites (phasing) and assigned to species directly, circumventing the assembly problem for mixed samples. If traditional markers give sufficient resolution, metabarcoding experiments (see Chapter 11 Amplicon metabarcoding) can be designed for a more cost-effective approach.
The research question will determine whether a customised bait panel is needed for a study, and the choice is a trade-off between cost and detail (Figure
Universal bait panels are commonly designed for resolving deeper phylogenetic nodes (e.g., angiosperms;
Customised bait panels offer greater recovery and detail but are sensitive to the taxonomic distance from the reference used in the design. For example, target recovery with the Aloe bait panel decreased from an average of 93.6% in Aloe samples to 74.3% in the sister clade Bulbine (
Designing a custom bait panel requires transcriptome or whole-genome sequences from at least one, but preferably more, taxon in the clade of interest or from a closely related clade (
SLCN genes can be retrieved from a set of transcriptomes using software such as Markerminer (
It is important to determine, as much as possible, the copy number of genes identified at this stage to avoid including paralogs in the target design. Annotated whole-genome sequences have an advantage here. When using transcriptomes, Markerminer can indicate the copy status of identified loci based on the curated dataset from De Smet et al. (2013). Additionally, a reciprocal blast of putative SLCN loci against the transcriptome can be used to identify near-identical matches, providing an indication on the presence of paralogs.
It is usually unnecessary to include all detected loci as they may vary in their phylogenomic value and there is a limit to what a bait panel can efficiently cover. The smallest RNA-bait panels from MYBaits (Arbor Biosciences), for example, include up to 20,000 baits between 80 to 120 bp in length. Larger bait panels are considerably more expensive. It is advisable to use 2–3x tiling in the bait design so that the whole set covers each base of the target loci with 2–3 baits or more (Figure
Schematic representation of tiling. The reference sequence at the top represents a hypothetical 800 bp exon with dotted blue lines indicating the intron-exon boundaries. Hypothetical baits are 80 bp long. This example is based on a 3x tiling strategy where each nucleotide is (on average) covered by three unique baits. The bait coverage decreases towards the ends of the exon as the target design of this hypothetical example did not include introns or intergenic regions.
Firstly, prioritising target loci recovered across all taxa in the reference dataset is recommended to ensure consistent recovery and to include as much variation as possible in the bait panel design. If a locus is represented by only one taxon in the design, the resulting capture will be skewed towards samples more closely related to that taxon. This may result in uneven recovery among samples in a pooled capture experiment (see Multiplexing and pooling below).
Secondly, for phylogenomic and molecular identification purposes, loci with low variability should be excluded. The target loci should be variable enough to resolve challenging phylogenetic clades. When designing a bait panel for broader taxonomic applications (e.g., above genus-level), limiting the inclusion of highly variable genes can be considered to keep recovery levels consistent across taxonomic scales. The variability of loci can be assessed based on pairwise sequence identity (ideally < 95% between the reference sequences), phylogenetic resolution on gene trees, and the number of parsimony-informative sites (e.g., ≥ 20 per 1000 bp). Additionally, metrics such as the amount and proportion of missing data can be obtained in a useful summary diagram with the AMAS tool (
Thirdly, it is important to make sure that target loci are long enough to be covered efficiently by the RNA-baits (i.e., > 400–500 bp is recommended). This is especially relevant when targeting exons. Exons shorter than the bait length will not be captured efficiently since bait sequences that span two intron-exon boundaries will have reduced recovery as they only bind partly to one exon.
Finally, target genes should be checked for repetitive regions such as microsatellites or transposons. These can be detected by the presence of short, repetitive sequences of low complexity (e.g., (ATC)n) and should not be included in the bait panel design. Baits spanning these repetitive elements will likely hybridise in many places in the genome, reducing the hybridisation efficiency and recovery for the target gene. Similarly, including highly conserved and high-copy regions common to plants, such as the MYB-domain (
Selected SLCN genes should first be aligned to a reference sequence from an annotated genome to indicate putative intron-exon boundaries. These alignments will form the basis of the RNA-bait design. An example pipeline can be found in the publication for the Angiosperms353 panel (
Firstly, baits need to be checked for potential overlap with high-copy sequences from organellar genomes (plastomes, mitochondrial genomes, and nuclear ribosomes) by mapping them to published sequences, which are typically available in NCBI databases. Secondly, GC-content in baits should be normalised across the panel. The hybridisation temperature governs the specificity of capture and baits with the same melting temperature (Tm) should hybridise evenly. Additionally, a high GC content in baits will lead to more off-target hybridisation as these baits are more likely to bind efficiently to GC-rich areas in the genome. Baits with a GC-content > 75% should therefore be removed, though one might lower the threshold to 60%. Thirdly, identical, or nearly identical, baits should be removed to reduce redundancy in the dataset as well as bias towards regions covered by identical baits. This should be done carefully however to not reduce the desired tiling of the bait panel. For example, removing baits with > 90% identical sequence over 85% of the total bait sequence generally works for 3x tiling. The digital panel design provided by the company should then be checked for accuracy by mapping the designed baits against all reference target sequences (e.g., selected transcripts for the panel design). This is to make sure that the baits align with the target sequence, are not too divergent from the target sequence and are tiled uniformly across the length of all genes.
A custom-designed bait panel needs to be tested to ensure it efficiently captures the target sequence prior to a large-scale study. A cost- and time-efficient pilot study can include up to 24 samples using an Illumina MiSeq platform (
Information on introns and flanking regions may be elucidated from the ‘splash zone’ in a pilot study that can subsequently be added to the bait panel design (Cangren et al., unpublished). In these cases where the bait panel is expanded, a second pilot study may be required. The results of a pilot study are generally published along with the design of a custom bait panel to the benefit of other researchers who may use the same custom bait panel (e.g., Dioscorea (
Target capture sequencing uses genomic DNA libraries prepared for sequencing on HTS platforms. These libraries consist of DNA fragments, usually of a controlled size, obtained from source plant material. The source DNA fragment is flanked by standardised identifier sequences (indexes or sometimes also called barcodes) to help identify the sample origin of a sequence read and a standardised adapter sequence to allow binding of the DNA fragment to the flow cell of the sequencer. The number of bp DNA from the source genome in a library fragment (insert size is therefore smaller than the fragment itself:
insert size = average library fragment size − 2 × (length of adapter+index sequence)
The library preparation procedure is not discussed here, but details can be found in Chapter 9 Sequencing platforms and data types.
Sequences from different samples can be distinguished by labelling (indexing) each library with its own unique identifier. Combining differently labelled libraries into one sequencing run (multiplexed sequencing or multiplexing) is a common strategy to reduce per-sample costs. To further reduce the per-sample cost of target capture experiments, libraries from different samples can be combined in one tube for simultaneous enrichment (pooling). This reduces the number of RNA-baits necessary to enrich the same number of samples, and significantly reduces costs. Efficient target enrichment is routinely achieved with up to 48 samples per RNA-baits reaction and even 96-plexing strategies have been successful (
When deploying a universal bait panel, libraries from different taxonomic groups, particularly at the family rank and above, must be separated. The closer the sample is to the reference taxa the higher the similarity between the bait sequences and the target sequences (
In all target capture experiments, libraries in the same pool should contain similar fragment sizes. Short fragments can move around much easier in a solution and will thus encounter the RNA-baits more often, increasing their chances of capture. Mixing short and long fragments in the same target capture reaction can therefore skew the enrichment towards the shorter libraries (
The number of DNA fragments from each sample in the same pool should be equal, i.e., be present in equimolar quantities. A library with a higher number of DNA fragments than the others in the pool will be overrepresented and potentially bias the DNA sequences that are enriched. Diluting libraries to the same molarity (usually in nM) before pooling is therefore generally advised. An example of how to calculate pooling parameters is shown in Figure
A target capture wet-lab protocol has three steps: denaturing the DNA libraries, hybridising with target-specific baits, and post-capture washing to remove unwanted DNA fragments. An example protocol using the Daicel Arbor MYBaits kit is detailed here.
In the first step, the genomic libraries are denatured at > 95 °C and ‘blocker’ oligonucleotides are added that bind to the adapter sequences. This is to keep the single-stranded fragments from hybridising back to their complementary strands. The blockers also reduce any interference of the adapter sequences during hybridisation, in case the baits themselves contain complementary sequences to the adapters and/or index primers used.
In the hybridisation reaction, the target-specific baits are added to each pool and hybridisation will occur at a constant temperature of 60–65 °C (depending on the specifics of the bait panel) for a minimum of 16 hours. These parameters should always be optimised when setting up a target capture protocol. Longer hybridisation times (≥ 24 hours) are needed for enrichment of more complex genomic libraries, such as those from larger genomes and from universal kits. In these reactions, the baits take longer to encounter the target DNA fragments.
For samples that are expected to underperform (short libraries, herbarium samples, or samples taxonomically distant to the target reference), the hybridisation temperature can be dropped to < 60 °C and the hybridisation time extended to 48 hours. To prevent evaporation and any potential loss of target DNA, a small amount of hydrophobic wax can be added on top of the hybridisation reaction. If using a thermocycler, the heated lid should also be on at ± 105 °C to prevent evaporation.
Finally, the magnetic streptavidin beads are added to the reaction mixture to bind the target-bait hybrids. These streptavidin beads need to be washed to remove any storage buffer before they are added to the target capture pools. Once ready, the tube with magnetic beads and bead-bound target DNA can be placed on a magnetic tube rack to concentrate and anchor the beads, allowing the non-bound DNA fragments to be washed away.
The amount of DNA in enriched pools often needs to be PCR amplified to generate sufficient detectable fragments for sequencing on HTS platforms. This is especially important when capturing loci from large genomes (e.g.,
Post-capture amplification can either be done with the DNA still bound to the beads using a specific hot start polymerase or after removing them from the beads and using a standard high-fidelity polymerase (PFU or Q5, various suppliers). In the latter case, the target DNA is released by denaturation at > 95 °C in a suitable solvent, i.e., Tris-HCl solution (no EDTA should be present since this will inhibit the polymerase) and immediately transferred to a magnetic tube rack to separate the DNA from the baits, gently removing the solvent containing target DNA and transferring it to clean tubes. A concentrated high-grade detergent such as TWEEN-20 is often added prior to denaturation to enhance the release of target DNA. PCR amplification is then done in-solution on the target DNA using universal PCR primers that bind to the adapter sequences.
Optimising the number of PCR cycles (e.g., via qPCR with a dilution of the captured product) is generally advised since too many PCR cycles can increase the chance of false positives in the form of random errors in the sequences, which cannot be corrected based on the sequencing data. Furthermore, excessive PCR cycles introduce unnecessary PCR clones or duplicates. Performing just enough PCR cycles to obtain a pool into the desired concentration range for the sequencing platform (generally ≥ 3 nM) is therefore recommended.
As a rough qualitative indication of target capture success, the fragment size distribution can be determined using high-precision electrophoresis instruments such as an Agilent TapeStation. After a successful experiment, there will be a peak in the expected library fragment size range (insert size + adapter and index sequences). While exact target capture success can only be determined from sequencing results, this post-capture fragment distribution analysis acts as an extra quality assurance prior to sequencing.
Sequencing output of target capture sequencing experiments is in the same format as for other HTS experiments. Demultiplexing and quality filtering/trimming of the raw reads is required. Demultiplexing is often done automatically with Illumina sequencing data, using the BaseSpace firmware. For Oxford Nanopore and/or PacBio reads, there are freeware options such as PoreChop (https://github.com/rrwick/Porechop). Read filtering/trimming is based on FastQC (
High-quality reads are assembled into consensus sequences for target loci to make good sequence alignments between samples. For target capture experiments, reads are aligned to the target reference used in the RNA-bait design, a process called mapping, to reduce the complexity of the assembly. Several tools are available to assemble mapped reads de novo, meaning without further use of the reference sequences. HybPhyloMaker (
HybPiper uses a combination of different mapping and assembly tools to retrieve target sequences from large target capture datasets. Reads are mapped to the reference sequence using a Burrows-Wheeler Aligner (
Target capture sequencing achieves reproducible high-quality sequencing results for hundreds of targeted SLCN genes or, in fact, any desired target gene. By reducing the complexity of genomic libraries, high-coverage sequencing results of single-copy genes can be obtained regardless of the organisms’ genome size and DNA degradation rate. These characteristics make target capture ideally suited for molecular identification studies (
The method is being refined as the underlying molecular techniques (
Bait – Short oligonucleotide (80–120 bp of RNA or DNA) that is used to capture target sequences in a genomic library. Baits are chemically modified (biotinylated) so they can be bound to magnetic streptavidin beads whilst hybridised to the target DNA for removal from the genomic library.
DNA fragment shearing – The controlled breaking of DNA strands into random smaller fragments by restriction enzymes or, more commonly, by ultrasound shearing (ultrasonication).
Exon – Coding part of a gene. Can be determined from mRNA in RNA-Seq experiments.
Genomic library – A DNA sample containing fragments representative of the different genomes in an organism, e.g., organellar and nuclear genomes, prepared for HTS by addition of platform-specific adapters and sample-specific unique identifiers (in the case of multiplex sequencing).
Infrageneric relationships – Relationships between taxa or taxonomic groups within the same genus.
Infraspecific variation – Variation found within the same species. Can refer to characteristic differences between subspecies or varieties of the same species. In phylogenetics this refers to DNA sequence variation between individuals, populations or subspecies and varieties.
Intron – Non-coding part of a gene, which is often more variable than the exon sequence due to relaxed selective pressures. It is transcribed into RNA in the nucleus but often spliced out of the mRNA that is exported to the ribosomes.
Orthologs – Sequences of the same gene (copy) in different organisms or species. Orthologous loci represent the true evolutionary relationships between organisms and species as their sequences evolved (virtually) independently in different taxa.
Paralogs – Derived copies of a (nuclear) gene in the same organism that arose through either gene or whole genome duplication. Sequences of paralogs can be highly similar, making it difficult to separate them (through phasing of reads) prior to phylogenetic analysis.
Single-to-low copy nuclear (SLCN) gene – A gene located in the nuclear genome that has only one or very few copies per haplotype (a copy of the genome). The limited presence of paralogs for these genes is an advantage, particularly in plant phylogenomics, since it reduces the problem of phasing (sorting out copies of genes from sequencing data).
Tiling (in bait panel design) – A bait panel design approach to ensure consistent coverage across all target loci by more than one bait, without duplicating baits. The principle is to shift the sequence of the second bait slightly downstream of the target sequence so that it overlaps for a large part with the first bait but also captures a new part of the target sequence. See Figure
Unrooted gene trees – A phylogenetic tree calculated from a DNA sequence alignment of a single gene without a rooting point. This is commonly done for quick exploratory analysis of relationships between the organisms studied for a particular gene. In phylogenomics, it is widely used as a tool to identify potential paralogs in a sequence alignment, which will be indicated by a strong bipolar split in the unrooted tree with the sequences of the respective paralogs at each end.
Transcriptomics is the study of the transcriptome, which is the complete set of all RNA molecules, including coding and noncoding RNA, that is expressed in a cell, tissue, or organism at a specific spatial, temporal, or developmental stage (
In plant research, transcriptomics is widely used for studying differential expression, identifying novel genes, and general expression patterns (
The first publication studying individual transcripts used Northern blotting for RNA detection, which is a hybridization-based method (
Microarrays were the first high-throughput method developed for transcriptomics to achieve widespread use due to their affordability and highly sensitive transcript detection (
Currently, RNA-seq data is often acquired using technologies that allow for long read data. Long read RNA-seq data allows reading full transcripts, finding new isoforms, identifying fusion transcripts, identifying long noncoding RNA, simplifying the computational analysis, and reducing PCR biases (
Experimental design considerations
Isolating a sufficient quantity of high-quality RNA is critical for conducting transcriptome sequencing experiments and their analyses. When designing a protocol, a number of biological replicates should be considered. Biological replication represents RNA harvested from different plants or different sets of independent samples treated under the same conditions. This biological replication is important for assessing variation between samples, and more biological replicates can increase statistical power during analysis. In general, the minimum number of samples for transcriptomics studies is three biological replicates. Once the minimum number of samples and replications is achieved, the following steps are sample treatment and handling, RNA isolation, and RNA quality and quantity testing.
RNA to be used in transcriptomic experiments is most commonly isolated from a maximum of 100 mg of fresh plant tissue. If not used immediately, harvested plant tissue should be snap-frozen in liquid nitrogen and stored at -80 °C. If it is not possible to homogenise the fresh material or to snap-freeze it in liquid nitrogen immediately (e.g., in the field), it should be kept in a preservation buffer that maintains a constant pH to preserve proteins and protect the RNA. RNA stabilisation and storage solutions available from manufacturers (e.g., Ambion, Applied Biosystem or RNAlaterTM, Invitrogen, ThermoFisher Scientific, USA) or other preservatives such as a sulfate salt solution (e.g., ammonium sulfate) preserve tissue samples after harvesting in order to retain the quality and quantity of RNA for long periods (
A crucial step in tissue preparation is finding the most appropriate method to homogenise the tissue in order to maximise the yield and quality of the RNA. The most common method to homogenise the tissue is snap-freezing in liquid nitrogen and subsequent homogenization/disruption of the tissue by manually grinding with a mortar and pestle or with glass/metal beads and a tissue lyser. However, this is challenging for hard tissue like wood, roots or plant tissues with thick cuticles such as succulent leaves. The combination of snap-freezing in liquid nitrogen,disruption of the tissue by manually grinding, and second grinding with glass/metal beads and a tissue lyser can be a solution to optimise the tissue homogenization of hard tissues. Once the tissue samples are powdered, they can be stored at -80 °C or used immediately for RNA isolation. It is advised to thaw a frozen tissue sample only once and add the lysis buffer immediately to obtain high-quality isolated RNA. It is important that the lytic agent or denaturant comes into contact with the cellular contents when the cells are disrupted. The RNA lysis buffer (e.g., Buffer RLT, Qiagen-USA) is usually composed of phenol and guanidine isothiocyanate. This buffer has two functions as a denaturing agent and stabilises nucleic acid by preventing the activity of the RNase enzyme.
Compared to DNA, RNA is less stable due to its chemical structure: RNA is single-stranded and can easily be enzymatically degraded by the abundant amounts of ribonuclease (RNAse) that are present in the environment. RNases are secreted through our skin and in the air we breathe out. RNA isolations therefore need to be conducted in RNase-free conditions. Gloves must be worn at all times and the RNA isolation should take place in a fume hood. Designated working spaces and equipment should be cleaned with RNase inhibitors. Common RNase inhibitors to use are strong denaturants such as guanidinium, sodium dodecyl sulfate (SDS), diethyl pyrocarbonate (DEPC), or phenol-based compounds. Additionally, commercially available products include DNase/RNase AWAYTM (Merck BV, The Netherlands) or bleach (sodium hypochlorite). Keep in mind to also use RNase-free plastics and glassware. The main steps for RNA isolation are similar to the DNA isolation protocol (Chapter 1 DNA from Plant Tissues). RNA can be extracted and purified by following protocols described in the literature such as an acidic phenol-chloroform RNA extraction (
Single-cell RNA-seq (scRNA-seq) is an advanced method to profile transcriptomes from individual cells. scRNA-seq can be used for cell type identification, transcriptome profiling, and inference of gene regulatory networks across the cell (
One method for tissue-specific isolation is laser microdissection (LMD), which is based on a histological identification that isolates specific cell types by laser capture and laser cutting (
Single-cell sequencing can further provide high-resolution functional information on an individual cell. In order to capture single cells for scRNA-seq experiments, fluorescence-activated cell sorting (FACS) with the use of protoplasts is commonly used. This is both a high-throughput and highly specific method (
The quality and quantity evaluation of RNA is essential to the success of sequencing experiments and the downstream analysis.. The RNA quality and quantity can be evaluated by measuring the UV absorption of a sample. The optical density (OD) ratios at A260/A280 and A260/A230 can be used to determine the RNA purity. Pure RNA has an A260/A280 ratio of 2.1, and an A260/A230 ratio in the range of 2.0-2.2 (
Measuring the RNA integrity in order to determine its degradation level is also recommended. Traditionally, RNA integrity was determined by visualising total RNA using gel electrophoresis and ethidium bromide staining. Intact RNA gives sharp and clear 28S and 18S rRNA bands with an intensity ratio of 28S/18S at 2.0 or higher, in addition to a messenger RNA (mRNA) smear that should be visible between these two distinct bands. A more recent and standardised RNA integrity determination method is determining the RNA integrity number (RIN) with Agilent Bioanalyzer Systems instruments (Agilent Technologies, USA) (
The selection of library preparation methods depends on the fragment size, presence of structural features, and sequencing platform. In the Illumina short-read RNA-seq protocol, the library preparation entails four main steps: (1) RNA molecule selection (mRNA enrichment or rRNA depletion), (2) fragmenting the targeted sequence to the desired length and converting fragmented RNA into cDNA, (3) attaching the adapters and PCR amplification to create the cDNA library, and (4) quantifying the library product for sequencing. The library preparation for long-read sequencing is somewhat simpler than for short-read sequencing. The PacBio Iso-Seq protocol consists of three main steps: (1) cDNA synthesis, (2) cDNA amplification, and (3) library construction. With the Oxford Nanopore platform, the sequencing can be done directly from RNA or by using the amplified (or non-amplified) cDNA input.
A total RNA sample after extraction contains ribosomal RNA (rRNA), precursor mRNA (pre-mRNA), mRNA, small noncoding RNA (sRNA/sncRNA), and long ncRNA (transcripts longer than 200 nucleotides), where the majority of material is rRNA (
The conversion of RNA into cDNA is an essential step for RNA-seq. This conversion is necessary because DNA is biologically more stable than RNA. PCR amplification can only be done with DNA, and most sequencing protocols are designed for sequencing DNA. The first step in converting RNA to cDNA is the fragmentation of the RNA into an appropriate size for sequencing (i.e., 100–600 bp). Several approaches are available for RNA fragmentation, including physical approaches (e.g., acoustic shearing and sonication), chemical approaches (i.e., heating and divalent metal cation addition), and enzymatic methods (i.e., non-specific endonuclease cocktails and transposase tagmentation reactions) (
Adapters are ligated to one or both ends of the cDNA fragment. Adapters consist of sequences that allow library fragments to bind to the flow cell, sequencing primer binding sites, and index sequences. Index/barcode sequences are sequence identifiers that enable the pooling of several samples (multiplexing) in a single sequencing run or flow cell lane. Products from the ligation reaction are purified using agarose gel electrophoresis prior to PCR amplification to create the cDNA library.
Several library preparation kits based on the Illumina platform are available. The “TruSeq Stranded Total RNA with Ribo-Zero Plant” kit is useful for large tissue samples (0.1–1 μg total RNA). While for low quantities of RNA, the “NEBNext® Ultra™ II Directional RNA Library Prep with Sample Purification Beads” kit (10 ng–1 µg total RNA for polyA mRNA workflow and 5 ng–1 µg total RNA for rRNA depletion workflow) (New England Biolabs Inc., UK) can be used. These kits incorporate Illumina library preparation steps, including bead-based rRNA depletion or mRNA enrichment, cDNA synthesis, adding adaptors, indexing, and PCR. For a tissue sample that yields smaller amounts of RNA, like a single cell (1–25 ng), the “Collibri stranded RNA Library Prep kit” (ThermoFisher Scientific, USA) can be applied.
For the PacBio Iso-Seq platform for long-read RNA-seq, the “NEBNext Single Cell/Low Input cDNA Synthesis & Amplification Module” kit (New England Biolabs Inc., UK) can be used for cDNA synthesis and its amplification from a single cell or ultra-low input RNA (as low as 1 pg–200 ng). The “SMRTbell Express Template Prep Kit 2.0” (Pacific Bioscience, USA) can be used to detect full-length transcripts up to 10 kb.
The ONT platform provides a starter pack for direct RNA-seq, PCR-cDNA sequencing kit, and direct cDNA sequencing kit (Oxford Nanopore Technologies Ltd., UK) with necessary inputs for RNA or Poly-A+(poly(A) on the present of the polyadenylated 3’-ends) 500 ng for direct RNA-seq, 1 ng for PCR-cDNA sequencing, and 100 ng for direct cDNA sequencing.
A very sensitive method for checking the quantity of a library preparation is with fluorometric methods (i.e., Qubitt™ Fluorometer, ThermoFisher Scientific, USA) or by qPCR. qPCR library quantification is based on the amplification of cDNA fragments with the adapters. The qPCR machine measures the intensity of fluorescence emitted by the probe at each cycle. In this approach, only templates that have both adapter sequences on either end will be measured and subsequently form clusters in a flow cell. Other methods include the use of electrophoresis-based quantification methods such as fragment analyzer systems that use automated parallel capillary electrophoresis to assess the library size distribution (e.g., Tapestation, Agilent Technologies, USA). A critical aspect in the quality check from the fragment analyzer is the library size distribution in the expected range. The peaks near the lower marker on library electrophoresis show contaminants, including primer and adapter dimers. An additional clean-up of the sample is recommended to increase the quality.
cDNA sequencing can be performed on several different platforms (see Chapter 9 Sequencing platforms and data types). Overall, RNA sequencing does not differ from the sequencing of genomic DNA. The sequencer reads cDNA fragments in one of two ways: using a single-end or paired-ends. In single-end reading, the sequencer reads the cDNA from the 3’ or 5’ end of only one strand of the insert. This method can produce large volumes of high-quality data especially for differential gene expression studies where an important factor is determining where the reads in transcripts come from (
The requirements for sequence coverage and depth varies depending on the scientific questions to be answered, with complex studies perhaps needing greater sequencing depth and coverage. For example, a differential expression study using the Illumina platform requires 10-30 million reads per sample (
Prior to the development of high-throughput methods, individual transcriptome studies were performed using hybridization-based methods such as Northern blotting and microarrays (see above) or amplification-based methods including Sanger sequencing and RT-qPCR.
Hybridization-based methods require visual inspection or image processing analyses to interpret the output, while in qPCR, it is the amplification that must be monitored. In qPCR, the expression levels are represented by cycle threshold (Ct) values and further normalisation steps and statistical analyses need to be used for the estimation of relative or absolute abundances. Neither hybridization methods nor qPCR require labour-intensive post-processing.
On the other hand, EST/SAGE/MPSS or RNA-seq methods rely on sequence data and require several post-processing steps such as clustering, assembly, and functional annotation. As RNA-seq allows characterization of whole transcriptomes and currently is the most widely used method, we outline the bioinformatic analysis steps for high-throughput RNA-seq data. Long read sequencing methods such as ONT and SMRT allow full-length characterization of transcripts and can be used to study complex transcriptomes. Although one common concern regarding these technologies is high error rates, their accuracy has dramatically increased recently and the development of long-read specific error correction approaches are providing further improvements (
After obtaining raw RNA-seq data, the quality of the reads should be checked and sequencing errors should be corrected in order to improve the accuracy and efficiency of the assembly process. It is also recommended to mask low complexity regions and repetitive sequences that might generate hits that are artefacts. DUST and SEG modules of BLAST can be used for this purpose on nucleotide and amino acid sequences, respectively. Bacterial and viral contaminants can be removed by running similarity searches against public databases or using tools such as DeconSeq (
Most short-read assemblers first divide reads into subsequences of length k (i.e., k-mers) and generate a graph representing the overlap between them (
Depending on whether a reference genome/transcriptome is available or not, there are different strategies for transcriptome assembly.
De-novo assembly is solely based on RNA-seq data and uses the k-mer composition by subdividing the reads into shorter segments of a given length k. This composition and the overlaps between these k-mers are represented on a de Bruijn graph, which is finally resolved to reconstruct transcripts (
Commonly used de-novo assemblers include Trans-ABySS (
There are also combined de-novo assembly approaches such as EvidentialGene (
Genome-guided assemblers map RNA-seq data to a reference genome and avoid constructing de Bruijn graphs by merging the reads based on their overlapping regions. The quality of the reference genome is critical here, as a high-quality assembly can provide accurate transcript predictions and expression profiles, while using a fragmented or incomplete assembly as reference might aggravate this process. When mapping RNA-seq reads to a reference genome, introns should be accounted for. Therefore genome-guided assemblers allow splitting the reads during mapping. This is achieved by using a splice aware alignment strategy where the downstream regions of a read can map to a downstream exon on the reference. Such splice aware aligners include TopHat2 (
RNA-seq reads can also be mapped to a transcriptome, if a high-quality assembly is available for the target or a closely related species. This transcriptome-guided approach can improve the contiguity and completeness of the assembly (
High-sensitivity reference-guided assemblers can be combined with de-novo assemblers in order to detect novel and missing transcripts as well. If the reference genome is incomplete, fragmented, or from a distantly related species, the de-novo assembly should be performed first in order to avoid the potential errors in the reference. This approach can also be useful for extending incomplete transcripts to full-length by merging these based on a reference (
The average length of assembled contigs in an RNA-seq experiment will vary based on the actual mRNA fragments that are sequenced. Thus, metrics based on assembled contigs do not necessarily indicate the quality of a transcriptome assembly. Transcriptome-specific metrics have been suggested such as ExN50, which computes transcript lengths as expression-weighted means of isoform lengths. Another method to assess the assembly quality is by checking the read percentage that can concordantly align to the final assembly in order to understand if the full complement of paired-end reads are represented in the assembled transcripts. Tools such as bowtie2 or BWA can be used for this type of mapping. Other tools for evaluating the quality of an assembled transcriptome include DETONATE (
Transcripts can also be translated into protein sequences and mapped against well annotated databases such as UniProt/Swiss-Prot, Pfam, or NCBI. If the sequenced organism is closely related to a model organism, a high proportion of the contigs should have potential homologs in these databases. Another tool, BUSCO, assesses the completeness of the assembly by comparing it with universal single-copy gene databases specific to different lineages such as bacteria, fungi, or plants.
Expression quantification is a critical step for most RNA-seq experiments. There are two main sources of systematic variability which might introduce errors to this process; (i) longer transcripts generate more reads than shorter transcripts at the same abundance due to RNA fragmentation during library construction, and (ii) the number of fragments mapped across samples are different due to varying number of reads produced for each run. Therefore, read counts need to be normalised in order to obtain accurate gene expression estimates. Inter-sample normalisation methods have been developed for differential expression analysis, such as DeSeq2 (
Assembled transcripts from de-novo or reference-guided assemblies are expected to represent real biological differences such as expression levels, alternative splice forms, and paralogous or allelic transcripts (
Transcriptomes have been used for plant phylogenomic inference as they contain abundant information from the nuclear genome. Famously, the generation of > 1000 transcriptomes across the plant kingdom led to new evolutionary insights for land plants (One Thousand Plant Transcriptomes Initiative 2019). However, the application of RNA-seq is limited to fresh tissue with low levels of degradation, making it less applicable to studies with large taxonomic sampling.
An emerging phylogenomic approach that partly relies on transcriptomics uses targeted next-generation sequencing (see Chapter 14 Target capture) to obtain specific genes for high-coverage DNA sequencing in large numbers of samples with varying taxonomic breadth. Target capture is very efficient in recovering hundreds of genes, regardless of the degradation level in the source DNA (
Metatranscriptomics is the application of transcriptome sequencing to environmental samples such as water, soil, or sediments. It gives an overview of the actual metabolic activity and taxonomic diversity within a community. The protocol involves HTS of reverse-transcribed cDNA obtained from an environmental mRNA isolate. While reverse transcriptase PCRs can only detect a single gene at a time, metatranscriptomics gives a whole gene expression profile of a diverse community of organisms playing various functional roles in the ecosystem (
Chapter 15 Infographic: Overview scheme of transcriptomics in plants with emphasis on the RNA-seq method. (a) Sample preparation and RNA isolation. (b) Library preparation starts by selecting the RNA species from the total RNA, followed by fragmentation of the RNA sequence, cDNA synthesis, adapter ligation, cDNA amplification, and RNA sequencing. (c) The first step in transcriptome analysis is assessing the quality and quantity of reads. The clean reads are assembled to the reference genome or through a de-novo assembly or by combining these two approaches. The assemble reads are then annotated, followed by quantification and normalisation of the annotated results. The final step is differential expression analysis to quantify the difference in the expression level of genes between the samples or treatments.
Some of the main challenges of metatranscriptomics are the presence of PCR inhibitors in environmental samples (e.g., humic acid, polysaccharides;
There are various applications of metatranscriptomics such as revealing the composition of freshwater bacterioplankton communities (
Plant transcriptomics studies have undergone huge advances over the past few years as the costs of the second generation of sequencing, such as Illumina, have declined, third generation sequencing has become more accurate, and a wider range of analysis tools and pipelines have become available and become more accurate (
Studies using comparative transcriptomics to understand interactions between different organisms (
Adapter – Chemically synthesised single-stranded or double-stranded oligonucleotides to capture a DNA sequence of interest.
Artefacts – Variations in sequences because of non-biological processes. For example, chemical reactions can cause changes in the nucleotides during the sequencing process.
Bait – An oligonucleotide designed for capturing a specific RNA or DNA species for sequencing.
BUSCO – Benchmarking Universal Single-Copy Orthologs.
Contigs – An assembled continuous sequence from overlapping DNA segments.
de Bruijn graph – A graphical representation of overlapping sequences which is used to construct whole length contigs.
De novo assembly – A method for creating a transcriptome assembly without a reference genome.
DNAse (deoxyribonuclease) - An enzyme that cleaves and degrades DNA.
Genome – A haploid set of chromosomes, including genes in microorganisms.
High-throughput sequencing (HTS) – A sequencing technology that enables large massively parallel DNA sequencing.
k-mer – subsequences of length k contained in a nucleotide or amino acid sequence.
Loci (plural locus) – The specific position of a particular gene or marker located on a chromosome.
Oligo (dT) primer – A single-stranded sequence of deoxythymine (dT) that is suitable to use as a primer with reverse transcriptase for first strand cDNA synthesis.
Random primer – Random primers are random hexadeoxynucleotides used for first-strand cDNA synthesis and cloning.
Reverse transcriptase (RT) – A DNA polymerase that enables the synthesis of a double helix DNA (cDNA) from RNA.
RNA transcripts – Single-stranded RNA products (e.g., mRNA, tRNA) synthesised by transcription of DNA.
RNase (ribonuclease) – A nuclease that catalyses the degradation of RNA into small fragments.
Sequencing depth – The coverage that represents the number of unique reads that include a given nucleotide in a final reconstructed sequence.
Splice junction – The site on the mature RNA indicating the position of a former intron which was spliced out after transcription.
Splice variants – A variant form of an mRNA produced by RNA genetic alteration in the DNA sequence that occurs at the splice site or the boundary of an exon and an intron.
Transcriptome assembly – The reconstruction of the RNA sequence composition of a biological sample by computational processing of the raw reads obtained by RNA-seq and subsequent steps for aligning and merging fragments from a longer transcript sequence in order to reconstruct the original sequence.
Modern sequencing technologies (see Chapter 9 Sequencing platforms and data types) make it possible to generate large-scale genomic sequencing data for any plant species. This dramatic step-change in genomic data availability, along with improvements in bioinformatic tools, has led to the release of many high-quality plant genomes (
In this chapter, we consider best practices for whole genome sequencing as a tool for plant identification, and the relative strengths and weaknesses of different genome sequencing approaches. We start by discussing the overall workflow common to any project using whole genome sequencing, before moving to the specific requirements of three approaches that differ in their sequencing coverage: (1) Genome skimming, which uses low-coverage sequence data to assemble well-represented (high copy number) genomic regions, (2) Genome resequencing, which uses modest-coverage sequence data to investigate genomic diversity relative to an existing nuclear reference genome sequence, (3) De novo whole genome assembly, which uses high-coverage sequence data to generate a nuclear reference genome. We also consider assembly-free approaches for using the nuclear genome.
Genomic sequencing starts with sample collection and DNA extraction, and finishes with a set of sequences or sequence variants that are suitable for analysis. The major stages are as follows.
Genome sequencing usually uses high-quality DNA extracted from plant tissue (see Chapter 1 DNA from plant tissue), though some approaches can accommodate DNA from degraded specimens (see Chapter 2 DNA from museum collections). The exact requirements depend on the downstream processes, but as a guide:
Most plant identification projects use DNA extracted from individual plant samples. However, metagenomic studies may work on mixed samples such as environmental DNA (see Chapter 12 Metagenomics), while some population genomic studies may choose to pool multiple individuals per population and compare diversity between these sample pools (e.g., Pool-seq; (
The wet lab protocols used to generate sequence-ready DNA libraries (see Chapter 9 Sequencing platforms and data types, Chapter 12 Metagenomics, Chapter 15 Transcriptomics) are varied and depend on the starting DNA quality and the intended downstream sequencing approach. Low amounts of starting input DNA will require amplification via PCR-based library preparation, while higher amounts of input DNA samples can be used in PCR-free libraries, which reduces bioinformatic issues with PCR duplicates. Further, a range of more lab intensive library preparation approaches are available for long read sequencing or to allow users to partition and barcode HMW DNA (e.g., linked read sequencing such as haplotagging; (
Most plant identification studies using whole genome sequencing rely on short-read data, such as that generated on Illumina platforms or with BGI technologies. Here, the benefits of low per base-pair sequencing costs, high accuracy and throughput, and potential for sample multiplexing make it extremely well-suited for a range of applications. However, recent innovations in long-read sequencing have reduced error profiles and made it more cost-effective (
The computational methods will vary considerably depending on sample type and number, sequence type, and downstream analysis approach. However most projects will involve the following initial stages:
Low-coverage sequencing of genomic DNA, ‘genome skimming’, is an efficient approach for comparative genomics of diverse species (
Genome skimming stands out from many other genomic approaches for its technical ease. It is straightforward at all stages of the workflow, from DNA extraction requirements to easy and comparatively cheap library building options (
The downside of genome skimming is that it fails to reliably sample the bulk of the genome. The regions which are represented at high coverage, particularly organelle genomes, show atypical inheritance and evolutionary patterns, which may yield phylogenetic results that are incongruent with phylogenies of the nuclear genome (e.g., in Orchidaceae; (
Plastids are organelles that are responsible for photosynthesis and the synthesis and storage of molecular products. Plastomes are mostly circular and nonrecombinant organellar genomes averaging 120–160 kb in size. Their high copy number per cell means high quality assembly is possible even from low depth nuclear genome sequencing (
Nuclear ribosomal DNA (nrDNA) primarily functions to code for ribosomal RNA. Plant nrDNA has an average size range of 10–15 kb and exists as hundreds to thousands of tandem repeats occurring in high copy numbers throughout cells (
Mitochondrial genomes have a primary function in respiration. Despite their conserved function and generally conserved gene complement, they show significant structural variation in plants, including in size (100 kb–2.7 Mb), sequence arrangement and repeat content (
Genome resequencing involves sequencing samples at a moderate depth (often 5–30X coverage) and analysing the data in the context of an existing reference genome. Most genome resequencing studies use short-read data and subsequently investigate SNPs and small indels, however, long read sequencing is now becoming more accessible for such resequencing work, thus allowing researchers to investigate longer indels and structural variation(see ‘Sequencing’ above).
The key benefit of genome resequencing over genome skimming is that it provides reliable and repeatable access to the nuclear genome. This allows researchers to investigate genome-wide diversity and evolutionary relationships both genome-wide and in specific genomic regions of interest, such as those loci underlying species differences in young taxa (
The key downside to genome resequencing is that it requires a reference genome. As such, genome resequencing has traditionally been restricted to population genomic analyses of model species. However, decreasing sequencing costs and the increasing availability of reference genomes (discussed below) means resequencing is now more widely applicable to a diversity of species. It is also becoming increasingly easy to perform resequencing studies on degraded DNA due to improved laboratory and bioinformatic methods that are able to capture and process short fragments (see Chapter 2 DNA from museum collections). In addition, the increasing usability of genotype likelihoods instead of hard SNP calls, means that sequence variation can be assessed at reduced coverage (and hence, cost). As such genome resequencing is increasingly used to resolve plant identification issues that require population-level sampling or the investigation of closely related species.
The use of a reference genome brings limitations to the analysis of samples of varying quality (exacerbated with fragmentation in degraded ancient DNA; (
A range of bioinformatic tools are now available for the analysis of short sequence reads without a reference genome. These approaches often rely on the frequency distribution of k-mers (short sequences of length k) across all sequence reads of a given sample (
De novo whole genome sequencing represents the ‘gold standard’ in genomics. Here, the aim is to produce a chromosomally contiguous set of sequences that document the complete nuclear genome. To achieve this aim, a complementary range of genomic sequencing technologies (with long read sequencing coupled with long range information, now the standard) are applied to high-quality DNA extracts, producing high coverage sequencing data.
The de novo assembly of plant genomes represents a complex analytical problem. Most assemblers rely on one of two approaches (
Published reference genome assemblies vary considerably in quality. Even when contamination and mis-assemblies have been minimised, contig and scaffold size and overall genome completeness can vary widely. These genome properties can be assessed with measures such as N50 (a length-weighted median measure of contig size), or BUSCO completeness (the percentage of fully assembled core plant genes) (
De novo genome sequencing is likely to play an important role in future studies of plant identification. One can imagine rapidly sequencing long DNA molecules that are directly assembled into chromosomes in real time, that are then compared to existing reference genomes to detect the presence of cryptic species. While this may seem like a fantasy, the dramatic and continued progress in genomic sequencing and bioinformatic algorithms makes this not so far-fetched, as seen with Oxford Nanopore adopting ‘adaptive’ small genome sequencing where reads are mapped and analysed in real-time. In the meantime, de novo genome sequencing effort is likely to be focused on generating reference genomes for each plant family, and for specific research projects, either as stand-alone research investigating genome evolution, or to facilitate genome resequencing of infraspecific variation. Current barriers to the wider deployment of reference genome production are the cost and bioinformatic complexity of assembling large, repeat-rich, polyploid plant genomes. These challenges are particularly difficult in some evolutionary lineages, such as ferns, which mostly have large polyploid genomes.
Genomic sequencing can aid in numerous aspects of plant identification, discussed in Section 3. Here, we consider a representative set of examples where genome skimming, genome resequencing, and whole genome sequencing may be the preferred approaches.
Genome skimming is particularly suitable for studying a large number of diverse samples (
Genome resequencing is most appropriate for studying the population genetic structure and/or the relationship of closely related species. For example, a researcher may want to clarify species boundaries and improve species delimitation in a taxonomically complex species group (
The current use of reference genome sequencing is largely to understand the evolution and genome structure of plants, rather than directly being used for plant identification. For example, the production of a reference genome for a medicinal plant species may be a key resource for characterising the evolution of chemical diversity. Here, a reference genome may reveal the genes and genetic pathways involved in secondary compound production (e.g., for medicinal compounds in the orchid genus Dendrobium (
Whole genome sequencing is increasingly used in studies of plant identification. A diverse range of methods are available, from low coverage genome skimming used to recover organelle sequences for reconstructing plant phylogeny, through to high coverage sequencing and de novo nuclear genome assembly used to generate reference genomes for comparative analyses. Future developments in sequencing technologies and bioinformatic tools will make these methods increasingly accessible to the botanical community.
BUSCO – Benchmarking Universal Single Copy Orthologous genes used for assessing the completeness of a sequenced genome.
Contig – A single continuous sequence of DNA present in a genome assembly. Contigs in modern genome assemblies are hundreds of kilbases or multiple megabases in length.
DNA barcoding – The sequencing of few standardised DNA regions to aid in plant identification.
Genome resequencing – Low to moderate coverage sequencing of samples that are compared to a reference genome.
Genome skimming – Low coverage sequencing of genomic DNA used to assemble multi-copy regions such as plastids and mitochondria.
HMW DNA – High-molecular-weight DNA (often over 100 kb), which is required for de novo genome sequencing.
Kmer – A sequence of length k. For example, a 27-mer is the collection of all (overlapping) sequences of length 27 base pairs in a given set of sequences.
Reference genome – A high quality genome sequence from a single individual that is used as the foundation for genomic analysis.
Scaffold – An assembly of contigs separated by gaps of known length.
“Species” is often considered to be one of the living world’s fundamental categories, having its own ontological status, similar to a gene, cell, population, clade, or organism. Despite its importance, defining the species category is controversial, and many different species concepts have been proposed (
In this text, we will focus on concepts that view species as historical individuals. These are composed of the genetic material (i.e., an assembly of alleles), which are reproduced through time, and expressed by ephemeral phenotypes. Historical individuals refer to different ontological kinds than individual organisms. Individuals lack defining properties, they exist restricted in time and space regardless of our ability to recognize them (
Pre-evolutionary taxonomy often divided organisms into binary groups based on absence or presence of certain properties, e.g., those with the ability to move versus those that are sedentary. However, we know from phylogenetics that attributes such as the ability to move have evolved many times. Defining animals as organisms having the ability to move results in a class. Phylogenetic taxonomy tries to identify and name monophyletic groups that have a certain spatio-temporal restriction. Thus, phylogenetic taxa are individuals, with no defining properties. In taxonomy, the concept of cryptic species, although not unambiguously defined (
The genetic material constitutes the replicators, i.e., the entity that evolves. The genes express themselves as interactors, the organisms, which are only vehicles for the genes, and everything beyond the information encoded in the DNA is an ephemeral expression of it (
The species concept debate nourishes a fundamental ontological problem: which categories in biological taxonomy are natural or even real? According to the rules of nomenclature (e.g., the International Code of Botanical Nomenclature;
Monophyletic groups, or clades, form a nested hierarchy. A phylogenetic tree is a convenient model to illustrate this. A named clade with family rank will normally include subclades that may be named as genera, and these will include named species. So, if the formal categorical ranks are applied to clades, there is no difference between for example genus and species, except that the latter never can include the former. The International Code of Phylogenetic Nomenclature (PhyloCode;
Population is a widely used concept among biologists, which usually refers to a geographically confined assemblage of individual organisms of the same species. Strict mathematical definitions enable the parameterization of certain aspects of population genetics if some simplifying assumptions are applied. For example, demographic history can be quantitatively studied using coalescent theory (
The implications of species concepts affect essential societal fields such as agronomy and plant breeding, agroforestry, pharmacology and medicine, horticulture, etc. This chapter does not have the ambition nor the goal to provide an exhaustive summary of all those implications. Here, we briefly discuss the implications of different species concepts to taxonomy and how species concepts have consequences on our perception of the current biodiversity crisis.
Some species concepts and their properties (e.g., the biological species concept, the ecological species concept, the phylogenetic species concept, etc.) can be incompatible and lead to the description and naming of differently delimited taxa. To understand the consequences of competing species delimitations, it is essential to acknowledge the central role of taxonomy in many biological studies and societal matters. Traditionally, taxonomy delimits species based on diagnostic morphological differences. However, taxonomists sometimes disagree, and there is a recognition of taxonomists as “lumpers” (favouring broad species delimitations) and “splitters” (favouring narrow). Moreover, morphologically delimited species may be different from those delimited according to other species concepts. The 21st century biodiversity crisis and the conservation efforts that arise from it are in need of a tool for quantitative biodiversity measurements. Species richness is often defined as the number of species per area and/or time, and is central in many biodiversity measurements, for example in Shannon entropy (
Species delimitation methods described in this chapter are DNA-based phylogenetic approaches investigating the evolutionary history of species. However, their goals and inherent properties should not be confused. While molecular phylogenetics aims to identify and infer the evolutionary relationships among clades, molecular species delimitation aims to estimate parameters identifying species.
There are several practical questions to address before designing and implementing a species delimitation study, we here briefly discuss a few of them.
The sample strategy should reflect sufficient intra-specific variation while mirroring greater interspecific divergence. In this light, the first advice is to sample from the entire known distribution range of the group under study. The second step is to sample the different morphotypes of each taxonomic species. The underlying idea is that as the phenotype is an expression of the genotype, sequencing a wide range of morphotypes per taxonomic species should facilitate a comprehensive study. Another way of putting it is that in this way, you will be able to test taxonomic delimitations based on phenotypic data using genetic data.
Both single and multiple locus approaches have been developed and used in plant species delimitation. In this chapter, we deliberately address multi-locus approaches. Multi-locus approaches present several advantages over single locus methods
Identifying and quantifying all the parameters that influence a biological system is complex, and in stochastic modelling we make simplifying assumptions and approximations. A stochastic model enables quantification of differences between the input data and what the model predicts. Conclusions may therefore be drawn on which processes are responsible for those differences. For example, the linear regression model has two parameters, the slope and the intercept. Given any sampled two-dimensional data, we can estimate the best fitting values for the two parameters. However, the fit will depend on the model’s assumptions (e.g., linearity, random and independent sampling, homoscedasticity, etc.). These assumptions can be relaxed by introducing new parameters that will provide a better fit to the data. Although this increases the computational effort, it also reduces the explanatory power, because there will be less data per parameter.
We present two types of parametric multi-locus delimitation approaches, allelic-clustering and coalescent-based methods. Uni-locus approaches and concatenation methods do not take advantage of the information inherent to the discordance among gene trees in a multi-locus dataset. We also note that other approaches than parametric modelling are possible, for example by simply plotting data and analysing the pattern and classifying the data points according to their Euclidean distances to each other (
Species delimitation can be viewed as a process where sampled individuals (i.e., which can be alleles or organisms) are clustered. In population genetics, a class of methods, often referred to as STRUCTURE-like methods due the original methodology proposed by
Alleles are in linkage equilibrium when they occur randomly and independently in a population, their frequency is the one expected according to the Hardy-Weinberg principle. However, biological processes often violate linkage equilibrium. Linkage disequilibrium (LD) is the non-random association of alleles at two or more loci in a population: they are in LD when they do not occur randomly and are not independent from each other. LD provides information about population genetic phenomena (i.e., migration, mutation, selection, genetic drift). In a population, LD is increased by selection, population structure, and genetic drift, and is eroded by recombination. STRUCTURE (
An interesting feature of these approaches is that they can directly cluster the genetic material, the alleles, rather than the phenotypic expressions (i.e., the organisms). Thus, they are directly clustering the replicators, and not the interactors, which may be heterogeneous assemblages of such clusters (i.e., hybrids). A shortcoming of allelic clustering methods is that unlike coalescent-based phylogenetic methods they do not assess the phylogenetic divergence of populations.
Ideally, species delimitation methods should parametrize gene flow (i.e., migration) and incomplete lineage sorting, which happen when the alleles of a certain gene coalesce deeper in the species tree than the species divergence. Indeed, these two phenomena are the primary causes of gene tree discordance when sampling unlinked genes. The eukaryotic nuclear genome usually consists of several chromosomes, and within each chromosome, recombination occurs between linkage groups. By contrast, organellar genomes, which are haploid, are usually considered as non-recombining (but see e.g.,
Population genetics aims to understand how and why allelic frequencies vary within and between present populations. Two approaches exist for investigating ancestor-descendant relationships that centre on genetic drift. One approach is prospective/forward where the probability of identity-by-descent for allele copies (i.e., the probability that allele copies are descendants from a single common ancestor) is evaluated. Ancestor-descendant relationships are traced forward in time in order to understand the present pattern of allele copies.
The other approach is a retrospective/backward probabilistic approach called coalescent theory (Figure
A. Genealogy of a sample of 4 genes (N = 4, in red) in a population of 8 genes at the present time, back to a common ancestor. B. Genealogy of the 4 sampled extant alleles (in red) back to the most recent common ancestor, with three coalescent events (in blue).
The basic, and most simple coalescent model, assumes that the population conforms to the idealised conditions set by the geneticists
In the absence of migration between the tree branches, the gene tree splits will always be as old or older than the population branching. As the gene tree branching orders are random in Wright-Fisher populations, the MSC model efficiently handles incomplete lineage sorting (ILS), which is one reason why gene trees are different from species trees. Given the ambiguity of the term “species”, it was perhaps unfortunate that it was used for coining the name of the model.
As with parametric phylogenetic methods in general, parametric phylogenetic species delimitation methods can be based on the Maximum Likelihood (ML) criterion, or on Bayesian approaches. These can be further divided into implementations that use an exact likelihood function, which estimates all parameters of the model, and approximations, where some parameters are fixed. According to
Approximate likelihood ML phylogenetic methods typically work by dividing the gene tree and species tree estimation into two steps, such that gene trees obtained from phylogenetic analyses of each alignment become input data for the species tree estimation. Thus, the gene trees are point estimates for the genealogies at each locus. In addition, the effective population size is a crucial parameter for the MSC model and finding the maximum likelihood value for it is computationally intractable (
Method name | Approach | Statistical framework | Input data | Likelihood function | Example of studies using the method |
---|---|---|---|---|---|
BP&P ( |
discovery/validation | Bayesian | MSA Multiple Sequence Alignment | Full likelihood |
|
SpedeSTEM ( |
validation | Maximum likelihood | gene trees | Approximate likelihood |
|
Heuristic method ( |
discovery | Maximum likelihood | gene trees | Approximate likelihood |
|
STACEY ( |
discovery | Bayesian | Multiple sequence alignment | Full likelihood |
|
DISSECT ( |
discovery | Bayesian | Multiple sequence alignment | Full likelihood |
|
PHRAPL ( |
validation | model selection | gene trees | Approximate |
|
Bayesian full-likelihood implementations of the MSC model can theoretically accommodate unlimited numbers of sequences per locus, but are dependent on the approximations of the posterior densities that the Markov Chain Monte Carlo (MCMC) techniques provide. Extensive exploration of convergence and mixing are necessary to ensure that the results from MCMC are reliable (
As the MSC model is ultimately based on a phylogenetic tree, parametric implementations of species delimitation in essence identify extant species as the tip branches of the species tree. Software implementations such as *BEAST and StarBeast2 assume that sequences are assigned to the correct species, which are defined to be Wright-Fisher (WF) populations where the gene trees are distributed according to the coalescent model. DISSECT (
As alternative hierarchical species delimitation models differ with respect to the assignments of sequences to species, this leads to stochastic models having different sets of parameters. To evaluate the fit of the data to different delimitations, model selection criteria are relevant. In a maximum likelihood framework, hierarchical likelihood ratio tests can be applied when models are nested (i.e., for example when the split of A and B is compared to A and B as a single species. However, such methods cannot be applied when classifications are non-nested, e.g., when AB and C is compared to A and BC. In such cases, information-theoretical approaches must be applied (
The methods cited above assume no migration (hybridization, horizontal gene transfer) between branches, and instantaneous “speciation”, i.e., divergence is completed in one generation and no migration is permitted after that. A more flexible, approximate likelihood approach to species delimitation is provided by PHRAPL (
All the species delimitation models that we have introduced so far are developed for diploid genomes. However, allopolyploidy is traditionally thought of as being an speciation mechanism, where the allopolyploid hybrid instantaneously becomes reproductively isolated from its parents. Under this view, the problem of species delimitation becomes a problem of tracing the allopolyploidization event, and species delimitation of the descendants will follow the same logic as species delimitation of diploid genetic lineages. The models mentioned below are phylogenetic methods, which potentially can be extended in a similar fashion to the MSC-based methods described above. However, a special complication is the fact that it is usually difficult to assign sequences to subgenomes a priori.
The traditional way to model phylogenetics, and indeed also the MSC model, assumes reproductively isolated species (no migration after divergence) and bifurcating phylogenies. The genetic information is transmitted from ancestors to descendants without modelling gene flow between branches, and with bifurcations representing the speciation events.
However, hybridization and introgression are common natural processes which challenge these assumptions. Hybridization can be followed by whole genome duplication (WGD): this phenomenon is called allopolyploidization and is a significant factor in speciation due to the reproductive isolation of the newly formed polyploid from its diploid parents. Note that WGD may also occur within lineages and is then termed autopolyploidy. Here, we concentrate on the former type.
WGD is characteristic of all major land plant lineages (
Reticulation events challenge the biological species concept, which states that species are different entities that cannot interbreed to produce fertile offspring. This view leads back to the philosophical perception of species and the parameters describing these entities. A hypothetical species genome undergoing several rounds of allopolyploidization will end up with subgenomes carrying genetic information difficult to trace back. In other words, this hypothetical genome would be a mosaic or a melting pot of the parental genomes. With such a changing genome, how can we identify a reticulate entity according to set parameters?
Inferring species boundaries amongst extant allopolyploid plants requires identification of the two parental sub-genomes to allow accurate inference of allopolyploid ancestral events. The genes used to trace the evolution of a polyploid genome carry information from both parental genomes. Note however, that there may have been “normal” branching speciation events after the polyploidization events, and the parental species may have gone extinct. Therefore, a phylogenetic view is necessary.
Chloroplast and mitochondrial DNA usually carry information only from one parental genome, usually the maternal lineage. Nuclear ribosomal DNA (nrDNA), which in eukaryotes contains many tandem repeats (
In order to trace polyploid ancestry, genomes must be resolved at the haplotype level. To recover the full haplotype sequence, the DNA reads must overlap and the overlap should cover informative SNPs. Under these conditions, correct haplotype phasing can be achieved for diploid genomes (
Phylogenetic methods tracing allopolyploidy aim at assigning homoeologs (i.e., subgenomes) to parental genomes. However, the task is challenging for two reasons: biological phenomena such as recombination and gene loss result in the partial loss of parental genetic information, and secondly, modelling hybridization is computationally challenging. AlloppNET (
AlloppNET is implemented in the BEAST1 framework (
The MDC approach by
PADRE (Package for Analysing and Displaying Reticulate Evolution;
Conceptually, various species concepts attempt to accommodate genealogical, phenotypic as well as future aspects, and these need not lead to identical delimitations. There is an emerging view of viewing species as the branches of the phylogenetic tree, and we have focused on species as being historical individuals composed of the vertically transmitted genetic information. The MSC model allows scientists to view genetic data and rigorously test monophyly as well as branch content. However, most current implementations of the MSC model identify the branches as what most biologists would view as populations, and furthermore, they are not capable of including migrations of alleles. STRUCTURE-like methods have the capability to cluster alleles directly, but are dependent on similar assumptions to the MSC model, and lack a phylogenetic component. In principle, the MSC model can be extended to accommodate migrations, and a few recent attempts exist (e.g., DENIM; PHRAPL). A pluridisciplinary approach, involving genomic and evolutionary concepts implemented in a powerful statistical framework is anticipated for future progress. Beyond its importance for biology internally, species delimitation has important societal implications. The current “sixth mass extinction” calls for implementing conservation programs that use appropriate species richness assessments and species definitions in order to accurately measure and limit biodiversity loss. The necessity to agree on a given species definition in a given context (i.e., biodiversity erosion) does not in itself solve the ontological question “what is a species?”. The fast-moving next-generation sequencing technologies disclose the necessary genomic information to study virtually any taxonomic group, but there is an urgent need for conceptual development as well as suitable models with sufficient biological realism to view the data in.
AIC / BIC – AIC stands for Akaike Information Criterion and BIC for Bayesian Information Criterion. They are estimators of the quality of statistical models for a given set of data, providing a means for model selection.
Allopolyploidy – Inheritable condition of having more than two sets of chromosomes after hybridization. Typically, allopolyploids have disomic inheritance, meaning that there is bivalent pairing of chromosomes during meiosis. Also called “whole genome duplication”.
Autopolyploidy – Inheritable condition of having more than two sets of chromosomes received from a single ancestral taxon, by opposition to allopolyploidy. Autopolyploids have polysomic inheritance, meaning that there is multivalent pairing of chromosomes during meiosis.
Ancestral polymorphism – Genetic variation in a species that arose prior to speciation. Synonymous with “deep coalescence”.
Anomalous zone of the MSC –
Bayes Factors – The ratio of the marginal likelihoods of a given parametric model to another one. It can be interpreted as a measure of the weight of an hypothesis compared to another one.
Bayesian approaches – Based on the Bayes theorem, describing the probability of an event based on prior knowledge of conditions related to this event.
Bifurcating tree – A graph where branches (edges) give rise to daughter branches, and never merge.
Biodiversity – Association of the two words “biological” and “diversity”. It refers to the variety of life that is found on Earth.
Birth-death model – A continuous-time Markov process with two parameters, births and deaths. In a phylogenetic context, this translates to branching events being births, and extinctions being deaths. A birth-death model for a phylogenetic tree will in its most simple form have constant probabilities for branching events and extinctions.
Clade – A group of taxa that are monophyletic - composed of a common ancestor and all its descendants - on a phylogenetic tree.
Class – A grouping of entities based on defined criteria.
Coalescent theory – Models how alleles in a population have originated from a common ancestor. In its most simple form, it assumes no recombination, no selection, no gene flow, and no population structure. This implies that each allele is equally likely to have been passed on from one generation to the next. The model looks backwards in time, merging alleles into coalescence events according to a random process.
Cryptic species – Term referring to species that cannot readily be distinguished morphologically.
Deep coalescence – When two or more alleles of the same species have their most recent common ancestor in an ancestral species. Synonymous with “ancestral polymorphism”.
Discovery methods – Species delimitation methods that do not require pre-defined delimitations to assess, as opposed to verification/validation methods.
Epistemology (in the context of biology) – Concerns the theory of knowledge. For example, how can we know what a species is, is an epistemological question.
Gene flow – The transfer of genetic material from one population to another.
Gene tree – Phylogenetic tree of a gene.
Gene tree discordance – When gene trees from the same set of organisms are different in topology and/or branch lengths.
Genetic drift – Variation in allele frequency in a small population due to random factors
Genotype – The complete set of genetic, inheritable information (DNA) of an organism.
Hardy-Weinberg principle – A population genetics principle, also known as the Hardy-Weinberg equilibrium/model/theorem/law, which states that allele frequencies in a large population will remain constant from generation to generation in the absence of genetic drift and non-random evolutionary factors.
Hierarchical likelihood ratio tests – statistical tests estimating which model is the best fit to a dataset among two models. The competing models must be hierarchically nested. The more complex model must differ from the simpler one by one or more additional parameters.
Historical individual – Refers to an assembly of alleles that are reproduced through time.
Hybridization – Interspecific breeding.
Identity-by-descent – identical nucleotide sequences in two or more individuals inherited by a common ancestor, without recombination. The identical segment has the same origin among these individuals.
Incomplete lineage sorting – A phenomenon in population genetics when ancestral copies of alleles fail to coalesce into a common ancestral copy until deeper/older than previous speciation events. See also ancient polymorphism and deep coalescence, which refer to the same thing.
Interactor – Term defined in an evolutionary context by the biologist Richard Dawkins. Referring to organisms as being ephemeral vehicles for genes (the replicators).
Introgression – Also known as introgressive hybridization. It is the transfer of genetic material from one species into the gene pool of another one by repeated backcrossing of an interspecific hybrid with one of its parent species.
Linkage disequilibrium – Non-random association of alleles at two or more unlinked loci in a population.
Linear regression model – A linear approach to modelling the relationship between a scalar response and one or more explanatory variables.
Markov Chain Monte Carlo – In statistics, methods that sample from a probability distribution by constructing a Markov chain.
Markov chain – A random process describing a sequence of possible states or events where each state/event depends only on the previous one, independently from older ones.
Maximum Likelihood approaches – Estimation of parameters of an expected probability distribution given observed data. The estimated parameters are those that make the observed data most probable.
Metapopulation – A group of spatially separated populations of the same species which interact at some level.
Migration – Gene flow, including phenomena such as hybridization, introgression, horizontal gene transfer.
Monophyly – The condition in which a group of taxa composed only of a common ancestor and all its lineal descendants form a single clade.
Ontology (in the context of biology) – The field that divides living things into categories to better understand them and how they fit into the world. For example, the ontological nature of “species” answers the question “what is a species?”.
Phenotype – The expression of the genotype modified by environmental factors.
Plesiomorphy – In phylogenetics, a plesiomorphy is an ancestral state character.
Polyploidization – Event that creates more than two copies of the entire genome of a taxon.
Population – In general, individuals belonging to the same species that live in the same geographic area at the same time. The effective population, by contrast, consists of those involved in the reproduction to the next generation.
Reciprocal monophyly – When two sets of taxa form exclusive clades.
Replicator – Term defined in evolutionary context by the biologist Richard Dawkins. Genetic material that evolves and replicates. Also see interactor.
Reticulation – In phylogenetics, a reticulation is when a lineage originates by the merging of two ancestral lineages.
Speciation completion rate – Parameter describing the transition from a parent to a full independent species. Describes a protracted speciation, in opposition to instantaneous speciation in the birth-death model.
Species richness – The number of different species represented in an ecological community, landscape, or region.
Species tree – Phylogenetic tree representing the evolutionary relationships among species.
Stochastic modelling – A way of describing a certain set of random parameters with their associated probability distributions.
Synapomorphy – Derived traits shared by a group of taxa due to their inheritance from a common ancestor.
Taxon (plural ‘taxa’) – A set of genotypes and their associated expressed phenotypes that are formally recognized.
Taxonomy – The branch of science where biological taxa are described, named, and identified.
Validation/verification methods – Species delimitation methods that require a subset of pre-defined delimitations, in contrast to discovery methods, which consider all possibilities.
Plant DNA can be extracted for species identification from a wide variety of sample types, including fresh, museum or ancient plant tissue collections that represent a single taxon, to highly processed samples that contain multiple individuals or taxa, including food and medicine (Chapter 6 DNA from food and medicine), water (Chapter 3 DNA from water), soil (Chapter 4 DNA from soil), pollen (Chapter 5 DNA from pollen), faeces (Chapter 7 DNA from faeces), or ancient sediments (Chapter 8 aDNA from sediments). Section 2 of this book explores how DNA can be used for plant identification through either targeted (where select regions of the genome are used) or non-targeted (resulting in / producing / allowing for representations of the full genome) approaches. Targeted approaches include barcoding for single taxon samples (Chapter 10 DNA barcoding, Chapter 13 Barcoding - High Resolution Melting, and Chapter 14 Target capture) and metabarcoding for samples representing multiple taxa (Chapter 11 Amplicon metabarcoding). Non-targeted approaches form the field of genomics. For single taxon samples, genome resequencing and whole genome sequencing (Chapter 16 Whole genome sequencing) are used, while in samples containing multiple taxa, metagenomic methods are used (Chapter 12 Metagenomics).
Studies conducting species identification either use known samples to find unknown identifications or use known identifications to assign identity to unknown samples. In the first, labelled samples are used for exploring evolutionary relationships to assign species identity based on some measurement of distance clustering (Chapter 19 Systematics and evolution, Chapter 20 Museomics, and Chapter 21 Palaeobotany). In contrast, the second category of studies utilise databases with predefined species classifications to assign identity to unknown samples (Chapter 22 Healthcare, Chapter 23 Food safety, Chapter 24 Environmental and biodiversity assessments, Chapter 25 Wildlife trade, and Chapter 26 Forensic genetics, botany, and palynology).
The analytical methods used for species identification can be categorised into three groups: i) database alignment analyses, ii) alignment-free methods, and iii) sample alignment analyses (Box 1).
In this chapter, we outline common sequence pre-processing steps used in species identification projects, and then discuss how species identification from sequencing data can be accomplished using the three analytical categories mentioned here.
“Garbage in garbage out” is a phrase that any experimentalist should keep in mind when setting up a species identification project. Obtaining robust and accurate species identities from sequencing data requires that input reads are high-quality and filtered for contamination and sequencing errors. This section outlines the steps necessary for firstly checking that data is of sufficient quality for species identification, as well as the sorts of processing steps that are necessary for sequence data analysis.
Sequencing reads generated on short and long read platforms contain artefacts that need to be filtered or corrected in order to isolate high-quality reads for use in downstream analyses. Sequencing artefacts include reduction in read end base quality in short read data (common in Illumina sequencing), and amplified rates of homopolymer errors in longer read data generated with Nanopore technologies (Chapter 9 Sequencing platforms and data types). Correcting these errors is a mandatory first step in most bioinformatics analyses as poor quality control of raw sequence reads can result in inconclusive or incorrect species identification. Several quality control software packages including FastQC (Andrews and Others 2010), multiQC (
Sequence library preparation methods append non-representative, non-biological sequences, such as adapters and tags for multiplexing, to the DNA fragments. These sequences should therefore be removed during sequence processing to avoid failure in species identification or even a false species identification. Tools such as AdapterRemoval (
PCR (polymerase chain reaction) amplification of template DNA can introduce errors including artificial base differences, chimeras, and heteroduplex molecules (
Targeted sequencing experiments, where a specific region of the genome or plastome is sequenced, require a few additional quality control steps to remove sequencing artefacts. Tools such as obitools (
There are specific challenges to be considered when analysing ancient DNA samples, including archaeological and herbarium samples. DNA damage, primarily driven by chemical changes in the DNA post-mortem, is prevalent in aDNA samples (see Chapter 2 DNA from museum collections). Programs such as mapDamage (
Database alignment methods are the most intuitive class of search-based species identification from sequencing data and have been used for the better part of the last three decades to identify species that are the putative sources of sample DNA or protein sequences. These methods compare the sequencing reads, either directly in the form of short reads or in the form of assembled contigs, to a reference database of curated sequences. Widely used alignment tools include BLAST (
In theory, alignment-based approaches using databases can be used for species identification on sequences generated from the entire spectrum of molecular methods detailed previously. However, high computational requirements coupled with logistical issues such as the unavailability of appropriate databases make these methods best suited to targeted sequencing approaches, especially barcoding and metabarcoding. In these approaches, only a limited number of unique sequences are used in the initial data input, making them substantially less computationally expensive methods.
The database choice plays an integral role in the sensitivity and specificity of local alignment algorithms and whether the alignment approaches return a species identification. Accurate and positive species identifications are more likely with databases containing high numbers of closely related species. Global databases, such as the NCBI nucleotide database and NCBI non-redundant protein database (
Alternative options to consider are national or local sequence databases that have been assembled by genetic and genomic researchers to represent the species of a country or region. Prime examples include DNAmark (
Does the choice of using DNA or protein make a difference in the database alignment algorithm? Yes! DNA sequences provide more sensitivity while amino acid sequences are more robust. What do we mean by that? DNA sequences can provide a better resolution in terms of describing the evolutionary relationships between closely related species. Proteins on the other hand can illuminate much older evolutionary relationships, and tend to provide more robust identifications (
Alignment based methods can be used on both raw short reads directly from the sequencing machines and on assembled contigs, where multiple short reads are stitched together into longer stretches of DNA. These approaches come with their own pros and cons. Ease of use is the primary selling point in using short reads directly from the sequencing machine. Using assembled contigs requires additional steps, but the increased length can result in lower error rates and longer read regions, leading to better resolution. The use of assembled contigs additionally takes advantage of databases that allow for alignment of longer regions, including possibly the entire target region (
The rapid advance and adoption of second and third generation sequencing technologies has led to an exponential increase in the numbers of sequencing studies that employ either whole genome resequencing or genome skimming to characterise sample genomes. With these large genomic datasets, alignment based approaches can be computationally taxing (
Alignment-free approaches come in many flavours, including k-mer based methods, micro-alignments, fourier transformation methods, and information theory methods (
Alignment-free methods are primarily restricted for use with non-targeted sequencing approaches. This is due to the short length of targeted regions leading to a limited number of k-mers, which restricts the ability of these approaches to result in meaningful inferences. Although k-mer based methods might look like ideal candidates for use in metagenomics, the fact that metagenomic samples are derived from multiple sources in varying proportions makes it difficult to successfully isolate individual taxa (
The depth to which samples are sequenced affects the accuracy of the dissimilarity metric estimates computed in alignment-free methods. As the sequencing depth reduces, their variance increases even if the estimates remain unbiased by assembly. This variance is propagated into the downstream analyses. Thus, the robustness of these methods should be verified when using very low coverage sequencing data (
Contamination can also be tricky to deal with in alignment-free methods using mixed bags of raw sequencing reads, and therefore filtering for contamination using tools such as BlobTools (
K-mer frequency profiles of sequences are used to compute dissimilarity scores between those sequences. There are many distance metric options that can be used to compute the dissimilarity score, e.g., Euclidian, inner product, Kullback-Leibler divergence (relative entropy), and mismatches (Jaccard). The most commonly used distance metric is the Jaccard distance, since it is easy to compute and corresponds to nucleotide changes. Specifically, the Jaccard distance ranges from 0.0 to 1.0, where 0.0 corresponds to identical k-mer profiles, and 1.0 implies no overlap in k-mers. By computing the pairwise Jaccard distances between sequences from an unknown sample and a set of reference sequences with known species identity, we can assign our unknown samples to the closest species among the set of reference sequences. Further, the dissimilarity measures can be used to build a phylogeny of the sequences (
An advantage of k-mer based methods is their applicability to different sequencing data types, which allow combining sequence data from different experiment types. For example, one can compute the k-mer frequency profiles directly from the reads or from scaffold sequences. All subsequent steps to compute distances can be applied without regard to potentially different sequence sources.
Several alignment-free methods have been developed in the last few years, incorporating several of the k-mer algorithms (
Sample alignment methods are the foundation of molecular taxonomy, phylogenetic classification, and population genetics, and allow the exploration of evolutionary relationships and genetic distance between samples. These methods include de novo assemblies and the creation of reference genomes, as well as assembly or mapping using a reference. There is inherent bias in terms of reference availability, and inadequate reference mapping can result in skewed representations of genetic similarity in downstream analysis.
The use of references to inform the assembly of contigs to produce scaffolds and create sample-specific consensus sequences representing genes, gene regions, and genomes is inherently biassed towards the available references. Popular mapping tools include Global BWA (
High quality, in-depth sequencing is required to produce a de novo assembly. A de novo assembly will however avoid any inherent biases introduced by using references for assembly, and in turn can be used as reference in future projects. For an outline of the processes involved in de novo assemblies, please see reviews by (
The foundation of any tree-building or comparative gene analysis is multiple sequence alignment (MSA). MSA matches up areas of the genome across samples and allows for comparison. MSA algorithms are based on maximising sum-of-pair scores through heuristic progressive (input-order dependent) alignments (
Widely used MSA tools include ClustalW (
There are several important factors that can determine or influence which species assignment method is ultimately chosen. The study design and experimental question, as well the DNA source and extraction methods are important factors. For example, genome skimming and metagenomic studies might be well suited to alignment-free methods (
It is thus important to be aware of the strengths and limitations of different species assignment methods and to choose the method best suited to the biological questions being posed and the experimental design used to generate the sequencing data. For alignment based methods, it is important to remember that the species identification results are only as good as the databases the sequences are being aligned to, applicable to both targeted sequencing and genomic studies. Further, results from alignment against large databases must be interpreted carefully, since the order of the results are dependent on both the sequence identity and the number of times a certain species is represented in the database. For alignment-free methods, such as k-mer based, sequencing depth and the quality of the k-mer profiles from target species (database) are important factors. Also note that the value of k in the k-mer profile generation is an important parameter to tune. Finally, metagenomic taxonomy assignment tools again depend, in varying degrees, on external databases for identification of taxa.
Contamination – DNA from a non-targeted taxa.
Contig – A single continuous sequence of DNA present in a genome assembly. Contigs in modern genome assemblies are hundreds of kilobases or multiple megabases in length.
K-mer – A sequence of length k. For example, a 27-mer is the collection of all (overlapping) sequences of length 27 base pairs in a given set of sequences.
Multiplexing – Combining tagged DNA fragments from multiple samples before sequencing.
Non-targeted approaches (genomics) – Capturing representations of the full genome. See Chapter 16 Whole genome sequencing
Reference genome – A high-quality genome sequence from a single individual that is used as the foundation for genomic analysis.
Scaffold – An assembly of contigs separated by gaps of known length.
Sequencing depth (= coverage) – The number of unique reads including a given nucleotide. This is about the depth of coverage.
Tags – DNA fragment labels for multiplexing.
Targeted approaches (genetics, including amplicon sequencing) – Where the breadth of coverage is defined and a smaller amount than the whole genome is used.
Systematics is the field of biology that studies biological diversity (or biodiversity) and its evolutionary history (
An integral part of systematics is taxonomy, which focuses on the identification, description, naming, classification and inventory of taxa (
A major challenge that comes from these different species concepts is that they may be incompatible and often lead to different conclusions on the boundaries of what should be considered the same or different species (de Queiroz 2007). Given the variety of species concepts and their definitions, de Queiroz (2007) suggested two solutions. The first solution identifies commonalities in the different species concepts, resulting in a unified concept where species are defined as separately evolving metapopulation lineages. The second solution emphasises the necessity of separating the problem of species concepts from that of species delimitation. In the present chapter, species delimitation refers to the practice of determining boundaries of species based on empirical data. For a comprehensive discussion of the various species concepts and delimitation approaches the reader is referred to Chapter 17 Species delimitation.
Commonly used methodological approaches in plant systematics include traditional comparative morphological/anatomical systematics, chemosystematics, and molecular systematics, which utilise different sources of data as input for inference. In traditional comparative morphological/anatomical systematics, the grouping of taxa is primarily based on morphological/phenotypic similarity (
The primary aim of systematics is to recognise evolutionary lineages where the genotypes are reproduced through time. The phenotypes are ephemeral manifestations of these genealogical lineages. Historically, the Aristotelian view of taxa (“natural kinds”) having essential features (i.e., to qualify as a vertebrate, the organism must develop vertebrae) has dominated biological systematics. Some philosophers argue that the essence of those natural kinds may exist regardless of humans’ abilities to recognise them, but there is no doubt that essentialism has played a great role in recognition of many taxonomic groups where certain phenotypic traits have been used for defining specific taxa. The development of evolutionary theory has provided systematists with the concept of monophyly, which ultimately is based on genealogical relationships. By using phylogenetic methodology, monophyletic groups (clades) sharing a common ancestry can be recognised. Both phenotypic and genotypic data can be useful for this, but the former is considered a proxy for the latter. Thus, while recognising the enormous importance of phenotypic data for the primary identification (i.e., classification and nomenclature) of taxa (and of course of general biology), we will in the following focus on the genetic data.
The use of internal or anatomical features in taxonomy began with the development of microscopes powerful enough to visualise the internal structures of organs and tissues (
Most chemotaxonomists recognise three broad categories of chemical compounds as taxonomically important: primary metabolites, secondary metabolites, and semantides (
The number of chromosomes in each cell of all individuals of a species is usually constant and more closely related species are likely to have similar haploid chromosome numbers (
Embryonic development and structure have historically been used at different levels of classification. For example, the basic division of the plant kingdom into two units, the Thallophyta and the Embryophyta, was based in part on zygotic behaviour. In the same way, embryonic characteristics were an important component in the division of the angiosperms into two major groups, the monocotyledons and the dicotyledons (
Studying plant pollen and spores is useful for determining species relationships in plants (
Morphological and genetic analysis of fossil material from pollen, leaves, stems, and other plant parts are used to trace evolutionary developments through stratigraphic sequences and also predicting past ecological conditions (
Strictly, DNA constitutes the genotype, while RNA, proteins, and associated structures belong to the phenotype. Nevertheless, DNA, RNA, and proteins can all be used to detect basic genotype changes. Very often, nucleotide substitutions are neutral and either do not change the amino acid sequence of the protein that they transcribe for, or result in minimal changes in the amino acid sequence (
Several methods have been developed to either generate sequence data for whole genomes (whole genome sequencing, WGS), or sample a subset of specific loci from across the genome (
Since the advent of molecular phylogenetics in the late 1980s, several theoretical approaches for reconstructing relationships in the Tree of Life have been developed (
A second set of methods, under the umbrella term “optimality approaches’’, assess the optimal tree in the full tree-space using predefined criteria. This includes minimum evolution, which optimises the tree that minimises the sum of pairwise distances as expressed on the tree (
Bayesian methods use Bayes’ theorem to estimate the probability of a tree (including topology, branch lengths, and parameters in the underlying model of sequence evolution) given the alignment data (
An important finding when sequencing multiple loci across different accessions was that a set of genes for the same group of taxa often supports different branching patterns in the gene trees. A number of phenomena are responsible for this discord among gene trees, including the incomplete sorting of ancestral polymorphisms (incomplete lineage sorting or ILS), gene duplication and loss, horizontal gene transfer, and branch length heterogeneity (
The MSC models ILS by assuming that the degree of incongruence among gene-trees is positively related to effective population size and negatively related to the times between lineage divergences (
The second approach, commonly referred to as co-estimation methods, uses sequence alignments as input data such that gene and species trees can be simultaneously estimated (
While the MSC represents a major advance in modern phylogenetics, it accounts for only one source of gene tree discord, which has a number of alternate causes, collectively summarised under the concept of migration, meaning the transfer of alleles between otherwise discrete lineages of alleles. Thus, migration in this meaning will include processes such as hybridization, introgression, horizontal and lateral gene transfer, and admixture. The classic MSC model assumes that speciation is instantaneous, and that all gene flow ceases directly after two lineages diverge (
The larger number of parameters estimated in IM and MSCi models relative to classic MSC methods improves the biological realism with which the evolutionary process is modelled, but also necessitates a larger number of loci for reliable parameter estimation (
A special form of discrete modelling of migration is posed by allopolyploidy, for which the reader is referred to Chapter 17 Species delimitation for a description of available methodologies.
The introduction of the concept of monophyly (
The integrative taxonomy approach attempts to integrate and use information from several different sources (i.e., morphological, chemical, genomic, ecological, etc.) in order to rigorously delineate species and other taxa. However, this approach has received criticism due to the lack of a clear conceptual and methodological framework, particularly with reference to quantitative criteria. Thus, grouping (i.e., recognition of monophyletic groups) precedes ranking (i.e., choice of level for naming and formal ranking in the taxonomic hierarchy).
The taxon-tree contains clades (= monophyletic groups) of various inclusiveness that may be named and given a rank according to the rules of nomenclature. This ranking process is in principle arbitrary, but various auxiliary criteria, such as the different versions of the phylogenetic species concept (
Application of the MSC model enables rigorous scientific testing of monophyly hypotheses using multi-locus sequence data. Applying species rank to certain clades is valid but will need auxiliary criteria to reduce subjectivity (
Biodiversity – The variety of living organisms encompassed in all forms.
Branch (= edge) – A part of a phylogenetic tree that connects different nodes (= vertices) or terminals (= leaves).
Clade – A part of a phylogenetic tree made up of a common ancestor including all its descendants.
Effective population size – Describes the size of an ideal Wright-Fisher population, containing exactly the equivalent genetic diversity and/or experiencing exactly the same genetic drift as the population surveyed irrespective of its census population size.
Gene flow (= migration) – The transfer of alleles between populations due to various biological processes.
Gene-tree (= genealogy) – A tree representing the evolutionary history of a particular gene.
Gene-tree discordance (= phylogenetic incongruence) – A phenomenon where evolutionary trees from individual genes result in conflicting branching patterns.
Homology – The shared similarity due to descent from a common ancestor.
Horizontal gene transfer (= lateral gene transfer) – The transfer of genetic material through a biological process other than sexual reproduction.
Incomplete lineage sorting (= deep coalescence) – The failure of ancestral gene copies of two or more lineages in a population to coalesce within the population branch.
Lineage – In phylogenetics, a group of populations connected by a single line of descent from a common ancestor.
Metapopulation – A group of spatially separated populations which share the same evolutionary history.
Monophyly – A relationship where descendants of a common ancestor form a single clade.
Phenotypic variation – The variability in the observable expressed and environmentally affected features that exists in a population.
Polymorphism – A phenomenon where a trait has more than one expression.
Polyploidization – A biological process which a single genome undergoes to possess more than two sets of chromosomes.
Posterior probability – An estimation of the probability of a hypothesis given the data, a stochastic model and prior expectations.
Sequence alignment (= alignment) – The process of arranging DNA sequences in order to identify homologous positions.
Species delimitation (= delimitation) – The process of analytically identifying boundaries of species using empirical data.
Species-tree – A tree showing the evolutionary branching history of ancestral to descendant populations.
Species-tree inference – The process of estimating branching history of populations.
Substitution – A change from one nucleotide to another that results in a change in the DNA sequence.
Topology (= tree topology) – The branching pattern and order of nodes on an evolutionary tree.
Tree-space – A collection of all possible trees for a given set of input sequences.
Wright-Fisher model – A model where alleles are sampled from a population which is characterised by random reproduction, no selection, and no overlap between generations.
Have you ever wondered how museum collections can be used for answering fundamental questions about biodiversity and its evolution across space and time? Natural history museums harbour ~3 billion biological specimens that are often linked with a specific collection time and place (
Museomics is the study of biological material from museum collections using genomic techniques that allow the reconstruction of partial or complete genomes. In contrast to single-loci PCR-based approaches, these genomic techniques provide information on a genome-wide scale that can, for example, be used to assess evolutionary and ecological processes (
Although different in age and preservation state, historical DNA from museum collections (typically > 200 years old) can have similar characteristics as ancient DNA (aDNA, typically < 200 years old) such as post-mortem degradation patterns (
The plant material available in museum collections is an indispensable source of genetic information for species that are extinct (Van de Paer et al. 2016;
Herbaria collections include a variety of sample types including herbarium specimens, seeds, wood or xylarium samples, flowers and fruits in alcohol or desiccated, and biocultural or ethnobotanical collections gathered over hundreds of years. They may (1) originate from general collections seeking to represent the world’s biodiversity, (2) have been deposited as vouchers, or (3) have served as reference material for a specific study (
Herbarium specimens generally consist of a pressed plant mounted on acid-free paper. They ideally include leaves, stems, flowers and/or fruits, and roots when possible, and have the necessary plant parts for unambiguous identification. The metadata associated with a specimen should at minimum include the binomial scientific name, who collected it, the collection date, locality, and a unique number. Additional information may include a description of the habitat, and associated plants as well as any other details that cannot be observed from the dried specimen, including specimen’s colours and smell at the time of collection, and any observed visiting insects (
Xylarium, or wood collections, comprise a collection of different wood parts of a tree and such specimens can inform forensics, timber trade, and conservation efforts. A typical specimen is wood stripped from bark, when present, and has the shape of a book. Some collections can also consist of cross-sections, which can provide valuable ecological and anatomical information than the book-shaped wood (
Economic botany or biocultural specimens include economically useful plant parts such as fruits, barks, seeds, bark clothes, baskets, and papers for medicine, religious, entertainment, and commercial purposes (
The next paragraphs are examples of the applications and impact of museomics research, as well as the potentially negative implications of unethical use of collections on local communities.
DNA analysis of plant material in museum collections has allowed us to improve their taxonomic annotations and their corresponding scientific value. By analysing their genomes, it is possible to assign taxonomic information to samples that cannot be reliably identified morphologically or that no longer exist. For example, genomics was used for the identification of both endangered and extinct species of Hawaiian endemic mints and the now considered extinct Hesperelaea palmeri (Van de Paer et al. 2016;
Genetic analysis of museum samples has increasingly been used to describe evolutionary processes shaping the genetic diversity, population structure, phylogenetic history, and demography of plants. Both Sanger and high-throughput sequencing have been used to obtain partial and complete genomes of plants in museum collections. Combining the genetic analysis with the information contained in their associated metadata increases the scope of the evolutionary inferences that can be made. Information about the collection date and geographic location can be used to directly measure changes in genetic diversity across time and space, the effect of climate change, domestication, human environmental disturbance, and other natural phenomena (
One of the principal applications of museomics has been reconstructing plant species’ phylogenetic histories. Understanding plants’ evolutionary relationships can help refine their taxonomic classification, identify their potential geographic and evolutionary origins, and make predictions on their chemical properties and potential future applications (
Additionally, herbarium material has been used to identify and measure the extent of gene flow (i.e., exchange genetic material through interbreeding) among plant populations such as that occurring between different species of ragweed (
An interesting aspect is the evolution of plants under domestication (i.e., the process through which wild plants became today’s crops). Most of the plants (in volume) that we consume today as food or use in the production of plant-based products are the result of domestication. Museomics has made important contributions to the study of the geographic origins, dispersal patterns, and selective evolution of domesticated species. Herbarium specimens have been used to trace the origin of the European potatoes in the Andes (
Another aspect where museum specimens can be used to provide valuable insights is in the study of genetic erosion, which is the decrease in genetic diversity over time. Samples collected at different points in time are an ideal and reliable way to directly measure changes in genetic diversity through time and in relation with historical, geographic, and climatic changes (
Genetic analyses of museum collections can also be used for the study of ecological processes (i.e., the interactions between plants, animals, and abiotic components in an ecosystem). Herbarium collections can help in the characterization of the distribution and abundance of plant species through time and in measuring changes in biodiversity. By combining their genetic data and metadata, we can measure the habitat ranges of species through time and identify possible associations between such changes and climatic or anthropogenic events. In one example, the genetic analysis of the grass Alopecurus myosuroides from herbarium collections showed that the genetic variants associated with herbicide resistance in this plant pre-dated the use of herbicides, which confirmed that this resistance did not evolve from anthropogenic events (
Multiple studies have used herbarium and xylarium specimens to develop DNA barcode libraries of entire floras or for more specific applied uses within forensics, authentication, and conservation (see Chapter 26 Forensic genetics, botany, and palynology; Chapter 23 Food safety; Chapter 22 Healthcare; Chapter 13 Barcoding - High Resolution Melting).
Developing plant cultivars with desirable characteristics is essential to guarantee food security in the future. One of the initial stages in improving crops is identifying plants that already have certain beneficial traits that can be used in the breeding process (
Plant pathogens cause diseases and losses at different levels from hunger and famines to the extinction of entire species (
Previous studies have shown that museomics can answer questions about the evolution and origin of plant pathogens as in the case of the potato late blight, Phytophthora infestans (
Studies on endangered species often lack past diversity estimations or rely on inferring past genetic diversity from modern populations. Museum specimens can provide an important perspective for past population evolutionary events and eventually contribute to the conservation of the species through scientifically supported conservation and management recommendations (
The conservation of rare and endangered plants relies on a sound understanding of their genetic diversity to ensure the health of both wild and ex situ collections, and to avoid or overcome genetic bottlenecks (
Other studies have documented genetic changes of endangered species in response to human disturbance.
International conventions govern accessing, researching, and moving plant material between institutions and countries. Specifically, researchers need to understand and adhere to the Nagoya Protocol on Access to Genetic Resources and the Fair and Equitable Sharing of Benefits Arising from their Utilisation to the Convention on Biological Diversity.
Additionally, some museums have destructive sampling policies and committees that weigh the pros and cons of destroying precious and unique samples against the possibility of advancing scientific knowledge (
Ideally, researchers aiming to use biocultural or other culturally sensitive collections of high human interest should involve and consult with the Indigenous peoples and Local Communities from an early research stage to ensure the fair and equal use of the collections and their associated information, as well as to take advantage of the knowledge related to the cultural value and uses of the specific plants and artefacts. This is becoming a common practice for human paleogenomic studies, but unfortunately not yet for plant material. From the researcher perspective, knowing Indigenous communities’ practices and concerns minimises potential unintended cultural harm in the future by paleogenomic studies and can also provide additional advice on relevant research questions to consider (
Finally, museums have experienced continuous reductions in funding and staff resulting in the lack of curatorial expertise and capacity leading to increasingly orphaned collections (
Biocultural – The combination of biological and cultural factors.
Crop improvement – Genetic improvement of crops in terms of quality and/or quality to satisfy human needs.
Extant – Species, lineage, or specimen that still exists today.
Gene flow – Allele exchange between populations, one of the main forces that drives evolution.
Haplotype – DNA sequences that are closely located on the chromosomes and thus likely to be inherited together.
Host-pathogen coevolution – The constant competition between hosts and pathogens to infect and spread and to avoid death from infection, respectively. Results in genetic innovations from both sides.
Metadata – Description of a collection event including, among other details, the possible species identification, collection locality, and collector. Usually found on a label, but can also be tracked from associated databases or archives.
Museomics – The study of museum collections using genomic techniques that allow the reconstruction of partial or complete genomes.
Paleogenomics – A field in evolution that attempts to reconstruct and analyse the genetics of specimens that no longer exist.
Xylarium (xylotheque) – Museum collection consisting of authenticated wood samples.
Zebra chip disease – A disease affecting potatoes caused by Candidatus Liberibacter solanacearum.
The study of ancient plant remains was historically limited to morphological studies, palaeontology being the primary field of study of past organisms. However, since the 1980s, genetic analysis of biological matter within fossils has become increasingly informative, thanks to the development of new sequencing technologies such as polymerase chain reaction (PCR) and high-throughput sequencing (HTS). Since the first identification of aDNA from extinct species in 1984 (
In the context of paleogenetics, ancient DNA (aDNA) is DNA from long-deceased tissues preserved by conditions allowing DNA survival. Despite appropriate preservation conditions, aDNA is usually degraded by biotic or abiotic processes. Though often damaged, it can carry valuable historical information (
The sequences used for most plant aDNA studies are derived from the nuclear and organellar genomes and are quite often the same markers typically used for plant identification or studies of evolutionary history. Markers from plastids are usually favoured for their high copy number and short length, despite reported problems resulting from their high propensity for genetic rearrangements. Furthermore, horizontal transfer from the plastome to nuclear and mitochondrial genomes complicates the analysis as the mutational rate differs between in the nucleus and in other organelles (
Several difficulties are inherent to working with aDNA from plant specimens: the complexity and variability of the genome, aDNA damages, and potential contamination increase downstream analytical difficulties. The combination of often very low aDNA concentrations with the amplification power of PCR dramatically increases the probability of amplifying contaminating modern DNA. Specialised methods and laboratory procedures have been established to reduce the risk of contamination. These include: the use of positively pressurised clean laboratory facilities dedicated to aDNA work, the replication of experimental works in different institutions, and the use of biomarkers for prediction of DNA survival such as mitochondrial DNA (mtDNA) detection, aDNA damage patterns, and detection of associated remains (
Another complication in the analysis of plant DNA is its variability. The presence of different organelle genomes (plastid and mitochondrial) as well as the interspecific differences in ploidy level and chromosome size can complicate the alignment of sequencing reads to a reference sequence (
Macrofossils are defined as fossils that are observable without magnification, and in the case of plant-based studies, they are ancient preserved tissues found in archaeological or sedimentological contexts. aDNA can be extracted from macroscopic plant remains such as leaves, needles, bud scales, wood, or seeds. However, individually-based approaches on plant macrofossils are scarce and most of the studies focusing on plant DNA are based on metabarcoding using sedimentary DNA (sedaDNA) material (
A very common archaeological plant material is charred remains. One example is superficially burnt seeds in hearth remains found in ancient settlements. Molecular identification of even lightly charred remains is however challenging since the DNA is often very fragmented and contaminated (
In contrast, desiccated samples are often suitable for molecular analysis. Desiccated samples are typically found in dry environments such as caves, shelters formed by rock features (well suited for long-term food storage), or deserts. Desiccation can limit DNA degradation, and plastid and mitochondrial DNA from sunflower seeds as old as 3,100 years old has been successfully recovered (
Biological remains preserved under waterlogged anaerobic conditions may also contain sufficient aDNA for molecular identification. Lakes and marine sediments can provide sedimentary DNA (sedaDNA) from plant remains and pollen grains found in different strata of core samples. They can be used to reconstruct past ecological diversity. Microorganism communities can as well be a source of aDNA. For example, diatoms are commonly used bioindicators for assessing the biological composition (trophic state) of a lake since their morphology is highly sensitive to the surrounding environment (
Waterlogged remains can be found in the context of archaeological studies. Wells, latrines, ditches, and pits can result in anaerobic conditions. DNA from grape seeds from the Iron Age have been sequenced successfully with Hyb-Seq, and it was shown that the grapes are related to present-day West European cultivars, which provides evidence that there has been 900 years of uninterrupted vegetative propagation of the crop (
Mineralized samples or those embedded in resin or fossilised in amber are both potential sources for aDNA, though the high probability of contamination, extreme fragmentation of the material, and non-reproducibility of the results have led some authors to strongly discourage aDNA analysis from amber-preserved fossils (
Microfossils can be found in any environment, including in humid conditions and tropical zones where macrofossil preservation is rare. These include pollen, starch grains, and phytoliths. Plastid aDNA obtained from pollen grains is very often endogenous, and its amplification has previously established the first genetic link between extant and fossilised Scots Pine specimens from post glacial lake sediments in Sweden (
Sediments found in lakes, temperate caves, permafrost, and ice cores can retain plant aDNA for thousands, and in some cases, millions of years (
sedaDNA provides a broad understanding of the past environment, climate, and ecology of the paleosol studied. It can also provide insights on the movement and cultivation of plants by Neolithic populations and their social network in absence of other archeological evidence (
sedaDNA from lake sediments has been used to reconstruct ancient plant vegetation and to assess the impact of anthropogenic activities on the paleoenvironment. For example, the impact of cattle grazing on deforestation dynamics during the Late Iron Age and Roman period has been demonstrated by using a metabarcoding approach on sediment samples from a subalpine lake (
sedaDNA can also be used to study the impact of climatic changes on plant biodiversity and help prioritise conservation management. A research project using metabarcoding of lake sediments was able to show that a heterogeneous mountain landscape served as a refugium for arctic-alpine plants in a warm climate (
Another study on Arctic Canada lake sediments gave clues about the effect of the rise in temperature during the Last Interglacial period (LIG) on plant population dynamics. Previous attempts to reconstruct the LIG paleoclimate with climate modelling based on the simulation of atmosphere, sea, and ice circulation have yielded inconsistent results (
We can improve modelling of future climate change effects on plant diversity based on these studies that inform how plant richness has evolved in reaction to previous episodes of climate warming. Several environmental changes that might have been overlooked such as arctic amplification or arctic greening can be studied with sedaDNA (
SedaDNA studies are furthermore more robust than pollen-based methods for detecting plant richness and deliver taxa diversity with more resolution (
The same observations can be done using sedaDNA extracted from permafrost, as presented in a study encompassing 50,000 years of megafauna diet and arctic vegetation history from samples collected across the Arctic. While pollen-based reconstruction showed a majority of graminoids in unglaciated Arctic during the Late Glacial Maximum, the metabarcoding approach has revealed a forb-dominated vegetation (
Ancient faeces, though relatively uncommon, are a rich source of biomolecules and paleodietary information that can be related to demographic, ecological, and climatic changes in the locations in which they are found (
The analysis of an aDNA dataset is complicated by post-mortem DNA degradation that leads to short fragments, specific nucleotide substitution patterns, and overall low DNA yields (
The initial alignment step with a reference genome during bioinformatic analyses is already affected by aDNA chemical damage, which can increase the apparent error rate and lower the alignment accuracy. Subsequent steps in variant calling of genetic markers can be complicated by the high mapping error rate and low coverage (
The evolutionary history of a species or a population can be established based on genomic inference from modern samples, providing clues about the evolutionary processes that form the basis for present genomic variation. However, allelic patterns in contemporary specimens are shaped by a range of demographic events, including changes in population size, gene flow, and hybridization events. These may be due to very recent events, and do not necessarily represent the lineage’s deeper evolutionary history. A time series of samples can provide greater resolution in a genomic analysis and resolve phylogenetic questions. It can also detect recent demographic events such as population bottlenecks and provide chronological estimates for these events without using a molecular clock. Allele frequencies can be directly estimated for each time point and used to estimate the strength of selection pressure during that period (
The Dramatic global warming and extinction events that occurred during the later Anthropocene coincided with the active collection of specimens for museums and herbaria (
Positive selection can also be detected in contemporary specimens using statistical tools such as coalescence, population differentiation (FST), and linkage disequilibrium. Selection pressure, however, can be conflated with demographic change or background selection. Specific methods have been developed to detect positive selection on a polygenic trait using an admixture graph to represent the admixture events relating different populations through time (
Purifying selection or negative selection can be detected in present-day specimens as signals of reduced genetic diversity. However, similar signals can be caused by demographic events such as population bottlenecks or background selection (
Balancing selection is more difficult to detect since it affects narrow genomic regions on a short timescale. This can be mistaken for positive selection, demographic events, or introgression (
All current crops are the products of single or repeated domestication events starting less than 12,000 years ago from the ancestral wild species (
Archaeobotanical remains can be arranged in a time series to study the evolution of domestication over time and space. They can indicate the number of times that domestication events occurred and their location, the pace and stringency of anthropogenic selection, introgression with wild relatives and between different cultivars and be used to determine the date of these events (
Molecular methods have made an increasingly large contribution to the field of archaeobotany. Starting with simple genetic analysis for taxonomic identification to supplement morphological examination, the field has rapidly progressed following advances in high-throughput technologies in archaeogenomics. Methods such as shotgun sequencing have enabled genome-wide studies, exploring in detail the genome of domesticated plants and analysing the genome-wide rearrangements that occurred during this process (
As both a key crop and a genetic model organism deeply studied for over 100 years, a wealth of domestication studies have been conducted on maize, revealing a detailed picture of evolution. Molecular analysis of palaeobotanical remains continues to provide new information on maize evolution, and PCR-based studies have identified the likely geographic region of its original domestication in Mexico and traced its dispersal across Central America and South America (
The target capture method, or Hyb-Seq (see Chapter 14 Target capture) has been used to confirm and refine models for maize domestication over time mediated with progressive introgression from wild relatives (da Fonseca et al. 2015). A recent study on maize domestication and diversification in South America based on the genomes of present-day and ancient American maize cobs has shown that maize had a stratified mode of domestication that started with a large Mesoamerican gene pool that was partially domesticated. This was followed by a dispersal to different locations in which the sub populations become reproductively isolated by different selection pressures (
Wheat domestication has not been studied as extensively as maize, but modern genome-wide studies on emmer wheat chaff found shared haplotypes between 3,000-year-old Egyptian emmer wheat from museum collection and modern emmer wheat, including domestication loci as two QTLs related to grain size and seed dormancy. Although several haplotypes present in historical specimens are absent from modern emmer, similarities between museum specimens and Arabian and Indian emmer landraces suggest an early South-Eastern dispersal of ancient Egyptian emmer (
Bottlenecks are a common feature in the domestication process and have also been revealed from ancient plant material in beans. One of the symptoms of a bottleneck event in the demographic history of a lineage is genetic erosion, the loss of allele diversity in a population due to genetic drift and inbreeding caused by the bottleneck event. This effect was found in the case of the Andean bean domestication, which was likely triggered by stringent varietal selection (
Climatic and environmental changes can be responsible for major shifts in species’ geographic distributions. For example, the glaciation cycles over the past 2.4 million years have restricted some species in separate refugia, often resulting in a loss of allelic variation that persists after the species’ expansion out of the refugium. Phylogeography allows studying the history of geographic distribution of genealogical lineages using population genetic tools to detect the changes in genetic variation caused by historical events such as migration and dispersal (
Early plant phylogeography studies were based on plastid DNA (pDNA) sequencing methods, as a study of the distribution and circumpolar migration of saxifrage, suggesting the possibility that plant refugia were located in the Arctic (
Ancient DNA studies can unravel the ecological past and temporally explore the adaptation mechanism and interactions between organisms. This can include processes such as convergent evolution of different species in a similar environment, present plant adaptations due to standing or de novo mutation in the evolutionary history of a species, or metagenomics of an aDNA specimen to reveal the dynamics of plant pathogens (
Innovations in shotgun metagenomics have increased the possibilities for using sedDNA analysis for reconstruction of past vegetation with higher taxonomic resolution than with pollen DNA barcoding (
Some limitations do however remain. SedaDNA is preserved in lake environments since the stable temperature conditions can conserve DNA. However, sampling can be challenging in these areas. There are also major challenges in detecting species that are rare or have a low biomass. Additionally, the taxonomic resolution provided by sedaDNA is variable in function of the method used. While metabarcoding sedaDNA almost always provides higher resolution than direct pollen analysis (
The Anthropocene presents major global challenges, including climate change, loss of biodiversity through extinction, and emerging zoonotic infectious diseases. An understanding of previous human interactions with the environment can guide conservation management during this era of massive environmental change and rapid loss of biodiversity. The field of conservation archaeogenomics involves analysing aDNA with the goal of guiding present-day biological conservation (
Genomic archaeological data can also reveal details about the time and potential reasons for local or global extinction events, and help to understand the resulting consequences on ecosystems and human societies. Studies that use these data may also contribute to better understanding how human activities and behaviours may have contributed to past extinction events. Studying the distribution of species and how they colonise new areas can also help us to anticipate how ecosystems may respond to future climate change (
A theoretical application of the recent progress in molecular biology and sequencing techniques follows from the concept of “de-extinction” or “species revivalism”. The possibility of de-extinction is controversial and still debated on both technical and ethical levels, as it is difficult to justify the ecological need for reviving extinct species rather than supporting current conservation efforts for endangered species (
Over the last several decades, paleogenetics has made substantial contributions towards our understanding of ancient plant science, ecology, and archeology. In contrast, paleogenomics is just in its infancy and sequencing and analysis techniques are constantly improving. The study of full genome datasets has allowed to accurately characterise taxonomic diversity (
The race to understand biological diversity before it is lost is, to some degree, mitigated by the presence of valuable genomic information in archaeological and natural history collections that include extinct and endangered species. As this field of research provides information about common species and their ecological background, it provides a framework in which to study and understand how the past 200 years of human activity have impacted patterns of genetic diversity in the natural world. It is essential that we use insights from the study of ancient plant genomics to help us reduce biodiversity loss over the next 200 years.
Amber – Fossilised tree resin, may contain animal or plant material as inclusion.
Paleogenetics – The study of the past using genetic material from ancient specimens.
Palaeogenomics – Genome-scale sequencing studies of genetic material from ancient specimens.
Balancing selection – Different selective processes which maintain genetic diversity at a frequency superior to that expected under neutral genetic drift.
Coprolite (or coprolith) – Fossilised human or animal faeces. Contrary to paleofaeces, most of their original composition has been replaced by mineral deposit.
cpDNA – Chloroplast DNA, or plastome.
De-extinction – Theoretical possibility to rebuild extinct species using aDNA sequences.
Ice core – Long cylinder of ice recovered by drilling through ice sheets or glaciers.
mtDNA – Mitochondrial DNA.
Palaeoecology (or paleoecology) – The study of interactions between organisms and their environment across geologic timescales.
Palaeofaeces (or paleofeces) – Ancient animal or human faeces. Contrary to coprolites, they retain some parts of their original biological composition, although in practice the terms are used interchangeably.
Permafrost – Ground continuously frozen (below 0 °C) for two or more years.
Phytoliths – Silica microstructures found in some plant tissues.
Plant domestication – Human selection of desirable traits in plants that has taken place in the last 12,000 years.
Positive selection (or directional selection) – Process by which one phenotype is selected preferentially to others, causing allele frequency to shift over time towards this phenotype.
Purifying selection (or negative selection) – The removal of deleterious alleles from a population genome.
SedDNA – Sedimentary DNA, younger and better preserved sedimentary DNA.
SedaDNA – Sedimentary ancient DNA, older, more poorly preserved.
Subfossil – Organism partially fossilised still containing biological matter such as bone, skin, or faecal deposit, while a fossil is completely mineralized.
Taphonomy – Study of how organic remains pass from the biosphere to the lithosphere, including processes affecting remains from the time of death of an organism through decomposition, burial, and preservation as mineralized fossils or other stable biomaterials.
Plants have been used as medicines for millennia in diverse geographical and cultural contexts. They continue to play an essential role as therapeutic and prophylactic agents in traditional and complementary medicine (
Accurate medicinal plant identification is very often a challenging task. Products can either be from single species or from mixtures, can be in a dried, fragmented, or powdered form, and can originate from plant leaves, flowers, stems, barks, roots, fruit, and seeds. They may also come in the form of phytopharmaceutical products, including oral, topical, parenteral, ophthalmic, or inhaled forms. Most regulatory guidelines and pharmacopoeias for conventional plant authentication are based on diagnostic morphological or chemical features.
Macroscopic botanical analysis for herbal drug authentication is complicated in many cases, since it requires an experienced taxonomist and the plant samples should include flowers or fruits along with a segment of the stem with leaves enough to observe branching patterns (
Chemical authentication of herbal medicines includes metabolite detection, quantification, profiling, and elucidation through analytical methods such as thin layer chromatography (TLC), high-performance thin layer chromatography (HPTLC), high-performance liquid chromatography (HPLC), liquid chromatography coupled to mass spectrometry (LC-MS), gas chromatography coupled to mass spectrometry (GC-MS), or nuclear magnetic resonance (NMR) (
Molecular plant identification techniques such as DNA barcoding have proven to be cost-effective procedures useful in pharmacovigilance to authenticate herbal medicines at species level and to detect adulterants (de Boer et al. 2015a). Molecular identification of medicinal plants has been suggested for routine market surveillance and for screening the quality of raw materials in early stages of the herbal supply chain (see Chapter 23 Food safety). DNA-based methods come with several advantages, including that DNA is present in all plant organs, its presence is less sensitive to external factors than metabolites, and it can be used for the identification of dried and powdered products where morphological characteristics are absent (
The increase in herbal medicine adulteration is of growing concern due to the expansion of the global market for natural products (
Chapter 22 Infographic: Visual representation of methodological steps for molecular identification of medicinal plants.
A number of methods for identifying medicinal plants using DNA-based methods have been previously described. Several barcoding regions have been shown to be effective for identifying adulterants and physiologically difficult-to-discriminate plant species from a variety of sample forms (
Ethnopharmacology studies the use of drugs made by humans, and integrates anthropological, pharmacological, toxicological, and chemical approaches (
DNA-based identification of plant ingredients in herbal medicines is important for resolving taxonomic controversies, assessing the genetic variability and evolutionary traits of medicinal plants, as well as enabling the detection and further conservation of endangered, illegally traded species (de Boer et al. 2017). Molecular identification of medicinal plants is an important force in driving taxonomic research on medicinal species, guiding forensic DNA and toxicological research. Some examples of applications within the field include nrITS2 barcoding of 90 Fabaceae species from China, including 24 species approved in the Chinese Pharmacopoeia (
The vast majority of commercialised medicinal plants are collected from wild resources, and in many cases they are overexploited and some are becoming increasingly scarce. These factors threaten the conservation of endangered plants, endemic species, or species with limited distributions. DNA analysis is also important for the detection of endangered species through screening marketed natural products. For instance, DNA barcoding using rbcL, matK, psbA-trnH, and nrITS allowed the identification of species of the cycad genus Encephalartos, which are catalogued as threatened and are illegally traded in South African herbal markets (
According to the WHO, accurate identification of medicinal plants is an essential measure for the assurance of the quality, safety, and effectiveness of natural medicines (
Authentication of medicinal plants and the detection of adulterants is a crucial concern for regulatory agencies and phytopharmaceutical industries in order to guarantee optimal quality, safety, and efficacy of herbal products for consumers. Challenges associated with conventional pharmacognostic procedures to authenticate processed or multi-ingredient herbal products can be mitigated with DNA technologies, enabling the accurate identification of medicinal species and substituents in complex samples. It is recommended to combine molecular, chemical, and morphological plant identification methods to increase the discriminatory capacity of authentication approaches (
Herbal medicines – Plant(s) or plant part(s) or extract(s) used to improve health, and to prevent and treat disease.
Phytopharmaceutical – Pharmaceutical agents derived from plants or plant parts, for which the active compounds are known.
Traditional medicines – Knowledge, skills, and practices based on traditional cultures, aimed to promote health and to diagnose, prevent, and treat disease.
Trituration – Trituration refers to different methods used to reduce the particle size of a substance and to produce homogeneous material from various components.
Complementary medicines – Health practices not belonging to the tradition of a country or to conventional medicine.
Ethnopharmacology – Scientific study of drugs traditionally used by people.
Pharmacovigilance – A pharmacological science relating to the collection, detection, assessment, monitoring, and prevention of adverse effects of pharmaceutical products.
Phylogeny – History of the evolution of a species or group, especially in reference to lines of descent and relationships among broad groups of organisms.
Bioprospecting – The exploration of natural sources for small molecules, macromolecules, and biochemical and genetic information that could be developed into commercially valuable products.
Food safety is defined as the routines used in food handling, preparation, and storage to reduce the risk of individuals becoming sick from foodborne illnesses. Food safety draws from the expertise in a wide range of academic fields, including chemistry, microbiology, molecular biology, and engineering. Although advances in science and technology have led to a substantial improvement in food quality, food can still be a source for public health issues (
Numerous botanical products with considerable differences in their classification can be bought throughout the world. These include foodstuffs, herbal medicinal products, and cosmetics. Foodstuffs include dietary supplements, food ingredients, functional foods, and foods for particular nutritional use including various botanical extracts. Herbal medicinal products can only be sold in pharmacies, under the supervision of a pharmacist, and are marketed after registration procedures according to their classification (see Chapter 22 Healthcare). Dietary supplements and herbal medicines are usually considered as two different regulatory categories, but for each of them, the consensus for regulation is also lacking across countries (
Food hazards refer to any agents with the potential to cause adverse health consequences for consumers. Food safety hazards occur when food is exposed to and contaminated by hazardous agents. Food hazards may be biological, chemical, physical, allergenic, nutritional, and/or biotechnology-related (
A final category of food hazards are biotechnology-related hazards such as genetically modified organisms (GMOs). GMOs are the products of genetic engineering where new genes are transferred from one species into another. The resulting properties may lead to better optimised agricultural performance or the new or increased production of valuable pharmaceutical substances (
It is also important to point out that the toxicity of any substance, including plant-based food and medicinal plants, is largely dependent on the dose or amount used. A (harmless) plant may be toxic at high doses, and a highly toxic plant could be considered safe at low dose (
Food fraud is a collective term used to encompass the intentional substitution, addition, tampering, or misrepresentation of food, food ingredients, or food packaging with the aim of increased economic gain (
Adulteration is the failure of a product to meet legal quality standards. According to the US Federal Food, Drug and Cosmetic Act (FFDCA), food can be declared adulterated if (1) a substance is added which is injurious to health, (2) a cheaper or inferior quality item is added to the food, (3) any valuable constituent is extracted from the main food article, (4) the quality of food is below the standards, (5) a substance is added to increase bulk or weight, and (6) a substance is added to make it appear more valuable. Adulterated food can be dangerous since it may be toxic to human or animal health, it may lead to the deprivation of nutrients required for health, and it may cause intoxication or allergic reactions in sensitised individuals. Adulterants in food can be categorised as follows (
Intentional adulteration is the inclusion of inferior substances having properties similar to the foods to which they are added. The adulterant can be physical, chemical, or biological. An example of intentional adulteration is the addition of wheat or other grains as an inexpensive filler to increase profit margins (
Unintentional adulteration is the inclusion of unwanted substances due to ignorance, carelessness, or lack of proper facilities and hygiene during food processing. This includes contamination of foods by bacteria and fungi, or harmful residues from packing material, or even inherent adulteration including the presence of certain chemicals, organic compounds, or radicals that naturally occur in foods such as toxic varieties of plants, mushrooms, etc.
Metallic contamination is the intentional or unintentional inclusion of different types of metals and metal compounds in food. Arsenic, cadmium, lead and mercury are amongst the most toxic ones.
Microbial contamination is the spoilage of food due to infusion of different microbes through various sources.
Agrobioterrorism can be defined as the use of pathogens or toxins against agricultural products or facilities usually with the purpose of causing casualties or fatalities from contaminated agricultural resources or food (
Toxic plants are occasionally eaten due to their misidentification. The fruits of toxic plants such as the very poisonous deadly nightshade (Atropa belladonna) appears similar to edible fruits such as blueberries (Vaccinium sp.) or black nightshade (Solanum nigrum) (
Food authentication is the process by which food is verified as complying with its label description. According to
Plant-based oils and fats dominate food applications. A balanced intake of oils and fatty acids are essential for human health (van Duijn 2014). Extra virgin olive oil is a high-priced product with high nutritional value. Due to the high market prices and its increasing demand, olive oil is one of the most adulterated products on the global food market. Usually olive oil is substituted with less expensive edible vegetable oils (
The supply chains for herbs and spices tend to be long and complex and pass through many countries. These complexities and the increase in crushed and ground herbs and spices render those products more prone to intentional adulteration (
The adulteration of dietary supplements has been reported fairly frequently, as a result of their rising popularity. Fraudulent practices may result in reduced therapeutic potential of the original drug, posing a serious risk to the health of the consumers (
Food hazards may enter the food chain in various ways and have large impacts on human health. It is therefore important that products “moving” along the food supply chain (FSC) are both tracked and traced. Traceability, under EU law, means the ability to track any food, feed, food-producing animal, or substance that will be used for consumption, through all stages of production, processing, and distribution. Traceability applies to both upstream (where the product comes from) as downstream (where the product is delivered to) tracking (
This issue was debated in the UN’s joint Food and Agriculture Organization (FAO) and World Health Organization (WHO), leading to the Codex Alimentarius or “Food Code”, a collection of standards, guidelines and codes of practice adopted by the Codex Alimentarius Commission (
In Europe, risks along the supply chain are assessed by the European Food Safety Authority (EFSA). EFSA monitors and analyses information and data on biological hazards, chemical contaminants, food consumption, and emerging risks (European Food Safety Authority (EFSA), 2012). It is important to note that the principles of the universal HACCP method depend on the origin and nature of the food products as well as the type of end-product. Hazards and their subsequent risk assessment will therefore differ between the olive oil supply chain and the herbal tea supply chain. Olive oil production involves specific processing steps and uses industrial settings such as extraction mills (for pressing or centrifugation), which are absent in the processing phase of dry plant material such as herbal teas and spices (
Fraudulent practices can happen at any step of the supply chain. The most effective way to eliminate illegal practices in the food sector is food chain transparency and full raw material traceability. For example, food companies that implement a digital traceability system using unique product identifiers increase their transparency since they have supply chain visibility in real-time (
In the following example, risk assessment in a chain of commercialization of plant-based products based on dry plant material (e.g., herbal tea, spices, medicinal mixtures) is discussed.
Plant cultivation is the first step in the supply chain for a herbal product, from seed(ling) to adult plant. During growing periods in agricultural fields or greenhouses, different hazardous sources may affect downstream processing and production. These hazards can include: faeces, contaminated soil, irrigation water, water used to apply pesticides, foliar treatments, growth hormones, dust, wild and domestic animals, insects, and human handling. Automated and regular monitoring as well as personal hygiene are therefore essential.
Harvesting can be performed by hand or mechanically, and involves several important commercial steps including pre-sorting and removal of foliage and other non-edible parts. Personal hygiene is particularly important during manual harvesting. Contamination of the herbal product with other plants such as weeds can result from insufficient quality control during harvesting (
Authentication or verification of raw plant material can be done by traditional morphological analysis or by DNA-based methodologies (see paragraph “Methodologies for identification of plant food hazards”). In the case of products with a protected designation of origin (PDO), the label originates from a certain region or area and the product quality and/or characteristics are due to the particular geographical environment, e.g., Greek extra virgin olive oil or PDO saffron (
During transportation, the raw plant material might be damaged due to poor handling, cross-contamination with other materials in the vehicle, or contaminated with vehicle exhaust from petrol and diesel (
The plant material might be stored for an extended period of time before packing. Storage requirements depend on the state of the plant-based product (i.e., fresh, raw, processed). Raw plant material needs to be stored in a cool and dry place since fungi can grow if the humidity is too high (
The purpose of packing is to protect against food pathogens, spoilage-causing organisms, pests, damage, etc. Good hygiene practices should be followed in handling containers and improved packing materials to prevent product contamination (
Analytical methods for detecting adulterated food are traditionally seen as a first line of defence against food fraud (
Several techniques are used in the food industry to detect food-microbial contamination. Omics-based techniques (i.e., genomics, transcriptomics, proteomics, and metabolomics; see Chapter 12 Metagenomics and Chapter 15 Transcriptomics) are robust tools to gain insight into microbial communities along the food chain and can detect pathogens, the origin of a foodborne illness, microbial source tracking investigations, and antimicrobial resistance (
Several common molecular techniques for plant-based food authentication are available.
PCR-based techniques are useful for the detection and identification of animal and plant species in foods because of their high sensitivity and specificity, in addition to being relatively fast and inexpensive. Multiplex PCR assays simultaneously identify several species by using species-specific primers, and they are being extensively applied to the detection and differentiation of species present in food products (
Like PCR and qPCR, Loop-Mediated Isothermal Amplification (LAMP) detects specific DNA sequences, but can target up to eight different sequences. The LAMP method uses self-recurring strand-displacement DNA synthesis to replicate a target DNA at a constant temperature and avoids any PCR amplification steps, saving time and avoiding PCR bias. LAMP has been applied for the detection of foodborne pathogens, the screening of pesticide residues, the assessment of adulterations in meat and various food allergens as well as the authentication of GM crops (
High resolution melting (HRM) is a post-PCR analysis method that monitors the rate of double stranded DNA dissociation to single stranded DNA with increasing temperature and is used to identify variations in nucleic acid sequences. HRM, especially in combination with DNA barcoding, has proven successful for species discrimination, adulterant and allergen detection and product authentication on a wide range of complex food materials of plant as well as animal origin (
Next generation sequencing (NGS) combined with powerful bioinformatics tools are advancing food microbiology and authentication of products of botanical origin (
More advanced molecular methods such as shotgun metagenomic and whole genome sequencing are becoming more widely adopted in the food industry. These approaches provide deeper information in one single analysis and provide more complete sequence information (
There is an urgent need to combat food safety issues in plant-based products. Further methodological improvements in food hazard detection and digitalization of food safety protocols are necessary for quality assurance of food products. Mislabelling and fraudulent practices such as adulteration require special attention as they are the most common issues in the global food supply chain. Most food hazard detection techniques are chemistry-based and used for detecting chemical food hazards or focus on microbial contamination, issues which have important repercussions on human health. DNA based methodological advances for plant-based foods should focus more on the creation and curation of reference databases and the use of innovative bioinformatics tools for fast and accurate food authentication. Standardisation of DNA based methodologies is a prerequisite for the successful implementation of hazard risk assessment protocols at the national and international level.
Chapter 23 Infographic: From farm to fork: risks along the herbal product supply chain. An example of potential risks and their sources in the supply chain chain for commercially-based dried plant products (e.g., herbal tea, spices, medicinal mixtures). The most common food safety risks associated with every step in the supply chain are highlighted in boxes, and the critical risk assessment steps are highlighted in the process (indicated by a magnifying glass and an exclamation mark).
Adulteration – A food product that fails to meet the legal standards set by the government is said to have been adulterated. Food adulteration is a legal offence and occurs when substances that lower the quality of food are present, either intentionally or unintentionally.
Bioterrorism – Bioterrorism is defined as a release of biological agents or toxins that affect humans, animals, or plants with the intent to harm or intimidate.
Codex Alimentarius – Also known as “Food Code”, this is a collection of standards, guidelines and codes of practice adopted by the Codex Alimentarius Commission. The Commission, also known as CAC, is the central part of the Joint FAO/WHO Food Standards Programme and was established by the FAO and WHO to protect consumer health and promote fair practices in food trade.
EFSA – European Food Safety Authority. EFSA provides independent scientific advice on food-related risks. This advice informs European laws, rules, and policymaking, and thus helps protect consumers from risks in the food chain.
FAO – Food and Agriculture Organisation. The FAO is a neutral intergovernmental organisation established by the United Nations. It strives to provide information and supports sustainable agriculture through legislation and national strategies, with the goal of alleviating hunger.
FDA – The United States Food and Drug Administration (also known as USFDA) is a federal agency of the Department of Health and Human Services. The FDA is responsible for protecting and promoting public health through the control and supervision of food safety.
FFDCA – The Food, Drug, and Cosmetic Act is the primary food safety law in the US. The FFDCA authorises the FDA to monitor and regulate the safety of food, drugs, and cosmetics.
Food authentication – The process of irrefutably proving that a food or food ingredient is in its original, genuine, verifiable, and intended form as declared and represented.
Food fraud – A collective term used to encompass the intentional substitution, addition, tampering, or misrepresentation of food, food ingredients, or food packaging with the aim of increased economic gain.
Food hazard – Food safety hazards occur when food is exposed to hazardous agents which result in contamination of that food. Food hazards may be biological, chemical, physical, allergenic, nutritional, and/or biotechnology-related.
General Food Law – The General Food Law Regulation is the foundation of food and feed law of the European Union. It sets out an overarching and coherent framework for the development of food and feed legislation both at Union and national levels. To this end, it lays down general principles, requirements and procedures that underpin decision making in matters of food and feed safety, covering all stages of food and feed production and distribution.
GMO – Genetically modified organism. An organism whose genome has been engineered in the laboratory in order to favour the expression of desired physiological traits or the generation of desired biological products.
HACCP – Hazard Analysis Critical Control Point. A management system in which food safety is addressed through the analysis and control of biological, chemical, and physical hazards from raw material production, procurement and handling, to manufacturing, distribution and consumption of the finished product.
Mycotoxins – Naturally occurring toxins produced by certain moulds (fungi) that are chemical food hazards. The moulds can grow on a variety of different crops and foodstuffs including cereals, nuts, spices, dried fruits, apples, and coffee beans, often in warm and humid conditions.
PDO – Protected Designation of Origin. Registered designation of products that have the strongest links to their area of production and protected by intellectual property rights.
Phytotoxins – Plant toxins are naturally produced as secondary metabolites, and play a central role in the organism from natural threats. The main groups of plant toxins are alkaloids, terpenes, glycosides, proteinaceous compounds, organic acids, and resinoid compounds.
Plant allergen – A plant derived substance that causes an allergic reaction in humans.
Supply chain – The network of all individuals, organisations, resources, activities, and technologies involved in the creation and sale of a product. A supply chain encompasses everything from the delivery of source materials from the supplier to the manufacturer through to its eventual delivery to the end user.
Traceability – The ability to track any food, feed, food-producing animal, or substance that will be used for consumption, through all stages of production, processing, and distribution.
WHO – World Health Organisation, a part of the United Nations that deals with major health issues around the world. The World Health Organization sets standards for disease control, health care, and medicines, conducts education and research programs, and publishes scientific papers and reports.
Being the world’s most abundant life kingdom, plants are virtually everywhere: in terrestrial, freshwater, and marine ecosystems and even in the air in the form of pollen and spores (
However, plant biodiversity assessments are impeded by problems associated with species detection, taxonomic assignment, abundance quantification, and sample bias given the unknown spatial and temporal distribution of target species (
Improving plant biodiversity assessments is one of the century’s greatest challenges as less than 10% of the world’s plant diversity is currently known, and its loss outpaces the rate at which is discovered, inventoried, and protected (
While many plants are sessile and their biomass is mainly located below or above anchoring surfaces, some vegetative and reproductive plant parts (i.e. flowers, leaf debris, pollen, seeds) detach and are transported on short or great distances from the main organismal body until they are finally deposited onto substrates (i.e., ground, water, and more). Hence, plant DNA can be found in environmental substrates as organismal and extra-organismal DNA at various proportions, with each substrate potentially tracking different spatial and temporal signatures of biodiversity (
As no single marker provides resolution for all taxa, eDNA-based assessments often employ metabarcoding of different nuclear and chloroplast regions such as ITS, rbcL, and matK to harvest their complementary resolution power (see Chapter 11 Amplicon metabarcoding for information about these regions and their suitable applicability; CBOL Plant Working Group 2009;
Despite the major recent advances in detection, eDNA-based assessments remain limited to reliably quantify abundance, which in turn makes it hard to assess population status and take management actions (
Furthermore, presence/absence estimations provided by eDNA-based assessments can be misleading as DNA may remain in the environment after the organism is no longer present (
About a decade after the term eDNA was introduced, the eDNA scientific community has adopted different terminology in reference to the state, source, or substrate from which eDNA is isolated (
Pollen DNA is most commonly the main source of plant eDNA present in airborne samples, although single-cell algae, leaf and flower fragments may also be present (
Faeces, mucus, and saliva contain DNA from the host and from the organisms that were ingested or that have been in contact with the host (
Soil and sediments, from both terrestrial and aquatic environments, are presumably the substrates where most plant DNA is present, as extra-organismal and organismal DNA from both active and dormant tissues including, roots, debris, fallen vegetative parts, seeds, and pollen are gathered or ultimately deposited in these substrates. Because of the major presence of plant eDNA and the ubiquity of these substrates in both aquatic and terrestrial ecosystems, soil and sedimentary eDNA samples are advantageously appropriate for plant assessments. Differences between soil and sediments can be ambiguous, as both are products of the earth’s crusts weathering (
Soil eDNA plant assessments have successfully characterised diversity in tropical (
As sediments are deposited throughout time and form distinguishable layers, the eDNA present in these layers (namely sedaDNA) can signal organisms that were likely locally present in ancient environments (
eDNA-based biodiversity assessments have proliferated in marine and freshwater environments in recent years, and our knowledge on the persistence, decay rates, and states of eDNA in water samples and its resolution compared to traditional assessments has in parallel increased (
The assessment of aquatic plant eDNA in freshwater ecosystems has simultaneously enabled the early detection of invasive species (
Plant DNA can also be isolated from water samples in the form of snow, firn, and ice (
Bulk samples from plants are distinctly different from pitfall or Malaise traps filled with insects. In bulk samples of plants, one can distinguish natural bulk samples such as pollen samples from pollen samplers, or those scraped or washed from pollinating vectors, and those that are artificially assembled such as collected roots, leaves, or flowers. Nevertheless, all bulk samples constitute organismal DNA from plant communities that can be used either to assess plant or other diversity (
Flower bulk samples have been assembled to assess arthropod communities that leave DNA traces after either visitation or pollination (
A recent development in eDNA metabarcoding is the use of organisms as natural samplers of DNA (coined nsDNA;
In aquatic ecosystems, macroinvertebrates (Chironomidae, Coleoptera, Hemiptera, Ephemeroptera) that feed both on aquatic vascular plants and plant fragments leached to the environment hold great potential to signal overall vegetation implicated in freshwater trophic relationships. Likewise, filtering organisms or animals that use specialised structures to filter fine particles from the water in lakes and rivers harbour the same potential, i.e., sponges (Ephydatia), Simuliidae, Ephemeroptera, Chironomidae, and Trichoptera.
For the assessment of terrestrial vegetation in tropical areas, bats hold great potential as biotic samplers of plant DNA since omnivorous and frugivorous communities are abundant and thus easy to collect (
Finally, amplifying hypervariable markers from biotic DNA samplers, i.e., COI for animals, has recently gained attention as it can assess diversity below the species level, and thus signals ecosystem population assemblages in space and time (metaphylogeography; X.
While the exploration of eDNA samples and methods for plant assessments is still at its infancy, eDNA has already revolutionised the way and speed in which biodiversity can be inventoried. Plant detection via eDNA has enabled the discovery of plants living in extreme and/or ancient environments and yielded myriad applications with societal relevance. A decade after the rise of eDNA-based assessments, the limitations of this method across different eDNA samples are still being recognised while in parallel different strategies are being developed to overcome and mitigate these. In this rapidly developing field, it is essential to combine the basics of eDNA metabarcoding with the most recent insights and developments in the field to devise the most robust study design to answer your research questions.
Organismal DNA – The DNA that is isolated from bulk-extracted mixtures of organisms that are separated from the environmental sample. Also named community DNA.
Extra-organismal DNA – DNA originated (i) from biological material shed from an organism as part of tissue replacement or metabolic waste; (ii) as biologically active propagules such as gametes, pollen, seeds or spores; or (iii) as a result of cell lysis or cell extrusion (
Environmental DNA – DNA captured from modern environments, i.e., seawater, freshwater, soil, or air; or ancient environments, i.e., cores from sediment, ice or permafrost (
Intracellular DNA – DNA that is located within cell membranes.
Extracellular DNA – DNA that is located free in the environment after cell lysis or cell extrusion.
Anemophily – Plant pollination where pollen is distributed by wind, i.e. wind pollination.
Firn – Crystalline or granular snow, especially on the upper part of a glacier, where it has not yet been compressed into ice.
Melissopalynology – The study of pollen contained in honey and, in particular, the pollen’s source.
Wildlife trade is the trading of living or dead wild plants, fungi, or animals, either as whole organisms or as parts and the products derived from them. This varies from rare animal and plant species for collectors, to ingredients made of wild organisms for medicinal or cosmetic purposes, to wood for timber, paper, craftwork, and construction, and various animals, plants, and mushrooms for nutritional purposes. Although conservation concerns about the unsustainable use of wildlife became more prominent from the 1960s onward, evidence shows that large-scale wildlife trade is older than the Roman Empire and ancient Greek civilisations (‘t Sas-Rolfes et al. 2019). International wildlife trade is a billion-dollar industry, and together with illegal wildlife trafficking, it has become a substantial threat to global biodiversity and the preservation of endangered species (
The impacts of wildlife trade are substantial with both conservation and socio-economic importance. Unsustainable trade could lead to (local) extinction of populations or even entire species. For plants that occupy a specialised niche, it can destabilise interactions with other species, with potential consequences for the entire ecosystem. Therefore, after habitat loss, wildlife trade is the second-biggest threat to species survival (WWF, 2020). Not only does illegal wildlife trade threaten biodiversity due to consistent overexploitation, it also competes with legal use of natural resources and results in a substantial loss of income for both local communities and governments (
In order to regulate the trade in vulnerable wildlife, the Convention on International Trade of Endangered Species of Wild Fauna and Flora (CITES) was established in 1975. Species at risk of overexploitation due to international trade are listed on one of three appendices depending on how much they are threatened by unrestricted trade. Appendix I lists the most endangered species, for which commercial trade is not permitted - except for pre-convention material - and for which non-commercial trade is strictly regulated. Appendix II lists the species that may become extinct if trade is not carefully controlled, which therefore requires a proper permit. Finally, Appendix III lists species that are protected in at least one country and other CITES Parties assistance is required to control the trade. Listing species on Appendix III helps to establish international cooperation in order to control trade in the species according to the laws and regulations of that country. Species can be added to Appendix I and II or removed from them, or shifted from Appendix I to II and vice versa only by voting at a Conference of the Parties (CoP), which is a meeting of the CITES Parties to review the implementation of the Convention. Species can be added to Appendix III or removed from it at any time and by any Party unilaterally (CITES, n.d.).
At the moment, roughly 39,000 species, including ca. 6000 species of animals and ca. 33,000 species of plants (395 species in Appendix I, 32,364 species in Appendix II, and 9 species in Appendix III) are protected by CITES (CITES, n.d.). In countries that are signatories to the convention, import and export permits must be issued for international trade of plants and animals listed in these appendices. Some countries set annual export quotas for certain species to ensure that they will not be traded beyond the sustainable limits for species survival. Non-compliance with CITES regulations can lead to confiscation of the material as well as fines and prison sentences, and in some cases trade sanctions against a country (CITES, n.d.). Since 2017, CITES has also facilitated the Wildlife Cybercrime Working Group that has coordinated national responses to the threat posed by online trade (
Other international and national regulations have been put into place to support the implementation of and in some cases expand on CITES regulations. Examples are the EU Action Plan Against Wildlife Trafficking (European Commission 2016), the EU Wildlife Trade Regulations (European Commission 2010), European Union Timber Regulation (EUTR), United States LEMIS wildlife trade data (
Despite the fact that plant species far outnumber animal species on the CITES appendices, in the public discourse on wildlife trade and conservation, charismatic mammals such as elephants, rhinos, tigers, and lions usually take centre stage. Smaller animals (e.g., insects, molluscs), but also most plant groups, receive less attention and generate less funding in discussions regarding wildlife trade and conservation. And although plants appear frequently in national and international regulations, regulatory enforcement and additional conservation measures still primarily target iconic megafauna (
Plant blindness is a psychological bias that leads us to notice (large) animals, and take plants largely for granted, reducing them to background vegetation for other organisms. The term was coined by
Plant blindness has been institutionalised throughout society, from (higher) education to governance and wildlife management (
Apart from the limited attention that plants receive in research, education, and conservation, effective control of trade in plant species is hampered because some of the traded goods are difficult to recognise, either because they are processed or because they contain only parts of the organism, which lack the morphological characters needed for identification (
Other challenges are posed by the growing use of the internet for transactions, which makes wildlife material more readily accessible and at lower costs, while preserving anonymity. The internet is not only increasingly used to sell and obtain specimens, but even to organise poaching events (
Lastly, since international wildlife trade per definition transcends borders, enforcement of legal trade requires coordinated action between multiple countries to address the whole supply chain. While there are already many institutional collaborations that work across international borders to help track and catch illegal wildlife trafficking syndicates - including financial institutions, NGOs, customs and police forces and online tech platforms - one of the main bottlenecks to combating wildlife trade will be to sustain sufficient international attention to allow the detection and prevention, not just of single illegal transactions, but of organised trade networks operating at larger scales.
The importance of wildlife and the impacts of unsustainable trade on biodiversity are undeniable, which highlights the urgency of developing high-throughput methods that are widely applicable. The next section presents some of the most commonly used methods in illegal trade identification today. In the final section, we provide recommendations on which techniques to use for the identification and tracking of illegally traded plants, and discuss future developments that could improve global wildlife trade monitoring and control.
Traded plant materials come in all shapes and sizes and in different stages of processing, ranging from complete living plants to raw timber logs and to engineered wood products. There is a wide variety of molecular and non-molecular methods for illegal wildlife trade monitoring, from DNA (meta) barcoding and genetic methods, to chemical identification, and computer vision and pattern recognition tools. Each of these methods is applicable to certain types of materials and requires knowledge about different aspects of the traded product that determines its legality, including species identity, geographic origin, source population (wild or cultivated), and the sample age. Here we describe the most commonly used methods to identify each of these aspects, and why they are important.
Methods for species identification are used to ascertain whether the organism being traded is CITES-listed or not. Depending on the taxonomic rank that is listed, it may be necessary to identify the exact species (e.g., Panax ginseng), genus (e.g., Aloe spp.), or family (e.g., Orchidaceae) to which an organism belongs. Species identification methods include genetic based methods (based on DNA sequencing information), chemical methods (based on molecular mass spectra), and computational methods (based on image recognition). Each of these methods require suitable reference data against which to query an unknown sample. The availability of reference data and the nature of the sample will dictate which method is most suitable for species identification.
The main chemical method used to identify species is Direct Analysis in Real Time (DART) coupled with time-of-flight (TOF) mass spectrometry (DART-TOF MS). DART-TOF MS consists of two parts: DART is an ionisation source that ionises ambient atmospheric molecules by using electronically excited-state helium which reacts with the molecules in the investigated sample to produce analyte ions (
Thanks to machine learning and computer vision, expert systems are playing an increasingly important role in identification of a wide variety of wildlife related objects, such as medicinal leaves (
DNA-based identification methods can use different genomic markers that offer different levels of identification, from universal loci such as conserved genes or intergenic spacers, to neutrally evolving markers with sufficient variation to resolve specific taxa, such as microsatellites and genome-wide Single Nucleotide Polymorphisms (SNPs). In addition to these markers, which require information about genomic context, it is also possible to identify species and populations using alignment-free shotgun data (see Chapter 17 Species delimitation).
For species identification, DNA barcoding (see Chapter 10 DNA barcoding) is often the method of choice. It can effectively identify traded plant species in a number of cases, including the identification of rosewood (Dalbergia spp.), species used in Ayurvedic medicine (Decalepis spp.), and cycads (Encephalartos spp.) (
An advantage of DNA barcoding is that the sequence data is universally comparable among labs and large numbers of species. But since DNA barcoding was originally meant to distinguish between species and not within species, this method often falls short when higher resolution is needed. Identification below the species level may be useful if the legality of trade is determined by the source population. In some cases, the country of origin determines the legal status of traded plants, which requires population level data for a collection of reference samples spanning the species range. Cost-effective traditional population genetic methods use a number of species-specific variable markers, typically simple sequence repeats (SSRs) or inter simple sequence repeats (ISSRs), which can be highly variable and show fine-grained population structure. More recently developed high-throughput sequencing methods cover larger sections of the genome, such as reduced representation sequencing methods (RAD-seq, target capture, or low coverage whole genome shotgun sequencing (also known as genome skimming, see Chapter 16 Whole genome sequencing).
These methods can generate large numbers of SNPs that allow inference of geographic origins at various scales. Although the increased costs for library preparation and sequencing means that these methods are not economically feasible in all cases, they offer the added advantage that functional analyses of genes or markers linked to genes with adaptive significance is possible.
Geographic origins have even been identified at the level of continents using genome skimming (
While population genetic markers can offer unmatched resolution of spatial variation, a general disadvantage is that many of them (with the exception of those used in RAD-seq and shotgun sequencing) need to be tested or developed specifically for each species, and reference data must be generated for populations across the distribution range to be tested. Stable isotope analysis can also infer geographic origin of samples, and does not depend on species-specific reference data to the same extent as genetic methods do. Stable isotope analysis is based on the principle that the presence of stable isotopes in the environment depends on both climate and geography. This creates a correlation between the stable isotope profile and its geographic location (
There are two methods to measure radiocarbon abundance: radiometric dating and accelerator mass spectrometry (AMS). These methods can be used to date samples based on the decay of carbon isotopes. The estimated age gives an indication of whether or not the traded sample is a pre-convention material, meaning that the traded material predates the convention or listing of the species (e.g.,
Currently, no genetic methods for inferring sample age can compete with radiocarbon dating, and while DNA fragment sizes tend to be shorter for older and more degraded plant tissues, this alone cannot be used to determine the plant age (see Chapter 2 DNA from museum collections). For other purposes, genetic markers are the method of choice to infer species identity and geographic origin, whenever DNA extraction is a realistic option. Any genetic method will however be limited by the quality and quantity of DNA that can be extracted, which can be notoriously difficult for some materials, especially timber and processed products (
Despite significant progress in methods and computational analyses, applications for most methods are still limited by the lack or incompleteness of suitable reference data. As shown in Table
A comparison of the methods used for identifying plants in trade with an indication of their applications and limitations.
DNA (meta)barcoding | Population genetic markers | Computer vision and pattern recognition | DART-TOF MS | AMS/ 14C dating | Stable isotope | |
---|---|---|---|---|---|---|
Material input | Whole plants, organs, tissues, powder | Whole plants, organs, tissues, powder | Timber, leaves, flowers, pollen | All | Anything containing organic matter | Anything containing organic matter |
Purpose of application | Determine taxonomic identity from genus to species level | Determine population or region of origin | Determine taxonomic identify, from genus to (sometimes) species level | Determine taxonomic identity at species level | Determine age of material | Determine the region of origin |
Availability of reference data | Well-developed for temperate species, less for tropical species and regions | Needs to be developed and referenced for each species separately | Being developed for CITES protected timber and plants | Being developed for CITES protected timber | Calibration might be required depending on the sample | Needs to be developed for each region separately |
When one method lacks sufficient reference data or is not sensitive enough to infer species identity or population of origin, multiple identification techniques tools (e.g., DNA barcoding, machine learning, and DART-TOF MS) can be combined to improve identification accuracy. Developing an integrated identification framework, which links reference databases and connects multiple sources of data for taxa of interest, is expected to play a major role in the future of regulating wildlife trade, though this would rely on standardisation and equitable distribution to enforcement agencies around the world. Coupled with new technologies that ensure quality control and compliance across the supply chain of wildlife products, the tools available for wildlife trade monitoring can aid not just the detection and confiscation of illegally traded goods, but also the transparency and traceability of legally traded commodities.
With blockchain for example, it may eventually be possible to develop a secure and robust infrastructure to register and track wildlife-related products from source to destination (
The technology has already proven its relevance in agriculture and fisheries, where the WWF Blockchain Tuna Project demonstrates it is possible to track the history of a fishing product from ocean to plate with just a QR Code (WWF, 2018). The customisable and scalable features of blockchain make it a promising technology for application to traded timber and other wildlife-related products (MoonX, 2019). Once it is possible to keep track of all steps taken throughout the commercialisation of wild harvested plants, the checkpoints for identification will no longer be restricted to points of entry or sales, enabling monitoring of wildlife trade from the source.
Accelerator Mass Spectrometry (AMS) – A form of mass spectrometry that accelerates ions to extraordinarily high kinetic energies before mass analysis.
Ayurvedic medicine – A medical system from India that aims to cleanse the body and to restore balance to the body, mind, and spirit by using diet, herbal medicines, exercise, meditation, breathing, physical therapy, and other methods.
Blockchain – A decentralised and distributed network that is used to record transactions across many computers.
Computer vision – An interdisciplinary scientific field that deals with how computers can gain high-level understanding from digital images or videos.
Expert systems – In artificial intelligence, an expert system is a computer system emulating the decision-making ability of a human expert.
Inter-simple sequence repeats (ISSRs) – ISSRs are regions in the genome flanked by microsatellite sequences. PCR amplification of these regions using a single primer yields multiple amplification products that can be used as a dominant multilocus marker system for the study of genetic variation in various organisms.
Near infrared spectroscopy – A spectroscopic method that uses a certain range of the electromagnetic spectrum from 780 nm to 2500 nm which is called the near infrared region.
Pattern recognition – The automated recognition of patterns and regularities in data.
Restriction site Associated DNA Sequencing (RAD-Seq) – A fractional genome sequencing strategy, designed to interrogate anywhere from 0.1% to 10% of a selected genome.
Simple sequence repeats (SSRs) – SSRs are DNA tracts in which a short base-pair motif is repeated several to many times in tandem. These sequences experience frequent mutations that alter the number of repeats.
Spectroscopy – The study of the interaction between matter and electromagnetic radiation as a function of the wavelength or frequency of the radiation.
X-ray microtomography – A 3D modelling method uses X-rays to create cross-sections of a physical object that can be used to recreate a virtual model without destroying the original object.
Forensic science is the use of science in criminal cases. Many scientific disciplines can be involved, among others chemistry, botany, entomology, and physics. In many trials, the presence and identification of physical evidence can be a critical factor in determining the final verdict. Physical evidence may include among other plant material such as leaves, flowers, fruits, or pollen. In this sense, forensic botany is the study of plants during criminal investigations, as botanical samples can be critical evidence in crimes (
Plant material that is usually found at crime scenes may include leaves, stems, seeds, pollen, flowers, or any other plant parts (
Plant seeds can be caught and carried in a pant cuff or on a shoe, and plant leaves and stems can be found in a victim’s and/or suspect’s car. Plant parts can also be identified in the victim’s stomach, nose or lungs, under fingernails, on skin, clothes, or hair. However, data generated from recovered botanical material is often not fully exploited since forensic agents may lack the appropriate know-how. The role of the forensic botanist within the investigative process is to compare samples recovered from crime scenes, macroscopically and microscopically examine the biological material. Although botanists might be able to identify a species phenotypically, many times this is not possible and DNA analysis with molecular techniques must be used. New tools, and especially new molecular tools, are being developed in forensic botany to aid in both criminal and civil cases. Although chemical analysis of plant material can serve as evidence when a relevant reference database is available, DNA is much more stable than many macromolecules and metabolites and can persist for long periods, even if broken into smaller fragments. It is therefore very often the preferred method for identifying plants in forensics (
Examination of DNA is a powerful technique allowing the identification of an individual. A suspect’s DNA and a crime scene sample are matched to reference databases containing the profiles of large numbers of individuals generated over time (
Since then, the development and application of DNA-based methods and genetics have revolutionised forensic science. Nowadays the use of DNA as forensic evidence is routine, with a major impact on the criminal justice system and society (
Forensic genetics is progressing rapidly, as highly sensitive methods for DNA recovery and new sequencing technologies are being developed (
DNA barcoding is the most commonly used method for genetic identification in forensic genetics (
In plants, DNA barcodes are mainly derived from the chloroplast (
Recent developments in DNA analysis now allow for the wider use of biological materials, for example, mixtures of samples such as soil or stomach contents (Figure
Palynology is the study of palynomorphs, including pollen, spores, dinocysts, etc. Pollen grains are however the most studied palynomorph, and especially in forensics it can be an important piece of evidence if it can be associated with a crime scene or be retrieved from the suspect or equipment used at the crime scene. Pollen is of microscopic dimensions and can very easily be retained in clothes, home objects, and soil. Crime scenes limited to a few square meters, like a rape scene or the entry point of a burglary, are very often the best choices for the use of forensic palynology (
In Auckland (New Zealand), a prostitute claimed she was attacked in a passageway by a suspect, around seven meters away from the suspect’s car (
In Taipei (Taiwan), the body of a young woman was found lying by a drain in an urban area. It was unknown whether she was a homicide or suicide victim. Her body showed no obvious bone fractures and it was suspected that she was involved in a hit and run by a car. By the time investigators arrived at the scene, the body had already been transferred to a hospital, where a tiny berry and stem was found in the victim’s hair. This berry was however not commonly found in the area where the victim lived or where the body was found. The investigators discovered the same plant on the edge of a railing above a drain attached to a building directly next to where the body was found, suggesting that the woman fell from the building, and the plant piece became tangled in her hair during the fall (
A murder case in 1992 in Arizona (USA) revealed the power of forensic botany. Seed pods of a Palo Verde tree (Cercidium sp.) were retrieved from a suspect’s pickup truck (
In a Finnish study, RAPD and SSR molecular markers were used on mosses to connect three suspects to a murder scene (
Forensic botany also helped to resolve a case of theft that occurred at a Catholic church in Florence (Italy). In this case, the thief made a mistake, leaving faecal material at the crime scene, as, unfortunately for him, he suffered from diarrhoea. Although a priest at the church had previously cleaned the crime scene of faecal matter, there was still enough material left to be collected by the police. The police suspected a local man with a police record who suffered from Crohn’s disease. The suspect denied the accusations and presented an alibi. The police, who had retrieved his blue jeans from the time of the robbery, found them stained with faeces, yet the suspect still denied being guilty and challenged the police to “prove it”. The comparison of the two samples revealed 14 dietary items of botanic origin that matched and none that did not, forcing the suspect to confess the crime (
In the early 1980s, a young girls’ body was found whose last known meal was with her boyfriend at a local fast-food restaurant. An autopsy however revealed the presence of vegetables in her stomach that were not on the fast food restaurant’s menu. A botanical investigation confirmed the autopsy results, suggesting she had another meal before her death, which helped her boyfriend to be cleared of any charges. The case was solved a few years later when a serial killer confessed to the murder (
In the Black Widow case, in 1993, a domestic homicide was solved with the help of forensic botany. The victim Gerry was married to Jill who had 7 previous marriages. When Gerry found out that Jill had not actually divorced her 7th husband before marrying him, he went to court to annul the marriage and freeze his assets. On the day of his death, Gerry had a breakfast of coffee, hash browns, eggs, and toast, and Jill and her then-boyfriend were spotted near his house. Forensic botanists examined the contents of his stomach and found starch and onion, concluding that the only meal he had was his breakfast and that he did not go out to have another meal. That coincided with the time that Jill was seen at his house and allowed the police to issue a search warrant for her property. The police found a gun and other evidence which led the court to find her and her boyfriend guilty (
In a homicide case, a body was found in a stream near a roadside covered with the knotgrass Polygonum aviculare. Seeds of knotgrass were recovered from the wheels of the suspect’s car. Additional knotgrass samples were collected from different sites and locations. The investigators used AFLP molecular markers to demonstrate that the origin of the seeds found in the suspect’s car came from the crime site (
Metagenomic analysis for human DNA was used in a sexual assault case that took place in the Netherlands in 2015 and involved a 28-year-old woman. The woman preserved her clothes after the assault and also took intimate samples from herself. Initially, the samples were analysed using capillary electrophoresis (CE) analysis. A year later, these CE results produced a hit in the Dutch convicted criminal database. However, the analysis was challenged, and the ambiguous results made the suspect go free. Only after the use of massively parallel sequencing, it was possible to match the suspect’s environmental DNA with the assault evidence which finally led to his conviction in 2018 (de Knijf 2020).
AFLP – Amplified fragment length polymorphism is a PCR-based technique that uses selective amplification of a subset of digested DNA fragments to generate and compare unique fingerprints for genomes of interest.
RAPD – Random amplification of polymorphic DNA is PCR-based technique in which DNA fragments are amplified at random using primers with arbitrary nucleotide sequences.
RFLP – Restriction fragment length polymorphism is a technique that utilises variations in DNA, i.e. polymorphisms, to differentiate between individuals.
RFLP analysis – A DNA sample is fragmented with restriction enzymes, which selectively cleave the DNA. The produced fragments are separated with agarose gel electrophoresis and since different individuals have fragments of different length, it is possible to distinguish between them.