Corresponding author: Hugo de Boer (
Alexandre Antonelli1
1 Director of Science, Royal Botanical Gardens, Kew, United Kingdom
Names are the carriers of knowledge. Without names, much of science would be meaningless. Names give us insight into the diseases that affect our health; the objects that sustain our economies; the celestial bodies that travel in the Universe. Names solve ambiguity.
In botany, the name of a plant may provide the first clues as to its characteristics, also called traits. Is it edible, or poisonous? Beautiful, or ugly? While some traits are relative (edible by whom, ugly to whom?), others are absolute: thorny, succulent, epiphytic. Some are obvious, others elusive. From morphological descriptions and DNA sequences to historical accounts and traditional uses, they are all linked by the name.
Until recently, the reliable identification of plants was the task of a select few: the taxonomists. Today, this is less so. The molecular identification of plants through DNA barcodes has been shown to perform just as well, and in fact often better, than taxonomists for many taxa, particularly when specimens lack reproductive structures. Other techniques, such as image recognition through machine learning and the spectrophotometric signature of leaves, can yield similar results. Does this mean the demise of taxonomists is on the horizon?
Not at all. I believe it is very much the opposite: in the current environmental crisis, the need to document and protect the world’s biodiversity has never been more acute. At the same time, some 20% of all plant species have not yet been scientifically described, and many of them may disappear even before we have identified and characterized them. The work of taxonomists remains therefore critical, but as molecular identification of species is underway and set to become routine across the private and public sectors, expert time can now be reallocated from bulk identifications to the training of students, build-up of physical and digital reference collections, and further development of identification methods. Technologies are here to help – not replace – taxonomy, by complementing the human strengths and compensating for some of our human weaknesses: an insufficient memory, a biased brain, and lack of time.
This book is for you who are curious about how plants can be identified using DNA: the most powerful source of information to link a plant to a name. This may sound trivial, but it is not. But don’t despair in advance: it is doable, mostly fun, and always rewarding. You just need to learn how.
Here, you will not only learn how various types of materials containing plant fragments can be identified to species in the lab and how to execute sophisticated computer analyses, but also gain a deeper understanding of the complexities and challenges faced by taxonomy in general, and plant identification in particular, including the lack of comprehensive reference databases. Enforcing strict species concepts onto nature’s inherent fluidity doesn’t always work, and despite all recent advances in this field it still happens that some plant samples cannot be confidently named. Yet, if this ever happens to you, this initially frustrating insight can also be scientifically revealing, and help you design further experiments.
The applications of molecular identification are far more numerous and trans-disciplinary than most people would imagine. Several chapters take a deep dive at applications in fields as seemingly disparate as palaeobotany and healthcare, but as I argued at the start of this text, they are all unified by a common denominator: the name, the information-carrier.
I hope you will find this book as inspiring, informative, and revelatory as I have, and that you will choose to carry out your own projects using the molecular identification of plants. And if you do so, just don’t forget to cite the chapters that inspired you!
Hugo de Boer1
1 Natural History Museum, University of Oslo, Norway
An estimated 340,000–390,000 vascular plant species are known to science (
Organismal diversity is the foundation of all biological research, but species discovery and delimitation requires taxonomic skills. Even the most experienced taxonomists can rarely critically identify more than 0.01% of the estimated 10–15 million species (
The global scientific community lacks the expertise and continuity to identify all species diversity, and biodiversity is lost at a greater speed than we can discover and describe new taxa (
DNA-based species identification, i.e., molecular identification, makes it possible to identify species precisely from trace fragments such as pollen (
These innovations in molecular identification enable us to detect and identify species in places and settings that were unimaginable only a few decades ago, or even in 2020 (
Natalia Angela Saoirse Przelomska1, Steven Dodsworth1,2, Ana Rita Giraldes Simões1, Panagiota Malakasi1, Imalka Kahandawala1, Roseina Woods1, Timothy Fulcher1, Yannick Woudstra1,3,4,5, Olwen M. Grace1
1 Royal Botanic Gardens, Kew, Surrey, United Kingdom
2 University of Portsmouth, Portsmouth, United Kingdom
3 Natural History Museum Denmark, University of Copenhagen, Denmark
4 Gothenburg Global Biodiversity Center, Department of Biological and Environmental Sciences, University of Gothenburg, Sweden
5 Department of Plant Sciences, University of Oxford, United Kingdom
Natalia Angela Saoirse Przelomska
Steven Dodsworth
Ana Rita Giraldes Simões
Panagiota Malakasi
Imalka Kahandawala
Roseina Woods
Timothy Fulcher
Yannick Woudstra
Olwen M. Grace
Deoxyribonucleic acid (DNA) is the blueprint of life. DNA encodes genes which carry instructions for the production of proteins, the fundamental components of a cell’s machinery. DNA was first isolated and confirmed as the genetic material in cells, and thereby the basis of heredity, in the 1940s (
A fundamental tenet of molecular biology is that DNA is transcribed into ribonucleic acid (RNA), and subsequently translated into amino acids that form a protein sequence. We now have a much more detailed understanding of this framework, including the varied roles of RNA in gene expression and regulation, and the role of epigenetics—heritable changes in DNA that do not alter the base sequence (e.g., methylation). Since the discovery of DNA, there has been a steady increase in the use of DNA sequences as molecular markers in varied biological contexts, including medical and forensic applications, elucidation of genes encoding adaptive traits, understanding population genomic processes, as well as systematics of prokaryotic and eukaryotic organisms.
Most DNA extraction protocols extract total cellular DNA. In certain experimental cases, it can also be preferable to target either DNA contained in the nucleus or DNA comprising organellar genomes (in plants: mitochondria and plastids). Organellar genomes are much smaller than any plant nuclear genome.
As with virtually all eukaryotes, plants have endosymbiotically derived mitochondria for cellular respiration and energy production. However, compared to other eukaryotic kingdoms (animals in particular), the mitochondrial genome of plants is quite large, ranging between 200 and 750 Kbp in size (
In contrast, plastid genomes (e.g.: found in chloroplasts of leaves or amyloplasts of cereal grains) have a very stable genomic structure and a size of around 150 Kbp in most cases (
Nuclear genomes, particularly in angiosperms, are highly variable in size, with the angiosperm mean and modal 1C (the amount of DNA in an unreplicated gametic nucleus) both at around 5 pg/Gbp (
The first isolation of DNA, by the Swiss physician Friedrich Miescher in 1869, happened accidentally while studying proteins from leukocyte nuclei (
Plants possess a tough cell wall made up of cellulose and other compounds such as lignin, in addition to a cell membrane. This necessitates a robust first step for plant DNA extraction that disintegrates the structure of the plant tissue and breaks down cell walls. In a low-throughput scenario (or for samples that are tougher to disrupt), this could involve flash freezing the tissue with liquid nitrogen followed by grinding with a pestle and mortar. For higher throughput of samples, tissue-disrupting machinery can be applied. The ground material should then be taken forward immediately to the chemical steps of the process, which involve breakdown of the cellular membrane to release the lysate containing the soluble DNA. This is then separated from cell debris and other insoluble material. Various methods are subsequently used to separate DNA molecules from the remaining material, which can contain soluble proteins, nucleic acids, and small molecular metabolites (
Numerous protocols and procedures have been developed to extract DNA from plant material of varying origins (
A major innovation in DNA extraction protocols from plant material was developed by Doyle and Doyle (
Plant material for any research project must be collected ethically and legally, and the preparation of DNA extracts is no exception. Permission, prior informed consent and mutually agreeable terms of use must be obtained before using plant tissue for DNA extraction according to the Convention on Biological Diversity. This includes the fair and equitable sharing of benefits arising from the utilisation of genetic resources (as outlined in the Nagoya Protocol). National and international law and conventions apply to derivatives of biological materials, including DNA extracts and their transportation. The same principles apply to botanical collections such as seeds, silica dried specimens stored in a tissue bank, herbarium specimens, or plants in living collections. The terms under which they are stored in a collection may restrict the use of specimens for research and require additional permissions (for instance, from the regulatory authority in the country of origin) before they can be used. The storage and future use of DNA extracts, likewise, must comply with the terms of the permissions granted, which could include being stored indefinitely for future research, returned to the country or institute of origin, or discarded. See Chapter 2 DNA from museum collections for guidance about your responsibilities as a researcher.
DNA can be extracted from healthy plant tissues including leaves, flowers, buds, seeds, roots, bark, and even spines. Young leaf tissue is the preferred starting material (
Successful extraction of high-quality DNA from any plant material depends on the material being prepared correctly, dried rapidly (without excessive heat treatment), and stored in a dark, dry place to minimise degradation of its DNA. DNA degradation prior to extraction is caused by the release of endogenous nucleases during cellular lysis, which may be accelerated by environmental factors such as heat and humidity (
The extraction method is determined by the plant material available. For most kit and CTAB based protocols, a 1 cm2 section of herbaceous leaf tissue will suffice for a single extraction. Careful laboratory notes of the material used, including provenance data, sample weight, and extraction date, are vital for checking the quality of sequencing results against the specifics of the extraction process in the lab and for pinpointing reasons for variation between samples. For some protocols, weighed tissue can be placed straight into a 1.5 ml tube labelled with a unique number or laboratory code and other information, ready for the DNA extraction process.
Plant material dried and stored in silica gel – including as specimens stored in tissue banks specifically for the purpose of DNA extraction – tends to be a good source of high-quality DNA. Silica gel (silicon dioxide xerogel) is a desiccant that removes moisture from the atmosphere, drying out the plant tissue. Indicator silica gel crystals change colour when the silica is saturated, signalling when the silica gel should be regenerated or replaced. These crystals can be used in a mixture with non-indicating silica gel.
The use of silica gel is a popular approach to dry fresh plant material for DNA extraction because it is low cost and convenient compared to liquid nitrogen or lyophilization, especially when preparing tissue in the field. To effectively preserve the DNA in plant tissue, the recommended minimum ratio between plant material and silica is 1:10 (
One approach is to freeze plant tissue until needed for DNA extraction, preferably at –80 °C, and otherwise in a standard laboratory freezer at –20 °C, if the sample is properly sealed. Alternatively, material can be flash frozen in liquid nitrogen. The resulting rapidly frozen material can yield high-quality DNA extractions, but liquid nitrogen is impractical for some settings due to handling considerations and cost (
High-quality DNA can be extracted from lyophilized (or freeze-dried) tissue, such as leaves and roots (
After the plant material has been prepared by drying and/or freezing using one of the above-mentioned techniques, a DNA extraction protocol can be implemented. Although there are a multitude of available protocols, the general methodology involves the following steps, discussed in more detail below:
Weighing of plant tissue
Mechanical disruption (grinding)
(Optional) pre-treatment
Extraction of nucleic acids from the cell
DNA isolation and precipitation
DNA purification
We place emphasis on the CTAB protocol due to its popularity, but also introduce other protocols that may be of interest to the reader.
The starting amount of plant tissue is important: too little will result in an unsatisfactory yield and too much may lead to poor grinding, saturation of the reaction and/or excessive debris which can also be detrimental to final yield. A useful starting ratio is a buffer quantity that is fivefold that of the weight of the leaf tissue (e.g., 0.2 g leaf tissue for 1 ml of buffer) (
Plant tissue must be finely ground to a powder such that the cell walls are disrupted and the cell membranes are more accessible for the chemical reagents in subsequent steps to act successfully. It is advisable to scrape hairs or wax from the surface of the plant tissue before weighing and grinding. For herbarium specimens, special care should be taken that any glue that may be present is removed since this can interfere with the reagents used during the DNA extraction. Sterilised sand can also be used to increase the friction and enhance the disruption of the tissue; it will be separated later in the DNA extraction protocol. Fleshy tissue can be flash frozen in a mortar with a little liquid nitrogen before grinding. The dewar for transporting the liquid nitrogen should be clean and free of potential contaminants.
Manual grinding is inexpensive, yet time consuming and requires a sterilised mortar, pestle, and spatula for each sample. Use of a mechanical homogenizer, also called a tissue lyser, is more efficient. A steel ball bearing is added to each tube with a sample and shaken at high frequency within the instrument. This allows multiple samples to be disrupted simultaneously with minimal degradation of the nucleic acids. It also minimises loss of material and the chances of contamination, as each sample is processed in the tube that it remains in for subsequent extraction steps. Metallic, ceramic, or silica beads of different sizes can be added to the sample tubes to increase the disruption of particularly tough or woody material. Metallic and ceramic beads must be removed before proceeding with the protocol, but silica beads can be separated later in the protocol.
This step can be included as an optimisation strategy for increased yield, quality, or purity of the extracted DNA. For example, when high amounts of polysaccharides and/or polyphenols in the plant material are a concern (as is the case for succulent plants and plants in high stress environments, respectively), the modified STE-CTAB protocol can be used (
In this stage, the goal is to release nucleic acids from the cell, whilst also minimising risk of nucleic acid degradation and to commence the segregation of unwanted cellular compounds from the DNA molecules.
The hallmark of the most widely adopted method for DNA extraction from plants, originally developed by Doyle and Doyle (
2% w/v CTAB: a cationic detergent which, during DNA extraction, binds to the lipids in cell membranes, enhancing cell lysis, thus releasing intact nucleic acids from the nucleus and organelles
1.4 M NaCl: a salt which increases the ionic strength of the solution, which simultaneously induces plasmolysis, promotes separation of proteins from DNA, and aids in polysaccharide precipitation
100 mM Tris-HCl: a buffer (at pH ~8.0) which maintains the pH of the solution and stabilises the DNA by impeding degradation
20 mM EDTA (ethylenediaminetetraacetic acid): which protects the DNA by inhibiting the enzymatic activity of DNase and RNase (i.e., by chelating divalent cations, such as Mg2+ and Ca2+, which are cofactors for these enzymes)
0.2% ß-mercaptoethanol: which denatures polyphenols and tannins (abundant in plants), rendering it possible to separate them from the DNA in subsequent steps
CTAB buffer is added to each sample tube containing ground plant tissue and the mixture is incubated at 60–65 °C for 15–60 minutes. This can be done in an automatic shaking incubator. Alternatively, the sample tubes can be periodically shaken manually.
Alternatively, methods involving an SDS buffer can be applied (
The goal of this stage is the separation of DNA from other molecules in the lysate, by making use of the differing polarity of these molecules. This is followed by DNA precipitation from the solution.
In the CTAB protocol, the methodology is phase separation using organic solvent(s), where hydrophilic molecules, including DNA, can be isolated. A 24:1 solution of chloroform-isoamyl alcohol (SEVAG buffer) is added to the incubated CTAB/leaf tissue mixture. This solution is hazardous and must be prepared and added to the sample tubes in a fume hood to avoid inhalation. It is also highly volatile and evaporates very quickly, so it should be handled quickly to avoid evaporation during the work. The mixture is then centrifuged at room temperature, which results in the DNA becoming concentrated in the clear upper phase (i.e., the aqueous phase). The supernatant is very carefully drawn off with a pipette without disturbing or touching the organic phase (containing the chloroform with lipids, proteins, and other cellular debris) and transferred to a new tube. The supernatant is purified by adding RNase A and chilled isopropanol, where the latter induces precipitation of DNA. Samples are then transferred to a freezer at -20 °C, either overnight or for several days if sample input is low and maximum precipitation is desirable (at the cost of potential co-precipitation of salts).
In the SDS protocol, proteins and polysaccharides precipitate with the SDS itself. Sodium acetate in turn is used to precipitate the DNA; in solution this compound dissociates and the sodium ions (Na+) neutralise the negative ions on the sugar phosphate backbone of DNA molecules, thus making it less hydrophilic and amenable to precipitation (
As a final step to both methodologies, the samples are centrifuged to encourage the formation of a DNA pellet, optionally washed with 70% ethanol at least once and re-suspended, preferably in 10 mM Tris-EDTA buffer (which serves to protect the DNA from damage, as explained in the CTAB buffer recipe above).
The DNA isolation stage is not perfect. Since the extraction process involves steps that segregate compounds by binding properties and molecular weight, co-extraction of molecularly similar polysaccharides is common. Furthermore, the eluent can contain certain contaminants, including traces of chemicals added during the extraction process and precipitated salts, as well as endogenous proteins, tannins, polysaccharides, and other molecules. The presence of such compounds can negatively impact the downstream experimental use of the DNA (i.e., act as PCR inhibitors), and further purification of DNA using various clean-up steps may be necessary.
One strategy is using a silica column and centrifugation-based method, by adding a chaotropic agent (commonly guanidine hydrochloride), which disrupts the hydrogen bonds between water molecules, creating a more hydrophobic environment. This increases the solubility of non-polar compounds (often contaminants) and additionally breaks up the hydration shell that forms around the negatively charged DNA phosphate backbone and further promotes efficient adsorption to the column surface under high salt and moderately acidic conditions (
An alternative involves the use of Solid Phase Reverse Immobilisation (SPRI) beads (
When a DNA extraction protocol does not yield satisfactory results, in terms of quality or quantity of extracted DNA, modifications can be applied. A valuable strategy for this is conducting a search of the scientific literature for protocols that have been used for similar experimental purposes or have targeted the same taxonomic groups.
If using the CTAB protocol, understanding the biochemical actions and interactions of its components is a useful starting point to identifying what might need adjustment to help improve the outcome. CTAB acts according to the ionic strength of the solution; the concentration of NaCl must be at least 0.5 M so that it does not bind to nucleic acids, but does bind to proteins and neutrally charged polysaccharides as desired. NaCl is most commonly used at a concentration of 1.4 M. When working with a plant group that has a high content of polysaccharides, experimenting with higher concentrations of NaCl may improve the purity of the final DNA. Sometimes, other reagents such as N-Lauroylsarcosine (sarkosyl) buffer can be added, to enhance lysis (rupturing of the cell membrane) and to reduce the activity of DNase or RNase enzymes. Proteinase K can also be added to enhance the denaturation of proteins. The volume of 24:1 chloroform-isoamyl alcohol solution can also be adjusted. Phenol can be added as an additional non-polar, organic solvent that is highly effective in denaturing proteins and can aid in increasing the final DNA yield, as opposed to solely applying chloroform (
Tris-HCl and EDTA are present in nearly all protocols. ß-mercaptoethanol is toxic and should thus be handled with care, and always in a fume hood with an extractor fan. One may consider simply not adding this reagent to the solution for plant tissues low in phenolic compounds. However, it is important to note that phenolic compounds co-precipitate with DNA and thus can be problematic in downstream steps of DNA laboratory work. ß-mercaptoethanol can be replaced with less toxic alternatives such as PVP (polyvinylpyrrolidone). PVP attaches to phenolic compounds via hydrogen bonding and can be removed together with them after centrifugation (
Most commercial kit-based protocols use a combination of buffers that perform similar functions to the components of the CTAB protocol, with a final step of elution through silica-columns, which tends to yield relatively clean DNA extracts. An added benefit of column-based kits is the use of filter columns at an earlier stage for the separation of crude plant material. Silica-based columns bind DNA so that it can be washed multiple times with alcohol-containing solutions to wash away contaminants before DNA elution. This speeds up DNA extraction significantly, reducing the total time from multiple days – as is common in regular protocols – to 6 hours. Drawbacks of these approaches however include the reduced yields of purified DNA in comparison to CTAB + chloroform extractions, as well as the significantly higher (~3–4 fold greater) cost.
Commercial kits that use magnetic beads are also becoming increasingly popular. Magnetic bead extraction kits are highly versatile and provide high yields of DNA that are also highly pure, in the absence of the hazardous solvents chloroform and phenol. After plant tissue grinding and lysis with an appropriate buffer, DNA is bound to the surface of the magnetic particles. The magnetic particle-DNA system is then washed several times with alcohol-containing solutions before a final elution step with a low salt buffer or nuclease-free water. In contrast to the column-based extraction method, binding of DNA to the magnetic particles occurs in solution, thus enhancing the efficiency and kinetics of binding and simultaneously increasing the contact of the bead-DNA compounds with the wash buffer, which improves the purity of the DNA. Magnetic particle kits have also been applied in combination with steps from the CTAB extraction method to extract high quality DNA from sorghum leaves and seeds, cotton leaves and pine needles (
Finally, a less common commercial method involves the use of Whatman FTA® PlantSaver cards and custom reagents. This method is very practical in terms of collection of samples in the field and their transportation. Furthermore, immediate mechanical disruption of the plant tissue can eliminate the need for obtaining permits. While this method has been predominantly applied to agricultural plant taxa, its performance in 15 phylogenetically diverse non-agricultural taxa has been demonstrated, where DNA from these samples was found to be less fragmented than that from replicate samples extracted alongside with the CTAB method (
Assessment of the properties of each genomic DNA (gDNA) sample post-extraction – its integrity, quantity, and purity – is imperative for making decisions regarding downstream molecular work. The methods described below have some overlapping uses in terms of assessing these different properties, but we highlight which is most appropriate for each DNA quality-related aspect.
Agarose gel electrophoresis is an appropriate method for estimating DNA integrity, as well as for crudely estimating DNA concentration. This method requires a horizontal gel electrophoresis tank with an external power supply, agarose, a running buffer such as Tris-acetate-EDTA (TAE) or sodium borate (SB), a fluorescent intercalating DNA dye, a loading dye, and a DNA standard (‘ladder’). The intercalating dye is added to the buffer (or sometimes to the loading dye) and serves to visualise the DNA in the agarose gel at the end point of electrophoresis. Historically, ethidium bromide was the standard intercalating agent, but it has now mostly been superseded by safer dyes that are less carcinogenic and do not require complex disposal procedures. Nonetheless, it is recommended that any compound that intercalates DNA be handled with care. The DNA standard is referred to as a ladder, since it is a complex of appropriately sized DNA standards of known concentrations which provide different benchmarks of size and concentration for comparison.
Each DNA sample and the DNA standard (ladder) are combined with loading dye and then pipetted into a well of the agarose gel, to then be subjected to an electric field. Due to the negatively charged phosphate backbone, DNA molecules will migrate towards the positively charged anode. The DNA migration rate depends on the fragment size, where smaller DNA fragments migrate faster, leading to a size-associated separation of DNA molecules. Additionally, the percentage of agarose in the gel will determine the size range of DNA that will be resolved with the greatest clarity. A range of 0.5% to 3% encompasses most applications, where < 1% is best for examining the genomic DNA of plants and 3% would be suitable for examining fragments with small (e.g., ~20 bp) differences in length. Once the fragments have migrated sufficiently to ensure resolution of the DNA and ladder, the gel is transferred to a cabinet with a UV light and the DNA fragments are visualised due to the excitation of the intercalating dye when UV is applied. The approximate yield and concentration of genomic DNA in a gel are indicated by comparison of the sample’s intensity of fluorescence to that of a standard.
Where a more precise estimation of the size of the DNA fragments is required, automated capillary electrophoresis can be used. Such systems (e.g., Agilent Bioanalyser, Agilent Tapestation) are more expensive to use, but – aside from precision – offer faster preparation and analysis time.
Fluorescent measurements are considered the most accurate quantification method for measuring DNA concentration. These involve the addition of fluorescent dyes (in an accompanying buffer), which selectively intercalate into the DNA. Fluorescence measurements use excitation and emission values that vary depending on the dye used. The concentration of unknown samples is calculated by the fluorometer (e.g., Quantus™ or Qubit™) based on a comparison to a standard measurement from DNA of a known concentration (usually lambda bacteriophage DNA). Since the dyes are sensitive to light and degrade rapidly in its presence, sample tubes must be stored in the dark if readings are not taken imminently after their preparation in the buffer.
A rough estimate of DNA yield and a more useful estimate of DNA purity can be measured via absorbance with a spectrophotometer that emits UV light through a UV-transparent cuvette containing the sample. Absorbance readings are conducted at 260 nm (A260), the wavelength of maximum absorption for DNA. The A260 measurement is then adjusted for turbidity (measured by absorbance at 320 nm), multiplied by the dilution factor, and calibrated using the following conversion factor: A260 of 1.0 = 50 µg/ml pure dsDNA. This useful relationship between light absorption and DNA concentration can be defined according to the Beer-Lambert law. Total yield is obtained by multiplying the DNA concentration by the final total purified sample volume. However, it is key to note that RNA also has maximum absorbance at 260 nm and aromatic amino acids have a maximum absorbance at 280 nm. Both molecules can contribute to the total measured absorbance at 260 nm and thus provide a misleading overestimate of DNA yield.
DNA purity is evaluated by measuring absorbance in the 230–320 nm range. Since proteins are the contaminant of primary concern, absorbance at 260 nm divided by absorbance at 280 nm is the standard metric. DNA can be considered of high quality and suitable for most genomic applications, when it has an A260/A280 ratio of 1.7–2.0. As a further step, the ratio of 260 nm to 230 nm can help evaluate the level of salt carryover in the purified DNA, where a A260/A230 of > 1.5 is considered to be of good quality. Strong absorbance at around 230 nm, which would lower this ratio, suggests the presence of organic compounds or chaotropic salts.
Instruments such as the NanoDrop® 2000 spectrophotometer are highly accurate for evaluating the A260/A280 and A260/A230 ratios. This method is not as accurate as fluorescence quantitation, but is most suitable where information on DNA purity is sought and is also time efficient (the sample is loaded directly into the machine and requires no preparation of buffers).
Particularly challenging types of plant tissue, as well as degraded plant material, can still yield high-quality DNA if suitably optimised protocols are followed.
For instance, seeds can be a good source of DNA if specialised protocols are used (
Advances in the sensitivity of genomic sequencing and optimised DNA extraction methods make it possible to study herbarium and other dried botanical specimens (
Physical and chemical degradation is to be expected in herbarium and museum specimens; DNA in deceased tissue breaks down over time. The rate of physical fragmentation is related to temperature and other environmental variables, as well as the composition of the plant tissue itself. In a study of herbarium specimens, it was shown that fragment length significantly regressed against sample age going back 300 years (
The CTAB extraction protocol is generally preferable for extracting fragmented DNA, as it generally gives higher yields of DNA than kit-based methods. Where fragment size distribution is predicted to be very low, a high-volume chaotropic salt used as a binding buffer in the latter stage of extraction can improve the recovery of DNA molecules (
A wide variety of DNA extraction protocols are available in the literature. The structural, biochemical, and genomic characteristics of plants present a particular set of challenges; isolating high purity, undamaged DNA from plant tissue is non-trivial and requires a careful and patient approach in the laboratory. Therefore, researchers must often optimise a chosen protocol for their specific experiment. Success in the primary step of a molecular workflow is crucial, unlocking the downstream steps of plant molecular identification and characterisation, and hence possibilities for addressing many exciting questions in molecular and evolutionary biology.
For each of the DNA-containing compartments in a plant cell, which of its characteristics deserve most consideration during DNA extraction and analysis, and why?
Describe the main compound classes from plant extracts that need to be removed from DNA extracts for downstream analysis. How can they be removed?
Describe the main difference between DNA extraction using the CTAB protocol and using a column-based extraction kit. What are the advantages and disadvantages of both?
The nuclear genome of plants is hugely variable in size. To maximise retrieval of intact DNA for species with larger genomes, a higher DNA yield should be aimed for. This could affect decisions regarding input material and the number of total DNA extractions carried out per sample. The plastid genome is present in high copy numbers in plant cells, as well as being a useful unit for addressing a variety of biological questions. Therefore, it is ideal for genome skimming experiments and a valuable target in degraded material, where the (single copy) nuclear genome might be highly fragmented. The mitochondrial genome of plants is characterised by high plasticity in its genomic structure and therefore is not recommended for plant identification.
Problematic biomolecules in plant extracts include polyphenols, tannins, and polysaccharides. These interfere with DNA extraction buffers (such as CTAB) as well as with other buffers and enzymes used in downstream DNA analysis. They are removed from the solution by either SEVAG cleaning (in the CTAB protocol) or, basically, by column cleaning or magnetic particles (commercial kits). Polysaccharides can also be removed from the crude plant tissue prior to extraction using STE buffer. Phenolic compounds can often be removed using ß-mercaptoethanol and/or PVP. Further impurities such as secondary metabolic compounds that may interfere with enzymes in downstream protocols can often be removed using a SPRI bead clean-up protocol.
The CTAB protocol uses specific buffers (such as SEVAG) and DNA precipitation (involving isopropanol) to separate non-DNA and DNA biomolecules, whereas extraction kits rely on using DNA-binding columns. or magnetic particles Although the kits are much more expensive on a per-sample basis, they generally yield clean DNA with a short turnaround time (up to 6 hours). CTAB extractions are very cheap and highly scalable as they do not rely on the specifically manufactured columns or magnetic particles. However, the protocol takes at least two full days to progress from plant tissue to DNA extract. Co-precipitation of non-DNA biomolecules is often observed and therefore affects the purity of the final DNA extract. Sometimes, substantial yield losses are observed using extraction kits and this can be a key consideration when dealing with precious samples.
Nataly Allasi Canales1,2, Andrew C. Clarke3, Mark Nesbitt2, Rafal Gutaker2
1 Natural History Museum of Denmark, University of Copenhagen, Denmark
2 Royal Botanic Gardens, Kew, United Kingdom
3 Future Food Beacon of Excellence & School of Biosciences, University of Nottingham, United Kingdom
Nataly O. Allasi Canales
Andrew C. Clarke
Mark Nesbitt
Rafal Gutaker
Museum collections of plant origin include herbaria (pressed plants), xylaria (woods), and economic botany (useful plant) specimens. They are not only places of history and display, but also of research, and contain rich repositories of molecules, including DNA. Such DNA, retrieved from historical or ancient tissue, carries unique degradation characteristics and regardless of its age is known as ancient DNA (aDNA). Research into aDNA has developed rapidly in the last decade as a result of an improved understanding of its biochemical properties, the development of specific laboratory protocols for its isolation, and better bioinformatic tools. Why are museum collections useful sources of aDNA? We identify three main reasons: 1) specimens can play a key role in taxonomic and macroevolutionary inference when it is difficult to sample living material, for example, by giving us snapshots of extinct taxa (Van de Paer et al. 2016); 2) accurate identification of specimens that were objects of debate or scientific mystery, as exemplified by misidentified type specimens of the watermelon’s progenitor (
However, extracting DNA does mean the destruction of a part of the specimen. Museum curators therefore face challenges in balancing the conservation of specimens for future research with the rising demand for aDNA analysis. Increasingly, curators are also considering legal and ethical issues in sampling (
With few exceptions, plant material found in museums originally grew on lands tended or owned by people for many millennia (
A first consideration is whether the plant species or artefacts (such as baskets or wooden objects) are of special significance (e.g., sacred) to the source community. Examples of sacred material include
There are international conventions that usually apply when accessing, researching, and moving plant material between institutions and countries. Researchers must also be aware of country-specific laws that may require further permits and inspections, e.g., for plants that produce controlled substances, require phytosanitary checks, or are considered invasive species. Legal elements of the Convention on Biological Diversity (CBD), Nagoya Protocol, and Convention on Trade in Endangered Species (CITES) are covered in Chapter 27 Legislation and policy as well as in other published works (e.g.
Botanical gardens hold living specimens and distribute seeds of these via seed lists (Index Seminum). Their global collections can be searched via
There are a number of pitfalls when searching online catalogues. It may be necessary to search for accepted names and common synonyms: the same species may appear under different botanical names in a single collection, and accuracy of specimen identification varies. In general, herbarium specimens are the most reliable, as they bear diagnostic criteria such as flowers on which taxonomists rely. Garden material and seeds are often misidentified, or become confused in labelling, or are hybridised during repeated cultivations. Their identifications should be confirmed, for example growing on the seeds or by using morphological criteria (
Research projects will benefit enormously from a close collaboration between researcher and curator. Museums should be approached early during a project, with the researcher providing sufficient detail about its background, aims, methodology, and timetable. Museums are often under-staffed and persistence may be required in making contact. Curators’ expertise will be crucial in identifying the most appropriate specimens for analysis, not only in their institutions, but in others with which they are familiar. The curator will also play a key role in assessing the provenance of specimens, using museum archives, and the implications for any of the ethical and legal issues addressed above. Curators often have good links to source communities and can advise on appropriate procedures.
After preliminary discussions, the researcher will usually need to fill in a ‘destructive sampling’ form. This acts as a permanent record of the justification for sampling, and allows the museum to make a detailed check on the aims and methodology of the project (see for example,
Sampling may be carried out by the researcher or the curator. If feasible, it is worthwhile for the researcher to carry out the sampling, as it allows for the investigation of the context of the specimen and for flexibility in choosing the samples. It may also speed up the process of obtaining samples, especially if a large number is required. It also allows samples to be safely hand-carried to the researcher’s laboratory. Where materials must be sent, it is safest to use a courier service, with specimens marked “Scientific specimens of no commercial value”.
It should be agreed with the museum whether, after sampling, surplus material should be returned or securely retained. Museums can require that they are informed about results and that they check manuscripts before publication. This is in any case good practice to ensure accurate reporting of sample details. Museum policies on co-authorship vary, and this topic should be discussed early. Significant contribution by the curator on the choice of appropriate samples, provenance research, or in technically complex sampling, merits co-authorship. Unless agreed otherwise, DNA sequencing data should be submitted to NCBI GenBank or other public repositories, taking care to give the correct specimen identifier. At a minimum, the museum’s unique catalogue number (if one exists), and the name of the museum should be cited. This allows the DNA sequence data to be linked directly with the specimen or object. Other museum and laboratory information may be included with the DNA sequence data or in publications (e.g., the collector name, collection number, dates, locations, and laboratory extraction numbers). Additionally, most museum collections will require that vouchers are annotated in a way that links them to DNA sequencing data (see below). Some museums have also started to permanently store DNA isolates, and we encourage researchers to share their stocks on request. Integrated data management and accessibility of the raw data and results will ultimately bolster curatorial practices, develop a more ethical science, and safeguard collections for future generations (
Sampling decisions will be determined both by the research design and the nature of the specimens, in addition to the legal and ethical factors mentioned above. Changes to agreed sampling lists are often necessary once specimens have been examined, for example when they are lost, in poor condition, inadequately annotated or georeferenced, present in small quantities, or of rare taxa. Bulk raw material is usually easy to sample, while objects are usually not subjected to destructive sampling unless the results will inform the history and significance of the object. For herbarium specimens, preserving the morphological features, especially those that are diagnostic, for future research, is critical. Sampling should be targeted towards tissue types or organs at a given developmental state that are most numerous. For example, if there are many flowers and few leaves, it may be preferable to sample a petal. Or if there are few cauline and many rosette leaves, it may be preferable to sample a rosette leaf.
Different parts of a specimen may yield varying amounts, quality, and types of DNA. Wood, husks, and other tissues that were undergoing senescence at the time of preservation may yield less DNA. Young, immature leaves will have higher cell densities, and therefore are expected to yield more DNA. Seeds are often excellent sources of nuclear DNA, although the genotype of the seed will differ from the parent plant and might be of inconsistent ploidy. It may be necessary to extract DNA from individual seeds or to remove maternal tissue such as the testa. Some herbarium sheets will contain multiple individuals and, in most cases, it is better to sample individuals rather than mixed material. If individuals are pooled for DNA extraction, it may complicate downstream analyses that depend on individual genotypes.
The method of specimen preservation is another consideration for DNA isolation. Desiccation has been shown to preserve plant DNA remarkably well, while charring or ethanol preservation destroys plant DNA almost completely (
Before sampling begins, the specimen’s identifying data, such as its herbarium ID, should be recorded with great care, and double-checked on both the sample label and typed list of specimens. Additionally, the museum may require that vouchers are annotated with the sampling date, tissue type, sample identifier, and information about the researchers. The voucher, including any labels, should be photographed, ideally before and after sampling. Digital links between herbarium vouchers, imaging, and DNA sequences are very useful; they can be included in herbarium and nucleotide databases.
For desiccated leaves, the most commonly sampled tissue, the process is usually straightforward. Using forceps and a scalpel or scissors one can make a precise cut and remove 1 cm2 or less of tissue. Generally, between 2 and 10 mg of dry leaf tissue is sufficient for the isolation of complex mixtures of genomic DNA fragments. It is preferable that leaves of lesser value are targeted, for example damaged, folded, or hidden, avoiding possible contamination by mould, lichen, or
Potential contamination of the sample, specimen, or wider collection with exogenous DNA is an important consideration. For most museum collections, there will inevitably already be surface DNA contamination of specimens. Ask the curator about adhesives (e.g., wheat starch) and preservatives that were used with the specimen of interest. Curatorial staff and other users of the collections may not routinely wear gloves or, if they do, may not change them between specimens. In most cases, there is unlikely to be any benefit from the person undertaking sampling wearing protective equipment (e.g., face masks, hair nets) that is beyond that normally used by users of the collection. Contamination control is only as good as the weakest link.
Extra precautions may be taken for equipment that is used directly in the sampling process, for example, disposable scalpels that are changed between samples, or wiping of scalpel blades with bleach and ethanol. This will reduce the risk of cross-contamination between specimens. Further precautions may be beneficial if internal tissue is being sampled (e.g., inside a seed). In these cases, surface decontamination (see section below on pre-processing) followed by sampling with DNA-free equipment and while wearing personal protective equipment may be appropriate. In some cases where specialistic equipment such as microdrill is required, it may be beneficial for sampling to be undertaken within an ancient DNA laboratory, where contamination controls can be better implemented, however bringing large amounts of plant material into the laboratory should be limited as it is an additional contamination source.
Contamination of specimens and collections by ‘modern’ DNA and especially amplified DNA is perhaps the greatest risk, potentially compromising future research. Researchers are likely to have been using molecular laboratories, and steps should be taken to prevent the inadvertent transfer of modern DNA to museum collections. These precautions can include not visiting a collection directly from a modern laboratory, cleaning items that must move between modern laboratories and collections (e.g., clothes, phones, cameras), and using sampling equipment (scalpels, tubes, pens) that has not been taken from a modern laboratory.
Before starting any experiments with historical and ancient plant samples, it is important to recognize challenges arising from the degraded nature of aDNA. Unlike DNA isolated from fresh samples, DNA from preserved specimens is fragmented, damaged, and contaminated post mortem (
aDNA is also affected by “damage”, post mortem substitutions that convert cytosine to uracil residues through deamination (uracils are read by insensitive DNA polymerases as thymine, hence the commonly used term “C-to-T substitutions’’) (
Finally, it is important to recognize that aDNA from plants is in fact a mixture of bona fide endogenous DNA, exogenous DNA introduced pre mortem, (e.g., from endophytic microbes), and exogenous DNA introduced post mortem (e.g., from microbes involved in decomposition, human-associated collection and museum practices; see above) (
Examples of selected successfully isolated and sequenced DNA from plant material. *BP: before present.
Species | Tissue | Age BP* | Endogenous DNA | Fragment length (bp) | Damage at 5’ end | Source |
---|---|---|---|---|---|---|
Thale cress ( |
Leaf | 184 | 83% | ~62 | 0.026 |
|
Potato ( |
Leaf | 361 | 87% | ~45 | 0.047 |
|
Maize ( |
Cobs | 1863 | 80% | ~52 | 0.052 |
|
Wheat ( |
Chaff | 3150 | 40% | ~53 | 0.095 |
|
Barley ( |
Seeds | 4988 | 86% | ~49 | 0.138 |
|
Given the characteristics of aDNA (
The isolation and pre-amplification manipulation of aDNA should be carried out in a dedicated laboratory that is physically separated from labs where post-amplification steps are carried out. Ideally the aDNA laboratory should be supplied with HEPA-filtered air under positive pressure. Users should not move from a ‘modern’ laboratory (where amplified DNA is handled) to the aDNA laboratory on the same day. Reagents and materials in an aDNA lab should be DNA-free, disposable where possible, and never taken out of the clean lab. Surfaces should be cleaned before and after every experiment with 3–10% bleach, 70% ethanol, and overnight UV-C irradiation. To minimise contamination and ensure a DNA-free laboratory environment, users should wear full body suits, foot protectors, slippers, facemasks, sleeves, and double gloves (
Material preparation is an essential step before DNA can be isolated. Optional pre-processing of dirty samples can be done by gently cleaning the surface with a very low concentration (~3%) of bleach, and rinsing twice with ddH2O (
DNA extraction methods for different tissues should be considered. While plant materials tend to contain inhibitory substances like polyphenols, proteins, and polysaccharides, ancient plant materials can additionally be rich in humic acids and salts. This set of macromolecules might prevent successful DNA amplification (
Here we will cover the basics of recovering the highest quality of DNA from ancient plant tissues. Using a two-day extraction protocol will greatly increase the recovery of endogenous DNA. The first day consists of grinding the plant material. Tissue can be disrupted by: grinding dry, grinding flash-frozen, or grinding material soaked in lysis buffer. In all cases, grinding to finer particles increases the recovery of aDNA. Ground tissue is incubated in a fresh lysis buffer. Three commonly used buffers include CTAB (
By contrast to primed amplification approaches, even low amounts of isolated DNA can be used for genomic library preparation (
To help decide which sampled material is most promising for further DNA analyses it is necessary to obtain good estimates for fragmentation, damage, and contamination. This can be achieved through sequencing genomic libraries in low-throughput mode (about 10,000 DNA reads per sample), commonly referred to as “screening” and bioinformatic analyses that produce relevant summary statistics. Promising samples will contain aDNA with a median fragment length over 50 bp and endogenous content over 0.2. For samples of particular interest, mapping the accuracy for short aDNA reads can be improved with specialised procedures (de Filippo et al. 2018), and endogenous content can be increased by targeted enrichment on hybridization arrays (
Characterising DNA fragmentation and damage is very useful for authentication and establishing historical provenance of degraded plant samples. DNA degradation advances with time (
Library-based methods assist with the responsible use of collections, as they preserve the total (non-selective) DNA and ‘immortalise’ it for future use. Immortalisation only has value if the DNA that has been amplified is truly historical/ancient and devoid of contemporary contamination and hence all the aforementioned precautions are necessary when working with aDNA. We recommend that extracts or library builds are precisely annotated with the methods used and are properly archived.
Name three legal considerations and their related ethical main issues that should be taken into account for aDNA research using museum material.
Why is it important to process herbarium samples in a dedicated clean lab?
Name three benefits of getting curators involved in the early stages of research using collections.
Legal: CITES (restriction in international trade of endangered species), Nagoya Protocol (ownership and other significance to indigenous peoples), and Drug Act (controlled substances).
The decay of DNA from historical plant material makes it very susceptible to contamination with exogenous modern DNA.
Curators can contribute (1) high-quality metadata such as collection dates and provenance, (2) knowledge of collections in-house and elsewhere, (3) knowledge of source communities and ethical and legal issues, (4) advice on choice of specimens most suitable for sampling.
Ozan Çiftçi1, Sevgi Kaynar2, Physilla Chua3
1 Institute of Environmental Sciences, Leiden University, The Netherlands
2 Institute for Biochemistry & Biology, University of Potsdam, Germany
3 Section for Evolutionary Genomics, Globe Institute, University of Copenhagen, Denmark
Ozan Çiftçi
Sevgi Kaynar
Physilla Chua
The first studies conducted on DNA obtained from water samples were published in the 1990s. Cloning techniques were commonly used to investigate novel genes and functions of environmental communities at that time. Stein et al. (
Conventionally, biomonitoring of freshwater and marine environments is based on direct observation of indicator taxa to compute biotic metrics/indices. This can be time and labour intensive (
The main advantage of water is the ease of sample collection compared to other aquatic sample types such as sediments or biofilms, as these substrates usually require more sophisticated tools and longer sampling times (
Detecting DNA in water samples obtained from aquatic environments can be challenging because it is usually present at low concentrations with an uneven spatial distribution (
Natural processes influencing the composition and quantity of detectable DNA in a water sample can be categorised into 1) shedding of biological material from source organisms, 2) degradation, 3) transport across the water column, and 4) retention and resuspension (
Senescence in aquatic plants releases free cells into the water column that will eventually break down into organic compounds, including DNA. However, degradation in many cells begins via apoptosis before shedding. Apoptosis involves the shrinkage of the cell and its nucleus in a programmed way, in contrast to necrosis, which is uncontrolled cell death due to loss of osmotic control typically by swelling and bursting (
DNA is a highly stable molecule at neutral pH and moderate temperatures. However, there are several abiotic factors that directly and indirectly influence its stability in aquatic environments (
Hydrological characteristics of the water body are also critical to consider when inferring species presence and distribution. DNA can bind to particles of varying size in aquatic environments (less than 0.2 µm to greater than 180 µm) and this particle association is one of many parameters that affect DNA transport and diffusion (
Considering the higher dilution and the effects of currents and waves in marine waters, DNA is generally less concentrated and more quickly dispersed compared to freshwater ecosystems (
In rivers and streaming waters, the probability of DNA detection is strongly correlated with downstream transportation rates. Retention, rather than degradation, appears to be a more important factor that limits the transport of DNA in streaming waters (
Conventional sampling techniques often require a lot of time and effort for detecting indicator, rare, or invasive species. Keeping the target organism alive or intact might also be an important consideration in such cases. Detection of species via nucleic acids collected from environmental samples (eDNA/eRNA) is a relatively new approach that emerged in the last five years (Anglès d’Auriac et al. 2019). These methods offer a non-destructive and efficient complementary approach for the detection of aquatic organisms. They rely on reference sequences and the amount of available data varies among taxonomic groups and countries (Chapter 10 DNA barcoding and Chapter 11 Amplicon metabarcoding). For example, aquatic vascular plants used in biomonitoring are well represented in public databases (BOLD, GenBank), while this is hard to achieve for diatoms due to large proportions of undescribed species and the problems with cultivation of monoclonal cultures (
Although DNA from plant communities have been detected from environmental samples as parts of larger surveys (e.g., within coral reefs), biodiversity studies targeting a large number of plant species are still rare, possibly owing to issues with universal amplification and discriminatory power of single or multiple gene surveys in plants (
An important application for DNA-based methods is the quantification of species abundance and biomass since there are several environmental applications that rely on this information. Depending on the specific aim of the study, this information can be obtained at varying degrees of efficiency and reliability. Approaches employing species-specific methods are more suitable for abundance or biomass estimations (e.g., qPCR, ddPCR). However, they require a priori knowledge of the target group and are limited to already described species. On the other hand, high-throughput approaches can identify species that are rare or have low biomass (e.g., metabarcoding, metagenomics), but they suffer from biases introduced by downstream steps such as PCR amplification, sequencing (Chapter 9 Sequencing platforms and data types), availability of reference sequences, and even the bioinformatics analyses (Chapter 18 Sequence to species) (
Although molecular methods for species detection have been used as a tool for biodiversity management for more than a decade, only 2% of the available studies have focused on plants (
Recent studies that detect plant species in aquatic ecosystems via eDNA are mainly about methodological adjustments (
There are three main steps in a field study for the collection of aqueous eDNA: water collection, transportation, and filtering. In designing sampling strategies for species identification from water samples, there are many factors to consider. These include, but are not limited to, the field conditions, the distance between sampling point and laboratory, the amount of water that is required, and the morphology and life cycle of the target organism (
After the selection of the sampling location, the next step is to decide on the transportation strategy. Water samples can either be directly transported to the laboratory or filtered in the field. If direct transportation is the chosen method, the samples are usually collected with sterilised glass or plastic bottles or disposable plastic tubes. After that, DNA in the water samples can be captured by filtration or ethanol precipitation in the laboratory. This method both reduces the effort and time spent in the field and researchers can perform additional analyses on water samples or store subsamples for further processing (
Precipitation using ethanol or isopropanol can be used for capturing DNA after water collection, but filtration is the more widely used method (
Pore sizes of filters used in eDNA studies range from 0.22 µm to 60 µm (
The type of filter is one of the most important decisions to be made when designing the sampling strategy. Filters can be classified as open or encapsulated/cartridge filters (
Contamination of samples and the degradation of DNA are two critical processes that should be avoided as much as possible from water collection in the field to DNA isolation in the lab (
Choosing the correct DNA extraction protocol can be crucial in ensuring that the effect of PCR inhibitors in water samples will be minimised. The chemical and physical characteristics of samples can vary considerably, and therefore the quantity and purity of isolated DNA also vary (
DNA isolated from water samples can be used for several downstream applications based on the specific aim of the study or survey. Currently, qPCR methods are the most commonly used method for detecting specific target taxa in water samples, while metabarcoding is used for community analyses (Chapter 11 Metabarcoding). The studies comparing the efficiency of these DNA methods with more conventional methods show varying results. For some species or taxa, DNA-based detection methods appear to outperform more conventional methods (
Are water samples best collected at a single point representative of the habitat diversity as a whole? Motivate your answer.
Describe three biotic and three abiotic factors that can affect DNA detection rates in aquatic environments. Explain in a few sentences how these factors can result in the detection of false positives and false negatives in streaming waters.
List five factors that should be taken into account while designing a sampling strategy for detection of DNA from water samples.
No. The probability of species detection depends on the presence and concentration of DNA collected in a water sample. Therefore, multiple sampling sites with replicates is highly encouraged to obtain a broader overview of the local species diversity.
The life history, age, and size of an organism are some of the biotic factors which can affect DNA detection rates. After release from the cell, abiotic factors such as UV, temperature, and pH can influence these rates. In streaming waters, including currents, false positives might be detected in downstream regions due to the transport of DNA. Similarly, if the DNA of the target organism degrades too quickly, it cannot be detected resulting in false negatives.
The scientific question, environmental conditions (physical and chemical), distance between sampling point and laboratory, and morphology and life cycle of the target organism should be considered when designing sampling strategies.
Maria Ariza Salazar1, Sandra Garcés-Pastor2, Hugo de Boer1
1 Natural History Museum, University of Oslo, Norway
2 The Arctic University Museum of Norway, UiT - The Arctic University of Norway, Tromsø, Norway
Maria Ariza Salazar
Sandra Garcés-Pastor
Hugo de Boer
The natural presence of any plant entails the existence of a substrate where it can anchor itself and absorb nutrients for its development and survival (
Since the first isolation of DNA from soil bacteria, soil eDNA has gained attention for the assessment of terrestrial environments for several reasons: soil is virtually everywhere, it is easy to collect and transport, harbors signals from above and below biota including both active and dormant cells, and is a non-invasive sample collection technique (
Further, studies may also refer to bulk soil DNA when using soil samples to identify unknown communities, especially in forensic contexts (
Molecular (plant) identification using soil or sediment eDNA relies on isolating DNA traces from roots, debris, seeds, and pollen (
The state of DNA in the soil is subject to intrinsic and extrinsic DNA properties related to the origins of the DNA as well as factors influencing its decay (
iDNA persists due to protection from the cell wall and membranes against abiotic processes. Cells are more likely to remain intact in the ground if there is decreased enzymatic activity as a result of rapid soil desiccation, low temperatures, or extreme pH values (
Plant eDNA bound to soil particles can originate from multiple taxa and multiple vegetative parts, each one with particular mechanisms to bind, persist and degrade in soil substrates. Plant DNA persistence within soil allows us to harvest its botanical memory for identifying vegetation through time. Indeed, comparisons of plant identifications through both visual vegetation surveys and soil eDNA assessments have shed light on the temporal signals stored in top soils. In boreal areas, plant identification through soil eDNA signal mostly registered contemporary vegetation (
The flora and study area are key in any study to ensure sound conclusions. Below you will find considerations that can help you to answer common questions when designing field and wet lab experiments.
Soil sampling can be done either by scooping out the soil, drilling down a tube, i.e., a 50 ml falcon tube, or with a soil core sampler. We recommend to use sampling protocols specifically validated in an environment similar to your study site, e.g., woodlands, grasslands, meadows, boreal temperate, and tropical forest (
Obtaining clean DNA samples as well as avoiding cross contamination is challenging when sampling soil eDNA. Collection instruments should therefore be decontaminated between each sample (e.g., flaming, chlorine cleaning), gloves and masks should be worn and changed regularly to avoid introduction of DNA, and samples should be stored in separate plastic bags. In order to stop (or greatly reduce) enzymatic activity, samples should be stored cold or frozen, preferably at -20 °C, if immediate sample processing is not possible (
DNA extraction is a key bottleneck when capturing molecular data, and protocols need to be tailored to both the study area and the question(s). At a minimum, you need to decide which fraction of the total soil eDNA (iDNA or exDNA) you want to isolate to answer your research question. In general, isolating exDNA is preferred when targeting non-microorganisms and avoiding diversity patterns across short temporal scales (
If (meta)barcoding is used for identification, there are three desired features for a barcode in any study: sufficient polymorphism for identification at the desired taxonomic resolution, conserved primer binding sites for universal amplification, and available reference sequences for the target organism. In many cases, not all features can be met. You may therefore need to decide on which features are most important for your research question. For more general information about choosing suitable markers and available reference databases, see Chapter 10 DNA barcoding and Chapter 11 Amplicon metabarcoding. Soil eDNA studies targeting plants have used markers found in chloroplast DNA (
The laboratory technician hands you an extraction protocol that has been used previously to extract DNA from soil and sediments. How do you know if this protocol will extract both iDNA and exDNA? Motivate your answer.
You are designing your soil eDNA study for a plant taxon that is distributed heterogeneously across plots. Describe the soil sampling strategy that will take into account the target taxon distribution.
You want to reconstruct vegetation types based on soil eDNA targeting the
By checking if there is a step that can lyse the cells to extract iDNA. This step can be grinding, sonication, thermal shocks, or chemical treatments such as with chloroform.
To take into account heterogeneity the strategy is to take many subsamples and mix them.
Soil eDNA using
Marcel Polling 1,2
1 Naturalis Biodiversity Center, Leiden, The Netherlands
2 Natural History Museum, University of Oslo, Oslo, Norway
Marcel Polling
To identify pollen, spores, and other plant-related microremains, the field of palynology has traditionally relied on microscope-based analyses. This is a time-consuming process that requires highly trained specialists. Additionally, pollen grains from many plant families are morphologically indistinguishable using light microscopy (
These challenges highlight the necessity for innovative methods within the field of palynology, to increase both the speed and accuracy of pollen identifications. DNA-based methods for the molecular identification of pollen grains have the potential to be of complementary value. However, the extraction of DNA from pollen is non-trivial. This chapter therefore focuses on how DNA can be extracted from pollen, the common problems encountered, and the qualitative and quantitative molecular possibilities for analyses.
Using pollen grain DNA for identification has shown promising results in a number of applications, including the study of provenance and authentication of honey (
Collecting pollen for DNA analysis is mostly similar to collecting pollen for microscopic analysis, though more care should be taken to avoid contamination from other potential sources of DNA. This is because pollen generally contains low quantities of DNA and is therefore prone to contamination. Pollen grains can either be collected directly from the environment (air, water, soil, etc.) or from pollinators (pollen baskets, honey). Pollen collected from the environment will most often (though not always) be derived from anemophilous (wind pollinated) plants, while pollinators collect the majority of pollen from so-called entomophilous (insect pollinated) plants. Pollinators may, however, also have anemophilous pollen accidentally sticking to their bodies. For studies looking at pollen from pollinators, either all pollen grains on the animal’s body are collected by washing off the pollen or, when present, only the corbicular pollen baskets are collected (
Pollen grains can be referred to as “natural plastic”: they have a very hard outer cell wall called an exine, which is made of sporopollenin (
Overview of selected studies since 2017 that have used molecular techniques to identify pollen, including the aim, strategy for pollen lysis, extraction method, amount of PCR cycles, sequencing method, and marker choice.
Study | Aim | Pollen lysis step | Extraction method | PCR cycles | Sequencing method | Markers |
---|---|---|---|---|---|---|
|
Airborne pollen identification | Bead beating (one 5 mm stainless steel bead), two 1-min cycles at 30 Hz | DNeasy Plant Mini Kit (Qiagen) and Nucleomag kit (Macherey–Nagel) | 30 | Sanger sequencing |
|
|
Pollen quantification | Bead beating (mix of 0.5 and 1 mm silica beads), 2 min | Wizard (Promega) | N/A | Genome skimming | N/A |
|
Pollen quantification | Bead beating (mini-bead beater), 3 min | FastDNA SPIN Kit for Soil (MP Biomedicals) | 30 | Metabarcoding | nrITS2, |
|
Pollen quantification | Bead beating (five 1 mm stainless steel beads), 2 min at 22.5 Hz | Adapted CTAB | N/A | Genome skimming | N/A |
|
Plant pollinator interactions over time | Bead beating (one 3 mm stainless steel bead + lysis buffer), 2 min at 25 Hz | QIAamp DNA Micro Kit and DNeasy Plant Mini Kit (Qiagen), Nucleospin DNA Trace Kit (Macherey-Nagel) | 30 | Metabarcoding | nrITS1, nrITS2, |
|
Airborne pollen identification | Bead beating (3 mm tungsten beads), 4 min at 30 Hz | DNeasy Plant Mini Kit (Qiagen) | 35 | Metabarcoding | nrITS2, |
|
Bee pollen diet | Bead beating (3.355 mg 0.7 mm zirconia beads), 5 min | DNeasy Plant Mini kit (Qiagen) | Three steps (55 cycles in total) | Metabarcoding | nrITS2, |
|
Insect migration analysis | Bead beating (five zirconium beads), 1 min at 30 Hz | No extraction, using Phire Plant Direct Polymerase | Two steps (32 cycles in total) | Metabarcoding | nrITS2 |
|
Pollen quantification | CF lysis buffer (Nucleospin Food Kit) | DNeasy Plant Mini Kit (Qiagen) | 25, 30, 35 | Metabarcoding | nrITS1, |
|
Airborne pollen identification | Bead beating (0.2 g 425–600 μm glass beads + lysis buffer), two 1-min cycles (3450 oscillations/min) | Adapted CTAB | 40 | Metabarcoding |
|
Bee pollen diet | Bead beating (150 g mix of 1.4 mm ceramic and 3 mm tungsten beads + lysis buffer), two 45 second cycles at 6.5 m/s | DNeasy Plant Mini Kit (Qiagen) | 37 | Metabarcoding | nrITS2 |
It should be noted that other methods for DNA extraction from pollen exist in which the pollen grains are not destroyed, and in some specific cases, excluding the bead-beating step has even given better results (
Several commercially available DNA extraction protocols have been used for DNA extraction from pollen grains after the lysis step. Table
The quality of DNA that can be extracted from pollen samples is critical for any molecularly-based identification method, and particularly when working with very small amounts of DNA. Therefore, avoiding contamination is critical and it is essential to work in a clean lab, to keep windows closed, use sterilised tools in a laminar flow cabinet, and to keep the DNA extraction lab separate from the post-PCR environment.
Molecular methods can contribute to the analysis of pollen both by identifying which species are present (qualitative) as well as by giving a measure of the abundance of different pollen species (quantification). While DNA metabarcoding methods are currently most often used (Table
Species-resolution in pollen grain identifications is critical for studies that try to answer specific research questions including: what particular species of flower does a common carder bee prefer? What grass species is responsible for most of the pollen in the ambient air in early May? Species-specific markers and qPCR techniques can be used for the identification of specific species within a mixture of different pollen types (see Chapter 10 DNA barcoding). One study used custom-made primers for the nuclear Internal Transcribed Spacer (nrITS) to differentiate between mugwort (
DNA barcoding can be used to target specific species, yet it is rare that a pollen sample contains only a single pollen species. DNA metabarcoding is therefore the most-often used method for the molecular identification of the different species of pollen grains from mixed samples (see Chapter 11 Amplicon metabarcoding). Both nuclear and chloroplast DNA can be amplified in pollen DNA (
While research into targeting different barcoding regions and primers is ongoing (
It is important to use positive controls with known concentrations of different pollen species in any DNA metabarcoding study. This is because the amount of DNA that can be extracted from different pollen types has been shown to vary. For example, it can be easier to extract DNA from pollen with a thinner exine and from plant species that are richer in chloroplast DNA than from those having a more ‘sturdy’ exine (
Beyond identifying which pollen species are present in a particular sample, pollen grain quantification is equally important. For example, for hay fever forecasts, it is not just important to know
In a recent study on the use of DNA to quantify pollen grains, Bell and colleagues found a very weak correlation between pollen counts recorded by palynologists and the proportion of metabarcoding reads (
Another group of scholars, however, are finding more promising results in using DNA metabarcoding to quantify pollen grains.
Since using DNA metabarcoding approaches for pollen abundance may not give quantitative results with complex, multi-species samples, other molecular methods such as genome skimming and shotgun sequencing are being used to circumvent some of the drawbacks. The major advantage of these two methods is that they do not include a PCR-step and therefore do not introduce amplification bias (see Chapter 12 Metagenomics). Genome skimming has already been used to show that quantification is feasible, even for pollen from species that are very rare in mock mixtures (
What are the main advantages of molecular pollen identification over traditional (microscopic) methods? Justify your answer.
Pollen is dispersed by various vectors. There are two main types of pollination strategies in land plants, please name them and also explain the importance of the difference between the two in terms of DNA yield.
Which four factors make the quantification of pollen grains using metabarcoding problematic?
A higher taxonomic resolution can be achieved using molecular methods such as metabarcoding. Furthermore, pollen analysis requires highly trained experts that have to spend considerable time to analyse a single sample and therefore molecular techniques are faster, especially with a large number of samples.
Entomophilous (insect collected) and anemophilous (wind dispersed) pollen. The presence of pollenkitt on entomophilous pollen grains influences the amount of DNA that can be obtained per pollen grain.
Copy number, DNA preservation, DNA isolation technique, and amplification bias.
Felicitas Mück1, Carlos A. Vásquez-Londoño2
1 Department of Pharmacy, Faculty of Mathematics and Natural Sciences, University of Oslo, Norway
2 Department of Pharmacy, Faculty of Sciences, National University of Colombia, Colombia
Felicitas Mück
Carlos A. Vásquez-Londoño
DNA-based methods for the molecular identification of plant products can help us to address food and medicine authenticity issues at each stage in the supply chain (Di Bernardo et al. 2007). Documentation and requirements for DNA-based detection methods for food authentication are defined in collaborative activities by the European Committee for Standardization (CEN) and the International Organization for Standardization (ISO). Both rapid and accurate identification of plant products are crucial for the the herbal drug industry (
The majority of standardised DNA-based authentication methods for the inspection and regulation of food and plant-medicines use well-established PCR-based techniques for DNA amplification as these are sensitive, specific, and simple (
DNA barcoding methods are also established for the identification of unique medicinal and edible plant species (
High-throughput sequencing (HTS) methods such as amplicon metabarcoding are also powerful tools for the authentication of herbal end products, post-marketing control, pharmacovigilance, and the assessment of species composition in botanical medicines, such as in traditional Chinese medicines (TCMs) (
In addition to PCR-based techniques, the detection of single nucleotide polymorphisms (SNPs) is frequently used for the molecular identification and authentication of various food commodities using small DNA fragments (Di Bernardo et al. 2007;
DNA-based methods for molecular plant identification depend on well-curated nucleotide sequence repositories. In addition to GenBank (
Successful DNA extraction is the foundation for any further downstream analysis (
Four main factors that affect the efficacy of DNA isolation from food and medicine samples are the sample source and processing, collection and storage, homogenisation, and the presence of contaminants. Generally it is easier to extract high-quality DNA from fresh samples (
Removal of frequent contaminants that can reduce the yield of extracted DNA from edible and medicinal plants.
Proteins and RNA | |||
---|---|---|---|
What compounds define the chemical composition of your samples? | Polysaccharides (starch, sugars) | Polyphenolics | |
RNA | (plant secondary metabolites like: tannins, flavonoids, terpenoids, etc.) | ||
|
Can co-purify with DNA | Can co-precipitate with DNA | When bound to DNA very hard to remove in extraction |
dependending on the age of the samples and how they were conserved | Results in a sticky viscous consistency to DNA pellet after centrifugation | ||
Inhibition of enzymes used for molecular techniques (restriction endonucleases, polymerases, and ligases ( |
Results in contaminated pellets not usable for many downstream analyses ( |
||
Adherence to wells in agarose gel residing in long smears of bands detected in gel ( |
|||
|
RNA removable with DNase-free RNase A or ethanol precipitation using lithium chloride | Removal via highly concentrated sodium chloride (NaCl) in extraction buffers leading to increased solubility in ethanol | Binder compounds polyvinyl pyrrolidone (PVP) or polypyrrolidone (PVPP) can be used in extraction buffers to absorb polyphenols before polymerization with DNA |
Proteins can be removed by i) inclusion of detergents (cetyltrimethylammonium bromide (CTAB), SDS) in extraction buffer | |||
Combination of NaCl and cationic detergent CTAB | |||
CTAB with differential precipitation ( |
|||
Use of antioxidant compounds (BME, DDT, ascorbic acid, iso-ascorbate) in buffer to prevent polymerization ( |
|||
ii) protein denaturants e.g., β-mercaptoethanol (BME), dithiothreitol (DTT) | |||
iii) enzymatic proteases e.g., proteinase K |
Although CTAB-based methods usually result in DNA extraction from plants and processed food and medicine products, the quantity is often quite low and the protocols are time consuming (
Several studies exist that compare commercial and in-house DNA isolation techniques for food and medicine (
Overview of different DNA extraction methods recommended for use with food by the European Union Reference Laboratory for GM Food and Feed (EU-RL GMFF).
Plant source | Method of choice | Reference |
---|---|---|
|
CRLVL16/05XP corrected version 2 01/03/2018 | |
Maize seeds and grains | For isolation of genomic DNA from a wide variety of maize tissues and derived matrices for high-quality genomic DNA from processed plant tissue (e.g., leaf, grain, or seed). | |
Lysis step (thermal lysis in the presence of Tris HCl, EDTA, CTAB, and β-mercaptoethanol). | ||
Tissues processed prior to extraction procedure. Possible methods of processing include a mortar and pestle with liquid nitrogen (leaf) or commercial blender (grain or seed). | ||
|
|
CRLVL13/05XP 14/05/2007 |
Soybean seeds | “Dellaporta-derived” method starts with a lysis step (thermal lysis in the presence of Tris HCl, EDTA, NaCl, and β-mercaptoethanol). | |
Isopropanol precipitation and removal of contaminants such as lipophilic molecules and proteins by extraction with phenol:chloroform:isoamyl alcohol. | ||
|
“ |
CRLVL09/05XP Corrected Version 1 20/01/2009 |
Freeze-dried potato tubers | Lysis step (thermal lysis in the presence of CTAB, EDTA, and proteinase K). | |
Removal of RNA by digestion with RNase A and removal of contaminants such as lipophilic molecules and proteins by extraction with chloroform. | ||
Remaining inhibitors are removed by a gel filtration step using the commercially available product S-300 HR Microspin Columns (Amersham Pharmacia). | ||
|
|
CRLVL14/04XP Corrected Version 1 15/01/2007 |
Lysis step (thermal lysis in the presence of Tris HCl, EDTA, SDS, and β-mercaptoethanol). | ||
Removal of contaminants such as lipophilic molecules and proteins by extraction with phenol and chloroform. | ||
DNA precipitate is generated by using isopropanol. The pellet is dissolved in TE buffer. | ||
|
Inhibitors are removed by an anion exchange chromatography step using the DNA Clean & Concentrator 25 kit (Zymo Research). | CRLVL14/04XP Corrected Version 1 15/01/2007 |
|
|
|
Technique is ideal for the rapid isolation of small amounts of DNA from many different species and is also useful for large scale isolations. | ||
Lysis step (thermal lysis in the presence of Tris HCl, EDTA, CTAB, and β-mercaptoethanol). | ||
Removal of contaminants such as lipophilic molecules and proteins by extraction with phenol and chloroform. | ||
Samples processed prior to extraction procedure (mortar and pestle, liquid nitrogen, or commercial blender). |
After DNA extraction, measuring both the DNA concentration and purity is important before continuing with further downstream analysis. Isolated DNA can be tested for quality using absorbance methods, agarose gel electrophoresis, and fluorescent DNA-intercalating dyes (
It is in the interest of both biodiversity conservation and public safety that DNA-based techniques are further developed to screen food and medicine sourced from the global market (
The quality of DNA from food and medicinal sources is a critical factor for DNA-based analyses. Which factors can influence the quality of nucleic acids extracted from foods and plant-based medicines?
What is the first step when choosing a DNA isolation technique for your samples?
What methods can be used for measuring DNA quality after isolation?
Sample source and processing, collection and storage, homogenisation, and the presence of contaminants.
One should firstly consider whether it is a complex mixture or a pure product, and the degree of processing (form and degree of homogeneity).
Absorbance methods, agarose gel electrophoresis, and fluorescent DNA-intercalating dyes.
Physilia Chua 1,2, Christina Lynggaard1, Kristine Bohmann1
1 Section for Evolutionary Genomics, Globe Institute, University of Copenhagen, Denmark
2 Sanger Institute, Wellcome Trust Genome Campus, Hinxton, UK
Physilia Chua
Christina Lynggaard
Kristine Bohmann
Do you know that faeces are windows to the natural world? Faeces, although not the most glamorous thing in the world, are worth their weight in gold when it comes to providing information about the host(s) they are derived from. Faeces, also commonly known as scat, poop, droppings, excreta, or stools are solid remains of the ingested food that were not digested in the intestine. They are composed of water, protein, polysaccharides, fats, solids (e.g., fibres from plants), and bacteria (
Different types of information can be obtained from faeces. Chemical analyses provide information on hormonal changes that can occur from stress (
Non-molecular methods have traditionally been used for the analysis of contents from faecal samples. An example is microhistology, where small amounts of faecal samples are mounted on a microscope slide, and digested remains of plant cuticle fragments are identified based on morphology (
In plant molecular applications, a common use of faecal samples is in herbivore/omnivore diet studies. The goal of most plant-focused diet studies is to characterise the diet profile of the host, which can be used to answer research questions concerning for example, resource competition and partitioning (
The main advantage of using faecal samples for molecular plant identification as compared to other types of samples such as whole animals/insects (
One limitation when using faecal samples for molecular plant identification is that it can be difficult to obtain fresh faecal samples collected immediately after defecation, especially when working with wild animals. Age of samples can have an impact on the amount and quality of DNA that can be extracted due to DNA degradation caused by exposure to environmental conditions (
Advantages and limitations of using DNA from faeces to reconstruct plant communities.
Advantages | Limitations |
---|---|
Non-invasive | Fresh samples may be challenging to obtain from wild animals |
Easy to detect and collect | Presence of PCR inhibitors |
Not considered rare | DNA degradation |
Does not require capturing or locating animal of interest | Hard to distinguish morphologically with closely related species |
Additional molecular work needed | |
Increased cost and time |
Before designing any sampling strategies for the collection of faecal samples, there are at least six factors that researchers must take into consideration:
The research question(s) and the required data to achieve the research objectives
The ecology of the species to be studied
The feasibility of sampling in the study area (is accessing the terrain a safety risk?)
The duration and spatial extent of the project (long term or short term? Does it span across different seasons?)
Budget constraints
Ethical considerations
Based on the research question(s) and objectives (i.e., quantitative, presence/absence, composition), researchers must decide how many samples and replicates are needed from each individual and/or population to sufficiently meet their research objectives. The choice of downstream molecular methods used for reconstructing herbivore/omnivore diet will also have an impact on how many samples are required. In quantitative studies where the objective is to quantify the ingested biomass, the number of different individuals sampled is not as important as in composition studies, where more individuals are required to obtain a better overview of the dietary range of the studied species. This is due to the effect of individual food preference, which can lead to biases in retrieving the whole range of a dietary profile for a given species if only a few individuals are studied (
Generally, the more ecological information gathered and incorporated into sampling strategies, the higher the chance of successful faecal collection. For wild species, prior ecological information regarding the species of interest is essential for designing sound sampling strategies, to optimise and streamline sample collection. Researchers can use the following questions as a guide in planning their sample collection strategy:
Is the target species localised to a certain area?
What is the extent of its daily range (does it differ between seasons)?
Is it a generalist or a specialist?
What is its foraging behaviour (does it differ between seasons)?
Is the habitat easily accessible for sample collection?
What is the density of the population in the study sites?
Does its habitat overlap with closely-related species and will this lead to possible collection of faeces from non-target species?
Without this information, it is challenging to narrow down specific study sites for field collection. Additionally, such information can reduce the necessary man-power, resources, and time spent in the field while increasing the probability of finding sufficient numbers of faecal samples. Knowledge of habitat range and population density can prevent excessive amounts of samples collected from a single individual when the research question requires samples from multiple individuals. Differences in home-range and diet between seasons can also impact sample collection strategy (
Sampling strategies are also heavily dependent on budget constraints, which may reduce the time spent on sample collection, the number of samples processed, and also the molecular techniques used in analysing the faecal samples. Therefore, it is prudent to ensure that the budget fits the research objectives or that research objectives should be tailored to fit the research budget. While there are many different approaches to sampling, two commonly used approaches are systematic sampling and opportunistic sampling. In systematic sampling, the study area is divided into grids or transects, and samples are taken at each grid point or fixed intervals (
Finally, ethical consideration of minimising distress to studied animals is one of the main concerns in animal studies and there are legal restrictions as implemented in the EU Directive 2010/63/EU on the protection of animals used for scientific purposes (
Once the sampling strategy has been determined, the sampling in the field can start. The first step is to locate the faecal samples in the field. Once faecal samples have been located, collection can begin. When collecting faecal samples, there are a few materials that will be needed no matter what animal and habitat the faecal samples are derived from; sterile tubes filled with e.g. RNAlater™, silica beads or 90% ethanol, gloves, and a device to collect the samples. Sterile tubes will be necessary for sample storage. Tubes can have either removable screw-lids or hinged lids. Removable screw-lids have the advantage that the lids will not come off during transport. However, there is an increased risk of environmental contamination with these lids since they are separate from the tube and must be placed somewhere before collection. Tubes with hinged lids are easier to work with in that sense, though they can open during transport if not sealed (e.g., with parafilm™). Proper use of gloves and a collection device are also important to limit the risk of a collector becoming sick from directly handling faeces, as well as reducing the risk of sample contamination. The size and type of the sampling device can differ depending on the size of the faecal dropping and can range from a toothpick to a large spoon.
DNA-based diet analyses are very sensitive to contamination, and the trace amounts of digested plant material that can be extracted from faecal samples is easily contaminated. Contamination can occur between samples, by plant DNA from the surrounding environment, or even from the collector’s (plant-based) lunch (
To avoid DNA degradation, faecal samples should be preserved as soon as possible upon collection and stored under the same conditions (
To avoid contamination, extractions should be carried out in a room free of PCR amplified DNA. Due to the risk of zoonotic disease transmission, extraction should ideally be carried out in a flow-hood to avoid inhaling dust from dry faeces (
Faecal samples from plant-eating animals usually contain high levels of PCR inhibitors such as humic acid, which can lead to amplification failure during downstream analysis (
Depending on the research question(s), several different HTS methods can be used for analysing DNA extracted from faecal samples including metabarcoding (
Name one sampling limitation of working with faecal samples as compared to other types of samples (e.g., gut contents) and give suggestions on how to overcome this limitation.
How does prior information of studied species ecology aid in sampling design?
Contamination can occur during sample collection, sample preprocessing, and DNA extraction. Describe the main type of contamination during each phase and how it can be prevented.
Possible challenges and solutions: It is difficult to obtain fresh faecal samples → one can use pointing dogs; Problem of using relatively long DNA barcoding fragments → use primers that can amplify shorter regions; Overlapping habitats of closely related species → use additional molecular markers to identify species (though this increases the cost and time necessary); Faecal samples only provide a snapshot of the entire diet → take multiple samples from the same individual and/or sample a larger number of individuals over a longer period and larger geographical area.
It helps to narrow down the study areas for field collection. This reduces the manpower, resources, and time needed, increasing the chance of finding samples. When applying for permits, you can point out how to keep the disturbance of animals in the field to a minimum with this knowledge, which will increase the chances of obtaining permission.
During sample collection → wear gloves, ensure that samples are not collected from wet soil, practice good collection hygiene; During sample preprocessing → remove outer layers that were in close contact with the environment, work in flow-hood and a PCR-free lab; During DNA extraction → include extraction controls, avoid using extraction kits with plant-based or other types of contaminants.
Anneke T.M. ter Schure1, Yi Wang2, Anna L.J. Chagas2, Laura S. Epp2
1 Centre for Ecological and Evolutionary Synthesis, University of Oslo, Oslo, Norway
2 University of Konstanz, Konstanz, Germany
Anneke T.M. ter Schure
Yi Wang
Anna Luiza Jaime Chagas
Laura S. Epp
Sedimentary ancient DNA studies aim to reconstruct the biology and ecology of past environments using the DNA present in the sediment record. Compared to modern soil and sedimentary DNA (see Chapter 4 DNA from soil), these analyses can be more challenging due to the prolonged exposure of the DNA to degradation processes. This has major implications for the scope of the study and the appropriate study design, which will be discussed in this chapter.
In order to use sedimentary ancient DNA for paleoecological studies (
Ancient DNA is the hereditary genetic content of cells from organisms that died a long time ago. There is no consensus on how old DNA should be in order to be called ancient, as the age is generally less important than the exposure to degradation processes that make it more degraded than modern DNA.
How does DNA end up in the sediment? Sediment is a result of erosion, weathering and biological processes and consists of organic and inorganic particles (e.g., sand and silt) that are transported by wind, water, or people (
Once exposed to the sedimentary environment, exDNA can undergo different post-depositional taphonomic processes that determine the quality of the DNA on longer timescales. ExDNA can be internalised by microbial cells (
Schematic overview of DNA degradation processes (hydrolysis, oxidation, alkylation and Maillard reaction) that can cause DNA damage in the form of cleavage, base modifications or cross-links. The major mechanism leading to miscoding lesions in aDNA is the hydrolysis of cytosine to uracil, which leads to G to A and C to T substitutions by DNA polymerases, whereas blocking lesions can obstruct the movement of DNA polymerases during PCR (
By analysing the ancient DNA present in the sediment (
In general, palaeovegetation data are the result of the attributes of the original vegetation, combined with depositional factors and preservation, as well as the experimental procedures to produce the data. For
A combination of
Comparison of pollen, plant macrofossils, and
Category | Pollen | Plant macrofossils | |
---|---|---|---|
|
|||
- Scale | Regional | Local | Local |
- Taxonomic groups | Pollen-producers | All plants | All organisms |
- |
|
|
|
|
|||
- Labour-intensive | Yes | Yes | No |
- Need for taxonomic knowledge | Yes | Yes | No |
- Taxonomic resolution | Limited to identifiable pollen types, generally to genus level | Generally to species-level | Depends on the marker, possible to species-level |
- |
|
|
|
|
|||
- Qualitative | Yes | Yes | Yes |
- Quantitative | Partial | Limited | Debated |
The first study using
Environmental reconstructions can range from polar, to temperate and tropical regions, although they are limited to sampling sites that allow preservation of
Due to its low concentration, retrieving ancient DNA from sediment samples requires strict protocols to avoid contamination by modern DNA or further degradation (
The previous section described some
Simplified overview of the
Choices for the different options at each step depend on the aims of the study. For example, when performing a reconstruction of overall plant community dynamics with universal plant metabarcoding primers, the most common taxa and major trends in community change will be reliably retrieved in the first PCR performed (
What is my study aim?
What spatial and temporal scale do I need to cover?
What contextual information and metadata do I need?
What taxa should I target and at what taxonomic resolution?
What laboratory and analytical methods should I use?
How will I minimise / control for contamination, biases, and false positives?
The aims of the study define the temporal and spatial scale needed to achieve them, thereby steering the selection of relevant sampling sites. Lake sediments provide a record of the plants that occurred in the lake catchment, being the area of land from which water and surface runoff drains into the lake (
General conditions under which
Dating is important in any study that involves ancient samples. Only with accurate dating can the timing of events be compared and their rates of change estimated. Commonly applied sediment dating methods are radioisotopic dating (in particular 210Pb, 14C, and luminescence dating) and dating based on chemostratigraphy or marker minerals (in particular tephrochronology), and the choice for a method depends on the type and age of the sediments (see Table
Summary of sediment dating methods, their applicability and limitations. Sources:
Dating method | Suitable sample types | Age limit | Sources of error and uncertainty |
---|---|---|---|
210Pb dating | Materials from aquatic environments such as lacustrine and marine deposits | ~100 to 150 years | Complex sedimentation processes that break the dating model assumptions, such as compaction, local mixing, erosion etc. |
14C (radiocarbon) dating | Organic remains (charcoal, wood, animal tissue), carbonates (corals, sediments, stalagmites and stalactites), water, air and organic matter from various sediments, soil, paleosol and peat deposits | Up to 50,000 years | Atmospheric 14C content fluctuation due to changes in cosmogenic production rate and exchange between the atmosphere and ocean |
Luminescence dating: | TL: materials containing crystalline minerals, such as sediments, lava, clay, and ceramics | TL: A few years to over 1,000,000 years | Variations in environmental radiation dose; saturation of electron traps in sample minerals |
- Thermoluminescence (TL) | |||
- Optical stimulated luminescence (OSL) | OSL: materials containing quartz or potassium feldspar sand-sized grains, or fine-grained mineral deposits | OSL: A few decades to ~150,000 years for quartz. | |
Tephrochronology | Terrestrial and lake sediments, marine deposits and ice cores that contain tephra | Up to 35,000 years, extendable under good conditions | Can only obtain indirect dates within the 14C age range |
Radioisotopic dating is based on the principle of radioactive decay. When a nucleus breaks down, it emits energy and forms a daughter product. The time this takes is expressed as the half-life, i.e., the time that it takes for 50% of a parent element to transmute into the daughter product. The relative quantity of a radioactive parent element in a sample can be used to infer its age. Relatively young aquatic sediments, with ages up to 150 years are commonly dated with 210Pb (half-life: 22.27 years;
Luminescence dating is based on the phenomenon that mineral crystals absorb electrons from the ionising radiation of surrounding sediments over time, and when stimulated in a laboratory by heat or light, they release the accumulated radiation as luminescence. The intensity of measured luminescence indicates the length of time between this in-lab stimulation and the last natural event of similar stimulation. Heat stimulated or thermoluminescence (TL) dating is used to date baked pottery from archeological sites or sediments once in contact with molten lava; optically stimulated luminescence (OSL) dating is used to date sediments once exposed to sunlight. The time range for luminescence dating can be from a few decades to over 1 Ma, depending on the ability of a mineral to absorb radiation over time. For studies concerning relatively young samples, OSL dating of quartz grains are generally used, covering from a few decades to ~150 ka.
Tephrochronology uses the chemical signature of tephra (volcanic ash) to pinpoint the age of that specific layer in a sediment sequence by reference to known or unknown dated volcanic eruptions. Terrestrial sediments (
DNA is everywhere - including in the air - and contamination can come from many different sources. When collecting and working with
The precautions you can take include: work cleanly, use equipment that is free of DNA and nucleases, and try to keep the samples in a stable and cold environment. In practice this is not so easy, which is why dedicated ancient DNA facilities are set up to avoid any form of contamination. These facilities should be physically isolated - ideally in a separate building - from any location where PCRs are performed (
You should assume that everything that you bring into the lab is contaminated with DNA. Therefore, before entering the lab, you should have showered and changed into clean clothes and everything you bring into the lab should be decontaminated. Inside the lab, you should wear a hairnet, face mask, full body suit with hood, shoe covers, and gloves at all times. Wearing two layers of gloves will allow you to change the outer gloves while still covering your hands, and you should change your outer gloves regularly while working. All tools and equipment should be decontaminated before use, and regular cleaning of the aDNA workspace is needed. Decontamination can be achieved by using a DNA decontamination product (e.g., 3-10% bleach or DNA-ExitusPlusTM) for surfaces, ideally supplemented with UV irradiation of the workspace. To prevent cross-contamination, tools should be cleaned between working with each sample or sample-extract. Tools should be left in a DNA decontamination product for at least 10 minutes, rinsed with UV irradiated milliQ water, and ideally also UV irradiated using a UV crosslinker with irradiation at the shortest distance possible to the UV source (
Choices for sampling and personal protective equipment will depend on the setting, as the sampling of sediments at an archaeological site can be very different from the sub-sampling of a lake sediment core in a lab facility. It is important to try to limit the amount of potential contamination, but practical considerations and the target DNA can also be leading. For example, a study aiming to recover human aDNA will require stricter use of personal protective equipment than a study focussing on plant aDNA. Sampling of sediments can be done directly in the field or by subsampling of sediment cores in a clean, sheltered environment. When collecting sediment cores for
A general
The sampling itself follows aDNA lab procedures where possible, even if it takes place elsewhere: clean the workspace, use personal protective equipment, do not hover over the sediment you are sampling and change outer gloves and tools between each individual sample. In order to avoid contamination, sampling should start at the oldest part of the sediment, working your way up to the youngest parts and subsamples from sediment cores should be taken from inside the undisturbed centre (
The choice for a specific DNA extraction protocol depends on a range of factors, including the aim of your study, sample characteristics, available laboratory facilities and equipment, and costs of the reagents or extraction kits. The latter can be a consideration of investing either time or finances as it can be cheaper to make the buffers needed for extraction yourself, but this also increases the preparation time and could introduce additional contamination to your samples. There are several protocols that can be used for
Overview of the advantages and limitations of several commonly used extraction protocols and some example publications using these protocols.
Extraction protocol | Sample size | Advantages | Limitations | Used by |
---|---|---|---|---|
DNeasy PowerMax kit (Qiagen) | ≤ 10 g | - Large initial sample volume | - Expensive | |
- Few inhibitors in the resulting extract | - DNA can be lost with inhibitor removal solution | |||
DNeasy PowerSoil kit (Qiagen) | ≤ 250 mg | - Few amplification and sequencing inhibitors in the resulting extract | - DNA can be lost with inhibitor removal solution | |
- Easy processing of large sets of samples | - Smaller initial sample volume compared to the PowerMax kit | |||
Rohland protocol ( |
≤ 50 mg | - Developed to recover small DNA fragments | - Small starting amount of sediment | |
- Easy processing of large sets of samples | - Potential coextraction of inhibitors | |||
- Homemade buffers can increase contamination risk | ||||
Phosphate buffer + NucleoSpin® Soil kit ( |
≤ 15 g | - Large initial sample volume | - Extracts only extracellular DNA | |
- Processes a 2 ml subsample of the phosphate buffer and sample mixture | ||||
Murchie protocol ( |
≤ 250 mg | - High DNA yields | - Optimised for permafrost samples and may not perform as well in lake sediment |
|
- Uses a high volume binding buffer to improve the recovery of small DNA fragments |
All extraction protocols include similar steps for the isolation of sedimentary DNA (Figure
Common DNA extraction steps: (1) samples are first homogenized using a sterile scalpel and later on go through a step, in which either (2a) extracellular DNA is washed off the sedimentary matrix (
Be aware that the presence of certain substances may inhibit further amplification or sequencing steps. These can be derived from humic substances (important components of humus), which are commonly present in sediments and might inhibit downstream analysis. Moreover, the amount of humic substances is site-specific, and it might be necessary to repurify the samples or use inhibitor removal columns. During DNA extraction, contamination may be introduced from the laboratory facilities, tools, reagents and other consumables. It is essential to track this contamination by including a negative control. It is suggested to add one such extraction control for each batch of 11 samples, and include it in all subsequent steps (e.g., metabarcoding, library preparation, sequencing;
After extracting the DNA, the
Most
Sequencing data can be processed using bioinformatic tools, where strict quality filtering of the sequence data is followed by taxonomic assignment. Further filtering allows removal of sequences with low identity scores, contaminants (i.e., sequences present in the controls), and false-positives (see Chapter 18 Sequence to species for details). False identifications can be caused by the quality of the reference library, but also by technical errors, contamination, or errors in the DNA sequences, especially as
Name and explain two main advantages of using
Imagine you have a long lake sediment core that is thought to be between 50 000 and 10 000 years old. What dating methods could be used to date this core and why?
What are the main sources of bias when working with
Possible advantages of
For mineral-rich sediments, luminescence dating can be used as this method can be applied to sediments from a few decades old to over a million years old, and is based on the phenomenon that mineral crystals absorb electrons from ionising radiation of surrounding sediments over time. For sediment rich in organic materials, AMS radiocarbon dating of identified macroscopic remains (with calibration) is a good option. Radiocarbon dating is based on the concentration of C14 in organismic remains. The half-life of C14 (5730 years) makes it an appropriate method for samples under 50,000 years old. To increase confidence in the dating results, multiple dating techniques could be used for creating an age model for the core.
Biases when working with
Marcella Orwick Rydmark1, Yannick Woudstra2,3,4,5, Hugo de Boer1
1 Natural History Museum, University of Oslo, Norway
2 Royal Botanic Gardens, Kew, United Kingdom
3 Natural History Museum Denmark, University of Copenhagen, Copenhagen, Denmark
4 Gothenburg Global Biodiversity Center, Department of Biological and Environmental Sciences, University of Gothenburg, Gothenburg, Sweden
5 Department of Plant Sciences, University of Oxford, Oxford, United Kingdom
Marcella Orwick Rydmark
Yannick Woudstra
Hugo de Boer
The revolution in genome-wide screening has vastly reduced the price for sequencing, with enormous implications in the biomedical field, industry, biodiversity monitoring, as well as in plant identification. The first plant genome (
Sanger sequencing was introduced in 1977 by Sanger and colleagues, and for over 40 years, it was the most commonly-used form of sequencing (
In the second step of Sanger sequencing, the oligonucleotides are separated by size using capillary gel electrophoresis. A laser excites the terminal fluorescent nucleotide in each oligonucleotide, resulting in fluorescence emission that is detected and read by a computer. By reading the gel bands from smallest to largest, the 5’ to 3’ sequence of the target DNA can be determined at single base pair resolution. The data output for Sanger sequencing is a chromatogram which is automatically read by a computer to generate the DNA sequence. Primer sequences should be trimmed off the reads as these are not part of the target DNA, and the quality of the chromatogram should be assessed to determine the reliability of the generated DNA sequence. There are a number of online tutorials from both industrial and academic sources that we refer the reader to for assessing a chromatogram quality (University of Michigan, Biomedical Research Core Facilities, n.d.). Base calling accuracy can also be measured using Phred quality scores (
Sanger sequencing is not used today for large-scale genomic projects due its low throughput. The requirement of needing specific primers for a region of interest limits its easy use and application across divergent plant taxa. Additionally, the amplification of multicopy genes, such as the commonly used DNA barcode ITS (see Chapter 10 DNA barcoding), as well as markers in taxa of allopolyploid hybrid origin, result in difficult-to-interpret chromatograms. This is because nucleotide polymorphisms between different copies result in double peaks in the resulting chromatogram (
Illumina was the second HTS technique that became commercially available in the early 2000s (
In Illumina sequencing, like in other high throughput sequencing approaches, the target DNA is initially broken into shorter fragments that match the optimal fragment sequencing length of the platform, if not already present as shorter segments. These fragments are then PCR-amplified with adaptors that can be individually chemically tethered to the flow cell surface. Using bridge amplification (
Dyed dNTPs are added in a controlled fashion through the use of reversible blocking group chemistry, so that the emission of each added fluorescent dNTP is read before the addition of the next fluorescently-labelled dNTP. This process is done on millions of fragments simultaneously, making it a far more efficient method than Sanger sequencing for large-scale genomic projects (
Two limitations to consider with Illumina sequencing however are that the produced reads are relatively short (50 to 300 bp), and similarly to Sanger sequencing, most applications require a PCR amplification step. However, PCR free library kits and protocols provide increasingly good results, and have the important advantage of reducing typical PCR-induced biases. Assembling whole genomes using short read Illumina methods, especially if they are highly repetitive, can be challenging (
Current examples of Illumina sequencing platforms, specifications, and suitability for different applications in plant identification.
Illumina sequencing platform | MiSeq | HiSeq 2500* | HiSeq 3000* | HiSeq 4000* | NextSeq 1000 and 2000 | NovaSeq 6000 |
---|---|---|---|---|---|---|
|
||||||
Maximum read length (pair ended) | 2 x 300 | 2 x 250 | 2 x 150 | 2 x 150 | 2 x 150 | 2 x 250 |
Maximum reads per run (single reads) | 25 million | 600 million | 2.5 billion | 5 billion | 1.1 billion | 20 billion |
Flow Cell output | 15 Gb | 300 Gb | 750 Gb | 1.5 Tb | 330 Gb | 6 Tb |
|
||||||
Metabarcoding | +++ | +++ | + | + | + | ++ |
Target Capture | + | + | + | +++ | + | +++ |
Shotgun sequencing | + | ++ | +++ | +++ | ++ | +++ |
Genome skimming | + | ++ | +++ | +++ | ++ | +++ |
Organellar sequencing (plastids) | + | ++ | +++ | +++ | ++ | ++ |
Transcriptomics:gene targeted | +++ | +++ | + | + | + | ++ |
Transcriptomics: | + | + | ++ | ++ | ++ | +++ |
total RNA/mRNA seq |
*The HiSeq series has been discontinued but is still widely available. We thus include the general specifications and which applications they are best suited for.
Pacific Biosciences (PacBio) sequencing is based on single molecule real time (SMRT) technologies for reading DNA and RNA sequences. No PCR amplification is required, which for certain applications can be advantageous. This includes if PCR inhibitors are/may be present, the sequence is GC rich, or if PCR bias should be avoided. Additionally, PacBio reads are considerably longer than in either Sanger or Illumina sequencing (up to 25 kb) ((Pacific Biosciences, n.d.). This reduces computational challenges related to assembling contigs into full sequences. PacBio is considered a third generation sequencing technology, as it reads the nucleotide sequence both in real-time and at the single molecule level (
Similarly to Illumina and Sanger sequencing, PacBio also uses fluorescently-labelled dNTPs for determining a target DNA sequence. PacBio however employs a technology called zero mode waveguides (ZMW) to read nucleotide sequences at the single molecule level. ZMWs are nanosized wells that can be etched into different materials, with attoliter (10-21 L) volumes. ZMW technology differentiates a fluorescent molecule that is floating in solution from a fluorescently-labelled nucleotide that is located at the bottom of the well. A single DNA polymerase is tethered to the bottom of each well, and when a fluorescently-labelled dNTP is incorporated into the growing DNA strand, the fluorescent label is cleaved off. There is a unique fluorescent marker for each of the 4 nucleotides, and each cleavage event is read and directly linked to a specific nucleotide (van Dijk et al. 2018). Additionally, the rate of addition can be used to infer whether the target DNA is modified (i.e., post-translationally phosphorylated or methylated), since a modified DNA strand moves more slowly through the DNA polymerase, resulting in a reduced incorporation rate for a fluorescent nucleotide. This information is extremely powerful for predicting epigenetic modifications that are critical for a variety of biological functions. In addition, chemical modifications that are often present in aDNA can also be detected, making PacBio a particularly useful technique for assessing aDNA damage (
While previously PacBio suffered from a high error rate in comparison to Illumina sequencing, this has been dramatically reduced by the introduction of circular consensus sequencing (CCS), also known as long high-fidelity (HiFi) reads (
Oxford Nanopore (or simply Nanopore) sequencing is also a third generation SMRT technology that is single-molecule based and measured in real time. Nanopore is unique from the other sequencing technologies discussed here in that no DNA polymerase is required, and no expensive chemically modified dNTPs are necessary for reading the target sequence. The system consists of an electrolytic solution and a nanosized, biologically-derived pore in an insulating solid (a material that does not conduct electricity). The biological nanopores used in this technology are derived from proteins that form pores in biological membranes that naturally function to allow for the passage of ions and biomolecules across the membrane. When an electric field is applied, ions in the electrolytic solution pass through the pore, resulting in a stable current that can be detected. When larger molecules pass through the pore, such as DNA strand, detectable disruptions in the current occur. With a DNA strand, sequences of 6–7 nucleotides move through the pore and the movement of these bases yield a changing detectable disruption. This disruption has a unique signature with a specific current change for a specific length of time that can be linked to each of the four individual nucleotides. From the current disruption pattern it is possible to deduce the sequence. As well, since it is the change in current through the pore that is detected, no other chemical markers are necessary (
Nanopore technologies, with a read length up to 4 Mb, are rapidly becoming important due to their scalability and portability. The MinION sequencing platform (theoretical output up to 50 Gb/flow cell) is a portable and cost-effective option (87 g, available from $1000) that can be used in the field. Already, a number of excellent examples of biodiversity studies (and plant-based studies in particular) are available in the literature (
Library preparations are essential for all experiments involving HTS. General points to consider are discussed here and we also refer to Chapter 12 Metagenomics and Chapter 15 Transcriptomics for more details.
Unlike in other forms of sequencing, Ion Torrent technologies are not based upon optical outputs, but rather on changes in pH. When a DNA polymerase adds a nucleotide to a growing DNA strand, a proton is released upon each addition. It is this release of protons into solution, and the resulting change in the pH of the solution, that is detected in Ion Torrent technologies (
Similarly to Illumina sequencing, the target DNA is initially fragmented (200–600 bps) and PCR-amplified with adaptors that can be tethered to micro-machined wells on a semiconductor chip. The plates are then flooded with one of the 4 nucleotides. If a nucleotide is added across from the complementary base in the single-stranded DNA by the DNA polymerase, it results in the release of a proton and a subsequent change in solution pH. This shift in solution pH is detected by an ion-sensitive field-effect transistor (ISEFT), which can detect changes in proton concentration. This is done in a massively parallel fashion, with 1000s of microwell plates being used simultaneously. The pH change that results from the addition of multiple nucleotides in a repetitive sequence is also detectable using this technology, as the addition of two nucleotides will result in double the voltage change as the addition of a single nucleotide. The data output with Ion Torrent technologies can provide an approximate readout of 10 MBb in a single run with conventional machines, and up to 10 GBb with the newest models. The platform however struggles with base calling of homopolymers, and for these sequences it can be a challenge to obtain accurate reads.
The Ion Torrent machine and sequencing chips are relatively inexpensive compared to Illumina and PacBio, and this made it popular in smaller labs without access to high throughput sequencing core facility sequencing, though its use is no longer as common.
The sequencing platform that is ultimately chosen by a scientist depends on a number of factors. This can include (but is not limited to) the scientific question being considered, the quality of target DNA (see Chapter 1 DNA from plant tissue), costs, as well as in-house expertise and/or availability of existing platforms. In all cases, however, the quality and sequencing depth of target DNA should be considered. For DNA that is primarily expected to exist in shorter sequences (i.e., samples that are expected to be degraded from herbarium or ancient sources), then technologies requiring long reads are often not necessary, and Illumina sequencing or Ion Torrent technologies may be sufficient. If however one wishes to avoid any PCR bias or acquire long reads, then using PacBio or Nanopore is advisable. Finally, it may even be useful to use two different types of sequencing to overcome each technology’s respective limitations. For example, in whole genome sequencing, hybrid methods combining Illumina with PacBio are commonly used to ensure long reads and high accuracy.
Sequencing platform choices for different experimental questions and sample types.
Experiment or sample considerations | Recommended method(s) | Comments |
---|---|---|
Whole genome or organellar sequencing project (genome skimming, genome resequencing, de novo genome assembly) | Illumina, PacBio, or a combination of both | Illumina is the method of choice for resequencing for high throughput short read projects due to its high read accuracy |
Barcoding | Sanger sequencing or PacBio CCS | Larger projects are moving to PacBio CCS to reduce costs. Multiplexing very large numbers of samples is necessary to optimise costs |
Metabarcoding/Target capture | Illumina, MGI, DNBSEQ, or Ion Torrent | PacBio and/or Nanopore may also be considered if the sequence is expected to be highly repetitive |
Heavily degraded samples (i.e., herbarium or ancient DNA samples) | Illumina (or Ion Torrent) | PacBio may also be relevant for the study of post-genetic modifications often found in ancient DNA samples, or if dealing with hard-to-phase sequences |
On-site sequencing | Nanopore (MinION) Hi-C/3C-Seq/Capture-C (Illumina) |
In the last decades, developments in sequencing platforms have primarily focused on increasing the throughput and accuracy of sequencing output, increasing the length of reads, and reducing costs. We can expect the field to continue developing further in this direction, with a focus in particular on the miniaturisation of these platforms for more on-site work, as well as better automation and integration of analytical software and data analysis pipelines. In particular, miniaturisation and automatization of data analysis can be expected to have major impacts in regulatory fields related to both food safety and trade, where the ability for non-specialists to rapidly test on-site for the presence/absence of species will be extremely useful (see Chapter 22 Healthcare and Chapter 23 Food safety). Further development of HTS technologies to be used at the single-cell level and in functional studies can be also expected.
What method(s) are most commonly used for whole genome projects of plants and why? Which sequencing method(s) are most currently most commonly used for library preparations and why?
In a scenario where you may want to include amplicons and primers of different length when creating a library for sequencing, how would you adapt your library setup? How would it affect your sequencing costs?
Why do Nanopore and PacBio-based technologies provide longer reads than Illumina or Ion torrent-based technologies? Why are these longer reads especially useful for projects in plant identification?
For whole genome projects, PacBio and Nanopore technologies are the most commonly used technologies. This is because both are long read technologies, which reduces the bioinformatic challenges related to assembling 1000s of short contigs together for assembling a whole genome. For library preparations, Illumina platforms are still the most commonly used due to their relatively competitive costs, high accuracy, and available support from a range of analysis tools and pipelines.
It is important to create equimolar pools so that the total number of DNA molecules is normalised across a library, so that one result does not dominate the others. However, even after normalisation of concentrations, it may still be the case that amplicons of very different length will not be amplified with the same efficiency. Additionally, using primers of roughly the same length so that their annealing temperatures are approximately the same is also important in order to avoid PCR bias within the same library. Thus, in a scenario with amplicons and/or primers of very different length, it is often best to put those amplicons in separate libraries. However, when amplicons and primers are of reasonably similar size, pooling the library samples can be an effective method to reduce sequencing costs.
Nanopore and PacBio based technologies provide longer reads than Illumina or Ion torrent based technologies since both have platforms available that do not require for a sample to be fragmented. These long reads can be especially useful for projects in plant identification when working with sequences that are particularly repetitive or have very large genomes. Additionally, PacBio technologies can also be used when longer amplicons are required for the phasing of haplotypes for instance or when tracing polyploid ancestry.
Hugo de Boer 1
1 Natural History Museum, University of Oslo, Oslo, Norway
Hugo de Boer
The method of identifying living organisms to species level using DNA sequences has been coined DNA barcoding (
DNA-based typing for species identification focused first on microbial organisms (
In recent years, the barcoding movement has grown substantially, and worldwide efforts coordinated by the Consortium for the Barcode of Life (CBOL) are now being focused on barcoding all organisms (
Species delimitation is a central tenet of taxonomy (see Chapter 17 Species delimitation). Traditionally, species were identified, described and classified based mainly on their morphological characters. This is more difficult when it comes to cryptic, hybridising or highly convergent species (
Most species concepts agree on species being evolving metapopulations (de Queiroz 2007), and this implies that genetic variation exists both within and between species. Advanced approaches using many accessions as well as many loci, such as species delimitation based on multispecies coalescent theory, can enhance species identification resolution. However, more data also adds new challenges, and inferred structure due to population-level processes and that due to species boundaries are hard to distinguish (
To identify an unknown DNA barcode using a reference library, one can use several approaches to look at the interrelatedness of the samples (see Chapter 18 Sequence to species). Many databases including GenBank and BOLD (
The mitochondrial genome in plants evolves far too slowly to allow it to distinguish between species (
The plastid marker
The nuclear ribosomal marker ITS, and specifically nrITS2, is used commonly in barcoding and metabarcoding studies (China Plant BOL Group et al. 2011;
The strict requirements for both universality and high variability for potential universal barcodes has led some to label DNA barcoding a “search for the Holy Grail” (
The core plant DNA barcoding markers are
The most commonly used primers for plant DNA barcoding.*
Barcode | Primer | Sequence (5’-3’) | Dir. | Reference |
---|---|---|---|---|
|
rbcLa_f | ATGTCACCACAAACAGAGACTAAAGC | F |
|
rbcLa_rev | GTAAAATCAAGTCCACCRCG | R |
|
|
|
matk-3F | CGTACAGTACTTTTGTGTTTACGAG | F | CBOL Plant Working Group (2009) |
matk-1R | ACCCAGTCCATCTGGAAATCTTGGTTC | R | CBOL Plant Working Group (2009) | |
nrITS | ITS5a | CCTTATCATTTAGAGGAAGGAG | F | Wurdack in |
ITS4 | TCCTCCGCTTATTGATATGC | R |
|
|
nrITS2 | S2F | ATGCGATACTTGGTGTGAAT | F |
|
S3R | GACGCTTCTCCAGACTACAAT | R |
|
|
trnL-g | GGGCAATCCTGAGCCAA | F |
|
|
trnL-h | CCATTGAGTCTCTGCACCTATC | R |
|
|
|
psbA | GTTATGCATGAACGTAATGCTC | F |
|
trnH | CGCGCATGGTGGATTCACAATCC | R |
|
*These are some of the most commonly used primers, but there are many more primers and markers that have been used for specific applications. Never use these primers blindly, but always check for appropriate markers and primers for your target group.
When choosing appropriate markers for a plant DNA barcoding study it is important to consider the following questions:
: DNA barcoding of plants encompasses two streams of data from organism to DNA, one for the query sequence that should be identified and one for the reference sequence that is part of the reference library for identification. DNA source, marker choice, primer choice, sequencing approach and identification strategy all influence the ability and resolution of identification.
After choosing one or several markers, it is important to consider the following:
Once a suitable combination of markers has been found and suitable primers or primer panels have been selected, it is important to test the primers on a sufficient number of your samples. Template DNA quality, DNA concentration, and the effects of inhibiting secondary metabolites can all influence the efficacy of the PCR and might require optimization to obtain the best possible results for the largest number of samples. This is beyond the scope of this book, but sufficient online resources are available to help you with optimization. In addition, there are many online discussion forums to troubleshoot PCR optimization.
The subsequent chapters in section 2 describe different sequencing platforms and approaches to obtain DNA sequences for downstream analysis, and section 3 provides an overview of applications of molecular identification of plants. Depending on whether one chooses standard DNA barcoding using Sanger sequencing, DNA metabarcoding using Ion Torrent, Illumina, or other platforms, or a variety of whole or reduced library representation genome sequencing approaches, one will need to choose different wet lab steps to create the relevant sequencing libraries. Check out the relevant chapter for your application to find out more.
An author writes that she used DNA barcoding to identify
You are planning to use DNA barcoding to distinguish herbal medicines based on
You want to investigate if DNA barcoding can outperform morphology-based biodiversity assessments in terms of species identification. For what material do you expect DNA barcoding to be more useful than morphology-based identification?
Some things that might have been overlooked: (1) Does NCBI GenBank list more than one species of
Answer for 2A) The study can ascertain that only these five species are present if it includes a sequence reference database of all other
Answers could include: vegetative material such as roots, leaves, and seedlings, DNA extracts from bulk samples, soil DNA, faecal DNA, pollen DNA, or air-captured eDNA.
Physilia Chua1,2*, Marcel Polling3,4*, Christina Lynggaard1, Maria Ariza Salazar4, Kristine Bohmann1
1 Section for Evolutionary Genomics, Globe Institute, University of Copenhagen, Copenhagen, Denmark
2 Sanger Institute, Wellcome Trust Genome Campus, Hinxton, UK
3 Naturalis Biodiversity Center, Leiden, The Netherlands
4 Natural History Museum, University of Oslo, Oslo, Norway
Physilia Chua
Marcel Polling
Christina Lynggaard
Maria Ariza Salazar
Kristine Bohmann
* These authors contributed equally.
DNA metabarcoding is an approach where taxonomically informative regions in the DNA are amplified from mixed-template samples containing DNA from different taxa for identification (Pompanon et al. 2012;
Metabarcoding is based on the DNA barcoding concept (see Chapter 10 DNA barcoding). However, for metabarcoding, samples containing DNA from a mix of different taxa are typically used. One of the first studies that used metabarcoding on a parallel sequencing system (herein referred to as DNA barcoding) to identify plants was by Valentini and colleagues (
Plant metabarcoding is widely used to study the taxonomic composition of mixed template samples such as water (
Plant metabarcoding has been used in various types of applications including species delimitation (see Chapter 17 Species delimitation), archaeo- and palaeo-botany (
DNA metabarcoding is a cost-effective method as compared to metagenomics (
However, DNA metabarcoding also has its limitations, and the PCR amplification step has previously proven to be particularly problematic (
Another drawback of DNA metabarcoding is primer binding bias due to mismatches between the primer and the template DNA. This can result in discrepancies between the proportion of the original taxa in the DNA extract and the amplified DNA sequences (
Finally, the taxonomic assignment of sequences to species is heavily dependent on the DNA reference database used for sequence matching. When the reference database to which the resulting sequences are compared to is incomplete and/or consists of inaccurately identified species, this results in erroneously identified species and/or false negatives (
At the start of any (plant) metabarcoding study lies a clearly defined research question. A study design should furthermore encompass a clear sampling strategy, and identification of suitable DNA extraction techniques for the sample type used before carrying out downstream analysis (
Barcode choice is one of the most important aspects of metabarcoding studies as it will determine which taxa are identified and to what resolution. Considerable efforts have gone into constructing libraries for these plant barcodes and in assessing their limitations (CBOL Plant Working Group 2009;
The first step is to check whether or not reference libraries exist for the sequences of the targeted organism(s). This is because barcodes are only useful if the sequences for the targeted organism(s) are available in sequence repositories or reference libraries (
Discriminatory power refers to how effectively the barcodes can discriminate between closely related species and is linked to the variability of the locus. Typically, barcodes can only identify plants up to a certain taxonomic level (resolution) depending on the barcode used and the group of plants targeted. Moreover, because reference libraries are incomplete for all DNA barcodes, some species may only be detected using one DNA barcode while others may only be detected by another. Therefore, using a single primer set will most often not result in the recovery of all species present in a sample. We recommend adopting a multilocus approach to gain highly resolved taxonomic coverage for complex samples (see e.g.
DNA is relatively unstable in the environment and can degrade quickly depending on certain factors such as age, transport, and abiotic factors (
In the metabarcoding laboratory workflow, unique nucleotide tags are added to amplicons, and these tags are used to assign sequences to the sample they originate from (
In the ‘one-step PCR’ approach, the metabarcoding barcode is amplified and built into libraries during one PCR. This is achieved through the use of metabarcoding primers that carry both adapters and library indexes (
In the ‘two-step PCR’ approach, sample extracts are PCR-amplified with metabarcoding primers that only carry 5’ tails. These are added to act as templates for the following second PCR and do not include any labelling. The second PCR is carried out on each PCR product with primers that carry adapters and indexes (
In the ‘tagged PCR’ approach, DNA extracts are PCR amplified with metabarcoding primers that carry 5’ unique nucleotide tags. Next, the individually 5’ tagged PCR products are pooled and library preparation is carried out on the pools (first demonstrated by (
With the cost of sequencing decreasing exponentially, more effort can be put into applying technical PCR replicates to circumvent sequencing errors and other PCR related issues. When using PCR replicates they should be sequenced in separate locations on the same 96-well plate or, ideally, with replicates in separate plates.Taxa identification lies at the core of any ecological research question. Thus, it is crucial to perform a reliable and reproducible identification workflow to ensure correct identification. In general, care should be taken to avoid cross-contamination between samples by working in clean laboratories with filter-tipped pipettes and separate pre- and post-PCR labs. Normalisation of the amplicons prior to library construction is crucial to avoid overamplification of the most represented taxa in the sample. Since some often-used plant-specific marker regions are very short (e.g.,
The preferred platforms for sequencing are currently IonTorrent and Illumina. Both platforms require an additional post-ligation PCR-step or PCR-free ligation of platform-specific adapters to the amplicons before sequencing. However, due to the different technologies behind both platforms, both the error rates and error types can differ. For Illumina (optical sequencing), a substitution error rate of 0.1% has been identified, while IonTorrent (based on detection of hydrogen ions) can show up to 1% indel errors (
Several different bioinformatic tools can be used to analyse the sequence output. Some commonly used packages are OBITools (
Currently, metabarcoding is the dominant technique used in the identification of plants from mixed samples. Developments and improvements in addressing methodological challenges such as PCR bias may one day allow for unbiased quantitative inferences from metabarcoding datasets. This would be a huge step forward for the metabarcoding community since it is still controversial to use read counts as an indication for biomass (
Metabarcoding could potentially be used to determine plant composition in a landscape from bulk arthropod samples. Bulk arthropod samples have been used for biodiversity monitoring of vertebrates (
How can overamplification of the most represented taxa in a single sequencing run of multiple complex mixtures be avoided?
Which DNA barcode region is most suitable for dealing with plant DNA from samples where DNA is expected to be degraded?
The nuclear ribosomal ITS region is shared between plants and fungi. How can undesirable fungal DNA amplification be avoided?
By using equimolar pooling of individual samples.
The highly stable P6 loop can best be targeted in this case, using
By using plant-specific ITS primers that minimise the amplification of fungal DNA.
Physilia Chua1, Youri Lammers3, Ozan Çiftçi4
1 Section for Evolutionary Genomics, Globe Institute, University of Copenhagen, Copenhagen, Denmark
2 Sanger Institute, Wellcome Trust Genome Campus, Hinxton, UK
3 The Arctic University Museum of Norway, UiT - The Arctic University of Norway, Tromsø, Norway
4 Institute of Environmental Sciences, Leiden University, Leiden, The Netherlands
Physilia Chua
Youri Lammers
Ozan Çiftçi
Metagenomics is the study of genetic material recovered directly from environmental samples such as air, water, soil, or sediments (
The term ‘metagenome’ was first coined in 1998 by Handelsman et al. (
With the development of high-throughput sequencing (HTS) technologies, the need for cloning to increase the amount of starting material was eliminated. An early study recovered the first near-complete genomes of five dominant members of a natural acidophilic biofilm using an insert plasmid library and shotgun sequencing (
The immense amount of data collected by these methods introduced challenges in data analysis, resulting in several innovations in comparative metagenomics such as clustering orthologs (
Similar to metabarcoding, substrates that can be used for metagenomics in plant identification include environmental samples, fragmented template materials (i.e., dental calculus and faeces) (
Several promising applications exist for plant-related metagenomics as compared to conventional targeted genomic approaches. Dietary studies are one such application. While dietary studies have been revolutionised by conventional metabarcoding (see Chapter 11 Amplicon metabarcoding;
Similar to metabarcoding (see Chapter 11 Amplicon metabarcoding), metagenomics can potentially be used to reconstruct plant compositions from bulk arthropods samples, and to solve crimes in forensic genetics (see Chapter 26 Forensic genetics, botany, and palynology), especially by uncovering taxa that are not normally amplified in metabarcoding studies. It can also potentially be applied to plant resources for the retrieval of plant population genetic information from mixed templates (which has already been shown in mammals;
Metagenomics is an untargeted method that captures all genetic material in a sample, which is advantageous over targeted methods as no prior knowledge of the taxa and their genes is required (
Metagenomics does, however, come with some disadvantages that need to be considered. The main downside is the taxonomic inefficiency of the method. Sequenced material can originate from any part of the genome, but full nuclear genome references for most species are currently lacking. Thus, only a small proportion of species can currently be identified (
DNA fragmentation is an essential step in the metagenomic workflow, and the size of the DNA fragments required depends on the sequencing platform used. Broadly speaking, there are two methods for DNA fragmentation to obtain size-controlled DNA fragments: enzyme-based and mechanical. Each method has its associated advantages and disadvantages (
Library preparation is another important step in the metagenomics workflow as it can affect the results of the sequencing output. The addition of adapters to the ends of DNA fragments lets it bind to the sequencing flow cell, which allows for the identification of the reads (
DNA sequencing has gradually shifted from Sanger to HTS technologies in the last decades. These new sequencing technologies can provide much higher yields of reads at a much lower cost (see Chapter 9 Sequencing platforms and data types). Initially, 454/Roche pyrosequencing (discontinued) was the most widely used platform (
Short reads are bioinformatically challenging for metagenomic assembly because genes and chromosomal regions can be difficult to span, especially if they are long or composed of repetitive elements. Certain protocols have been developed to overcome such challenges (e.g., assembly after binning and taxonomic assignment), but long-read sequencing technologies offer excellent alternatives for metagenomics. PacBio and Oxford Nanopore technologies offer longer read lengths but can be accompanied by higher error rates and higher costs. In contrast to the other platforms which introduce inherent systematic errors (e.g., homopolymer regions, index hopping), errors in these platforms are mostly random, which might be overcome with technological improvements (
The exact number of reads required to effectively characterise a sample using metagenomics will be highly variable, and as such, no one number for the total number of reads required can be given universally. In principle, the total number of species in the sample, the genome sizes, and the relative abundance of each species should be known to make such an estimation. As a rule of thumb, it is suggested to maximise the output to capture as many reads as possible from the rare members of the community (
There are currently two main strategies to identify the contents of a metagenomic sample: identification of individual reads by alignment to a reference, or by assembling the reads into longer contigs prior to identification.
The most straightforward method for identification is by aligning the reads to a known reference dataset. BLAST and related tools such as MegaBLAST (
Two alternative approaches aim to speed up the identification of metagenomic datasets. These either use more compressed reference databases in combination with more efficient aligners or rely on exact alignments of k-mers between the reads and the reference (
Assembly methods attempt to generate longer contigs before downstream analysis. These longer contigs can be used for gene identifications (
Each bioinformatic strategy has its pros and cons, and the decision about which strategy to use depends on the starting material available as well as the research questions to aim to be answered. The alignment method works well when there is ample reference material available for the taxa of interest, when working with older and more fragmented material, or when the target taxa are sparse in a sample. The assembly method on the other hand performs best when there is abundant material available, which is often not the case for environmental datasets.
As sequencing costs continue to significantly decrease, bioinformatics pipelines are optimised, and more comprehensive DNA reference libraries are available (
What are the two main steps in the metagenomics laboratory workflow and why are they necessary?
What are the challenges of using short-read sequencing for metagenomics applications and how do you overcome these challenges?
What are problems caused by using environmental samples with unequal abundances in metagenomics applications?
DNA fragmentation and library building. Current sequencing technologies are unable to sequence full genomes of most organisms in a single run, so fragmentation is required for downstream procedures. Library preparation prepares the DNA fragments for sequencing, and the addition of adapters allows DNA fragments to be identified.
It is bioinformatically more challenging for the assembly process due to the formation of chimeras. Utilising bioinformatics procedures such as assembly after binning and taxonomic assignment, or using long-read sequencing, can overcome these challenges.
Samples with unequal abundance can complicate assembly as reads from different taxa do not overlap, reducing the probability of accurate taxonomic or functional assignment.
Bastien Anthoons1,2, Andreas D. Drouzas1, Panagiotis Madesis3
1 School of Biology, Aristotle University of Thessaloniki, Thessaloniki, Greece
2 Institute for Applied Biosciences, Centre for Research and Technology, Thessaloniki, Greece
3 Lab of Molecular Biology, Department of Agriculture Crop Production and Rural Environment, University of Thessaly, Volos, Greece
Bastien Anthoons
Andreas D. Drouzas
Panagiotis Madesis
Accurate species identification is fundamental for correct assessment of species diversity and for studying the functioning of their communities and ecosystems. Additional applications include the use of species identification in food product authentication and for diagnosing diseases. Species identification can be carried out using morphological, (bio)chemical, or molecular traits. Pertaining to the molecular-based approaches, both PCR and post-PCR analyses have been extensively used for species identification and genotyping. The most widely used PCR-based method is DNA barcoding, which is able to provide species-level identifications using the sequences of short standard DNA regions (
HRM data analysis is straightforward and it does not require advanced bioinformatics skills, in contrast to other genetic analyses used for species identification. Furthermore, HRM is a cost-effective and high-throughput methodology. Due to its ability to discriminate between samples at the resolution of a single nucleotide (allowing for single nucleotide polymorphism (SNP) identification), HRM is commonly used for genotyping, mutation scanning, and DNA methylation analyses. The HRM analysis of DNA barcoding regions, e.g., ITS2,
PCR amplification of the genetic region of interest is a prerequisite for HRM analysis and is done in the presence of a fluorescent dye that binds dsDNA. Such dyes intercalate into the dsDNA that is produced during a PCR reaction, without affecting PCR efficiency. Asymmetric cyanine dyes fluoresce strongly in the presence of dsDNA and are characterised by low intensity fluorescence in the unbound state (
The rate of dsDNA PCR product dissociation, and thus the shape of the HRM curve, depends on (1) the sequence itself and its length, (2) the GC content, (3) the complementarity, and (4) the nearest-neighbour thermodynamics of the amplicon (
An example of standard output curves of an HRM analysis. In both A and B, the three phases of the DNA melting profile are shown. (1) The pre-melting phase is characterised by an initial fluorescence given in relative fluorescence units (RFUs). Here all PCR products are double-stranded and the maximum amount of dye is bound. (2) In the active melting phase the inflection point (i.e., Tm) is where 50% of the PCR product the samples is denatured. (3) The post-melting phase. As the temperature increases, the PCR products denature, dye is released, and the fluorescent signal drops and plateaus
Identification of species by Bar-HRM relies on small genetic differences between DNA sequences, which will result in different melting curves. However, small differences between melting curve profiles may also arise from sources other than the DNA sequences, thus assay optimisation is a prerequisite for a successful Bar-HRM analysis. An example of a Bar-HRM workflow can be found in the infographic. Factors that could influence the outcome of an HRM analysis include genomic DNA quality, DNA extraction impurities, amplicon length, primer design, dye selection, PCR reagent choice (see Chapter 1 DNA from plant tissue), and the choice of Bar-HRM instruments and software (
The major factor associated with DNA quality is salt carryover, as this will change the thermodynamics of the DNA melting process. This could lead to lower reproducibility and higher error rates in the Bar-HRM results. A solution is to precipitate and resuspend the DNA extract in a buffer with a low salt concentration such as TE (10 mM Tris, 1 mM EDTA) prior to the PCR (
Amplicons up to 300 bp are generally preferable for Bar-HRM analysis since they are more suitable for the detection of DNA mutations such as SNPs, inversions, insertions, and deletions. The larger the fragments, the more likely it is to detect additional mutation sites that may complicate the discrimination between/among different sequence variants. On the other hand, amplicons that are too small (< 50 bp) may produce lower fluorescence signals, due to lower amounts of dye being incorporated into the PCR product (
The intercalating dyes used in the Bar-HRM analysis bind generically to any double-stranded DNA product. It is therefore important to design robust PCR primers that are specific to the region of interest, to ensure that this is the only region amplified in the PCR product. Each developed primer pair should be tested for specificity to the region of interest and should not produce any primer-dimers or non-specific products. PCR products from the developed primer pair should be assessed by gel electrophoresis, as the HRM software may not be able to detect all non-specific reaction products if their melting curves are similar (
HRM uses dsDNA-binding fluorescent dyes that do not interfere with the PCR reaction. The so-called “release-on-demand” dyes are preferred for HRM as they do not inhibit DNA polymerases or alter the Tm of the product (
Reagents for HRM analysis and reaction conditions should be optimised to reduce amplification biases as much as possible. Primer dimers and other non-specific products can significantly decrease the performance of the HRM analysis. So, in addition to optimising reactions, one must ensure that variation is not introduced by poor assay design or optimization decisions (see Chapter 1 DNA from plant tissue;
HRM analysis requires a PCR thermal cycler and an instrument with optics capable of detecting fluorescence. This can either be a rotary design in which samples spin past an optical detector or a block-based instrument in which samples are read by a scanning head or stationary camera. This instrument should be coupled with a computer with appropriate HRM analysis software capable of handling the large amounts of data generated during the analysis. A good HRM software package should provide a view of the raw fluorescence data points and a process to both align the data and view melting curve differences between samples (
The chemical improvement of “release-on-demand” DNA dyes and the increased instrumentation precision has widely expanded the use of Bar-HRM for genotyping (
Since the first description of the HRM methodology in 2003, it has been increasingly used as a research tool (
Medicinal plants and plant-based food products are often processed and lack the essential parts necessary for morphological identification when sold on the herbal market. In addition, the herbal market is highly competitive and lacks standardised methods for quality assessment. This has contributed to increasing problems with product adulteration and substitution. Numerous studies reported the substitution of costly ingredients in herbal products with plant material of inferior quality or unlabelled plant fillers (
Bar-HRM has been used for the identification and quantification of the ingredients in plant and animal food products, including Protected Designation of Origin (PDO) products. Olive oil for instance, which is one of the most adulterated vegetable oils on the market, has been successfully authenticated with Bar-HRM (
Apart from species identification, Bar-HRM can also be used for species quantification, which is also important for quality control, especially for quantifying adulterants in food or other processed products. Serial dilutions of a DNA sample mixed with adulterant DNA are made, corresponding to different known adulterant content percentages. These artificial serial admixtures are then used to create reference curves that can be used to quantify samples of unknown content (Lagiotis et al. 2020;
Bar-HRM technology can provide taxonomic identification of plant taxa, the tracking of a wide range of raw and processed herbal products, and the detection of adulterants and poisonous contaminants in food products. As the precision of the “release-on-demand” dyes and HRM instruments further increase, in addition to the development of melting curve reference databases, we can expect that Bar-HRM will be implemented as a routine analytical tool for species identification and authentication. Finally, the successful application of Bar-HRM as a tool for quality control in the food industry, renders it suitable to be also used in a regulatory framework by the corresponding authorities.
A. Which amplicon length is generally recommended for Bar-HRM analysis? Explain your answer. B. What is the risk for Bar-HRM reactions if the amplicon length is too short?
A. In case the HRM analysis detects non-specific products, what could the underlying reasons be? B. How would this issue be verified and resolved?
A. Describe the different melting phases of the dsDNA sequence depicted in Figure
A. The suitable amplicon length for Bar-HRM analysis varies from 50 to 300 bp. The shorter the amplicon length, the more accurate the result.
B. Amplicons that are too short (< 50 bp) produce too little fluorescence signal, due to limited dye incorporation in a shorter sequence.
A. The underlying reasons for detection of non-specific products by HRM could be (i) low quality DNA, (ii) an increased salt concentration (MgCl2) in the PCR reaction, (iii) insufficient primer specificity, or (iv) possible contaminations.
B. This could be verified by (i) checking the DNA quality (in some cases further purifying the sample with a DNA kit or performing DNA extraction again is recommended), (ii) adjusting the MgCl2 amount by performing titration or by using a master mix with standard (known) MgCl2 concentration (diluting the DNA sample in TE buffer can also be attempted), and (iii) ensuring primer specificity with a BLAST search, by running the PCR product on an electrophoresis gel prior to Bar-HRM to check for a single band pattern, or sequencing the amplicon, and (iv) replacement of the working materials
A. The pre-melting phase is the stage of initial fluorescence when all products are double-stranded and the maximum amount of dye is bound. The active melting phase includes the inflection point where 50% of the PCR products in the samples are denatured and the post melting phase is characterised by a drop in fluorescence signal (when the PCR products denature) as the temperature increases. In Figure
B. The melting temperature (Tm) is the temperature at which 50% of the dsDNA has been denatured. It is unique for each sample and therefore of use for species discrimination.
Yannick Woudstra1,2,3,4, Anne-Sophie Quatela3,5, Catherine Kidner6,7, Juan Viruel1, Alexandre Zuntini1, Michael D. Martin8, Thibauld Michel6,7, Olwen M. Grace1
1 Royal Botanic Gardens, Kew, United Kingdom
2 Natural History Museum Denmark, University of Copenhagen, Copenhagen, Denmark
3 Gothenburg Global Biodiversity Center, Department of Biological and Environmental Sciences, University of Gothenburg, Gothenburg, Sweden
4 Department of Plant Sciences, University of Oxford, Oxford, United Kingdom
5 Department of Biological and Environmental Sciences, University of Gothenburg, Gothenburg, Sweden
6 Institute of Molecular Plant Sciences, University of Edinburgh, Edinburgh, United Kingdom
7 Royal Botanic Garden Edinburgh, Scotland, United Kingdom
8 Department of Natural History, Norwegian University of Science and Technology, Trondheim, Norway
Yannick Woudstra
Anne-Sophie Quatela
Catherine Kidner
Juan Viruel
Alexandre Zuntini
Michael D. Martin
Thibauld Michel
Olwen M. Grace
Efforts to resolve the plant tree of life have led to the replacement of traditional DNA sequencing markers (
Unique challenges exist when trying to obtain sequences from plant nuclear genomes. Plant genomes are often large (
Thankfully, many nuclear genes have been discovered through an abundance of annotated transcriptomes (
Analysing SLCN genes requires multiple reads to cover the same genomic region (high coverage) to obtain high-quality assemblies. The goal of target capture, also called bait capture or hybrid capture, is to achieve high coverage on (nuclear) target loci by proportionally increasing (enriching) the target DNA fragments in a genomic library. The workflow is straightforward (see Infographic): DNA is extracted using tissue-specific protocols (see Section 1 Design, sampling, and substrates in this book), sheared to the desirable fragment length (e.g., 300–700 bp for Illumina sequencing, depending on the quality of the extract), processed into genomic libraries using indexing techniques for multiplex sequencing, enriched for the target genomic regions using specific baits (see below), and sequenced on a platform with high sequencing accuracy (e.g., Illumina or PacBio Sequel).
Target capture uses custom-designed short RNA- or DNA-baits (usually between 80 and 120 bp long), also called probes, that hybridise in solution to target loci with complementary sequences (
Besides target capture, there are other techniques for reducing genomic complexity in DNA samples (reduced representation sequencing) depending on the target loci, taxon, and/or application (
Target capture is a robust alternative for these applications as it works independently of specific PCR primers. The small size of the RNA-baits makes this method ideal for degraded DNA and baits do not need to be an exact match for the target to be captured. The hybridisation conditions can be modified for more or less permissive binding between bait and target, and locus capture can still be successful with up to 30% bait mismatch (
Nuclear sequence data from a single clade member, either a whole-genome or transcriptome, are enough to design efficient target capture baits for application across the clade. As the number of complete plant genomes (e.g., 128 species, Phytozome v.13;
Annotated whole-genome sequences are preferred for RNA-bait panel design, since: 1) target loci can be more carefully selected with the detailed gene copy number information; 2) annotated whole-genomes provide intronic and intergenic regions (sequences immediately 5’ or 3’ of the gene), which can be included in the panel design.
The advantage of using introns and intergenic regions is the inclusion of highly variable sequences that are useful for phylogenetic inference of recent diversification events (
Target capture is a cost-efficient, high-throughput, and customizable solution for plant phylogenomics and systematics (
Examples of available universal and clade-specific target capture panels. Number of exons and introns as well as total number of bases targeted are as reported in the original publications.
Taxonomic level | Number of loci | Exons/introns | Total target size (bp) | Reference |
---|---|---|---|---|
Flagellate plants and gymnosperms | 248 | 451 exons | 150,369 |
|
Ferns | 25 | Exons only | 39,134 |
|
Angiosperms | 353 | Exons only | 260,802 |
|
|
||||
|
301 | Not reported | Not reported |
|
|
||||
|
Not reported | 469 exons | 364,653 |
|
|
853 | Exons only | 1,545,593 |
|
|
4,184 | Exons only | 4,287,662 |
|
2,473 | 3,709 exons | Not reported |
|
|
|
1,061 | Exons only | Not reported |
|
|
1,776 | Exons only | ± 2,300,000 |
|
|
120 (+ A353) | 469 exons | 136,495 |
|
|
507 (423 SLCN) | Exons only | 737,309 (SLCN only) |
|
289 | 1,021 exons | 359,269 |
|
|
964 | Exons only | 1,134,513 |
|
|
|
830 | Exons only | 776,754 |
|
|
384 (266 from A353) | Exons only | Not reported |
|
|
275 | Exons only | 660,000 |
|
|
963 | 6,005 exons | Not reported |
|
|
972 (593 SLCN) | Exons only | Not reported |
|
|
1,241 | Exons only | Not reported |
|
|
||||
189 | 1,029 exons | 353,794 |
|
|
872 | Not reported | Not reported |
|
|
1,239 | Exons + introns | Not reported |
|
|
745 | Exons only | Not reported |
|
|
570 | Exons only | 180,784 |
|
|
260 | Exons only | 441,626 |
|
|
278 | Including introns | 378,553 |
|
|
87 | Exons only | Not reported |
|
|
283 | 676 exons | 171,365 |
|
|
276 | 907 exons | Not reported |
|
|
Full exome | Exons only | 85 Mbp |
|
|
50 | Exons (131) + introns | 104,374 | Cangren et al., unpublished | |
926 | 8,963 exons | Not reported |
|
|
|
||||
431 | Exons only | 709 kbp |
|
|
9,748 | Including introns | 150 bp per (sub)gene |
|
Applications for target capture and in combination with other methods.
Technique | Principle | Application | Reference |
---|---|---|---|
Target capture | Target enrichment using in-solution hybridisation with specifically designed baits: short oligonucleotides complementary to target loci. | Phylogenomics, population genomics | |
RAD-Seq + |
Using custom baits to capture selected restriction-site associated DNA (RAD) tags. | Population genomics with museum specimens | |
Target capture + genome skimming (Hyb-Seq) | Adding an unenriched library to the enriched sample before sequencing to obtain low-coverage sequencing results from non-target nuclear regions and organellar genomes. | Phylogenomics, population genomics, comparing chloroplast and nuclear phylogeny. |
|
Target capture + allele phasing | Estimation of ploidy level based on allelic frequency and allelic ratio from the number of reads for each allele. | Estimation of ploidy level from museum specimens. |
|
Target capture + molecular identification | Using target capture to obtain high-coverage sequence data for SLCN genes to identify unknown samples of traded plants. | Trade monitoring, authentication of medicinal plants. |
|
Target capture + repetitive sequence analysis | Using off-target reads to investigate levels of DNA repetition across a taxonomic clade. | Structural evolution of genomes, repetitive DNA analysis. |
|
The effective enrichment of degraded DNA samples, wide taxonomic range, and increased availability of custom bait panels make target capture ideally suited for molecular identification of plants (
The success of molecular identification depends on a curated database of taxonomically verified reference sequences with a corresponding comprehensive phylogeny (
Once sequenced, unknown material can be identified using either genomic distance (
Analysing target capture data from mixed samples, where material from different species is combined into one sample, is complex and requires long-read sequencing and rigorous phasing. Short sequence reads require assembly into longer fragments (contigs), increasing the risk of erroneous assemblies in mixed samples where reads belonging to different species might end up in the same contig. Long sequence reads can be sorted based on variable sites (phasing) and assigned to species directly, circumventing the assembly problem for mixed samples. If traditional markers give sufficient resolution, metabarcoding experiments (see Chapter 11 Amplicon metabarcoding) can be designed for a more cost-effective approach.
The research question will determine whether a customised bait panel is needed for a study, and the choice is a trade-off between cost and detail (Figure
Flow-chart describing the workflow of a target capture experiment.
Universal bait panels are commonly designed for resolving deeper phylogenetic nodes (e.g., angiosperms;
Customised bait panels offer greater recovery and detail but are sensitive to the taxonomic distance from the reference used in the design. For example, target recovery with the
Designing a custom bait panel requires transcriptome or whole-genome sequences from at least one, but preferably more, taxon in the clade of interest or from a closely related clade (
SLCN genes can be retrieved from a set of transcriptomes using software such as Markerminer (
It is important to determine, as much as possible, the copy number of genes identified at this stage to avoid including paralogs in the target design. Annotated whole-genome sequences have an advantage here. When using transcriptomes, Markerminer can indicate the copy status of identified loci based on the curated dataset from De Smet et al. (2013). Additionally, a reciprocal blast of putative SLCN loci against the transcriptome can be used to identify near-identical matches, providing an indication on the presence of paralogs.
It is usually unnecessary to include all detected loci as they may vary in their phylogenomic value and there is a limit to what a bait panel can efficiently cover. The smallest RNA-bait panels from MYBaits (Arbor Biosciences), for example, include up to 20,000 baits between 80 to 120 bp in length. Larger bait panels are considerably more expensive. It is advisable to use 2–3x tiling in the bait design so that the whole set covers each base of the target loci with 2–3 baits or more (Figure
Schematic representation of tiling. The reference sequence at the top represents a hypothetical 800 bp exon with dotted blue lines indicating the intron-exon boundaries. Hypothetical baits are 80 bp long. This example is based on a 3x tiling strategy where each nucleotide is (on average) covered by three unique baits. The bait coverage decreases towards the ends of the exon as the target design of this hypothetical example did not include introns or intergenic regions.
Firstly, prioritising target loci recovered across all taxa in the reference dataset is recommended to ensure consistent recovery and to include as much variation as possible in the bait panel design. If a locus is represented by only one taxon in the design, the resulting capture will be skewed towards samples more closely related to that taxon. This may result in uneven recovery among samples in a pooled capture experiment (see Multiplexing and pooling below).
Secondly, for phylogenomic and molecular identification purposes, loci with low variability should be excluded. The target loci should be variable enough to resolve challenging phylogenetic clades. When designing a bait panel for broader taxonomic applications (e.g., above genus-level), limiting the inclusion of highly variable genes can be considered to keep recovery levels consistent across taxonomic scales. The variability of loci can be assessed based on pairwise sequence identity (ideally < 95% between the reference sequences), phylogenetic resolution on gene trees, and the number of parsimony-informative sites (e.g., ≥ 20 per 1000 bp). Additionally, metrics such as the amount and proportion of missing data can be obtained in a useful summary diagram with the AMAS tool (
Thirdly, it is important to make sure that target loci are long enough to be covered efficiently by the RNA-baits (i.e., > 400–500 bp is recommended). This is especially relevant when targeting exons. Exons shorter than the bait length will not be captured efficiently since bait sequences that span two intron-exon boundaries will have reduced recovery as they only bind partly to one exon.
Finally, target genes should be checked for repetitive regions such as microsatellites or transposons. These can be detected by the presence of short, repetitive sequences of low complexity (e.g., (ATC)n) and should not be included in the bait panel design. Baits spanning these repetitive elements will likely hybridise in many places in the genome, reducing the hybridisation efficiency and recovery for the target gene. Similarly, including highly conserved and high-copy regions common to plants, such as the MYB-domain (
Selected SLCN genes should first be aligned to a reference sequence from an annotated genome to indicate putative intron-exon boundaries. These alignments will form the basis of the RNA-bait design. An example pipeline can be found in the publication for the Angiosperms353 panel (
Firstly, baits need to be checked for potential overlap with high-copy sequences from organellar genomes (plastomes, mitochondrial genomes, and nuclear ribosomes) by mapping them to published sequences, which are typically available in NCBI databases. Secondly, GC-content in baits should be normalised across the panel. The hybridisation temperature governs the specificity of capture and baits with the same melting temperature (Tm) should hybridise evenly. Additionally, a high GC content in baits will lead to more off-target hybridisation as these baits are more likely to bind efficiently to GC-rich areas in the genome. Baits with a GC-content > 75% should therefore be removed, though one might lower the threshold to 60%. Thirdly, identical, or nearly identical, baits should be removed to reduce redundancy in the dataset as well as bias towards regions covered by identical baits. This should be done carefully however to not reduce the desired tiling of the bait panel. For example, removing baits with > 90% identical sequence over 85% of the total bait sequence generally works for 3x tiling. The digital panel design provided by the company should then be checked for accuracy by mapping the designed baits against all reference target sequences (e.g., selected transcripts for the panel design). This is to make sure that the baits align with the target sequence, are not too divergent from the target sequence and are tiled uniformly across the length of all genes.
A custom-designed bait panel needs to be tested to ensure it efficiently captures the target sequence prior to a large-scale study. A cost- and time-efficient pilot study can include up to 24 samples using an Illumina MiSeq platform (
Information on introns and flanking regions may be elucidated from the ‘splash zone’ in a pilot study that can subsequently be added to the bait panel design (Cangren et al., unpublished). In these cases where the bait panel is expanded, a second pilot study may be required. The results of a pilot study are generally published along with the design of a custom bait panel to the benefit of other researchers who may use the same custom bait panel (e.g.,
Target capture sequencing uses genomic DNA libraries prepared for sequencing on HTS platforms. These libraries consist of DNA fragments, usually of a controlled size, obtained from source plant material. The source DNA fragment is flanked by standardised identifier sequences (indexes or sometimes also called barcodes) to help identify the sample origin of a sequence read and a standardised adapter sequence to allow binding of the DNA fragment to the flow cell of the sequencer. The number of bp DNA from the source genome in a library fragment (insert size is therefore smaller than the fragment itself:
insert size = average library fragment size − 2 × (length of adapter+index sequence)
The library preparation procedure is not discussed here, but details can be found in Chapter 9 Sequencing platforms and data types.
Sequences from different samples can be distinguished by labelling (indexing) each library with its own unique identifier. Combining differently labelled libraries into one sequencing run (multiplexed sequencing or multiplexing) is a common strategy to reduce per-sample costs. To further reduce the per-sample cost of target capture experiments, libraries from different samples can be combined in one tube for simultaneous enrichment (pooling). This reduces the number of RNA-baits necessary to enrich the same number of samples, and significantly reduces costs. Efficient target enrichment is routinely achieved with up to 48 samples per RNA-baits reaction and even 96-plexing strategies have been successful (
When deploying a universal bait panel, libraries from different taxonomic groups, particularly at the family rank and above, must be separated. The closer the sample is to the reference taxa the higher the similarity between the bait sequences and the target sequences (
In all target capture experiments, libraries in the same pool should contain similar fragment sizes. Short fragments can move around much easier in a solution and will thus encounter the RNA-baits more often, increasing their chances of capture. Mixing short and long fragments in the same target capture reaction can therefore skew the enrichment towards the shorter libraries (
The number of DNA fragments from each sample in the same pool should be equal, i.e., be present in equimolar quantities. A library with a higher number of DNA fragments than the others in the pool will be overrepresented and potentially bias the DNA sequences that are enriched. Diluting libraries to the same molarity (usually in nM) before pooling is therefore generally advised. An example of how to calculate pooling parameters is shown in Figure
Example describing the strategy of pre-target capture pooling.
A target capture wet-lab protocol has three steps: denaturing the DNA libraries, hybridising with target-specific baits, and post-capture washing to remove unwanted DNA fragments. An example protocol using the Daicel Arbor MYBaits kit is detailed here.
In the first step, the genomic libraries are denatured at > 95 °C and ‘blocker’ oligonucleotides are added that bind to the adapter sequences. This is to keep the single-stranded fragments from hybridising back to their complementary strands. The blockers also reduce any interference of the adapter sequences during hybridisation, in case the baits themselves contain complementary sequences to the adapters and/or index primers used.
In the hybridisation reaction, the target-specific baits are added to each pool and hybridisation will occur at a constant temperature of 60–65 °C (depending on the specifics of the bait panel) for a minimum of 16 hours. These parameters should always be optimised when setting up a target capture protocol. Longer hybridisation times (≥ 24 hours) are needed for enrichment of more complex genomic libraries, such as those from larger genomes and from universal kits. In these reactions, the baits take longer to encounter the target DNA fragments.
For samples that are expected to underperform (short libraries, herbarium samples, or samples taxonomically distant to the target reference), the hybridisation temperature can be dropped to < 60 °C and the hybridisation time extended to 48 hours. To prevent evaporation and any potential loss of target DNA, a small amount of hydrophobic wax can be added on top of the hybridisation reaction. If using a thermocycler, the heated lid should also be on at ± 105 °C to prevent evaporation.
Finally, the magnetic streptavidin beads are added to the reaction mixture to bind the target-bait hybrids. These streptavidin beads need to be washed to remove any storage buffer before they are added to the target capture pools. Once ready, the tube with magnetic beads and bead-bound target DNA can be placed on a magnetic tube rack to concentrate and anchor the beads, allowing the non-bound DNA fragments to be washed away.
The amount of DNA in enriched pools often needs to be PCR amplified to generate sufficient detectable fragments for sequencing on HTS platforms. This is especially important when capturing loci from large genomes (e.g.,
Post-capture amplification can either be done with the DNA still bound to the beads using a specific hot start polymerase or after removing them from the beads and using a standard high-fidelity polymerase (PFU or Q5, various suppliers). In the latter case, the target DNA is released by denaturation at > 95 °C in a suitable solvent, i.e., Tris-HCl solution (no EDTA should be present since this will inhibit the polymerase) and immediately transferred to a magnetic tube rack to separate the DNA from the baits, gently removing the solvent containing target DNA and transferring it to clean tubes. A concentrated high-grade detergent such as TWEEN-20 is often added prior to denaturation to enhance the release of target DNA. PCR amplification is then done in-solution on the target DNA using universal PCR primers that bind to the adapter sequences.
Optimising the number of PCR cycles (e.g., via qPCR with a dilution of the captured product) is generally advised since too many PCR cycles can increase the chance of false positives in the form of random errors in the sequences, which cannot be corrected based on the sequencing data. Furthermore, excessive PCR cycles introduce unnecessary PCR clones or duplicates. Performing just enough PCR cycles to obtain a pool into the desired concentration range for the sequencing platform (generally ≥ 3 nM) is therefore recommended.
As a rough qualitative indication of target capture success, the fragment size distribution can be determined using high-precision electrophoresis instruments such as an Agilent TapeStation. After a successful experiment, there will be a peak in the expected library fragment size range (insert size + adapter and index sequences). While exact target capture success can only be determined from sequencing results, this post-capture fragment distribution analysis acts as an extra quality assurance prior to sequencing.
Sequencing output of target capture sequencing experiments is in the same format as for other HTS experiments. Demultiplexing and quality filtering/trimming of the raw reads is required. Demultiplexing is often done automatically with Illumina sequencing data, using the BaseSpace firmware. For Oxford Nanopore and/or PacBio reads, there are freeware options such as PoreChop (
High-quality reads are assembled into consensus sequences for target loci to make good sequence alignments between samples. For target capture experiments, reads are aligned to the target reference used in the RNA-bait design, a process called mapping, to reduce the complexity of the assembly. Several tools are available to assemble mapped reads de novo, meaning without further use of the reference sequences. HybPhyloMaker (
HybPiper uses a combination of different mapping and assembly tools to retrieve target sequences from large target capture datasets. Reads are mapped to the reference sequence using a Burrows-Wheeler Aligner (
Target capture sequencing achieves reproducible high-quality sequencing results for hundreds of targeted SLCN genes or, in fact, any desired target gene. By reducing the complexity of genomic libraries, high-coverage sequencing results of single-copy genes can be obtained regardless of the organisms’ genome size and DNA degradation rate. These characteristics make target capture ideally suited for molecular identification studies (
The method is being refined as the underlying molecular techniques (
Explain the difference between universal and customised bait panels. What are the advantages (and drawbacks) of a customised approach?
Explain why target capture sequencing is potentially very suitable for obtaining low-copy nuclear genes from herbarium samples.
How does in-solution hybridisation-based target capture ensure enrichment of a HTS library?
Explain the difference between target capture sequencing and other genomic sequencing protocols, such as genome skimming and whole genome sequencing. What are the potential benefits (and drawbacks) of this technique?
Universal bait panels are designed for broad taxonomic application, such as all angiosperms. They are cheaper than customised bait panels due to the high consumer demand, but are usually less powerful in phylogenetics of recently diversified clades due to the use of conserved nuclear loci. Customised bait panels can require some investment, especially when no reference transcriptome is available, but offer a high return-on-investment by allowing for the selection of highly informative loci for the clade of interest.
Due to the small size of baits used in target enrichment (80–120 bp), the technique is less dependent on DNA fragmentation in (older) historical specimens. Even fragments partially containing a target sequence can be captured by the baits. Compared to traditional Sanger sequencing or PCR-based target enrichment, target capture does not depend on having complete genic fragments present in the DNA samples.
Short RNA- (or DNA-)baits containing sequences complementary to target sequences can hybridise in-solution with DNA fragments in any DNA sample. They can effectively hybridise with complementary DNA strands with up to 30% nucleotide mismatches allowing the same bait panel to capture target sequences in even distantly related taxa compared to the original clade used in the design. The baits are chemically modified (biotinylated) so they can be bound to magnetic streptavidin beads with target DNA fragments attached. This allows the user to precipitate the target DNA on a magnetic tube rack to wash away all unwanted non-target DNA fragments.
Target capture sequencing is a reduced representation HTS technique that effectively reduces the complexity of a HTS library by increasing the proportion of target DNA fragments in the sample. Compared to genome skimming and other low-coverage HTS techniques, it is much more powerful in obtaining low-copy nuclear loci, which are highly popular in plant phylogenomics. It is, however, much more expensive than genome skimming. On the other hand, it is much more cost-effective than whole genome sequencing if the user knows which genes the study aims to obtain.
Dewi Pramanik1,2, Ozan Çiftçi3, Yannick Woudstra4,5,6,7
1 Evolutionary Ecology Group, Naturalis Biodiversity Center, Leiden, The Netherlands
2 National Research and Innovation Agency Republic of Indonesia (BRIN), Indonesia
3 Institute of Environmental Sciences, Leiden University, Leiden, The Netherlands
4 Royal Botanic Gardens, Kew, United Kingdom
5 Natural History Museum Denmark, University of Copenhagen, Copenhagen, Denmark
6 Gothenburg Global Biodiversity Center, Department of Biological and Environmental Sciences, University of Gothenburg, Gothenburg, Sweden
7 Department of Plant Sciences, University of Oxford, Oxford, United Kingdom
Dewi Pramanik
Ozan Çiftçi
Yannick Woudstra
Transcriptomics is the study of the transcriptome, which is the complete set of all RNA molecules, including coding and noncoding RNA, that is expressed in a cell, tissue, or organism at a specific spatial, temporal, or developmental stage (
In plant research, transcriptomics is widely used for studying differential expression, identifying novel genes, and general expression patterns (
The first publication studying individual transcripts used Northern blotting for RNA detection, which is a hybridization-based method (
Microarrays were the first high-throughput method developed for transcriptomics to achieve widespread use due to their affordability and highly sensitive transcript detection (
Currently, RNA-seq data is often acquired using technologies that allow for long read data. Long read RNA-seq data allows reading full transcripts, finding new isoforms, identifying fusion transcripts, identifying long noncoding RNA, simplifying the computational analysis, and reducing PCR biases (
Isolating a sufficient quantity of high-quality RNA is critical for conducting transcriptome sequencing experiments and their analyses. When designing a protocol, a number of biological replicates should be considered. Biological replication represents RNA harvested from different plants or different sets of independent samples treated under the same conditions. This biological replication is important for assessing variation between samples, and more biological replicates can increase statistical power during analysis. In general, the minimum number of samples for transcriptomics studies is three biological replicates. Once the minimum number of samples and replications is achieved, the following steps are sample treatment and handling, RNA isolation, and RNA quality and quantity testing.
RNA to be used in transcriptomic experiments is most commonly isolated from a maximum of 100 mg of fresh plant tissue. If not used immediately, harvested plant tissue should be snap-frozen in liquid nitrogen and stored at -80 °C. If it is not possible to homogenise the fresh material or to snap-freeze it in liquid nitrogen immediately (e.g., in the field), it should be kept in a preservation buffer that maintains a constant pH to preserve proteins and protect the RNA. RNA stabilisation and storage solutions available from manufacturers (e.g., Ambion, Applied Biosystem or RNAlaterTM, Invitrogen, ThermoFisher Scientific, USA) or other preservatives such as a sulfate salt solution (e.g., ammonium sulfate) preserve tissue samples after harvesting in order to retain the quality and quantity of RNA for long periods (
A crucial step in tissue preparation is finding the most appropriate method to homogenise the tissue in order to maximise the yield and quality of the RNA. The most common method to homogenise the tissue is snap-freezing in liquid nitrogen and subsequent homogenization/disruption of the tissue by manually grinding with a mortar and pestle or with glass/metal beads and a tissue lyser. However, this is challenging for hard tissue like wood, roots or plant tissues with thick cuticles such as succulent leaves. The combination of snap-freezing in liquid nitrogen,disruption of the tissue by manually grinding, and second grinding with glass/metal beads and a tissue lyser can be a solution to optimise the tissue homogenization of hard tissues. Once the tissue samples are powdered, they can be stored at -80 °C or used immediately for RNA isolation. It is advised to thaw a frozen tissue sample only once and add the lysis buffer immediately to obtain high-quality isolated RNA. It is important that the lytic agent or denaturant comes into contact with the cellular contents when the cells are disrupted. The RNA lysis buffer (e.g., Buffer RLT, Qiagen-USA) is usually composed of phenol and guanidine isothiocyanate. This buffer has two functions as a denaturing agent and stabilises nucleic acid by preventing the activity of the RNase enzyme.
Compared to DNA, RNA is less stable due to its chemical structure: RNA is single-stranded and can easily be enzymatically degraded by the abundant amounts of ribonuclease (RNAse) that are present in the environment. RNases are secreted through our skin and in the air we breathe out. RNA isolations therefore need to be conducted in RNase-free conditions. Gloves must be worn at all times and the RNA isolation should take place in a fume hood. Designated working spaces and equipment should be cleaned with RNase inhibitors. Common RNase inhibitors to use are strong denaturants such as guanidinium, sodium dodecyl sulfate (SDS), diethyl pyrocarbonate (DEPC), or phenol-based compounds. Additionally, commercially available products include DNase/RNase AWAYTM (Merck BV, The Netherlands) or bleach (sodium hypochlorite). Keep in mind to also use RNase-free plastics and glassware. The main steps for RNA isolation are similar to the DNA isolation protocol (Chapter 1 DNA from Plant Tissues). RNA can be extracted and purified by following protocols described in the literature such as an acidic phenol-chloroform RNA extraction (
Single-cell RNA-seq (scRNA-seq) is an advanced method to profile transcriptomes from individual cells. scRNA-seq can be used for cell type identification, transcriptome profiling, and inference of gene regulatory networks across the cell (
One method for tissue-specific isolation is laser microdissection (LMD), which is based on a histological identification that isolates specific cell types by laser capture and laser cutting (
Single-cell sequencing can further provide high-resolution functional information on an individual cell. In order to capture single cells for scRNA-seq experiments, fluorescence-activated cell sorting (FACS) with the use of protoplasts is commonly used. This is both a high-throughput and highly specific method (
The quality and quantity evaluation of RNA is essential to the success of sequencing experiments and the downstream analysis.. The RNA quality and quantity can be evaluated by measuring the UV absorption of a sample. The optical density (OD) ratios at A260/A280 and A260/A230 can be used to determine the RNA purity. Pure RNA has an A260/A280 ratio of 2.1, and an A260/A230 ratio in the range of 2.0-2.2 (
Measuring the RNA integrity in order to determine its degradation level is also recommended. Traditionally, RNA integrity was determined by visualising total RNA using gel electrophoresis and ethidium bromide staining. Intact RNA gives sharp and clear 28S and 18S rRNA bands with an intensity ratio of 28S/18S at 2.0 or higher, in addition to a messenger RNA (mRNA) smear that should be visible between these two distinct bands. A more recent and standardised RNA integrity determination method is determining the RNA integrity number (RIN) with Agilent Bioanalyzer Systems instruments (Agilent Technologies, USA) (
The selection of library preparation methods depends on the fragment size, presence of structural features, and sequencing platform. In the Illumina short-read RNA-seq protocol, the library preparation entails four main steps: (1) RNA molecule selection (mRNA enrichment or rRNA depletion), (2) fragmenting the targeted sequence to the desired length and converting fragmented RNA into cDNA, (3) attaching the adapters and PCR amplification to create the cDNA library, and (4) quantifying the library product for sequencing. The library preparation for long-read sequencing is somewhat simpler than for short-read sequencing. The PacBio Iso-Seq protocol consists of three main steps: (1) cDNA synthesis, (2) cDNA amplification, and (3) library construction. With the Oxford Nanopore platform, the sequencing can be done directly from RNA or by using the amplified (or non-amplified) cDNA input.
A total RNA sample after extraction contains ribosomal RNA (rRNA), precursor mRNA (pre-mRNA), mRNA, small noncoding RNA (sRNA/sncRNA), and long ncRNA (transcripts longer than 200 nucleotides), where the majority of material is rRNA (
The conversion of RNA into cDNA is an essential step for RNA-seq. This conversion is necessary because DNA is biologically more stable than RNA. PCR amplification can only be done with DNA, and most sequencing protocols are designed for sequencing DNA. The first step in converting RNA to cDNA is the fragmentation of the RNA into an appropriate size for sequencing (i.e., 100–600 bp). Several approaches are available for RNA fragmentation, including physical approaches (e.g., acoustic shearing and sonication), chemical approaches (i.e., heating and divalent metal cation addition), and enzymatic methods (i.e., non-specific endonuclease cocktails and transposase tagmentation reactions) (
Adapters are ligated to one or both ends of the cDNA fragment. Adapters consist of sequences that allow library fragments to bind to the flow cell, sequencing primer binding sites, and index sequences. Index/barcode sequences are sequence identifiers that enable the pooling of several samples (multiplexing) in a single sequencing run or flow cell lane. Products from the ligation reaction are purified using agarose gel electrophoresis prior to PCR amplification to create the cDNA library.
Several library preparation kits based on the Illumina platform are available. The “TruSeq Stranded Total RNA with Ribo-Zero Plant” kit is useful for large tissue samples (0.1–1 μg total RNA). While for low quantities of RNA, the “NEBNext® Ultra™ II Directional RNA Library Prep with Sample Purification Beads” kit (10 ng–1 µg total RNA for polyA mRNA workflow and 5 ng–1 µg total RNA for rRNA depletion workflow) (New England Biolabs Inc., UK) can be used. These kits incorporate Illumina library preparation steps, including bead-based rRNA depletion or mRNA enrichment, cDNA synthesis, adding adaptors, indexing, and PCR. For a tissue sample that yields smaller amounts of RNA, like a single cell (1–25 ng), the “Collibri stranded RNA Library Prep kit” (ThermoFisher Scientific, USA) can be applied.
For the PacBio Iso-Seq platform for long-read RNA-seq, the “NEBNext Single Cell/Low Input cDNA Synthesis & Amplification Module” kit (New England Biolabs Inc., UK) can be used for cDNA synthesis and its amplification from a single cell or ultra-low input RNA (as low as 1 pg–200 ng). The “SMRTbell Express Template Prep Kit 2.0” (Pacific Bioscience, USA) can be used to detect full-length transcripts up to 10 kb.
The ONT platform provides a starter pack for direct RNA-seq, PCR-cDNA sequencing kit, and direct cDNA sequencing kit (Oxford Nanopore Technologies Ltd., UK) with necessary inputs for RNA or Poly-A+(poly(A) on the present of the polyadenylated 3’-ends) 500 ng for direct RNA-seq, 1 ng for PCR-cDNA sequencing, and 100 ng for direct cDNA sequencing.
A very sensitive method for checking the quantity of a library preparation is with fluorometric methods (i.e., Qubitt™ Fluorometer, ThermoFisher Scientific, USA) or by qPCR. qPCR library quantification is based on the amplification of cDNA fragments with the adapters. The qPCR machine measures the intensity of fluorescence emitted by the probe at each cycle. In this approach, only templates that have both adapter sequences on either end will be measured and subsequently form clusters in a flow cell. Other methods include the use of electrophoresis-based quantification methods such as fragment analyzer systems that use automated parallel capillary electrophoresis to assess the library size distribution (e.g., Tapestation, Agilent Technologies, USA). A critical aspect in the quality check from the fragment analyzer is the library size distribution in the expected range. The peaks near the lower marker on library electrophoresis show contaminants, including primer and adapter dimers. An additional clean-up of the sample is recommended to increase the quality.
cDNA sequencing can be performed on several different platforms (see Chapter 9 Sequencing platforms and data types). Overall, RNA sequencing does not differ from the sequencing of genomic DNA. The sequencer reads cDNA fragments in one of two ways: using a single-end or paired-ends. In single-end reading, the sequencer reads the cDNA from the 3’ or 5’ end of only one strand of the insert. This method can produce large volumes of high-quality data especially for differential gene expression studies where an important factor is determining where the reads in transcripts come from (
The requirements for sequence coverage and depth varies depending on the scientific questions to be answered, with complex studies perhaps needing greater sequencing depth and coverage. For example, a differential expression study using the Illumina platform requires 10-30 million reads per sample (
Prior to the development of high-throughput methods, individual transcriptome studies were performed using hybridization-based methods such as Northern blotting and microarrays (see above) or amplification-based methods including Sanger sequencing and RT-qPCR.
Hybridization-based methods require visual inspection or image processing analyses to interpret the output, while in qPCR, it is the amplification that must be monitored. In qPCR, the expression levels are represented by cycle threshold (Ct) values and further normalisation steps and statistical analyses need to be used for the estimation of relative or absolute abundances. Neither hybridization methods nor qPCR require labour-intensive post-processing.
On the other hand, EST/SAGE/MPSS or RNA-seq methods rely on sequence data and require several post-processing steps such as clustering, assembly, and functional annotation. As RNA-seq allows characterization of whole transcriptomes and currently is the most widely used method, we outline the bioinformatic analysis steps for high-throughput RNA-seq data. Long read sequencing methods such as ONT and SMRT allow full-length characterization of transcripts and can be used to study complex transcriptomes. Although one common concern regarding these technologies is high error rates, their accuracy has dramatically increased recently and the development of long-read specific error correction approaches are providing further improvements (
After obtaining raw RNA-seq data, the quality of the reads should be checked and sequencing errors should be corrected in order to improve the accuracy and efficiency of the assembly process. It is also recommended to mask low complexity regions and repetitive sequences that might generate hits that are artefacts. DUST and SEG modules of BLAST can be used for this purpose on nucleotide and amino acid sequences, respectively. Bacterial and viral contaminants can be removed by running similarity searches against public databases or using tools such as DeconSeq (
Most short-read assemblers first divide reads into subsequences of length k (i.e., k-mers) and generate a graph representing the overlap between them (
Depending on whether a reference genome/transcriptome is available or not, there are different strategies for transcriptome assembly.
De-novo assembly is solely based on RNA-seq data and uses the k-mer composition by subdividing the reads into shorter segments of a given length k. This composition and the overlaps between these k-mers are represented on a de Bruijn graph, which is finally resolved to reconstruct transcripts (
Commonly used de-novo assemblers include Trans-ABySS (
There are also combined de-novo assembly approaches such as EvidentialGene (
Genome-guided assemblers map RNA-seq data to a reference genome and avoid constructing de Bruijn graphs by merging the reads based on their overlapping regions. The quality of the reference genome is critical here, as a high-quality assembly can provide accurate transcript predictions and expression profiles, while using a fragmented or incomplete assembly as reference might aggravate this process. When mapping RNA-seq reads to a reference genome, introns should be accounted for. Therefore genome-guided assemblers allow splitting the reads during mapping. This is achieved by using a splice aware alignment strategy where the downstream regions of a read can map to a downstream exon on the reference. Such splice aware aligners include TopHat2 (
RNA-seq reads can also be mapped to a transcriptome, if a high-quality assembly is available for the target or a closely related species. This transcriptome-guided approach can improve the contiguity and completeness of the assembly (
High-sensitivity reference-guided assemblers can be combined with de-novo assemblers in order to detect novel and missing transcripts as well. If the reference genome is incomplete, fragmented, or from a distantly related species, the de-novo assembly should be performed first in order to avoid the potential errors in the reference. This approach can also be useful for extending incomplete transcripts to full-length by merging these based on a reference (
The average length of assembled contigs in an RNA-seq experiment will vary based on the actual mRNA fragments that are sequenced. Thus, metrics based on assembled contigs do not necessarily indicate the quality of a transcriptome assembly. Transcriptome-specific metrics have been suggested such as ExN50, which computes transcript lengths as expression-weighted means of isoform lengths. Another method to assess the assembly quality is by checking the read percentage that can concordantly align to the final assembly in order to understand if the full complement of paired-end reads are represented in the assembled transcripts. Tools such as bowtie2 or BWA can be used for this type of mapping. Other tools for evaluating the quality of an assembled transcriptome include DETONATE (
Transcripts can also be translated into protein sequences and mapped against well annotated databases such as UniProt/Swiss-Prot, Pfam, or NCBI. If the sequenced organism is closely related to a model organism, a high proportion of the contigs should have potential homologs in these databases. Another tool, BUSCO, assesses the completeness of the assembly by comparing it with universal single-copy gene databases specific to different lineages such as bacteria,
Expression quantification is a critical step for most RNA-seq experiments. There are two main sources of systematic variability which might introduce errors to this process; (i) longer transcripts generate more reads than shorter transcripts at the same abundance due to RNA fragmentation during library construction, and (ii) the number of fragments mapped across samples are different due to varying number of reads produced for each run. Therefore, read counts need to be normalised in order to obtain accurate gene expression estimates. Inter-sample normalisation methods have been developed for differential expression analysis, such as DeSeq2 (
Assembled transcripts from de-novo or reference-guided assemblies are expected to represent real biological differences such as expression levels, alternative splice forms, and paralogous or allelic transcripts (
Transcriptomes have been used for plant phylogenomic inference as they contain abundant information from the nuclear genome. Famously, the generation of > 1000 transcriptomes across the plant kingdom led to new evolutionary insights for land plants (One Thousand Plant Transcriptomes Initiative 2019). However, the application of RNA-seq is limited to fresh tissue with low levels of degradation, making it less applicable to studies with large taxonomic sampling.
An emerging phylogenomic approach that partly relies on transcriptomics uses targeted next-generation sequencing (see Chapter 14 Target capture) to obtain specific genes for high-coverage DNA sequencing in large numbers of samples with varying taxonomic breadth. Target capture is very efficient in recovering hundreds of genes, regardless of the degradation level in the source DNA (
Metatranscriptomics is the application of transcriptome sequencing to environmental samples such as water, soil, or sediments. It gives an overview of the actual metabolic activity and taxonomic diversity within a community. The protocol involves HTS of reverse-transcribed cDNA obtained from an environmental mRNA isolate. While reverse transcriptase PCRs can only detect a single gene at a time, metatranscriptomics gives a whole gene expression profile of a diverse community of organisms playing various functional roles in the ecosystem (
Some of the main challenges of metatranscriptomics are the presence of PCR inhibitors in environmental samples (e.g., humic acid, polysaccharides;
There are various applications of metatranscriptomics such as revealing the composition of freshwater bacterioplankton communities (
Plant transcriptomics studies have undergone huge advances over the past few years as the costs of the second generation of sequencing, such as Illumina, have declined, third generation sequencing has become more accurate, and a wider range of analysis tools and pipelines have become available and become more accurate (
Studies using comparative transcriptomics to understand interactions between different organisms (
What is the difference between genomics, proteomics, and transcriptomics?
What is a more suitable library preparation approach for comparative gene expression study: poly(A) enrichment or rRNA depletion? Motivate your answer.
Describe three criteria that are critical for the choice of reference-based vs. de-novo assembly approaches.
Genomics is the study of the entire genome from the complete set of DNA in the nucleus, chloroplasts and mitochondria of somatic cells. Proteomics is the study of proteins, protein complexes, their localization, their interactions, and posttranslational modifications. Transcriptomics is the study of genome expression products of the mRNAs that are actively expressed at any given moment in a cell or organism.
Poly(A) enrichment is suitable for comparative gene expression studies. Poly(A) can produce sufficient mRNA and separates mRNA from rRNA contaminants. This results in higher exonic coverage for protein-coding genes of a transcriptome. rRNA depletion is mainly applied for comprehensive transcriptome studies. It can capture a wider diversity of unique transcriptome features such as mRNAs lacking the poly(A) tail, long ncRNAs, newly transcribed, and unprocessed transcripts. If rRNA depletion is used for comparative gene expression studies, the results will be biased. The rRNA depletion method results in most reads mapping to intronic regions reducing the number of reads to the exonic region. It also overestimates the expression levels of the genes that overlap with the intronic regions of other genes.
(i) Availability, completeness, and fragmentation of reference genomes of target species. (ii) Availability of reference genomes of closely related species. (iii) Detecting novel transcripts.
Dewi Pramanik1,2, Ozan Çiftçi3, Yannick Woudstra4,5,6,7
1 Evolutionary Ecology Group, Naturalis Biodiversity Center, Leiden, The Netherlands
2 National Research and Innovation Agency Republic of Indonesia (BRIN), Indonesia
3 Institute of Environmental Sciences, Leiden University, Leiden, The Netherlands
4 Royal Botanic Gardens, Kew, United Kingdom
5 Natural History Museum Denmark, University of Copenhagen, Copenhagen, Denmark
6 Gothenburg Global Biodiversity Center, Department of Biological and Environmental Sciences, University of Gothenburg, Gothenburg, Sweden
7 Department of Plant Sciences, University of Oxford, Oxford, United Kingdom
Dewi Pramanik
Ozan Çiftçi
Yannick Woudstra
Modern sequencing technologies (see Chapter 9 Sequencing platforms and data types) make it possible to generate large-scale genomic sequencing data for any plant species. This dramatic step-change in genomic data availability, along with improvements in bioinformatic tools, has led to the release of many high-quality plant genomes (
In this chapter, we consider best practices for whole genome sequencing as a tool for plant identification, and the relative strengths and weaknesses of different genome sequencing approaches. We start by discussing the overall workflow common to any project using whole genome sequencing, before moving to the specific requirements of three approaches that differ in their sequencing coverage: (1) Genome skimming, which uses low-coverage sequence data to assemble well-represented (high copy number) genomic regions, (2) Genome resequencing, which uses modest-coverage sequence data to investigate genomic diversity relative to an existing nuclear reference genome sequence, (3) De novo whole genome assembly, which uses high-coverage sequence data to generate a nuclear reference genome. We also consider assembly-free approaches for using the nuclear genome.
Genomic sequencing starts with sample collection and DNA extraction, and finishes with a set of sequences or sequence variants that are suitable for analysis. The major stages are as follows.
Genome sequencing usually uses high-quality DNA extracted from plant tissue (see Chapter 1 DNA from plant tissue), though some approaches can accommodate DNA from degraded specimens (see Chapter 2 DNA from museum collections). The exact requirements depend on the downstream processes, but as a guide:
Low initial concentrations (500 pg+) of degraded or intact DNA (fragment molecules 100 bp+) can be used for genome skimming (
Modest initial concentrations (100 ng+) of intact DNA without extensive degradation (fragment molecules 400 bp+) are typically used in genome resequencing.
High initial concentrations (1 µg+) of high molecular weight DNA (fragment molecules 20 kb+) are typically used for de novo genome sequencing.
Most plant identification projects use DNA extracted from individual plant samples. However, metagenomic studies may work on mixed samples such as environmental DNA (see Chapter 12 Metagenomics), while some population genomic studies may choose to pool multiple individuals per population and compare diversity between these sample pools (e.g., Pool-seq; (
The wet lab protocols used to generate sequence-ready DNA libraries (see Chapter 9 Sequencing platforms and data types, Chapter 12 Metagenomics, Chapter 15 Transcriptomics) are varied and depend on the starting DNA quality and the intended downstream sequencing approach. Low amounts of starting input DNA will require amplification via PCR-based library preparation, while higher amounts of input DNA samples can be used in PCR-free libraries, which reduces bioinformatic issues with PCR duplicates. Further, a range of more lab intensive library preparation approaches are available for long read sequencing or to allow users to partition and barcode HMW DNA (e.g., linked read sequencing such as haplotagging; (
Most plant identification studies using whole genome sequencing rely on short-read data, such as that generated on Illumina platforms or with BGI technologies. Here, the benefits of low per base-pair sequencing costs, high accuracy and throughput, and potential for sample multiplexing make it extremely well-suited for a range of applications. However, recent innovations in long-read sequencing have reduced error profiles and made it more cost-effective (
The computational methods will vary considerably depending on sample type and number, sequence type, and downstream analysis approach. However most projects will involve the following initial stages:
De-multiplex samples to give separate sequence files per individual.
Data quality control (QC), to check the sequence quality, read number, and other sequence quality metrics.
Sequence data post-processing. This may involve trimming or quality filtering raw reads to remove low-quality sequences.
Genome assembly. For genome skimming, organellar and ribosome genomes can readily be assembled de novo (see below), while for nuclear genomes the data then goes through a multi-stage genome assembly pipeline (
Alignment. Studies of small genomes, such as plastids, will usually involve whole genome alignment. In resequencing studies, sequence reads are directly mapped to the reference genome (e.g., short-read mapping with bwa-mem). Various additional stages such as marking of duplicates and realignment around indels may be required to produce high quality alignments.
Variant calling to produce Single Nucleotide Polymorphism (SNP) datasets for downstream analyses. Many SNP callers are available, with the Genome Analysis Toolkit (GATK; (
Quality filtering to remove low frequency SNPs and sites/individuals with lots of missing data. This may either be by applying ‘hard’ quality thresholds, or more sophisticated machine-learning approaches for removing sequence artefacts (e.g., variant quality score recalibration in model species with a high quality training set).
Low-coverage sequencing of genomic DNA, ‘genome skimming’, is an efficient approach for comparative genomics of diverse species (
Genome skimming stands out from many other genomic approaches for its technical ease. It is straightforward at all stages of the workflow, from DNA extraction requirements to easy and comparatively cheap library building options (
The downside of genome skimming is that it fails to reliably sample the bulk of the genome. The regions which are represented at high coverage, particularly organelle genomes, show atypical inheritance and evolutionary patterns, which may yield phylogenetic results that are incongruent with phylogenies of the nuclear genome (e.g., in
Plastids are organelles that are responsible for photosynthesis and the synthesis and storage of molecular products. Plastomes are mostly circular and nonrecombinant organellar genomes averaging 120–160 kb in size. Their high copy number per cell means high quality assembly is possible even from low depth nuclear genome sequencing (
Nuclear ribosomal DNA (nrDNA) primarily functions to code for ribosomal RNA. Plant nrDNA has an average size range of 10–15 kb and exists as hundreds to thousands of tandem repeats occurring in high copy numbers throughout cells (
Mitochondrial genomes have a primary function in respiration. Despite their conserved function and generally conserved gene complement, they show significant structural variation in plants, including in size (100 kb–2.7 Mb), sequence arrangement and repeat content (
Genome resequencing involves sequencing samples at a moderate depth (often 5–30X coverage) and analysing the data in the context of an existing reference genome. Most genome resequencing studies use short-read data and subsequently investigate SNPs and small indels, however, long read sequencing is now becoming more accessible for such resequencing work, thus allowing researchers to investigate longer indels and structural variation(see ‘Sequencing’ above).
The key benefit of genome resequencing over genome skimming is that it provides reliable and repeatable access to the nuclear genome. This allows researchers to investigate genome-wide diversity and evolutionary relationships both genome-wide and in specific genomic regions of interest, such as those loci underlying species differences in young taxa (
The key downside to genome resequencing is that it requires a reference genome. As such, genome resequencing has traditionally been restricted to population genomic analyses of model species. However, decreasing sequencing costs and the increasing availability of reference genomes (discussed below) means resequencing is now more widely applicable to a diversity of species. It is also becoming increasingly easy to perform resequencing studies on degraded DNA due to improved laboratory and bioinformatic methods that are able to capture and process short fragments (see Chapter 2 DNA from museum collections). In addition, the increasing usability of genotype likelihoods instead of hard SNP calls, means that sequence variation can be assessed at reduced coverage (and hence, cost). As such genome resequencing is increasingly used to resolve plant identification issues that require population-level sampling or the investigation of closely related species.
The use of a reference genome brings limitations to the analysis of samples of varying quality (exacerbated with fragmentation in degraded ancient DNA; (
A range of bioinformatic tools are now available for the analysis of short sequence reads without a reference genome. These approaches often rely on the frequency distribution of k-mers (short sequences of length k) across all sequence reads of a given sample (
De novo whole genome sequencing represents the ‘gold standard’ in genomics. Here, the aim is to produce a chromosomally contiguous set of sequences that document the complete nuclear genome. To achieve this aim, a complementary range of genomic sequencing technologies (with long read sequencing coupled with long range information, now the standard) are applied to high-quality DNA extracts, producing high coverage sequencing data.
The de novo assembly of plant genomes represents a complex analytical problem. Most assemblers rely on one of two approaches (
Published reference genome assemblies vary considerably in quality. Even when contamination and mis-assemblies have been minimised, contig and scaffold size and overall genome completeness can vary widely. These genome properties can be assessed with measures such as N50 (a length-weighted median measure of contig size), or BUSCO completeness (the percentage of fully assembled core plant genes) (
De novo genome sequencing is likely to play an important role in future studies of plant identification. One can imagine rapidly sequencing long DNA molecules that are directly assembled into chromosomes in real time, that are then compared to existing reference genomes to detect the presence of cryptic species. While this may seem like a fantasy, the dramatic and continued progress in genomic sequencing and bioinformatic algorithms makes this not so far-fetched, as seen with Oxford Nanopore adopting ‘adaptive’ small genome sequencing where reads are mapped and analysed in real-time. In the meantime, de novo genome sequencing effort is likely to be focused on generating reference genomes for each plant family, and for specific research projects, either as stand-alone research investigating genome evolution, or to facilitate genome resequencing of infraspecific variation. Current barriers to the wider deployment of reference genome production are the cost and bioinformatic complexity of assembling large, repeat-rich, polyploid plant genomes. These challenges are particularly difficult in some evolutionary lineages, such as ferns, which mostly have large polyploid genomes.
Genomic sequencing can aid in numerous aspects of plant identification, discussed in Section 3. Here, we consider a representative set of examples where genome skimming, genome resequencing, and whole genome sequencing may be the preferred approaches.
Genome skimming is particularly suitable for studying a large number of diverse samples (
Genome resequencing is most appropriate for studying the population genetic structure and/or the relationship of closely related species. For example, a researcher may want to clarify species boundaries and improve species delimitation in a taxonomically complex species group (
The current use of reference genome sequencing is largely to understand the evolution and genome structure of plants, rather than directly being used for plant identification. For example, the production of a reference genome for a medicinal plant species may be a key resource for characterising the evolution of chemical diversity. Here, a reference genome may reveal the genes and genetic pathways involved in secondary compound production (e.g., for medicinal compounds in the orchid genus
Whole genome sequencing is increasingly used in studies of plant identification. A diverse range of methods are available, from low coverage genome skimming used to recover organelle sequences for reconstructing plant phylogeny, through to high coverage sequencing and de novo nuclear genome assembly used to generate reference genomes for comparative analyses. Future developments in sequencing technologies and bioinformatic tools will make these methods increasingly accessible to the botanical community.
How does sequencing coverage differ between genome skimming, genome resequencing, and reference genome production?
Which sequencing approach (mentioned in other chapters) would be a good alternative to genome skimming for sequencing degraded museum specimens?
How can genome sequencing approaches be used to facilitate plant identification? Give two examples.
Genome skimming typically generates low coverage (~0.1–5X), genome resequencing modest coverage (~5–30X), and de novo genome sequencing high sequencing coverage (often > 100X).
Target capture would be the obvious alternative (see Chapter 14 Target capture). This enriches for specific target regions which are then sequenced at high coverage.
There are many possible uses, two examples: (1) genome skimming could be used as an ‘extended barcode’ to genetically characterise a degraded, fragmented or processed sample (e.g., a museum specimen, an illegally traded processed sample, or juvenile material) relative to existing DNA barcoding databases, (2) genome resequencing could be used to characterise species limits and species relationships and identify cryptic species.
Anne-Sophie Quatela1,2, Bengt Oxelman1,2
1 University of Gothenburg, Gothenburg, Sweden
2 Global Biodiversity Center, Gothenburg, Sweden
Anne-Sophie Quatela
Bengt Oxelman
“Species” is often considered to be one of the living world’s fundamental categories, having its own ontological status, similar to a gene, cell, population, clade, or organism. Despite its importance, defining the species category is controversial, and many different species concepts have been proposed (
In this text, we will focus on concepts that view species as historical individuals. These are composed of the genetic material (i.e., an assembly of alleles), which are reproduced through time, and expressed by ephemeral phenotypes. Historical individuals refer to different ontological kinds than individual organisms. Individuals lack defining properties, they exist restricted in time and space regardless of our ability to recognize them (
Pre-evolutionary taxonomy often divided organisms into binary groups based on absence or presence of certain properties, e.g., those with the ability to move versus those that are sedentary. However, we know from phylogenetics that attributes such as the ability to move have evolved many times. Defining animals as organisms having the ability to move results in a class. Phylogenetic taxonomy tries to identify and name monophyletic groups that have a certain spatio-temporal restriction. Thus, phylogenetic taxa are individuals, with no defining properties. In taxonomy, the concept of cryptic species, although not unambiguously defined (
The genetic material constitutes the replicators, i.e., the entity that evolves. The genes express themselves as interactors, the organisms, which are only vehicles for the genes, and everything beyond the information encoded in the DNA is an ephemeral expression of it (
The species concept debate nourishes a fundamental ontological problem: which categories in biological taxonomy are natural or even real? According to the rules of nomenclature (e.g., the International Code of Botanical Nomenclature;
Monophyletic groups, or clades, form a nested hierarchy. A phylogenetic tree is a convenient model to illustrate this. A named clade with family rank will normally include subclades that may be named as genera, and these will include named species. So, if the formal categorical ranks are applied to clades, there is no difference between for example genus and species, except that the latter never can include the former. The International Code of Phylogenetic Nomenclature (PhyloCode;
Population is a widely used concept among biologists, which usually refers to a geographically confined assemblage of individual organisms of the same species. Strict mathematical definitions enable the parameterization of certain aspects of population genetics if some simplifying assumptions are applied. For example, demographic history can be quantitatively studied using coalescent theory (
The implications of species concepts affect essential societal fields such as agronomy and plant breeding, agroforestry, pharmacology and medicine, horticulture, etc. This chapter does not have the ambition nor the goal to provide an exhaustive summary of all those implications. Here, we briefly discuss the implications of different species concepts to taxonomy and how species concepts have consequences on our perception of the current biodiversity crisis.
Some species concepts and their properties (e.g., the biological species concept, the ecological species concept, the phylogenetic species concept, etc.) can be incompatible and lead to the description and naming of differently delimited taxa. To understand the consequences of competing species delimitations, it is essential to acknowledge the central role of taxonomy in many biological studies and societal matters. Traditionally, taxonomy delimits species based on diagnostic morphological differences. However, taxonomists sometimes disagree, and there is a recognition of taxonomists as “lumpers” (favouring broad species delimitations) and “splitters” (favouring narrow). Moreover, morphologically delimited species may be different from those delimited according to other species concepts. The 21st century biodiversity crisis and the conservation efforts that arise from it are in need of a tool for quantitative biodiversity measurements. Species richness is often defined as the number of species per area and/or time, and is central in many biodiversity measurements, for example in Shannon entropy (
Species delimitation methods described in this chapter are DNA-based phylogenetic approaches investigating the evolutionary history of species. However, their goals and inherent properties should not be confused. While molecular phylogenetics aims to identify and infer the evolutionary relationships among clades, molecular species delimitation aims to estimate parameters identifying species.
There are several practical questions to address before designing and implementing a species delimitation study, we here briefly discuss a few of them.
The sample strategy should reflect sufficient intra-specific variation while mirroring greater interspecific divergence. In this light, the first advice is to sample from the entire known distribution range of the group under study. The second step is to sample the different morphotypes of each taxonomic species. The underlying idea is that as the phenotype is an expression of the genotype, sequencing a wide range of morphotypes per taxonomic species should facilitate a comprehensive study. Another way of putting it is that in this way, you will be able to test taxonomic delimitations based on phenotypic data using genetic data.
Both single and multiple locus approaches have been developed and used in plant species delimitation. In this chapter, we deliberately address multi-locus approaches. Multi-locus approaches present several advantages over single locus methods
Identifying and quantifying all the parameters that influence a biological system is complex, and in stochastic modelling we make simplifying assumptions and approximations. A stochastic model enables quantification of differences between the input data and what the model predicts. Conclusions may therefore be drawn on which processes are responsible for those differences. For example, the linear regression model has two parameters, the slope and the intercept. Given any sampled two-dimensional data, we can estimate the best fitting values for the two parameters. However, the fit will depend on the model’s assumptions (e.g., linearity, random and independent sampling, homoscedasticity, etc.). These assumptions can be relaxed by introducing new parameters that will provide a better fit to the data. Although this increases the computational effort, it also reduces the explanatory power, because there will be less data per parameter.
We present two types of parametric multi-locus delimitation approaches, allelic-clustering and coalescent-based methods. Uni-locus approaches and concatenation methods do not take advantage of the information inherent to the discordance among gene trees in a multi-locus dataset. We also note that other approaches than parametric modelling are possible, for example by simply plotting data and analysing the pattern and classifying the data points according to their Euclidean distances to each other (
Species delimitation can be viewed as a process where sampled individuals (i.e., which can be alleles or organisms) are clustered. In population genetics, a class of methods, often referred to as STRUCTURE-like methods due the original methodology proposed by
Alleles are in linkage equilibrium when they occur randomly and independently in a population, their frequency is the one expected according to the Hardy-Weinberg principle. However, biological processes often violate linkage equilibrium. Linkage disequilibrium (LD) is the non-random association of alleles at two or more loci in a population: they are in LD when they do not occur randomly and are not independent from each other. LD provides information about population genetic phenomena (i.e., migration, mutation, selection, genetic drift). In a population, LD is increased by selection, population structure, and genetic drift, and is eroded by recombination. STRUCTURE (
An interesting feature of these approaches is that they can directly cluster the genetic material, the alleles, rather than the phenotypic expressions (i.e., the organisms). Thus, they are directly clustering the replicators, and not the interactors, which may be heterogeneous assemblages of such clusters (i.e., hybrids). A shortcoming of allelic clustering methods is that unlike coalescent-based phylogenetic methods they do not assess the phylogenetic divergence of populations.
Ideally, species delimitation methods should parametrize gene flow (i.e., migration) and incomplete lineage sorting, which happen when the alleles of a certain gene coalesce deeper in the species tree than the species divergence. Indeed, these two phenomena are the primary causes of gene tree discordance when sampling unlinked genes. The eukaryotic nuclear genome usually consists of several chromosomes, and within each chromosome, recombination occurs between linkage groups. By contrast, organellar genomes, which are haploid, are usually considered as non-recombining (but see e.g.,
Population genetics aims to understand how and why allelic frequencies vary within and between present populations. Two approaches exist for investigating ancestor-descendant relationships that centre on genetic drift. One approach is prospective/forward where the probability of identity-by-descent for allele copies (i.e., the probability that allele copies are descendants from a single common ancestor) is evaluated. Ancestor-descendant relationships are traced forward in time in order to understand the present pattern of allele copies.
The other approach is a retrospective/backward probabilistic approach called coalescent theory (Figure
The basic, and most simple coalescent model, assumes that the population conforms to the idealised conditions set by the geneticists
In the absence of migration between the tree branches, the gene tree splits will always be as old or older than the population branching. As the gene tree branching orders are random in Wright-Fisher populations, the MSC model efficiently handles incomplete lineage sorting (ILS), which is one reason why gene trees are different from species trees. Given the ambiguity of the term “species”, it was perhaps unfortunate that it was used for coining the name of the model.
As with parametric phylogenetic methods in general, parametric phylogenetic species delimitation methods can be based on the Maximum Likelihood (ML) criterion, or on Bayesian approaches. These can be further divided into implementations that use an exact likelihood function, which estimates all parameters of the model, and approximations, where some parameters are fixed. According to
Approximate likelihood ML phylogenetic methods typically work by dividing the gene tree and species tree estimation into two steps, such that gene trees obtained from phylogenetic analyses of each alignment become input data for the species tree estimation. Thus, the gene trees are point estimates for the genealogies at each locus. In addition, the effective population size is a crucial parameter for the MSC model and finding the maximum likelihood value for it is computationally intractable (
Summary of different species delimitation methods.
Method name | Approach | Statistical framework | Input data | Likelihood function | Example of studies using the method |
---|---|---|---|---|---|
BP&P ( |
discovery/validation | Bayesian | MSA Multiple Sequence Alignment | Full likelihood |
|
SpedeSTEM ( |
validation | Maximum likelihood | gene trees | Approximate likelihood |
|
Heuristic method ( |
discovery | Maximum likelihood | gene trees | Approximate likelihood |
|
STACEY ( |
discovery | Bayesian | Multiple sequence alignment | Full likelihood |
|
DISSECT ( |
discovery | Bayesian | Multiple sequence alignment | Full likelihood |
|
PHRAPL ( |
validation | model selection | gene trees | Approximate |
|
Bayesian full-likelihood implementations of the MSC model can theoretically accommodate unlimited numbers of sequences per locus, but are dependent on the approximations of the posterior densities that the Markov Chain Monte Carlo (MCMC) techniques provide. Extensive exploration of convergence and mixing are necessary to ensure that the results from MCMC are reliable (
As the MSC model is ultimately based on a phylogenetic tree, parametric implementations of species delimitation in essence identify extant species as the tip branches of the species tree. Software implementations such as *BEAST and StarBeast2 assume that sequences are assigned to the correct species, which are defined to be Wright-Fisher (WF) populations where the gene trees are distributed according to the coalescent model. DISSECT (
As alternative hierarchical species delimitation models differ with respect to the assignments of sequences to species, this leads to stochastic models having different sets of parameters. To evaluate the fit of the data to different delimitations, model selection criteria are relevant. In a maximum likelihood framework, hierarchical likelihood ratio tests can be applied when models are nested (i.e., for example when the split of A and B is compared to A and B as a single species. However, such methods cannot be applied when classifications are non-nested, e.g., when AB and C is compared to A and BC. In such cases, information-theoretical approaches must be applied (
The methods cited above assume no migration (hybridization, horizontal gene transfer) between branches, and instantaneous “speciation”, i.e., divergence is completed in one generation and no migration is permitted after that. A more flexible, approximate likelihood approach to species delimitation is provided by PHRAPL (
All the species delimitation models that we have introduced so far are developed for diploid genomes. However, allopolyploidy is traditionally thought of as being an speciation mechanism, where the allopolyploid hybrid instantaneously becomes reproductively isolated from its parents. Under this view, the problem of species delimitation becomes a problem of tracing the allopolyploidization event, and species delimitation of the descendants will follow the same logic as species delimitation of diploid genetic lineages. The models mentioned below are phylogenetic methods, which potentially can be extended in a similar fashion to the MSC-based methods described above. However, a special complication is the fact that it is usually difficult to assign sequences to subgenomes a priori.
The traditional way to model phylogenetics, and indeed also the MSC model, assumes reproductively isolated species (no migration after divergence) and bifurcating phylogenies. The genetic information is transmitted from ancestors to descendants without modelling gene flow between branches, and with bifurcations representing the speciation events.
However, hybridization and introgression are common natural processes which challenge these assumptions. Hybridization can be followed by whole genome duplication (WGD): this phenomenon is called allopolyploidization and is a significant factor in speciation due to the reproductive isolation of the newly formed polyploid from its diploid parents. Note that WGD may also occur within lineages and is then termed autopolyploidy. Here, we concentrate on the former type.
WGD is characteristic of all major land plant lineages (
Reticulation events challenge the biological species concept, which states that species are different entities that cannot interbreed to produce fertile offspring. This view leads back to the philosophical perception of species and the parameters describing these entities. A hypothetical species genome undergoing several rounds of allopolyploidization will end up with subgenomes carrying genetic information difficult to trace back. In other words, this hypothetical genome would be a mosaic or a melting pot of the parental genomes. With such a changing genome, how can we identify a reticulate entity according to set parameters?
Inferring species boundaries amongst extant allopolyploid plants requires identification of the two parental sub-genomes to allow accurate inference of allopolyploid ancestral events. The genes used to trace the evolution of a polyploid genome carry information from both parental genomes. Note however, that there may have been “normal” branching speciation events after the polyploidization events, and the parental species may have gone extinct. Therefore, a phylogenetic view is necessary.
Chloroplast and mitochondrial DNA usually carry information only from one parental genome, usually the maternal lineage. Nuclear ribosomal DNA (nrDNA), which in eukaryotes contains many tandem repeats (
In order to trace polyploid ancestry, genomes must be resolved at the haplotype level. To recover the full haplotype sequence, the DNA reads must overlap and the overlap should cover informative SNPs. Under these conditions, correct haplotype phasing can be achieved for diploid genomes (
Phylogenetic methods tracing allopolyploidy aim at assigning homoeologs (i.e., subgenomes) to parental genomes. However, the task is challenging for two reasons: biological phenomena such as recombination and gene loss result in the partial loss of parental genetic information, and secondly, modelling hybridization is computationally challenging. AlloppNET (
AlloppNET is implemented in the BEAST1 framework (
The MDC approach by
PADRE (Package for Analysing and Displaying Reticulate Evolution;
Conceptually, various species concepts attempt to accommodate genealogical, phenotypic as well as future aspects, and these need not lead to identical delimitations. There is an emerging view of viewing species as the branches of the phylogenetic tree, and we have focused on species as being historical individuals composed of the vertically transmitted genetic information. The MSC model allows scientists to view genetic data and rigorously test monophyly as well as branch content. However, most current implementations of the MSC model identify the branches as what most biologists would view as populations, and furthermore, they are not capable of including migrations of alleles. STRUCTURE-like methods have the capability to cluster alleles directly, but are dependent on similar assumptions to the MSC model, and lack a phylogenetic component. In principle, the MSC model can be extended to accommodate migrations, and a few recent attempts exist (e.g., DENIM; PHRAPL). A pluridisciplinary approach, involving genomic and evolutionary concepts implemented in a powerful statistical framework is anticipated for future progress. Beyond its importance for biology internally, species delimitation has important societal implications. The current “sixth mass extinction” calls for implementing conservation programs that use appropriate species richness assessments and species definitions in order to accurately measure and limit biodiversity loss. The necessity to agree on a given species definition in a given context (i.e., biodiversity erosion) does not in itself solve the ontological question “what is a species?”. The fast-moving next-generation sequencing technologies disclose the necessary genomic information to study virtually any taxonomic group, but there is an urgent need for conceptual development as well as suitable models with sufficient biological realism to view the data in.
According to the latest taxonomic revision (based on morphology) a certain taxon includes 13 species, four species are tetraploid, and the others are diploid. You can afford to sample and generate DNA sequences from multiple loci of 96 individual plants. You want to test whether the taxonomy corresponds to a coalescent-based species delimitation. Given that the assumptions of the MSC model are fairly well met by your data, what properties will the species delimited by STACEY have? Which species concept would be the most relevant, if any, and why?
The 13 taxonomic species in the previous example cover a very large area of the Northern Hemisphere. Your MSC-based analysis identified most of your 96 sampled individuals as separate species. In most cases, these are only separated based on the DNA information you have sampled. On the other hand, all four tetraploid species are resolved as having separate allopolyploid origins, with some divergence after these. The diploid taxonomic species form moderately to well supported clades.
Which assumptions would these results be based on? What could have violated those, in terms of sampling and biological processes not accounted for in your model?
Given that you trust your results, which taxonomic decisions should be made, if any? Give arguments for different scenarios.
You receive a dataset with DNA sequences covering 10 different populations with 50 individuals each. Presumably, they all belong to the same taxonomic species. However, different phenotypes are observed in some populations. You want to know if they belong to the same species based on genetic data but you are also interested in the genetic structure of the population. However, the phylogenetic divergent time is not your first interest. Which class of methods would you use?
You could use a phylogenetic method allowing for allopolyploidization, AlloppNET. However, that method assumes that you know the correct species assignments (according to the MSC model). You could therefore try to assign homeologs to subgenomes of the polyploids using an MDC approach, and then run either a full Bayesian full likelihood model (e.g., STACEY) treating the subgenomes as separate diploids, and then finally run AlloppNET using the achieved delimitations. Alternatively, you could run MDC with PADRE network transformation.
a. The Wright-FIsher assumptions for the coalescent process, and the sequence evolution model for the gene trees (strict/relaxed clock, Jukes-Cantor, GTR, etc.). Note that if there is a lot of migration among branches, the trees can be grossly misleading.
b. Clades can be viewed as historical individuals, so the obtained results would not reject the current taxonomy. With a concept of taxonomic species as being branches of the evolutionary tree, you would have to split into many species.
Allelic clustering (STRUCTURE-like) methods (population genomics). They are traditionally used in population genetics because alleles are directly clustered into bins and enable testing the fit to Hardy-Weinberg equilibrium. However, those methods are not coalescent-based and do not provide information about divergence time of population/species.
In single-locus approaches, one gene is used to build a gene tree that will be used as an estimate of the species tree. In other words, a single gene genealogy is assumed to accurately represent the species phylogeny. Single-locus phylogenetic methods impose a strict threshold of reciprocal monophyly for delimiting species and aim to detect discontinuity in sequence variation, under the assumption that interspecific divergence exceeds intraspecific variation. Phylogenetic single locus approaches are rooted in the phylogenetic species concept: they aggregate predefined populations with unique nucleotide differences into a single species. These methods rely on the assumption that species are monophyletic (i.e., no ancestral polymorphism and sorting of alleles is complete) for the gene studied. It is assumed that discrete differences in sequence variation are observable within and between species, as a result of allele fixation in species lineages. In other words, reciprocal monophyly of alleles is assumed for the gene under study. However, reciprocal monophyly among lineages is rather improbable, particularly in recent speciation (Rosenberg 2003; Knowles and Carstens 2007). Some statistical methods are specifically designed for single locus data. The GMYC (the Generalised Mixed Yule Coalescent; Pons et al. 2006) identifies the position in a gene tree where the branching process switches from a birth/death process with constant branching probability to a coalescent process. Poisson Tree Processes model Yule coalescent transition points based on the change in substitution rates on the phylogenetic input tree (Zhang et al. 2013). The ABGD method (Puillandre et al. 2012) detects significant differences in intra- and interspecific pairwise distances (i.e., barcoding gaps). Although single locus approaches have some utility in large-scale datasets, serious concerns about its accuracy in delimiting species boundaries have been stressed.
Concatenation methods are multi-locus approaches where unlinked loci are concatenated into a supermatrix of genes. The assumption is that all genes have evolved according to the same tree, which is used as an estimate of the species tree. However, this approach oversimplifies the biological processes involved in speciation events. These processes violate the assumption of gene tree congruence across multiple loci. Incomplete lineage sorting (ILS) and gene flow are the two main phenomena responsible for gene tree discordance. Simulations bring mathematical evidence of topologically inconsistent genealogy of concatenation methods in some regions of the tree space (Kubatko and Degnan 2007). This is supported by empirical data (Jiang et al. 2020) showing that there is rarely gene tree congruence in multilocus datasets.
Phen Garrett1, Shyam Gopalakrishnan2
1 University of Copenhagen, Copenhagen, Denmark
2 GLOBE Institute, SUND Department, University of Copenhagen, Copenhagen, Denmark
Phen Garrett
Shyam Gopalakrishnan
Plant DNA can be extracted for species identification from a wide variety of sample types, including fresh, museum or ancient plant tissue collections that represent a single taxon, to highly processed samples that contain multiple individuals or taxa, including food and medicine (Chapter 6 DNA from food and medicine), water (Chapter 3 DNA from water), soil (Chapter 4 DNA from soil), pollen (Chapter 5 DNA from pollen), faeces (Chapter 7 DNA from faeces), or ancient sediments (Chapter 8 aDNA from sediments). Section 2 of this book explores how DNA can be used for plant identification through either targeted (where select regions of the genome are used) or non-targeted (resulting in / producing / allowing for representations of the full genome) approaches. Targeted approaches include barcoding for single taxon samples (Chapter 10 DNA barcoding, Chapter 13 Barcoding - High Resolution Melting, and Chapter 14 Target capture) and metabarcoding for samples representing multiple taxa (Chapter 11 Amplicon metabarcoding). Non-targeted approaches form the field of genomics. For single taxon samples, genome resequencing and whole genome sequencing (Chapter 16 Whole genome sequencing) are used, while in samples containing multiple taxa, metagenomic methods are used (Chapter 12 Metagenomics).
Studies conducting species identification either use known samples to find unknown identifications or use known identifications to assign identity to unknown samples. In the first, labelled samples are used for exploring evolutionary relationships to assign species identity based on some measurement of distance clustering (Chapter 19 Systematics and evolution, Chapter 20 Museomics, and Chapter 21 Palaeobotany). In contrast, the second category of studies utilise databases with predefined species classifications to assign identity to unknown samples (Chapter 22 Healthcare, Chapter 23 Food safety, Chapter 24 Environmental and biodiversity assessments, Chapter 25 Wildlife trade, and Chapter 26 Forensic genetics, botany, and palynology).
The analytical methods used for species identification can be categorised into three groups: i) database alignment analyses, ii) alignment-free methods, and iii) sample alignment analyses (Box 1).
In this chapter, we outline common sequence pre-processing steps used in species identification projects, and then discuss how species identification from sequencing data can be accomplished using the three analytical categories mentioned here.
Database alignment analyses deal with single or mixed taxa from mostly mixed taxa samples and use targeted or non-targeted molecular methods. These alignment analyses start with unknown samples and attempt to assign known identifications using a pre-existing database of sequences linked to known identities.
Alignment-free analytical methods utilise single taxon sample data from mostly non-targeted molecular methods, and start with known samples and apply distance-based clustering to explore genetic similarity and infer relationships.
Sample alignment analyses also target single taxon samples, can be applied to sequence data from targeted or non-targeted molecular methods, and again start with known samples to explore evolutionary relationships and genetic distance or similarity between samples to create or improve identification understanding. This latter category includes de novo assemblies and the creation of reference genomes.
“Garbage in garbage out” is a phrase that any experimentalist should keep in mind when setting up a species identification project. Obtaining robust and accurate species identities from sequencing data requires that input reads are high-quality and filtered for contamination and sequencing errors. This section outlines the steps necessary for firstly checking that data is of sufficient quality for species identification, as well as the sorts of processing steps that are necessary for sequence data analysis.
Sequencing reads generated on short and long read platforms contain artefacts that need to be filtered or corrected in order to isolate high-quality reads for use in downstream analyses. Sequencing artefacts include reduction in read end base quality in short read data (common in Illumina sequencing), and amplified rates of homopolymer errors in longer read data generated with Nanopore technologies (Chapter 9 Sequencing platforms and data types). Correcting these errors is a mandatory first step in most bioinformatics analyses as poor quality control of raw sequence reads can result in inconclusive or incorrect species identification. Several quality control software packages including FastQC (Andrews and Others 2010), multiQC (
Sequence library preparation methods append non-representative, non-biological sequences, such as adapters and tags for multiplexing, to the DNA fragments. These sequences should therefore be removed during sequence processing to avoid failure in species identification or even a false species identification. Tools such as AdapterRemoval (
PCR (polymerase chain reaction) amplification of template DNA can introduce errors including artificial base differences, chimeras, and heteroduplex molecules (
Targeted sequencing experiments, where a specific region of the genome or plastome is sequenced, require a few additional quality control steps to remove sequencing artefacts. Tools such as obitools (
There are specific challenges to be considered when analysing ancient DNA samples, including archaeological and herbarium samples. DNA damage, primarily driven by chemical changes in the DNA post-mortem, is prevalent in aDNA samples (see Chapter 2 DNA from museum collections). Programs such as mapDamage (
Database alignment methods are the most intuitive class of search-based species identification from sequencing data and have been used for the better part of the last three decades to identify species that are the putative sources of sample DNA or protein sequences. These methods compare the sequencing reads, either directly in the form of short reads or in the form of assembled contigs, to a reference database of curated sequences. Widely used alignment tools include BLAST (
In theory, alignment-based approaches using databases can be used for species identification on sequences generated from the entire spectrum of molecular methods detailed previously. However, high computational requirements coupled with logistical issues such as the unavailability of appropriate databases make these methods best suited to targeted sequencing approaches, especially barcoding and metabarcoding. In these approaches, only a limited number of unique sequences are used in the initial data input, making them substantially less computationally expensive methods.
The database choice plays an integral role in the sensitivity and specificity of local alignment algorithms and whether the alignment approaches return a species identification. Accurate and positive species identifications are more likely with databases containing high numbers of closely related species. Global databases, such as the NCBI nucleotide database and NCBI non-redundant protein database (
Alternative options to consider are national or local sequence databases that have been assembled by genetic and genomic researchers to represent the species of a country or region. Prime examples include DNAmark (
Does the choice of using DNA or protein make a difference in the database alignment algorithm? Yes! DNA sequences provide more sensitivity while amino acid sequences are more robust. What do we mean by that? DNA sequences can provide a better resolution in terms of describing the evolutionary relationships between closely related species. Proteins on the other hand can illuminate much older evolutionary relationships, and tend to provide more robust identifications (
Alignment based methods can be used on both raw short reads directly from the sequencing machines and on assembled contigs, where multiple short reads are stitched together into longer stretches of DNA. These approaches come with their own pros and cons. Ease of use is the primary selling point in using short reads directly from the sequencing machine. Using assembled contigs requires additional steps, but the increased length can result in lower error rates and longer read regions, leading to better resolution. The use of assembled contigs additionally takes advantage of databases that allow for alignment of longer regions, including possibly the entire target region (
The rapid advance and adoption of second and third generation sequencing technologies has led to an exponential increase in the numbers of sequencing studies that employ either whole genome resequencing or genome skimming to characterise sample genomes. With these large genomic datasets, alignment based approaches can be computationally taxing (
Alignment-free approaches come in many flavours, including k-mer based methods, micro-alignments, fourier transformation methods, and information theory methods (
Alignment-free methods are primarily restricted for use with non-targeted sequencing approaches. This is due to the short length of targeted regions leading to a limited number of k-mers, which restricts the ability of these approaches to result in meaningful inferences. Although k-mer based methods might look like ideal candidates for use in metagenomics, the fact that metagenomic samples are derived from multiple sources in varying proportions makes it difficult to successfully isolate individual taxa (
The depth to which samples are sequenced affects the accuracy of the dissimilarity metric estimates computed in alignment-free methods. As the sequencing depth reduces, their variance increases even if the estimates remain unbiased by assembly. This variance is propagated into the downstream analyses. Thus, the robustness of these methods should be verified when using very low coverage sequencing data (
Contamination can also be tricky to deal with in alignment-free methods using mixed bags of raw sequencing reads, and therefore filtering for contamination using tools such as BlobTools (
K-mer frequency profiles of sequences are used to compute dissimilarity scores between those sequences. There are many distance metric options that can be used to compute the dissimilarity score, e.g., Euclidian, inner product, Kullback-Leibler divergence (relative entropy), and mismatches (Jaccard). The most commonly used distance metric is the Jaccard distance, since it is easy to compute and corresponds to nucleotide changes. Specifically, the Jaccard distance ranges from 0.0 to 1.0, where 0.0 corresponds to identical k-mer profiles, and 1.0 implies no overlap in k-mers. By computing the pairwise Jaccard distances between sequences from an unknown sample and a set of reference sequences with known species identity, we can assign our unknown samples to the closest species among the set of reference sequences. Further, the dissimilarity measures can be used to build a phylogeny of the sequences (
An advantage of k-mer based methods is their applicability to different sequencing data types, which allow combining sequence data from different experiment types. For example, one can compute the k-mer frequency profiles directly from the reads or from scaffold sequences. All subsequent steps to compute distances can be applied without regard to potentially different sequence sources.
Several alignment-free methods have been developed in the last few years, incorporating several of the k-mer algorithms (
Sample alignment methods are the foundation of molecular taxonomy, phylogenetic classification, and population genetics, and allow the exploration of evolutionary relationships and genetic distance between samples. These methods include de novo assemblies and the creation of reference genomes, as well as assembly or mapping using a reference. There is inherent bias in terms of reference availability, and inadequate reference mapping can result in skewed representations of genetic similarity in downstream analysis.
The use of references to inform the assembly of contigs to produce scaffolds and create sample-specific consensus sequences representing genes, gene regions, and genomes is inherently biassed towards the available references. Popular mapping tools include Global BWA (
High quality, in-depth sequencing is required to produce a de novo assembly. A de novo assembly will however avoid any inherent biases introduced by using references for assembly, and in turn can be used as reference in future projects. For an outline of the processes involved in de novo assemblies, please see reviews by (
The foundation of any tree-building or comparative gene analysis is multiple sequence alignment (MSA). MSA matches up areas of the genome across samples and allows for comparison. MSA algorithms are based on maximising sum-of-pair scores through heuristic progressive (input-order dependent) alignments (
Widely used MSA tools include ClustalW (
There are several important factors that can determine or influence which species assignment method is ultimately chosen. The study design and experimental question, as well the DNA source and extraction methods are important factors. For example, genome skimming and metagenomic studies might be well suited to alignment-free methods (
It is thus important to be aware of the strengths and limitations of different species assignment methods and to choose the method best suited to the biological questions being posed and the experimental design used to generate the sequencing data. For alignment based methods, it is important to remember that the species identification results are only as good as the databases the sequences are being aligned to, applicable to both targeted sequencing and genomic studies. Further, results from alignment against large databases must be interpreted carefully, since the order of the results are dependent on both the sequence identity and the number of times a certain species is represented in the database. For alignment-free methods, such as k-mer based, sequencing depth and the quality of the k-mer profiles from target species (database) are important factors. Also note that the value of k in the k-mer profile generation is an important parameter to tune. Finally, metagenomic taxonomy assignment tools again depend, in varying degrees, on external databases for identification of taxa.
What are several sequencing artefacts which need consideration and removal?
What difference does the choice of using DNA or protein in the database alignment algorithm make?
What types of sequencing approaches are alignment-free analytical methods primarily used for?
Sequencing artefacts include: low base quality, over-representation of short k-mers, presence of adapter sequences, GC biases, reduction in read end base quality in short read data, and amplified rates of homopolymer errors in longer read data.
DNA sequences provide more sensitivity and can provide a better resolution in terms of describing the evolutionary relationships between closely related species. Proteins on the other hand can illuminate much older evolutionary relationships, and tend to provide more robust identifications.
Alignment-free methods are primarily restricted for use with non-targeted sequencing approaches.
Ntwai A. Moiloa1,2, Meshack N. Dludlu3, Abubakar Bello4, Zaynab Shaik1,2,5, A. Muthama Muasya6, Bengt Oxelman1,2
1 Department of Biological and Environmental Sciences, University of Gothenburg, Gothenburg, Sweden
2 Gothenburg Global Biodiversity Centre, Gothenburg, Sweden
3 Eswatini Institute for Research in Traditional Medicine, Medicinal and Indigenous Food Plants (EIRMIP), University of Eswatini, Kwaluseni, Eswatini
4 Department of Biology, Faculty of Natural and Applied Sciences, Umaru Musa Yar’adua University, Katsina State, Nigeria
5 Department of Botany and Zoology, Natural Sciences Building, Stellenbosch University, Stellenbosch, South Africa
6 Department of Biological Sciences, University of Cape Town, Cape Town, South Africa
Ntwai A. Moiloa
Meshack N. Dludlu
Abubakar Bello
Zaynab Shaik
A. Muthama Muasya
Bengt Oxelman
Systematics is the field of biology that studies biological diversity (or biodiversity) and its evolutionary history (
An integral part of systematics is taxonomy, which focuses on the identification, description, naming, classification and inventory of taxa (
A major challenge that comes from these different species concepts is that they may be incompatible and often lead to different conclusions on the boundaries of what should be considered the same or different species (de Queiroz 2007). Given the variety of species concepts and their definitions, de Queiroz (2007) suggested two solutions. The first solution identifies commonalities in the different species concepts, resulting in a unified concept where species are defined as separately evolving metapopulation lineages. The second solution emphasises the necessity of separating the problem of species concepts from that of species delimitation. In the present chapter, species delimitation refers to the practice of determining boundaries of species based on empirical data. For a comprehensive discussion of the various species concepts and delimitation approaches the reader is referred to Chapter 17 Species delimitation.
Commonly used methodological approaches in plant systematics include traditional comparative morphological/anatomical systematics, chemosystematics, and molecular systematics, which utilise different sources of data as input for inference. In traditional comparative morphological/anatomical systematics, the grouping of taxa is primarily based on morphological/phenotypic similarity (
The primary aim of systematics is to recognise evolutionary lineages where the genotypes are reproduced through time. The phenotypes are ephemeral manifestations of these genealogical lineages. Historically, the Aristotelian view of taxa (“natural kinds”) having essential features (i.e., to qualify as a vertebrate, the organism must develop vertebrae) has dominated biological systematics. Some philosophers argue that the essence of those natural kinds may exist regardless of humans’ abilities to recognise them, but there is no doubt that essentialism has played a great role in recognition of many taxonomic groups where certain phenotypic traits have been used for defining specific taxa. The development of evolutionary theory has provided systematists with the concept of monophyly, which ultimately is based on genealogical relationships. By using phylogenetic methodology, monophyletic groups (clades) sharing a common ancestry can be recognised. Both phenotypic and genotypic data can be useful for this, but the former is considered a proxy for the latter. Thus, while recognising the enormous importance of phenotypic data for the primary identification (i.e., classification and nomenclature) of taxa (and of course of general biology), we will in the following focus on the genetic data.
The use of internal or anatomical features in taxonomy began with the development of microscopes powerful enough to visualise the internal structures of organs and tissues (
Most chemotaxonomists recognise three broad categories of chemical compounds as taxonomically important: primary metabolites, secondary metabolites, and semantides (
The number of chromosomes in each cell of all individuals of a species is usually constant and more closely related species are likely to have similar haploid chromosome numbers (
Embryonic development and structure have historically been used at different levels of classification. For example, the basic division of the plant kingdom into two units, the
Studying plant pollen and spores is useful for determining species relationships in plants (
Morphological and genetic analysis of fossil material from pollen, leaves, stems, and other plant parts are used to trace evolutionary developments through stratigraphic sequences and also predicting past ecological conditions (
Strictly, DNA constitutes the genotype, while RNA, proteins, and associated structures belong to the phenotype. Nevertheless, DNA, RNA, and proteins can all be used to detect basic genotype changes. Very often, nucleotide substitutions are neutral and either do not change the amino acid sequence of the protein that they transcribe for, or result in minimal changes in the amino acid sequence (
Several methods have been developed to either generate sequence data for whole genomes (whole genome sequencing, WGS), or sample a subset of specific loci from across the genome (
Since the advent of molecular phylogenetics in the late 1980s, several theoretical approaches for reconstructing relationships in the Tree of Life have been developed (
A second set of methods, under the umbrella term “optimality approaches’’, assess the optimal tree in the full tree-space using predefined criteria. This includes minimum evolution, which optimises the tree that minimises the sum of pairwise distances as expressed on the tree (
Bayesian methods use Bayes’ theorem to estimate the probability of a tree (including topology, branch lengths, and parameters in the underlying model of sequence evolution) given the alignment data (
An important finding when sequencing multiple loci across different accessions was that a set of genes for the same group of taxa often supports different branching patterns in the gene trees. A number of phenomena are responsible for this discord among gene trees, including the incomplete sorting of ancestral polymorphisms (incomplete lineage sorting or ILS), gene duplication and loss, horizontal gene transfer, and branch length heterogeneity (
The MSC models ILS by assuming that the degree of incongruence among gene-trees is positively related to effective population size and negatively related to the times between lineage divergences (
The second approach, commonly referred to as co-estimation methods, uses sequence alignments as input data such that gene and species trees can be simultaneously estimated (
While the MSC represents a major advance in modern phylogenetics, it accounts for only one source of gene tree discord, which has a number of alternate causes, collectively summarised under the concept of migration, meaning the transfer of alleles between otherwise discrete lineages of alleles. Thus, migration in this meaning will include processes such as hybridization, introgression, horizontal and lateral gene transfer, and admixture. The classic MSC model assumes that speciation is instantaneous, and that all gene flow ceases directly after two lineages diverge (
The larger number of parameters estimated in IM and MSCi models relative to classic MSC methods improves the biological realism with which the evolutionary process is modelled, but also necessitates a larger number of loci for reliable parameter estimation (
A special form of discrete modelling of migration is posed by allopolyploidy, for which the reader is referred to Chapter 17 Species delimitation for a description of available methodologies.
The introduction of the concept of monophyly (
The integrative taxonomy approach attempts to integrate and use information from several different sources (i.e., morphological, chemical, genomic, ecological, etc.) in order to rigorously delineate species and other taxa. However, this approach has received criticism due to the lack of a clear conceptual and methodological framework, particularly with reference to quantitative criteria. Thus, grouping (i.e., recognition of monophyletic groups) precedes ranking (i.e., choice of level for naming and formal ranking in the taxonomic hierarchy).
The taxon-tree contains clades (= monophyletic groups) of various inclusiveness that may be named and given a rank according to the rules of nomenclature. This ranking process is in principle arbitrary, but various auxiliary criteria, such as the different versions of the phylogenetic species concept (
Application of the MSC model enables rigorous scientific testing of monophyly hypotheses using multi-locus sequence data. Applying species rank to certain clades is valid but will need auxiliary criteria to reduce subjectivity (
Are the terms “systematics” and “taxonomy” synonymous? If not, how do they differ?
Discuss the differences between a taxonomy based on hypothetical evolutionary relationships and one based on the possession of certain traits.
What is the major difference between discrete and continuous phylogenetic models allowing for migration?
Although some use the terms interchangeably, they have different meanings. Systematics is the study of diversification of organisms and their relationships through time, whereas taxonomy refers to the theory and practice of identifying, describing, naming, and classifying organisms, which is an integral part of systematics.
In the first case, real entities, which form parts of the evolutionary history of the lineages are considered. These hypotheses may be wrong or correct, because there is only one history. In the second approach, taxa can be classified under a multitude of different traits, and each of them may be more or less useful for certain purposes, but there is no unique correct classification.
In discrete models, migration only occurs during specific periods of time, resulting in extra, merging branches of the network. In continuous models, migration is a continuous process treated as a parameter in the branching model.
Nataly Allasi Canales1,2*, Nina Rønsted3,2, Jazmin Ramos-Madrigal4,5*
1 Natural History Museum of Denmark, University of Copenhagen, Copenhagen, Denmark
2 Royal Botanic Gardens, Kew, United Kingdom
3 National Tropical Botanical Garden, Kalaheo, Hawaii, USA
4 Section for Evolutionary Genomics, The GLOBE Institute, University of Copenhagen, Copenhagen, Denmark
5 Center for Evolutionary Hologenomics, The GLOBE Institute, University of Copenhagen, Copenhagen, Denmark
Nataly O. Allasi Canales
Nina Rønsted
Jazmin Ramos-Madrigal
* These authors contributed equally.
Have you ever wondered how museum collections can be used for answering fundamental questions about biodiversity and its evolution across space and time? Natural history museums harbour ~3 billion biological specimens that are often linked with a specific collection time and place (
Museomics is the study of biological material from museum collections using genomic techniques that allow the reconstruction of partial or complete genomes. In contrast to single-loci PCR-based approaches, these genomic techniques provide information on a genome-wide scale that can, for example, be used to assess evolutionary and ecological processes (
Although different in age and preservation state, historical DNA from museum collections (typically > 200 years old) can have similar characteristics as ancient DNA (aDNA, typically < 200 years old) such as post-mortem degradation patterns (
The plant material available in museum collections is an indispensable source of genetic information for species that are extinct (Van de Paer et al. 2016;
Herbaria collections include a variety of sample types including herbarium specimens, seeds, wood or xylarium samples, flowers and fruits in alcohol or desiccated, and biocultural or ethnobotanical collections gathered over hundreds of years. They may (1) originate from general collections seeking to represent the world’s biodiversity, (2) have been deposited as vouchers, or (3) have served as reference material for a specific study (
Herbarium specimens generally consist of a pressed plant mounted on acid-free paper. They ideally include leaves, stems, flowers and/or fruits, and roots when possible, and have the necessary plant parts for unambiguous identification. The metadata associated with a specimen should at minimum include the binomial scientific name, who collected it, the collection date, locality, and a unique number. Additional information may include a description of the habitat, and associated plants as well as any other details that cannot be observed from the dried specimen, including specimen’s colours and smell at the time of collection, and any observed visiting insects (
Xylarium, or wood collections, comprise a collection of different wood parts of a tree and such specimens can inform forensics, timber trade, and conservation efforts. A typical specimen is wood stripped from bark, when present, and has the shape of a book. Some collections can also consist of cross-sections, which can provide valuable ecological and anatomical information than the book-shaped wood (
Economic botany or biocultural specimens include economically useful plant parts such as fruits, barks, seeds, bark clothes, baskets, and papers for medicine, religious, entertainment, and commercial purposes (
The next paragraphs are examples of the applications and impact of museomics research, as well as the potentially negative implications of unethical use of collections on local communities.
DNA analysis of plant material in museum collections has allowed us to improve their taxonomic annotations and their corresponding scientific value. By analysing their genomes, it is possible to assign taxonomic information to samples that cannot be reliably identified morphologically or that no longer exist. For example, genomics was used for the identification of both endangered and extinct species of Hawaiian endemic mints and the now considered extinct
Genetic analysis of museum samples has increasingly been used to describe evolutionary processes shaping the genetic diversity, population structure, phylogenetic history, and demography of plants. Both Sanger and high-throughput sequencing have been used to obtain partial and complete genomes of plants in museum collections. Combining the genetic analysis with the information contained in their associated metadata increases the scope of the evolutionary inferences that can be made. Information about the collection date and geographic location can be used to directly measure changes in genetic diversity across time and space, the effect of climate change, domestication, human environmental disturbance, and other natural phenomena (
One of the principal applications of museomics has been reconstructing plant species’ phylogenetic histories. Understanding plants’ evolutionary relationships can help refine their taxonomic classification, identify their potential geographic and evolutionary origins, and make predictions on their chemical properties and potential future applications (
Additionally, herbarium material has been used to identify and measure the extent of gene flow (i.e., exchange genetic material through interbreeding) among plant populations such as that occurring between different species of ragweed (
An interesting aspect is the evolution of plants under domestication (i.e., the process through which wild plants became today’s crops). Most of the plants (in volume) that we consume today as food or use in the production of plant-based products are the result of domestication. Museomics has made important contributions to the study of the geographic origins, dispersal patterns, and selective evolution of domesticated species. Herbarium specimens have been used to trace the origin of the European potatoes in the Andes (
Another aspect where museum specimens can be used to provide valuable insights is in the study of genetic erosion, which is the decrease in genetic diversity over time. Samples collected at different points in time are an ideal and reliable way to directly measure changes in genetic diversity through time and in relation with historical, geographic, and climatic changes (
Genetic analyses of museum collections can also be used for the study of ecological processes (i.e., the interactions between plants, animals, and abiotic components in an ecosystem). Herbarium collections can help in the characterization of the distribution and abundance of plant species through time and in measuring changes in biodiversity. By combining their genetic data and metadata, we can measure the habitat ranges of species through time and identify possible associations between such changes and climatic or anthropogenic events. In one example, the genetic analysis of the grass
Multiple studies have used herbarium and xylarium specimens to develop DNA barcode libraries of entire floras or for more specific applied uses within forensics, authentication, and conservation (see Chapter 26 Forensic genetics, botany, and palynology; Chapter 23 Food safety; Chapter 22 Healthcare; Chapter 13 Barcoding - High Resolution Melting).
Developing plant cultivars with desirable characteristics is essential to guarantee food security in the future. One of the initial stages in improving crops is identifying plants that already have certain beneficial traits that can be used in the breeding process (
Plant pathogens cause diseases and losses at different levels from hunger and famines to the extinction of entire species (
Previous studies have shown that museomics can answer questions about the evolution and origin of plant pathogens as in the case of the potato late blight,
Studies on endangered species often lack past diversity estimations or rely on inferring past genetic diversity from modern populations. Museum specimens can provide an important perspective for past population evolutionary events and eventually contribute to the conservation of the species through scientifically supported conservation and management recommendations (
The conservation of rare and endangered plants relies on a sound understanding of their genetic diversity to ensure the health of both wild and ex situ collections, and to avoid or overcome genetic bottlenecks (
Other studies have documented genetic changes of endangered species in response to human disturbance.
International conventions govern accessing, researching, and moving plant material between institutions and countries. Specifically, researchers need to understand and adhere to the Nagoya Protocol on Access to Genetic Resources and the Fair and Equitable Sharing of Benefits Arising from their Utilisation to the Convention on Biological Diversity.
Additionally, some museums have destructive sampling policies and committees that weigh the pros and cons of destroying precious and unique samples against the possibility of advancing scientific knowledge (
Ideally, researchers aiming to use biocultural or other culturally sensitive collections of high human interest should involve and consult with the Indigenous peoples and Local Communities from an early research stage to ensure the fair and equal use of the collections and their associated information, as well as to take advantage of the knowledge related to the cultural value and uses of the specific plants and artefacts. This is becoming a common practice for human paleogenomic studies, but unfortunately not yet for plant material. From the researcher perspective, knowing Indigenous communities’ practices and concerns minimises potential unintended cultural harm in the future by paleogenomic studies and can also provide additional advice on relevant research questions to consider (
Finally, museums have experienced continuous reductions in funding and staff resulting in the lack of curatorial expertise and capacity leading to increasingly orphaned collections (
What types of plant materials can be found in museum collections?
What are the advantages of using museum collections for genomics compared to fresh specimens?
What challenges are faced when using museum collections for genomic studies?
Herbarium, xylarium and other wood specimens, seed collections, and economic botany collections are examples of museum collections that may be useful sources of DNA.
Inference using fossils or other ancient materials adds a reliable temporal dimension to the analyses, which can, for example, be used to measure the evolutionary processes plants have gone through in response to environmental and/or evolutionary changes.
Challenges include physical and permit availability of sufficient amounts of material for destructive sampling from the outset. Museum specimens may not always contain sufficient characters for unambiguous identification or information about origin. In the wet lab and during the analysis, challenges are posed by traits of aDNA that are highly fragmented and low content of endogenous DNA.
Thibauld Michel1, Michael D. Martin2, Catherine Kidner1
1 Royal Botanic Garden of Edinburgh, University of Edinburgh, Edinburgh, United Kingdom
2 NTNU University Museum, Norwegian University of Science and Technology, Trondheim, Norway
Thibauld Michel
Michael D. Martin
Catherine Kidner
The study of ancient plant remains was historically limited to morphological studies, palaeontology being the primary field of study of past organisms. However, since the 1980s, genetic analysis of biological matter within fossils has become increasingly informative, thanks to the development of new sequencing technologies such as polymerase chain reaction (PCR) and high-throughput sequencing (HTS). Since the first identification of aDNA from extinct species in 1984 (
In the context of paleogenetics, ancient DNA (aDNA) is DNA from long-deceased tissues preserved by conditions allowing DNA survival. Despite appropriate preservation conditions, aDNA is usually degraded by biotic or abiotic processes. Though often damaged, it can carry valuable historical information (
The sequences used for most plant aDNA studies are derived from the nuclear and organellar genomes and are quite often the same markers typically used for plant identification or studies of evolutionary history. Markers from plastids are usually favoured for their high copy number and short length, despite reported problems resulting from their high propensity for genetic rearrangements. Furthermore, horizontal transfer from the plastome to nuclear and mitochondrial genomes complicates the analysis as the mutational rate differs between in the nucleus and in other organelles (
Several difficulties are inherent to working with aDNA from plant specimens: the complexity and variability of the genome, aDNA damages, and potential contamination increase downstream analytical difficulties. The combination of often very low aDNA concentrations with the amplification power of PCR dramatically increases the probability of amplifying contaminating modern DNA. Specialised methods and laboratory procedures have been established to reduce the risk of contamination. These include: the use of positively pressurised clean laboratory facilities dedicated to aDNA work, the replication of experimental works in different institutions, and the use of biomarkers for prediction of DNA survival such as mitochondrial DNA (mtDNA) detection, aDNA damage patterns, and detection of associated remains (
Another complication in the analysis of plant DNA is its variability. The presence of different organelle genomes (plastid and mitochondrial) as well as the interspecific differences in ploidy level and chromosome size can complicate the alignment of sequencing reads to a reference sequence (
A very common archaeological plant material is charred remains. One example is superficially burnt seeds in hearth remains found in ancient settlements. Molecular identification of even lightly charred remains is however challenging since the DNA is often very fragmented and contaminated (
In contrast, desiccated samples are often suitable for molecular analysis. Desiccated samples are typically found in dry environments such as caves, shelters formed by rock features (well suited for long-term food storage), or deserts. Desiccation can limit DNA degradation, and plastid and mitochondrial DNA from sunflower seeds as old as 3,100 years old has been successfully recovered (
Biological remains preserved under waterlogged anaerobic conditions may also contain sufficient aDNA for molecular identification. Lakes and marine sediments can provide sedimentary DNA (sedaDNA) from plant remains and pollen grains found in different strata of core samples. They can be used to reconstruct past ecological diversity. Microorganism communities can as well be a source of aDNA. For example, diatoms are commonly used bioindicators for assessing the biological composition (trophic state) of a lake since their morphology is highly sensitive to the surrounding environment (
Waterlogged remains can be found in the context of archaeological studies. Wells, latrines, ditches, and pits can result in anaerobic conditions. DNA from grape seeds from the Iron Age have been sequenced successfully with Hyb-Seq, and it was shown that the grapes are related to present-day West European cultivars, which provides evidence that there has been 900 years of uninterrupted vegetative propagation of the crop (
Mineralized samples or those embedded in resin or fossilised in amber are both potential sources for aDNA, though the high probability of contamination, extreme fragmentation of the material, and non-reproducibility of the results have led some authors to strongly discourage aDNA analysis from amber-preserved fossils (
Microfossils can be found in any environment, including in humid conditions and tropical zones where macrofossil preservation is rare. These include pollen, starch grains, and phytoliths. Plastid aDNA obtained from pollen grains is very often endogenous, and its amplification has previously established the first genetic link between extant and fossilised Scots Pine specimens from post glacial lake sediments in Sweden (
Sediments found in lakes, temperate caves, permafrost, and ice cores can retain plant aDNA for thousands, and in some cases, millions of years (
sedaDNA provides a broad understanding of the past environment, climate, and ecology of the paleosol studied. It can also provide insights on the movement and cultivation of plants by Neolithic populations and their social network in absence of other archeological evidence (
sedaDNA from lake sediments has been used to reconstruct ancient plant vegetation and to assess the impact of anthropogenic activities on the paleoenvironment. For example, the impact of cattle grazing on deforestation dynamics during the Late Iron Age and Roman period has been demonstrated by using a metabarcoding approach on sediment samples from a subalpine lake (
sedaDNA can also be used to study the impact of climatic changes on plant biodiversity and help prioritise conservation management. A research project using metabarcoding of lake sediments was able to show that a heterogeneous mountain landscape served as a refugium for arctic-alpine plants in a warm climate (
Another study on Arctic Canada lake sediments gave clues about the effect of the rise in temperature during the Last Interglacial period (LIG) on plant population dynamics. Previous attempts to reconstruct the LIG paleoclimate with climate modelling based on the simulation of atmosphere, sea, and ice circulation have yielded inconsistent results (
We can improve modelling of future climate change effects on plant diversity based on these studies that inform how plant richness has evolved in reaction to previous episodes of climate warming. Several environmental changes that might have been overlooked such as arctic amplification or arctic greening can be studied with sedaDNA (
SedaDNA studies are furthermore more robust than pollen-based methods for detecting plant richness and deliver taxa diversity with more resolution (
The same observations can be done using sedaDNA extracted from permafrost, as presented in a study encompassing 50,000 years of megafauna diet and arctic vegetation history from samples collected across the Arctic. While pollen-based reconstruction showed a majority of graminoids in unglaciated Arctic during the Late Glacial Maximum, the metabarcoding approach has revealed a forb-dominated vegetation (
Ancient faeces, though relatively uncommon, are a rich source of biomolecules and paleodietary information that can be related to demographic, ecological, and climatic changes in the locations in which they are found (
The analysis of an aDNA dataset is complicated by post-mortem DNA degradation that leads to short fragments, specific nucleotide substitution patterns, and overall low DNA yields (
The initial alignment step with a reference genome during bioinformatic analyses is already affected by aDNA chemical damage, which can increase the apparent error rate and lower the alignment accuracy. Subsequent steps in variant calling of genetic markers can be complicated by the high mapping error rate and low coverage (
The evolutionary history of a species or a population can be established based on genomic inference from modern samples, providing clues about the evolutionary processes that form the basis for present genomic variation. However, allelic patterns in contemporary specimens are shaped by a range of demographic events, including changes in population size, gene flow, and hybridization events. These may be due to very recent events, and do not necessarily represent the lineage’s deeper evolutionary history. A time series of samples can provide greater resolution in a genomic analysis and resolve phylogenetic questions. It can also detect recent demographic events such as population bottlenecks and provide chronological estimates for these events without using a molecular clock. Allele frequencies can be directly estimated for each time point and used to estimate the strength of selection pressure during that period (
The Dramatic global warming and extinction events that occurred during the later Anthropocene coincided with the active collection of specimens for museums and herbaria (
Positive selection can also be detected in contemporary specimens using statistical tools such as coalescence, population differentiation (FST), and linkage disequilibrium. Selection pressure, however, can be conflated with demographic change or background selection. Specific methods have been developed to detect positive selection on a polygenic trait using an admixture graph to represent the admixture events relating different populations through time (
Purifying selection or negative selection can be detected in present-day specimens as signals of reduced genetic diversity. However, similar signals can be caused by demographic events such as population bottlenecks or background selection (
Balancing selection is more difficult to detect since it affects narrow genomic regions on a short timescale. This can be mistaken for positive selection, demographic events, or introgression (
All current crops are the products of single or repeated domestication events starting less than 12,000 years ago from the ancestral wild species (
Archaeobotanical remains can be arranged in a time series to study the evolution of domestication over time and space. They can indicate the number of times that domestication events occurred and their location, the pace and stringency of anthropogenic selection, introgression with wild relatives and between different cultivars and be used to determine the date of these events (
Molecular methods have made an increasingly large contribution to the field of archaeobotany. Starting with simple genetic analysis for taxonomic identification to supplement morphological examination, the field has rapidly progressed following advances in high-throughput technologies in archaeogenomics. Methods such as shotgun sequencing have enabled genome-wide studies, exploring in detail the genome of domesticated plants and analysing the genome-wide rearrangements that occurred during this process (
As both a key crop and a genetic model organism deeply studied for over 100 years, a wealth of domestication studies have been conducted on maize, revealing a detailed picture of evolution. Molecular analysis of palaeobotanical remains continues to provide new information on maize evolution, and PCR-based studies have identified the likely geographic region of its original domestication in Mexico and traced its dispersal across Central America and South America (
The target capture method, or Hyb-Seq (see Chapter 14 Target capture) has been used to confirm and refine models for maize domestication over time mediated with progressive introgression from wild relatives (da Fonseca et al. 2015). A recent study on maize domestication and diversification in South America based on the genomes of present-day and ancient American maize cobs has shown that maize had a stratified mode of domestication that started with a large Mesoamerican gene pool that was partially domesticated. This was followed by a dispersal to different locations in which the sub populations become reproductively isolated by different selection pressures (
Wheat domestication has not been studied as extensively as maize, but modern genome-wide studies on emmer wheat chaff found shared haplotypes between 3,000-year-old Egyptian emmer wheat from museum collection and modern emmer wheat, including domestication loci as two QTLs related to grain size and seed dormancy. Although several haplotypes present in historical specimens are absent from modern emmer, similarities between museum specimens and Arabian and Indian emmer landraces suggest an early South-Eastern dispersal of ancient Egyptian emmer (
Bottlenecks are a common feature in the domestication process and have also been revealed from ancient plant material in beans. One of the symptoms of a bottleneck event in the demographic history of a lineage is genetic erosion, the loss of allele diversity in a population due to genetic drift and inbreeding caused by the bottleneck event. This effect was found in the case of the Andean bean domestication, which was likely triggered by stringent varietal selection (
Climatic and environmental changes can be responsible for major shifts in species’ geographic distributions. For example, the glaciation cycles over the past 2.4 million years have restricted some species in separate refugia, often resulting in a loss of allelic variation that persists after the species’ expansion out of the refugium. Phylogeography allows studying the history of geographic distribution of genealogical lineages using population genetic tools to detect the changes in genetic variation caused by historical events such as migration and dispersal (
Early plant phylogeography studies were based on plastid DNA (pDNA) sequencing methods, as a study of the distribution and circumpolar migration of saxifrage, suggesting the possibility that plant refugia were located in the Arctic (
Ancient DNA studies can unravel the ecological past and temporally explore the adaptation mechanism and interactions between organisms. This can include processes such as convergent evolution of different species in a similar environment, present plant adaptations due to standing or de novo mutation in the evolutionary history of a species, or metagenomics of an aDNA specimen to reveal the dynamics of plant pathogens (
Innovations in shotgun metagenomics have increased the possibilities for using sedDNA analysis for reconstruction of past vegetation with higher taxonomic resolution than with pollen DNA barcoding (
Some limitations do however remain. SedaDNA is preserved in lake environments since the stable temperature conditions can conserve DNA. However, sampling can be challenging in these areas. There are also major challenges in detecting species that are rare or have a low biomass. Additionally, the taxonomic resolution provided by sedaDNA is variable in function of the method used. While metabarcoding sedaDNA almost always provides higher resolution than direct pollen analysis (
The Anthropocene presents major global challenges, including climate change, loss of biodiversity through extinction, and emerging zoonotic infectious diseases. An understanding of previous human interactions with the environment can guide conservation management during this era of massive environmental change and rapid loss of biodiversity. The field of conservation archaeogenomics involves analysing aDNA with the goal of guiding present-day biological conservation (
Genomic archaeological data can also reveal details about the time and potential reasons for local or global extinction events, and help to understand the resulting consequences on ecosystems and human societies. Studies that use these data may also contribute to better understanding how human activities and behaviours may have contributed to past extinction events. Studying the distribution of species and how they colonise new areas can also help us to anticipate how ecosystems may respond to future climate change (
A theoretical application of the recent progress in molecular biology and sequencing techniques follows from the concept of “de-extinction” or “species revivalism”. The possibility of de-extinction is controversial and still debated on both technical and ethical levels, as it is difficult to justify the ecological need for reviving extinct species rather than supporting current conservation efforts for endangered species (
Over the last several decades, paleogenetics has made substantial contributions towards our understanding of ancient plant science, ecology, and archeology. In contrast, paleogenomics is just in its infancy and sequencing and analysis techniques are constantly improving. The study of full genome datasets has allowed to accurately characterise taxonomic diversity (
The race to understand biological diversity before it is lost is, to some degree, mitigated by the presence of valuable genomic information in archaeological and natural history collections that include extinct and endangered species. As this field of research provides information about common species and their ecological background, it provides a framework in which to study and understand how the past 200 years of human activity have impacted patterns of genetic diversity in the natural world. It is essential that we use insights from the study of ancient plant genomics to help us reduce biodiversity loss over the next 200 years.
Human faecal material recovered from the latrines of an ancient settlement were analysed with a shotgun sequencing approach, yielding puzzling results. The plants identified from this archaeological site were not domesticated at the time of its occupation and are not supposed to be present at this location. How can you explain this discrepancy? What protocols can be used to verify this result?
A study of the Holocene glacial retreat will be designed to assess the time and zone affected by deglaciation using plant aDNA as a proxy. What aDNA specimens can be used to assess the changes in plant diversity over time at each sampling point, and identify the species involved?
Several biases specific to aDNA analysis lead to an incorrect identification of the specimen species. The low quantity of aDNA in historical specimens can increase the effect of cross-contamination between samples and differential amplification of the DNA fragments during the PCR process of making genomic libraries. Different replicates of the samples can be analysed in separate facilities to test reproducibility of the results, and a negative control devoid of DNA can be used to check contamination. Ancient DNA damages such as substitution or deletion can affect the DNA sequence itself and lead to incorrect identification. Software assessing aDNA damage and recalibrating the alignment file can be used to minimise this bias. Another source of error can be the incompleteness of the plastid reference database used to match the sequencing reads. If many species are missing from the reference database, the detection might occur at genus level instead of species level. For more information, this question is based on a study that characterised the diet and intestinal parasites of ancient communities in Northern Europe and Middle East from latrines remains aDNA (
To study the evolution of plant richness over time, we can use a time series of samples to reconstruct the evolution of vegetation diversity at the sampling point. A range of datasets from several sampling points can be used to model the Holocene glacial retreat over time. Lake sedaDNA can be an adequate source of aDNA to study climatic change via taxonomic plant diversity detection. SedaDNA is extracted from lake sediment cores, each sediment layer of the core corresponds to a different era. This kind of sampling might provide a measure of plant vegetation richness before and after deglaciation, and might be used to confirm models of the Holocene glacial retreat. For more information about lake sedaDNA cores used to reconstruct changes in plant diversity over time and geographically, have a look at a study using sedaDNA to characterise the emergence of vascular plants after glaciation in Greenland (
Felicitas Mück1, Carlos A. Vásquez-Londoño2
1 Department of Pharmacy, Faculty of Mathematics and Natural Sciences, University of Oslo, Oslo, Norway
2 Department of Pharmacy, Faculty of Sciences, National University of Colombia, Bogotá, Colombia
Felicitas Mück
Carlos A. Vásquez-Londoño
Plants have been used as medicines for millennia in diverse geographical and cultural contexts. They continue to play an essential role as therapeutic and prophylactic agents in traditional and complementary medicine (
Accurate medicinal plant identification is very often a challenging task. Products can either be from single species or from mixtures, can be in a dried, fragmented, or powdered form, and can originate from plant leaves, flowers, stems, barks, roots, fruit, and seeds. They may also come in the form of phytopharmaceutical products, including oral, topical, parenteral, ophthalmic, or inhaled forms. Most regulatory guidelines and pharmacopoeias for conventional plant authentication are based on diagnostic morphological or chemical features.
Macroscopic botanical analysis for herbal drug authentication is complicated in many cases, since it requires an experienced taxonomist and the plant samples should include flowers or fruits along with a segment of the stem with leaves enough to observe branching patterns (
Chemical authentication of herbal medicines includes metabolite detection, quantification, profiling, and elucidation through analytical methods such as thin layer chromatography (TLC), high-performance thin layer chromatography (HPTLC), high-performance liquid chromatography (HPLC), liquid chromatography coupled to mass spectrometry (LC-MS), gas chromatography coupled to mass spectrometry (GC-MS), or nuclear magnetic resonance (NMR) (
Molecular plant identification techniques such as DNA barcoding have proven to be cost-effective procedures useful in pharmacovigilance to authenticate herbal medicines at species level and to detect adulterants (de Boer et al. 2015a). Molecular identification of medicinal plants has been suggested for routine market surveillance and for screening the quality of raw materials in early stages of the herbal supply chain (see Chapter 23 Food safety). DNA-based methods come with several advantages, including that DNA is present in all plant organs, its presence is less sensitive to external factors than metabolites, and it can be used for the identification of dried and powdered products where morphological characteristics are absent (
The increase in herbal medicine adulteration is of growing concern due to the expansion of the global market for natural products (
A number of methods for identifying medicinal plants using DNA-based methods have been previously described. Several barcoding regions have been shown to be effective for identifying adulterants and physiologically difficult-to-discriminate plant species from a variety of sample forms (
Ethnopharmacology studies the use of drugs made by humans, and integrates anthropological, pharmacological, toxicological, and chemical approaches (
DNA-based identification of plant ingredients in herbal medicines is important for resolving taxonomic controversies, assessing the genetic variability and evolutionary traits of medicinal plants, as well as enabling the detection and further conservation of endangered, illegally traded species (de Boer et al. 2017). Molecular identification of medicinal plants is an important force in driving taxonomic research on medicinal species, guiding forensic DNA and toxicological research. Some examples of applications within the field include nrITS2 barcoding of 90
The vast majority of commercialised medicinal plants are collected from wild resources, and in many cases they are overexploited and some are becoming increasingly scarce. These factors threaten the conservation of endangered plants, endemic species, or species with limited distributions. DNA analysis is also important for the detection of endangered species through screening marketed natural products. For instance, DNA barcoding using
According to the WHO, accurate identification of medicinal plants is an essential measure for the assurance of the quality, safety, and effectiveness of natural medicines (
Authentication of medicinal plants and the detection of adulterants is a crucial concern for regulatory agencies and phytopharmaceutical industries in order to guarantee optimal quality, safety, and efficacy of herbal products for consumers. Challenges associated with conventional pharmacognostic procedures to authenticate processed or multi-ingredient herbal products can be mitigated with DNA technologies, enabling the accurate identification of medicinal species and substituents in complex samples. It is recommended to combine molecular, chemical, and morphological plant identification methods to increase the discriminatory capacity of authentication approaches (
List three categories of herbal medicine adulteration.
Why are genetic methods an important tool for plant identification in ethnopharmacology in addition to analytical chemical methods?
Molecular identification has various applications in healthcare. How could molecular identification aid bioprospecting?
Categories that could be mentioned include: unconscious misidentification by collectors, intentional fraudulent substitution, discrepancies between vernacular names and scientific species names, high market value of medicines incentivise adulteration, and lack of regulation in some countries.
Genetic methods in plant identification can be used in combination with chemical analytical methods to identify plants used as medicines in traditional health systems during ethnobotanical or ethnopharmacological research since chemically based methods alone often cannot correctly identify a plant species or its origin.
Molecular plant identification, potentially in combination with chemical analytical methods, can be used to systematically identify plants with potential medicinal properties.
Bastien Anthoons1,2, Panagiotis Madesis3
1 School of Biology, Aristotle University of Thessaloniki, Thessaloniki, Greece
2 Institute for Applied Biosciences, Centre for Research and Technology, Thessaloniki, Greece
3 Lab of Molecular Biology, Department of Agriculture Crop Production and Rural Environment, University of Thessaly, Volos, Greece
Bastien Anthoons
Panagiotis Madesis
Food safety is defined as the routines used in food handling, preparation, and storage to reduce the risk of individuals becoming sick from foodborne illnesses. Food safety draws from the expertise in a wide range of academic fields, including chemistry, microbiology, molecular biology, and engineering. Although advances in science and technology have led to a substantial improvement in food quality, food can still be a source for public health issues (
Numerous botanical products with considerable differences in their classification can be bought throughout the world. These include foodstuffs, herbal medicinal products, and cosmetics. Foodstuffs include dietary supplements, food ingredients, functional foods, and foods for particular nutritional use including various botanical extracts. Herbal medicinal products can only be sold in pharmacies, under the supervision of a pharmacist, and are marketed after registration procedures according to their classification (see Chapter 22 Healthcare). Dietary supplements and herbal medicines are usually considered as two different regulatory categories, but for each of them, the consensus for regulation is also lacking across countries (
Food hazards refer to any agents with the potential to cause adverse health consequences for consumers. Food safety hazards occur when food is exposed to and contaminated by hazardous agents. Food hazards may be biological, chemical, physical, allergenic, nutritional, and/or biotechnology-related (
A final category of food hazards are biotechnology-related hazards such as genetically modified organisms (GMOs). GMOs are the products of genetic engineering where new genes are transferred from one species into another. The resulting properties may lead to better optimised agricultural performance or the new or increased production of valuable pharmaceutical substances (
It is also important to point out that the toxicity of any substance, including plant-based food and medicinal plants, is largely dependent on the dose or amount used. A (harmless) plant may be toxic at high doses, and a highly toxic plant could be considered safe at low dose (
Food fraud is a collective term used to encompass the intentional substitution, addition, tampering, or misrepresentation of food, food ingredients, or food packaging with the aim of increased economic gain (
Adulteration is the failure of a product to meet legal quality standards. According to the US Federal Food, Drug and Cosmetic Act (FFDCA), food can be declared adulterated if (1) a substance is added which is injurious to health, (2) a cheaper or inferior quality item is added to the food, (3) any valuable constituent is extracted from the main food article, (4) the quality of food is below the standards, (5) a substance is added to increase bulk or weight, and (6) a substance is added to make it appear more valuable. Adulterated food can be dangerous since it may be toxic to human or animal health, it may lead to the deprivation of nutrients required for health, and it may cause intoxication or allergic reactions in sensitised individuals. Adulterants in food can be categorised as follows (
Intentional adulteration is the inclusion of inferior substances having properties similar to the foods to which they are added. The adulterant can be physical, chemical, or biological. An example of intentional adulteration is the addition of wheat or other grains as an inexpensive filler to increase profit margins (
Unintentional adulteration is the inclusion of unwanted substances due to ignorance, carelessness, or lack of proper facilities and hygiene during food processing. This includes contamination of foods by bacteria and
Metallic contamination is the intentional or unintentional inclusion of different types of metals and metal compounds in food. Arsenic, cadmium, lead and mercury are amongst the most toxic ones.
Microbial contamination is the spoilage of food due to infusion of different microbes through various sources.
Agrobioterrorism can be defined as the use of pathogens or toxins against agricultural products or facilities usually with the purpose of causing casualties or fatalities from contaminated agricultural resources or food (
Toxic plants are occasionally eaten due to their misidentification. The fruits of toxic plants such as the very poisonous deadly nightshade (
Food authentication is the process by which food is verified as complying with its label description. According to
Plant-based oils and fats dominate food applications. A balanced intake of oils and fatty acids are essential for human health (van Duijn 2014). Extra virgin olive oil is a high-priced product with high nutritional value. Due to the high market prices and its increasing demand, olive oil is one of the most adulterated products on the global food market. Usually olive oil is substituted with less expensive edible vegetable oils (
The supply chains for herbs and spices tend to be long and complex and pass through many countries. These complexities and the increase in crushed and ground herbs and spices render those products more prone to intentional adulteration (
The adulteration of dietary supplements has been reported fairly frequently, as a result of their rising popularity. Fraudulent practices may result in reduced therapeutic potential of the original drug, posing a serious risk to the health of the consumers (
Food hazards may enter the food chain in various ways and have large impacts on human health. It is therefore important that products “moving” along the food supply chain (FSC) are both tracked and traced. Traceability, under EU law, means the ability to track any food, feed, food-producing animal, or substance that will be used for consumption, through all stages of production, processing, and distribution. Traceability applies to both upstream (where the product comes from) as downstream (where the product is delivered to) tracking (
This issue was debated in the UN’s joint Food and Agriculture Organization (FAO) and World Health Organization (WHO), leading to the Codex Alimentarius or “Food Code”, a collection of standards, guidelines and codes of practice adopted by the Codex Alimentarius Commission (
In Europe, risks along the supply chain are assessed by the European Food Safety Authority (EFSA). EFSA monitors and analyses information and data on biological hazards, chemical contaminants, food consumption, and emerging risks (European Food Safety Authority (EFSA), 2012). It is important to note that the principles of the universal HACCP method depend on the origin and nature of the food products as well as the type of end-product. Hazards and their subsequent risk assessment will therefore differ between the olive oil supply chain and the herbal tea supply chain. Olive oil production involves specific processing steps and uses industrial settings such as extraction mills (for pressing or centrifugation), which are absent in the processing phase of dry plant material such as herbal teas and spices (
Fraudulent practices can happen at any step of the supply chain. The most effective way to eliminate illegal practices in the food sector is food chain transparency and full raw material traceability. For example, food companies that implement a digital traceability system using unique product identifiers increase their transparency since they have supply chain visibility in real-time (
In the following example, risk assessment in a chain of commercialization of plant-based products based on dry plant material (e.g., herbal tea, spices, medicinal mixtures) is discussed.
Plant cultivation is the first step in the supply chain for a herbal product, from seed(ling) to adult plant. During growing periods in agricultural fields or greenhouses, different hazardous sources may affect downstream processing and production. These hazards can include: faeces, contaminated soil, irrigation water, water used to apply pesticides, foliar treatments, growth hormones, dust, wild and domestic animals, insects, and human handling. Automated and regular monitoring as well as personal hygiene are therefore essential.
Harvesting can be performed by hand or mechanically, and involves several important commercial steps including pre-sorting and removal of foliage and other non-edible parts. Personal hygiene is particularly important during manual harvesting. Contamination of the herbal product with other plants such as weeds can result from insufficient quality control during harvesting (
Authentication or verification of raw plant material can be done by traditional morphological analysis or by DNA-based methodologies (see paragraph “Methodologies for identification of plant food hazards”). In the case of products with a protected designation of origin (PDO), the label originates from a certain region or area and the product quality and/or characteristics are due to the particular geographical environment, e.g., Greek extra virgin olive oil or PDO saffron (
During transportation, the raw plant material might be damaged due to poor handling, cross-contamination with other materials in the vehicle, or contaminated with vehicle exhaust from petrol and diesel (
The plant material might be stored for an extended period of time before packing. Storage requirements depend on the state of the plant-based product (i.e., fresh, raw, processed). Raw plant material needs to be stored in a cool and dry place since
The purpose of packing is to protect against food pathogens, spoilage-causing organisms, pests, damage, etc. Good hygiene practices should be followed in handling containers and improved packing materials to prevent product contamination (
Analytical methods for detecting adulterated food are traditionally seen as a first line of defence against food fraud (
Several techniques are used in the food industry to detect food-microbial contamination. Omics-based techniques (i.e., genomics, transcriptomics, proteomics, and metabolomics; see Chapter 12 Metagenomics and Chapter 15 Transcriptomics) are robust tools to gain insight into microbial communities along the food chain and can detect pathogens, the origin of a foodborne illness, microbial source tracking investigations, and antimicrobial resistance (
Several common molecular techniques for plant-based food authentication are available.
PCR-based techniques are useful for the detection and identification of animal and plant species in foods because of their high sensitivity and specificity, in addition to being relatively fast and inexpensive. Multiplex PCR assays simultaneously identify several species by using species-specific primers, and they are being extensively applied to the detection and differentiation of species present in food products (
Like PCR and qPCR, Loop-Mediated Isothermal Amplification (LAMP) detects specific DNA sequences, but can target up to eight different sequences. The LAMP method uses self-recurring strand-displacement DNA synthesis to replicate a target DNA at a constant temperature and avoids any PCR amplification steps, saving time and avoiding PCR bias. LAMP has been applied for the detection of foodborne pathogens, the screening of pesticide residues, the assessment of adulterations in meat and various food allergens as well as the authentication of GM crops (
High resolution melting (HRM) is a post-PCR analysis method that monitors the rate of double stranded DNA dissociation to single stranded DNA with increasing temperature and is used to identify variations in nucleic acid sequences. HRM, especially in combination with DNA barcoding, has proven successful for species discrimination, adulterant and allergen detection and product authentication on a wide range of complex food materials of plant as well as animal origin (
Next generation sequencing (NGS) combined with powerful bioinformatics tools are advancing food microbiology and authentication of products of botanical origin (
More advanced molecular methods such as shotgun metagenomic and whole genome sequencing are becoming more widely adopted in the food industry. These approaches provide deeper information in one single analysis and provide more complete sequence information (
There is an urgent need to combat food safety issues in plant-based products. Further methodological improvements in food hazard detection and digitalization of food safety protocols are necessary for quality assurance of food products. Mislabelling and fraudulent practices such as adulteration require special attention as they are the most common issues in the global food supply chain. Most food hazard detection techniques are chemistry-based and used for detecting chemical food hazards or focus on microbial contamination, issues which have important repercussions on human health. DNA based methodological advances for plant-based foods should focus more on the creation and curation of reference databases and the use of innovative bioinformatics tools for fast and accurate food authentication. Standardisation of DNA based methodologies is a prerequisite for the successful implementation of hazard risk assessment protocols at the national and international level.
A. Food analysts found a considerable amount of wheat (
What is the food safety risk of undeclared GMOs in food products?
Extra virgin olive oil is one of the most adulterated plant foods, either intentionally or unintentionally. State three reasons why this product is more prone to end up in an adulterated state compared to other products.
A. Since a considerable amount of the wheat was detected, intentional adulteration would be the most likely food safety issue. Unfortunately, this is common practice in herbal products and spices. Wheat and other grain based components are being used as fillers to increase profits due to cost differential. B. Intentional adulteration will likely happen during processing or packaging. In case of unintentional adulteration, the harvesting step in the supply chain would be the most probable step that might have caused this issue. Wheat is a very common plant in agricultural fields and is very likely to get harvested accidently with the cultivated herb/ vegetable, either by machines or manually. In addition, like many other agricultural crops, wheat is wind-pollinated. Wheat pollen can therefore contaminate the product at various steps in the supply chain, from cultivation to the packaging step. C. Adding wheat, a plant allergen, as an adulterant in food products could create serious health issues especially for those with gluten intolerance. Also, adding cheap substitutes to expensive spices has economic consequences as well, since you pay a high price for low quality products.
Although the benefits of GMOs are vast, it is important to note that some health risks are associated with them that are not always fully understood. Genetically modified plants may cause hazards related to increased allergenicity, transfer of genes from GM food to cells of the body, or to bacteria in the gastrointestinal tract.
The significant financial rewards, low availability of high-quality extra virgin olive oil as a result of the increasing demand, and inadequate screening from regulatory agencies are the three main reasons for extra virgin oil adulteration.
Maria Ariza1, Sandra Garcés-Pastor2, Hugo J. de Boer1
1 Natural History Museum, University of Oslo, Norway
2 The Arctic University Museum of Norway, UiT - The Arctic University of Norway, Tromsø, Norway
Maria Ariza
Sandra Garcés-Pastor
Hugo J. de Boer
Being the world’s most abundant life kingdom, plants are virtually everywhere: in terrestrial, freshwater, and marine ecosystems and even in the air in the form of pollen and spores (
However, plant biodiversity assessments are impeded by problems associated with species detection, taxonomic assignment, abundance quantification, and sample bias given the unknown spatial and temporal distribution of target species (
Improving plant biodiversity assessments is one of the century’s greatest challenges as less than 10% of the world’s plant diversity is currently known, and its loss outpaces the rate at which is discovered, inventoried, and protected (
While many plants are sessile and their biomass is mainly located below or above anchoring surfaces, some vegetative and reproductive plant parts (i.e. flowers, leaf debris, pollen, seeds) detach and are transported on short or great distances from the main organismal body until they are finally deposited onto substrates (i.e., ground, water, and more). Hence, plant DNA can be found in environmental substrates as organismal and extra-organismal DNA at various proportions, with each substrate potentially tracking different spatial and temporal signatures of biodiversity (
As no single marker provides resolution for all taxa, eDNA-based assessments often employ metabarcoding of different nuclear and chloroplast regions such as ITS,
Despite the major recent advances in detection, eDNA-based assessments remain limited to reliably quantify abundance, which in turn makes it hard to assess population status and take management actions (
Furthermore, presence/absence estimations provided by eDNA-based assessments can be misleading as DNA may remain in the environment after the organism is no longer present (
About a decade after the term eDNA was introduced, the eDNA scientific community has adopted different terminology in reference to the state, source, or substrate from which eDNA is isolated (
Pollen DNA is most commonly the main source of plant eDNA present in airborne samples, although single-cell algae, leaf and flower fragments may also be present (
Faeces, mucus, and saliva contain DNA from the host and from the organisms that were ingested or that have been in contact with the host (
Soil and sediments, from both terrestrial and aquatic environments, are presumably the substrates where most plant DNA is present, as extra-organismal and organismal DNA from both active and dormant tissues including, roots, debris, fallen vegetative parts, seeds, and pollen are gathered or ultimately deposited in these substrates. Because of the major presence of plant eDNA and the ubiquity of these substrates in both aquatic and terrestrial ecosystems, soil and sedimentary eDNA samples are advantageously appropriate for plant assessments. Differences between soil and sediments can be ambiguous, as both are products of the earth’s crusts weathering (
Soil eDNA plant assessments have successfully characterised diversity in tropical (
As sediments are deposited throughout time and form distinguishable layers, the eDNA present in these layers (namely sedaDNA) can signal organisms that were likely locally present in ancient environments (
eDNA-based biodiversity assessments have proliferated in marine and freshwater environments in recent years, and our knowledge on the persistence, decay rates, and states of eDNA in water samples and its resolution compared to traditional assessments has in parallel increased (
The assessment of aquatic plant eDNA in freshwater ecosystems has simultaneously enabled the early detection of invasive species (
Plant DNA can also be isolated from water samples in the form of snow, firn, and ice (
Bulk samples from plants are distinctly different from pitfall or Malaise traps filled with insects. In bulk samples of plants, one can distinguish natural bulk samples such as pollen samples from pollen samplers, or those scraped or washed from pollinating vectors, and those that are artificially assembled such as collected roots, leaves, or flowers. Nevertheless, all bulk samples constitute organismal DNA from plant communities that can be used either to assess plant or other diversity (
Flower bulk samples have been assembled to assess arthropod communities that leave DNA traces after either visitation or pollination (
A recent development in eDNA metabarcoding is the use of organisms as natural samplers of DNA (coined nsDNA;
In aquatic ecosystems, macroinvertebrates (
For the assessment of terrestrial vegetation in tropical areas, bats hold great potential as biotic samplers of plant DNA since omnivorous and frugivorous communities are abundant and thus easy to collect (
Finally, amplifying hypervariable markers from biotic DNA samplers, i.e., COI for animals, has recently gained attention as it can assess diversity below the species level, and thus signals ecosystem population assemblages in space and time (metaphylogeography; X.
While the exploration of eDNA samples and methods for plant assessments is still at its infancy, eDNA has already revolutionised the way and speed in which biodiversity can be inventoried. Plant detection via eDNA has enabled the discovery of plants living in extreme and/or ancient environments and yielded myriad applications with societal relevance. A decade after the rise of eDNA-based assessments, the limitations of this method across different eDNA samples are still being recognised while in parallel different strategies are being developed to overcome and mitigate these. In this rapidly developing field, it is essential to combine the basics of eDNA metabarcoding with the most recent insights and developments in the field to devise the most robust study design to answer your research questions.
You want to assess the floral resources available in summer for a butterfly species and identify potential food competitors. Describe your experimental design and the eDNA substrate(s) that you would use and why.
You are hired to conduct a vegetation assessment of a landscape mosaic composed of several small lakes and grasslands, however, you only have the time and budget to collect samples from a single eDNA substrate. Which eDNA substrate would you choose and why?
You use soil eDNA to detect the spread of an invasive alien gymnosperm tree species (Sitka spruce,
The analysis of eDNA from gut contents, faeces, or eDNA traces from the butterfly’s body (vegetation fragments or pollen grains) would reveal the floral resources available and visited. To reveal other organisms that are using the same floral resources (other pollinators competitors), one could target insect eDNA present in flowers that have been visited.
eDNA water samples from near-shore sites would optimise the vegetation assessment as they are both easy to collect and signal terrestrial and aquatic diversity. Though airborne DNA could be also considered for this purpose, it may miss dormant DNA or non wind-dispersed plants. In addition, sedimentary eDNA may also signal nearby diversity.
Spruce and pine spores are tiny, light, and spread by wind. These have a tendency to show up anywhere, and are not a good indication for local presence. Invasive species monitoring needs approaches that provide a clear link between detected species and specific environments.
Mehrdad Jahanbanifard1,2, Margaretha A. Veltman3, Sarina Veldman1, Ida Hartvig4, Carly Cowell5, Frederic Lens1, Steven Janssens6, Erik Smets1
1 Naturalis Biodiversity Center, Leiden, The Netherlands
2 Leiden Institute of Advanced Computer Science (LIACS), Leiden University, Leiden, The Netherlands
3 Natural History Museum, University of Oslo, Oslo, Norway
4 University of Copenhagen, Copenhagen, Denmark
5 Royal Botanical Garden, Kew, United Kingdom
6 Meise Botanic Garden, Meise, Belgium
Mehrdad Jahanbanifard
Margaretha A. Veltman
Sarina Veldman
Ida Hartvig
Carly Cowell
Frederic Lens
Steven Janssens
Erik Smets
Wildlife trade is the trading of living or dead wild plants,
The impacts of wildlife trade are substantial with both conservation and socio-economic importance. Unsustainable trade could lead to (local) extinction of populations or even entire species. For plants that occupy a specialised niche, it can destabilise interactions with other species, with potential consequences for the entire ecosystem. Therefore, after habitat loss, wildlife trade is the second-biggest threat to species survival (WWF, 2020). Not only does illegal wildlife trade threaten biodiversity due to consistent overexploitation, it also competes with legal use of natural resources and results in a substantial loss of income for both local communities and governments (
In order to regulate the trade in vulnerable wildlife, the Convention on International Trade of Endangered Species of Wild Fauna and Flora (CITES) was established in 1975. Species at risk of overexploitation due to international trade are listed on one of three appendices depending on how much they are threatened by unrestricted trade. Appendix I lists the most endangered species, for which commercial trade is not permitted - except for pre-convention material - and for which non-commercial trade is strictly regulated. Appendix II lists the species that may become extinct if trade is not carefully controlled, which therefore requires a proper permit. Finally, Appendix III lists species that are protected in at least one country and other CITES Parties assistance is required to control the trade. Listing species on Appendix III helps to establish international cooperation in order to control trade in the species according to the laws and regulations of that country. Species can be added to Appendix I and II or removed from them, or shifted from Appendix I to II and vice versa only by voting at a Conference of the Parties (CoP), which is a meeting of the CITES Parties to review the implementation of the Convention. Species can be added to Appendix III or removed from it at any time and by any Party unilaterally (CITES, n.d.).
At the moment, roughly 39,000 species, including ca. 6000 species of animals and ca. 33,000 species of plants (395 species in Appendix I, 32,364 species in Appendix II, and 9 species in Appendix III) are protected by CITES (CITES, n.d.). In countries that are signatories to the convention, import and export permits must be issued for international trade of plants and animals listed in these appendices. Some countries set annual export quotas for certain species to ensure that they will not be traded beyond the sustainable limits for species survival. Non-compliance with CITES regulations can lead to confiscation of the material as well as fines and prison sentences, and in some cases trade sanctions against a country (CITES, n.d.). Since 2017, CITES has also facilitated the Wildlife Cybercrime Working Group that has coordinated national responses to the threat posed by online trade (
Other international and national regulations have been put into place to support the implementation of and in some cases expand on CITES regulations. Examples are the EU Action Plan Against Wildlife Trafficking (European Commission 2016), the EU Wildlife Trade Regulations (European Commission 2010), European Union Timber Regulation (EUTR), United States LEMIS wildlife trade data (
Despite the fact that plant species far outnumber animal species on the CITES appendices, in the public discourse on wildlife trade and conservation, charismatic mammals such as elephants, rhinos, tigers, and lions usually take centre stage. Smaller animals (e.g., insects, molluscs), but also most plant groups, receive less attention and generate less funding in discussions regarding wildlife trade and conservation. And although plants appear frequently in national and international regulations, regulatory enforcement and additional conservation measures still primarily target iconic megafauna (
Plant blindness is a psychological bias that leads us to notice (large) animals, and take plants largely for granted, reducing them to background vegetation for other organisms. The term was coined by
Plant blindness has been institutionalised throughout society, from (higher) education to governance and wildlife management (
Apart from the limited attention that plants receive in research, education, and conservation, effective control of trade in plant species is hampered because some of the traded goods are difficult to recognise, either because they are processed or because they contain only parts of the organism, which lack the morphological characters needed for identification (
Other challenges are posed by the growing use of the internet for transactions, which makes wildlife material more readily accessible and at lower costs, while preserving anonymity. The internet is not only increasingly used to sell and obtain specimens, but even to organise poaching events (
Lastly, since international wildlife trade per definition transcends borders, enforcement of legal trade requires coordinated action between multiple countries to address the whole supply chain. While there are already many institutional collaborations that work across international borders to help track and catch illegal wildlife trafficking syndicates - including financial institutions, NGOs, customs and police forces and online tech platforms - one of the main bottlenecks to combating wildlife trade will be to sustain sufficient international attention to allow the detection and prevention, not just of single illegal transactions, but of organised trade networks operating at larger scales.
The importance of wildlife and the impacts of unsustainable trade on biodiversity are undeniable, which highlights the urgency of developing high-throughput methods that are widely applicable. The next section presents some of the most commonly used methods in illegal trade identification today. In the final section, we provide recommendations on which techniques to use for the identification and tracking of illegally traded plants, and discuss future developments that could improve global wildlife trade monitoring and control.
Traded plant materials come in all shapes and sizes and in different stages of processing, ranging from complete living plants to raw timber logs and to engineered wood products. There is a wide variety of molecular and non-molecular methods for illegal wildlife trade monitoring, from DNA (meta) barcoding and genetic methods, to chemical identification, and computer vision and pattern recognition tools. Each of these methods is applicable to certain types of materials and requires knowledge about different aspects of the traded product that determines its legality, including species identity, geographic origin, source population (wild or cultivated), and the sample age. Here we describe the most commonly used methods to identify each of these aspects, and why they are important.
Methods for species identification are used to ascertain whether the organism being traded is CITES-listed or not. Depending on the taxonomic rank that is listed, it may be necessary to identify the exact species (e.g.,
The main chemical method used to identify species is Direct Analysis in Real Time (DART) coupled with time-of-flight (TOF) mass spectrometry (DART-TOF MS). DART-TOF MS consists of two parts: DART is an ionisation source that ionises ambient atmospheric molecules by using electronically excited-state helium which reacts with the molecules in the investigated sample to produce analyte ions (
Thanks to machine learning and computer vision, expert systems are playing an increasingly important role in identification of a wide variety of wildlife related objects, such as medicinal leaves (
DNA-based identification methods can use different genomic markers that offer different levels of identification, from universal loci such as conserved genes or intergenic spacers, to neutrally evolving markers with sufficient variation to resolve specific taxa, such as microsatellites and genome-wide Single Nucleotide Polymorphisms (SNPs). In addition to these markers, which require information about genomic context, it is also possible to identify species and populations using alignment-free shotgun data (see Chapter 17 Species delimitation).
For species identification, DNA barcoding (see Chapter 10 DNA barcoding) is often the method of choice. It can effectively identify traded plant species in a number of cases, including the identification of rosewood (
An advantage of DNA barcoding is that the sequence data is universally comparable among labs and large numbers of species. But since DNA barcoding was originally meant to distinguish between species and not within species, this method often falls short when higher resolution is needed. Identification below the species level may be useful if the legality of trade is determined by the source population. In some cases, the country of origin determines the legal status of traded plants, which requires population level data for a collection of reference samples spanning the species range. Cost-effective traditional population genetic methods use a number of species-specific variable markers, typically simple sequence repeats (SSRs) or inter simple sequence repeats (ISSRs), which can be highly variable and show fine-grained population structure. More recently developed high-throughput sequencing methods cover larger sections of the genome, such as reduced representation sequencing methods (RAD-seq, target capture, or low coverage whole genome shotgun sequencing (also known as genome skimming, see Chapter 16 Whole genome sequencing).
These methods can generate large numbers of SNPs that allow inference of geographic origins at various scales. Although the increased costs for library preparation and sequencing means that these methods are not economically feasible in all cases, they offer the added advantage that functional analyses of genes or markers linked to genes with adaptive significance is possible.
Geographic origins have even been identified at the level of continents using genome skimming (
While population genetic markers can offer unmatched resolution of spatial variation, a general disadvantage is that many of them (with the exception of those used in RAD-seq and shotgun sequencing) need to be tested or developed specifically for each species, and reference data must be generated for populations across the distribution range to be tested. Stable isotope analysis can also infer geographic origin of samples, and does not depend on species-specific reference data to the same extent as genetic methods do. Stable isotope analysis is based on the principle that the presence of stable isotopes in the environment depends on both climate and geography. This creates a correlation between the stable isotope profile and its geographic location (
There are two methods to measure radiocarbon abundance: radiometric dating and accelerator mass spectrometry (AMS). These methods can be used to date samples based on the decay of carbon isotopes. The estimated age gives an indication of whether or not the traded sample is a pre-convention material, meaning that the traded material predates the convention or listing of the species (e.g.,
Currently, no genetic methods for inferring sample age can compete with radiocarbon dating, and while DNA fragment sizes tend to be shorter for older and more degraded plant tissues, this alone cannot be used to determine the plant age (see Chapter 2 DNA from museum collections). For other purposes, genetic markers are the method of choice to infer species identity and geographic origin, whenever DNA extraction is a realistic option. Any genetic method will however be limited by the quality and quantity of DNA that can be extracted, which can be notoriously difficult for some materials, especially timber and processed products (
Despite significant progress in methods and computational analyses, applications for most methods are still limited by the lack or incompleteness of suitable reference data. As shown in Table
A comparison of the methods used for identifying plants in trade with an indication of their applications and limitations.
DNA (meta)barcoding | Population genetic markers | Computer vision and pattern recognition | DART-TOF MS | AMS/ 14C dating | Stable isotope | |
---|---|---|---|---|---|---|
|
Whole plants, organs, tissues, powder | Whole plants, organs, tissues, powder | Timber, leaves, flowers, pollen | All | Anything containing organic matter | Anything containing organic matter |
|
Determine taxonomic identity from genus to species level | Determine population or region of origin | Determine taxonomic identify, from genus to (sometimes) species level | Determine taxonomic identity at species level | Determine age of material | Determine the region of origin |
|
Well-developed for temperate species, less for tropical species and regions | Needs to be developed and referenced for each species separately | Being developed for CITES protected timber and plants | Being developed for CITES protected timber | Calibration might be required depending on the sample | Needs to be developed for each region separately |
When one method lacks sufficient reference data or is not sensitive enough to infer species identity or population of origin, multiple identification techniques tools (e.g., DNA barcoding, machine learning, and DART-TOF MS) can be combined to improve identification accuracy. Developing an integrated identification framework, which links reference databases and connects multiple sources of data for taxa of interest, is expected to play a major role in the future of regulating wildlife trade, though this would rely on standardisation and equitable distribution to enforcement agencies around the world. Coupled with new technologies that ensure quality control and compliance across the supply chain of wildlife products, the tools available for wildlife trade monitoring can aid not just the detection and confiscation of illegally traded goods, but also the transparency and traceability of legally traded commodities.
With blockchain for example, it may eventually be possible to develop a secure and robust infrastructure to register and track wildlife-related products from source to destination (
The technology has already proven its relevance in agriculture and fisheries, where the WWF Blockchain Tuna Project demonstrates it is possible to track the history of a fishing product from ocean to plate with just a QR Code (WWF, 2018). The customisable and scalable features of blockchain make it a promising technology for application to traded timber and other wildlife-related products (MoonX, 2019). Once it is possible to keep track of all steps taken throughout the commercialisation of wild harvested plants, the checkpoints for identification will no longer be restricted to points of entry or sales, enabling monitoring of wildlife trade from the source.
Customs officers often come across cultural heritage such as sculptures made from economically costly, legally protected wood (such as Brazilian rosewood). Which method could they use to find out whether the sculpture is made from CITES-listed species? Motivate your answer.
What is “plant blindness” and why is it hampering the battle against illegal plant trade?
Provide two advantages of AMS over radiometric dating when investigating illegal wildlife trade. Motivate your answer.
Any non destructive method would be potentially usable such as near infrared spectroscopy or X-ray micro CT, to preserve the samples in their original form.
Plant blindness is the bias towards animals, and taking-for-granted plants, which are not recognised as anything but background. The downside of plant blindness is that illegal plant trade is considered as relatively harmless as compared with illegal animal trade.
AMS requires a much smaller sample size (20–500 mg) compared to radiometric methods (10–100 g). It is also faster and usually produces higher precision results than radiometric methods. Samples can be analysed in a few hours with AMS, while it can take one or two days with radiometric methods. In case confiscated organisms are still alive, a fast verdict increases the chances of survival as rescued animals or plants can quickly be transferred back to the wild before they die.
Panagiotis Madesis 1
1 Lab of Molecular Biology, Department of Agriculture Crop Production and Rural Environment, University of Thessaly, Volos, Greece
Panagiotis Madesis
Forensic science is the use of science in criminal cases. Many scientific disciplines can be involved, among others chemistry, botany, entomology, and physics. In many trials, the presence and identification of physical evidence can be a critical factor in determining the final verdict. Physical evidence may include among other plant material such as leaves, flowers, fruits, or pollen. In this sense, forensic botany is the study of plants during criminal investigations, as botanical samples can be critical evidence in crimes (
Plant material that is usually found at crime scenes may include leaves, stems, seeds, pollen, flowers, or any other plant parts (
Plant seeds can be caught and carried in a pant cuff or on a shoe, and plant leaves and stems can be found in a victim’s and/or suspect’s car. Plant parts can also be identified in the victim’s stomach, nose or lungs, under fingernails, on skin, clothes, or hair. However, data generated from recovered botanical material is often not fully exploited since forensic agents may lack the appropriate know-how. The role of the forensic botanist within the investigative process is to compare samples recovered from crime scenes, macroscopically and microscopically examine the biological material. Although botanists might be able to identify a species phenotypically, many times this is not possible and DNA analysis with molecular techniques must be used. New tools, and especially new molecular tools, are being developed in forensic botany to aid in both criminal and civil cases. Although chemical analysis of plant material can serve as evidence when a relevant reference database is available, DNA is much more stable than many macromolecules and metabolites and can persist for long periods, even if broken into smaller fragments. It is therefore very often the preferred method for identifying plants in forensics (
Examination of DNA is a powerful technique allowing the identification of an individual. A suspect’s DNA and a crime scene sample are matched to reference databases containing the profiles of large numbers of individuals generated over time (
Since then, the development and application of DNA-based methods and genetics have revolutionised forensic science. Nowadays the use of DNA as forensic evidence is routine, with a major impact on the criminal justice system and society (
Forensic genetics is progressing rapidly, as highly sensitive methods for DNA recovery and new sequencing technologies are being developed (
DNA barcoding is the most commonly used method for genetic identification in forensic genetics (
In plants, DNA barcodes are mainly derived from the chloroplast (
Recent developments in DNA analysis now allow for the wider use of biological materials, for example, mixtures of samples such as soil or stomach contents (Figure
Palynology is the study of palynomorphs, including pollen, spores, dinocysts, etc. Pollen grains are however the most studied palynomorph, and especially in forensics it can be an important piece of evidence if it can be associated with a crime scene or be retrieved from the suspect or equipment used at the crime scene. Pollen is of microscopic dimensions and can very easily be retained in clothes, home objects, and soil. Crime scenes limited to a few square meters, like a rape scene or the entry point of a burglary, are very often the best choices for the use of forensic palynology (
In Auckland (New Zealand), a prostitute claimed she was attacked in a passageway by a suspect, around seven meters away from the suspect’s car (
In Taipei (Taiwan), the body of a young woman was found lying by a drain in an urban area. It was unknown whether she was a homicide or suicide victim. Her body showed no obvious bone fractures and it was suspected that she was involved in a hit and run by a car. By the time investigators arrived at the scene, the body had already been transferred to a hospital, where a tiny berry and stem was found in the victim’s hair. This berry was however not commonly found in the area where the victim lived or where the body was found. The investigators discovered the same plant on the edge of a railing above a drain attached to a building directly next to where the body was found, suggesting that the woman fell from the building, and the plant piece became tangled in her hair during the fall (
A murder case in 1992 in Arizona (USA) revealed the power of forensic botany. Seed pods of a Palo Verde tree (
In a Finnish study, RAPD and SSR molecular markers were used on mosses to connect three suspects to a murder scene (
Forensic botany also helped to resolve a case of theft that occurred at a Catholic church in Florence (Italy). In this case, the thief made a mistake, leaving faecal material at the crime scene, as, unfortunately for him, he suffered from diarrhoea. Although a priest at the church had previously cleaned the crime scene of faecal matter, there was still enough material left to be collected by the police. The police suspected a local man with a police record who suffered from Crohn’s disease. The suspect denied the accusations and presented an alibi. The police, who had retrieved his blue jeans from the time of the robbery, found them stained with faeces, yet the suspect still denied being guilty and challenged the police to “prove it”. The comparison of the two samples revealed 14 dietary items of botanic origin that matched and none that did not, forcing the suspect to confess the crime (
In the early 1980s, a young girls’ body was found whose last known meal was with her boyfriend at a local fast-food restaurant. An autopsy however revealed the presence of vegetables in her stomach that were not on the fast food restaurant’s menu. A botanical investigation confirmed the autopsy results, suggesting she had another meal before her death, which helped her boyfriend to be cleared of any charges. The case was solved a few years later when a serial killer confessed to the murder (
In the Black Widow case, in 1993, a domestic homicide was solved with the help of forensic botany. The victim Gerry was married to Jill who had 7 previous marriages. When Gerry found out that Jill had not actually divorced her 7th husband before marrying him, he went to court to annul the marriage and freeze his assets. On the day of his death, Gerry had a breakfast of coffee, hash browns, eggs, and toast, and Jill and her then-boyfriend were spotted near his house. Forensic botanists examined the contents of his stomach and found starch and onion, concluding that the only meal he had was his breakfast and that he did not go out to have another meal. That coincided with the time that Jill was seen at his house and allowed the police to issue a search warrant for her property. The police found a gun and other evidence which led the court to find her and her boyfriend guilty (
In a homicide case, a body was found in a stream near a roadside covered with the knotgrass
Metagenomic analysis for human DNA was used in a sexual assault case that took place in the Netherlands in 2015 and involved a 28-year-old woman. The woman preserved her clothes after the assault and also took intimate samples from herself. Initially, the samples were analysed using capillary electrophoresis (CE) analysis. A year later, these CE results produced a hit in the Dutch convicted criminal database. However, the analysis was challenged, and the ambiguous results made the suspect go free. Only after the use of massively parallel sequencing, it was possible to match the suspect’s environmental DNA with the assault evidence which finally led to his conviction in 2018 (de Knijf 2020).
What is the advantage of plant DNA over other plant metabolites as forensic evidence?
Is DNA barcoding a suitable approach for plant forensics? Motivate your answer.
Can Bar-HRM be used in plant forensics? Motivate your answer.
Why is palynology a suitable method for plant forensics? Motivate your answer.
DNA is more stable over time and persists over a longer period of time, so it is more useful for identifying unknown plant material than other plant metabolites.
DNA barcoding allows the identification of plants and the development of a suitable database so it is the appropriate solution for forensic use when we want to identify plant species and match species in a given area with plant material identified on a suspect. If matches are sought on plant population level, though, fingerprint methods such as microsatellites might be more appropriate. However, this requires that the population of plants refers exclusively to individuals from the same species.
Bar-HRM is a method that combines DNA barcoding and High Resolution Melting Analysis. The method could be an alternative to DNA sequencing which allows rapid results should this be necessary, however, the use of sequencing is probably indispensable for forensic use.
Palynology is the use of pollen for the identification of a species, which becomes a powerful tool when combined with DNA barcoding. Pollen can stay intact for thousands of years, protecting the DNA it contains.