Monograph |
Corresponding author: Hugo de Boer ( h.de.boer@nhm.uio.no ) © 2022 Hugo de Boer, Marcella Orwick Rydmark, Brecht Verstraete, Barbara Gravendeel.
This is an open access article distributed under the terms of the CC0 Public Domain Dedication.
Citation:
de Boer H, Rydmark MO, Verstraete B, Gravendeel B (2022) Molecular identification of plants: from sequence to species. Advanced Books. https://doi.org/10.3897/ab.e98875
|
Names are the carriers of knowledge. Without names, much of science would be meaningless. Names give us insight into the diseases that affect our health; the objects that sustain our economies; the celestial bodies that travel in the Universe. Names solve ambiguity.
In botany, the name of a plant may provide the first clues as to its characteristics, also called traits. Is it edible, or poisonous? Beautiful, or ugly? While some traits are relative (edible by whom, ugly to whom?), others are absolute: thorny, succulent, epiphytic. Some are obvious, others elusive. From morphological descriptions and DNA sequences to historical accounts and traditional uses, they are all linked by the name.
Until recently, the reliable identification of plants was the task of a select few: the taxonomists. Today, this is less so. The molecular identification of plants through DNA barcodes has been shown to perform just as well, and in fact often better, than taxonomists for many taxa, particularly when specimens lack reproductive structures. Other techniques, such as image recognition through machine learning and the spectrophotometric signature of leaves, can yield similar results. Does this mean the demise of taxonomists is on the horizon?
Not at all. I believe it is very much the opposite: in the current environmental crisis, the need to document and protect the world’s biodiversity has never been more acute. At the same time, some 20% of all plant species have not yet been scientifically described, and many of them may disappear even before we have identified and characterized them. The work of taxonomists remains therefore critical, but as molecular identification of species is underway and set to become routine across the private and public sectors, expert time can now be reallocated from bulk identifications to the training of students, build-up of physical and digital reference collections, and further development of identification methods. Technologies are here to help – not replace – taxonomy, by complementing the human strengths and compensating for some of our human weaknesses: an insufficient memory, a biased brain, and lack of time.
This book is for you who are curious about how plants can be identified using DNA: the most powerful source of information to link a plant to a name. This may sound trivial, but it is not. But don’t despair in advance: it is doable, mostly fun, and always rewarding. You just need to learn how.
Here, you will not only learn how various types of materials containing plant fragments can be identified to species in the lab and how to execute sophisticated computer analyses, but also gain a deeper understanding of the complexities and challenges faced by taxonomy in general, and plant identification in particular, including the lack of comprehensive reference databases. Enforcing strict species concepts onto nature’s inherent fluidity doesn’t always work, and despite all recent advances in this field it still happens that some plant samples cannot be confidently named. Yet, if this ever happens to you, this initially frustrating insight can also be scientifically revealing, and help you design further experiments.
The applications of molecular identification are far more numerous and trans-disciplinary than most people would imagine. Several chapters take a deep dive at applications in fields as seemingly disparate as palaeobotany and healthcare, but as I argued at the start of this text, they are all unified by a common denominator: the name, the information-carrier.
I hope you will find this book as inspiring, informative, and revelatory as I have, and that you will choose to carry out your own projects using the molecular identification of plants. And if you do so, just don’t forget to cite the chapters that inspired you!
An estimated 340,000–390,000 vascular plant species are known to science (
Organismal diversity is the foundation of all biological research, but species discovery and delimitation requires taxonomic skills. Even the most experienced taxonomists can rarely critically identify more than 0.01% of the estimated 10–15 million species (
The global scientific community lacks the expertise and continuity to identify all species diversity, and biodiversity is lost at a greater speed than we can discover and describe new taxa (
DNA-based species identification, i.e., molecular identification, makes it possible to identify species precisely from trace fragments such as pollen (
These innovations in molecular identification enable us to detect and identify species in places and settings that were unimaginable only a few decades ago, or even in 2020 (
Deoxyribonucleic acid (DNA) is the blueprint of life. DNA encodes genes which carry instructions for the production of proteins, the fundamental components of a cell’s machinery. DNA was first isolated and confirmed as the genetic material in cells, and thereby the basis of heredity, in the 1940s (
A fundamental tenet of molecular biology is that DNA is transcribed into ribonucleic acid (RNA), and subsequently translated into amino acids that form a protein sequence. We now have a much more detailed understanding of this framework, including the varied roles of RNA in gene expression and regulation, and the role of epigenetics—heritable changes in DNA that do not alter the base sequence (e.g., methylation). Since the discovery of DNA, there has been a steady increase in the use of DNA sequences as molecular markers in varied biological contexts, including medical and forensic applications, elucidation of genes encoding adaptive traits, understanding population genomic processes, as well as systematics of prokaryotic and eukaryotic organisms.
Most DNA extraction protocols extract total cellular DNA. In certain experimental cases, it can also be preferable to target either DNA contained in the nucleus or DNA comprising organellar genomes (in plants: mitochondria and plastids). Organellar genomes are much smaller than any plant nuclear genome.
As with virtually all eukaryotes, plants have endosymbiotically derived mitochondria for cellular respiration and energy production. However, compared to other eukaryotic kingdoms (animals in particular), the mitochondrial genome of plants is quite large, ranging between 200 and 750 Kbp in size (
In contrast, plastid genomes (e.g.: found in chloroplasts of leaves or amyloplasts of cereal grains) have a very stable genomic structure and a size of around 150 Kbp in most cases (
Nuclear genomes, particularly in angiosperms, are highly variable in size, with the angiosperm mean and modal 1C (the amount of DNA in an unreplicated gametic nucleus) both at around 5 pg/Gbp (
The first isolation of DNA, by the Swiss physician Friedrich Miescher in 1869, happened accidentally while studying proteins from leukocyte nuclei (
Plants possess a tough cell wall made up of cellulose and other compounds such as lignin, in addition to a cell membrane. This necessitates a robust first step for plant DNA extraction that disintegrates the structure of the plant tissue and breaks down cell walls. In a low-throughput scenario (or for samples that are tougher to disrupt), this could involve flash freezing the tissue with liquid nitrogen followed by grinding with a pestle and mortar. For higher throughput of samples, tissue-disrupting machinery can be applied. The ground material should then be taken forward immediately to the chemical steps of the process, which involve breakdown of the cellular membrane to release the lysate containing the soluble DNA. This is then separated from cell debris and other insoluble material. Various methods are subsequently used to separate DNA molecules from the remaining material, which can contain soluble proteins, nucleic acids, and small molecular metabolites (
Numerous protocols and procedures have been developed to extract DNA from plant material of varying origins (
A major innovation in DNA extraction protocols from plant material was developed by Doyle and Doyle (
Plant material for any research project must be collected ethically and legally, and the preparation of DNA extracts is no exception. Permission, prior informed consent and mutually agreeable terms of use must be obtained before using plant tissue for DNA extraction according to the Convention on Biological Diversity. This includes the fair and equitable sharing of benefits arising from the utilisation of genetic resources (as outlined in the Nagoya Protocol). National and international law and conventions apply to derivatives of biological materials, including DNA extracts and their transportation. The same principles apply to botanical collections such as seeds, silica dried specimens stored in a tissue bank, herbarium specimens, or plants in living collections. The terms under which they are stored in a collection may restrict the use of specimens for research and require additional permissions (for instance, from the regulatory authority in the country of origin) before they can be used. The storage and future use of DNA extracts, likewise, must comply with the terms of the permissions granted, which could include being stored indefinitely for future research, returned to the country or institute of origin, or discarded. See Chapter 2 DNA from museum collections for guidance about your responsibilities as a researcher.
DNA can be extracted from healthy plant tissues including leaves, flowers, buds, seeds, roots, bark, and even spines. Young leaf tissue is the preferred starting material (
Successful extraction of high-quality DNA from any plant material depends on the material being prepared correctly, dried rapidly (without excessive heat treatment), and stored in a dark, dry place to minimise degradation of its DNA. DNA degradation prior to extraction is caused by the release of endogenous nucleases during cellular lysis, which may be accelerated by environmental factors such as heat and humidity (
The extraction method is determined by the plant material available. For most kit and CTAB based protocols, a 1 cm2 section of herbaceous leaf tissue will suffice for a single extraction. Careful laboratory notes of the material used, including provenance data, sample weight, and extraction date, are vital for checking the quality of sequencing results against the specifics of the extraction process in the lab and for pinpointing reasons for variation between samples. For some protocols, weighed tissue can be placed straight into a 1.5 ml tube labelled with a unique number or laboratory code and other information, ready for the DNA extraction process.
Plant material dried and stored in silica gel – including as specimens stored in tissue banks specifically for the purpose of DNA extraction – tends to be a good source of high-quality DNA. Silica gel (silicon dioxide xerogel) is a desiccant that removes moisture from the atmosphere, drying out the plant tissue. Indicator silica gel crystals change colour when the silica is saturated, signalling when the silica gel should be regenerated or replaced. These crystals can be used in a mixture with non-indicating silica gel.
The use of silica gel is a popular approach to dry fresh plant material for DNA extraction because it is low cost and convenient compared to liquid nitrogen or lyophilization, especially when preparing tissue in the field. To effectively preserve the DNA in plant tissue, the recommended minimum ratio between plant material and silica is 1:10 (
One approach is to freeze plant tissue until needed for DNA extraction, preferably at –80 °C, and otherwise in a standard laboratory freezer at –20 °C, if the sample is properly sealed. Alternatively, material can be flash frozen in liquid nitrogen. The resulting rapidly frozen material can yield high-quality DNA extractions, but liquid nitrogen is impractical for some settings due to handling considerations and cost (
High-quality DNA can be extracted from lyophilized (or freeze-dried) tissue, such as leaves and roots (
After the plant material has been prepared by drying and/or freezing using one of the above-mentioned techniques, a DNA extraction protocol can be implemented. Although there are a multitude of available protocols, the general methodology involves the following steps, discussed in more detail below:
We place emphasis on the CTAB protocol due to its popularity, but also introduce other protocols that may be of interest to the reader.
The starting amount of plant tissue is important: too little will result in an unsatisfactory yield and too much may lead to poor grinding, saturation of the reaction and/or excessive debris which can also be detrimental to final yield. A useful starting ratio is a buffer quantity that is fivefold that of the weight of the leaf tissue (e.g., 0.2 g leaf tissue for 1 ml of buffer) (
Plant tissue must be finely ground to a powder such that the cell walls are disrupted and the cell membranes are more accessible for the chemical reagents in subsequent steps to act successfully. It is advisable to scrape hairs or wax from the surface of the plant tissue before weighing and grinding. For herbarium specimens, special care should be taken that any glue that may be present is removed since this can interfere with the reagents used during the DNA extraction. Sterilised sand can also be used to increase the friction and enhance the disruption of the tissue; it will be separated later in the DNA extraction protocol. Fleshy tissue can be flash frozen in a mortar with a little liquid nitrogen before grinding. The dewar for transporting the liquid nitrogen should be clean and free of potential contaminants.
Manual grinding is inexpensive, yet time consuming and requires a sterilised mortar, pestle, and spatula for each sample. Use of a mechanical homogenizer, also called a tissue lyser, is more efficient. A steel ball bearing is added to each tube with a sample and shaken at high frequency within the instrument. This allows multiple samples to be disrupted simultaneously with minimal degradation of the nucleic acids. It also minimises loss of material and the chances of contamination, as each sample is processed in the tube that it remains in for subsequent extraction steps. Metallic, ceramic, or silica beads of different sizes can be added to the sample tubes to increase the disruption of particularly tough or woody material. Metallic and ceramic beads must be removed before proceeding with the protocol, but silica beads can be separated later in the protocol.
This step can be included as an optimisation strategy for increased yield, quality, or purity of the extracted DNA. For example, when high amounts of polysaccharides and/or polyphenols in the plant material are a concern (as is the case for succulent plants and plants in high stress environments, respectively), the modified STE-CTAB protocol can be used (
In this stage, the goal is to release nucleic acids from the cell, whilst also minimising risk of nucleic acid degradation and to commence the segregation of unwanted cellular compounds from the DNA molecules.
The hallmark of the most widely adopted method for DNA extraction from plants, originally developed by Doyle and Doyle (
CTAB buffer is added to each sample tube containing ground plant tissue and the mixture is incubated at 60–65 °C for 15–60 minutes. This can be done in an automatic shaking incubator. Alternatively, the sample tubes can be periodically shaken manually.
Alternatively, methods involving an SDS buffer can be applied (
The goal of this stage is the separation of DNA from other molecules in the lysate, by making use of the differing polarity of these molecules. This is followed by DNA precipitation from the solution.
In the CTAB protocol, the methodology is phase separation using organic solvent(s), where hydrophilic molecules, including DNA, can be isolated. A 24:1 solution of chloroform-isoamyl alcohol (SEVAG buffer) is added to the incubated CTAB/leaf tissue mixture. This solution is hazardous and must be prepared and added to the sample tubes in a fume hood to avoid inhalation. It is also highly volatile and evaporates very quickly, so it should be handled quickly to avoid evaporation during the work. The mixture is then centrifuged at room temperature, which results in the DNA becoming concentrated in the clear upper phase (i.e., the aqueous phase). The supernatant is very carefully drawn off with a pipette without disturbing or touching the organic phase (containing the chloroform with lipids, proteins, and other cellular debris) and transferred to a new tube. The supernatant is purified by adding RNase A and chilled isopropanol, where the latter induces precipitation of DNA. Samples are then transferred to a freezer at -20 °C, either overnight or for several days if sample input is low and maximum precipitation is desirable (at the cost of potential co-precipitation of salts).
In the SDS protocol, proteins and polysaccharides precipitate with the SDS itself. Sodium acetate in turn is used to precipitate the DNA; in solution this compound dissociates and the sodium ions (Na+) neutralise the negative ions on the sugar phosphate backbone of DNA molecules, thus making it less hydrophilic and amenable to precipitation (
As a final step to both methodologies, the samples are centrifuged to encourage the formation of a DNA pellet, optionally washed with 70% ethanol at least once and re-suspended, preferably in 10 mM Tris-EDTA buffer (which serves to protect the DNA from damage, as explained in the CTAB buffer recipe above).
The DNA isolation stage is not perfect. Since the extraction process involves steps that segregate compounds by binding properties and molecular weight, co-extraction of molecularly similar polysaccharides is common. Furthermore, the eluent can contain certain contaminants, including traces of chemicals added during the extraction process and precipitated salts, as well as endogenous proteins, tannins, polysaccharides, and other molecules. The presence of such compounds can negatively impact the downstream experimental use of the DNA (i.e., act as PCR inhibitors), and further purification of DNA using various clean-up steps may be necessary.
One strategy is using a silica column and centrifugation-based method, by adding a chaotropic agent (commonly guanidine hydrochloride), which disrupts the hydrogen bonds between water molecules, creating a more hydrophobic environment. This increases the solubility of non-polar compounds (often contaminants) and additionally breaks up the hydration shell that forms around the negatively charged DNA phosphate backbone and further promotes efficient adsorption to the column surface under high salt and moderately acidic conditions (
An alternative involves the use of Solid Phase Reverse Immobilisation (SPRI) beads (
When a DNA extraction protocol does not yield satisfactory results, in terms of quality or quantity of extracted DNA, modifications can be applied. A valuable strategy for this is conducting a search of the scientific literature for protocols that have been used for similar experimental purposes or have targeted the same taxonomic groups.
If using the CTAB protocol, understanding the biochemical actions and interactions of its components is a useful starting point to identifying what might need adjustment to help improve the outcome. CTAB acts according to the ionic strength of the solution; the concentration of NaCl must be at least 0.5 M so that it does not bind to nucleic acids, but does bind to proteins and neutrally charged polysaccharides as desired. NaCl is most commonly used at a concentration of 1.4 M. When working with a plant group that has a high content of polysaccharides, experimenting with higher concentrations of NaCl may improve the purity of the final DNA. Sometimes, other reagents such as N-Lauroylsarcosine (sarkosyl) buffer can be added, to enhance lysis (rupturing of the cell membrane) and to reduce the activity of DNase or RNase enzymes. Proteinase K can also be added to enhance the denaturation of proteins. The volume of 24:1 chloroform-isoamyl alcohol solution can also be adjusted. Phenol can be added as an additional non-polar, organic solvent that is highly effective in denaturing proteins and can aid in increasing the final DNA yield, as opposed to solely applying chloroform (
Tris-HCl and EDTA are present in nearly all protocols. ß-mercaptoethanol is toxic and should thus be handled with care, and always in a fume hood with an extractor fan. One may consider simply not adding this reagent to the solution for plant tissues low in phenolic compounds. However, it is important to note that phenolic compounds co-precipitate with DNA and thus can be problematic in downstream steps of DNA laboratory work. ß-mercaptoethanol can be replaced with less toxic alternatives such as PVP (polyvinylpyrrolidone). PVP attaches to phenolic compounds via hydrogen bonding and can be removed together with them after centrifugation (
Most commercial kit-based protocols use a combination of buffers that perform similar functions to the components of the CTAB protocol, with a final step of elution through silica-columns, which tends to yield relatively clean DNA extracts. An added benefit of column-based kits is the use of filter columns at an earlier stage for the separation of crude plant material. Silica-based columns bind DNA so that it can be washed multiple times with alcohol-containing solutions to wash away contaminants before DNA elution. This speeds up DNA extraction significantly, reducing the total time from multiple days – as is common in regular protocols – to 6 hours. Drawbacks of these approaches however include the reduced yields of purified DNA in comparison to CTAB + chloroform extractions, as well as the significantly higher (~3–4 fold greater) cost.
Commercial kits that use magnetic beads are also becoming increasingly popular. Magnetic bead extraction kits are highly versatile and provide high yields of DNA that are also highly pure, in the absence of the hazardous solvents chloroform and phenol. After plant tissue grinding and lysis with an appropriate buffer, DNA is bound to the surface of the magnetic particles. The magnetic particle-DNA system is then washed several times with alcohol-containing solutions before a final elution step with a low salt buffer or nuclease-free water. In contrast to the column-based extraction method, binding of DNA to the magnetic particles occurs in solution, thus enhancing the efficiency and kinetics of binding and simultaneously increasing the contact of the bead-DNA compounds with the wash buffer, which improves the purity of the DNA. Magnetic particle kits have also been applied in combination with steps from the CTAB extraction method to extract high quality DNA from sorghum leaves and seeds, cotton leaves and pine needles (
Finally, a less common commercial method involves the use of Whatman FTA® PlantSaver cards and custom reagents. This method is very practical in terms of collection of samples in the field and their transportation. Furthermore, immediate mechanical disruption of the plant tissue can eliminate the need for obtaining permits. While this method has been predominantly applied to agricultural plant taxa, its performance in 15 phylogenetically diverse non-agricultural taxa has been demonstrated, where DNA from these samples was found to be less fragmented than that from replicate samples extracted alongside with the CTAB method (
Assessment of the properties of each genomic DNA (gDNA) sample post-extraction – its integrity, quantity, and purity – is imperative for making decisions regarding downstream molecular work. The methods described below have some overlapping uses in terms of assessing these different properties, but we highlight which is most appropriate for each DNA quality-related aspect.
Agarose gel electrophoresis is an appropriate method for estimating DNA integrity, as well as for crudely estimating DNA concentration. This method requires a horizontal gel electrophoresis tank with an external power supply, agarose, a running buffer such as Tris-acetate-EDTA (TAE) or sodium borate (SB), a fluorescent intercalating DNA dye, a loading dye, and a DNA standard (‘ladder’). The intercalating dye is added to the buffer (or sometimes to the loading dye) and serves to visualise the DNA in the agarose gel at the end point of electrophoresis. Historically, ethidium bromide was the standard intercalating agent, but it has now mostly been superseded by safer dyes that are less carcinogenic and do not require complex disposal procedures. Nonetheless, it is recommended that any compound that intercalates DNA be handled with care. The DNA standard is referred to as a ladder, since it is a complex of appropriately sized DNA standards of known concentrations which provide different benchmarks of size and concentration for comparison.
Each DNA sample and the DNA standard (ladder) are combined with loading dye and then pipetted into a well of the agarose gel, to then be subjected to an electric field. Due to the negatively charged phosphate backbone, DNA molecules will migrate towards the positively charged anode. The DNA migration rate depends on the fragment size, where smaller DNA fragments migrate faster, leading to a size-associated separation of DNA molecules. Additionally, the percentage of agarose in the gel will determine the size range of DNA that will be resolved with the greatest clarity. A range of 0.5% to 3% encompasses most applications, where < 1% is best for examining the genomic DNA of plants and 3% would be suitable for examining fragments with small (e.g., ~20 bp) differences in length. Once the fragments have migrated sufficiently to ensure resolution of the DNA and ladder, the gel is transferred to a cabinet with a UV light and the DNA fragments are visualised due to the excitation of the intercalating dye when UV is applied. The approximate yield and concentration of genomic DNA in a gel are indicated by comparison of the sample’s intensity of fluorescence to that of a standard.
Where a more precise estimation of the size of the DNA fragments is required, automated capillary electrophoresis can be used. Such systems (e.g., Agilent Bioanalyser, Agilent Tapestation) are more expensive to use, but – aside from precision – offer faster preparation and analysis time.
Fluorescent measurements are considered the most accurate quantification method for measuring DNA concentration. These involve the addition of fluorescent dyes (in an accompanying buffer), which selectively intercalate into the DNA. Fluorescence measurements use excitation and emission values that vary depending on the dye used. The concentration of unknown samples is calculated by the fluorometer (e.g., Quantus™ or Qubit™) based on a comparison to a standard measurement from DNA of a known concentration (usually lambda bacteriophage DNA). Since the dyes are sensitive to light and degrade rapidly in its presence, sample tubes must be stored in the dark if readings are not taken imminently after their preparation in the buffer.
A rough estimate of DNA yield and a more useful estimate of DNA purity can be measured via absorbance with a spectrophotometer that emits UV light through a UV-transparent cuvette containing the sample. Absorbance readings are conducted at 260 nm (A260), the wavelength of maximum absorption for DNA. The A260 measurement is then adjusted for turbidity (measured by absorbance at 320 nm), multiplied by the dilution factor, and calibrated using the following conversion factor: A260 of 1.0 = 50 µg/ml pure dsDNA. This useful relationship between light absorption and DNA concentration can be defined according to the Beer-Lambert law. Total yield is obtained by multiplying the DNA concentration by the final total purified sample volume. However, it is key to note that RNA also has maximum absorbance at 260 nm and aromatic amino acids have a maximum absorbance at 280 nm. Both molecules can contribute to the total measured absorbance at 260 nm and thus provide a misleading overestimate of DNA yield.
DNA purity is evaluated by measuring absorbance in the 230–320 nm range. Since proteins are the contaminant of primary concern, absorbance at 260 nm divided by absorbance at 280 nm is the standard metric. DNA can be considered of high quality and suitable for most genomic applications, when it has an A260/A280 ratio of 1.7–2.0. As a further step, the ratio of 260 nm to 230 nm can help evaluate the level of salt carryover in the purified DNA, where a A260/A230 of > 1.5 is considered to be of good quality. Strong absorbance at around 230 nm, which would lower this ratio, suggests the presence of organic compounds or chaotropic salts.
Instruments such as the NanoDrop® 2000 spectrophotometer are highly accurate for evaluating the A260/A280 and A260/A230 ratios. This method is not as accurate as fluorescence quantitation, but is most suitable where information on DNA purity is sought and is also time efficient (the sample is loaded directly into the machine and requires no preparation of buffers).
Particularly challenging types of plant tissue, as well as degraded plant material, can still yield high-quality DNA if suitably optimised protocols are followed.
For instance, seeds can be a good source of DNA if specialised protocols are used (
Advances in the sensitivity of genomic sequencing and optimised DNA extraction methods make it possible to study herbarium and other dried botanical specimens (
Physical and chemical degradation is to be expected in herbarium and museum specimens; DNA in deceased tissue breaks down over time. The rate of physical fragmentation is related to temperature and other environmental variables, as well as the composition of the plant tissue itself. In a study of herbarium specimens, it was shown that fragment length significantly regressed against sample age going back 300 years (
The CTAB extraction protocol is generally preferable for extracting fragmented DNA, as it generally gives higher yields of DNA than kit-based methods. Where fragment size distribution is predicted to be very low, a high-volume chaotropic salt used as a binding buffer in the latter stage of extraction can improve the recovery of DNA molecules (
A wide variety of DNA extraction protocols are available in the literature. The structural, biochemical, and genomic characteristics of plants present a particular set of challenges; isolating high purity, undamaged DNA from plant tissue is non-trivial and requires a careful and patient approach in the laboratory. Therefore, researchers must often optimise a chosen protocol for their specific experiment. Success in the primary step of a molecular workflow is crucial, unlocking the downstream steps of plant molecular identification and characterisation, and hence possibilities for addressing many exciting questions in molecular and evolutionary biology.
Absorbance – A measure of the quantity of light absorbed by a sample, also referred to as optical density, measured using an absorbance spectrophotometer.
Beer-Lambert law – For a material through which light is travelling, the path length of light and concentration of the sample are both directly proportional to the absorbance of the light.
Chaotropic agent – A chemical substance which in an aqueous solution destroys the hydrogen bonds between water molecules (e.g., guanidine hydrochloride).
Cryopreservation – A preservation treatment for biological material, which involves cooling to very low temperatures (at least -80 °C, or -196 °C using e.g., liquid nitrogen).
Desiccant – A substance with a high affinity for water, such that it attracts moisture from surrounding materials, resulting in a state of dryness in its vicinity (e.g., silica gel).
DNA integrity – The level of fragmentation of extracted DNA, where minimal fragmentation of the original chromosomes equates to high DNA integrity.
Intercalating dye – A dye, whose molecular components stack between two bases of DNA, which is invaluable for DNA visualisation, yet at the same time implies a hazard for human health and demands laboratory safety considerations.
Lysate – A commonly fluid mixture of cellular contents that is the result of the disruption of cell walls and membranes via cell lysis.
Molecular marker (in a genetic context) – A sequence of DNA, which can be a single base pair, a gene, or repetitive sequence, with a known location in the genome, which tends to exhibit variation amongst individuals or taxa, such that it has useful research applications.
Organellar genome – The genetic material present in a plastid or mitochondrion, typically in the form of a small and circular genome and often in multiple copies within each organelle. These are thought to be present in eukaryotic cells as a result of endosymbiosis.
Plastome – The total genetic information contained by the plastid (e.g., chloroplast) of a plant cell.
Museum collections of plant origin include herbaria (pressed plants), xylaria (woods), and economic botany (useful plant) specimens. They are not only places of history and display, but also of research, and contain rich repositories of molecules, including DNA. Such DNA, retrieved from historical or ancient tissue, carries unique degradation characteristics and regardless of its age is known as ancient DNA (aDNA). Research into aDNA has developed rapidly in the last decade as a result of an improved understanding of its biochemical properties, the development of specific laboratory protocols for its isolation, and better bioinformatic tools. Why are museum collections useful sources of aDNA? We identify three main reasons: 1) specimens can play a key role in taxonomic and macroevolutionary inference when it is difficult to sample living material, for example, by giving us snapshots of extinct taxa (Van de Paer et al. 2016); 2) accurate identification of specimens that were objects of debate or scientific mystery, as exemplified by misidentified type specimens of the watermelon’s progenitor (
However, extracting DNA does mean the destruction of a part of the specimen. Museum curators therefore face challenges in balancing the conservation of specimens for future research with the rising demand for aDNA analysis. Increasingly, curators are also considering legal and ethical issues in sampling (
With few exceptions, plant material found in museums originally grew on lands tended or owned by people for many millennia (
A first consideration is whether the plant species or artefacts (such as baskets or wooden objects) are of special significance (e.g., sacred) to the source community. Examples of sacred material include Banisteriopsis caapi, used to make ayahuasca in South America (
There are international conventions that usually apply when accessing, researching, and moving plant material between institutions and countries. Researchers must also be aware of country-specific laws that may require further permits and inspections, e.g., for plants that produce controlled substances, require phytosanitary checks, or are considered invasive species. Legal elements of the Convention on Biological Diversity (CBD), Nagoya Protocol, and Convention on Trade in Endangered Species (CITES) are covered in Chapter 27 Legislation and policy as well as in other published works (e.g.
Botanical gardens hold living specimens and distribute seeds of these via seed lists (Index Seminum). Their global collections can be searched via PlantSearch, hosted by Botanic Gardens Conservation International. Gene banks hold seeds, and sometimes also tissue and living plants. While they originally focused on crop plants and their wild relatives, many have now broadened in scope to include wild plants, such as Royal Botanic Gardens Kew’s Millennium Seed Bank. Many gene bank collections can be searched via Genesys. Herbaria hold dried plant specimens and can be located via Index Herbariorum. Although many herbaria are incompletely recorded in databases, substantial data can already be found in the Global Biodiversity Information Facility (GBIF) (
There are a number of pitfalls when searching online catalogues. It may be necessary to search for accepted names and common synonyms: the same species may appear under different botanical names in a single collection, and accuracy of specimen identification varies. In general, herbarium specimens are the most reliable, as they bear diagnostic criteria such as flowers on which taxonomists rely. Garden material and seeds are often misidentified, or become confused in labelling, or are hybridised during repeated cultivations. Their identifications should be confirmed, for example growing on the seeds or by using morphological criteria (
Research projects will benefit enormously from a close collaboration between researcher and curator. Museums should be approached early during a project, with the researcher providing sufficient detail about its background, aims, methodology, and timetable. Museums are often under-staffed and persistence may be required in making contact. Curators’ expertise will be crucial in identifying the most appropriate specimens for analysis, not only in their institutions, but in others with which they are familiar. The curator will also play a key role in assessing the provenance of specimens, using museum archives, and the implications for any of the ethical and legal issues addressed above. Curators often have good links to source communities and can advise on appropriate procedures.
After preliminary discussions, the researcher will usually need to fill in a ‘destructive sampling’ form. This acts as a permanent record of the justification for sampling, and allows the museum to make a detailed check on the aims and methodology of the project (see for example, British Museum form and policies). Requests that have unclear research aims or which employ inappropriate methodologies are unlikely to be approved. Researchers will likely need to sign a Material Transfer Agreement (MTA) or Material Supply Agreement (MSA) with the museum which sets out their legal responsibilities.
Sampling may be carried out by the researcher or the curator. If feasible, it is worthwhile for the researcher to carry out the sampling, as it allows for the investigation of the context of the specimen and for flexibility in choosing the samples. It may also speed up the process of obtaining samples, especially if a large number is required. It also allows samples to be safely hand-carried to the researcher’s laboratory. Where materials must be sent, it is safest to use a courier service, with specimens marked “Scientific specimens of no commercial value”.
It should be agreed with the museum whether, after sampling, surplus material should be returned or securely retained. Museums can require that they are informed about results and that they check manuscripts before publication. This is in any case good practice to ensure accurate reporting of sample details. Museum policies on co-authorship vary, and this topic should be discussed early. Significant contribution by the curator on the choice of appropriate samples, provenance research, or in technically complex sampling, merits co-authorship. Unless agreed otherwise, DNA sequencing data should be submitted to NCBI GenBank or other public repositories, taking care to give the correct specimen identifier. At a minimum, the museum’s unique catalogue number (if one exists), and the name of the museum should be cited. This allows the DNA sequence data to be linked directly with the specimen or object. Other museum and laboratory information may be included with the DNA sequence data or in publications (e.g., the collector name, collection number, dates, locations, and laboratory extraction numbers). Additionally, most museum collections will require that vouchers are annotated in a way that links them to DNA sequencing data (see below). Some museums have also started to permanently store DNA isolates, and we encourage researchers to share their stocks on request. Integrated data management and accessibility of the raw data and results will ultimately bolster curatorial practices, develop a more ethical science, and safeguard collections for future generations (
Sampling decisions will be determined both by the research design and the nature of the specimens, in addition to the legal and ethical factors mentioned above. Changes to agreed sampling lists are often necessary once specimens have been examined, for example when they are lost, in poor condition, inadequately annotated or georeferenced, present in small quantities, or of rare taxa. Bulk raw material is usually easy to sample, while objects are usually not subjected to destructive sampling unless the results will inform the history and significance of the object. For herbarium specimens, preserving the morphological features, especially those that are diagnostic, for future research, is critical. Sampling should be targeted towards tissue types or organs at a given developmental state that are most numerous. For example, if there are many flowers and few leaves, it may be preferable to sample a petal. Or if there are few cauline and many rosette leaves, it may be preferable to sample a rosette leaf.
Different parts of a specimen may yield varying amounts, quality, and types of DNA. Wood, husks, and other tissues that were undergoing senescence at the time of preservation may yield less DNA. Young, immature leaves will have higher cell densities, and therefore are expected to yield more DNA. Seeds are often excellent sources of nuclear DNA, although the genotype of the seed will differ from the parent plant and might be of inconsistent ploidy. It may be necessary to extract DNA from individual seeds or to remove maternal tissue such as the testa. Some herbarium sheets will contain multiple individuals and, in most cases, it is better to sample individuals rather than mixed material. If individuals are pooled for DNA extraction, it may complicate downstream analyses that depend on individual genotypes.
The method of specimen preservation is another consideration for DNA isolation. Desiccation has been shown to preserve plant DNA remarkably well, while charring or ethanol preservation destroys plant DNA almost completely (
Before sampling begins, the specimen’s identifying data, such as its herbarium ID, should be recorded with great care, and double-checked on both the sample label and typed list of specimens. Additionally, the museum may require that vouchers are annotated with the sampling date, tissue type, sample identifier, and information about the researchers. The voucher, including any labels, should be photographed, ideally before and after sampling. Digital links between herbarium vouchers, imaging, and DNA sequences are very useful; they can be included in herbarium and nucleotide databases.
For desiccated leaves, the most commonly sampled tissue, the process is usually straightforward. Using forceps and a scalpel or scissors one can make a precise cut and remove 1 cm2 or less of tissue. Generally, between 2 and 10 mg of dry leaf tissue is sufficient for the isolation of complex mixtures of genomic DNA fragments. It is preferable that leaves of lesser value are targeted, for example damaged, folded, or hidden, avoiding possible contamination by mould, lichen, or fungi. The sampling of detached “pocket” material should be conducted with caution, and only if the researcher and curator are confident that the detached material truly belongs to the voucher. For other tissue types, such as wood, researchers may need to develop tailored sampling methods on contemporary material first. After sampling, material should immediately be sealed in a labelled tube or envelope and packaged for transport.
Potential contamination of the sample, specimen, or wider collection with exogenous DNA is an important consideration. For most museum collections, there will inevitably already be surface DNA contamination of specimens. Ask the curator about adhesives (e.g., wheat starch) and preservatives that were used with the specimen of interest. Curatorial staff and other users of the collections may not routinely wear gloves or, if they do, may not change them between specimens. In most cases, there is unlikely to be any benefit from the person undertaking sampling wearing protective equipment (e.g., face masks, hair nets) that is beyond that normally used by users of the collection. Contamination control is only as good as the weakest link.
Extra precautions may be taken for equipment that is used directly in the sampling process, for example, disposable scalpels that are changed between samples, or wiping of scalpel blades with bleach and ethanol. This will reduce the risk of cross-contamination between specimens. Further precautions may be beneficial if internal tissue is being sampled (e.g., inside a seed). In these cases, surface decontamination (see section below on pre-processing) followed by sampling with DNA-free equipment and while wearing personal protective equipment may be appropriate. In some cases where specialistic equipment such as microdrill is required, it may be beneficial for sampling to be undertaken within an ancient DNA laboratory, where contamination controls can be better implemented, however bringing large amounts of plant material into the laboratory should be limited as it is an additional contamination source.
Contamination of specimens and collections by ‘modern’ DNA and especially amplified DNA is perhaps the greatest risk, potentially compromising future research. Researchers are likely to have been using molecular laboratories, and steps should be taken to prevent the inadvertent transfer of modern DNA to museum collections. These precautions can include not visiting a collection directly from a modern laboratory, cleaning items that must move between modern laboratories and collections (e.g., clothes, phones, cameras), and using sampling equipment (scalpels, tubes, pens) that has not been taken from a modern laboratory.
Before starting any experiments with historical and ancient plant samples, it is important to recognize challenges arising from the degraded nature of aDNA. Unlike DNA isolated from fresh samples, DNA from preserved specimens is fragmented, damaged, and contaminated post mortem (
aDNA is also affected by “damage”, post mortem substitutions that convert cytosine to uracil residues through deamination (uracils are read by insensitive DNA polymerases as thymine, hence the commonly used term “C-to-T substitutions’’) (
Finally, it is important to recognize that aDNA from plants is in fact a mixture of bona fide endogenous DNA, exogenous DNA introduced pre mortem, (e.g., from endophytic microbes), and exogenous DNA introduced post mortem (e.g., from microbes involved in decomposition, human-associated collection and museum practices; see above) (
Examples of selected successfully isolated and sequenced DNA from plant material. *BP: before present.
Species | Tissue | Age BP* | Endogenous DNA | Fragment length (bp) | Damage at 5’ end | Source |
---|---|---|---|---|---|---|
Thale cress (Arabidopsis thaliana) | Leaf | 184 | 83% | ~62 | 0.026 |
|
Potato (Solanum tuberosum) | Leaf | 361 | 87% | ~45 | 0.047 |
|
Maize (Zea mays) | Cobs | 1863 | 80% | ~52 | 0.052 |
|
Wheat (Triticum durum) | Chaff | 3150 | 40% | ~53 | 0.095 |
|
Barley (Hordeum vulgare) | Seeds | 4988 | 86% | ~49 | 0.138 |
|
Given the characteristics of aDNA (
The isolation and pre-amplification manipulation of aDNA should be carried out in a dedicated laboratory that is physically separated from labs where post-amplification steps are carried out. Ideally the aDNA laboratory should be supplied with HEPA-filtered air under positive pressure. Users should not move from a ‘modern’ laboratory (where amplified DNA is handled) to the aDNA laboratory on the same day. Reagents and materials in an aDNA lab should be DNA-free, disposable where possible, and never taken out of the clean lab. Surfaces should be cleaned before and after every experiment with 3–10% bleach, 70% ethanol, and overnight UV-C irradiation. To minimise contamination and ensure a DNA-free laboratory environment, users should wear full body suits, foot protectors, slippers, facemasks, sleeves, and double gloves (
Material preparation is an essential step before DNA can be isolated. Optional pre-processing of dirty samples can be done by gently cleaning the surface with a very low concentration (~3%) of bleach, and rinsing twice with ddH2O (
DNA extraction methods for different tissues should be considered. While plant materials tend to contain inhibitory substances like polyphenols, proteins, and polysaccharides, ancient plant materials can additionally be rich in humic acids and salts. This set of macromolecules might prevent successful DNA amplification (
Here we will cover the basics of recovering the highest quality of DNA from ancient plant tissues. Using a two-day extraction protocol will greatly increase the recovery of endogenous DNA. The first day consists of grinding the plant material. Tissue can be disrupted by: grinding dry, grinding flash-frozen, or grinding material soaked in lysis buffer. In all cases, grinding to finer particles increases the recovery of aDNA. Ground tissue is incubated in a fresh lysis buffer. Three commonly used buffers include CTAB (
By contrast to primed amplification approaches, even low amounts of isolated DNA can be used for genomic library preparation (
To help decide which sampled material is most promising for further DNA analyses it is necessary to obtain good estimates for fragmentation, damage, and contamination. This can be achieved through sequencing genomic libraries in low-throughput mode (about 10,000 DNA reads per sample), commonly referred to as “screening” and bioinformatic analyses that produce relevant summary statistics. Promising samples will contain aDNA with a median fragment length over 50 bp and endogenous content over 0.2. For samples of particular interest, mapping the accuracy for short aDNA reads can be improved with specialised procedures (de Filippo et al. 2018), and endogenous content can be increased by targeted enrichment on hybridization arrays (
Characterising DNA fragmentation and damage is very useful for authentication and establishing historical provenance of degraded plant samples. DNA degradation advances with time (
Chapter 2 Infographic: Overview of sampling and obtaining DNA from museum collections. An team effort of communities, curators and researchers (1) Collection of botanical material should have detailed consideration of its ethical and legal aspects and the consultation of source communities in advance, in accordance with CITES, CBD and Nagoya legal and ethical frameworks. (2) Curated botanical samples can be found in different types of museums that include botanic gardens, ethnobotany and anthropological collections. The next step is to find relevant specimens with preferably rich metadata, e.g. species identification, collection place and date. (3) Once the specimens have been identified, they should undergo molecular analyses in clean facilities. Where they will be pre-processed according to their traits, avoiding contamination with other samples, “modern” specimens, and amplicons. Then, it is crucial to identify samples that failed and passed quality controls for endogenous DNA. Finally, the data produced should be linked to their respective vouchers and made available in public repositories like NCBI and BOLD.
Library-based methods assist with the responsible use of collections, as they preserve the total (non-selective) DNA and ‘immortalise’ it for future use. Immortalisation only has value if the DNA that has been amplified is truly historical/ancient and devoid of contemporary contamination and hence all the aforementioned precautions are necessary when working with aDNA. We recommend that extracts or library builds are precisely annotated with the methods used and are properly archived.
aDNA – Ancient DNA, DNA that exhibits biochemical characteristics typical for DNA from old degraded material, i.e., damage and fragmentation, regardless of age.
Artefact – An object made by humans that is of historical or cultural importance, examples include: clothing, ornaments, utensils.
Authentication – Bioinformatic analyses that quantify damage and fragmentation of sequenced DNA to help rule out that DNA is derived from contemporary contamination.
Collection – Repository of curated biological material arranged in a systematic fashion.
Contamination – Introduction of alien tissue or DNA to a specimen or DNA isolate, examples include: microbial colonisation, human epithelium, plant-based foods, etc.
Curator – Custodian of a collection with expert knowledge about specimens, their organisation, and preservation.
Destructive sampling – Permanent removal of a fragment of a specimen of any size that will be irretrievable after biochemical characterization.
DNA damage – Typically conversion of cytosine to uracil in DNA through deamination, which accumulates with time. During sequencing, uracil is replaced with thymine, hence the common synonym, C-to-T substitutions.
Endogenous DNA – Authentic DNA from targeted individuals of a species, in contrast to exogenous DNA from associated microbes and contemporary plant and human DNA contamination.
Fragmentation – Breaks in the DNA backbone, most frequently caused by depurination, leading to shorter DNA fragments with time.
Immortalization – Molecular manipulation of DNA, for example the attachment of DNA adapters, that allows infinite re-amplification of the original DNA from a biological specimen.
Type specimen – Preserved individual plant that has defining features of that taxon that is used for the first taxonomic description of a species. This permanent feature-specimen link is recognized in a publication.
Voucher – Preserved botanical specimen kept in permanent collection and cited by research project. Vouchers will have been expertly identified and are usually annotated with collection time, place, and collector details.
The first studies conducted on DNA obtained from water samples were published in the 1990s. Cloning techniques were commonly used to investigate novel genes and functions of environmental communities at that time. Stein et al. (
Conventionally, biomonitoring of freshwater and marine environments is based on direct observation of indicator taxa to compute biotic metrics/indices. This can be time and labour intensive (
The main advantage of water is the ease of sample collection compared to other aquatic sample types such as sediments or biofilms, as these substrates usually require more sophisticated tools and longer sampling times (
Detecting DNA in water samples obtained from aquatic environments can be challenging because it is usually present at low concentrations with an uneven spatial distribution (
Natural processes influencing the composition and quantity of detectable DNA in a water sample can be categorised into 1) shedding of biological material from source organisms, 2) degradation, 3) transport across the water column, and 4) retention and resuspension (
Senescence in aquatic plants releases free cells into the water column that will eventually break down into organic compounds, including DNA. However, degradation in many cells begins via apoptosis before shedding. Apoptosis involves the shrinkage of the cell and its nucleus in a programmed way, in contrast to necrosis, which is uncontrolled cell death due to loss of osmotic control typically by swelling and bursting (
DNA is a highly stable molecule at neutral pH and moderate temperatures. However, there are several abiotic factors that directly and indirectly influence its stability in aquatic environments (
Hydrological characteristics of the water body are also critical to consider when inferring species presence and distribution. DNA can bind to particles of varying size in aquatic environments (less than 0.2 µm to greater than 180 µm) and this particle association is one of many parameters that affect DNA transport and diffusion (
Considering the higher dilution and the effects of currents and waves in marine waters, DNA is generally less concentrated and more quickly dispersed compared to freshwater ecosystems (
In rivers and streaming waters, the probability of DNA detection is strongly correlated with downstream transportation rates. Retention, rather than degradation, appears to be a more important factor that limits the transport of DNA in streaming waters (
Conventional sampling techniques often require a lot of time and effort for detecting indicator, rare, or invasive species. Keeping the target organism alive or intact might also be an important consideration in such cases. Detection of species via nucleic acids collected from environmental samples (eDNA/eRNA) is a relatively new approach that emerged in the last five years (Anglès d’Auriac et al. 2019). These methods offer a non-destructive and efficient complementary approach for the detection of aquatic organisms. They rely on reference sequences and the amount of available data varies among taxonomic groups and countries (Chapter 10 DNA barcoding and Chapter 11 Amplicon metabarcoding). For example, aquatic vascular plants used in biomonitoring are well represented in public databases (BOLD, GenBank), while this is hard to achieve for diatoms due to large proportions of undescribed species and the problems with cultivation of monoclonal cultures (
Although DNA from plant communities have been detected from environmental samples as parts of larger surveys (e.g., within coral reefs), biodiversity studies targeting a large number of plant species are still rare, possibly owing to issues with universal amplification and discriminatory power of single or multiple gene surveys in plants (
An important application for DNA-based methods is the quantification of species abundance and biomass since there are several environmental applications that rely on this information. Depending on the specific aim of the study, this information can be obtained at varying degrees of efficiency and reliability. Approaches employing species-specific methods are more suitable for abundance or biomass estimations (e.g., qPCR, ddPCR). However, they require a priori knowledge of the target group and are limited to already described species. On the other hand, high-throughput approaches can identify species that are rare or have low biomass (e.g., metabarcoding, metagenomics), but they suffer from biases introduced by downstream steps such as PCR amplification, sequencing (Chapter 9 Sequencing platforms and data types), availability of reference sequences, and even the bioinformatics analyses (Chapter 18 Sequence to species) (
Although molecular methods for species detection have been used as a tool for biodiversity management for more than a decade, only 2% of the available studies have focused on plants (
Recent studies that detect plant species in aquatic ecosystems via eDNA are mainly about methodological adjustments (
There are three main steps in a field study for the collection of aqueous eDNA: water collection, transportation, and filtering. In designing sampling strategies for species identification from water samples, there are many factors to consider. These include, but are not limited to, the field conditions, the distance between sampling point and laboratory, the amount of water that is required, and the morphology and life cycle of the target organism (
After the selection of the sampling location, the next step is to decide on the transportation strategy. Water samples can either be directly transported to the laboratory or filtered in the field. If direct transportation is the chosen method, the samples are usually collected with sterilised glass or plastic bottles or disposable plastic tubes. After that, DNA in the water samples can be captured by filtration or ethanol precipitation in the laboratory. This method both reduces the effort and time spent in the field and researchers can perform additional analyses on water samples or store subsamples for further processing (
Precipitation using ethanol or isopropanol can be used for capturing DNA after water collection, but filtration is the more widely used method (
Pore sizes of filters used in eDNA studies range from 0.22 µm to 60 µm (
The type of filter is one of the most important decisions to be made when designing the sampling strategy. Filters can be classified as open or encapsulated/cartridge filters (
Contamination of samples and the degradation of DNA are two critical processes that should be avoided as much as possible from water collection in the field to DNA isolation in the lab (
Choosing the correct DNA extraction protocol can be crucial in ensuring that the effect of PCR inhibitors in water samples will be minimised. The chemical and physical characteristics of samples can vary considerably, and therefore the quantity and purity of isolated DNA also vary (
Chapter 3 Infographic: Summary of steps from field collection of water samples to DNA extraction in the laboratory. (1) Open or closed (encapsulated/cartridge) filters can be used for filtering water samples on-site. Large filters (e.g., plankton net with 60 μm pore size) are preferred for filtering larger volumes of water, while small pore size filters can usually process a few litres. Closed filters offer the advantage of preventing contamination, therefore they are more commonly used for on-site filtration. (2) Degradation is another important issue that should be prevented until DNA extraction. Water or filter samples can either be preserved in a chemical buffer or transported in cold and dark conditions to the laboratory for further processing. (3) Plant DNA in water samples can be captured by filtration or precipitation. When using filtration, samples are usually incubated in a lysis solution to extract DNA, while in precipitation samples are mixed with ethanol and DNA is collected in the pellet. Commercial DNA isolation kits specifically designed for environmental sample types are commonly used with some small modifications.
DNA isolated from water samples can be used for several downstream applications based on the specific aim of the study or survey. Currently, qPCR methods are the most commonly used method for detecting specific target taxa in water samples, while metabarcoding is used for community analyses (Chapter 11 Metabarcoding). The studies comparing the efficiency of these DNA methods with more conventional methods show varying results. For some species or taxa, DNA-based detection methods appear to outperform more conventional methods (
Apoptosis – Controlled cell death which involves cell shrinkage, nuclear fragmentation, chromatin condensation, and chromosomal DNA fragmentation.
Biofilm – A consortium of microorganisms where cells stick to each other and often also to a surface.
Dimictic lake – A body of freshwater whose difference in temperature between surface and bottom layers becomes negligible twice per year.
Extracellular nucleases – Enzymes that can work outside of the cell and are capable of cleaving the phosphodiester bonds between nucleotides of nucleic acids.
Mesocosm – Any outdoor experimental system that simulates the natural environment under controlled conditions.
Necrosis – Uncontrolled cell death due to the loss of osmotic control typically by swelling and bursting.
PCR inhibitors – Any factor which prevents the amplification of nucleic acids through the polymerase chain reaction.
Primer – A short single stranded nucleic acid sequence used by all living organisms in the initiation of DNA synthesis.
qPCR (Quantitative PCR) – An extension of the PCR technique which allows estimation of the initial quantity of nucleic acids in a biological sample.
Senescence – The gradual deterioration of functional characteristics with ageing (can be used both for organismal or cellular ageing).
Thermal stratification – The phenomenon in which lakes develop two discrete layers of water of different temperatures; warm on top (epilimnion) and cold below (hypolimnion).
Vector (i.e., cloning vectors) – A small piece of DNA that can be stably maintained in an organism that a foreign DNA fragment can be inserted into for cloning purposes.
The natural presence of any plant entails the existence of a substrate where it can anchor itself and absorb nutrients for its development and survival (
Since the first isolation of DNA from soil bacteria, soil eDNA has gained attention for the assessment of terrestrial environments for several reasons: soil is virtually everywhere, it is easy to collect and transport, harbors signals from above and below biota including both active and dormant cells, and is a non-invasive sample collection technique (
Further, studies may also refer to bulk soil DNA when using soil samples to identify unknown communities, especially in forensic contexts (
Molecular (plant) identification using soil or sediment eDNA relies on isolating DNA traces from roots, debris, seeds, and pollen (
The state of DNA in the soil is subject to intrinsic and extrinsic DNA properties related to the origins of the DNA as well as factors influencing its decay (
iDNA persists due to protection from the cell wall and membranes against abiotic processes. Cells are more likely to remain intact in the ground if there is decreased enzymatic activity as a result of rapid soil desiccation, low temperatures, or extreme pH values (
Plant eDNA bound to soil particles can originate from multiple taxa and multiple vegetative parts, each one with particular mechanisms to bind, persist and degrade in soil substrates. Plant DNA persistence within soil allows us to harvest its botanical memory for identifying vegetation through time. Indeed, comparisons of plant identifications through both visual vegetation surveys and soil eDNA assessments have shed light on the temporal signals stored in top soils. In boreal areas, plant identification through soil eDNA signal mostly registered contemporary vegetation (
The flora and study area are key in any study to ensure sound conclusions. Below you will find considerations that can help you to answer common questions when designing field and wet lab experiments.
Soil sampling can be done either by scooping out the soil, drilling down a tube, i.e., a 50 ml falcon tube, or with a soil core sampler. We recommend to use sampling protocols specifically validated in an environment similar to your study site, e.g., woodlands, grasslands, meadows, boreal temperate, and tropical forest (
Obtaining clean DNA samples as well as avoiding cross contamination is challenging when sampling soil eDNA. Collection instruments should therefore be decontaminated between each sample (e.g., flaming, chlorine cleaning), gloves and masks should be worn and changed regularly to avoid introduction of DNA, and samples should be stored in separate plastic bags. In order to stop (or greatly reduce) enzymatic activity, samples should be stored cold or frozen, preferably at -20 °C, if immediate sample processing is not possible (
DNA extraction is a key bottleneck when capturing molecular data, and protocols need to be tailored to both the study area and the question(s). At a minimum, you need to decide which fraction of the total soil eDNA (iDNA or exDNA) you want to isolate to answer your research question. In general, isolating exDNA is preferred when targeting non-microorganisms and avoiding diversity patterns across short temporal scales (
If (meta)barcoding is used for identification, there are three desired features for a barcode in any study: sufficient polymorphism for identification at the desired taxonomic resolution, conserved primer binding sites for universal amplification, and available reference sequences for the target organism. In many cases, not all features can be met. You may therefore need to decide on which features are most important for your research question. For more general information about choosing suitable markers and available reference databases, see Chapter 10 DNA barcoding and Chapter 11 Amplicon metabarcoding. Soil eDNA studies targeting plants have used markers found in chloroplast DNA (trnL P6 loop, matK, rbcL) and in ribosomal DNA (ITS2;
Bioturbation – Biological processes involved in the dissemination of genetic media through terrestrial media.
DNA degradation – Refers to the physical changes of the DNA molecule.
DNA decay – Refers to the reduction in detectable quantity of eDNA.
DNA persistence – Refers to the amount of DNA that remains detectable across time.
DNA polymorphism – Presence of two or more variants of a particular DNA sequence.
Horizon – A layer parallel to the soil surface whose physical, chemical and biological characteristics differ from the layers above and beneath.
Power analysis – Probability of detecting an effect, given that the effect is really there. Can also be seen as rejecting the null hypothesis when it is in fact false.
Pedogenesis – The process of soil formation as regulated by the effects of place, environment, and history.
Rarefaction curves (in ecology) – A technique to assess species richness given the number of samples collected.
To identify pollen, spores, and other plant-related microremains, the field of palynology has traditionally relied on microscope-based analyses. This is a time-consuming process that requires highly trained specialists. Additionally, pollen grains from many plant families are morphologically indistinguishable using light microscopy (
These challenges highlight the necessity for innovative methods within the field of palynology, to increase both the speed and accuracy of pollen identifications. DNA-based methods for the molecular identification of pollen grains have the potential to be of complementary value. However, the extraction of DNA from pollen is non-trivial. This chapter therefore focuses on how DNA can be extracted from pollen, the common problems encountered, and the qualitative and quantitative molecular possibilities for analyses.
Using pollen grain DNA for identification has shown promising results in a number of applications, including the study of provenance and authentication of honey (
Collecting pollen for DNA analysis is mostly similar to collecting pollen for microscopic analysis, though more care should be taken to avoid contamination from other potential sources of DNA. This is because pollen generally contains low quantities of DNA and is therefore prone to contamination. Pollen grains can either be collected directly from the environment (air, water, soil, etc.) or from pollinators (pollen baskets, honey). Pollen collected from the environment will most often (though not always) be derived from anemophilous (wind pollinated) plants, while pollinators collect the majority of pollen from so-called entomophilous (insect pollinated) plants. Pollinators may, however, also have anemophilous pollen accidentally sticking to their bodies. For studies looking at pollen from pollinators, either all pollen grains on the animal’s body are collected by washing off the pollen or, when present, only the corbicular pollen baskets are collected (
Pollen grains can be referred to as “natural plastic”: they have a very hard outer cell wall called an exine, which is made of sporopollenin (
Overview of selected studies since 2017 that have used molecular techniques to identify pollen, including the aim, strategy for pollen lysis, extraction method, amount of PCR cycles, sequencing method, and marker choice.
Study | Aim | Pollen lysis step | Extraction method | PCR cycles | Sequencing method | Markers |
---|---|---|---|---|---|---|
|
Airborne pollen identification | Bead beating (one 5 mm stainless steel bead), two 1-min cycles at 30 Hz | DNeasy Plant Mini Kit (Qiagen) and Nucleomag kit (Macherey–Nagel) | 30 | Sanger sequencing | trnL |
|
Pollen quantification | Bead beating (mix of 0.5 and 1 mm silica beads), 2 min | Wizard (Promega) | N/A | Genome skimming | N/A |
|
Pollen quantification | Bead beating (mini-bead beater), 3 min | FastDNA SPIN Kit for Soil (MP Biomedicals) | 30 | Metabarcoding | nrITS2, rbcL |
|
Pollen quantification | Bead beating (five 1 mm stainless steel beads), 2 min at 22.5 Hz | Adapted CTAB | N/A | Genome skimming | N/A |
|
Plant pollinator interactions over time | Bead beating (one 3 mm stainless steel bead + lysis buffer), 2 min at 25 Hz | QIAamp DNA Micro Kit and DNeasy Plant Mini Kit (Qiagen), Nucleospin DNA Trace Kit (Macherey-Nagel) | 30 | Metabarcoding | nrITS1, nrITS2, rbcL |
|
Airborne pollen identification | Bead beating (3 mm tungsten beads), 4 min at 30 Hz | DNeasy Plant Mini Kit (Qiagen) | 35 | Metabarcoding | nrITS2, rbcL |
|
Bee pollen diet | Bead beating (3.355 mg 0.7 mm zirconia beads), 5 min | DNeasy Plant Mini kit (Qiagen) | Three steps (55 cycles in total) | Metabarcoding | nrITS2, rbcL, trnL, trnH |
|
Insect migration analysis | Bead beating (five zirconium beads), 1 min at 30 Hz | No extraction, using Phire Plant Direct Polymerase | Two steps (32 cycles in total) | Metabarcoding | nrITS2 |
|
Pollen quantification | CF lysis buffer (Nucleospin Food Kit) | DNeasy Plant Mini Kit (Qiagen) | 25, 30, 35 | Metabarcoding | nrITS1, trnL |
|
Airborne pollen identification | Bead beating (0.2 g 425–600 μm glass beads + lysis buffer), two 1-min cycles (3450 oscillations/min) | Adapted CTAB | 40 | Metabarcoding | rbcL |
|
Bee pollen diet | Bead beating (150 g mix of 1.4 mm ceramic and 3 mm tungsten beads + lysis buffer), two 45 second cycles at 6.5 m/s | DNeasy Plant Mini Kit (Qiagen) | 37 | Metabarcoding | nrITS2 |
It should be noted that other methods for DNA extraction from pollen exist in which the pollen grains are not destroyed, and in some specific cases, excluding the bead-beating step has even given better results (
Several commercially available DNA extraction protocols have been used for DNA extraction from pollen grains after the lysis step. Table
The quality of DNA that can be extracted from pollen samples is critical for any molecularly-based identification method, and particularly when working with very small amounts of DNA. Therefore, avoiding contamination is critical and it is essential to work in a clean lab, to keep windows closed, use sterilised tools in a laminar flow cabinet, and to keep the DNA extraction lab separate from the post-PCR environment.
Molecular methods can contribute to the analysis of pollen both by identifying which species are present (qualitative) as well as by giving a measure of the abundance of different pollen species (quantification). While DNA metabarcoding methods are currently most often used (Table
Species-resolution in pollen grain identifications is critical for studies that try to answer specific research questions including: what particular species of flower does a common carder bee prefer? What grass species is responsible for most of the pollen in the ambient air in early May? Species-specific markers and qPCR techniques can be used for the identification of specific species within a mixture of different pollen types (see Chapter 10 DNA barcoding). One study used custom-made primers for the nuclear Internal Transcribed Spacer (nrITS) to differentiate between mugwort (Artemisia vulgaris) and ragweed (Ambrosia artemisiifolia), two notoriously allergenic species from the Asteraceae family (
DNA barcoding can be used to target specific species, yet it is rare that a pollen sample contains only a single pollen species. DNA metabarcoding is therefore the most-often used method for the molecular identification of the different species of pollen grains from mixed samples (see Chapter 11 Amplicon metabarcoding). Both nuclear and chloroplast DNA can be amplified in pollen DNA (
Chapter 5 Infographic: Overview of pollen sources, DNA extraction, and downstream analytical methods for the molecular identification of plants from pollen DNA.
While research into targeting different barcoding regions and primers is ongoing (trnT-F;
It is important to use positive controls with known concentrations of different pollen species in any DNA metabarcoding study. This is because the amount of DNA that can be extracted from different pollen types has been shown to vary. For example, it can be easier to extract DNA from pollen with a thinner exine and from plant species that are richer in chloroplast DNA than from those having a more ‘sturdy’ exine (
Beyond identifying which pollen species are present in a particular sample, pollen grain quantification is equally important. For example, for hay fever forecasts, it is not just important to know if there are certain allergenic pollen in the air, but also how many pollen grains there are at a given point in time. The golden standard for palynology has been to count a certain number of pollen grains under the microscope (e.g., 200 to 500) to obtain a semi-quantitative measure of the pollen types in a sample. While DNA-based methods for pollen quantification are less developed than DNA-based methods for identification, DNA-based pollen quantification using metagenomics (reviewed in Chapter 12 Metagenomics) seems feasible, while there is still strong debate about using DNA metabarcoding reads for this purpose.
In a recent study on the use of DNA to quantify pollen grains, Bell and colleagues found a very weak correlation between pollen counts recorded by palynologists and the proportion of metabarcoding reads (
Another group of scholars, however, are finding more promising results in using DNA metabarcoding to quantify pollen grains.
Since using DNA metabarcoding approaches for pollen abundance may not give quantitative results with complex, multi-species samples, other molecular methods such as genome skimming and shotgun sequencing are being used to circumvent some of the drawbacks. The major advantage of these two methods is that they do not include a PCR-step and therefore do not introduce amplification bias (see Chapter 12 Metagenomics). Genome skimming has already been used to show that quantification is feasible, even for pollen from species that are very rare in mock mixtures (
Anemophilous – Wind-pollinated.
Bead beating – The application of beads to break open the outer cell wall of pollen grains.
Hirst-type pollen trap – Volumetric air sampler that is one of the standard devices for monitoring airborne pollen and spores.
cpDNA – Chloroplast DNA.
Entomophilous – Insect-pollinated.
Exine – Outer wall of pollen grains. Composed mainly of sporopollenin that is extremely resistant to degradation. The exine of pollen grains has to be broken to release the DNA from the organic material within the grains.
Palynology – The science that studies both living and fossil spores, pollen grains, and other microscopic structures (e.g., chironomids, dinocysts, acritarchs, chitinozoans, scolecodonts).
Pollen grains – The male gametophyte of seed plants; source and carrier for the male gametes (spermatozoids or sperm cells).
Pollenkitt – The outermost hydrophobic lipid layer mostly present on entomophilous pollen grains.
Sporopollenin – A chemically inert biological polymer that is a component of the outer wall (see Exine) of a pollen grain.
Super-resolution microscopy – Technique in optical microscopy that allows visualisation of images with resolutions up to 140 nm, much higher than those imposed by the diffraction limit. This technique also allows visualisation of internal structures.
DNA-based methods for the molecular identification of plant products can help us to address food and medicine authenticity issues at each stage in the supply chain (Di Bernardo et al. 2007). Documentation and requirements for DNA-based detection methods for food authentication are defined in collaborative activities by the European Committee for Standardization (CEN) and the International Organization for Standardization (ISO). Both rapid and accurate identification of plant products are crucial for the the herbal drug industry (
The majority of standardised DNA-based authentication methods for the inspection and regulation of food and plant-medicines use well-established PCR-based techniques for DNA amplification as these are sensitive, specific, and simple (
DNA barcoding methods are also established for the identification of unique medicinal and edible plant species (
High-throughput sequencing (HTS) methods such as amplicon metabarcoding are also powerful tools for the authentication of herbal end products, post-marketing control, pharmacovigilance, and the assessment of species composition in botanical medicines, such as in traditional Chinese medicines (TCMs) (
In addition to PCR-based techniques, the detection of single nucleotide polymorphisms (SNPs) is frequently used for the molecular identification and authentication of various food commodities using small DNA fragments (Di Bernardo et al. 2007;
DNA-based methods for molecular plant identification depend on well-curated nucleotide sequence repositories. In addition to GenBank (
Successful DNA extraction is the foundation for any further downstream analysis (
Four main factors that affect the efficacy of DNA isolation from food and medicine samples are the sample source and processing, collection and storage, homogenisation, and the presence of contaminants. Generally it is easier to extract high-quality DNA from fresh samples (
Removal of frequent contaminants that can reduce the yield of extracted DNA from edible and medicinal plants.
Proteins and RNA | |||
---|---|---|---|
What compounds define the chemical composition of your samples? | Polysaccharides (starch, sugars) | Polyphenolics | |
RNA | (plant secondary metabolites like: tannins, flavonoids, terpenoids, etc.) | ||
Understand the specific properties of your samples for DNA extraction | Can co-purify with DNA | Can co-precipitate with DNA | When bound to DNA very hard to remove in extraction |
dependending on the age of the samples and how they were conserved | Results in a sticky viscous consistency to DNA pellet after centrifugation | ||
Inhibition of enzymes used for molecular techniques (restriction endonucleases, polymerases, and ligases ( |
Results in contaminated pellets not usable for many downstream analyses ( |
||
Adherence to wells in agarose gel residing in long smears of bands detected in gel ( |
|||
Consider applying mitigation strategies to overcome difficulties in extracting DNA from your samples | RNA removable with DNase-free RNase A or ethanol precipitation using lithium chloride | Removal via highly concentrated sodium chloride (NaCl) in extraction buffers leading to increased solubility in ethanol | Binder compounds polyvinyl pyrrolidone (PVP) or polypyrrolidone (PVPP) can be used in extraction buffers to absorb polyphenols before polymerization with DNA |
Proteins can be removed by i) inclusion of detergents (cetyltrimethylammonium bromide (CTAB), SDS) in extraction buffer | |||
Combination of NaCl and cationic detergent CTAB | |||
CTAB with differential precipitation ( |
|||
Use of antioxidant compounds (BME, DDT, ascorbic acid, iso-ascorbate) in buffer to prevent polymerization ( |
|||
ii) protein denaturants e.g., β-mercaptoethanol (BME), dithiothreitol (DTT) | |||
iii) enzymatic proteases e.g., proteinase K |
Although CTAB-based methods usually result in DNA extraction from plants and processed food and medicine products, the quantity is often quite low and the protocols are time consuming (
Several studies exist that compare commercial and in-house DNA isolation techniques for food and medicine (
Overview of different DNA extraction methods recommended for use with food by the European Union Reference Laboratory for GM Food and Feed (EU-RL GMFF).
Plant source | Method of choice | Reference |
---|---|---|
Maize | CTAB precipitate (in-house) (Rogers and Bendich 1985) | CRLVL16/05XP corrected version 2 01/03/2018 |
Maize seeds and grains | For isolation of genomic DNA from a wide variety of maize tissues and derived matrices for high-quality genomic DNA from processed plant tissue (e.g., leaf, grain, or seed). | |
Lysis step (thermal lysis in the presence of Tris HCl, EDTA, CTAB, and β-mercaptoethanol). | ||
Tissues processed prior to extraction procedure. Possible methods of processing include a mortar and pestle with liquid nitrogen (leaf) or commercial blender (grain or seed). | ||
Soybean | CTAB precipitate (in-house) (Dellaporta et al. 1983) | CRLVL13/05XP 14/05/2007 |
Soybean seeds | “Dellaporta-derived” method starts with a lysis step (thermal lysis in the presence of Tris HCl, EDTA, NaCl, and β-mercaptoethanol). | |
Isopropanol precipitation and removal of contaminants such as lipophilic molecules and proteins by extraction with phenol:chloroform:isoamyl alcohol. | ||
Potato | “CTAB/Microspin” method | CRLVL09/05XP Corrected Version 1 20/01/2009 |
Freeze-dried potato tubers | Lysis step (thermal lysis in the presence of CTAB, EDTA, and proteinase K). | |
Removal of RNA by digestion with RNase A and removal of contaminants such as lipophilic molecules and proteins by extraction with chloroform. | ||
Remaining inhibitors are removed by a gel filtration step using the commercially available product S-300 HR Microspin Columns (Amersham Pharmacia). | ||
Rapeseed | CTAB precipitate (in-house) (Dellaporta et al. 1983) | CRLVL14/04XP Corrected Version 1 15/01/2007 |
Lysis step (thermal lysis in the presence of Tris HCl, EDTA, SDS, and β-mercaptoethanol). | ||
Removal of contaminants such as lipophilic molecules and proteins by extraction with phenol and chloroform. | ||
DNA precipitate is generated by using isopropanol. The pellet is dissolved in TE buffer. | ||
Rapeseed | Inhibitors are removed by an anion exchange chromatography step using the DNA Clean & Concentrator 25 kit (Zymo Research). | CRLVL14/04XP Corrected Version 1 15/01/2007 |
Multi-herbal products | CTAB precipitate (in-house) (Murray and Thompson 1980) | Arulandhu et al. 2017 |
Technique is ideal for the rapid isolation of small amounts of DNA from many different species and is also useful for large scale isolations. | ||
Lysis step (thermal lysis in the presence of Tris HCl, EDTA, CTAB, and β-mercaptoethanol). | ||
Removal of contaminants such as lipophilic molecules and proteins by extraction with phenol and chloroform. | ||
Samples processed prior to extraction procedure (mortar and pestle, liquid nitrogen, or commercial blender). |
After DNA extraction, measuring both the DNA concentration and purity is important before continuing with further downstream analysis. Isolated DNA can be tested for quality using absorbance methods, agarose gel electrophoresis, and fluorescent DNA-intercalating dyes (
It is in the interest of both biodiversity conservation and public safety that DNA-based techniques are further developed to screen food and medicine sourced from the global market (
Bioprospecting - The exploration of biodiversity for new resources of social and commercial value.
Pharmacophylogenomics - Plant pharmacophylogenomics is a field established by combining the fields of ethnopharmacology, plant systematics, phytochemistry, pharmacology, and bioinformatics. It is the application of phylogenomics to the study of pharmaceuticals.
Pharmacopoeia - From the obsolete typography pharmacopœia, literally, “drug-making”. In its modern technical sense, it is a book containing directions for the identification of compound medicines, and is published by the authority of a government or a medical or pharmaceutical society.
Pharmaphylogenetics - Field of research focusing on the phylogenetic correlation between phylogeny, chemical constituents, and pharmaceutical effects of medicinal plants.
Do you know that faeces are windows to the natural world? Faeces, although not the most glamorous thing in the world, are worth their weight in gold when it comes to providing information about the host(s) they are derived from. Faeces, also commonly known as scat, poop, droppings, excreta, or stools are solid remains of the ingested food that were not digested in the intestine. They are composed of water, protein, polysaccharides, fats, solids (e.g., fibres from plants), and bacteria (
Different types of information can be obtained from faeces. Chemical analyses provide information on hormonal changes that can occur from stress (
Non-molecular methods have traditionally been used for the analysis of contents from faecal samples. An example is microhistology, where small amounts of faecal samples are mounted on a microscope slide, and digested remains of plant cuticle fragments are identified based on morphology (
In plant molecular applications, a common use of faecal samples is in herbivore/omnivore diet studies. The goal of most plant-focused diet studies is to characterise the diet profile of the host, which can be used to answer research questions concerning for example, resource competition and partitioning (
The main advantage of using faecal samples for molecular plant identification as compared to other types of samples such as whole animals/insects (
One limitation when using faecal samples for molecular plant identification is that it can be difficult to obtain fresh faecal samples collected immediately after defecation, especially when working with wild animals. Age of samples can have an impact on the amount and quality of DNA that can be extracted due to DNA degradation caused by exposure to environmental conditions (
Advantages and limitations of using DNA from faeces to reconstruct plant communities.
Advantages | Limitations |
---|---|
Non-invasive | Fresh samples may be challenging to obtain from wild animals |
Easy to detect and collect | Presence of PCR inhibitors |
Not considered rare | DNA degradation |
Does not require capturing or locating animal of interest | Hard to distinguish morphologically with closely related species |
Additional molecular work needed | |
Increased cost and time |
Before designing any sampling strategies for the collection of faecal samples, there are at least six factors that researchers must take into consideration:
Based on the research question(s) and objectives (i.e., quantitative, presence/absence, composition), researchers must decide how many samples and replicates are needed from each individual and/or population to sufficiently meet their research objectives. The choice of downstream molecular methods used for reconstructing herbivore/omnivore diet will also have an impact on how many samples are required. In quantitative studies where the objective is to quantify the ingested biomass, the number of different individuals sampled is not as important as in composition studies, where more individuals are required to obtain a better overview of the dietary range of the studied species. This is due to the effect of individual food preference, which can lead to biases in retrieving the whole range of a dietary profile for a given species if only a few individuals are studied (
Generally, the more ecological information gathered and incorporated into sampling strategies, the higher the chance of successful faecal collection. For wild species, prior ecological information regarding the species of interest is essential for designing sound sampling strategies, to optimise and streamline sample collection. Researchers can use the following questions as a guide in planning their sample collection strategy:
Without this information, it is challenging to narrow down specific study sites for field collection. Additionally, such information can reduce the necessary man-power, resources, and time spent in the field while increasing the probability of finding sufficient numbers of faecal samples. Knowledge of habitat range and population density can prevent excessive amounts of samples collected from a single individual when the research question requires samples from multiple individuals. Differences in home-range and diet between seasons can also impact sample collection strategy (
Sampling strategies are also heavily dependent on budget constraints, which may reduce the time spent on sample collection, the number of samples processed, and also the molecular techniques used in analysing the faecal samples. Therefore, it is prudent to ensure that the budget fits the research objectives or that research objectives should be tailored to fit the research budget. While there are many different approaches to sampling, two commonly used approaches are systematic sampling and opportunistic sampling. In systematic sampling, the study area is divided into grids or transects, and samples are taken at each grid point or fixed intervals (
Finally, ethical consideration of minimising distress to studied animals is one of the main concerns in animal studies and there are legal restrictions as implemented in the EU Directive 2010/63/EU on the protection of animals used for scientific purposes (
Once the sampling strategy has been determined, the sampling in the field can start. The first step is to locate the faecal samples in the field. Once faecal samples have been located, collection can begin. When collecting faecal samples, there are a few materials that will be needed no matter what animal and habitat the faecal samples are derived from; sterile tubes filled with e.g. RNAlater™, silica beads or 90% ethanol, gloves, and a device to collect the samples. Sterile tubes will be necessary for sample storage. Tubes can have either removable screw-lids or hinged lids. Removable screw-lids have the advantage that the lids will not come off during transport. However, there is an increased risk of environmental contamination with these lids since they are separate from the tube and must be placed somewhere before collection. Tubes with hinged lids are easier to work with in that sense, though they can open during transport if not sealed (e.g., with parafilm™). Proper use of gloves and a collection device are also important to limit the risk of a collector becoming sick from directly handling faeces, as well as reducing the risk of sample contamination. The size and type of the sampling device can differ depending on the size of the faecal dropping and can range from a toothpick to a large spoon.
DNA-based diet analyses are very sensitive to contamination, and the trace amounts of digested plant material that can be extracted from faecal samples is easily contaminated. Contamination can occur between samples, by plant DNA from the surrounding environment, or even from the collector’s (plant-based) lunch (
To avoid DNA degradation, faecal samples should be preserved as soon as possible upon collection and stored under the same conditions (
To avoid contamination, extractions should be carried out in a room free of PCR amplified DNA. Due to the risk of zoonotic disease transmission, extraction should ideally be carried out in a flow-hood to avoid inhaling dust from dry faeces (
Faecal samples from plant-eating animals usually contain high levels of PCR inhibitors such as humic acid, which can lead to amplification failure during downstream analysis (
Depending on the research question(s), several different HTS methods can be used for analysing DNA extracted from faecal samples including metabarcoding (
Coprolites – Fossilised faeces.
Near-infrared spectroscopy (NIRS) – A non-destructive and fast technique utilising the near-infrared region of the electromagnetic spectrum.
RNAlater – Non-toxic aqueous reagent for storage purposes, preserving RNA and DNA.
Stable isotopes – Non-radioactive elements.
Zoonotic disease – Infectious disease caused by pathogens jumping from non-human hosts to humans.
Sedimentary ancient DNA studies aim to reconstruct the biology and ecology of past environments using the DNA present in the sediment record. Compared to modern soil and sedimentary DNA (see Chapter 4 DNA from soil), these analyses can be more challenging due to the prolonged exposure of the DNA to degradation processes. This has major implications for the scope of the study and the appropriate study design, which will be discussed in this chapter.
In order to use sedimentary ancient DNA for paleoecological studies (sedaDNA;
Ancient DNA is the hereditary genetic content of cells from organisms that died a long time ago. There is no consensus on how old DNA should be in order to be called ancient, as the age is generally less important than the exposure to degradation processes that make it more degraded than modern DNA. SedaDNA degradation processes are primarily related to environmental and sedimentary properties, such as temperature, pH, water content, oxygen levels, and minerals present in the sediment (
How does DNA end up in the sediment? Sediment is a result of erosion, weathering and biological processes and consists of organic and inorganic particles (e.g., sand and silt) that are transported by wind, water, or people (
Once exposed to the sedimentary environment, exDNA can undergo different post-depositional taphonomic processes that determine the quality of the DNA on longer timescales. ExDNA can be internalised by microbial cells (
Schematic overview of DNA degradation processes (hydrolysis, oxidation, alkylation and Maillard reaction) that can cause DNA damage in the form of cleavage, base modifications or cross-links. The major mechanism leading to miscoding lesions in aDNA is the hydrolysis of cytosine to uracil, which leads to G to A and C to T substitutions by DNA polymerases, whereas blocking lesions can obstruct the movement of DNA polymerases during PCR (
By analysing the ancient DNA present in the sediment (
Macrofossils and plant sedaDNA originate close to the sample location and give a similar local signal (
In general, palaeovegetation data are the result of the attributes of the original vegetation, combined with depositional factors and preservation, as well as the experimental procedures to produce the data. For sedaDNA analyses, this includes every step of the data generation itself: sampling, transport, storage, processing of the DNA in the laboratory, and finally, the bioinformatic pipelines used. In terms of the data generation, pollen analyses and macrofossil analyses rely on taxonomic identification by microscopy, which is labour-intensive and requires a high level of taxonomic knowledge. Although some training is needed to work in an ancient DNA laboratory, in principle, taxonomic identification by DNA can be carried out without prior taxonomic knowledge. However, familiarity with plant taxonomy, phylogenetic placement, and biology of different groups is invaluable in the interpretation of the automated identifications. For example, it is important to check if the automated DNA identifications make sense for the sample location, because contamination, DNA degradation, and the quality of the reference library can cause false DNA identifications (see Chapter 18 Sequence to species for details).
A combination of sedaDNA, macrofossils, and pollen proxies gives the most complete overview of plant diversity and community composition through time. The choice for these proxies is dependent on the aims of the study. Table
Comparison of pollen, plant macrofossils, and sedaDNA as proxies for palaeoecological reconstructions on the levels of: source and sediment, data generation, and data interpretation. Sources:
Category | Pollen | Plant macrofossils | SedaDNA |
---|---|---|---|
Source and sediment | |||
- Scale | Regional | Local | Local |
- Taxonomic groups | Pollen-producers | All plants | All organisms |
- Potential sources of bias | High pollen-producing plants; vegetation cover close to sampling area; differential preservation | Differential preservation of tissue-types and species | Differential DNA degradation and decay |
Data generation | |||
- Labour-intensive | Yes | Yes | No |
- Need for taxonomic knowledge | Yes | Yes | No |
- Taxonomic resolution | Limited to identifiable pollen types, generally to genus level | Generally to species-level | Depends on the marker, possible to species-level DNA contamination; |
- Potential sources of bias | Identifiability of the remains | Identifiability of the remains; random occurrence | choice of lab techniques; completeness of reference library |
Data interpretation | |||
- Qualitative | Yes | Yes | Yes |
- Quantitative | Partial | Limited | Debated |
The first study using sedaDNA of macroorganisms was published in 2003, demonstrating the possibility to detect plant and animal DNA in both permafrost sediments and temperate cave sediments (
Environmental reconstructions can range from polar, to temperate and tropical regions, although they are limited to sampling sites that allow preservation of sedaDNA, such as permafrost, lake sediments, and dry cave sediments. Permafrost sediment can be used to assess vegetational development in polar regions under climate change (e.g.,
Due to its low concentration, retrieving ancient DNA from sediment samples requires strict protocols to avoid contamination by modern DNA or further degradation (
The previous section described some sedaDNA studies focusing on palaeoecological and archaeological questions. In both cases, choices of location and methods are very much steered by the research focus and what is already known about the area, such as past changes in climate, geology, ecology, or human impacts. Although details in the study design can differ, all sedaDNA studies follow the same steps: site selection, collection of samples and metadata, DNA extraction, further processing of the DNA in the lab, sequencing, and finally, bioinformatic sequence quality filtering and data analyses (Figure
Simplified overview of the sedaDNA research process, including some of the major challenges and potential solutions indicated at each step.
Choices for the different options at each step depend on the aims of the study. For example, when performing a reconstruction of overall plant community dynamics with universal plant metabarcoding primers, the most common taxa and major trends in community change will be reliably retrieved in the first PCR performed (
The aims of the study define the temporal and spatial scale needed to achieve them, thereby steering the selection of relevant sampling sites. Lake sediments provide a record of the plants that occurred in the lake catchment, being the area of land from which water and surface runoff drains into the lake (
General conditions under which sedaDNA preserves well are: cold and stable temperatures, neutral pH, dry or anoxic sediments with a high mineral content. Sediments from rockshelters, dry caves, and lake sediments are generally preferred as they are protected and provide stable conditions: rockshelter and dry cave sediments are sheltered from rain and have stable temperatures and there is some evidence that calcite has a high adsorption capacity for DNA (
Dating is important in any study that involves ancient samples. Only with accurate dating can the timing of events be compared and their rates of change estimated. Commonly applied sediment dating methods are radioisotopic dating (in particular 210Pb, 14C, and luminescence dating) and dating based on chemostratigraphy or marker minerals (in particular tephrochronology), and the choice for a method depends on the type and age of the sediments (see Table
Summary of sediment dating methods, their applicability and limitations. Sources:
Dating method | Suitable sample types | Age limit | Sources of error and uncertainty |
---|---|---|---|
210Pb dating | Materials from aquatic environments such as lacustrine and marine deposits | ~100 to 150 years | Complex sedimentation processes that break the dating model assumptions, such as compaction, local mixing, erosion etc. |
14C (radiocarbon) dating | Organic remains (charcoal, wood, animal tissue), carbonates (corals, sediments, stalagmites and stalactites), water, air and organic matter from various sediments, soil, paleosol and peat deposits | Up to 50,000 years | Atmospheric 14C content fluctuation due to changes in cosmogenic production rate and exchange between the atmosphere and ocean |
Luminescence dating: | TL: materials containing crystalline minerals, such as sediments, lava, clay, and ceramics | TL: A few years to over 1,000,000 years | Variations in environmental radiation dose; saturation of electron traps in sample minerals |
- Thermoluminescence (TL) | |||
- Optical stimulated luminescence (OSL) | OSL: materials containing quartz or potassium feldspar sand-sized grains, or fine-grained mineral deposits | OSL: A few decades to ~150,000 years for quartz. | |
Tephrochronology | Terrestrial and lake sediments, marine deposits and ice cores that contain tephra | Up to 35,000 years, extendable under good conditions | Can only obtain indirect dates within the 14C age range |
Radioisotopic dating is based on the principle of radioactive decay. When a nucleus breaks down, it emits energy and forms a daughter product. The time this takes is expressed as the half-life, i.e., the time that it takes for 50% of a parent element to transmute into the daughter product. The relative quantity of a radioactive parent element in a sample can be used to infer its age. Relatively young aquatic sediments, with ages up to 150 years are commonly dated with 210Pb (half-life: 22.27 years;
Luminescence dating is based on the phenomenon that mineral crystals absorb electrons from the ionising radiation of surrounding sediments over time, and when stimulated in a laboratory by heat or light, they release the accumulated radiation as luminescence. The intensity of measured luminescence indicates the length of time between this in-lab stimulation and the last natural event of similar stimulation. Heat stimulated or thermoluminescence (TL) dating is used to date baked pottery from archeological sites or sediments once in contact with molten lava; optically stimulated luminescence (OSL) dating is used to date sediments once exposed to sunlight. The time range for luminescence dating can be from a few decades to over 1 Ma, depending on the ability of a mineral to absorb radiation over time. For studies concerning relatively young samples, OSL dating of quartz grains are generally used, covering from a few decades to ~150 ka.
Tephrochronology uses the chemical signature of tephra (volcanic ash) to pinpoint the age of that specific layer in a sediment sequence by reference to known or unknown dated volcanic eruptions. Terrestrial sediments (
DNA is everywhere - including in the air - and contamination can come from many different sources. When collecting and working with sedaDNA samples, it is important to keep in mind that the DNA you are interested in will probably be present in very low concentrations. Contamination with modern DNA can easily overpower the sedaDNA signal in which you are interested. Therefore it is important to absolutely minimise the amount of modern DNA coming into your samples and limit further degradation of the sedaDNA.
The precautions you can take include: work cleanly, use equipment that is free of DNA and nucleases, and try to keep the samples in a stable and cold environment. In practice this is not so easy, which is why dedicated ancient DNA facilities are set up to avoid any form of contamination. These facilities should be physically isolated - ideally in a separate building - from any location where PCRs are performed (
You should assume that everything that you bring into the lab is contaminated with DNA. Therefore, before entering the lab, you should have showered and changed into clean clothes and everything you bring into the lab should be decontaminated. Inside the lab, you should wear a hairnet, face mask, full body suit with hood, shoe covers, and gloves at all times. Wearing two layers of gloves will allow you to change the outer gloves while still covering your hands, and you should change your outer gloves regularly while working. All tools and equipment should be decontaminated before use, and regular cleaning of the aDNA workspace is needed. Decontamination can be achieved by using a DNA decontamination product (e.g., 3-10% bleach or DNA-ExitusPlusTM) for surfaces, ideally supplemented with UV irradiation of the workspace. To prevent cross-contamination, tools should be cleaned between working with each sample or sample-extract. Tools should be left in a DNA decontamination product for at least 10 minutes, rinsed with UV irradiated milliQ water, and ideally also UV irradiated using a UV crosslinker with irradiation at the shortest distance possible to the UV source (
Choices for sampling and personal protective equipment will depend on the setting, as the sampling of sediments at an archaeological site can be very different from the sub-sampling of a lake sediment core in a lab facility. It is important to try to limit the amount of potential contamination, but practical considerations and the target DNA can also be leading. For example, a study aiming to recover human aDNA will require stricter use of personal protective equipment than a study focussing on plant aDNA. Sampling of sediments can be done directly in the field or by subsampling of sediment cores in a clean, sheltered environment. When collecting sediment cores for sedaDNA, closed-chamber piston-type corers are preferred (
A general sedaDNA sampling kit contains personal protective equipment, sampling equipment, and cleaning products, including: full bodysuits, face masks, hairnets, nitrile gloves, sterile scalpels, sample tubes, clean ziplock bags, DNA decontamination products, distilled water, 70% ethanol, trays or beakers for cleaning the tools, paper towels, trash bags and pens for labelling. To limit potential contamination, much of the preparation for the sampling kit takes place in the ancient DNA lab facility: making sure the sampling tools and collection tubes are prepared and DNA-free. Aluminium foil can be helpful for covering your workspace and provides a clean surface for all of the sampling materials at a sampling site. Sterile syringes with the tip cut off can be useful mini-corers, speeding up the sample-taking (
The sampling itself follows aDNA lab procedures where possible, even if it takes place elsewhere: clean the workspace, use personal protective equipment, do not hover over the sediment you are sampling and change outer gloves and tools between each individual sample. In order to avoid contamination, sampling should start at the oldest part of the sediment, working your way up to the youngest parts and subsamples from sediment cores should be taken from inside the undisturbed centre (
The choice for a specific DNA extraction protocol depends on a range of factors, including the aim of your study, sample characteristics, available laboratory facilities and equipment, and costs of the reagents or extraction kits. The latter can be a consideration of investing either time or finances as it can be cheaper to make the buffers needed for extraction yourself, but this also increases the preparation time and could introduce additional contamination to your samples. There are several protocols that can be used for sedaDNA extraction (see
Overview of the advantages and limitations of several commonly used extraction protocols and some example publications using these protocols.
Extraction protocol | Sample size | Advantages | Limitations | Used by |
---|---|---|---|---|
DNeasy PowerMax kit (Qiagen) | ≤ 10 g | - Large initial sample volume | - Expensive |
|
- Few inhibitors in the resulting extract | - DNA can be lost with inhibitor removal solution | |||
DNeasy PowerSoil kit (Qiagen) | ≤ 250 mg | - Few amplification and sequencing inhibitors in the resulting extract | - DNA can be lost with inhibitor removal solution |
|
- Easy processing of large sets of samples | - Smaller initial sample volume compared to the PowerMax kit | |||
Rohland protocol ( |
≤ 50 mg | - Developed to recover small DNA fragments | - Small starting amount of sediment |
|
- Easy processing of large sets of samples | - Potential coextraction of inhibitors | |||
- Homemade buffers can increase contamination risk | ||||
Phosphate buffer + NucleoSpin® Soil kit ( |
≤ 15 g | - Large initial sample volume | - Extracts only extracellular DNA |
|
- Processes a 2 ml subsample of the phosphate buffer and sample mixture | ||||
Murchie protocol ( |
≤ 250 mg | - High DNA yields | - Optimised for permafrost samples and may not perform as well in lake sediment |
|
- Uses a high volume binding buffer to improve the recovery of small DNA fragments |
All extraction protocols include similar steps for the isolation of sedimentary DNA (Figure
Common DNA extraction steps: (1) samples are first homogenized using a sterile scalpel and later on go through a step, in which either (2a) extracellular DNA is washed off the sedimentary matrix (
SedaDNA studies employing protocols developed for the extraction of modern environmental DNA from soils and sediments generally add additional steps to increase the yield of DNA from low concentration ancient sediment samples. A lysis step can be added to extract iDNA from intact cells present in the samples through chemical lysis, and/or mechanical shearing of cell membranes using beads. Adding certain chemicals to the lysis buffer can also increase yield: N-phenacylthiazolium bromide (PTB) breaks down cross-links between DNA and proteins (
Chapter 8 Infographic: Visual representation of the content of this chapter. Top left image based on Pederson et al. (2015).
Be aware that the presence of certain substances may inhibit further amplification or sequencing steps. These can be derived from humic substances (important components of humus), which are commonly present in sediments and might inhibit downstream analysis. Moreover, the amount of humic substances is site-specific, and it might be necessary to repurify the samples or use inhibitor removal columns. During DNA extraction, contamination may be introduced from the laboratory facilities, tools, reagents and other consumables. It is essential to track this contamination by including a negative control. It is suggested to add one such extraction control for each batch of 11 samples, and include it in all subsequent steps (e.g., metabarcoding, library preparation, sequencing;
After extracting the DNA, the sedaDNA needs to be further processed before sequencing and several approaches are continuously being improved and new ones developed.
Most sedaDNA studies apply a DNA metabarcoding approach, using PCR amplification primers to target short DNA sequences (< 300 bp, preferentially around or below 100 bp) from taxonomic marker genes to identify specific taxonomic groups (see Chapter 11 Amplicon metabarcoding). It is relatively low cost and some of the metabarcoding primers give high taxonomic resolution. However, this method can introduce amplification bias (
Sequencing data can be processed using bioinformatic tools, where strict quality filtering of the sequence data is followed by taxonomic assignment. Further filtering allows removal of sequences with low identity scores, contaminants (i.e., sequences present in the controls), and false-positives (see Chapter 18 Sequence to species for details). False identifications can be caused by the quality of the reference library, but also by technical errors, contamination, or errors in the DNA sequences, especially as sedaDNA is generally highly degraded and of low concentration. It is therefore important to check if the identifications make sense for the sampling location and age before further analyses of the sedaDNA data.
Alkylation – Addition or substitution of an alkyl group (CnH2n+1) to an organic molecule.
Accelerator Mass-Spectrometry (AMS) dating – A dating method that determines the age of an organic material (i.e., macroscopic remains of plants or animals) by measuring their radiocarbon concentration.
Cell lysis – The process whereby the membrane(s) of a cell breaks down, thereby releasing the cell contents.
exDNA – Extracellular DNA; all DNA located outside cell membranes.
Geochemical fingerprinting – A method using chemical signals to infer the origin, the formation and/or the environment of a geological sample.
Half-life – The time necessary for half of a radioactive atom’s nucleus to decay by emission of matter and energy to form a new daughter product. The half-life is specific to a radioactive element, and can be used for dating purposes.
iDNA – Intracellular DNA; all DNA present within cell membranes.
Lake catchment – Area of land from which water and surface runoff drains into a lake.
Luminescence dating – A group of methods to determine how long ago mineral grains were last exposed to sunlight or sufficient heating by measuring the luminescence emitted by the mineral grain upon stimulation.
Metabarcoding – Method for the simultaneous identification of many taxa within the same complex DNA extract. This is achieved by high throughput sequencing (HTS) of amplicons from taxonomic marker genes (barcodes).
Next Generation Sequencing (NGS) – Massively parallel sequencing technology allowing high throughput of DNA.
Nucleases – Diverse group of enzymes able to hydrolyze the phosphodiester bonds of DNA and RNA thereby cleaving them into smaller fragments.
Optically stimulated luminescence (OSL) dating – Dating method that determines the age of a sample by measuring the luminescence it emits in response to visible or infrared light.
Palaeoecology – The study of the relationship between past organisms and their ancient environments.
Permafrost – Soil, sediment, or rock that is continuously exposed to temperatures of < 0 °C for at least two consecutive years.
Radioactive isotope – An atom with excess nuclear energy and prone to undergo radioactive decay.
Reference library – A database of known DNA sequences with their taxonomic identifications, used in bioinformatics as a reference to identify the DNA sequences obtained in a sedaDNA study.
sedaDNA – Sedimentary ancient DNA; this is the aged and degraded DNA from dead organisms now incorporated in the sediment record, either as iDNA in dead tissues, or as exDNA free in the sediment matrix or adsorbed to sediment particles.
Shotgun sequencing – A method for the random sequencing of all of the DNA within a DNA extract.
Taphonomic processes – The processes involved in the transfer, deposition and preservation or organismal remains, including DNA.
Target capture – A technique that allows the capture of the DNA of interest by hybridization to target-specific probes (baits).
Tephrochronology – A geochronological technique that uses layers of tephra (volcanic ash from a single volcanic eruption) to create a chronological framework for the sedimentary record.
Thermoluminescence (TL) dating – Dating method that determines the age of a sample by measuring the luminescence it emits in response to heat.
Total DNA – The intracellular and extracellular DNA combined.
Tree-ring dating – Also called dendrochronology; a method of dating tree rings to the exact year they were formed.
The revolution in genome-wide screening has vastly reduced the price for sequencing, with enormous implications in the biomedical field, industry, biodiversity monitoring, as well as in plant identification. The first plant genome (Arabidopsis thaliana L.) was sequenced using Sanger sequencing. This took 10 years to complete with an associated cost of approximately $100,000,000 (Arabidopsis Genome Initiative 2000). With current high-throughput sequencing (HTS) methods, this same genome now takes 1 week to sequence and assemble, and costs $1000 (
Sanger sequencing was introduced in 1977 by Sanger and colleagues, and for over 40 years, it was the most commonly-used form of sequencing (
In the second step of Sanger sequencing, the oligonucleotides are separated by size using capillary gel electrophoresis. A laser excites the terminal fluorescent nucleotide in each oligonucleotide, resulting in fluorescence emission that is detected and read by a computer. By reading the gel bands from smallest to largest, the 5’ to 3’ sequence of the target DNA can be determined at single base pair resolution. The data output for Sanger sequencing is a chromatogram which is automatically read by a computer to generate the DNA sequence. Primer sequences should be trimmed off the reads as these are not part of the target DNA, and the quality of the chromatogram should be assessed to determine the reliability of the generated DNA sequence. There are a number of online tutorials from both industrial and academic sources that we refer the reader to for assessing a chromatogram quality (University of Michigan, Biomedical Research Core Facilities, n.d.). Base calling accuracy can also be measured using Phred quality scores (
Sanger sequencing is not used today for large-scale genomic projects due its low throughput. The requirement of needing specific primers for a region of interest limits its easy use and application across divergent plant taxa. Additionally, the amplification of multicopy genes, such as the commonly used DNA barcode ITS (see Chapter 10 DNA barcoding), as well as markers in taxa of allopolyploid hybrid origin, result in difficult-to-interpret chromatograms. This is because nucleotide polymorphisms between different copies result in double peaks in the resulting chromatogram (
Illumina was the second HTS technique that became commercially available in the early 2000s (
In Illumina sequencing, like in other high throughput sequencing approaches, the target DNA is initially broken into shorter fragments that match the optimal fragment sequencing length of the platform, if not already present as shorter segments. These fragments are then PCR-amplified with adaptors that can be individually chemically tethered to the flow cell surface. Using bridge amplification (
Dyed dNTPs are added in a controlled fashion through the use of reversible blocking group chemistry, so that the emission of each added fluorescent dNTP is read before the addition of the next fluorescently-labelled dNTP. This process is done on millions of fragments simultaneously, making it a far more efficient method than Sanger sequencing for large-scale genomic projects (
Two limitations to consider with Illumina sequencing however are that the produced reads are relatively short (50 to 300 bp), and similarly to Sanger sequencing, most applications require a PCR amplification step. However, PCR free library kits and protocols provide increasingly good results, and have the important advantage of reducing typical PCR-induced biases. Assembling whole genomes using short read Illumina methods, especially if they are highly repetitive, can be challenging (
Current examples of Illumina sequencing platforms, specifications, and suitability for different applications in plant identification.
Illumina sequencing platform | MiSeq | HiSeq 2500* | HiSeq 3000* | HiSeq 4000* | NextSeq 1000 and 2000 | NovaSeq 6000 |
---|---|---|---|---|---|---|
Specifications | ||||||
Maximum read length (pair ended) | 2 x 300 | 2 x 250 | 2 x 150 | 2 x 150 | 2 x 150 | 2 x 250 |
Maximum reads per run (single reads) | 25 million | 600 million | 2.5 billion | 5 billion | 1.1 billion | 20 billion |
Flow Cell output | 15 Gb | 300 Gb | 750 Gb | 1.5 Tb | 330 Gb | 6 Tb |
Method suitability | ||||||
Metabarcoding | +++ | +++ | + | + | + | ++ |
Target Capture | + | + | + | +++ | + | +++ |
Shotgun sequencing | + | ++ | +++ | +++ | ++ | +++ |
Genome skimming | + | ++ | +++ | +++ | ++ | +++ |
Organellar sequencing (plastids) | + | ++ | +++ | +++ | ++ | ++ |
Transcriptomics:gene targeted | +++ | +++ | + | + | + | ++ |
Transcriptomics: | + | + | ++ | ++ | ++ | +++ |
total RNA/mRNA seq |
Pacific Biosciences (PacBio) sequencing is based on single molecule real time (SMRT) technologies for reading DNA and RNA sequences. No PCR amplification is required, which for certain applications can be advantageous. This includes if PCR inhibitors are/may be present, the sequence is GC rich, or if PCR bias should be avoided. Additionally, PacBio reads are considerably longer than in either Sanger or Illumina sequencing (up to 25 kb) ((Pacific Biosciences, n.d.). This reduces computational challenges related to assembling contigs into full sequences. PacBio is considered a third generation sequencing technology, as it reads the nucleotide sequence both in real-time and at the single molecule level (
Similarly to Illumina and Sanger sequencing, PacBio also uses fluorescently-labelled dNTPs for determining a target DNA sequence. PacBio however employs a technology called zero mode waveguides (ZMW) to read nucleotide sequences at the single molecule level. ZMWs are nanosized wells that can be etched into different materials, with attoliter (10-21 L) volumes. ZMW technology differentiates a fluorescent molecule that is floating in solution from a fluorescently-labelled nucleotide that is located at the bottom of the well. A single DNA polymerase is tethered to the bottom of each well, and when a fluorescently-labelled dNTP is incorporated into the growing DNA strand, the fluorescent label is cleaved off. There is a unique fluorescent marker for each of the 4 nucleotides, and each cleavage event is read and directly linked to a specific nucleotide (van Dijk et al. 2018). Additionally, the rate of addition can be used to infer whether the target DNA is modified (i.e., post-translationally phosphorylated or methylated), since a modified DNA strand moves more slowly through the DNA polymerase, resulting in a reduced incorporation rate for a fluorescent nucleotide. This information is extremely powerful for predicting epigenetic modifications that are critical for a variety of biological functions. In addition, chemical modifications that are often present in aDNA can also be detected, making PacBio a particularly useful technique for assessing aDNA damage (
While previously PacBio suffered from a high error rate in comparison to Illumina sequencing, this has been dramatically reduced by the introduction of circular consensus sequencing (CCS), also known as long high-fidelity (HiFi) reads (
Oxford Nanopore (or simply Nanopore) sequencing is also a third generation SMRT technology that is single-molecule based and measured in real time. Nanopore is unique from the other sequencing technologies discussed here in that no DNA polymerase is required, and no expensive chemically modified dNTPs are necessary for reading the target sequence. The system consists of an electrolytic solution and a nanosized, biologically-derived pore in an insulating solid (a material that does not conduct electricity). The biological nanopores used in this technology are derived from proteins that form pores in biological membranes that naturally function to allow for the passage of ions and biomolecules across the membrane. When an electric field is applied, ions in the electrolytic solution pass through the pore, resulting in a stable current that can be detected. When larger molecules pass through the pore, such as DNA strand, detectable disruptions in the current occur. With a DNA strand, sequences of 6–7 nucleotides move through the pore and the movement of these bases yield a changing detectable disruption. This disruption has a unique signature with a specific current change for a specific length of time that can be linked to each of the four individual nucleotides. From the current disruption pattern it is possible to deduce the sequence. As well, since it is the change in current through the pore that is detected, no other chemical markers are necessary (
Nanopore technologies, with a read length up to 4 Mb, are rapidly becoming important due to their scalability and portability. The MinION sequencing platform (theoretical output up to 50 Gb/flow cell) is a portable and cost-effective option (87 g, available from $1000) that can be used in the field. Already, a number of excellent examples of biodiversity studies (and plant-based studies in particular) are available in the literature (
Library preparations are essential for all experiments involving HTS. General points to consider are discussed here and we also refer to Chapter 12 Metagenomics and Chapter 15 Transcriptomics for more details.
Unlike in other forms of sequencing, Ion Torrent technologies are not based upon optical outputs, but rather on changes in pH. When a DNA polymerase adds a nucleotide to a growing DNA strand, a proton is released upon each addition. It is this release of protons into solution, and the resulting change in the pH of the solution, that is detected in Ion Torrent technologies (
Similarly to Illumina sequencing, the target DNA is initially fragmented (200–600 bps) and PCR-amplified with adaptors that can be tethered to micro-machined wells on a semiconductor chip. The plates are then flooded with one of the 4 nucleotides. If a nucleotide is added across from the complementary base in the single-stranded DNA by the DNA polymerase, it results in the release of a proton and a subsequent change in solution pH. This shift in solution pH is detected by an ion-sensitive field-effect transistor (ISEFT), which can detect changes in proton concentration. This is done in a massively parallel fashion, with 1000s of microwell plates being used simultaneously. The pH change that results from the addition of multiple nucleotides in a repetitive sequence is also detectable using this technology, as the addition of two nucleotides will result in double the voltage change as the addition of a single nucleotide. The data output with Ion Torrent technologies can provide an approximate readout of 10 MBb in a single run with conventional machines, and up to 10 GBb with the newest models. The platform however struggles with base calling of homopolymers, and for these sequences it can be a challenge to obtain accurate reads.
The Ion Torrent machine and sequencing chips are relatively inexpensive compared to Illumina and PacBio, and this made it popular in smaller labs without access to high throughput sequencing core facility sequencing, though its use is no longer as common.
The sequencing platform that is ultimately chosen by a scientist depends on a number of factors. This can include (but is not limited to) the scientific question being considered, the quality of target DNA (see Chapter 1 DNA from plant tissue), costs, as well as in-house expertise and/or availability of existing platforms. In all cases, however, the quality and sequencing depth of target DNA should be considered. For DNA that is primarily expected to exist in shorter sequences (i.e., samples that are expected to be degraded from herbarium or ancient sources), then technologies requiring long reads are often not necessary, and Illumina sequencing or Ion Torrent technologies may be sufficient. If however one wishes to avoid any PCR bias or acquire long reads, then using PacBio or Nanopore is advisable. Finally, it may even be useful to use two different types of sequencing to overcome each technology’s respective limitations. For example, in whole genome sequencing, hybrid methods combining Illumina with PacBio are commonly used to ensure long reads and high accuracy.
Sequencing platform choices for different experimental questions and sample types.
Experiment or sample considerations | Recommended method(s) | Comments |
---|---|---|
Whole genome or organellar sequencing project (genome skimming, genome resequencing, de novo genome assembly) | Illumina, PacBio, or a combination of both | Illumina is the method of choice for resequencing for high throughput short read projects due to its high read accuracy |
Barcoding | Sanger sequencing or PacBio CCS | Larger projects are moving to PacBio CCS to reduce costs. Multiplexing very large numbers of samples is necessary to optimise costs |
Metabarcoding/Target capture | Illumina, MGI, DNBSEQ, or Ion Torrent | PacBio and/or Nanopore may also be considered if the sequence is expected to be highly repetitive |
Heavily degraded samples (i.e., herbarium or ancient DNA samples) | Illumina (or Ion Torrent) | PacBio may also be relevant for the study of post-genetic modifications often found in ancient DNA samples, or if dealing with hard-to-phase sequences |
On-site sequencing | Nanopore (MinION) Hi-C/3C-Seq/Capture-C (Illumina) |
In the last decades, developments in sequencing platforms have primarily focused on increasing the throughput and accuracy of sequencing output, increasing the length of reads, and reducing costs. We can expect the field to continue developing further in this direction, with a focus in particular on the miniaturisation of these platforms for more on-site work, as well as better automation and integration of analytical software and data analysis pipelines. In particular, miniaturisation and automatization of data analysis can be expected to have major impacts in regulatory fields related to both food safety and trade, where the ability for non-specialists to rapidly test on-site for the presence/absence of species will be extremely useful (see Chapter 22 Healthcare and Chapter 23 Food safety). Further development of HTS technologies to be used at the single-cell level and in functional studies can be also expected.
Allopolyploid hybrids – A polyploid species with multiple sets of chromosomes that originate from different species. If the hybrid is derived from two diploid species, the resulting tetraploid is fertile. These allopolyploid hybrids may be at least partially reproductively isolated from the parent species from which they are derived, and allopolyploid speciation is the best known route to hybrid speciation in plants.
Bridge amplification – A method used in Illumina sequencing to create DNA clusters with 1000s of double-stranded copies of the target DNA in flow cells. After amplification and generation of these clusters is complete, the reverse strand is washed away and sequencing by synthesis takes place.
Capillary gel electrophoresis (CGE) – An analytical method for the separation of charged molecules. DNA is separated according to size with this technique, with only nanogram quantities necessary for the input. Single-base pair resolution can be achieved on fragments up to several hundred base pairs in length.
Circular consensus sequencing (CCS) – Developed by PacBio and also known as HiFi reads, involves the circulation of a target DNA strand by ligating the ends of the strand (called a SMRTbell). This SMRTbell can be read multiple times by a DNA polymerase, dramatically reducing the error rate in the generated sequence.
Electrolytic solution – An electrically conductive solution. This conductivity is often due to the presence of ions in solution (for example dissociated Na+ and Cl- ions), though non-ionic solutions can also be conductive.
Epigenetic modifications – Alterations in gene expression and cellular function without changes to the original DNA sequence. Three mechanisms for epigenetic modifications so far identified include DNA methylation, histone modification, and non-coding RNA (ncRNA)-associated gene silencing.
Insulating solid – A solid material that an electric current cannot pass through.
Ion-sensitive field-effect transistor (ISEFT) – A field effect transistor that can measure ion concentrations in solution. Changes in the H+ concentration result in a pH change in solution that results in changes in the current that is detected. This technology is used in Ion Torrent sequencing platforms to identify when a base pair is added to a growing DNA double strand and is the basis for identifying the target DNA sequence.
Phred quality scores – Scores to measure the confidence of the nucleobase identifications generated from DNA sequencing methods. They are widely accepted for assessing the quality of reads.
Rolling circular amplification (RCA) – Where a linear single-stranded DNA molecule is firstly circularised and then copied multiple times as a single sequence (
Single molecular real time sequencing (SMRT) – A term coined by PacBio to describe their sequencing technologies. In contrast to second generation sequencing methods, SMRT technologies possess single-molecule sensitivity and provide the sequence readout in real time, dramatically increasing the sensitivity and turnaround times for DNA sequencing.
Zero mode waveguide (ZMW) – Nanosized wells that can be etched into different materials, with attoliter (10-21 L) volumes. ZMW technology differentiates a fluorescent molecule that is floating in solution from a fluorescently-labelled nucleotide that is located at the bottom of the well. This technology is used by PacBio for the single-molecule detection of fluorescently-labelled nucleotides that are added to immobilised DNA at the bottom of these wells so that nucleotide incorporation can be detected in real time.
The method of identifying living organisms to species level using DNA sequences has been coined DNA barcoding (
DNA-based typing for species identification focused first on microbial organisms (
In recent years, the barcoding movement has grown substantially, and worldwide efforts coordinated by the Consortium for the Barcode of Life (CBOL) are now being focused on barcoding all organisms (
Species delimitation is a central tenet of taxonomy (see Chapter 17 Species delimitation). Traditionally, species were identified, described and classified based mainly on their morphological characters. This is more difficult when it comes to cryptic, hybridising or highly convergent species (
Most species concepts agree on species being evolving metapopulations (de Queiroz 2007), and this implies that genetic variation exists both within and between species. Advanced approaches using many accessions as well as many loci, such as species delimitation based on multispecies coalescent theory, can enhance species identification resolution. However, more data also adds new challenges, and inferred structure due to population-level processes and that due to species boundaries are hard to distinguish (
To identify an unknown DNA barcode using a reference library, one can use several approaches to look at the interrelatedness of the samples (see Chapter 18 Sequence to species). Many databases including GenBank and BOLD (
The mitochondrial genome in plants evolves far too slowly to allow it to distinguish between species (
The plastid marker rbcL was for example good to infer relationships between angiosperm families (
The nuclear ribosomal marker ITS, and specifically nrITS2, is used commonly in barcoding and metabarcoding studies (China Plant BOL Group et al. 2011;
The strict requirements for both universality and high variability for potential universal barcodes has led some to label DNA barcoding a “search for the Holy Grail” (
The core plant DNA barcoding markers are rbcL and matK (CBOL Plant Working Group 2009). nrITS (or nrITS2 only) is the third most commonly used barcode (China Plant BOL Group et al. 2011;
Barcode | Primer | Sequence (5’-3’) | Dir. | Reference |
---|---|---|---|---|
rbcLa | rbcLa_f | ATGTCACCACAAACAGAGACTAAAGC | F |
|
rbcLa_rev | GTAAAATCAAGTCCACCRCG | R |
|
|
matK | matk-3F | CGTACAGTACTTTTGTGTTTACGAG | F | CBOL Plant Working Group (2009) |
matk-1R | ACCCAGTCCATCTGGAAATCTTGGTTC | R | CBOL Plant Working Group (2009) | |
nrITS | ITS5a | CCTTATCATTTAGAGGAAGGAG | F | Wurdack in |
ITS4 | TCCTCCGCTTATTGATATGC | R |
|
|
nrITS2 | S2F | ATGCGATACTTGGTGTGAAT | F |
|
S3R | GACGCTTCTCCAGACTACAAT | R |
|
|
trnL P6 | trnL-g | GGGCAATCCTGAGCCAA | F |
|
trnL-h | CCATTGAGTCTCTGCACCTATC | R |
|
|
psbA-trnH | psbA | GTTATGCATGAACGTAATGCTC | F |
|
trnH | CGCGCATGGTGGATTCACAATCC | R |
|
When choosing appropriate markers for a plant DNA barcoding study it is important to consider the following questions:
What is the necessary taxonomic level of identification? For composition studies of a flora or vegetation, genus-level identifications are often sufficient. Species-level identification can however be important for other questions. Identifying all angiosperms in Greenland is more straightforward than in a Neotropical rainforest. Also, although family-level identifications in Greenland provide useful insights into the local flora, this information most often does not have meaningful applications in rainforests. After deciding on the appropriate level of identification, the researcher then needs to determine whether multiple markers are necessary to ensure that all species can be distinguished.
What kind of a reference library will you use to identify the target barcodes? Query identification in a database that contains all plants is more challenging than with a tailored reference library. For example, identifying a sequence of Oxalis (Oxalidaceae) is easy in a database of Scandinavian plant sequences because there is only a single native Oxalis species. Any queried Oxalis sequence would match the Scandinavian Oxalis acetosella because it would be the only reference Oxalis sequence in a local database. In contrast, a database with South American Oxalis species has hundreds of taxa, and identification requires a marker with sufficient variation to discriminate between these species. Thus, for Scandinavia, one could use a marker with limited variation but universal primers, whereas for South America a specific marker or markers should be sought that can distinguish all Oxalis species present in a global database. It is therefore critical to pick your marker(s) based on the expected diversity in your reference library.
What is your source of reference sequences? If you want to identify species, which is common in studies aiming to authenticate herbal drugs and supplements, you need to include all putative species in your reference library. For example, if your goal is to identify a European wild collected Hypericum, your reference library should ideally include all European Hypericum species that could be confused or substituted for Hypericum perforatum. A reference library can be compiled from de novo sequenced amplicons from voucher accessions or from reference sequences mined from public repositories.
: DNA barcoding of plants encompasses two streams of data from organism to DNA, one for the query sequence that should be identified and one for the reference sequence that is part of the reference library for identification. DNA source, marker choice, primer choice, sequencing approach and identification strategy all influence the ability and resolution of identification.
After choosing one or several markers, it is important to consider the following:
Are universal primers available? If yes, this facilitates your project. However, are these primers really universal? Check this by seeing whether the study publishing the primers gets cited by relevant studies and look for larger studies and reviews that might provide more information about (1) amplification success with these primers; (2) ability to amplify from degraded or poor DNA extracts, a common challenge when working with older herbarium vouchers or processed herbal products; and (3) the need to tweak amplification protocols to make these primers work. If no universal primers are available, try to find studies using this marker and see which primers were used, to find suitable primers that you can then test. If possible, use studies targeting the same target order, family or genus. If there are no previously published primers for your marker, then it is necessary to design your own. If your primers target a widely used marker, then the primer performance that is assessed based on matching these novel primers to multiple sequence alignments of published data (in-silico testing) is generally reliable. If only genomic data is available, however, the accuracy of in-silico testing will be highly dependent on the relatedness of the reference genomes.
Do the primers amplify the right part of the marker? Primers can target fragments of longer loci, i.e., parts of rbcL, matK, nrITS. It is thus important that the segment the primers amplify is useful for your study. It should generate sequences that are identifiable in your reference library and variable enough for your intended level of identification. For example, targeting trnL intron with the universal g-h primers will yield short amplicons, and these have less variation than the entire trnL-F region. Make sure you reassess your marker choice after selecting suitable primers.
How many primers per marker will you use? Long markers can be hard to amplify from degraded templates and can be split up into multiple primer pairs. Degraded DNA is a common challenge when working with a common challenge when working with older herbarium vouchers or processed herbal products. Different combinations of forward and reverse primers can also increase the chance of successful amplification as having multiple different primers can increase the chance that one of these has a good fit to the organism being tested. However, the primer pair with the best fit and targeting the shorter marker will amplify more effectively than other pairs or longer fragments, and can lead to amplification bias.
Once a suitable combination of markers has been found and suitable primers or primer panels have been selected, it is important to test the primers on a sufficient number of your samples. Template DNA quality, DNA concentration, and the effects of inhibiting secondary metabolites can all influence the efficacy of the PCR and might require optimization to obtain the best possible results for the largest number of samples. This is beyond the scope of this book, but sufficient online resources are available to help you with optimization. In addition, there are many online discussion forums to troubleshoot PCR optimization.
The subsequent chapters in section 2 describe different sequencing platforms and approaches to obtain DNA sequences for downstream analysis, and section 3 provides an overview of applications of molecular identification of plants. Depending on whether one chooses standard DNA barcoding using Sanger sequencing, DNA metabarcoding using Ion Torrent, Illumina, or other platforms, or a variety of whole or reduced library representation genome sequencing approaches, one will need to choose different wet lab steps to create the relevant sequencing libraries. Check out the relevant chapter for your application to find out more.
matK – Plastid gene coding for maturase K. matK is one of the core plant DNA barcodes.
nrITS – Internal transcribed spacer (ITS) is a spacer situated between the small-subunit rDNA and large-subunit rDNA genes. In plants, it flanks the 18S and 26S rDNA genes. nrITS is split into two spacers, nrITS1 and nrITS2 with the 5.8S rDNA gene in between. nrITS is highly variable, and primers are designed in the conservated 18S, 5.8S, and 26S rDNA genes.
psbA-trnH – Plastid intergenic spacer region between the coding genes psbA and trnH. psbA-trnH has been advocated as a plant DNA barcoding marker.
Primer – Short DNA sequence used to amplify a marker.
rbcL – Plastid gene coding for ribulose-1,5-bisphosphate carboxylase-oxygenase. Most barcoding studies target the rbcLa region, but will refer to rbcL. rbcL is one of the core plant DNA barcodes. Plastids in plants are often incorrectly referred to as chloroplasts.
DNA metabarcoding is an approach where taxonomically informative regions in the DNA are amplified from mixed-template samples containing DNA from different taxa for identification (Pompanon et al. 2012;
Metabarcoding is based on the DNA barcoding concept (see Chapter 10 DNA barcoding). However, for metabarcoding, samples containing DNA from a mix of different taxa are typically used. One of the first studies that used metabarcoding on a parallel sequencing system (herein referred to as DNA barcoding) to identify plants was by Valentini and colleagues (
Plant metabarcoding is widely used to study the taxonomic composition of mixed template samples such as water (
Plant metabarcoding has been used in various types of applications including species delimitation (see Chapter 17 Species delimitation), archaeo- and palaeo-botany (
DNA metabarcoding is a cost-effective method as compared to metagenomics (
However, DNA metabarcoding also has its limitations, and the PCR amplification step has previously proven to be particularly problematic (
Another drawback of DNA metabarcoding is primer binding bias due to mismatches between the primer and the template DNA. This can result in discrepancies between the proportion of the original taxa in the DNA extract and the amplified DNA sequences (
Finally, the taxonomic assignment of sequences to species is heavily dependent on the DNA reference database used for sequence matching. When the reference database to which the resulting sequences are compared to is incomplete and/or consists of inaccurately identified species, this results in erroneously identified species and/or false negatives (
At the start of any (plant) metabarcoding study lies a clearly defined research question. A study design should furthermore encompass a clear sampling strategy, and identification of suitable DNA extraction techniques for the sample type used before carrying out downstream analysis (
Barcode choice is one of the most important aspects of metabarcoding studies as it will determine which taxa are identified and to what resolution. Considerable efforts have gone into constructing libraries for these plant barcodes and in assessing their limitations (CBOL Plant Working Group 2009;
In the metabarcoding laboratory workflow, unique nucleotide tags are added to amplicons, and these tags are used to assign sequences to the sample they originate from (
In the ‘one-step PCR’ approach, the metabarcoding barcode is amplified and built into libraries during one PCR. This is achieved through the use of metabarcoding primers that carry both adapters and library indexes (
In the ‘two-step PCR’ approach, sample extracts are PCR-amplified with metabarcoding primers that only carry 5’ tails. These are added to act as templates for the following second PCR and do not include any labelling. The second PCR is carried out on each PCR product with primers that carry adapters and indexes (
In the ‘tagged PCR’ approach, DNA extracts are PCR amplified with metabarcoding primers that carry 5’ unique nucleotide tags. Next, the individually 5’ tagged PCR products are pooled and library preparation is carried out on the pools (first demonstrated by (
With the cost of sequencing decreasing exponentially, more effort can be put into applying technical PCR replicates to circumvent sequencing errors and other PCR related issues. When using PCR replicates they should be sequenced in separate locations on the same 96-well plate or, ideally, with replicates in separate plates.Taxa identification lies at the core of any ecological research question. Thus, it is crucial to perform a reliable and reproducible identification workflow to ensure correct identification. In general, care should be taken to avoid cross-contamination between samples by working in clean laboratories with filter-tipped pipettes and separate pre- and post-PCR labs. Normalisation of the amplicons prior to library construction is crucial to avoid overamplification of the most represented taxa in the sample. Since some often-used plant-specific marker regions are very short (e.g., trnL P6 loop, 8 to152 bp), they are prone to picking up the slightest contaminants from the environment. It is therefore recommended to work in a clean environment, e.g. an ancient DNA laboratory with protective clothing.
The preferred platforms for sequencing are currently IonTorrent and Illumina. Both platforms require an additional post-ligation PCR-step or PCR-free ligation of platform-specific adapters to the amplicons before sequencing. However, due to the different technologies behind both platforms, both the error rates and error types can differ. For Illumina (optical sequencing), a substitution error rate of 0.1% has been identified, while IonTorrent (based on detection of hydrogen ions) can show up to 1% indel errors (
Several different bioinformatic tools can be used to analyse the sequence output. Some commonly used packages are OBITools (
Currently, metabarcoding is the dominant technique used in the identification of plants from mixed samples. Developments and improvements in addressing methodological challenges such as PCR bias may one day allow for unbiased quantitative inferences from metabarcoding datasets. This would be a huge step forward for the metabarcoding community since it is still controversial to use read counts as an indication for biomass (
Metabarcoding could potentially be used to determine plant composition in a landscape from bulk arthropod samples. Bulk arthropod samples have been used for biodiversity monitoring of vertebrates (
Adapters – Specific nucleotide sequences unique to different types of sequencing platforms that are added to amplicon libraries to allow for the attachment of library fragments to the flow cell for sequencing.
Amplicons – Products of PCR amplification.
ASVs – Amplicon sequence variants are also known as exact sequence variants or zero-radius OTUs. Although sometimes considered synonymous to OTUs, they correspond to all the unique reads in a dataset and do not require clustering used in creating OTUs.
Barcode – Targeted gene region, see Locus.
Demultiplexing – Bioinformatics step of assigning sequences to samples based on assigned nucleotide tags and/or library indexes.
Epilithic – Plant growing on surfaces of rocks, e.g., seaweeds.
Homopolymers – Nucleotide repetition, usually in tandem of more than 7 nucleotides.
Indel errors – Insertions or deletions in sequences resulting from mutations.
ITS – The internal transcribed spacer is a nuclear ribosomal region found between the small subunit ribosomal RNA (rRNA) and large-subunit rRNA genes.
Library indexes – Nucleotide index added to amplicon libraries to allow for the parallel sequencing of multiple libraries, which can be used bioinformatically to assign reads to the correct amplicon libraries.
Locus – Section and position in a chromosome where a particular DNA sequence is located. It can also be referred to as a barcode.
Macrofossils – Preserved plant remains large enough to be seen without a microscope.
matK – Maturase K is a gene found in the chloroplast genome.
Meta-phylogeography – Study of phylogeographic features and intraspecies variation.
Multiplexing – Parallel amplification of barcodes in one PCR reaction.
OTU – Operational taxonomic unit. The term is used to categorise clusters of similar sequences.
Overhangs – Stretch of unpaired nucleotides at the end of DNA fragments.
PCR – Polymerase chain reaction.
PCR stochasticity – Uneven amplification of molecules during PCR that can be a result of some sequences being present in lower copy numbers than others.
Phylogeography – Investigate the origin of genetic variation within closely related species across a landscape.
Primers – A short single-stranded nucleic acid sequence that serves as a starting point for the DNA replication in the PCR.
Primer set – Nucleic acid sequences explained above complementary to the 5’ end and 3’ end of the flanking regions of a loci.
Primer bias – Differences in DNA amplification due to a primer inefficiently binding to the target template. This can result from sequence divergence in the primer binding sites.
qPCR – Polymerase chain reaction used for quantifying DNA.
rbcL – The ribulose-1,5-bisphosphate carboxylase large subunit gene is found in the chloroplast genome.
Singletons – A sequence only present in one copy.
Nucleotide tags – Short nucleotide sequences added at the 5’ end of the primer in metabarcoding studies.
Tag jumps – Generation of amplicons with different tags than originally used, resulting in false positives in the data. For more detail see (
Taxa – Plural of taxon. A taxon is a group of organisms that form a taxonomic group.
Taxonomic assignment – Matching the obtained sequences to taxa names.
trnH-psbA – An intergenic spacer region found in the chloroplast genome.
trnL – The trnL gene is part of the trnL-F region of the chloroplast genome.
Metagenomics is the study of genetic material recovered directly from environmental samples such as air, water, soil, or sediments (
The term ‘metagenome’ was first coined in 1998 by Handelsman et al. (
With the development of high-throughput sequencing (HTS) technologies, the need for cloning to increase the amount of starting material was eliminated. An early study recovered the first near-complete genomes of five dominant members of a natural acidophilic biofilm using an insert plasmid library and shotgun sequencing (
The immense amount of data collected by these methods introduced challenges in data analysis, resulting in several innovations in comparative metagenomics such as clustering orthologs (
Similar to metabarcoding, substrates that can be used for metagenomics in plant identification include environmental samples, fragmented template materials (i.e., dental calculus and faeces) (
Several promising applications exist for plant-related metagenomics as compared to conventional targeted genomic approaches. Dietary studies are one such application. While dietary studies have been revolutionised by conventional metabarcoding (see Chapter 11 Amplicon metabarcoding;
Similar to metabarcoding (see Chapter 11 Amplicon metabarcoding), metagenomics can potentially be used to reconstruct plant compositions from bulk arthropods samples, and to solve crimes in forensic genetics (see Chapter 26 Forensic genetics, botany, and palynology), especially by uncovering taxa that are not normally amplified in metabarcoding studies. It can also potentially be applied to plant resources for the retrieval of plant population genetic information from mixed templates (which has already been shown in mammals;
Metagenomics is an untargeted method that captures all genetic material in a sample, which is advantageous over targeted methods as no prior knowledge of the taxa and their genes is required (
Metagenomics does, however, come with some disadvantages that need to be considered. The main downside is the taxonomic inefficiency of the method. Sequenced material can originate from any part of the genome, but full nuclear genome references for most species are currently lacking. Thus, only a small proportion of species can currently be identified (
DNA fragmentation is an essential step in the metagenomic workflow, and the size of the DNA fragments required depends on the sequencing platform used. Broadly speaking, there are two methods for DNA fragmentation to obtain size-controlled DNA fragments: enzyme-based and mechanical. Each method has its associated advantages and disadvantages (
Library preparation is another important step in the metagenomics workflow as it can affect the results of the sequencing output. The addition of adapters to the ends of DNA fragments lets it bind to the sequencing flow cell, which allows for the identification of the reads (
DNA sequencing has gradually shifted from Sanger to HTS technologies in the last decades. These new sequencing technologies can provide much higher yields of reads at a much lower cost (see Chapter 9 Sequencing platforms and data types). Initially, 454/Roche pyrosequencing (discontinued) was the most widely used platform (
Short reads are bioinformatically challenging for metagenomic assembly because genes and chromosomal regions can be difficult to span, especially if they are long or composed of repetitive elements. Certain protocols have been developed to overcome such challenges (e.g., assembly after binning and taxonomic assignment), but long-read sequencing technologies offer excellent alternatives for metagenomics. PacBio and Oxford Nanopore technologies offer longer read lengths but can be accompanied by higher error rates and higher costs. In contrast to the other platforms which introduce inherent systematic errors (e.g., homopolymer regions, index hopping), errors in these platforms are mostly random, which might be overcome with technological improvements (
The exact number of reads required to effectively characterise a sample using metagenomics will be highly variable, and as such, no one number for the total number of reads required can be given universally. In principle, the total number of species in the sample, the genome sizes, and the relative abundance of each species should be known to make such an estimation. As a rule of thumb, it is suggested to maximise the output to capture as many reads as possible from the rare members of the community (
There are currently two main strategies to identify the contents of a metagenomic sample: identification of individual reads by alignment to a reference, or by assembling the reads into longer contigs prior to identification.
The most straightforward method for identification is by aligning the reads to a known reference dataset. BLAST and related tools such as MegaBLAST (
Two alternative approaches aim to speed up the identification of metagenomic datasets. These either use more compressed reference databases in combination with more efficient aligners or rely on exact alignments of k-mers between the reads and the reference (
Assembly methods attempt to generate longer contigs before downstream analysis. These longer contigs can be used for gene identifications (
Each bioinformatic strategy has its pros and cons, and the decision about which strategy to use depends on the starting material available as well as the research questions to aim to be answered. The alignment method works well when there is ample reference material available for the taxa of interest, when working with older and more fragmented material, or when the target taxa are sparse in a sample. The assembly method on the other hand performs best when there is abundant material available, which is often not the case for environmental datasets.
As sequencing costs continue to significantly decrease, bioinformatics pipelines are optimised, and more comprehensive DNA reference libraries are available (
Basic Local Alignment Search Tool (BLAST) – An alignment tool commonly used in conjunction with the NCBI nucleotide reference database for sequence identifications. Different BLAST versions exist for nucleotide or protein alignments.
Binning – Clustering sequences based on their nucleotide composition or similarity to a reference database.
Burrows-Wheeler transform – Data transformation algorithm to make transformed data more compressible.
Community genetics – Study of genetic interactions between species and their environment in complex communities.
Contigs – A longer assembled DNA sequence.
Coverage – The mean number of times a nucleotide is sequenced in a genome.
De Bruijn graphs – A popular method for the de novo assembly of contigs. The graph is built up out of k-mers that overlap, which can be solved to construct contigs.
De novo assembly – The assembly of contigs or genomes from sequenced data without the aid of a reference.
DNA fragmentation – Separating or breaking DNA molecules into smaller fragments.
DNA libraries – DNA libraries are a collection of DNA fragments with specific sequencing-platform adapters ligated to both ends.
Ecogenomics – Study of the influence of environmental factors on the genome.
Environmental genomics – Prediction of organism responses at the genetic level.
FM-index – A compressed data structure for full-text pattern searching based on the Burrows-Wheeler transform.
Functional metagenomics – Study of gene functions from DNA extracted from mixed communities.
Hydrodynamic shearing – Fragmentation of DNA molecules by forcing them through a small tube or small gauge needle at high velocity.
K-mer – A short subsequence of length k that is generated from longer sequencing reads. The shorter k-mers allow for faster alignments and assemblies.
Last Common Ancestor (LCA) – A point on the tree of life from which a set of taxa are descended.
MegaBLAST – A faster, though less accurate, version of the BLAST tool.
Metagenome – All genetic material found in an environmental sample. It contains the genomes of many different organisms.
Nebulisation – Process of breaking DNA molecules into small fragments by passing DNA solution into a nebuliser unit, resulting in a fine mist that is collected.
Orthologs – Genes in different species that evolved from a common ancestral gene.
Paired-end sequencing – Sequencing of a DNA fragment from both ends. Both sequences can either be merged into a single larger fragment, if overlap is present, or kept separate.
Read – A DNA sequence generated by a sequencer.
Shotgun sequencing – A technique that randomly fragments DNA and then reassembles the fragments by searching for overlapping regions.
Sonication – Application of sound energy to break up DNA strands into smaller fragments.