Green plants, known as Viridiplantae, comprise around 500,000 species and represent an abundance of diversity. Originating at least 750 million years ago and displaying great heterogeneity, understanding the phylogenetic relationships within this clade has proven to be a significant challenge; one compounded by the extinction of several major lineages. In an attempt to resolve some of the major questions in this field, researchers have looked to nuclear, mitochondrial and plastid DNA. In recent years, next generation sequencing has rapidly increased the number of complete plastid genomes. With this wealth of data now available, Brad Ruhfel from Eastern Kentucky University, USA, and colleagues, sought to deduce a comprehensive green plant phylogeny based on the plastid genome and to explore some of the major relationships across this clade, as published in their recent study in BMC Evolutionary Biology. Here Ruhfel explains the benefits and limitations of using plastid sequence data and what new insights their study uncovered.
What are the benefits of using plastid sequences as opposed to nuclear or mitochondrial sequences?
Plastid sequences have been the mainstay of plant phylogenetics since the mid 1980s for several reasons. First, the plastid genome is present in high copy numbers, making the DNA easy to obtain, particularly when compared to single-copy nuclear genes. Second, genes in the plastid genome are relatively easy to align across all green plants. Third, the plastid genome is highly conserved in terms of structure and gene evolution. Finally, horizontal gene transfer and gene duplication can cause problems when using data from the mitochondrial or nuclear genomes; these issues are rare when using plastid genome data. In short, plastid data are easier to obtain and analyze.
How is your study different from previous plastid phylogenomic analyses of green plants?
Previous studies using plastid genome data to examine the relationships of all green plants have had much poorer taxon sampling (i.e. they sampled fewer species). These studies typically used somewhere between 40 to 90 species, while here we used 360. This expanded taxon sampling allowed us to address major relationships across all green plants simultaneously. Our study is also unique regarding the depth of data exploration and analysis that we conducted.
We thoroughly explored our data by: i) using several character coding protocols (nucleotides, amino acids, and RY-coding), ii) analyzing various subsets of the data (e.g. first and second codon positions, third positions only), iii) examining base composition bias, and finally, iv) exploring several different partitioning strategies and statistically determining which was most appropriate for these data sets. This rigorous approach allowed us to determine which relationships in the phylogeny were robust to various analytical approaches and which were not.
Did you find any unexpected results and/or results that conflicted with earlier studies?
The finding that Zygnematophyceae, a clade of green algae, is sister to land plants is very interesting. Most textbooks state that another group of green algae, Chara and its relatives, occupy that position. However, to me the most interesting results were those areas of the green plant phylogeny that were in strong conflict across our various analyses. For instance, the main view of early land plant relationships taught in textbooks is that liverworts are sister to all other land plants, followed by mosses, and hornworts are sister to all vascular plants. This set of relationships is well supported in two of our four analyses, but mosses and liverworts are sister groups in the other two analyses. These various bryophyte relationships have been reported before, but it is surprising that with whole plastid genomes the problem still persists.
This is not an isolated incident; several examples of this type of problem occur through our various trees. Without analyzing the data in several different ways, we may not have realized that these various areas of the topology are not robust. Which relationships are correct? Why is the same data set coded in different ways giving us such different results? These types of questions are very exciting and point the way for future research.
What systematic errors did you identify in previous analyses, and what can we learn from this?
We did not examine the analyses of previously published works. However, in our analyses it seems that base composition bias is present and that highly complicated partitioning schemes are needed to account for the heterogeneous patterns of molecular evolution in the plastid genome data sets.
Base composition bias may have affected the phylogenetic placement of some taxa. For example, in the analysis that included all nucleotide positions, the monilophyte clade (ferns and relatives) was placed as sister to lycophytes plus seed plants, a placement at odds with several other lines of evidence. When steps were taken to account for base composition bias, monilophytes were placed as sister to seed plants as expected. Phylogenetic studies using plastid genome data often do not examine base composition homogeneity though it is a basic assumption in our analyses. Hopefully this will become common practice in future studies.
We also see evidence of highly heterogeneous patterns of sequence evolution in the data set as our model-fitting experiments always favor the most parameter-rich models. Identifying partitioning schemes and models that can accurately reflect the true processes of molecular evolution while avoiding over-parameterization may be important when analyzing similar data sets. Some previous studies have not statistically explored the best partitioning strategy for their data and have used standard strategies such as partitioning by gene or codon position. This and other studies are now showing that these standard strategies may not be the best choice when analyzing large plastid genome data sets.
In addressing the systematic errors highlighted in your study, do you think plastid sequences will be able to resolve all of the intricacies of plant phylogeny?
There are very likely relationships that plastid genome sequence data will not be able to resolve. However, this can also be said for any type of data: nuclear, mitochondrial or morphological. It is important that we pursue all lines of available evidence. Future analyses of plastid data should increase taxon sampling. Remember that we have reconstructed what is perhaps over one billion years of evolution, using only 360 representatives of a clade that contains about 500,000 species. Then of course there is the problem of the countless lineages that have gone extinct that we will never be able to sample. Analyses of mitochondrial and nuclear genomes at this same scale are needed and are currently being investigated. It is very likely that phylogenies from these three data sources will not agree in some areas. But those areas of disagreement will either point us towards some very interesting biological phenomena or allow us to develop better analytical methods.
Where should we direct future efforts to get a more complete picture of plant phylogeny?
First, we need more data from more taxa. Several groups of organisms are very poorly represented in our plastid genome data set. For example, only two moss species were included here, while there are thought to be around 10,000 species. As I mentioned above, for this study we analyzed plastid genome sequence data from 360 taxa from a clade that may contain over 500,000 species. This is only about 0.07 percent of green plant diversity! Similarly sized data sets should also be assembled and explored using mitochondrial and nuclear data. Integrating fossils is also extremely important. We may never be able to get DNA from fossil taxa, but combined analyses of morphological and molecular data may allow us to better place taxa with no molecular data available. Efforts should also be focused on developing better models of evolution.
As more plant genome sequences become available, do you think the overall picture of plant phylogeny is set to change or is it now a matter filling in the details?
One of my favorite ideas is that science is a permanent revolution. We will likely never know the real truth in its entirety, but we must keep striving to find it so as to better understand our world. As we gain a better understanding of the molecular evolution of plant genomes, there may be some real surprises around the corner. However, many relationships in the green plant tree of life agree across multiple analyses of both molecular and morphological data as we state in the article. We have made great progress in determining the evolutionary history of green plants but there is still much to do.