Phylogenetics and Evolution of Orobanchaceae using publicly available data

Background:

The most species included in a phylogeny of Orobanchaceae is ~145. However, there are ~2000 descibed taxa in Orobanchaceae. Several studies focusing on smaller clades with Orobanchaceae have been published in recent years adding to what we know about the family. No one has combined the data from these studies to make a comprehensive phylogeny of Orobanchaceae. The plan is to build a new phylogeny in order to determine divergence times and reconstruct biogeographic history as well as the evolution of parasitism.

Here are several studies that have used molecular phylogenetics to add our knowledge of Orobanchaceae:

Recent studies using a super-matrix approach to ask questions about evolution in plants:

PyPHLAWD:

For this project I am using phlawd_db_maker to download sequences from genbank.

I am using pyphlawd to cluster sequences and alignments:

Double-checking the analysis:

I used phyutility to clean each of the alignments at 50% and then to concatenate the alignments. I built a quick tree with RAxML.

Then I used the PyPHLAWD script to change the NCBI taxonomy ids on the tree to genus species names:

python ../../../../src/change_ncbi_to_name_tre.py ../../Orobanchaceae_91896/Orobanchaceae_91896.table RAxML_bipartitions.12_gene_initial_NewFilF_2.0.tre RAxML_bipartitions.12_gene_initial_NewFilF_2.0.tre.cn.tre

There are several problems with the tree:

  • Extra long branches: Rhinanthus major, Castilleja elegans, Castilleja unalaschcensis, etc.
  • A lot of subspecies are included as tips in this tree. I want a single tip for every species.

In order to correct these issues among other things I want to edit the alignments gene by gene I converted the fasta headers of NCBI ids to the taxon names using this script:

change_names_in_fasta.py. This script uses parts of the PyPHLAWD change_ncbi_to_name_tre.py script but changes names in fasta files. This way the fasta files are readable by humans.

I used several criteria for remove sequences from the alignments:

  1. Erroneous sequences that caused extra long branches. Often these are caused by short sequences that are missing a lot of info.
  2. I removed or renamed sequences from subspecific taxa of species that were already represented in gene.

    For example:

    Cistanche_phelypaea is represented in rps2, matK, and ITS. Cistanche phelypaea subsp. phelypaea is only represented in rps2 and ITS. For the species Cistanche_phelypaea I deleted sequences identified as Cistanche phelypaea subsp. phelypaea for rps2 and ITS so that only Cistanche phelypaea remain.

    Castilleja rubicundula is represented by two tips in the initial tree: C.rubicundula and C. rubicundula subsp. rubicundula.

    C. rubicundula subsp. rubicundula is represented in ETS, rps16, trnL, and ITS. C. rubicundula is represented in GBSSI, phyA, rps2, matK, and ITS. In this case I renamed C. rubicundula subsp. rubicundula to C. rubicundula in ETS, rps16, trnL. I deleted C. rubicundula subsp. rubicundula in the ITS alignment. Now C. rubicundula will only be represented by one tip in the final concatenated tree.

    Another issue is that there are duplicate taxa with different names: Orobanche multicaulis = Aphyllon cooperi

  3. I removed taxa that were showing up in the wrong place due to data distribution issues. Ex: the taxon is only represented in ITS and thus pulled into the wrong clade in the concatenated tree.

After I completed the phylogeny I wanted to summarize what we know about Orobanchaceae generally and what we know based on this new tree.

Summary of the phylogeny:

  1. Heat-map of data distribution:
Heat map of some of the species (y axes) and 12 loci (x axes)

Jupyter notebook: TBD

Non-monophyletic genera and old names assessed with MonoPhy.

How well are we representing Orobanchaceae with this tree?

Web-scraper to gather data about what is known about Orobanchaceae:

Dating the tree:

Option 1:

1.1

Using secondary calibrations from another tree we can estimate divergence times on the Orobanchaceae tree.

Secondary Calibrations

TreePL: GitHub

Congruify: Link Tutorial

  1. I used the the Schneider and Moore MAP tree to make a point estimate of divergence times.

Example R code for congruify in geiger:

  • Multiple calibrations from posterior with 500 trees

cal_trees <- read.nexus('allgenes.trees.NEXUS.trees')
class(cal_trees)
cal_trees

cal_list<-list(0)
for( i in 1:500){
  calibrations<-congruify.phylo(reference = cal_trees[[i]], target = in_oro_tre, scale = NA)
  cal_list <- c(calibrations[2], cal_list)
  #print(cal_list)
}

length(cal_list)

for( i in 1:500){
  write.treePL(in_oro_tre, cal_list[i]$calibrations, base= paste("Fu_oro_multiCalib", i,sep=""), nsites=11028, opts=list(smooth=1000, nthreads=2, opt = 5, optad = 5, optcvad = 3, thorough=TRUE))
}

  1. I used the parameters from the point estimate and 50 CV runs to inform the files for 500 random replicates from the Schneider and Moore posterior distribution of trees.

CV loop example:

for i in {1..5}; do treePL Fu_Oro_point_CV_test_file_Fu$i.infile; done

Date estimation loop example:

for i in {1..500}; do treePL Fu_oro_multiCalib$i.infile; done

Prepping output trees for summarization:

for i in {1..500}; do cat fu_multiCalib$i.dated.tre >> fu_multiCalib_sum.dated.trees; done

Extracting annotation info from summarized tree file using python script: Requirements: dendropy and pandas beast_nexus_Annot_Extractor.py note: this script still needs work to make output prettier.

Additional datasets to complement the phylogeny:

  1. Locality/Geographic info:

Retrieving data from GBIF using rgbif:

specList1 <- c('Aeginetia indica', 'Agalinis acuta', 'Agalinis aphylla', 'Agalinis aspera', 'Agalinis auriculata', 'Agalinis calycina', 'Agalinis decemloba', 'Agalinis divaricata', 'Agalinis edwardsiana', 'Agalinis fasciculata', 'Agalinis filicaulis', 'Agalinis gattingeri', 'Agalinis georgiana', 'Agalinis harperi', 'Agalinis heterophylla', 'Agalinis homalantha', 'Agalinis laxa', 'Agalinis linifolia', 'Agalinis maritima', 'Agalinis navasotensis', 'Agalinis neoscotica', 'Agalinis obtusifolia', 'Agalinis oligophylla', 'Agalinis paupercula', 'Agalinis plukenetii', 'Agalinis pulchella', 'Agalinis purpurea', 'Agalinis setacea', 'Agalinis skinneriana', 'Agalinis strictifolia', 'Agalinis tenella', 'Agalinis tenuifolia', 'Agalinis viridis')
keys <- sapply(specList1, function(x) name_suggest(x)$key[1], USE.NAMES=FALSE) #Not sure how this work but it does. Something similar to a loop through our species. 
res <- occ_search(taxonKey=keys, limit=400, fields=c('name','key', 'decimalLatitude','decimalLongitude','country','basisOfRecord','coordinateAccuracy','elevation','elevationAccuracy','year','month','day'), basisOfRecord = 'PRESERVED_SPECIMEN')
Oro_localDF1 <- rbind.fill(lapply(res, "[[", "data"))

note: This isn’t working as well as I want it to. Second strategy (with data cleaning steps) using python coming soon

  1. Phenotypes

TBD