Exploring the choppy water of coding: Bio (6): Phylogeny.............

Phylogeny or or the lineage trace back is vital for genomic interpretations. The evolutionary link tracing is of importance in clinical, historical, and conservation biology. In this context, its important to be well-versed of the robust tools and data formats.

High bootstrap value means, node is well-supported. Bootstrap value of 95% or 0.95 means 95 out of 100 iterations, the node is supported. A maximum likelihood tree with bootstrap value ~70% and above are considered okay.

Positive or diversifying selection
Negative or purifying selection

Sequence to analyze for phylogeny........
Molecular sequence (DNA, protein)
Molecular presence (RFLP, isozyme, RAPD, ISSR, AFLP)

Aligner: Blat, gmap
Multiple Alignment:Clustal, MUSCLE, TCoffee
Alignment Refinement: Gblocks
Substitution matrices: BLOSUM, PAM, WAG, JTT, DAYHOFF
Model selection: jModel, GTR (General Time Reversible), GTRCAT, GTRGAMMA, PROTGAMMAJTT
Phylogenetic analysis: MEGA, Mesquite, BioNJ, MrBayes, PAML, PAUP, PhyML, RAxML, SeaView, BEAST
PAML: Phylogenetic Analysis Using Maximum Likelihood
File formats: Phylip, RAxML, Nexus
Tools an files they accept:

MEGA:

PAUP: nexus (.nex)

MrBayes: nexus (.nex)
PhyML, RAxML: phylip (.phy)
Others: fasta (.fa)
Phylogenetic tree view (vizualization): SplitsTree, Newick, Drwatree, TreeDyn, FigTree

Phylip: Phylogeny inference package
RAxML: Randomized Axelerated Maximum Likelihood (based on maximum likelihood)
RAxML is very popular as its fast and generates maximum likelihood tree with good scores. It accepts phylip format files.
fasta------------>phylip------------>tree

STEP 1: Code to convert fasta sequence to phylip format (convertFasta2Phylip.sh)........
#! /bin/sh
#The code convert.sh converts fasta sequence to phylip format
#Phylip format is almost same, just the sequences are presented in one line. It has an header mentioning the number/length of sequences (n, m) followed by alignment.
#If the execution is wrong print this statement
if [ $# != 1 ]; then
    echo "USAGE: ./script <fasta-file>"
    exit
fi

#first column should have > symbol. Count no. of >.
numSpec=$(grep -c ">" $1)

#read field 1; substitute symbols; delete all line numbers; delete space, substitute; substitute
tmp=$(cat $1 | sed "s/>[ ]*$\w*$.*/;\1</" | tr -d "\n" | tr -d ' ' | sed 's/^;//' | tr "<" " " )
#find length
length=$(($(echo $tmp | sed 's/[^ ]* $[^;]*$;.*/\1/'   | wc -m ) - 1))

echo "$numSpec $length"
echo $tmp | tr ";" "\n"
--------------------
data_file
>|cow|
ATCGGGGCTGCGTGAAAAAAAAATTGC
>|egret|
AGGGTCCAATGTTAACTTTCATGCGCTCG
>|turtle|
AGGTAAACCGTGAGCGGGCGGGATG
>|rabbit|
TATTGACTGACCCGGGCAATTCGTG
>|goat|
TTGAAAACCCGTGGGTGCGGGGCCCCGGG
--------------------
execution:
sh convert.sh data_file
--------------------
output
5 29
ATCGGGGCTGCGTGAAAAAAAAATTGC
AGGGTCCAATGTTAACTTTCATGCGCTCG
AGGTAAACCGTGAGCGGGCGGGATG
TATTGACTGACCCGGGCAATTCGTG
TTGAAAACCCGTGGGTGCGGGGCCCCGGG
############################################
STEP 2: Code to convert phylip sequence into tree........
# -s (input seq), -n (output_seq), -N (no. of replicates, no. of alignments), -T (threads to run), -f (estimation algorithm), -x(), -m(model), -b(randomizer), -p random seed, -f (rapid Bootstrap analysis and search for best-scoring ML tree in one program run), -m GTRGAMMA (GTR + Optimization of substitution rates + GAMMA model of rate)
#Have to play with the parameters depending upon requirements
raxml -s phylip.phy -n phylip.raxml.signalTree -m GTRCAT -f a -T 2 -x 1000 -N 300

#Tree from DNA file
raxml-hpc -T 8 -m GTRGAMMA -s file.phylip -f d -n output
raxml-hpc -T 8 -m GTRGAMMA -s file..phylip -x 12345 -N 500 -n output.500rbs

#Single tree from protein file
raxmlHPC -s file.phy -n file.raxml.singleTree -c 4 -f d -m PROTGAMMAJTT

#A set of bootstrap tree from protein file
raxmlHPC -s file.phy -n file.raxml -c 4 -f d -m PROTGAMMAJTT -b 234534251 -N 10
---------------
# Tree from multiple alignment sequences (concatenated core genes, mde by Roary)
raxmlHPC -m GTRGAMMA -p 12345 -s core_gene_alignment.aln -n NAME

# Run RAxML in bootstrap mode
raxmlHPC -m GTRGAMMA -p 12345 -s core_gene_alignment.aln -n NAME_bootstrap -f a -x 12345 -N 100 -T 12
# Results (open with Forester)
RAxML_bestTree.NAME_bootstrap - best-scoring ML tree
RAxML_bipartitions.NAME_bootstrap - best-scoring ML tree with support values
RAxML_bipartitionsBranchLabels.NAME_bootstrap - best-scoring ML tree with support values as branch labels
RAxML_bootstrap.NAME_bootstrap - all bootstrapped trees
RAxML_info.NAME_bootstrap - program info

paraphyletic taxon: does not include all the descendants of the most recent common ancestor.
monophyletic taxon: include all the descendants of the most recent common ancestor.
Probabilistic analysis of a concatenated alignment - are limited by large demands in memory and computing time
Supertree methods: focuses on the topology or structure of the phylogenetic tree, rather than the evolutionary divergences associated to it.

Exploring the choppy water of coding

Monday, December 14, 2015

Bio (6): Phylogeny.............

No comments:

Post a Comment