Sunday, December 20, 2015

Bioinformatics (3) : Databases, jargons,acronyms, tools, emerging area, omics...

#######################Databases################
BioCyc : Assortment of organism specific pathway
COG: This protein database was generated by comparing predicted and known proteins 
DAVID: Database for Annotation, Visualization and Integrated Discovery
DDBJ: DNA 
DEG: Database of essential genes
DOOR: Database of Prokaryotic operons
EggNOG: Ortholog
EMBL: Molecular Biology
ENCODE: Encyclopedia of DNA Elements (an annotation database)
Ensemble:  Vertebrate and eukaryote genome
GenBank: Genome
GEO: Gene Expression Omnibus (repository of high throughput gene expression data and hybridization arrays, chips, microarrays)
GO: Gene Ontology
GOLD : Metagenomics studies (Genome online database)
HMDB: Human Metabolome Database 
IMG: Integrated Microbial Genomes comparative analysis system 
KEGG: Kyoto Encyclopedia of Genes and Genomes
MetaCyc: metabolic-pathway

MG-RAST : Metagenomic samples
miRBase: the microRNA database (http://www.mirbase.org/)
nr: non-redundant (Protein search ) (All non-redundant GenBank CDS translations)
OMIM: Online Mendelian Inheritance in Man
Panther: Protein ANalysis THrough Evolutionary Relationships
PATRIC: Bacterial bioinformatics
PDB: Protein  Data Bank
pfam: Proten family
PIR: Protein Information Resource
PRF: Protein Research Foundation
Rebase: Restriction enzymes,  DNA methyltransferases
RefSeq: Reference Sequence
SRA: Short Read Archive (contains small reads of genomes)
Swiss Prot: Protein database
###########################Bioinformatics acronyms#################################
ANN: Artificial neural networks
ASD: Autism Spectrum Disorders
BAC: Bacterial Artificial Chromosome
BAM: Binary Alignment Map
BLASR: Basic Local Alignment with Successive Refinement (a mapper_
BLAST: Basic Local Alignment Tool
BUSCO: Benchmarking Universal Single Copy Orthologs
CDS: Coding Sequence
COG: Clusters of Orthologous Groups

CRISPR: Clustered regularly interspaced short palindromic repeats

DCA: DNA Composition Analysis
DDBJ: DNA Database of Japan
EMBL: European Molecular Biology Laboratory
EST: Expressed Sequence Tag
FBA: Flux balance analysis 
FMRI: Functional Magnetic Resonance Imaging (for brain imaging)
GATK: Genome Analysis Toolkit
GEE: Generalized estimating equations 
GML: Graph Modelling Language
GRN: Genetic Regulatory Network 
GSS: Genome Survey Sequences
GWAS: Genome-wide association studies (e.g. association between a SNP and each phenotype)
HDF: Hierarchical Data Format (sequencers generate high throughput data in this format)
HSP: Heat Shock Protein
HGAP: Hierarchical Genome Assembly Process
HMP: Human Microbiome Project

HGP: Human Genome Project
HTGS: High-Throughput Genome Sequence
MG-RAST: Metagenomics Rapid Annotation using Subsystem Technology
MANOVA: Multivariate analysis of variance
NCBI:National Center for Biotechnology Information
NIPT: Non-invasive Prenatal Testing
ORF: Open Reading Frame
QTL: Quantitative Trait Loci 
RAD: Restriction-site Associated DNA
RFLP: Restriction fragment length polymorphism (exploits variations in homologous DNA sequences)
RMSD: Root Mean Square Distance
SAM: Sequence Alignment Map
SMRT: Single Molecule Real Time
SMS:Single Molecule Sequencing
TFBS: transcription factor-binding sites
TMHMM: Transmembrane Hidden Markov model (membrane protein topology prediction )
TSS: transcription start site
TU: transcription units
VCF: Variant calling format
WGS: Whole Genome Sequencing
##########Bioinformatics  jargon###########
Bioinformatics is a multidisciplinary area, so is full of jargons. Some key words have been presented below.
Annotation: Generating metadata of data
Base calling: Identifying a nucleotide base
Bit score: Statistical properties of raw alignment score
Bootstrap: Any test that relies on random sampling
Built-in functions: Comes as standard package
Canonical: Conventional
Central dogma: DNA to make RNA, and then translation uses RNA to
make proteins
Clonal expansion: Progeny or daughter cells arising from single parent cell
cluster computing: parallel computing, ganglia, load distribution
Consensus sequence: Calculated order of most frequent residues
Coverage depth: Depth in squencing in terms of number of times a nucleotide is read
Deterministic: When returned result is always same
Epigenetics: Study of the complete set of epigenetic modifications on the genome
E-value: Measure of similarity between sequences (value 0 is best concordance, less value means more congruent sequences)
frameshift: stop codon mutation or  indel modify ORF
GC bais: Dependence between GC content and coverage depth of a genomic region
Genotyping: Lineage determination
Horizonatal transfer: Transfer of genes between organisms
Inversion: Sequence put in other orientation
In vitro passage: Changes in the cells or microbes due to serial culturing
Isoform: Splice variants
Ka/Ks: rate of non-synonymous to synonymous substitutions
Moonlighting protein: When  a protein performs many functions
Non-synonymous substitution: When a base change causes change of amino acidPromoter: 100 bases upstream of start position
Pleiotropy: When one gene influences more than one phenotypic traits
Pulling: Extracting
Quality Trimming: Wrongly called bases are removed 
Redundancy  of  the  genetic  code: Multiple codons for one amino acid
Synteny: When two or more genomic regions are derived one genomic region
Translocation: Sequence moved to another position

File formats

aln: Alignment
Axt: 
BAM: Binary alignment
BED:Browser Extensible Data (12 columns: 11th is blockSizes, 12th is blockstarts)
bigWig: 
Chain:
Fasta: text files that starts with > symbol
fna:
gbk: Detailed info, NCBI format
GenePhred: 
GFF: General Feature Formatformat consists of one line per feature
hmm: has position-specific scores
jnlp: java file that needed to be run by java web start
json: Javascript Object Notation
md: Its the short of MarkDown editing in Windows (it shows preview of the edit)
Microarray:
Net: 
pileup: (its generated by SAMTOOLS, from it SNP are extracted)
PLINK: For genotypic analysis
SAM: Sequence Alignment Map
vCard: Simple text files with features
VCF: variant calling (It has 19 columns)
WIG: Wiggle Track Format (compact)
XML: Extensible Markup Language
Softwares
Aligners (mappers): Bowtie, BWA, BBmap, SOAPaligner (will give SAM, BAM or any other aligned files)
Local aligners (re-aligners around indels): GATK, BQSR (these aligners will give realigned BAM files) AbySS: Paired-end sequence assembler that is designed for short reads
aragorn: To find tRNA
BAMtools: To generate coverage files
BEDTOOLS: To manipulate big data files
Bowtie: Ultrafast short read alignment (takes read and ref, aligns, converts sam to bam, sort)
BreakDancer: For genome-wide detection of structural variants
Broad's variant calling software: for pathogens, cancer
BQSR: Base Quality Score Recalibration
GATK: Genome Analysis Toolkit 
Gephi: for visualizing and analyzing large network graphs
HMMER: For sequence alignment and homolog finding
kallisto:For near-optimal RNA-Seq quantification

khmer :for working with DNA shotgun sequencing data from genomes, transcriptomes, metagenomes, and single cells
MAUVE: Multiple genome alignment. It takes care of large genome rearrangements
Picard: To manipulate SAM files
prodigal: To predict CDS
prokka: for de novo annotation of genomes
SAMTOOLS: For NGS analysis
tbl2asn: Generates sequence records for submission to Genbank
VarScan: Mutation Caller
velvet: de novo assembler to build long continuous sequences (contigs)
Web tools (interfaces)
Galaxy: Web-based platform for NGS data analysis 
MetaboAnalyst:  For metabolomic data analysis
SMARTSimple Modular Architecture Research Tool
UCSC: A genome browser
Bioinformatics Resources
Rosalind
StackExchange
SuperUser
Seqanswer
Bioconductor consortium (from R community)
Protein 3D structure prediction...........
Ab initio: 
QUARK is used for ab initio modelling. it uses Monte Carlo simulation to predict protein structure, even without available global template.
Comparative: MODELLER, Phyre
Protein 3D structure viewer...........
PyMol
RasMol
Swiss pdb Viewer
Emerging areas.......
Comparative genomics
Regulatory genomics
Systems biology
Epigenomics
Cancer biology
Next generation sequencing
De novo genome assembly
Functional annotation
Gene prediction
Metageomics
------------------------------------------------------------------------------------------------------------
Ome............
Exome: Protein coding parts of genes
Genome:  Complete set of DNA
Interactome: Whole set of molecular interactions
Metabolome: Entire set of small molecules
Patome: All patents
Proteome: All proteins
Reactome: All biological pathways in an organism
Transcriptome: Entire set of RNA molecules
------------------------------------------------------------------------------------------------------------------
Indel can occur by replications, recombination, mobile genetic element...
Deletions can be one bp long or entire gene length.
Frameshifts cause gene fusions, restoring functions ablated (lost) before
non-sense mutation (introduction of stop codon) cause ORF truncation
mutation stop codon cause ORF extension
Coverage of BAM file is calculated


Partitioning approach for metagenome assembly. 
Prepeocessing include digital normalization,  knot removal
genome separation/binning/strain extraction on raw reads. 

Digital normalization and partitioning are effective methods to assemble large metagenomic data. The assembly data has been uploaded to MG-RAST for annotation

No comments:

Post a Comment