Exploring the choppy water of coding: January 2016

Friday, January 29, 2016

Bio (4): BACTERIA....

In 1995, the first bacterial genomes were sequenced

First bacterial genome to be sequenced: H. influenzae

Average protein coding content of a bacterial genome is 40 to 97 %

A typical bacterial genome is around 5 million bp, encodes about 5000 proteins

Largest genome is Sorangium cellulosum strain So0157-2, has 14,782,125 bp, contains 11,599 genes

Smallest genome is Candidatus Nasuia deltocephalinicola strain NAS-ALF; has 112,091bp, codes for only 137 proteins

The GC content of the finished bacterial genomes ranges from a bit less than 15% to about 85%

Extremophiles: thermotolerant, psychrotolerant, and psychrotrophic bacteria

Members of a species are not necessarily “equal” or even similar, in terms of their (protein-coding) gene content

Depending on the species, the variation in gene content and genome size can be quite considerable, with some pan-genomes, like E. coli, being very “open”; other pan-genomes, such as that for Bacillus anthracis, contain very few extra genes, and can be considered “closed”.

Species can vary by more than a megabase

e.g.Haemophilus influenza HK1212 (1.0 mb) versus F3047 (2.0 mb)

Burkholderia pseudomallei THE (6.3 mb) versus MSHR520 (7.6 mb)

Core gene families :Families with at least one member in at least of 95% of genomes

Serratia symbiotica str. Cinara cedri has a protein-coding density of 38%, is an insect co-symbiont, having 58 pseudogenes.

Redundancy arise from gene duplications

Repeat sequences and parasitic DNA that seem to bear no function to the organism.

Bacterial genomes are not always evolving towards optimal efficiency

Increased number of tRNAs and rRNAs is correlated with a faster growth rate

Insertions and deletions arise from recombination events

All bacterial genomes have at least one copy of the 23S, 16S and 5S rRNA genes.

Genetic code allows for 62 possible anticodons for tRNAs, but since these have to cover only 20 essential amino acids, the theoretical minimum for a genome would be 20 tRNA genes.

The number of anticodons identified per genome has not exceeded 47 (out of 62 possible), and averages between 33 and 35

Bacteria control mobile elements through post-segregation killing systems

Tuesday, January 26, 2016

Shell (6): Linux/bash commands for day-to-day activity.......

#Move to another directory
cd Desktop

# Move to directory up in hierarchy
cd ..

#List directory
ls

#To determine the size of a file (good way to delete empty files)
ls -l

#Count the number of files in a directory
ls | wc -l
ls -1 | wc -l

#Create a new file
touch file
touch a b c d

#Copy the folder and its contents from the host to local PC folder
cp -R full_source_path full_destination path
e.g. cp -R /share/projects/data/genomes /home/pseema/Desktop/new_hypothetical_IS
cp -R annotation/ /home/pseema/Desktop

#Create directory
mkdir dir_name
mkdir dir1 dir2 dir3

#Remove directory
rm -r dir_name

#Remove file
rm file
rm a b c

#Rename copy
mv file1 file2

#Rename directory
mv dir1 dir2

#Move file to a new directory
mv file ~/dir
mv *.txt ~/dir

#Check history of commands used
history

#Find files with size 0
find . -size 0

#Delete files with size 0 (empty files). Before executing this critical command, change to the specified directory
cd dir_name
find . -size 0 -delete
find path_to_directory -size 0 -delete

#Find the number of files in a directory
ls -l | wc -l

# Find the files with a particular pattern (anywhere is the file name), can find line number if needed, then using cat to call the previous command
find path_name/pattern_*
e.g. find /home/pseema/hypothetical_analysis/result_files/*.only_header
find pattern_* | wc -l
cat `find pattern_*`

#To paste two files (not side by side, just the content)
cat file1 catfile2 > file_merged
cat *.fasta > file_merged

#Find unique lines only (keeps only unique lines)
awk '!NF || !seen[$0]++' file

#Print isolate names n times (times the number of protein)
printf 'pattern\n%.0s' {1..5}
for i in `seq 5`; do echo "pattern";done

#Add these files to the protein header name files
pr -m -t file1 file2 > combined_file

#Now grep your pattern
grep 'pattern' file

Sort file alphabetically
sort -u file > sorted_length_file

#Sort the file to find lines in the order of maximum frequency
sort file | uniq -c | sort -n -r > outfile

#Shows common item to file 1 and file2
comm -12 file1 file2 >
# Find rows only in file 1 (the 2nd file is file 1)
awk 'FNR == NR { h[$1,$2]; next }; !($1 SUBSEP $2 in h)' file2 file1

#These items occur only in file1 (accessory in 1)
comm -23 file1 file2 >

#These items occur only in file2 (accessory in 2)
comm -13 file1 file2 >

#Count lines (emulates "wc -l")
awk 'END{print NR}' file1

#Prints all the lines with the given pattern
grep 'pattern' file

#renaming file names
path_dir= /home/pseema/dir
gene_name= "inh"
resistant_list="resistant_list.txt"
#Creates a list and outputs into a txt file
ls /share/apps/pacbio/consensus_files > /home/pseema/dir/consensus_ls.txt#remove filenames with .fasta
sed'/.fasta/d' /home/pseema/dir/consensus_ls.txt > /home/pseema/dir/consensus_ls_nofasta.txt
#keep only first 6 characters of each line
cat /home/pseema/dir/consensus_ls_nofasta.txt | cut -c 1-6 > /home/pseema/dir/isolate_id.txt
#replace each dot wit hyphen in each line
sed 's/./-/2' /home/pseema/dir/isolate_id.txt > /home/pseema/dir/id_corrected.txt
#concatenation
paste -d"\t" /home/pseema/dir/id_corrected.txt /home/pseema/dir/consensus_ls_nofasta.txt > /home/pseema/dir/merged.txt

#remove one file for each isolate ($1 is isolate name)
awk '{a[$1]++}!(a[$1]-1)' /home/pseema/dir/merged.txt > /home/pseema/dir/unique_merged.txt

Tuesday, January 19, 2016

Bio (8): Nucleases, methylatransferases..........

Enzymes can be of six types i.e. hydrolases, transferases, lyases, oxidoreductases, isomerases and ligases. Examples of each category are hydrolases (peptidase, lipase, esterase, phosphatase, dehalogenase, deacylase, ), transferases (methyl, acyl, amino (transaminase), adenocyl, rhamnosyl, sufuryl (sulfurase); kinase; polymerase); lyases (dehydratase, decarboxylase, cyclase), oxidoreductases (peroxidases, hydroxylases, oxidase, reductase, dehydrogenase, dioxygenase), isomerases (racemase, epimerase, mutse) and ligases (synthase).

#Enzymes often have domains belonging to multiple classes, making things complicated.

#As enzymes are pH and temperature-sensitive, they often undergo changes and become hypothetical. It misleads our understanding of pathogen biology.

#Chymotrypsin fold has His490, Asp552 and Ser646

-----------------------------------------------------
A nuclease is an enzyme capable of cleaving the phosphodiester bonds between the nucleotide subunits of nucleic acids
Nuclease might be endonucleases or exonucleases; also might be deoxyribonuclease (cleaving DNA) and ribonuclease (cleaving RNA).
Nuclease is a subgroup of the hydrolases.
Restriction nuclease cuts nucleic acid at specific restriction sites and produce restriction fragments (used as atool in recombinant DNA technology). It can be Type I and II (recognize 4 to 8 bp sequence).
Homing endonucleases are a collection of endonucleases encoded either as freestanding genes within introns, as fusions with host proteins, or as self-splicing inteins. They catalyze the hydrolysis of genomic DNA within the cells that synthesize them.Homing endonucleases bind very long (12 to 40 bp), even asymmetric recognition sequences.
HNH domain in ndonuclease create zinc finger.
S1 P1 nuclease (a 36kb glycoprotein): cleave RNA and single stranded DNA
S1 nuclaese needs Zn2+ cofactor for catalysis (So, starving bacteria of Zn might help to kill it. In host Zn is anundant; does in vitro medium has Zn?)
PIN (N-terminus of the PilT protein) domain cleaves single stranded RNA in a sequence dependent manner
#####################################################
Methyltransferases are enzymes that methylate their substrates
class I: have Rossman fold for binding S-Adenosyl methionine (SAM).
class II methyltransferases: contain a SET domain
class III methyltransferases: membrane associated
They can be classified as protein methyltransferases, DNA methyltransferases, Natural product methyltransferases, and Non-SAM dependent methyltransferases.

SAM is the classical methyl donor for methyltrasferases
Methionine sulfur serves as the nucleophile that transfers the methyl group of SAM to the enzyme substrate.
Dam methylase (Deoxyadenosine methylase) adds a methyl group to the adenine of the sequence 5'-GATC-3' in DNA

Bio (7): Mycobacterium tuberculosis facts......

Mtb has GC content pf 65.53%
4% of its genome is PE and PPE genes (proline and glutamic acid rich)
Proline is non-polar, while glutamic acid is acidic (high proline means high stress)
PGRS is GC rich. PGRS genes in M. tuberculosis, which have regions with 80-90% GC content
MPTR has many tandem repeats.
PE has PGRS . PGRS is GC rich
PPE has MPTR
M. tuberculosis possesses a very high number of toxin-antitoxin (TA) systems in its chromosome, 79 in total, regrouping both well-known (68) and novel (11) families, with some of them being strongly induced in drug-tolerant persisters. IS ORFs are flanked by inverted repeats (IR) and direct repeats (DR).
Beijing clade of East Asian lineage have high number of IS6110 (16-24)
Great option to rearrange genome. Hundreds of insertion points and preferred loci have been found.
Genomic unstable regions are targeted by a variety of mobile genetic elements.
Some IS are stable, some are not.
in vitro conditions might affect IS copy number (M. tuberculosis H37Rv has 17-19 IS6110 copy numbers)
Intergenic region are not junk, they code for gene expression-changing small RNAs (sRNA)
Two component regulatory system: devR/devS
dosR is a regulatory protein
These proteins control HSP proteins and facilitate adaptation, promote dormancy
IS6110 have been seen to cause overexpression of regulatory gene dosR
Duplication is common. It can be as long as 350kb
pncA is a lineage marker, rather than resistance marker
Spoligotyping (based on spacers) and MIRU (based on repeat units) are lineage assignment techniques
Beijing clade has both ancient (N) and modern(W) branches (the division based on location of IS6110)
About two-third of the genomeis core genome (about 2,100 genes).
HGT is less common in M. tuberculosis, yet its there. Many antibiotic resistance genes have been found in the isolates.
Known mutations.........(*Codon for gene; *Position for promoter)
INH resistance
katG gene: S315T, katG315N (polar to polar, just size big)
inhA promoter: −15C/T
ahpC-oxyR region:
furA-katG:
fabG1-inhA:
efpA, fadE24, iniA,iniB, iniC, kasA, nat, ndh, Rv1772, Rv1592c, Rv0340, and srmR genes
RIF resistance
rpoB: rpoB531, rpoB526
FQ resistance
gyrA:Ala90Val, Ser91Pro, Asp94 (Gly/Ala/His/Asn)
gyrB: 495, 516, 533
Lineages (There are 7, out which 4 are most-studied)
The Indo- Oceanic lineage: EAI
East African-Indian lineage: CAS
East Asian lineage: Beijing
Euro-American lineage: Ural, X, Haarlem, LAM, S and T

TubercuList database is based on the below strain of H37Rv. Mycobacterium tuberculosis H37Rv complete genome

GenBank: AL123456.3

Transposon in Mtb

Tn1721: codes for inducible tetracycline resistance (1 copy)
Tn10: two genes (tetC and tetD) were identified and located (2 copy)
Phages
Lytic phages might be captured by CRISPR, lysogenic might be kept in genome, as an extra MGE. Sometimes, phages are adjacent, and they are highly polymorphic (17% cover and 55% identity)
Phage capsid family protein
Caudovirus prohead protease
Phage terminase, small subunit
phage T7 F exclusion suppressor FxsA
Putative prophage phiRv2 integrase

Bacterial phages:
Lytic: Enterobacteria phage T2, T4, T6
Lysogenic: phage lambda of E. coli
T4 (ds DNA phage) infects E. coli

Phage therapy: Using phage to kill bacteria (instead of using fungi or Streptomyxes-elaborated polyketides). Lytic phage is more suitable for therapy.
Lysogenic phage can detach and become a plasmid.
Some lesser-known bacterial enzymes and their functions:
Anthranilate synthase: pathway of tryptophan biosynthesis
Isonitrile hydratase: caprolactam degradation

Down-regulation and up-regulation of genes
RNase III activity lowered by stresses, and cold shock

Adhesin protein expression induced by immune response

Bio (8): Biogenic amines................

Biogenic amine neurotransmitters

Common biogenic amines : histamine, tyramine, cadaverine, 2-phenylethylamine, spermine, spermidine, putrescine, tryptamine, agmatine,octopamine, dopamine

Conversion of lysine into cadaverine

Major 5 types

Dopamine, norepinephrine (noradrenaline), and epinephrine (adrenaline) (3 catecholamines)

Histamine and serotonin

#Catecholamines are derived from Tyrosine. These amines are neurotransmitters in a sympathetic limb of the autonomic nervous system and in the CNS. Dopamine is involved in motivation, reward, addiction, behavioral reinforcement, and coordination of bodily movement.

#Histamine is derived from Histidine. It mediates arousal and attention, as well as a pro-nflammatory signal released from mast cells in response to allergic reactions or tissue damage.

#Serotonin is derived from Tryptophan. It, along with melatonin is indolamine.

Amphetamine is a trace biogenic amine, a potent central nervous system (CNS) stimulant that is used in the treatment of attention deficit hyperactivity disorder (ADHD), narcolepsy, and obesity. Its often abused as drug for recreational purposes.

Monday, January 18, 2016

IT (20): Codes for animation loops..........

#ASCII-art Box and Comment Drawing
sudo apt-get install boxes
echo "This is a test" | boxes
echo -e "\n\tSeema \n\Autumn girl" | boxes -d bird

#Install ASCIIQuarium
cd /tmp
wget http://www.robobunny.com/projects/asciiquarium/asciiquarium.tar.gz
tar -zxvf asciiquarium.tar.gz
cd asciiquarium_1.0/
sudo cp asciiquarium /usr/local/bin
sudo chmod 0755 /usr/local/bin/asciiquarium

/usr/local/bin/asciiquarium
perl /usr/local/bin/asciiquarium
#Christmas tree making using Perl module called Acme::POE::Tree
perl -MCPAN -e 'install Acme::POE::Tree'
perl -MAcme::POE::Tree -e 'Acme::POE::Tree->new()->run()'

#!/usr/bin/perl
#Customized tree.pl
use Acme::POE::Tree;
my $tree = Acme::POE::Tree->new(
{
star_delay => 1.5, # shimmer star every 1.5 sec
light_delay => 2, # twinkle lights every 2 sec
run_for => 10, # automatically exit after 10 sec
}
);

$tree->run();
#Penguin swarm
sudo apt-get install xpenguins
xpenguins
xpenguins -l

xpenguins --theme "Big Penguins" --theme "Turtles"

#Snowfall
sudo apt-get install xsnow
xsnow
xsnow -bg blue -sc snow
xsnow -snowflakes 10000 -delay 0
xsnow -notrees -nosanta

banner (dotted graphics)
man banner
banner Happy

Thursday, January 14, 2016

Language: Perl (Bioperl)............

http://perldoc.perl.org/
perldoc
perldoc -f sprintf
perldoc List::Util
--------------------------------------------------------
bioperl: perl modules for life sciences data and analysis
Modules are interfaces to data types: Sequences, Alignments, Features, Locations, Databases
Common modules: List::Util, Getopt::Long, Statistics::Descriptive
http://bioperl.org
http://github.com/bioperl
http://www.bioperl.org/
http://www.bioperl.org/wiki/HOWTOs

Bio Modules: SeqIO, DB : : Fasta, DB::GenBank, TreeIO, AlignIO, SerachIO
Objects: Bio::Seq, Bio : : DB : : Fasta, Bio::DB::GenBank, Bio::TreeIO, Bio : : AlignIO
Methods: seq() , length() , id() , description()
---------------------------------------------------------
SeqIO can both read and write sequences

---------------------------------------------------------
#Count the number of FASTA sequences
#!/ u s r / b in / p e r l −w
use s t r i c t ;
use Bio : : SeqIO ;
my $ s e q f i l e = " sequences .fa" ;
my $ i n = Bio : : SeqIO−>new(−format=>’fasta ’ ,
− f i l e => $ s e q f i l e ) ;
my $count = 0 ;
w h i l e ( my $seq = $in−>n e x t s e q ) {
$count++;
}

p r i n t " Sequence number is $count \n" ;
---------------------------------------------------------
#Count the number of bases
#!/ u s r / b in / p e r l −w
use s t r i c t ;
use Bio : : SeqIO ;
my $ s e q f i l e = " sequences .fa" ;
my $ i n = Bio : : SeqIO−>new(−format=>’fasta ’ ,
− f i l e => $ s e q f i l e ) ;
my $count = 0 ;
w h i l e ( my $seq = $in−>n e x t s e q ) {
$count += $seq−>l e n g t h ;
}
p r i n t " Number of bases is $count \n" ;
---------------------------------------------------------
#Convert file formats and output the sequences
#!/ u s r / b in / p e r l −w
use Bio : : SeqIO ;
my $ s e q f i l e = " sequences .gbk" ;
my $ i n = Bio : : SeqIO−>new(−format=>’genbank ’ ,
− f i l e => $ s e q f i l e ) ;
my $out = Bio : : SeqIO−>new(−format=>’fasta ’ ,
− f i l e => ">outputfile .fa" ) ) ;
w h i l e ( my $seq = $in−>n e x t s e q ) {
$out−>w r i t e s e q ( $seq ) ;
}
---------------------------------------------------------
#Fast random access to Fasta seq databases
use Bio : : DB : : Fasta;
my $ d i r = s h i f t @ARGV;
my $dbh = Bio : : DB : : Fasta−>new ( $ d i r ) ;
my $seq = $dbh−>get_Seq_by_acc ( " SEQ128 " ) ;
my $ s e q s t r = $dbh−>seq ( "chr1" , 9087 , 12375);
---------------------------------------------------------
#To query GenBank
use Bio : : DB : : GenBank ;
use Bio : : SeqIO ;
my $db = Bio : : DB : : GenBank−>new ;
my $seq = $db−>g e t S e q b y a c c ( " NM_206028 .1" ) ;
my $out = Bio : : SeqIO−>new(−format => ’fasta ’ ) ;
$out−>w r i t e s e q ( $seq ) ;
---------------------------------------------------------
#Convert from nexus to newick format
use Bio : : TreeIO ;
my $ i n = Bio : : TreeIO−>new(−format => ’nexus ’ ,
− f i l e => s h i f t @ARGV) ;
my $out = Bio : : TreeIO−>new(−format => ’newick ’ ) ;
w h i l e ( my $ t r e e = $in−>n e x t t r e e ) {
$out−>w r i t e t r e e ( $ t r e e ) ;

}
---------------------------------------------------------
#Multiple alignment
use Bio : : AlignIO ;
my $ i n = Bio : : AlignIO −>new(−format => ’clustalw ’ ,
− f i l e => s h i f t @ARGV) ;
my $out = Bio : : AlignIO −>new(−format => ’phylip ’ ,
− f i l e => s h i f t @ARGV) ;
w h i l e (my $ a l n = $in−>n e x t a l n ) {
$out−>w r i t e a l n ( $ a l n ) ;
}
---------------------------------------------------------
#Seq database search
my $ i n = Bio : : SearchIO−>new(−format => ’blast ’ ,
− f i l e => s h i f t @ARGV) ;
w h i l e ( my $r = $in−>n e x t r e s u l t ){
p r i n t $r−>query name , "\n" ;
w h i l e ( my $h = $r−>n e x t h i t ) {
p r i n t "\t" , $h−>name , " " , $h−>s i g n i f i c a n c e \n";
while ( my $hsp = $h -> next_hsp ) {
print "\ t \ t ", $hsp ->query ->start , " . . ",$hsp ->query ->end , "\n";
print "\ t \ t ", $hsp ->hit ->start , " . . ",$hsp ->hit ->end , "\n";
print "\ t \ t ", $hsp ->evalue , " ",$hsp -> frac_identical , " ",
$hsp -> frac_conserved , "\n";
print "\ t \ t ", $hsp -> query_string , "\n";
}
}
}
---------------------------------------------------------

Wednesday, January 13, 2016

IT (17): Sharing and storage tools....

Google drive
https://drive.google.com/drive/my-drive
create
share
name it
invite people

Dropbox

Own Cloud

Mendely

Bio (13): Immunology basics...........

Cell has receptors, G protein linked receptors
Receptor H4 attracts histamine.
The binding leads to eosinophil migration, mast cell recruitment

Thursday, January 7, 2016

Bio (12): Amino acid codon table............

#CODON TABLE (Most amino acids are represented by 2 codons)
Alanine (Ala) (A): GCA, GCC, GCG, GCT---4---GCN
Arginine(Arg) (R): CGT, CGC, CGA, CGG, AGA, AGG---6---CGN, MGR
Asparagine(Asn) (N): AAT, AAC---2---AAY
Aspartic acid(Asp)(D): GAT, GAC---2---GAY
Cysteine(Cys)(C): TGT, TGC---2---TGY
Glutamine (Gln)(Q): CAA, CAG---2---CAR
Glutamic acid (Glu)(E): GAA, GAG---2---GAR
Glycine (Gly)(G): GGT, GGC, GGA, GGG---4---GGN
Histidine(His) (H): CAT, CAC---2---CAY
Isoleucine(Ile) (I): ATT, ATC, ATA---3---ATH
Leucine(Leu) (L): TTA, TTG, CTT, CTC, CTA, CTG---6---YTR, CTN
Lysine(Lys) (K): AAA, AAG---2---AAR
Methionine (Met) (M): ATG---1
Phenylalanine(Phe) (F): TTT, TTC---2---TTY
Proline(Pro) (P): CCT, CCC, CCA, CCG---4---CCN
Serine(Ser) (S): TCT, TCC, TCA, TCG, AGT, AGC---6---TCN, AGY
Threonine (Thr) (T): ACT, ACC, ACA, ACG---4---ACN
Tryptophan (Trp) (W): TGG---1
Tyrosine (Tyr) (Y): TAT, TAC---2---TAY
Valine (Val) (V): GTT, GTC, GTA, GTG---4---GTN
*STOP TAA, TGA, TAG---3---TAR, TRA
#Asn, Leu, Ser have six codons
#Met and Trp have 1 codon each
Polar amino acids
Glutamine - Gln - Q
Asparagine - Asn - N
Histidine - His - H
Serine - Ser - S
Threonine - Thr - T
Tyrosine - Tyr - Y
Cysteine - Cys - C
Methionine - Met - M
Tryptophan - Trp - W

Non-polar (aliphatic or aromatic) amino acids
Alanine - Ala - A
Isoleucine - Ile - I
Leucine - Leu - L
Phenylalanine - Phe - F
Valine - Val - V
Proline - Pro - P
Glycine - Gly - G

Charged amino acids
Arginine - Arg - R
Lysine - Lys - K
Aspartic acid - Asp - D
Glutamic acid - Glu - E

Essential amino acids (9): Phenylalanine, valine, threonine, tryptophan, methionine, leucine, isoleucine, lysine, and histidine (F V T W M L I K H)

Conditionally-essential amino acids (6): Arginine, cysteine, glycine, glutamine, proline and tyrosine (R C G Q P Y)

Non-essential amino acids (5): Alanine, aspartic acid, asparagine, glutamic acid and serine (A D N E S)

* This chart retrieved from http://www.sigmaaldrich.com/ is useful
Proteins have a hydrophobic core and hydrophilic surface(to form hydrogen bond with water).
So, if a surface protein becomes hydrophobic, protein can't interact with water and looses stability.

Wednesday, January 6, 2016

Learning the terminal text editors: vi/vim, emacs...

#Create a text file
vi /tmp/download.txt
#vim is improved vi

#To switch backup creation, add to the vimrc
set nobackup
set nowritebackup

#Create a directory in your home directory called vimtmp
set backupdir=~/vimtmp
set directory=~/vimtmp

i: to start typing (inserting)

Esc: to move from insert mode to normal mode
#The following commands work in Esc or normal mode
x: to remove current character
X: to remove character to the left
A: to insert text at the end of line
u: undo action
0: cursor goes to start of the line
$: cursor goes to end of the line
^: cursor goes to first non-blank line
w: start
b: end

h: go left
l: go right
k: go up
j: go down
R: overwrites
:w Enter: save
:q Enter: quit

Some important softwares....

#To read Linux info documentation in colors
apt-get install pinfo
pinfo page
pinfo grep

#To show difference in colors
yum install colordiff
sudo apt-get install colordiff

colordiff file1 file2
diff -u file1 file2 | colordiff
diff -u file1 file2 | colordiff | less -R
diff file1 file2 | remark /usr/share/regex-markup/diff

grc diff file1 file2

#To convert HTML to pdf
sudo apt-get install wkhtmltopdf
sudo ln -s /usr/bin/wkhtmltopdf /usr/local/bin/html2pdf

Shell (7): Alias and .bashrc...........

Terminal normally starts the shell via /usr/bin/login
put source ~/.bashrc
source ~/.bashrc

#For adding features to ~/.profile
echo 'source $GROUPHOME/.config/xx-rc' >> ~/.profile

#alias is nothing but shortcut to commands that is written in ~/.bashrc file
#To see existing aliases
alias
#syntax
alias name=value
alias name='/path/to/script.sh arg1'
e.g.
alias c='clear
alias .4='cd ../../../../'
alias .5='cd ../../../../..'
alias info='pinfo'
alias vi='vim'
alias grep='grep --color'

alias update='yum update'

#Unalias an alias
unalias c
unalias aliasname

IT (6): Computer Acronyms and jargon........

ACID: Atomicity, Consistency, Isolation, Durability (database properties)
ADT: Abstract Data Types
ANSI: American National Standards Institute
API: Application Program Interface (a set of routines, protocols, and tools for building software applications)
ARGC: Argument count
ARGV: Argument vector
ASCII: American Standard Code for Information Interchange
AWK: Aho Weinberger Kernighan

BOM: Byte order mark

CGI: Common Gateway Interface (Standard environment for web servers to interface with executable programs)
CLI: Command Line interface
CLI: Common Language Infrastructure
CLISP: Common LISP
CMS: Content Management Systems (e.g. WordPress, Joomla, Drupal, Xibo)
DBM: Database management
DNS: Domain Name System
endl: End of line (used for flushing the stream in C++)
EOF: End of file
EVM: Earned Value Management (tracks progress and future of a project)
fdisk: Fixed disk or format disk (used to create, resize, delete, change, copy and move partitions on a hard drive)
FIFO: First-in first-out
FOSS: Free and Open Source Software
FS: Field separator
FSF: Free Software Foundation
FTP: File Transfer Protocol
GCC: GNU Compiler Collection. A compiler system (gcc)
GIMP: GNU Image Manipulation Program
GPL: General Public License
GUI: Graphical User Interface
GUID: Globally Unique Identifier
HRF: Human Readable Format
HTML: HyperText Programming Language
IDE: Integrated development environment
IFS: Internal Field Separator (Used by the parser for word splitting after expansion)
IGV: Integrative Genomics Viewer (Visualizing results)
IP: Internet Protocol
ISO: International Standards Organization
JSON: JavaScript Object Notation (a lightweight data-interchange format)
JVM: Java Virtual Machine
LAMP: Linux, Apache, MySQL, PHP
LAN: Local Area Network
LDAP: Lightweight Directory Access Protocol
LIFO: Last-in first-out
LISP: LISt Processing
MD5: Message-Digit Algorithm (to create hash value)
MPI: Message Passing Interface
NF: Number of Fields
NoSQL: Not only SQL
NR: Number of Records
OFS: Output Field Separator
ORS: Output Record Separator
PAN: Personal Area Network
PID: Process Identification Number
RDBMS: Relational Database Management System
Scala: Scalable Language
sed: stream editor
SLOC: Source lines of code
SRA: Short Read Archive (Store NGS dataset)
SSH: Secure Shell (encrypted network protocol)
stdin : standard input
SVG: Scalable Vector Graphics
varchar: Variable-length Characters
VoIP: Voice over IP
##############Computer Jargon###############

Who said computation is boring?..Difficult sure it is but fun element is not lacking..

I like Greek mythology and literature...knowledge fascinates me..and so does computation jargons.

Access modifier: private (visible only inside class), protected (visible only inside subclass), public (can be accessed from anywhere)
Apache: A software foundation, most widely used web server
Balkanization: Fragmentation into many smaller, uncooperative regions

Buffer: A hunk of memory to hold data
Clobbering: Overwriting the contents of a file or computer memory
Cloning: copying the exact version
Crawler: A program that visits websites and reads their pages
Currying: Translating functions with tuple arguments into function with single arguments
Cron: A job scheduler (cron job is task performed at regular intervals).
Daemon: In multi-tasking operating systems, it runs in the background
Data wrangling (data munging ): Changing data formats from one form to another, manipulating clumsy data
Defragmentation: To improve I/O operations (e.g loading, extracting)
Dependencies: Subordinate but essential things
Deprecated: Disapproved
Didactic: Intended to teach
Docker: It packages all softwares with all dependencies, i.e. enterprise application into one self-contained container, which runs on any environment.
filehandles: A number that the operating system assigns temporarily to a file
fsck: File system consistency check
Globbing: Finding some patterns
GNU: GNU not Unix
Heuristic:From experience
inode: A data structure used to represent a filesystem object
Iterable: Any list, file, string which can be manipulated with for loop is iterable
Iterator: Can be iterated only once
MapReduce: Software for processing and generating large data sets in parallel-manner
Metadata: Ancillary information, software revision information
Netfilter : A Linux OS firewall, controlled by iptables
Operator: a symbol to perform a job (arithmetic, relational, logical, bitwise, assignment)
Overhead: Excess computation time, memory, bandwidth and other resources for a rather simple goal
Overriding: Replacing a parent-implemented method by child class in Java
Parsing: Analyzing code into parts and describing their syntactic roles
Pipeline: Many commands put together in a script to achieve a result.
Readme file: contains information about other files in a directory or archive and is commonly distributed with computer software
Refactoring: The process of cleaning up a program to make it more usable or easier to understand or less complicated
regex: Language to describe patterns in strings
Regular expression: A pattern that describes a set of strings
Router: A device acting as a gateway,connecting two networks (e.g. connecting LAN to WAN)
Segmentation fault: when the program attempts to access memory it has either not been assigned by the operating system, or is otherwise not allowed to access.
shebang: #!/bin/sh -x (Simple text files become Bash scripts when adding a shebang line as first line, saying which program should read and execute this text file)
Stub: Generated something, but left to be filled up
subroutine: A sequence of codes that perform a specific task and can be used in other programs
tarball or tarfile: A group or archive of files that are bundled together using the tar command and usually have the .tar file extension. If the tarfile is compressed using gzip command the tarball will end with tar.gz.
Webinar: web conferencing
Variable: Reserved memory locations to store values
VMware: Cloud and virtualization software
VoIP: Technologies for the delivery of voice communications and multimedia sessions
Zombie : A Process is ‘Zombie’ if it has stopped but still active in process table
------------------------------------------------------------------------------------------------------

Monday, January 4, 2016

Python (3): Biopython.............

Examples from http://biopython.org/
ftp://ftp.expasy.org/databases (required dat files can be downloaded from here)
-------------------
Execute script as: python script.py
#To check Installation of Biopython
import Bio
# Frequently-used modules of python are Seq, pairwise2, AlignIO
-----------------------------------------------------------------------------------------------

#Import Seq module from Bio library of python
#! usr/bin/python
from Bio.Seq import Seq
#Create a sequence object
my_seq = Seq('ATGCGGATTGCAGGT')
#prints seq, reverse complement and protein
print 'DNA seq is: %s' % (my_seq)
print 'Seq %s is %i bases long' % (my_seq, len(my_seq))
print 'Reverse complement of DNA is: %s' % my_seq.reverse_complement()
print 'Transcribed mRNA is: %s' % my_seq.transcribe()
print 'Translated protein is: %s' % my_seq.translate()
-----------------------------------------------------------------------------------------------
#Import pairwise2 module from Bio library of python for sequence alignment
#! usr/bin/python
from Bio import pairwise2
from Bio.pairwise2 import format_alignment
for i in pairwise2.align.globalxx("GGTCCTTAG", "TTTCGGAAG"):
print(format_alignment(*i))

from Bio.Align.Applications import ClustalwCommandline
help(ClustalwCommandline)
#clustalw
from Bio.Align.Applications import ClustalwCommandline
cline = ClustalwCommandline("clustalw2", infile="opuntia.fasta")
print(cline)
#Matrix-based protein sequence alignment
#! usr/bin/python
from Bio import pairwise2
from Bio.pairwise2 import format_alignment
from Bio.SubsMat import MatrixInfo as matlist
matrix = matlist.blosum62
for i in pairwise2.align.globaldx("RTSMNRWT", "MPSTW", matrix):
print(format_alignment(*i))

#Multiple sequence alignment (MSA). MSA can be made by loading an alignment file (.aln) using AlignIO module

#! usr/bin/python
from Bio import AlignIO
align = AlignIO.read("opuntia.aln", "clustal")
print(align)
len(align)
for record in align:
    print("%s %i" % (record.id, len(record)))
#Extract first row
print(align[0].id)
#Extract last row
print(align[-1].id)
#Extract columns as strings row (here index 2, i.e. column3)
print(align[:, 2])
#Extract first five columns
print(align[:, :5])

#To find the count of A, T, C, G (their % too) in fasta sequence
#Use annotation file .ffn for it (It has individual gene ATCG sequence)
#Run as: python GC_content.py
#! usr/bin/python
from Bio import SeqIO

#Opened a FASTA file
input_file = open('L_11232015.ffn', 'r')
#Opened a output file
output_file = open('GC_content','w')
#Header line
#Header looks likie Gene   A   C   G   T   Length   CG%
output_file.write('Gene\tA\tC\tG\tT\tLength\tCG%\n')

for cur_record in SeqIO.parse(input_file, "fasta") :
#count nucleotides in this record...
   gene_name = cur_record.name
   A_count = cur_record.seq.count('A')
   C_count = cur_record.seq.count('C')
   G_count = cur_record.seq.count('G')
   T_count = cur_record.seq.count('T')
   length = len(cur_record.seq)
   cg_percentage = float(C_count + G_count) / length
   output_line = '%s\t%i\t%i\t%i\t%i\t%i\t%f\n' % \
   (gene_name, A_count, C_count, G_count, T_count, length, cg_percentage)
   output_file.write(output_line)

output_file.close()
input_file.close()

#QueryResult (compare query to the reference for hits)
from Bio import SearchIO
blast_qresult = SearchIO.read('my_blast.xml', 'blast-xml')
print(blast_qresult)

#Open a Swiss-Prot file
handle = open("myswissprotfile.dat")

import gzip
handle = gzip.open("myswissprotfile.dat.gz")

import urllib
handle = urllib.urlopen("http://www.somelocation.org/data/someswissprotfile.dat")
from Bio import ExPASy
handle = ExPASy.get_sprot_raw(myaccessionnumber)

#To read one Swiss-Prot record from the handle
from Bio import SwissProt
record = SwissProt.read(handle)
print(record.description)
print(record.organism_classification)

from Bio.ExPASy import Prosite
handle = open("prosite.dat")
records = Prosite.parse(handle)
record = next(records)
record.accession
record.name

from Bio.ExPASy import Enzyme
handle = open("enzyme.dat")
records = Enzyme.parse(handle)
ecnumbers = [record["ID"] for record in records]

Shell (8): User inputs.........

The command 'read' is a shell builtin. It reads from stdin and assigns it to a variable.

#Asks for input, then prints it

echo -n "Enter your name and press [ENTER]: "
read user_name
echo "Your name is: $user_name"

#Code asks user name; reads and stores the user information in a variable
echo Hello, who am I talking to?
read user_name
echo "Good Morning, $user_name!"

#Code asks user name and passwords and furnishes the login information.
read -p 'Username: ' user_var
read -sp 'Password: ' pass_var
echo
echo "Hi $user_var, here is your login details: "

#Code asks the user about two of their favorite flowers
echo "Name two of your favorite flowers:"
read flower1 flower2
echo Your first favorite flower is: $flower1
echo Your second favorite flower is: $flower