Exploring the choppy water of coding: May 2016

Tuesday, May 3, 2016

Allergens: Types, sources.......

GENERAL
#######PLANTS#####
Ole e: Olea europaea (Common olive)
Sin a: Sinapis alba (White mustard)
2S albumin: Ricinus communis (Castor bean)
Pectate lyase: Cryptomeria japonica (Japanese cedar) (Cupressus japonica)
Expansin-B1: Zea mays (Maize)
Superoxide dismutase: Olea europaea (Common olive)
Small rubber particle protein: Hevea brasiliensis (Para rubber tree)
Exopolygalacturonase: Platanus acerifolia (London plane tree)
Major pollen allergen Bet v 1-A: Betula pendula (European white birch) (Betula verrucosa)
Profilin-2: Phleum pratense (Common timothy)
Pectinesterase 1: Olea europaea (Common olive)
Non-specific lipid-transfer protein: Ambrosia artemisiifolia (Short ragweed)
Profilin-1 : Phleum pratense (Common timothy)
Pectate lyase 1: Ambrosia artemisiifolia (Short ragweed)
Pectate lyase 2: Ambrosia artemisiifolia (Short ragweed)
Bet v 1-L: Betula pendula (European white birch) (Betula verrucosa)
Amb a 3: Ambrosia artemisiifolia var. elatior (Short ragweed)
Pectinesterase 2: Olea europaea (Common olive)
Phl p 5b: Phleum pratense (Common timothy)
Polygalacturonase: Cryptomeria japonica (Japanese cedar)
Expansin-B11: Zea mays (Maize)
Lol p 1: Lolium perenne (Perennial ryegrass)
Profilin-4: Corylus avellana (European hazel) (Corylus maxima)
Actinidain: Actinidia deliciosa (Kiwi)
Polygalacturonase: Juniperus ashei (Ozark white cedar)
Esterase: Hevea brasiliensis (Para rubber tree)
Protein DOWNSTREAM OF FLC: Arabidopsis thaliana (Mouse-ear cress)
Major allergen Api g 1: Apium graveolens (Celery)
Alpha-amylase inhibitor BMAI-1: Hordeum vulgare (Barley)
Superoxide dismutase [Cu-Zn]: Olea europaea (Common olive)
Lactoylglutathione lyase: Oryza sativa subsp. japonica (Rice)
Profilin-1: Zea mays (Maize)
Ambrosia artemisiifolia (Short ragweed)
Bra j 1-E: Brassica juncea (Indian mustard) (Sinapis juncea)
Glucan endo-1,3-beta-glucosidase: Prunus avium (Cherry)
Non-specific lipid-transfer protein: Apium graveolens (Celery)
Dau c 1: Daucus carota (Wild carrot)
Pollen allergen KBG 41: Poa pratensis (Kentucky bluegrass)
Lol p 5a: Lolium perenne (Perennial ryegrass)
Profilin-2-5: Olea europaea (Common olive)
######FUNGI#####
60S acidic ribosomal protein P2: Alternaria alternata (Alternaria rot fungus)
Alcohol dehydrogenase 1: Candida albicans (Yeast)
Enolase: Cladosporium herbarum
Glucoamylase: Trichophyton mentagrophytes
Cla h 7: Cladosporium herbarum
Ribonuclease mitogillin: (Aspergillus fumigatus)
Fructose-bisphosphate aldolase: Candida albicans (strain SC5314 / ATCC MYA-2876) (Yeast)
60S acidic ribosomal protein P2: (Cladosporium herbarum)
Enolase: Alternaria alternata (Alternaria rot fungus)

######NEMATODE#####
Polyprotein ABA-1: Ascaris suum (Pig roundworm)
Major allergen Ani s 1: Anisakis simplex (Herring worm)
######ARTHROPODS#####
Pilosulin-3a: Myrmecia pilosula (Jack jumper ant) (Australian jumper ant)
Peptidase 1: Psoroptes ovis (Sheep scab mite)
Hyaluronidase A: Vespula vulgaris (Yellow jacket) (Wasp)
Eur m 3:Euroglyphus maynei (Mayne's house dust mite)
Peptidase 1: Dermatophagoides pteronyssinus (European house dust mite)
Mite group 2 allergen Lep d: Lepidoglyphus destructor (Storage mite)
Peptidase 1: Dermatophagoides farinae (American house dust mite)
Mite group 2 allergen Der p 2: Dermatophagoides pteronyssinus (European house dust mite)
Phospholipase A1: Solenopsis invicta (Red imported fire ant)
Melittin: Apis mellifera (Honeybee)
Pilosulin-1: Myrmecia pilosula (Jack jumper ant) (Australian jumper ant)
Hyaluronidase: Apis mellifera (Honeybee)
Aspartic protease Bla g 2: Blattella germanica (German cockroach) (Blatta germanica)
Peptidase 1: Euroglyphus maynei (Mayne's house dust mite)
Phospholipase A1 1: Dolichovespula maculata (Bald-faced hornet)
Venom allergen 3: Solenopsis invicta (Red imported fire ant)
Der p 3: Dermatophagoides pteronyssinus (European house dust mite)
Der f 3: Dermatophagoides farinae (American house dust mite)
Arginine kinase AK: Penaeus monodon (Giant tiger prawn)
Venom dipeptidyl peptidase 4: Apis mellifera (Honeybee)
Phospholipase A1: Vespula maculifrons (Eastern yellow jacket) (Wasp)
######FISH#####
Parvalbumin beta: Gadus morhua subsp. callarias (Baltic cod)
######BIRDS#####
Ovalbumin: Gallus gallus (Chicken)
Ovotransferrin: Gallus gallus (Chicken)
Lysozyme C: Gallus gallus (Chicken)
Ovomucoid: Gallus gallus (Chicken)
######MAMMALS#####
Minor allergen Can f 2: Canis lupus familiaris (Dog) (Canis familiaris)
Major allergen I polypeptide chain: Felis catus (Cat)
Allergen Fel d 4: Felis catus (Cat) (Felis silvestris catus)
Major urinary protein: Rattus norvegicus (Rat)
Allergen Bos d 2: Bos taurus (Bovine)
Protein S100-A7: Bos taurus (Bovine)
Latherin : Equus caballus (Horse)
Major allergen Equ c 1: Equus caballus (Horse)
-----------------
SPECIFIC
#Cashew, Pistachio

Vicilin-like protein, 2s albumin, Ana o 2, 11S globulin

#Almond, peach

pru1, Pru du, Non-specific lipid-transfer protein

#Tomato

Profilin, pectate lyase

#Peanut

Conglutin-7, Defensin, Ara h, Profilin, Non-specific lipid-transfer protein

#Avocado

Endochitinase

#Kiwi

Actinidain, Cysteine proteinase inhibitor, Thaumatin-like protein, Act d

Kiwellin, Kirola, Non-specific lipid-transfer protein, Endochitinase, Bet v

#Persimmon

Expansin, Non-specific lipid-transfer protein

#Celery

Non-specific lipid-transfer protein, Chlorophyll a-b binding protein, Api g, Profilin

#Kidney bean

Pathogenesis-related protein 1

Pectate lyase

#Egg

Ovalbumin

Ovotransferrin

Lysozyme C

Ovomucoid

Serum albumin

#Shrimp, lobster

Tropomyosin

Arginine kinase

Pen a

Lit v

Sarcoplasmic calcium-binding protein

#Mussel

Tropomyosin

Endo-beta-1,4-glucanase

#Fish

Alpha-enolase

Beta-enolase

Parvalbumin beta

Fructose-bisphosphate aldolase A

#Octopus

Arginine kinase

#Silk moth

SCP-related protein

Arginine kinase

Apolipoprotein of lipid transfer

#Rubber

Patatin

MY SCRIPT (2): Unique genes finding, their analysis, wrapper..

#Code to find out unique genes
#! /usr/bin
#Run as: sh unique_genes_finding.sh |& tee all_isolate_gene_profile
#mkdir /home/pseema/denovo_analysis/result_files/unique_genes
#find /home/pseema/denovo_analysis/result_files/*.only_header
while read strain;
do
while read isolate;
do
echo "#################Starting $isolate..####################"
#Extract all columns except column1
awk '{$1=""; print $0}' /home/pseema/denovo_analysis/result_files/$isolate.only_header > /home/pseema/denovo_analysis/result_files/$isolate.only_protein_name
echo "****Total number of proteins in $isolate: ******"
cat /home/pseema/denovo_analysis/result_files/$isolate.only_protein_name | wc -l
awk '!/hypothetical/' /home/pseema/denovo_analysis/result_files/$isolate.only_protein_name > /home/pseema/denovo_analysis/result_files/$isolate.only_functional_proteins
echo "******Number of non-hypothetical proteins in $isolate: *****"
cat /home/pseema/denovo_analysis/result_files/$isolate.only_functional_proteins | wc -l
sort -u /home/pseema/denovo_analysis/result_files/$isolate.only_functional_proteins > /home/pseema/denovo_analysis/result_files/$isolate.only_functional_proteins_sorted

#Shows common proteins to file 1 and file2 (option -12 or -21 can be used to achieve it)
echo "**Proteins common to $strain and $isolate: **"
comm -12 /home/pseema/denovo_analysis/result_files/$strain.only_functional_proteins_sorted /home/pseema/denovo_analysis/result_files/$isolate.only_functional_proteins_sorted > /home/pseema/denovo_analysis/result_files/in_both.$strain.$isolate
cat /home/pseema/denovo_analysis/result_files/in_both.$strain.$isolate | wc -l
cat /home/pseema/denovo_analysis/result_files/in_both.$strain.$isolate
cp /home/pseema/denovo_analysis/result_files/in_both.$strain.$isolate /home/pseema/denovo_analysis/result_files/unique_genes
echo "**Proteins common to $strain and $isolate done**"

#These proteins occur only in $strain (only column1)
echo "**Proteins unique to $strain: **"
comm -23 /home/pseema/denovo_analysis/result_files/$strain.only_functional_proteins_sorted /home/pseema/denovo_analysis/result_files/$isolate.only_functional_proteins_sorted > /home/pseema/denovo_analysis/result_files/not_in.$isolate
cat /home/pseema/denovo_analysis/result_files/not_in.$isolate | wc -l
cat /home/pseema/denovo_analysis/result_files/not_in.$isolate
cp /home/pseema/denovo_analysis/result_files/not_in.$isolate /home/pseema/denovo_analysis/result_files/unique_genes
echo "Unique protein search for $strain done"

#These proteins occur only in $isolate (only column2)
echo "**Proteins unique to $isolate: **"
comm -13 /home/pseema/denovo_analysis/result_files/$strain.only_functional_proteins_sorted /home/pseema/denovo_analysis/result_files/$isolate.only_functional_proteins_sorted > /home/pseema/denovo_analysis/result_files/only_in.$isolate
cat /home/pseema/denovo_analysis/result_files/only_in.$isolate | wc -l
cat /home/pseema/denovo_analysis/result_files/only_in.$isolate
cp /home/pseema/denovo_analysis/result_files/only_in.$isolate /home/pseema/denovo_analysis/result_files/unique_genes
echo "Unique protein search for $isolate done"

echo "********$isolate done********"
done < /home/pseema/denovo_analysis/input_files/isolate_list
#done < /home/pseema/denovo_analysis/input_files/IO_isolates
#done < /home/pseema/denovo_analysis/input_files/EAS_isolates
#done < /home/pseema/denovo_analysis/input_files/EAI_isolates
#done < /home/pseema/denovo_analysis/input_files/EAM_isolates

done < /home/pseema/denovo_analysis/input_files/strain_list
#done < /home/pseema/denovo_analysis/input_files/IO_isolates
#done < /home/pseema/denovo_analysis/input_files/EAS_isolates
#done < /home/pseema/denovo_analysis/input_files/EAI_isolates
#done < /home/pseema/denovo_analysis/input_files/EAM_isolates
-----------------------------------------------------
#! /usr/bin
#Code to analyze data for unique genes
#Execute as: sh unique_genes_analysis.sh |& tee all_isolate_gene_analysis
#mkdir /home/pseema/denovo_analysis/result_files/unique_genes
#find *.matches_comm_12 | wc -l
cat `find /home/pseema/denovo_analysis/result_files/unique_genes/in_both.*` > /home/pseema/denovo_analysis/result_files/unique_genes/all_isolates_common
echo "Common protein pool when the isolates were compared to each other..."
#cat /home/pseema/denovo_analysis/result_files/unique_genes/all_isolates_common | wc -l
uniq /home/pseema/denovo_analysis/result_files/unique_genes/all_isolates_common > /home/pseema/denovo_analysis/result_files/unique_genes/all_isolates_common_uniq
cat /home/pseema/denovo_analysis/result_files/unique_genes/all_isolates_common_uniq | wc -l
awk '!NF || !seen[$0]++' /home/pseema/denovo_analysis/result_files/unique_genes/all_isolates_common_uniq > /home/pseema/denovo_analysis/result_files/unique_genes/all_isolates_common_reduced
echo "Unique proteins in the common protein pool..."
cat /home/pseema/denovo_analysis/result_files/unique_genes/all_isolates_common_reduced | wc -l

#find *.matches_comm_23 | wc -l
cat `find /home/pseema/denovo_analysis/result_files/unique_genes/not_in.*` > /home/pseema/denovo_analysis/result_files/unique_genes/all_isolates_only_column1
cat /home/pseema/denovo_analysis/result_files/unique_genes/all_isolates_only_column1 | wc -l
uniq /home/pseema/denovo_analysis/result_files/unique_genes/all_isolates_only_column1 > /home/pseema/denovo_analysis/result_files/unique_genes/all_isolates_only_column1_uniq
cat /home/pseema/denovo_analysis/result_files/unique_genes/all_isolates_only_column1_uniq | wc -l
awk '!NF || !seen[$0]++' /home/pseema/denovo_analysis/result_files/unique_genes/all_isolates_only_column1_uniq > /home/pseema/denovo_analysis/result_files/unique_genes/all_isolates_only_column1_uniq_reduced
cat /home/pseema/denovo_analysis/result_files/unique_genes/all_isolates_only_column1_uniq_reduced | wc -l
#find *.matches_comm_13 | wc -l
cat `find /home/pseema/denovo_analysis/result_files/unique_genes/only_in.*`> /home/pseema/denovo_analysis/result_files/unique_genes/all_isolates_only_column2
cat /home/pseema/denovo_analysis/result_files/unique_genes/all_isolates_only_column2 | wc -l
uniq /home/pseema/denovo_analysis/result_files/unique_genes/all_isolates_only_column2 > /home/pseema/denovo_analysis/result_files/unique_genes/all_isolates_only_column2_uniq
cat /home/pseema/denovo_analysis/result_files/unique_genes/all_isolates_only_column2_uniq | wc -l
awk '!NF || !seen[$0]++' /home/pseema/denovo_analysis/result_files/unique_genes/all_isolates_only_column2_uniq > /home/pseema/denovo_analysis/result_files/unique_genes/all_isolates_only_column2_uniq_reduced
cat /home/pseema/denovo_analysis/result_files/unique_genes/all_isolates_only_column2_uniq_reduced | wc -l

#Find lines to a given pattern
awk '/Proteins unique to/' all_isolate_gene_profile > /home/pseema/denovo_analysis/result_files/unique_genes/pattern_files

#Find lines next to a given pattern
awk 'f{print;f=0} /Proteins unique to/{f=1}' all_isolate_gene_profile > /home/pseema/denovo_analysis/result_files/unique_genes/next_lines

#Paste these two files side by side
paste -d' ' /home/pseema/denovo_analysis/result_files/unique_genes/pattern_files /home/pseema/denovo_analysis/result_files/unique_genes/next_lines > /home/pseema/denovo_analysis/result_files/unique_genes/isolate_diff_unique_genes

#Extract only column 4
awk '{print $4}' /home/pseema/denovo_analysis/result_files/unique_genes/isolate_diff_unique_genes > /home/pseema/denovo_analysis/result_files/unique_genes/isolate_diff_unique_genes_only_isolate
#find difference between two consecutive lines in the generated file
#Extract only odd number lines
awk 'NR%2==1' /home/pseema/denovo_analysis/result_files/unique_genes/isolate_diff_unique_genes_only_isolate > /home/pseema/denovo_analysis/result_files/unique_genes/isolate_diff_unique_genes_only_isolate_only_odd

#Extract only even number lines
awk 'NR%2==0' /home/pseema/denovo_analysis/result_files/unique_genes/isolate_diff_unique_genes_only_isolate > /home/pseema/denovo_analysis/result_files/unique_genes/isolate_diff_unique_genes_only_isolate_only_even

#Paste the extracted columns side by side
paste -d' ' /home/pseema/denovo_analysis/result_files/unique_genes/isolate_diff_unique_genes_only_isolate_only_odd /home/pseema/denovo_analysis/result_files/unique_genes/isolate_diff_unique_genes_only_isolate_only_even > /home/pseema/denovo_analysis/result_files/unique_genes/merged_columns_isolates

#Find difference between two consecutive lines in the generated file
#Extract only odd number lines
awk 'NR%2==1' /home/pseema/denovo_analysis/result_files/unique_genes/next_lines > /home/pseema/denovo_analysis/result_files/unique_genes/only_odd

#Extract only even number lines
awk 'NR%2==0' /home/pseema/denovo_analysis/result_files/unique_genes/next_lines > /home/pseema/denovo_analysis/result_files/unique_genes/only_even

#Paste the extracted columns side by side
paste -d' ' /home/pseema/denovo_analysis/result_files/unique_genes/only_odd /home/pseema/denovo_analysis/result_files/unique_genes/only_even > /home/pseema/denovo_analysis/result_files/unique_genes/merged_columns
#Find difference between two columns of the file
awk 'NF > 0 { print $0 "\t" ($1 - $2) }' /home/pseema/denovo_analysis/result_files/unique_genes/merged_columns > /home/pseema/denovo_analysis/result_files/unique_genes/diff_columns

#Paste the extracted columns side by side
paste -d' ' /home/pseema/denovo_analysis/result_files/unique_genes/merged_columns_isolates /home/pseema/denovo_analysis/result_files/unique_genes/diff_columns > /home/pseema/denovo_analysis/result_files/unique_genes/isolate_gene_diff

#Print content beetween two patterns
echo "*****Isolate-specific unique protein*****"
awk '/Proteins unique to/ {flag=1;next} /Unique protein search/{flag=0} flag {print}' all_isolate_gene_profile && awk '/Unique protein search for/' all_isolate_gene_profile

#To find the common genes in all the files
echo "The core genes are......"
for isolate
do
awk '!NF || !seen[$0]++' /home/pseema/denovo_analysis/result_files/$isolate.only_functional_proteins > /home/pseema/denovo_analysis/result_files/unique_genes/indispensable_genes
done < /home/pseema/denovo_analysis/input_files/isolate_list

#To find the shared genes in all the files (it checks from folder to folder to find the shared genes)
echo "The shared genes are......"
#To get rid of backup files
#find . -name '*~' -exec rm {} \;
cat /home/pseema/denovo_analysis/result_files/*.only_functional_proteins_sorted | awk 'END {
for (R in rec) {
    n = split(rec[R], t, "/")
    if (n > 1)
      dup[n] = dup[n] ? dup[n] RS sprintf("\t%-20s -->\t%s", rec[R], R) : \
        sprintf("\t%-20s -->\t%s", rec[R], R)
    }
for (D in dup) {
    printf "records found in %d files:\n\n", D
    printf "%s\n\n", dup[D]
    }
}
{
rec[$0] = rec[$0] ? rec[$0] "/" FILENAME : FILENAME
}'
-----------------------------------------
#! /usr/bin
#Wrappr to call all related scripts
#Code to find out unique genes
sh unique_genes_finding.sh |& tee all_isolate_gene_profile
#sh unique_genes_finding.sh |& tee IO_isolate_gene_profile
#sh unique_genes_finding.sh |& tee EAS_isolate_gene_profile
#sh unique_genes_finding.sh |& tee EAI_isolate_gene_profile
#sh unique_genes_finding.sh |& tee EAM_isolate_gene_profile

#Code to analyze data for unique genes
sh unique_genes_analysis.sh |& tee all_isolate_gene_analysis

Monday, May 2, 2016

Tools to learn and work to do......

Alignment...
Alignment of sequencing reads to a reference genome is a core step in the analysis workflows for many high-throughput sequencing assays, including ChIP-Seq, RNA-seq, ribosome profiling and others.

Bowtie uses an extremely economical data structure called the FM index to store the reference genome sequence and allows it to be searched rapidly.

TopHat uses Bowtie as an alignment ‘engine’

Mauve?
#To run the Mauve GUI from within Terminal
#Add directory with executables to Mauve path
cd Mauve/
ls
cd mauve_2.3.1/
./Mauve
File
Align with progressive Mauve
Select the executable folder (by navigation
Mauve Console starts running (1-2 minutes for two full genomes)
Viewing the alignment
Zoom in    Ctrl + UpScroll
display left    Ctrl + LeftScroll
display right    Ctrl + RightLarge
left scroll    Shift + Ctrl + LeftLarge
right scroll    Shift + Ctrl + Right
Tool ---------> Export ---------> Export SNPs

Indel determination..

Whats the logic used to pull information from vcf file?

R PSI Blast

#Reversed Position Specific BLAST, or RPS BLAST, use at command line

#extract just these *.smp files from the large archive (cdd.tar.gz).

#run the formatrpsdb tool to build a database:

formatrpsdb -t Sigma.v001 -i Sigma.pn -o T -f 9.82 -n Sigma -S 100.0

#creates the eight files i.e. Sigma.aux, Sigma.loo, Sigma.phr, Sigma.pin, Sigma.psd, Sigma.psi, Sigma.psq and Sigma.rps which together make up the database.

#Compare

rpsblast -i rpoD.faa -d Sigma -e 0.00001

rpsblast -i rpoD.faa -d Sigma -e 0.00001 -o rpoD.txt

rpsblast -i rpoD.faa -d Sigma -e 0.00001 -m 7 -o rpoD.xml

#If comparing with Pfam database

rpsblast -i rpoD.faa -d Pfam -e 0.00001

#comparing entire genome with the Sigma database made earlier.

rpsblast -i NC_003197.faa -d Sigma -e 0.00001 -o NC_003197.txt

rpsblast -i NC_003197.faa -d Sigma -e 0.00001 -m 7 -o NC_003197.xml

#Analyzing RPS-BLAST output with Biopython

#For the smaller xml file

from Bio.Blast import NCBIXML

for record in NCBIXML.parse(open("rpoD.xml")) :

print "QUERY: %s" % record.query

for align in record.alignments :

print " MATCH: %s..." % align.title[:60]

for hsp in align.hsps :

print " HSP, e=%f, from position %i to %i" \

% (hsp.expect, hsp.query_start, hsp.query_end)

if hsp.align_length < 60 :

print " Query: %s" % hsp.query

print " Match: %s" % hsp.match

print " Sbjct: %s" % hsp.sbjct

else :

print " Query: %s..." % hsp.query[:57]

print " Match: %s..." % hsp.match[:57]

print " Sbjct: %s..." % hsp.sbjct[:57]

print "Done"

#For the large xml file

from Bio.Blast import NCBIXML

for record in NCBIXML.parse(open("NC_003197.xml")) :

#We want to ignore any queries with no search results:

if record.alignments :

print "QUERY: %s..." % record.query[:60]

for align in record.alignments :

for hsp in align.hsps :

print " %s HSP, e=%f, from position %i to %i" \

% (align.hit_id, hsp.expect, hsp.query_start, hsp.query_end)

print "Done"

That should give you the following output - note there is only

#Running RPS-BLAST from Biopython

#Adjust the file locations to match your own:

rpsblast_db = "C:\\Blast\\cdd\\Sigma"

rpsblast_exe = "C:\\Blast\\bin\\rpsblast.exe"

query_filename = "rpoD.faa"

#query_filename = "NC_003197.faa"

E_VALUE_THRESH = 0.00001 #Adjust the expectation cut-off here

from Bio.Blast import NCBIStandalone

output_handle, error_handle = NCBIStandalone.rpsblast(rpsblast_exe, \

rpsblast_db, query_filename, expectation=E_VALUE_THRESH)

from Bio.Blast import NCBIXML

for record in NCBIXML.parse(output_handle) :

#We want to ignore any queries with no search results:

if record.alignments :

print "QUERY: %s..." % record.query[:60]

for align in record.alignments :

for hsp in align.hsps :

print " %s HSP, e=%f, from position %i to %i" \

% (align.hit_id, hsp.expect, hsp.query_start, hsp.query_end)

assert hsp.expect <= E_VALUE_THRESH

print "Done"