Monday, September 26, 2016

Bone diseases: Information and hypotheses.........

SOST gene acts to lower bone mass..its deficiency induces lifelong bone gain.
As some old people, who normally have osteoporosis (poor bone mass)...have bone growth here and there, it must be due to dysregulation of SOST gene
Bone overgrowth diseases: sclerosteosis, van Buchem disease, and autosomal dominant craniodiaphyseal dysplasia

Hormonal drugs can meddle with bones..rendering it porous..

Monday, July 25, 2016

Breast cancer: Information and hypotheses.........

Breast tissue is made of adipose cells and lympatic glands..
The suspected gene BRCA1 is on chromosome 17 and BRCA2 is on chromosome 13
ER (estrogen) -positive cancer:
Luminal A cell lines: MCF-7 and T47D
Metastatic: MDA-MB-231

Common endocrine-disrupting agents: aluminum, prabens, triclosan, phthalates, perfumes
#Mammograms test detects tumor in breast tissue. However, its radiation itself is not safety-proof. Accumulated radiation doses can initiate mutation itself.

#Persistent infection and allergen exposure can cause lymph gland to swell. sensing danger, aromataae enzyme iwll overexpress, producing estrogen, which can cause hyperplasia...

#Its a good piece of information to know that some tumor disappear undetected, untreated. That means, if the source of inflammation or abuse is removed, the signalling is likely to correct and immune attack is likely to subside. Proteasome complex might degrade the offensive proteins.

#As prevention is always better than therapy, one must not abuse body through bad lifestyles.
Healthy lifestyle includes simple, minimally-processed balanced food,  lower exposure to chemicals (pesticides, cosmetics, cleaning agents, food additives), physical activity, vitamin D from sun, stress-free life..

Thursday, July 21, 2016

Cancer types and cell lines..........

Cancer is a heterogeneous disease. Above all, therapeutic success in unpredictable.
Its result of inflammation...caused by perturbed proteases, wrecking havoc with normal functionality of body. Pesticides and household endocrine disruptors are increasing risks of cancer.
Personalized medicine is required to treat as genetics of each individual is different.
Poor prognosis, metastasis, high relapse rate make cancer deadly
Mapping the mutations, genes and their pathways can reveal a lot about cancer.
Diagnois: Ultrasound, colonoscopy, mammography..
Current therapeutic strategy include: surgery (mastectomies), chemo, radiotherapy, molecular targeted therapy

TCGA: Cancer Genome Atlas consortium
MURINE......
26L5:  murine colon carcinoma
B16BL6:  murine melanoma,
murine Lewis lung carcinoma
HUMAN....
A375 : human melanoma
A498: Renal carcinoma
A549:  human lung adenocarcinoma
AMC-HN-4:  malignant human head and neck
BT474: human breast
ChaGo: human bronchogenic
CNE1:  nasopharyngeal carcinoma
DU145:  hormone-resistant prostate cancer
GBM : human glioblastoma
HCC: Hepatocellular carcinoma
HeLa: human cervix adenocarcinoma
Hep-G2:human liver
HT-1080:  humanfibrosarcoma
KATO-III: human gastric
LNCaP: hormone-sensitive prostate cancer
MCF-7:  human breast cancer ERα+
mCRPC: metastatic castration resistant prostate cancer
PBMC: uman peripheral blood mononuclear cell
PC-3:  human colon carcinoma
SW620: human colon
U87MG:  human glioblastoma
Normal cell lines (control)
CH-liver
HCT116
HS27: fibroblast
HT29
SW480 cells

Human genes associated with different cancers/cancer-associated genes:
Colon: BCL9L, RBM10, CTCF, and KLF5
Cervical adenocarcinoma:
Breast cancer: BRCA1 and BRCA2
Ovarian cancer. BRCA1 and BRCA2
Well-known cancer pathways
Wnt pathway
Canonical
Wnt binds to its receptor Frizzled, and potential co-receptor LRP-5/6
It suppresses GSK-3ß phosphorylation of ß-Catenin.
ß-Catenin accumulates in nucleus
it binds to LEF/TCF transcription factors, which activate Wnt target genes.
Non-canonical
Wnt binds to Dishevelled protein by tyrosine kinase

Tuesday, May 3, 2016

Allergens: Types, sources.......

GENERAL
#######PLANTS#####
Ole e: Olea europaea (Common olive)
Sin a: Sinapis alba (White mustard)
2S albumin: Ricinus communis (Castor bean)
Pectate lyase: Cryptomeria japonica (Japanese cedar) (Cupressus japonica)
Expansin-B1: Zea mays (Maize)
Superoxide dismutase: Olea europaea (Common olive)
Small rubber particle protein: Hevea brasiliensis (Para rubber tree)
Exopolygalacturonase: Platanus acerifolia (London plane tree)
Major pollen allergen Bet v 1-A: Betula pendula (European white birch) (Betula verrucosa)
Profilin-2: Phleum pratense (Common timothy)
Pectinesterase 1: Olea europaea (Common olive)
Non-specific lipid-transfer protein:  Ambrosia artemisiifolia (Short ragweed)
Profilin-1 : Phleum pratense (Common timothy)
Pectate lyase 1: Ambrosia artemisiifolia (Short ragweed)
Pectate lyase 2: Ambrosia artemisiifolia (Short ragweed)
Bet v 1-L: Betula pendula (European white birch) (Betula verrucosa)
Amb a 3: Ambrosia artemisiifolia var. elatior (Short ragweed)
Pectinesterase 2: Olea europaea (Common olive)
Phl p 5b: Phleum pratense (Common timothy)
Polygalacturonase: Cryptomeria japonica (Japanese cedar)
Expansin-B11: Zea mays (Maize)
Lol p 1: Lolium perenne (Perennial ryegrass)
Profilin-4: Corylus avellana (European hazel) (Corylus maxima)
Actinidain: Actinidia deliciosa (Kiwi)
Polygalacturonase: Juniperus ashei (Ozark white cedar)
Esterase: Hevea brasiliensis (Para rubber tree)
Protein DOWNSTREAM OF FLC: Arabidopsis thaliana (Mouse-ear cress)
Major allergen Api g 1: Apium graveolens (Celery)
Alpha-amylase inhibitor BMAI-1: Hordeum vulgare (Barley)
Superoxide dismutase [Cu-Zn]: Olea europaea (Common olive)
Lactoylglutathione lyase: Oryza sativa subsp. japonica (Rice)
Profilin-1: Zea mays (Maize)
Ambrosia artemisiifolia (Short ragweed)
Bra j 1-E: Brassica juncea (Indian mustard) (Sinapis juncea)
Glucan endo-1,3-beta-glucosidase: Prunus avium (Cherry)
Non-specific lipid-transfer protein: Apium graveolens (Celery)
Dau c 1: Daucus carota (Wild carrot)
Pollen allergen KBG 41: Poa pratensis (Kentucky bluegrass)
Lol p 5a: Lolium perenne (Perennial ryegrass)
Profilin-2-5: Olea europaea (Common olive)
######FUNGI#####
60S acidic ribosomal protein P2: Alternaria alternata (Alternaria rot fungus)
Alcohol dehydrogenase 1: Candida albicans (Yeast)
Enolase:  Cladosporium herbarum
Glucoamylase: Trichophyton mentagrophytes
Cla h 7: Cladosporium herbarum
Ribonuclease mitogillin: (Aspergillus fumigatus)
Fructose-bisphosphate aldolase: Candida albicans (strain SC5314 / ATCC MYA-2876) (Yeast)
60S acidic ribosomal protein P2: (Cladosporium herbarum)
Enolase: Alternaria alternata (Alternaria rot fungus)

######NEMATODE#####
Polyprotein ABA-1: Ascaris suum (Pig roundworm)
Major allergen Ani s 1: Anisakis simplex (Herring worm)
######ARTHROPODS#####
Pilosulin-3a: Myrmecia pilosula (Jack jumper ant) (Australian jumper ant)
Peptidase 1: Psoroptes ovis (Sheep scab mite)
Hyaluronidase A: Vespula vulgaris (Yellow jacket) (Wasp)
Eur m 3:Euroglyphus maynei (Mayne's house dust mite)
Peptidase 1: Dermatophagoides pteronyssinus (European house dust mite)
Mite group 2 allergen Lep d: Lepidoglyphus destructor (Storage mite)
Peptidase 1: Dermatophagoides farinae (American house dust mite)
Mite group 2 allergen Der p 2: Dermatophagoides pteronyssinus (European house dust mite)
Phospholipase A1: Solenopsis invicta (Red imported fire ant)
Melittin: Apis mellifera (Honeybee)
Pilosulin-1: Myrmecia pilosula (Jack jumper ant) (Australian jumper ant)
Hyaluronidase: Apis mellifera (Honeybee)
Aspartic protease Bla g 2: Blattella germanica (German cockroach) (Blatta germanica)
Peptidase 1: Euroglyphus maynei (Mayne's house dust mite)
Phospholipase A1 1: Dolichovespula maculata (Bald-faced hornet)
Venom allergen 3: Solenopsis invicta (Red imported fire ant)
Der p 3: Dermatophagoides pteronyssinus (European house dust mite)
Der f 3: Dermatophagoides farinae (American house dust mite)
Arginine kinase AK: Penaeus monodon (Giant tiger prawn)
Venom dipeptidyl peptidase 4: Apis mellifera (Honeybee)
Phospholipase A1: Vespula maculifrons (Eastern yellow jacket) (Wasp)
######FISH#####
Parvalbumin beta: Gadus morhua subsp. callarias (Baltic cod) 
######BIRDS#####
Ovalbumin: Gallus gallus (Chicken)
Ovotransferrin: Gallus gallus (Chicken)
Lysozyme C: Gallus gallus (Chicken)
Ovomucoid: Gallus gallus (Chicken)
######MAMMALS#####
Minor allergen Can f 2: Canis lupus familiaris (Dog) (Canis familiaris)
Major allergen I polypeptide chain: Felis catus (Cat)
Allergen Fel d 4: Felis catus (Cat) (Felis silvestris catus)
Major urinary protein: Rattus norvegicus (Rat)
Allergen Bos d 2: Bos taurus (Bovine)
Protein S100-A7: Bos taurus (Bovine)
Latherin : Equus caballus (Horse)
Major allergen Equ c 1: Equus caballus (Horse)
-----------------
SPECIFIC
#Cashew, Pistachio
Vicilin-like protein, 2s albumin, Ana o 2, 11S globulin
#Almond, peach
pru1, Pru du, Non-specific lipid-transfer protein
#Tomato
Profilin, pectate lyase
#Peanut
Conglutin-7, Defensin, Ara h, Profilin, Non-specific lipid-transfer protein
#Avocado
Endochitinase
#Kiwi
Actinidain, Cysteine proteinase inhibitor, Thaumatin-like protein, Act d
Kiwellin, Kirola, Non-specific lipid-transfer protein, Endochitinase, Bet v
#Persimmon
Expansin, Non-specific lipid-transfer protein
#Celery
Non-specific lipid-transfer protein, Chlorophyll a-b binding protein, Api g, Profilin
#Kidney bean
Pathogenesis-related protein 1
Pectate lyase
#Egg
Ovalbumin
Ovotransferrin
Lysozyme C
Ovomucoid
Serum albumin
#Shrimp, lobster
Tropomyosin
Arginine kinase
Arginine kinase
Pen a
Lit v
Sarcoplasmic calcium-binding protein
#Mussel
Tropomyosin
Endo-beta-1,4-glucanase
#Fish
Alpha-enolase
Beta-enolase
Parvalbumin beta
Fructose-bisphosphate aldolase A
#Octopus 
Arginine kinase
#Silk moth
SCP-related protein
Arginine kinase
Apolipoprotein of lipid transfer
#Rubber
Patatin

MY SCRIPT (2): Unique genes finding, their analysis, wrapper..

#Code to find out unique genes
#! /usr/bin
#Run as: sh unique_genes_finding.sh |& tee all_isolate_gene_profile

#mkdir /home/pseema/denovo_analysis/result_files/unique_genes
#find /home/pseema/denovo_analysis/result_files/*.only_header
while read strain;
do
while read isolate;
do

echo "#################Starting $isolate..####################"
#Extract all columns except column1
awk '{$1=""; print $0}' /home/pseema/denovo_analysis/result_files/$isolate.only_header > /home/pseema/denovo_analysis/result_files/$isolate.only_protein_name
echo "****Total number of proteins in $isolate: ******"
cat /home/pseema/denovo_analysis/result_files/$isolate.only_protein_name | wc -l
awk '!/hypothetical/' /home/pseema/denovo_analysis/result_files/$isolate.only_protein_name  >  /home/pseema/denovo_analysis/result_files/$isolate.only_functional_proteins
echo "******Number of non-hypothetical proteins in $isolate: *****"
cat  /home/pseema/denovo_analysis/result_files/$isolate.only_functional_proteins | wc -l
sort -u  /home/pseema/denovo_analysis/result_files/$isolate.only_functional_proteins > /home/pseema/denovo_analysis/result_files/$isolate.only_functional_proteins_sorted

#Shows common proteins to file 1 and file2 (option -12 or -21 can be used to achieve it)
echo "**Proteins common to $strain and $isolate: **"
comm -12  /home/pseema/denovo_analysis/result_files/$strain.only_functional_proteins_sorted  /home/pseema/denovo_analysis/result_files/$isolate.only_functional_proteins_sorted > /home/pseema/denovo_analysis/result_files/in_both.$strain.$isolate
cat /home/pseema/denovo_analysis/result_files/in_both.$strain.$isolate | wc -l
cat /home/pseema/denovo_analysis/result_files/in_both.$strain.$isolate
cp /home/pseema/denovo_analysis/result_files/in_both.$strain.$isolate  /home/pseema/denovo_analysis/result_files/unique_genes
echo "**Proteins common to $strain and $isolate done**"

#These proteins occur only in $strain (only column1)
echo "**Proteins unique to $strain: **"
comm -23  /home/pseema/denovo_analysis/result_files/$strain.only_functional_proteins_sorted  /home/pseema/denovo_analysis/result_files/$isolate.only_functional_proteins_sorted > /home/pseema/denovo_analysis/result_files/not_in.$isolate
cat /home/pseema/denovo_analysis/result_files/not_in.$isolate | wc -l
cat /home/pseema/denovo_analysis/result_files/not_in.$isolate
cp /home/pseema/denovo_analysis/result_files/not_in.$isolate  /home/pseema/denovo_analysis/result_files/unique_genes
echo "Unique protein search for $strain done"


#These proteins occur only in $isolate (only column2)
echo "**Proteins unique to $isolate: **"
comm -13  /home/pseema/denovo_analysis/result_files/$strain.only_functional_proteins_sorted  /home/pseema/denovo_analysis/result_files/$isolate.only_functional_proteins_sorted > /home/pseema/denovo_analysis/result_files/only_in.$isolate
cat /home/pseema/denovo_analysis/result_files/only_in.$isolate | wc -l
cat /home/pseema/denovo_analysis/result_files/only_in.$isolate
cp /home/pseema/denovo_analysis/result_files/only_in.$isolate  /home/pseema/denovo_analysis/result_files/unique_genes
echo "Unique protein search for $isolate done"

echo "********$isolate done********"
done < /home/pseema/denovo_analysis/input_files/isolate_list
#done < /home/pseema/denovo_analysis/input_files/IO_isolates
#done < /home/pseema/denovo_analysis/input_files/EAS_isolates
#done < /home/pseema/denovo_analysis/input_files/EAI_isolates
#done < /home/pseema/denovo_analysis/input_files/EAM_isolates

done < /home/pseema/denovo_analysis/input_files/strain_list
#done < /home/pseema/denovo_analysis/input_files/IO_isolates
#done < /home/pseema/denovo_analysis/input_files/EAS_isolates
#done < /home/pseema/denovo_analysis/input_files/EAI_isolates
#done < /home/pseema/denovo_analysis/input_files/EAM_isolates
-----------------------------------------------------
#! /usr/bin
#Code to analyze data for unique genes
#Execute as:  sh unique_genes_analysis.sh |& tee all_isolate_gene_analysis
#mkdir /home/pseema/denovo_analysis/result_files/unique_genes
#find *.matches_comm_12 |  wc -l
cat `find /home/pseema/denovo_analysis/result_files/unique_genes/in_both.*` > /home/pseema/denovo_analysis/result_files/unique_genes/all_isolates_common
echo "Common protein pool when the isolates were compared to each other..."
#cat /home/pseema/denovo_analysis/result_files/unique_genes/all_isolates_common | wc -l
uniq /home/pseema/denovo_analysis/result_files/unique_genes/all_isolates_common > /home/pseema/denovo_analysis/result_files/unique_genes/all_isolates_common_uniq
cat /home/pseema/denovo_analysis/result_files/unique_genes/all_isolates_common_uniq | wc -l
awk '!NF || !seen[$0]++' /home/pseema/denovo_analysis/result_files/unique_genes/all_isolates_common_uniq > /home/pseema/denovo_analysis/result_files/unique_genes/all_isolates_common_reduced
echo "Unique proteins in the common protein pool..."
cat /home/pseema/denovo_analysis/result_files/unique_genes/all_isolates_common_reduced | wc -l

#find *.matches_comm_23 |  wc -l
cat `find /home/pseema/denovo_analysis/result_files/unique_genes/not_in.*` > /home/pseema/denovo_analysis/result_files/unique_genes/all_isolates_only_column1
cat /home/pseema/denovo_analysis/result_files/unique_genes/all_isolates_only_column1 | wc -l
uniq /home/pseema/denovo_analysis/result_files/unique_genes/all_isolates_only_column1 > /home/pseema/denovo_analysis/result_files/unique_genes/all_isolates_only_column1_uniq
cat /home/pseema/denovo_analysis/result_files/unique_genes/all_isolates_only_column1_uniq | wc -l
awk '!NF || !seen[$0]++' /home/pseema/denovo_analysis/result_files/unique_genes/all_isolates_only_column1_uniq > /home/pseema/denovo_analysis/result_files/unique_genes/all_isolates_only_column1_uniq_reduced
cat /home/pseema/denovo_analysis/result_files/unique_genes/all_isolates_only_column1_uniq_reduced  | wc -l
#find *.matches_comm_13 |  wc -l
cat `find /home/pseema/denovo_analysis/result_files/unique_genes/only_in.*`> /home/pseema/denovo_analysis/result_files/unique_genes/all_isolates_only_column2
cat /home/pseema/denovo_analysis/result_files/unique_genes/all_isolates_only_column2 | wc -l
uniq /home/pseema/denovo_analysis/result_files/unique_genes/all_isolates_only_column2  > /home/pseema/denovo_analysis/result_files/unique_genes/all_isolates_only_column2_uniq
cat /home/pseema/denovo_analysis/result_files/unique_genes/all_isolates_only_column2_uniq | wc -l
awk '!NF || !seen[$0]++' /home/pseema/denovo_analysis/result_files/unique_genes/all_isolates_only_column2_uniq > /home/pseema/denovo_analysis/result_files/unique_genes/all_isolates_only_column2_uniq_reduced
cat /home/pseema/denovo_analysis/result_files/unique_genes/all_isolates_only_column2_uniq_reduced  | wc -l

#Find lines to a given pattern
awk '/Proteins unique to/'  all_isolate_gene_profile > /home/pseema/denovo_analysis/result_files/unique_genes/pattern_files

#Find lines next to a given pattern
awk 'f{print;f=0} /Proteins unique to/{f=1}' all_isolate_gene_profile > /home/pseema/denovo_analysis/result_files/unique_genes/next_lines

#Paste these two files side by side
paste -d' ' /home/pseema/denovo_analysis/result_files/unique_genes/pattern_files /home/pseema/denovo_analysis/result_files/unique_genes/next_lines > /home/pseema/denovo_analysis/result_files/unique_genes/isolate_diff_unique_genes

#Extract only column 4
awk '{print $4}' /home/pseema/denovo_analysis/result_files/unique_genes/isolate_diff_unique_genes > /home/pseema/denovo_analysis/result_files/unique_genes/isolate_diff_unique_genes_only_isolate
#find difference between two consecutive lines in the generated file
#Extract only odd number lines
awk 'NR%2==1' /home/pseema/denovo_analysis/result_files/unique_genes/isolate_diff_unique_genes_only_isolate  > /home/pseema/denovo_analysis/result_files/unique_genes/isolate_diff_unique_genes_only_isolate_only_odd

#Extract only even number lines
awk 'NR%2==0' /home/pseema/denovo_analysis/result_files/unique_genes/isolate_diff_unique_genes_only_isolate  > /home/pseema/denovo_analysis/result_files/unique_genes/isolate_diff_unique_genes_only_isolate_only_even

#Paste the extracted columns side by side
paste -d' ' /home/pseema/denovo_analysis/result_files/unique_genes/isolate_diff_unique_genes_only_isolate_only_odd /home/pseema/denovo_analysis/result_files/unique_genes/isolate_diff_unique_genes_only_isolate_only_even > /home/pseema/denovo_analysis/result_files/unique_genes/merged_columns_isolates

#Find difference between two consecutive lines in the generated file
#Extract only odd number lines
awk 'NR%2==1' /home/pseema/denovo_analysis/result_files/unique_genes/next_lines  > /home/pseema/denovo_analysis/result_files/unique_genes/only_odd

#Extract only even number lines
awk 'NR%2==0' /home/pseema/denovo_analysis/result_files/unique_genes/next_lines  > /home/pseema/denovo_analysis/result_files/unique_genes/only_even

#Paste the extracted columns side by side
paste -d' ' /home/pseema/denovo_analysis/result_files/unique_genes/only_odd /home/pseema/denovo_analysis/result_files/unique_genes/only_even > /home/pseema/denovo_analysis/result_files/unique_genes/merged_columns
#Find difference between two columns of the file
awk 'NF > 0 { print $0 "\t" ($1 - $2) }' /home/pseema/denovo_analysis/result_files/unique_genes/merged_columns > /home/pseema/denovo_analysis/result_files/unique_genes/diff_columns

#Paste the extracted columns side by side
paste -d' ' /home/pseema/denovo_analysis/result_files/unique_genes/merged_columns_isolates /home/pseema/denovo_analysis/result_files/unique_genes/diff_columns > /home/pseema/denovo_analysis/result_files/unique_genes/isolate_gene_diff

#Print content beetween two patterns
echo "*****Isolate-specific unique protein*****"
awk '/Proteins unique to/ {flag=1;next} /Unique protein search/{flag=0} flag {print}' all_isolate_gene_profile && awk '/Unique protein search for/' all_isolate_gene_profile

#To find the common genes in all the files
echo "The core genes are......"
for isolate
do
awk '!NF || !seen[$0]++' /home/pseema/denovo_analysis/result_files/$isolate.only_functional_proteins  > /home/pseema/denovo_analysis/result_files/unique_genes/indispensable_genes
done < /home/pseema/denovo_analysis/input_files/isolate_list

#To find the shared genes in all the files (it checks from folder to folder to find the shared genes)
echo "The shared genes are......"
#To get rid of backup files
#find . -name '*~' -exec rm {} \;
cat /home/pseema/denovo_analysis/result_files/*.only_functional_proteins_sorted | awk 'END {
  for (R in rec) {
    n = split(rec[R], t, "/")
    if (n > 1)
      dup[n] = dup[n] ? dup[n] RS sprintf("\t%-20s -->\t%s", rec[R], R) : \
        sprintf("\t%-20s -->\t%s", rec[R], R)
    }
  for (D in dup) {
    printf "records found in %d files:\n\n", D
    printf "%s\n\n", dup[D]
    } 
  }

  rec[$0] = rec[$0] ? rec[$0] "/" FILENAME : FILENAME
  }'
  -----------------------------------------
#! /usr/bin
#Wrappr to call all related  scripts
#Code to find out unique genes
sh unique_genes_finding.sh |& tee all_isolate_gene_profile
#sh unique_genes_finding.sh |& tee IO_isolate_gene_profile
#sh unique_genes_finding.sh |& tee EAS_isolate_gene_profile
#sh unique_genes_finding.sh |& tee EAI_isolate_gene_profile
#sh unique_genes_finding.sh |& tee EAM_isolate_gene_profile

#Code to analyze data for unique genes
sh unique_genes_analysis.sh |& tee all_isolate_gene_analysis

Monday, May 2, 2016

Tools to learn and work to do......

Alignment...
Alignment of sequencing reads to a reference genome is a core step in the analysis workflows for many high-throughput sequencing assays, including ChIP-Seq, RNA-seq, ribosome profiling and others.
Bowtie  uses an extremely economical data structure called the FM index to store the reference genome sequence and allows it to be searched rapidly. 
TopHat uses Bowtie as an alignment ‘engine’ 

Mauve?
#To run the Mauve GUI from within Terminal 
#Add directory with executables to Mauve path
cd Mauve/
ls
cd mauve_2.3.1/
./Mauve 

File
Align with progressive Mauve
Select the executable folder (by navigation
Mauve Console starts running (1-2 minutes for two full genomes)
 Viewing the alignment
Zoom in    Ctrl + UpScroll 
display left    Ctrl + LeftScroll 
display right    Ctrl + RightLarge 
left scroll    Shift + Ctrl + LeftLarge 
right scroll    Shift + Ctrl + Right
Tool ---------> Export ---------> Export SNPs   

Indel determination..
Whats the logic used to pull information from vcf file?


R PSI Blast
#Reversed Position Specific BLAST, or RPS BLAST, use at command line
#extract just these *.smp files from the large archive (cdd.tar.gz).
#run the formatrpsdb tool to build a database:
formatrpsdb -t Sigma.v001 -i Sigma.pn -o T -f 9.82 -n Sigma -S 100.0
#creates the eight files i.e. Sigma.aux, Sigma.loo, Sigma.phr, Sigma.pin, Sigma.psd, Sigma.psi, Sigma.psq and Sigma.rps which together make up the database.
#Compare
rpsblast -i rpoD.faa -d Sigma -e 0.00001
rpsblast -i rpoD.faa -d Sigma -e 0.00001 -o rpoD.txt
rpsblast -i rpoD.faa -d Sigma -e 0.00001 -m 7 -o rpoD.xml
#If comparing with Pfam database
rpsblast -i rpoD.faa -d Pfam -e 0.00001
#comparing entire genome with the Sigma database made earlier.
rpsblast -i NC_003197.faa -d Sigma -e 0.00001 -o NC_003197.txt
rpsblast -i NC_003197.faa -d Sigma -e 0.00001 -m 7 -o NC_003197.xml

#Analyzing RPS-BLAST output with Biopython
#For the smaller xml file
from Bio.Blast import NCBIXML
for record in NCBIXML.parse(open("rpoD.xml")) :
print "QUERY: %s" % record.query
for align in record.alignments :
print " MATCH: %s..." % align.title[:60]
for hsp in align.hsps :
print " HSP, e=%f, from position %i to %i" \
% (hsp.expect, hsp.query_start, hsp.query_end)
if hsp.align_length < 60 :
print " Query: %s" % hsp.query
print " Match: %s" % hsp.match
print " Sbjct: %s" % hsp.sbjct
else :
print " Query: %s..." % hsp.query[:57]
print " Match: %s..." % hsp.match[:57]
print " Sbjct: %s..." % hsp.sbjct[:57]
print "Done"


#For the large xml file
from Bio.Blast import NCBIXML
for record in NCBIXML.parse(open("NC_003197.xml")) :
    #We want to ignore any queries with no search results:
    if record.alignments :
        print "QUERY: %s..." % record.query[:60]
        for align in record.alignments :
            for hsp in align.hsps :
                print " %s HSP, e=%f, from position %i to %i" \
                % (align.hit_id, hsp.expect, hsp.query_start, hsp.query_end)
print "Done"
That should give you the following output - note there is only 



#Running RPS-BLAST from Biopython
#Adjust the file locations to match your own:
rpsblast_db = "C:\\Blast\\cdd\\Sigma"
rpsblast_exe = "C:\\Blast\\bin\\rpsblast.exe"

query_filename = "rpoD.faa"
#query_filename = "NC_003197.faa"

E_VALUE_THRESH = 0.00001 #Adjust the expectation cut-off here

from Bio.Blast import NCBIStandalone
output_handle, error_handle = NCBIStandalone.rpsblast(rpsblast_exe, \
rpsblast_db, query_filename, expectation=E_VALUE_THRESH)


from Bio.Blast import NCBIXML
for record in NCBIXML.parse(output_handle) :
    #We want to ignore any queries with no search results:
    if record.alignments :
        print "QUERY: %s..." % record.query[:60]
        for align in record.alignments :
            for hsp in align.hsps :
                print " %s HSP, e=%f, from position %i to %i" \
                % (align.hit_id, hsp.expect, hsp.query_start, hsp.query_end)
                assert hsp.expect <= E_VALUE_THRESH
print "Done"

Friday, April 29, 2016

Python: Codes ........

Practical Computing for Biologists by Cliburn Chan
#To clone or copy a list in python
import copy
class Foo(object):
    def __init__(self, val):
         self.val = val
    def __repr__(self):
        return str(self.val)

foo = Foo(1)
a = ['foo', foo]
b = a[:]
c = list(a)
d = copy.copy(a)
e = copy.deepcopy(a)

# edit orignal list and instance
a.append('baz')
foo.val = 5

print('original: %r\n slice: %r\n list(): %r\n copy: %r\n deepcopy: %r'
      % (a, b, c, d, e))
#Creating a list of lists (sublists changed)
myList = [[[1] * 4] for n in range(3)]
#lst1 = [1]*4; lst = [lst1]*3
print myList
myList[0][0] = 5
print myList
#check if a word is a palindrome
word = raw_input("Enter a word: ")
if word == word[::-1]:
print "%s is a palindrome!" % word
else:
print "%s is not palindrome" % word
#count the number of vowels in a word
name = raw_input("What’s your name? ")
num_vowels = 0
for vowel in ’aeiou’:
num_vowels += name.count(vowel)
print "Hello %s, there are %d vowels in your name." % (name, num_vowels)

#Substitution of regex
import re
find = r’(\d+)\s+(\w{3})[\w\,\.]*\s+(\d+)\sat\s(\d+):(\d+)\s+([-\d\.]+)\s+([-\d\.]+).*’
replace = r’\3\t\2.\t\1\t\4\t\5\t\6\t\7’
for line in open(’examples/Ch3observations.txt’):
newline = re.sub(find, replace, line)
print newline,
#Fitting the curve (Using 4 parameters, logistic equation). A plot will be generated
import numpy as np
import numpy.random as npr
import matplotlib.pyplot as plt
from scipy.optimize import leastsq
def logistic4(x, A, B, C, D):
"""4PL lgoistic equation."""
return ((A-D)/(1.0+((x/C)**B))) + D
def residuals(p, y, x):
"""Deviations of data from fitted 4PL curve"""
A,B,C,D = p
err = y-logistic4(x, A, B, C, D)
return err
def peval(x, p):
"""Evaluated value at x with current parameters."""
A,B,C,D = p
return logistic4(x, A, B, C, D)
# Make up some data for fitting and add noise
# In practice, y_meas would be read in from a file
x = np.linspace(0,20,20)
A,B,C,D = 0.5,2.5,8,7.3
y_true = logistic4(x, A, B, C, D)
y_meas = y_true + 0.2*npr.randn(len(x))
# Initial guess for parameters
p0 = [0, 1, 1, 1]
# Fit equation using least squares optimization
plsq = leastsq(residuals, p0, args=(y_meas, x))
# Plot results
plt.plot(x,peval(x,plsq[0]),x,y_meas,’o’,x,y_true)
plt.title(’Least-squares 4PL fit to noisy data’)
plt.legend([’Fit’, ’Noisy’, ’True’], loc=’upper left’)
for i, (param, actual, est) in enumerate(zip(’ABCD’, [A,B,C,D], plsq[0])):
plt.text(10, 3-i*0.5, ’%s = %.2f, est(%s) = %.2f’ % (param, actual, param, est))
plt.savefig(’logistic.png’)
#Simulation-based statistics (bootstrap and permuation resampling)
#Sampling without replacement
import numpy.random as npr
npr.random(5)
npr.random((3,4))
npr.normal(5, 1, 4)
npr.randint(1, 7, 10)
npr.uniform(1, 7, 10)
npr.binomial(n=10, p=0.2, size=(4,4))
x = [1,2,3,4,5,6]
npr.shuffle(x)
x
npr.permutation(10)
----------
#Sampling with replacement
import numpy as np
import numpy.random as npr
data = np.array([’tom’, ’jerry’, ’mickey’, ’minnie’, ’pocahontas’])
idx = npr.randint(0, len(data), (4,len(data)))
idx
samples_with_replacement = data[idx]
samples_with_replacement
#Bootstrapping ( higher order function)
import numpy as np
import numpy.random as npr
import pylab
def bootstrap(data, num_samples, statistic, alpha):
"""Returns bootstrap estimate of 100.0*(1-alpha) CI for statistic."""
n = len(data)
idx = npr.randint(0, n, (num_samples, n))
samples = x[idx]
stat = np.sort(statistic(samples, 1))
return (stat[int((alpha/2.0)*num_samples)],
stat[int((1-alpha/2.0)*num_samples)])
if __name__ == ’__main__’:
# data of interest is bimodal and obviously not normal
x = np.concatenate([npr.normal(3, 1, 100), npr.normal(6, 2, 200)])
# find mean 95% CI and 100,000 bootstrap samples
low, high = bootstrap(x, 100000, np.mean, 0.05)
# make plots
pylab.figure(figsize=(8,4))
pylab.subplot(121)
pylab.hist(x, 50, histtype=’step’)
pylab.title(’Historgram of data’)
pylab.subplot(122)
pylab.plot([-0.03,0.03], [np.mean(x), np.mean(x)], ’r’, linewidth=2)
pylab.scatter(0.1*(npr.random(len(x))-0.5), x)
pylab.plot([0.19,0.21], [low, low], ’r’, linewidth=2)
pylab.plot([0.19,0.21], [high, high], ’r’, linewidth=2)
pylab.plot([0.2,0.2], [low, high], ’r’, linewidth=2)
pylab.xlim([-0.2, 0.3])
pylab.title(’Bootstrap 95% CI for mean’)
pylab.savefig(’examples/boostrap.png’)

low, high = bootstrap(x, 100000, np.std, 0.05)
#Permutation sampling (to find p-value)
import numpy as np
import numpy.random as npr
import pylab
def permutation_resampling(case, control, num_samples, statistic):
"""Returns p-value that statistic for case is different
from statistc for control."""
observed_diff = abs(statistic(case) - statistic(control))
num_case = len(case)
combined = np.concatenate([case, control])
diffs = []
for i in range(num_samples):
xs = npr.permutation(combined)
diff = np.mean(xs[:num_case]) - np.mean(xs[num_case:])
diffs.append(diff)
pval = (np.sum(diffs > observed_diff) +
np.sum(diffs < -observed_diff))/float(num_samples)
return pval, observed_diff, diffs
if __name__ == ’__main__’:
# make up some data
case = [94, 38, 23, 197, 99, 16, 141]
control = [52, 10, 40, 104, 51, 27, 146, 30, 46]
# find p-value by permutation resampling
pval, observed_diff, diffs = \
permutation_resampling(case, control, 10000, np.mean)
# make plots
pylab.title(’Empirical null distribution for differences in mean’)
pylab.hist(diffs, bins=100, histtype=’step’, normed=True)
pylab.axvline(observed_diff, c=’red’, label=’diff’)
pylab.axvline(-observed_diff, c=’green’, label=’-diff’)
pylab.text(60, 0.01, ’p = %.3f’ % pval, fontsize=16)
pylab.legend()
pylab.savefig(’examples/permutation.png’)
#Data visualization (for  exploratory data analysis)
 import numpy as np
import pylab
xs = np.loadtxt(’anscombe.txt’)
for i in range(4):
x = xs[:,i*2]
y = xs[:,i*2+1]
A = np.vstack([x, np.ones(len(x))]).T
m, c = np.linalg.lstsq(A, y)[0]
pylab.subplot(2,2,i+1)
pylab.scatter(x, y)
pylab.plot(x, m*x+c, ’r’)
pylab.axis([2,20,0,14])
pylab.savefig(’anscombe.png’)

#Working with relational databases (connect, execute, iterate)
import sqlite3
con = sqlite3.connect(’pcfb.sqlite’)
r = con.execute(’select * from people’)
for i in r:
print i
r = con.execute(’select p.name, e.name from people as p join experiment as e where e.researcher == p.id
for i in r:
print ’Name: %s\n\tExperiment: %s’ % (i[0],i[1])





 #source code available from git repository. Need to self-compile
 #easy_install/pip installation 

Sunday, April 17, 2016

Allergy, inflammation, pesticides, pathogens, diseases, domains: Insights and my hypotheses........

Science is undergoing overhaul for good. Pre-established facts are being overthrown. 
 In nature, everything is related. This aspect ought to be focussed while hypothesizing.
#Allergy/inflammation/hormone insights
When similarity is high proteins have isozyme, isoforms. When less, they are different but homology remains.
Check homology between chitin (NAGA-NAGA), peptidoglycan (NAGA-NAMA), cuticle.
Pollen , virus domains and bacterial surface proteins are same.
Virus and cockroach allergen have same domain (kelch, jacalin)
Check similarity between gibberellin, auxin, insect pheromone, animal hormone,
Pathogenesis of pathogens, allergens and venoms are same.........just the latter two can't replicate, so victim might survive. Venom is high dose, so it can coagulate blood and kill.
All irritants provoke immune system and cause neural inflammation, leading to different diseases. Brain malaria, brain dengue, brain fever are nothing but body's immune system trying too hard to get rid of the pathogens and in turn causing harm to the host body itself.
Stress causes strange surface protein formation in bacteria which human antibody can't trap.


Immunodeficiency and autoimmunity is wrong classification. Its better to explain as immune activation-led inflammation.
Pastereulla sp. causes 50K antelope death, can it be by consumption of bad water?
Chymptrypsin can degrade galactouranase
Trypsin can cleave leather, collagen, keratin and can sequester co-factor metals like Ni, Cd.
Lathyrus plant causes neural paralysis by manipulating host protein via its lectins. Any pulse consumed too often will cause the same.
Any form of irritant disturbs host proteases. 
Most critical: Serine-, cysteine-, and metalloproteases (cause charge relay system)
So, lead to allergy, infectious disease, cancer, autoimmune disease

Bacterial ubiquitous protease: AAA proteases, degP (acts as chaperone as well as peptidase) or sortase (cys) ( substrates function as adhesins, internalins, blood clotting and immune evasion factors, and transporters for nutrients )
Bacterial pathogenic protease: clostripain (cys), collagenase (cause gas gangrene), botulinum neurotoxin, tetanus neurotoxin

Serine/threonine-protein kinase (pknA-B)
Ser/Thr phosphatase

Serine protease: chymotrypsin (S1), subtilisin (S8) 

To counter it protease inhibitors are given in the form of drugs or antibiotics. Chemotherapy is nothing but protease inhibitors and DNA gyrase inhibitors (to prevent DNA replication). Venoms, sea cucumber, ginseng etc. do the same. Peptide can be protease inhibitor. Myoactive neuropeptide NGIWY amide was  isolated from the holothurian. Sea urchin Strongylocentrotus purpuratus.
Protease inhibitors: glycoproteins or sulfated polysaccharides
These are present from virus to human. These inhibitors-coding  gene family have undergone duplication in higher animals. 
Serpin superfamily:
all-beta, immunoglobulin (Ig) fold:  (bacteria), chagasin and amoebiasin
Glycoproteins:  Kazal type (viral, fungal , termite, jelly fish, sea cucumber, human,  (SPINK)) or kunitz type (mostly in venoms e.g. spider, tick,  snake). Both types have been found in helminthes and bacteria

Protaese inhibition by: lock-and-key type,  conformational change and consequent kinetic trapping of an enzyme intermediate 


*Mice homozygous for SPINK gene results in postnatal lethality, growth retardation, dehydration, autophagic degeneration of acinar cells resulting in pancreas trophy, small intestine degeneration, and a small spleen.


Glycosylation of proteins occur in diabetes. It inactivates proteins that need to be activated. Protease can help in this scenario. Herbal drug cause glycosylation i.e is make activated protease inactive again. (most proteins are glycoproteins. Loss of glyco part might activate or inactivate the protein). Some hormones are glycoproteins. The allergen is cutting glyco moiety, so hormone can't act. Carbohydrate-degrading domains are known to play a role in fungal pathogenicity.
Non-pathogenicity is only a matter of time.
Person is not exposed to the allergen anymore, but the antibody, already formed is preventing the glycoproteins. Female menstrual cycle disturbance is an example of it. The hormone required for it are being destroyed by the antibody. That's what they call autoimmunity.
Breast infection...a type of inflammation...as its adipose tissue, pollutants are attracted here..
No one can underestimate the power of hormone....it causes bloom in bare plant, it brings puberty in an innocent child...
Why, women face most allergies in late 20's and early 30's? May be because estrogen level is high at this phase. (LTP (tomato, pulses, fruits), insect, allergy).


Is wisdom teeth eruption related to nerve growth factor?
Inflammation is the cause of all diseases...personalized diet is needed for healthy living.
Catalytic site of protease contain or leach Ni, Mo, Cd or  Zn atoms. So, they become active and unleash immune response.
Due to disulfide arrangement, sulphur-rich proteins (cysteine protease as well) are stable against thermal and enzymatic degradation.
Sulfites in food can cause bronchial constriction, which can cause asthma (e.g preserved fries, namkeen, pizza)
Common acids in fruits: malic acid, tartaric acid, oxalic acid
Para-phenylenediamine (PPD) in ‘black henna’, hair dye, black rubber cause allergy
Polysaccharide (ligand)-----Dectin-1 (receptor) on APC (macrophage, dendritic cell)--- (Syk, NF-κB signaling, and cytokine release) --->TNF-α and IL-6 secretion
Dectin-1 blocking reagent: Laminarin 
#Pesticides
Did pesticides or hormones caused protein misfolding (prions) and madcow disease in cattle?
In the 80's and 90's excessive use of pesticides in animal husbandry, led to their neural damage, corpse of which when devored by scavengers (vulture), the latter succumbed. Pesticides caused inflammation in cattle, that caused mad cow disease, causative agent of which was described as prions.....the misfolded proteins.....By devouring the carcass, vulture population almost vanished.
It is possible that the pesticides in food disrupted female reproductive system and caused birth defects, including autism in children.
Do food additives and other food chemicals are mimicing estrogen and causing early puberty in girls?
Dairy farm animals are advertised as free-range grazing cattle...but then mix preservatives in milk..
Pesticides manipulate our serine proteases.  Formaldehyde killed a young instantly.
Facts are emerging that rampant use of pesticides is affecting nervous system of farmers, leading to depression and suicidal tendencies.


Drug resistance in pathogens and cancer in human in outcome one stress.......drug abuse, pollution
deodorant use may be linked to diseases like cancer and Alzheimer's.
#Pathogen
Surface antigens: repeat proteins of gram-positive cocci
Internalin: a repeat protein in Listeria
Transcriptional regulation (histidine kinases)
Chemotaxis (methyl-accepting proteins)
Catabolite repression (adenylate cyclases)
Modulation of enzyme activity (diguanylate cyclases and phosphodiesterases)PhoQ histidine kinase, essential for resistance to antimicrobial peptides is present in a variety of enteric pathogens. IS cause over-expression in some Mtb PhoQ.
hyperthermophilic bacteria (i.e. Aquifex and Theromotoga) and archea (e.g. Pyrococcus, Thermococcus, Methanothermus and Sulfolobus). Despite the small set of studied systems, it is clear that super-slow protein unfolding is a dominant strategy to allow these proteins to function at extreme temperatures. 
Repeat sequences are supposed to play a role in protein–protein interactions
Chloramphenicol acetyltransferase attaches an acetyl group to chloramphenicol, which prevents the drug from binding to ribosomes. It leads to drug resistance. In vitro culture-driven picture is a messed up picture.
Because of its excellent blood-brain barrier penetration (superior to any of the cephalosporins), chloramphenicol remains the first-choice treatment for staphylococcal brain abscesses.
Clostridium perfringens enterotoxin gene is on a transposable element
Iron-regulated heparin-binding hemagglutinin capacity of M. tuberculosis


Helicobacter pyroli: CagA to perturb a host cell signaling pathway, and leads to development of peptic ulcer 
Plasmodium falciparum: histidine-rich proteins that facilitate its survival inside red blood cells 
Plant-pathogenic oomycetes: the multifunctional elicitin molecules facilitate infection by triggering host tissue necrosis; serve as a sterol-carrying protein
The elicitin-encoding gene is highly expressed at body temperature than room temperature. Something like this must be happening in Mtb replication. Stimulation of clathrin-mediated endocytosis by the elicitin. 
##Intracellular pathogens
Mycobacterium tuberculosis
Coxiella burnetii
Legionella pneumophila
Brucella abortus

Rickettsia conorii
#Diseases
Diseases are geographical, largely based on diet and lifestyle . 
Asians have vitiligo, and diabetes. But no peanut allergy or Alzheimer disease. Also, cancer death was less, though pollution is causing cancer even in villages. Peanut is native to Indian subcontinent, so people their have evolved to metabolize it, so there is no issue of allergy or anaphylaxis. 
High instance s diabetes might be due to cereal-based diet. Eating too much sugar will activate carbohydrate cleaving enzymes. these enzymes will disturb glycoproteins. Signalling system goes awry. Moderation is the answer. Diabetes and hormonal disturbance is tied together. Thats why diabetic woman can't conceive. Diabetic people have low sexual drive, as the hormones for the stimulation are getting destroyed.
In Western  countries, instances of Alzheimer, Parkinson, multiple sclerosis, autism, cancer is high. Cheese and  alcohol seem to be the culprit. Cheese is serine and tyrosine rich, so might be manipulating serine, tyrosine protease and serine/tyrosine receptors in brain. Alcohol causes liver cirrhosis, so enzymes can't be formed.

Diseases: Heart: Acute myocardial infarction
Lungs: Chronic obstructive pulmonary disease (COPD)
Neural: Autism
Fungal: Aspergillosis
Bacteria: Tuberculosis
Cystic fibrosis: A disease of cells producing mucus, sweat and digestive juice (lungs, liver, pancreas)
#Genetic

Alzheimer's: Processed foods produce toxins, they cause inflammation,  build-up of plaques, impaired cognitive function. Processed foods such as white breads, pasta, processed meats and cheeses cause inflammation.

Autism: Causes of autism might be:Pesticides (other endocrine disruptors), Alcohol, Drugs (anti-depressants). Study approaches: Microbiome of faeces, Brain imaging, Behavioral study
Vitiligo: Caucasian families with co-segregation of vitiligo and Hashimoto thyroiditis
Pfeiffer syndrome is strongly associated with mutations of Fibroblast growth factor receptor 1 and 2 hypnic jerk is an involuntary twitch which occurs just as a person is beginning to fall asleep.
Disease-causing mutations: His to alanine; Phe to glycine
#mutagenesis results,  pharmacophore data
Increased circulating galaninL levels in serum contribute to the development of metabolic syndrome
Matrix molecules, i.e. collagens and proteoglycans. Defective hydroxylation of collagen cause scurvy.
Intermediate filaments (IF) 
type I: acidic cytokeratins
type II: basic cytokeratins
type III: vimentin, desmin, glial fibrillary acidic protein (GFAP), peripherin, and plasticin
type IV: neurofilaments L, H and M, alpha-internexin and nestin
type V: nuclear lamins A, B1, B2 and C
Mutations in long coiled-coil proteins causes diseases


Mutations in BRCA1 result in truncated proteins


95% of the cases of chronic myelogenous leukemia contain the Philadelphia chromosome, which is a translocation of part of chromosome 22 to chromosome 9. 
BRCA1: Hereditary breast cancer, hereditary ovarian cancer

 kinase activators, including epidermal growth factor (EGF) and the tumor promoting phorbol ester 12-O-tetradecanylphorbol-13-acetate (TPA)
Antibiotics affect host: G protein-coupled receptors,  intracellular calcium signals,  membrane cholesterol distribution.
Proton pump (H2) inhibitors can prevent acid production
Hydrocortisone prevents itching, eczema, psoriasis
Glucosamine and chondroitin sulfate is needed for joint health

Chromothripsis: Chromosomal rearrangement due to DNA damage in micronuclei
Most health issues start a vicious circle in us.
#Genome
Pyrimidines: cytosine, thymine, and uracil
Urines: adenine, guanine
CG bond
AT bond, AU bond
Y: Any pyrimidine (C, T)
Out of 4 bases, A and C are methylated
Differential networks have recently been introduced as a powerful way to study the dynamic rewiring capabilities of an interactome in response to changing environmental conditions or stimuli. 
Paralogous genes are homologous genes that occur within one species and have diverged after a duplication event. 

Orthologous genes are homologous genes that diverged after a speciation event.
Protein expansion is primarily due to indels in intrinsically disordered regions
PCR amplification is one of the major sources of duplicates, which are usually introduced during sequencing library amplification.
Chromatin immunoprecipitation (ChIP) methodology: examine gene regulation in living cells
Plasmid isolation: CsCl/ethidium bromide gradient ultracentrifugation
Protein sizing: Western blot analysis
Localisation: immuno-histochemical staining
Aberrant or additional reactive bands: Southern blotting


Chimeric transcript detection: by RT-PCR
At high GC, coverage drops
Indel events most frequently occur in surface-exposed loops.
Human body contains about 1013 human cells and  about 1014 bacterial, fungal, and protozoan cells.
About 45% of human genome is transposable element.
Human genome size is 3,234.83Mb (almost 3 billion bases)
1.5% of the genome is CDS (20k-25K)
Each of the 23 chromosome has pseudogenes:  59 (chr 18) -1,130 (chr 1)
Confirmed protein range: 2,012 (in chromosome 1)- 45 (in   Y chromosome)
X chromosome codes for 815 genes
mtDNA has genes for only 13 proteins
Most variation s in chr 2 and 1
Longest chr: 1, 2
Shortest chr: mtDNA, 21, 22, 19, Y, 20
Large-scale sequencing efforts: 1000 Genomes, ExAC (Exome Aggregation Consortium), Scripps Wellderly, UK10K
Reference Variant Store (RVS) stores 400 million distinct variants observed in more than 80,000 human samples. (https://​rvs.​u.​hpc.​mssm.​edu/​)
Exhaustive annotation using tools such as snpEff, ANNOVAR, or VEP
Predictions of deleteriousness by SIFT, PolyPhen2, PROVEAN
Curated variantdatabases such as dbSNP, ClinVar, HGMD, OMIM, COSMIC
GEMINI: A software package  for exploring variation in personal genomes and family based genetic studies .
Well-studied disease genes and mutations
Breast cancer: BRCA2 (chr 13)
Cystic fibrosis: CFTR (chr 7)
Ctochrome b: MTCYB (mtDNA)
Hemoglobin: HBB (chr 11)

miRNA: Regulator of gene expression

snRNA: Small nuclear RNA (processes pre-mRNA and regulates transcription factors)
snoRNA: Small nucleolar RNA

After myosin and actin, titin is the third most abundant protein in human muscle

#Proteins/ Domains
Some proteins are fast-evolving 
Domains are common currencies of protein function that nature rearranges to create novel activities. (function and evolutionary aspect can be learnt from them).Domains do not generally appear de novo but my shuffling and rearrangement of existing domains. Cache_2 is predicted to originate from GAF-PAS fold. Domain-swap analysis revealed that the COOH-terminal leucine-rich repeat.
Flo11 flocculin belongs to a family of proteins involved in invasive growth, cell-cell adhesion, and mating, many of which can substitute for each other under abnormal conditions. Flo11 flocculin in yeast gives the cell a wide range of phenotypes (multicellular structures such as biofilms, flors, or filaments), depending on the strain and the environmental conditions. Does it happen in Mtb too? Sure. If its present in virus and cockroach it must be in bacteria too.
All recognition-related proteins are glycosylated (they bind to mannose or other carbohydrates). explains the cell-cell interaction capacity of FLO11-expressing cells.
Insoluble and inactive proteins are co-produced due to codon bias, protein folding, phosphorylation, glycosylation, mRNA stability and promoter strength.
Not only enzymes, even adhesins are pH dependent.
Also, the enzymes are hypothetical, as need cofactors to be active. Culture medium might be lacking in them.
Heparinase need substrate to be activated. (may not activate in vitro)
These are called transposases in the case of DNA elements and integrases in the case of the best-characterized RNA elements, the retroviruses and retrotransposons.
ATPase are cell-surface, membrane traversing proteins


coiled coil proteins: c-Fos,c-jun,tropomyosin

N-terminal amino acid of a protein is an important determinant of its half-life
CpG motifs are considered pathogen-associated molecular patterns (PAMPs). CpG PAMP is recognized by the pattern recognition receptor (PRR) Toll-Like Receptor 9 (TLR9), which is constitutively expressed only in B cells and plasmacytoid dendritic cells (pDCs) 

Protein families have arisen during evolution by gene duplication and divergence