Minerva

Gen*NY*Sis Center for Excellence in Cancer Genomics

University at Albany, State University of New York

UAlbany Home

UAlbany Site Index

UAlbany Search

 

Home

Lab Members

Compute Resources

Tools & software

Teaching

Publications

Curriculum Vitae (PDF)

Collaborators

Useful Links

Gen*NY*sis Faculty

Contact

 

Research Overview

Our laboratory nurtures a wide variety of research areas in bioinformatics that can be broadly grouped under Novel method development, Data mining and knowledge discovery, machine learning, and the development of webservers and software for bioinformatics applications. Computational methods are essential tools for efficient data mining from vast amount of genomic data. Our primary research focuses on developing such tools and use them for knowledge-discovery, and also make them accessible to the biomedical research community via web-based applications. Development of a good bioinformatics application is a challenging task that involves finding right discriminative features from the training data, designing the algorithm (rules and heuristics), implementation (writing the code), performance optimization (testing) and making them accessible to the public (web servers/databases).


Research Projects

Click below for a detailed description of each research project

 

Global analysis of protein-protein interactions in cancer-associated genes

Protein-protein interaction (PPI) studies have been widely used to understand the molecular functions or biological processes associated with different disease systems including cancer. While focused studies on individual cancers have generated valuable information, a global and comparative analysis of datasets from different cancer types has not been done. In this study, we carried out bioinformatic analysis of PPIs corresponding to differentially expressed genes from microarray studies of various tumour tissues. We compared the biological processes and molecular functions (based on GO terms) associated with PPIs of various cancer types and identified a set of functions or processes that are common to all cancers, and those that are specific to only one cancer or to a few cancer types. Similarly, protein interaction networks in nucleic acid metabolism were compared to identify the common and specific clusters of proteins across different cancer types. The methodology developed in this work can be applied to study similar systems while, the results of this study can provide the basis of further experimental investigations to study protein interaction networks associated with cancer.Protein-protein interaction (PPI) studies have been widely used to understand the molecular functions or biological processes associated with different disease systems including cancer. While focused studies on individual cancers have generated valuable information, a global and comparative analysis of datasets from different cancer types has not been done. In this study, we carried out bioinformatic analysis of PPIs corresponding to differentially expressed genes from microarray studies of various tumour tissues. We compared the biological processes and molecular functions (based on GO terms) associated with PPIs of various cancer types and identified a set of functions or processes that are common to all cancers, and those that are specific to only one cancer or to a few cancer types. Similarly, protein interaction networks in nucleic acid metabolism were compared to identify the common and specific clusters of proteins across different cancer types. The methodology developed in this work can be applied to study similar systems while, the results of this study can provide the basis of further experimental investigations to study protein interaction networks associated with cancer.


Protein interaction networks of nucleic acid metabolism pathway a) NFYA - Nuclear transcription factor Y subunit alpha, HIF1A - Hypoxia inducible factor 1 alpha, NRIP1 - Nuclear receptor interacting protein 1, JUN - Transcription factor activator protein 1, NCOA2 - Nuclear receptor co-activator 2, NR4A1 - Nuclear receptor sub-family 4 group A member 1; ATF4 - Activating transcription factor 4 (Cyt), JUN - Transcription factor activator protein 1 (Nuc), C/ATF4 - Cyclic AMP-dependent transcription factor ATF-4 (Cyt) (Nuc); b) WWTR1 (TAZ) - WW domain-containing transcription regulator protein (Nuc), NKX2.1 - Homeobox protein Nkx-2.1 (Nuc); c) Rxra - Retinoic acid receptor RXR-alpha (Nuc), crebbp - CREB binding protein (Nuc), Thra - Thyroid hormone receptor alpha (Nuc); d) NDK - Nucleoside diphosphate kinase, Nm23 - Nucleoside diphosphate kinase, mitochondrial (precursor) (Mito).
Published articles on this project
  • Guda P, Chittur S, Guda C
    Comparative analysis of protein-protein interactions in cancer associated genes
    Genomics Proteomics and Bioinformatics (2009) 7:25-36



 

Inference and comparison of domain-domain interactions in proteins

A vast majority of proteins must interact with other proteins to perform their intended functions. Proteins are made of functional modules known as domains that create the interface of an interaction through highly specific recognition events. Thus, knowledge on domain-domain interactions (DDIs) is very important for understanding the nature and the significance of protein-protein interactions (PPIs). Currently, the number of experimentally-known DDIs is very small, which warrants the development of computational inference methods for predicting functionally-significant DDIs.

We created a comprehensive, non-redundant dataset of 209,165 experimentally-derived PPIs by combining datasets from five major interaction databases. We introduced an integrated scoring system that uses a novel combination of a set of five orthogonal scoring features covering the probabilistic, evolutionary, evidence-based, spatial and functional properties of interacting domains, which can map the interacting propensity of two domains in many dimensions. This method outperforms similar existing methods both in the accuracy of prediction and in the coverage of domain interaction space. We predicted a set of 52,492 high-confidence DDIs to carry out cross-species comparison of DDI conservation in eight model species including human, mouse, Drosophila, C. elegans, yeast, Plasmodium, E. coli and Arabidopsis. Our results show that only 23% of these DDIs are conserved in at least two species and only 3.8% in at least 4 species, indicating a rather low conservation across species. Pair-wise analysis of DDI conservation revealed a 'sliding conservation' pattern between the evolutionarily neighboring species. Our methodology and the high-confidence DDI predictions generated in this study can help to better understand the functional significance of PPIs at the modular level, thus can significantly impact further experimental investigations in systems biology research.


Schemati diagram showing the derivation of datasets


Cumulative distribution of positive and negative test datasets against the entire range of prediction scores

Published articles on this project
  • Guda C King BR, Pal LR, Guda P.
    A top-down approach to infer and compare domain-domain interactions across eight model organisms
    PLoS ONE (2009) 4:e5096 [Pubmed]



 

Tracing the evolutionary origin of functional modules in the human proteome

The functional repertoire of the human proteome is an incremental collection of functions accomplished by protein domains evolved along the Homo sapiens lineage. Therefore, knowledge on the origin of these functionalities provides a better understanding of the domain and protein evolution in human. This study reports a unique approach for understanding the evolution of human proteome by tracing the origin of its constituting domains hierarchically, along the Homo sapiens lineage. The uniqueness of this method lies in subtractive searching of functional and conserved domains in the human proteome resulting in higher efficiency of detecting their origins. From these analyses the nature of protein evolution and trends in domain evolution can be observed in the context of the entire human proteome data. The method adopted here also helps delineate the degree of divergence of functional families occurred during the course of evolution.

This approach to trace the evolutionary origin of functional domains in the human proteome facilitates better understanding of their functional versatility as well as provides insights into the functionality of hypothetical proteins present in the human proteome. This work elucidates the origin of functional and conserved domains in human proteins, their distribution along the Homo sapiens lineage, occurrence frequency of different domain combinations and proteome-wide patterns of their distribution, providing insights into the evolutionary solution to the increased complexity of the human proteome.


Flow diagram of subtractive searching method depicting the process of tracing the evolutionary origin of human domains

Cartoon diagram of different representative proteins containing Pfam-A family EGF (epidermal growth factor) with remote homologs found at different nodes along the lineage using subtractive searching method. For each sequence, SWISS-PROT identifier is given and EGF domain is shown along with the node name where it has found its remote homolog in that protein sequence. The codes for different nodes are: B, bacteria; E, eukaryota; T, metazoa; C, chordata; M, mammalia; P, primates; H, Homo sapiens. Other functionally significant domain names in protein sequences are given in the legend.
Published articles on this project
  • Pal LR, Guda C
    Tracing the origin of functional and conserved domains in the human proteome:implications for protein evolution at the modular level.
    BMC Evolutionary Biology (2006) 6:91 [Pubmed]



 

ngLOC:A Bayesian method for estimating the subcellular proteomes of eukaryotes

We present a method called ngLOC - an n-gram based Bayesian classifier - that predicts the localization of a protein sequence over ten distinct subcellular organelles. A ten-fold cross-validation result shows an overall accuracy of 89% for sequences localized to a single organelle, and 82% for those localized to multiple organelles. An enhanced version of ngLOC was developed to allow the dynamic adjustment of the model parameters that are specific to a proteome being estimated. Using this method, we have estimated the subcellular proteomes of eight eukaryotic organisms including yeast, nematode, fruitfly, mosquito, zebrafish, chicken, mouse, and human. To our knowledge, this study reports the first estimation of ten distinct subcellular proteomes for eight eukaryotic model organisms.


The n-gram model for representing proteins in ngLOC. This figure depicts the process of extracting n-grams from an example protein sequence for cytoplasm (CYT), and shows how the table of frequencies of n-grams maintained by the model is updated accordingly. For this example, n = 4.

Comparison of predictions from three methods on the ngLOC dataset. Three methods, PSORT, pTARGET, and ngLOC, were evaluated by comparing the Matthews Correlation Coefficient (MCC) for each localization. The MCC was chosen because it provides a balanced measure between sensitivity and specificity for each class. *LYS location was omitted from PSORT predictions, as PSORT predicts this class as part of the vesicular secretory pathway..
Published articles on this project
  • King BR, Guda C
    ngLOC:An n-gram based Bayesian method for estimating the subcellular proteomes of eukaryotes
    Genome Biology(2007) 8:R68 [Pubmed]

  • King BR, Guda C
    Semi-supervised learning for classification of protein sequence data
    Scientific Programming (2008) 16:5-29


  • King BR, Latham L, Guda C
    Estimation of subcellular proteins in bacterial species
    The Open and Applied Informatics Journal (2009) 3:1-11 [PDF]



 

Reconstruction of amino acid metabolic pathways in human mitochondria

Mitochondria are subcellular organelles in eukaryotic cells with their own genome and protein synthesis machinery. Functionally, mitochondria are know as the powerhouses of the cell, however; they also play important role in the metabolism of lipids, amino acids, vitamins, nucleotides etc. Hence, genetic and/or metabolic alterations in this organelle are causative or contributing factors to over 100 known human diseases including cancer, apoptosis, diabetes II, Parkinson’s disease, Alzheimer’s disease etc. Mitochondrial function is under the control of two genomes; their own, and that of the host nucleus. In human, mitochondrial genome is a circular DNA molecule of 16.5 KB size and there are up to 2-10 copies of the genome per mitochondrium. Additionally, depending on the cell type, every cell has anywhere from 10-2,000 mitochondria present. Human mitochondrial genome codes for 22 tRNAs, 2 rRNAs and only 13 polypeptides, where all polypeptides are involved in oxidative phosphorylation. About 1500 proteins estimated to function in mitochondria are encoded by the nuclear genome, synthesized in the cytoplasm and transported into mitochondria. Hence, several metabolic pathways span across multiple sub-cellular locations that include mitochondria. We are primarily interested in studying all the metabolic and disease pathways associated with human mitochondrial proteins.

            We have used a bioinformatics approach for the identification and reconstruction of metabolic pathways associated with amino acid metabolism in human mitochondria. Human mitochondrial proteins determined through experimental and computational methods have been superposed on the reference pathways from the KEGG database to identify mitochondrial pathways. Enzymes at the entry and exit points for each reconstructed pathway were identified and mitochondrial solute carrier proteins were determined, where applicable. Intermediate enzymes in the mitochondrial pathways were identified based on the annotations available from public databases, from evidence in current literature, or from our MITOPRED program, which predicts the mitochondrial localization of proteins. Through integration of data derived from experimental, bibliographical and computational sources, we reconstructed the amino acid metabolic pathways in human mitochondria, which could help better understand the mitochondrial metabolism and its role in human health.


Overview of the reconstructed amino acid metabolic network in human mitochondria. Shaded areas represent the cytoplasmic segments of the pathways. 2-KG - 2-Ketoglutarate; ?-AAS - ?- aminoadipate semialdehyde; 2-OA - 2-Oxoadipate; XAA - Xanthurenic acid; 3-H,L-KYN - 3-Hydroxy-L-kynurenine; L-KYN - L-kynurenine; AACoA - Acetoacetyl CoA; ACoA - Acetyl CoA; OA - Oxaloacetate; 3-H-3-MGCoA - 3-Hydroxy-3-methylglutaryl CoA; 4-M-2-OP - 4-Methyl-2-oxopentanoate; SCoA - Succinyl CoA; alpha-KG - alpha-Ketoglutarate; 3-M-2-OP - 3-Methyl oxopentanoate; 2-O-IP - 2-oxo-isopentanoate; MM - Methylmalonate; L-MMCoA - L-Methylmalonyl-CoA; 2-OP - 2-Oxopropanol; 2-A-3-KB - 2-amino-3-ketobutyrate; CO2 - Carbon dioxide; NH3 - Ammonia; 5,10 MTHF - 5,10 Methylene tetrahydrofliate; CP - Carbomyl phosphate; CIT - Citrulline; NO - Nitricoxide; ORT - Ornithine; GA - Guanidinoacetate; P5C - Pyroline-5-carboxylate; GABA - Gamma-aminobutyric acid; SAM - S-adenosylmethionine.

Mitochondrial dysfunction has been implicated as the probable cause in several cancer cell phenotypes. Carcinoma cells exhibit altered energy production, elevated membrane potential, elevated generation of reactive oxygen species (ROS), diminished apoptotic capacity etc., all of which are associated with the dysfunction of Electron Transport chain (ETC). ETC dysfunction is a result of mutations to as well as nuclear DNA and the nuclear genome coordinates the synthesis and translocation of majority of the oxidative machinery. Studying the mitochondrial pathways is fundamental to understanding the cross-talk between mitochondrial signals and other organelles that can greatly influence cell behavior in carcinoma cells. In the future, we are interested in the reconstruction of mitochondrial pathways associated with apoptosis and cancer biology. This research could help use of mitochondria as a potential target in the cancer drug development.

Published articles related to this project
  • Guda P, Guda C, Subramaniam S
    Reconstruction of pathways associated with amino acid metabolism in human mitochondria
    Genomics, Proteomics & Bioinformatics (2007) 5:166-176 [Pubmed]


  • Guda P, Subramaniam S, Guda C
    MitoProteome: Human heart mitochondrial protein sequence database In: Cardiovascular Proteomics, Methods and Protocols.
    Methods in Molecular Biology (2006) 357:375-384 [Pubmed]


  • Guda C, Guda P, Fahy E, Subramaniam S.
    MITOPRED: a web server for genome-scale prediction of mitochondrial proteins.
    Nucleic Acids Research (2004) 32: W372-W374 [Pubmed]


  • Guda C, Fahy E, Subramaniam S.
    MITOPRED: A genome-scale method for prediction of nuclear-encoded mitochondrial proteins.
    Bioinformatics (2004) 20:1785-1794 [Pubmed]



 

Motif recognition in voltage-gated ion channel proteins

Voltage-gated ion channels (VGC) mediate selective diffusion of ions across cell membranes to enable many vital cellular processes. Three-dimensional structure data is virtually lacking for VGC proteins due to limitations in the crystallization of these mostly hydrophobic transmembrane proteins. Therefore, to better understand their function, there is a need to identify the conserved patterns using sequence analysis methods. VGC proteins assemble as functional tetramers from four monomer subunits in K+ ion channels or from four repeats of a single polypeptide in Ca2+ and Na+ channel sub-families. For Ca2+ and Na+ channel proteins, we generated profiles for each repeat and created profile-to-profile alignments for all repeats using a phylogenetic guide tree built from the consensus sequences of repeats. In this study, we identified several new conserved patterns specific to each transmembrane segment (TMS) of the voltage-sensing and the pore-forming modules in each sub-family. For Ca2+ and Na+, the functional theme of pattern conservation is similar in almost all segments while they differ with those of the K+ channel proteins, except in the S4 segment of voltage-sensing module. For each subfamily, we also identified residues conserved 50% or more in each TMS, their biological significance and disease associations in human.


Conserved motifs in the voltage-sensing module of calcium, sodium and potassium ion channel proteins. S1-S4 are transmembrane segments in the voltage-sensing module of VGC proteins
Published articles on this project
  • Guda P, Bourne PE, Guda C
    Conserved motifs in voltage-sensing and pore-forming modules of voltage-gated ion channel proteins
    Biochem. Biophys. Res. Commun. (2007) 352:292-298 [Pubmed]




 

pTARGET: Prediction of protein subcellular localization

We developed a new prediction method, pTARGET that can predict proteins targeted to 9 different subcellular locations in the eukaryotic animal species. The nine subcellular locations include cytoplasm, endoplasmic reticulum, extracellular/secretory, golgi, lysosomes, mitochondria, nucleus, plasma membrane and peroxisomes. Predictions are based on the location-specific protein functional domains and the amino acid compositional differences across different subcellular locations. Overall, this method can predict 68-87% of the true positives at accuracy rates of 96-99%. Comparison of the prediction performance against PSORT showed that pTARGET prediction rates are higher by 11-60% in 6 of the 8 locations tested. Besides, pTARGET method is robust enough for genome-scale prediction of protein subcellular localizations since, it does not rely on the presence of signal or target peptides.

Availability: A public web server based on the pTARGET method is accessible at the URL http://bioinformatics.albany.edu/~ptarget. Datasets used for developing pTARGET can be downloaded from this web server. Source code will be available on request from the corresponding author.


Comparison of the prediction performance of pTARGET and PSORT
Published articles on this project
  • Guda C
    pTARGET: A web server for predicting protein subcellular localization
    Nucleic Acids Research (2006) 35:W210-213 [Pubmed]


  • Guda C, Subramaniam S.
    pTARGET: A new method for predicting protein sub-cellular localization in eukaryotes
    Bioinformatics (2005) 21: 3963-3969 [Pubmed]



 

DMAPS database
(Database of Multiple Alignments for Protein Structures)

The database of multiple alignments for protein structures (DMAPS) provides instant access to pre-computed multiple structure alignments for all protein structure families in the Protein Data Bank (PDB). Protein structure families have been obtained from four distinct classification methods including SCOP, CATH, ENZYME and CE, and multiple structure alignments have been built for all families containing at least three members, using CE-MC software. Currently, multiple structure alignments are available for 3050 SCOP-, 3087 CATH-, 664 ENZYME- and 1707 CE-based families. A web-based query system has been developed to retrieve multiple alignments for these families using the PDB chain ID of any member of a family. Multiple alignments can be viewed or downloaded in six different formats, including JOY/html, TEXT, FASTA, PDB (superimposed coordinates), JOY/postscript and JOY/rtf. DMAPS is accessible online at http://bioinformatics.albany.edu/~dmaps.


A screenshot of DMAPS query results page
Published articles related to this project
  • Guda C, Pal LR, Shindyalov IN.
    DMAPS: A Database of Multiple Alignments for Protein Structures
    Nucleic Acids Research (2006) 34: D273-276 [Pubmed]


  • Guda C, Lu S, Scheeff E, Bourne PE, Shindyalov IN.
    CE-MC: A Multiple Protein Structure Alignment Server.
    Nucleic Acids Research (2004) 32: W100-W103 [Pubmed]


  • Guda C, Scheeff ED, Bourne PE, Shindyalov IN.
    A new algorithm for the alignment of multiple protein structures using Monte Carlo optimization.
    Proceedings of the Pacific Symposium on Biocomputing (2001)(pdf ), pp. 275-286 [Pubmed]



 

Phylogenic analysis of CZH domain proteins

The Rho family of small GTPases are important regulators of multiple cellular activities and, most notably, reorganization of the actin cytoskeleton. Dbl-homology (DH)-domain-containing proteins are the classical guanine nucleotide exchange factors (GEFs) responsible for activation of Rho GTPases. However, members of a newly discovered family can also act as Rho-GEFs. These CZH proteins include: CDM (Ced-5, Dock180 and Myoblast city) proteins, which activate Rac; and zizimin proteins, which activate Cdc42. The family contains 11 mammalian proteins and has members in many other eukaryotes. The GEF activity is carried out by a novel, DH-unrelated domain named the DOCKER, CZH2 or DHR2 domain. CZH proteins have been implicated in cell migration, phagocytosis of apoptotic cells, T-cell activation and neurite outgrowth, and probably arose relatively early in eukaryotic evolution.


Phylogenetic analysis of the CZH1 and CZH2 domains Multiple alignments were built using the CLUSTALW program and the distances between all pairs of sequences in the multiple alignment were determined. Phylogenetic trees were generated using The Neighbor-Joining method and trees were drawn using the TREEVIEW program. Scale bar represents 0.1 nucleotide substitutions per site.
Published articles on this project
  • Meller N, Merlot S, Guda C
    CZH proteins-New family of Rho GEFs
    Journal of Cell Science (2005) 118: 4937-4946 [Pubmed]

  • Meller N, Westbrook JM, Shannon JD, Guda C, Schwartz MA
    Function of the N-terminus of zizimin1: autoinhibition and membrane targeting.
    Biochemical Journal (2008)409:525-533 [Pubmed]




  •  

    Comparative analysis of plant chloroplast genomes

    We developed computational methods for comparative analysis of complete chloroplast genomes of solanaceous crop species and grass plant species. Specifically, we analyzed the inter-genomic spacer regions of tthese genomes in all-against-all fashion to compare and contrast the similarities and the differences


    Gene map of Tomato and Potato chloroplast genome: Comaprative analysis
    Published articles on this project
    • Saski C, Lee SB, Fjellheim S, Guda C, Jansen RK, Tomkins J, Rognli OA, Daniell H, Clarke JL.
      Complete chloroplast genome sequences of Hordeum vulgare, Sorghum bicolor and Agrostis stolonifera, and comparative analyses with other grass genomes.
      Theoritical and Applied Genetics (2007) 115:571-590 [Pubmed]


    • Daniell H, Lee SB, Grevich J, Saski C, Guda C, Tomkins J, Jansen RK.
      Complete chloroplast genome sequence of Solanum tuberosum, Lycopersicon esculentum and comparative analyses with other Solanaceous genomes.
      Theoritical and Applied Genetics (2006) 112:1503-1518 [Pubmed]