ERGO™ ERGO Overview Integrated Genomics
 
ERGO Overview ERGO Tutorial FAQs Support ERGO Publications
 
 

1. Introduction: The ERGO™ system
2. Loading a new genome in ERGO (Identification of ORFs)
3. Genome-wide Functional Annotation
  3.1. Initial automated functional annotations
  3.2. Manual functional annotations
  3.3. Automated Pathway annotations
  3.4. Manual Pathway annotations
  3.5. Detailed function annotations
References


 

Since 1995, over 200 microbial organisms have been completely sequenced (1). Since the first sequence, it has become evident that the single most important tool for interpretation of the new genome sequence is a through analysis and the integration for comparative genomics. The success of the comparative analysis is directly dependent on the efficiency of integration, which in turn is determined by the diversity of the organisms, high quality annotations, and the level of detailed cellular reconstructions.

Integrated Genomics, Inc. has designed the ERGO™ bioinformatics suite in order to accommodate such data integration, to provide the tools necessary to support the comparative analysis of genomes and the generation of sophisticated metabolic and cellular reconstructions (2). Emerging from PUMA and WIT, which were previously developed at Argonne National Laboratories (3,4), ERGO™ is a third generation bioinformatics suite offered exclusively from Integrated Genomics at: http://ergo.integratedgenomics.com/ERGO/. The ERGO system represents the development of a genome analysis strategy into a multi-dimensional environment, which supports both automatic and manual genome-wide curation. Rather than just repackaging known information, ERGO integrates genomic information with biochemical data, literature, and high-throughput analysis into a comprehensive user-friendly network of metabolic and non-metabolic pathways. In contrast to conventional systems, the ERGO user can take into account sequence similarity, protein and gene context clustering, occurrence profiles, regulatory and expression data, as well as functional hierarchies in order to achieve a set of the best possible functional predictions. In fact, using the ERGO system, a major part of the metabolism of an organism, can be reconstructed entirely in silico (5). The cyclical nature of the integration of these information types continually elevates our knowledge and understanding of the complex dynamics residing in living organisms.

The current version of the ERGO™ database contains 618 complete or nearly complete genomes, of which 319 are Bacteria, 116 Eukarya, 34 Archaea and 149 Viruses (Figure 1). In total, these genomes contain over 1,300,000 Open Reading Frames (ORFs), more than 60% of which have a functional annotation. This percentage of annotated genes is actually much higher for the bacterial genomes, reaching an average of 70%. Every genome that goes into the ERGO system, is annotated from scratch whether it has been sequenced at Integrated Genomics, or at another sequencing center. More than 450 of the genomes are available for subscription or as part of a stand-alone ERGO server package from Integrated Genomics.



Figure 1. Phylogenetic distribution of the number of complete (red bars) and gapped (white bars) genomes, integrated in the developers ERGO™ bioinformatics suite.

The ERGO system integrates many different types of data, summarized in Table 1. These include genomic and pathway related data (both metabolic and non-metabolic pathways), regulatory data, as well as 'proteomics' data such as gene essentiality and expression data. The genomic data include genome contigs, locations of ORFs and their translations, locations of RNAs, locations of insertion elements, functional assignments (along with their history records) and a number of proprietary gene clustering tools. The primary tools involve clustering of the ORFs according to sequences similarity (i.e. ortholog, paralog and protein clusters) or gene context (i.e. chromosomal and fusion clusters). The ortholog clusters are essentially bi-directional best hits across different genomes, while paralog clusters are homologs within the same genome. Protein family clustering represents a new clustering technology being developed at IG. It is based on the highly manually curated ORF database of ERGO and is an attempt to produce protein families where all ORFs share strong sequence homology and have the same predicted function. More than 60% of the ORFs in ERGO are currently connected to these sets of clusters. Protein clustering information is only available through subscription or purchase of a stand-alone ERGO server. The principal of chromosomal and fusion clustering, and their importance in function prediction has been previously reported (6,7).

Table 1. Summary of Data types in ERGO
Genomic Data
  • DNA sequence data into contigs (from over 400 genomes)
  • ORFs and their Location (Graphical visualization of ORFs on a contig)
  • Translation of ORFs
  • Pre-computed sequence similarities for each ORF (against the entire database)
  • Functional assignments of proteins (with their history records)
  • RNA assignments
  • Identification and localization of insertion elements (ISs)
  • Ortholog clusters
  • Paralog clusters
  • Protein family clusters
  • Chromosomal clusters
  • Fusion clusters
Pathway Data
  • Chemical structures
  • Enzyme records
  • Metabolic pathways
  • Non-metabolic pathways
  • Cellular overviews (networks of metabolic and non-metabolic pathways)
  • Functional hierarchies (Functional roles organized into Gene Ontologies
Regulatory Data
Essentiality Data
Expression Data


 

In order to incorporate a genome into ERGO™, all the potential ORFs must first be identified. There are numerous tools now available to support identification of genes, and better tools are constantly being developed. The reason for the rapid advances is clear: many of the decisions required in gene identification are based on sequence similarity to previously identified genes, and the set of well-annotated genomes is growing rapidly. As the pool of characterized genes grows, algorithms that exploit this information will produce more consistent and accurate predictions. Gene identification within prokaryotic genomes is substantially more reliable than within eukaryotic genomes; the absence of introns and the fact that the sequence is often of higher quality are undoubtedly major factors. The remaining problems relate to choice of start positions, detection of frameshifts, and identification of short genes. These problems can reasonably be viewed as significant, but relatively minor. Integrated Genomics has developed proprietary software to address and overcome these problems, which currently works quite efficiently.

The DNA sequence and its putative coding regions are then loaded into ERGO™ as a package. The installation follows a standard protocol which includes (i) de novo calculation of sequence similarities for all the newly predicted ORFs against the entire non-redundant set of ORFs (over 2 million) in ERGO using the FASTA algorithm and (ii) re-calculation of clusters based on sequence (i.e. orthologs, paralogs, protein clusters), or gene context (i.e. chromosomal 'neighborhoods' and fusion events) for the entire database. The installation is followed by a multi-step annotation routine established at IG and applied for the annotation of more than 200 genomes. The process culminates with building of metabolic reconstruction models, which are represented by wire diagrams of subsystems and cellular pathways connected to gene sequences.


 

Genome annotation consists of a series of automated and manual procedures carried out by a combination of sophisticated algorithms and an experienced team of professional annotators at Integrated Genomics. The main steps are reflected on Figure 2. In general, there are three stages for functional assignments of a given genome, two stages before the metabolic reconstruction is completed, and one stage after.



Figure 2. Genome curation steps at the ERGO™ bioinformatics suite.

 

The automatic annotations in ERGO™ are the culmination of a multi-step process approach, parts of which are summarized on Table 2.

Table 2. First round of analysis: Automatic annotation steps
Automatic annotations steps
Identify RNA genes This is a semi-automated procedure. Currently we identify only tRNAs and rRNAs.
Identify protein-encoding genes See above
Estimate phylogenetic distances We attempt to determine an approximate position in the phylogenetic tree, the organisms closest neighbors, and estimate the distance between the organism and others in the tree. This is used at several points within the analysis that follows.
Calculate protein scores Compute similarities of all ORFs of the query genome against the ERGO non-redundant database
Compute bi-directional best hits (BBHs) between protein-encoding genes A gene X in organism Gx is a bi-directional best hit of a gene Y in organism Gy iff X is the closest gene to Y in Gx and Y is the closest gene to X in Gy. These are used constantly to attempt to find corresponding genes in distinct genomes. There is a broad literature on the use and misuse of BBHs.
Compute pairs of close bi-directional best hits (PCBBHs). These are used heavily in the computation of 'functional coupling based on chromosomal clusters'.
Compute pairs of close homologs (PCHs). Same as above
Compute pinned orthologs. Pinned orthologs are used in constructing the 'pinned regions' displays in ERGO.
Compute preserved operons. Preserved operons are estimates of sets of related genes that are clustered into what appears to be 'an operon', but no real assertion is being made of co-regulation.
Compute paralog families. These are important in classes of genes like transposases, transporters, regulation proteins, and 2-component signaling systems.
Compute protein families. We consider a family to be a set of homologous proteins with the same non-hypothetical function (obviously, both the homology and identity of function are estimates).
Compute profiles (of both families and organisms). Profiles are not widely used at this stage. ERGO provides tools to compute which families act as signatures for sets of organisms based on 'family profiles'.
Compute 'spreadsheets' between closely-related genomes. This determines both core functionality, and sets of genes which are local to subsets of the genomes.

 

Manual annotations can lead to far more accurate and detailed predictions, than any automatic tools. Although, it has been argued that manual analysis cannot be adequate to cover the extraordinary large volumes of sequence data produced, Integrated Genomics has achieved this through a combination of a systematic approach and the integration of the data into a single system. The systematic approach includes a manual inspection of the automatically assigned functions, as well as an exhaustive manual study of every single gene, by employing the combined use of both proprietary and publicly available tools (Table 3). Here, all questionable assignments and cases of weak homology will be evaluated using sequence similarity search tools.

Table 3. Second round of analysis: Manual annotation steps
Manual annotation steps
Examine un-annotated ORFs with strong hits to ORFs with functions At the end of the automatic annotation round, genes may remain without a function, although they do have a strong sequence similarity to other genes of know function. This happens when the program encounters cases of equally strong hits to different functions. Therefore, a more detailed analysis is needed here to distinguish between the alternative options.
Reconcile models This step entails a manual inspection of all the differences between the automatic annotations in ERGO, and those generated by motif/domain databases like Pfam, COGs, InterPro, etc. This step increases both the coverage and the accuracy.
Examine gene context Examine the physical layout of the genes on the chromosome looking for previously unrevealed functional relations between genes.
Examine Paralog families Examine the ORFs that are part of paralog families, and remain un-annotated. Several such cases can be annotated with general family names, rather than assigning exact specificity.

Since functional annotations have been traditionally based on similarity to genes of known function, ERGO provides online access to sequence similarity tools such as BLASTP or PSI-BLAST searches that are submitted to the NCBI server. In addition to these, queries can be submitted to more sensitive sequence similarity search tools such as the motif/pattern databases Pfam, Prosite, Prodom, InterPro and COGs.

One of ERGOs most significant features is its comparative annotations environment that provides quality checks for both the automatic annotations and manual analysis. To this end, a user may request to compare all different annotations available for the genes of a particular genome (Table 3). These annotations come either from other users of the ERGO system or from external databases (whose annotations have been already integrated into ERGO). Whenever possible, all the function predictions from SwissProt and TrEmbl, or PIR are included for the genomes, as well as those based on Pfam and COGs. In addition to this,

Furthermore, as the number of genome sequences grows, we have incorporated additional methods that rely on gene context rather than on sequence similarity (7,11). Based on the tendency of functionally related bacterial genes to cluster along the chromosome, it is now possible to extend our ability to predict functions beyond sequence similarity (11). We calculated such 'chromosomal clusters' based on bi-directional best-hit algorithm for all the otholog genes throughout all the genomes in ERGO database. Since only 1/3 of genes are clustered this way in an average bacterial genome, a large number of genomes are needed for the method to work. With ERGO content of more than 350 prokaryotic genomes, chromosomal clusters coupled with organism-specific functional pathways became a powerful tool for predicting functions for 'missing' genes (and gene families) and genes with weak homology. One can suggest a functional role for an unknown ORF by cross-referencing of the chain of biochemical reactions with an ORF cluster in any genome. A similar approach is this of the gene-fusion, which is based on the observation that often two or more ORFs that are separate ("components") in one organism have their orhologs fused as a single protein (being a "composite") in another one (7). Such fusions sometimes yield a functional clue for unknown 'components': if one of the separate ORFs does not have a known function, perhaps it is related to its 'Siamese twin' domain with known function and visa versa. Gene fusions are particularly important for eukaryotes where up to 55% of all genes can be fused in a given genome (A. thaliana, C. elegans).

Overall, the combined use of the above tools, along with detailed multi-step manual curation supported by the ERGO system, results in a significant increase in the function prediction .

 

Once the function is predicted confidently, it may then be connected to a particular metabolic or cellular pathway, which already exists in the ERGO™ pathway collection. The level of detail and coverage at this step is directly related to the number of pathways present in the ERGO system. Over the years, IG has been compiling a database of functional pathways dubbed IG-Pathdb. Now, it contains over 5,000 cellular pathways (the majority of which are metabolic) and new ones are being added daily. Each metabolic pathway entry stores information about metabolites, reactions, and corresponding enzymatic functions. The non-metabolic pathways, unlike the metabolic ones, represent either lists of functionally related genes (i.e. genes of the large ribosomal subunit, or genes of the type IV protein secretion) or general lists of process related functions (i.e. general transcription activators, or Phage proteins). Most of the pathways were extracted from the experimental literature and connected to specific gene sequences at the genomes at ERGO database. With improvements in annotation technology, many pathways are now deduced from the sequenced genomes directly, using the metabolite compounds as connecting nodes and a set of rules. When a genome with annotated ORFs is added to ERGO, a set of pathways will be automatically assigned to the organism based on a collection of pathway templates. During this automated step, only the pathways with all the functional roles connected to at least one gene will be assigned (via the functions already assigned). Each function can be connected to a number of different or alternative pathways

 

At the second round, an expert user can manually perform a 'reality check' to the set of asserted pathways (particularly, to the alternative ones), or assert additional ones, according to the literature data concerning the organisms 'life style', as well as its biochemistry and genetics. Cellular pathways are connected into larger functional subsystems, such as amino-acid metabolism, oxidative phosphorylation, lipid metabolism, secretion, etc. This is partially automated task done at IG by professional curators specializing in particular subsystems. Based on their expert knowledge of a sybsystem (with all the alternatives among hundreds of organisms in ERGO), curators first look for the asserted pathways from their functional subsystem. Then they determine the set of pathways which must be found in the organism under study because they are essential for the organism. These additional pathways will not have previously been asserted because of missing gene associations with one or more of the functional roles in the pathway. Once identified, these pathways can be used to find those missing functions that escaped the initial similarity based analyses.

 

The detailed knowledge of every step in our collection of metabolic pathways allows us to identify the missing steps of the pathway for a particular organism. We then go back into a third round of manual annotations, and try to predict these missing steps. This brings us to the third and final step of annotations, which entails a directed and reversed (as compared to the first two rounds) approach. Along this highly laborious step, the query is the function predicted to be present, and the target is the gene, which now is expected to be identified, as opposed to the first two rounds where the query was the gene that had been predicted to exist and the target was the function that remained unidentified. If most (or some) functions in a given pathway are connected to genes which are neighbors on the chromosome, then that may yield a functional clue: if one if the ORFs in this neighborhood is without assigned function, then perhaps it has the function that in the pathway that has no genes connected.

Table 4. Third round of analysis: Manual annotation steps
Focused manual annotation steps
Pathway assertions Using a combination of automatic tools and manual analysis, all possible cellular pathways are asserted to an organsim
Identification of 'missing functions' Identification of the functional roles in pathways that are asserted, for which no gene was identified
Verify that functions match with the functional roles of the pathways (Controlled vocabulary cleanup) Check for instances in which functions were assigned, but the functions do not connect to existing pathways/subsystems (this often leads to the addition of more pathways or subsystems and more accurate annotations).
Examine gene context Examine the physical layout of the genes on the chromosome looking for previously unrevealed functional relations between neighboring genes.
Search for possible unidentified genes If all the above steps fail, examine the physical layout of genes on the chromosome looking for unacceptable overlaps between genes or unusually long gaps between genes.

Thus, the combination of cellular pathways and gene context tools available in ERGO, provide an ideal framework not only to identify and connect all possible functions to genes, but also to predict which functions should also be present and further facilitate the discovery of their corresponding genes (Figure 3).



Figure 3. Schematic representation of the process of function identification based on the combination of tools related to gene neighborhood and metabolic pathways in ERGO.


 

1.Bernal, A., Ear, U. and Kyrpides, N. (2001) Genomes OnLine Database (GOLD): a monitor of genomes projects world-wide. Nucleic Acids Research 29, 126-127.
2.Overbeek, R., Larsen, N., Walunas, T., D'Souza, M., Pusch, G., Selkov Jr., E., Liolios, K., Joukov, V., Kaznadzey, D., Anderson, I., Bhattacharyya, A., Burd, H., Gardner, W., Hanke, P., Kapatral, V., Mikhailova, N., Vasieva, O., Osterman, A., Vonstein, V., Fonstein, M., Ivanova, N. and Kyrpides, N. (2003). The ERGO™ Genome Analysis and Discovery System. Nucleic Acids Research 31, 164-71.
3.Overbeek, R., Larsen, N., Smith, W., Maltsev, N. and Selkov, E. (1997) Representation of function: the next step. Gene 191, GC1-GC9.
4.Overbeek, R., Larsen, N., Pusch, G.D., D'Souza, M., Selkov Jr., E., Kyrpides, N., Fonstein, M., Maltsev, N. and Selkov E. (2000) WIT - Integrated System for High-throughput Genome Sequence Analysis and Metabolic Reconstruction. Nucleic Acids Res., 28, 123-125.
5.Selkov, E., Overbeek, R., Kogan, Y., Chu, L., Vonstein, V., Holmes, D., Silver, S., Haselkorn, R. and Fonstein, M. (2000) Functional analysis of gapped microbial genomes: amino acid metabolism of Thiobacillus ferrooxidans. Proc Natl Acad Sci U S A 97, 3509-3514.
6.Overbeek R, Fonstein M, D'Souza M, Pusch GD, Maltsev N. (1999) The use of gene clusters to infer functional coupling. Proc Natl Acad Sci U S A. 96, 2896-901.
7.Enright, A., Iliopoulos, I., Kyrpides, N. and Ouzounis, C. (1999) Protein interaction maps for complete genomes based on gene fusion events. Nature 402, 86-90.
8.Mironov AA, Novichkov PS, Gelfand MS. Pro-Frame: similarity-based gene recognition in eukaryotic DNA sequences with errors. Bioinformatics. 2001;17(1):13-5.
9.Novichkov PS, Gelfand MS, Mironov AA. Gene recognition in eukaryotic DNA by comparison of genomic sequences. Bioinformatics. 2001; 17(11): 1011-8.
10.Mironov AA, Fickett JW, Gelfand MS. Frequent alternative splicing of human genes. Genome Res. 1999 9(12):1288-93.
11.Overbeek, R., Fonstein, M., D'Souza, M., Pusch, G.D. and Maltsev, N. (1999) The use of gene clusters to infer functional coupling. Proc Natl Acad Sci U S A. 96, 2896-2901.


 
ERGO family:
 
 ERGO™ bioinformatics suite is property of Integrated Genomics Inc.

IG is providing access to the ERGO™ through fee-based subscription
 
  Publicly available version of ERGO™

The server and the associated data are free of any charge for academic and non-commercial use only