ERGO™ ERGO Tutorial Integrated Genomics
 
ERGO Overview ERGO Tutorial FAQs Support ERGO Publications
 
 

 1. Starting with ERGO™ bioinformatics suite
 1.1. User-specific configuration
 1.2. Starting with the ERGO-Light bioinformatics suite
 2. Available genomes
 2.1. User configuration of Genomes page
 3. Statistics
 3.1. General Statistics Page
 3.1.1. DNA-related Statistics
 3.1.2. ORFs-related Statistics
 3.1.3. Pathway/Clusters related Statistics
 3.2. Organism Statistics Page (Stats)
 3.2.1. DNA-related Statistics
 3.2.1.1. DNA contigs
 3.2.1.1.1. DNA contig viewer (Contig ID)
 3.2.1.1.2. Table formatted list of ORFs (# ORFs)
 3.2.2. ORFs-related Statistics
 3.2.3. Function-related Statistics
 3.2.4. Pathway-related Statistics
 3.2.5. Cluster-related Statistics
 3.2.6. Domain-related Statistics
 3.2.6.1. Unique Pfam domains
 3.2.6.2. Unique COG domains
 3.2.7. Combination-related Statisitcs
 3.2.8. Reconcile/reconcile assignments differences


 

Integrated Genomics Inc. is welcoming you to the ERGO™ bioinformatics suite. The front page of ERGO (see below) is designed to provide the user with direct access into the system, as well as with general information regarding the company and the system. For a better description, the front page is broken below into 4 distinct sections:

The first section provides information related to the current status of the system, as well as the name of the Log-in User (in the case above: User guest). The genomes number, which is presented on the top left, signifies at any given time, the number of the genomes provided from this version of the system (in the case above: 402 genomes). This also serves as a link to the list of genomes available (see below section 2). The cumulative general statistics of all the organisms into the system are provided by clicking at the Statistics link, which is located on the top middle of the page (see below section 3.1). By clicking at the name of the User, one can also change at any time the User-Name.
The second section is the green menu bar, which provides access to different data sets, tools and pages in the ERGO system. Not all of those data and tools are available to the Light version of the system. The Data menu bar provides access to various data sets in ERGO (for detailed description see below in section xxx). These include Functions statistics, Clusters statistics, as well as date from Microarrays or Transposomics (not available in the Light version). The Tools menu bar provides access to various tools available in the system (for detailed description see below in section xxx). These are mostly related to whole genome comparisons. This menu is not available in the Light version. The Query menu bar provides access to various query pages in ERGO, including, keyword searches, or upload sequence/pattern query (for detailed description see below in section xxx). The Configure menu bar allows for user-specific configuration of the system (see below in section 1.1). Finally, the Help menu bar provides to the user access to this tutorial, as well as contact information and general acknowledgements.


 
(not available on the Light edition)

As mentioned above, the ERGO™ system supports the creation of a user-specific environment through out several of the provided tools/pages. We suggest that you start by configuring your environment from the front page (user-defined configuration in disabled in the Light version). From the green menu bar at the top of the page, the user can click on the arrow next to the Configure menu bar (see the menu inside the selected red box) in order to: (a) change the User Name; (b) Select a preferred organism; or (c) Select a preferred group of organisms, as shown below:

By selecting a preferred organism, a new window appears with the list of the genomes that are available on the subscribed ERGO system. The user can select any one of those genomes (only one at a time), by clicking on the radio-button in front of the preferred organism. The selected organism is then saved in the preferences of this particular User-Name. When the User will re-login into the system, using this name, the selected organism will automatically appear.

The colored buttons in front of the organism names denote the different domains:
- Bacterial,
- Archaea,
- Viral,
- Eukarya.
The list of organisms is presented by default alphabetically, but it can be also presented by Domain then name (selecting this option from the menu next to the: Sort the organisms below ). Once the preferred organism is selected, then the user may click on the Submit Organism Preference button. The Delete Preference button reverses the saved preferences to those before the selection.
As soon as the user selects an organism, and clicks the Submit Organism Preference, a new menu bar appears below the green one (brown menu bar), with information and tools that apply specifically for the selected organism. For example, if the user selects Bacillus cereus, the new menu bar that will appear will say Bacillus cereus, and a list of tools will be presented next to this, specific for this organism.

From the Configure menu, the user can also select a preferred set of organisms by clicking at the Set Organism Group. This permits the selection of multiple organisms, on which the user can later on perform a number of queries. A pre-selected list of set of Organisms, already exists into the system (for example, Archaea, Bacteria, etc., see below). However the user can create any new sets, by selecting the buttons of the first column, and by saving the selected organisms in the menu next to the button Save Loaded Group As.


 

The Light edition, as its name denotes, is a streamlined version of the enterprise-scale ERGO™ Bioinformatics Suite, which is currently available only through subscription. ERGO- Light is essentially a read-only version, lacking the annotation tools and the ability to configure a user environment. However, Light contains a full set of individual genome analysis tools and all the primary data types found in ERGO, including full contig data, protein sequence and annotations and chromosomal clustering information. Users can take advantage of Integrated Genomics' full metabolic and non-metabolic pathway database (containing over 5000 pathways) and a sample of the highly-curated bacterial metabolic reconstruction In most ERGO pages, there are help buttons, presented as blue boxes with a question mark. We recommend the user to read those for general information of what the page has. In several cases however, the information provided references to user-defined tools, absent from the Light edition.
We recommend that you start by clicking at the "genomes" number link on the top of the front page. This number denotes the number of organisms currently available through this version of the ERGO system. The link will take you to a page that will provide the names of these organisms as well as further links to pages with information about those genomes (see below).


 

To see the list of available genomes the User can click at any point at the "genomes" number link on the top of the front page. By doing so in the ERGO-Light version, the list of the available 7 genomes (at this point) appears. The user may click at the blue box with the question mark for further help related to the information provided in this page (including information for tools not available in the Light version).

D: Domain name, is for Bacteria; ID: the two or three letter code of the organism; Stats: This gives you a link to a page with genome's general statistics (See section 3.2.); Model-Graph: this link takes you to a graphical representation of the cellular overview (only parts of this are available through ERGO-Light. For accessing the complete overview, full subscription is necessary); Model-Text: this link takes you to the textual overview of the genome, which is automatically generated from the graphical overview; allO: this denotes the total number of ORFs identified at the moment for each genome, and provide a link to a table that lists all of them; AP: denotes the total number of pathways asserted for the organism, and provides a link to this list of pathways. Switch-Groups: this menu allows a user to switch from this group of genomes, to other sub-groups, that are either by default into the system or can be created by the user (see section 1.1.). For example, one may chose to see only the genomes sequenced from Integrated Genomics Inc. (select IG Organisms), or only the complete and publishes genomes (select Complete and Published). In the complete system (available on subscription), the user has in addition the ability to define new sets and visualize the list through this page.
The data/links provided in this genome's page can also be configured according to the needs of the User. This ability is again absent from the Light version.


 
(not available on the Light edition)

The full enterprise version of the ERGO system provides the user with the ability to configure many of the pages through-out ERGO.

The data/links provided through the Genomes page can also be configured by clicking on the corresponding button on the top right of the page: Configure Page (selected in the red box).

The Genomes Configuration Panel (shown at the right) allows the User to add to the above page of presented data/links list, additional types of data. In fact all the data types the are presented further down at the Organisms statistics Page (Section 3.2.) can be also displayed here on a organism comparative mode. Once the desirable data types are selected, then the Submit Choices button should be clicked, and the above Organisms Overviews Page will automatically be updated, while the Genome Configuration Panel page will disappear.


 

The ERGO system provides at any point general data statistics either for the overall system or individually for each of the integrated genomes.


 

To access the Cumulative general Statistics of the system, the User can click on the Statistics link on the top of the ERGO page (See above Section 1). This will lead to a page that looks like the one below:

The data types for which the statistics are presented above are shown in the first column (Data Category); the actual numbers for these data types are shown in the second column (Counts), and the percentage out of the total is presented in the third column (% of Total). The data types are organized into three distinct functional units: (i) DNA-related statistics, (ii) ORFs-related statistics, (iii) Pathway and Cluster related statistics.


 

The Cumulative Statistics for DNA include (a) the total number of base pairs sequenced in all genomes (DNA total sequenced, bases), as well as a subset of those base pairs, that are part of the coding region (DNA coding sequences, bases); (b) the DNA base pairs consisting only of AGCT (DNA, bases (AGCT only)), as well as statistics for the A+T (DNA A+T content, bases) and the G+C (DNA G+C content, bases) base pairs. Finally, the total number of the DNA contigs is also reported.


 

The Cumulative Statistics for the ORFs include:

  • the total number of predicted ORFs in all genomes (ORFs total);
  • the total number of the ORFs that have an assigned function (ORFs with assigned function), as well as a subset of those that do have a function, but no sequence similarity to any other ORF in the system (ORFs with function but no similarity);
  • the number of ORFs without any assigned functions (ORFs without assigned function), as well as two subsets of those: the number of ORFs that have neither a function or sequence similarity (FastA cut-off score used is P-value better than 0.01) (ORFs without function or similarity), and the number of ORFs without function but with sequence similarity to other ORFs in the system (ORFs without function, with similarity);
  • the number of ORFs that are also connected to asserted pathways (ORFs in asserted pathways);
  • the number of ORFs that are not connected to asserted pathways (ORFs not in asserted pathways), as well the subset of these ORFs that do have a function, but are still not connected to asserted pathways (ORFs with assigned function but no pathway): this later set of ORFs is created either because the pathway to which the ORFs are connected is not yet asserted to the reference organism, or simply because there is not any such pathway into the ERGO system yet;
  • the number of ORFs that are connected to the general functional overview (ORFs in the functional overview) (see below section xxx, for the description of what the functional overview is);
  • the number of ORFs that are found in protein clusters (ORFS in protein clusters) (not available on the Light edition) ;
  • the number of ORFs in clusters of paralog genes (ORFs in paralog clusters);
  • the number of ORFs that have a match in the COGs database of NCBI (ORFs in COGs);
  • the number of ORFs that have a hit in the Pfam database of Washington University (ORFs in Pfam);
  • the number of ORFs identified in chromosomal clusters (ORFs in chromosomal clusters);
  • the number of ORFs that participate in fusion events (ORFs in possible fusion events), as well as two subsets of those: the number of ORFs which participate in fusion events as composite (ORFs in possible fusion events as composites), as well as those that participate as components (ORFs in possible fusion events as components).


 

The Cumulative Statistics for the Pathway/Clusters include:

  • the total number of Pathways that have been asserted in all genomes (Pathways asserted total);
  • the total number of the Paralog clusters (Paralog clusters, total), as well as two subsets of those: the number of paralog clusters which include at least one (but not all) ORFs that has unknown function (Paralog clusters, some members hypothetical), as well as those paralog clusters for which all the ORFs (in each cluster) have unknown functions (Paralog clusters, all members hypothetical).
The paralog clusters with at least one (but not all), hypothetical functions, are useful for the manual annotation process. According to this step, if some members of the cluster have a function, then it is possible to assign a more general family function to the rest of the members (if a specific function cannot be applied).


 

To access Organism "Statistics" page, the User can click on the Stats link of the genomes page (See above Section 2). Getting to the Organism "Statistics" page, one can get a brief overview of the current status of the genome data in the system. The contents of this summary page (as displayed below for the genome of Bacillus cereus) are updated on a daily basis at IG's internal server only, and passed along the public servers when an update on them occurs. From the top of the statistics page, the user may switch to the statistics of any other organism in the system at any point, using the Switch Organisms menu bar. (Please note that not all of the data presented at the table below are available on the Light edition).

As with the Cumulative general statistics (of section 3.1.), the statistics provided for the Organisms, are organized in distinct sections of data types that are displayed on the third column (Data Category), the exact numbers corresponding to these data types are shown in the fourth column (Counts), and the percentage out of the total is presented in the fifth column (% of Total). The blue colored numbers provide links to other pages with these data. The first two columns allow the retrieval of data based on combinations of the pre-existing data categories. On the first column (I: Intersect) the user selected the categories to be included, or intersected, and in the second column (S: Subtract) the categories to subtract from the first ones. For example, one may chose to select from the first column all the ORFs with assigned function, and then from those subtract those that have a match in Pfam and in COGs.
The data types, similar to the general statistics, are organized into eight distinct functional units:

  • DNA-related statistics (section 3.2.1.),
  • ORFs-related statistics (section 3.2.21.),
  • Function related Statistics (section 3.2.3.),
  • Pathway related statistics (section 3.2.4.),
  • Cluster related statistics (section 3.2.5.),
  • Domain related statistics (section 3.2.6.),
  • Perform functions on different types of statistics (section 3.2.7.),
  • Compare/reconcile assignment differences (section 3.2.8.).


 

The Cumulative Statistics for DNA include the same data types available for the cumulative statistics, which are: (a) the total number of base pairs sequenced in all genomes (DNA total sequenced, bases), as well as two subset of those: the base pairs, that are part of the coding region (DNA coding sequences, bases), and the base pairs for the the G+C (DNA G+C content, bases) base pairs; (b) the total number of the available DNA contigs (DNA contigs).


 

From the Data category (Section 3.2.), the user may select to see information related to the DNA contigs. To do so, one can click to the number (under the Counts column) that corresponds to this Data category (in the case of Section 3.2. table above it would be the number 2, since there are only 2 contigs available for this organism), and get to the contig table:

Here, the first column (Contig ID) provides access to a graphical overview of the contig (see below: section 3.2.1.1.1.), the second (Contig Length) provides access to the entire DNA sequence of the contig and the third column (# ORFs) provides access to a list of the ORFs in the contig and their functions formatted in a table (see below: section 3.2.1.1.2.).


 

From the above contig table, the user can select to see a graphical overview of the contig with the identified ORFs and RNAs on it. To do so, one can click on the Contig ID, which is presented on the first column of the table above. In this case, the organism Bacillus cereus is completely sequenced and the entire genome is organized into two contigs (one chromosomal and one plasmid). By clicking on the chromosome, the user will access the contig map:

On this map, Red arrows represent RNAs, and Blue arrows represent ORFs. By placing the cursor on any of the RNAs or the ORFs, the user can see the geneID and the predicted function. By clicking in any of these ORFs, the user will go to the ORF page (see below Section ???). On the bottom of the map the user has some options for modifying the table:

  • The first scroll-down menu allows the user to switch the map between different contigs (Change contig to);
  • The third menu on the bottom of the figure above, allows the user to select the DNA region that will be projected on the map (Region). The default window on the map is 500Kb, but this can be expanded to cover the entire genome, or a smaller portion of it, by typing the correct coordinates here. For example, if a user would like to see the entire map of the complete genome of B. cereus (which is 5.4Mb genome) all that is required is to type in the Region windows: 1 to 5500000 and then click Redraw. Alternatively, the user can proceed through the genome on 500Kb windows (or other size that can be defined by the previous menu), by following the arrow on the middle-right of the table;
  • finally, the user may also select to color the ORFs according to their functional category. This is made possible through the second scroll-down menu (Color). For example, one may select to view all genes related to Information Processing (i.e. transcription, translation, replication etc.).

From the Color menu, the user will select this functional category, and then click the Redraw button. All selected ORFs (i.e. all ORFs related to Information processing now) will be colored blue, while the RNAs remain red, and all the other ORFs will become gray (see figure below). Evidently, only the ORFs that have a function, which is also part of a subsystem can be colored in this way on the contig map.


 

From the contig table of section 3.2.1.1., the user may select to see all the ORFs that are identified in a particular contig, in a table format. To do so the user will click at the number of the ORFs appearing on the third column (# ORFs). The result will be a page where all the ORFs will be presented in a table organized with:

  • the ORF IDs (these are also linked to the actual ORF pages) (see section ???);
  • the DNA coordinates (Begin and End of the Gene);
  • transcription orientation (Strand); Amino-acids size (Length); and
  • Function of the gene.


 

The ORF-related Statistics include the same data types available for the cumulative system statistics (of section 3.1.2.), and include:

  • the total number of predicted ORFs in all genomes (ORFs total);
  • the total number of the ORFs that have an assigned function (ORFs with assigned function), as well as a subset of those that do have a function, but no sequence similarity to any other ORF in the system (ORFs with function but no similarity);
  • the number of ORFs without any assigned functions (ORFs without assigned function), as well as two subsets of those: the number of ORFs that have neither a function or sequence similarity (FastA cut-off score used is P-value better than 0.01) (ORFs without function or similarity), and the number of ORFs without function but with sequence similarity to other ORFs in the system (ORFs without function, with similarity);
  • the number of ORFs that are also connected to asserted pathways (ORFs in asserted pathways);
  • the number of ORFs that are not connected to asserted pathways (ORFs not in asserted pathways), as well the subset of these ORFs that do have a function, but are still not connected to asserted pathways (ORFs with assigned function but no pathway): this later set of ORFs is created either because the pathway to which the ORFs are connected is not yet asserted to the reference organism, or simply because there is not any such pathway into the ERGO system yet;
  • the number of ORFs that are connected to the general functional overview (ORFs in the functional overview) (see below section xxx, for the description of what the functional overview is);
  • the number of ORFs that are found in protein clusters (ORFS in protein clusters) (not available on the Light edition);
  • the number of ORFs in clusters of paralog genes (ORFs in paralog clusters);
  • the number of ORFs that have a match in the COGs database of NCBI (ORFs in COGs);
  • the number of ORFs that have a hit in the Pfam database of Washington University (ORFs in Pfam);
  • the number of ORFs identified in chromosomal clusters (ORFs in chromosomal clusters);
  • the number of ORFs that participate in fusion events (ORFs in possible fusion events), as well as two subsets of those: the number of ORFs which participate in fusion events as composite (ORFs in possible fusion events as composites), as well as those that participate as components (ORFs in possible fusion events as components).


 

The statistics related with function include:

  • Functions assigned: this is the total number of different functions identified in this organism. More than one ORF may have the same exact function. By following the link under the counts column, a user can see the list of those functions;
    • Functions assigned hypothetical: this is a sub-category of the assigned functions, which includes all the hypothetical functions;
    • Functions assigned, connected to asserted pathways: this is another sub-category of the assigned functions, which includes those functions that are also connected to asserted pathways;
    • Functions assigned, not connected to asserted pathways: this is the list of the assigned functions that are not connected to any asserted pathways yet.
    This set of functions, can be either hypothetical, or it can be that there is not any pathway yet in the pathway database that have this function, or there is a pathway but is not asserted in this organism.
Finally, there are two more reported statistics here related to "missing" functions:
  • Functions missing from asserted pathways: these are the functions that are expected to be found in the organism, based on the fact that the pathway they belong to, is believed to be present (i.e. has been asserted) to this particular organism;
  • Functions with no sequence: this is essentially the same as the previous case, with the only difference that, the gene encoding this "missing" functions, has not been cloned yet in any organism, and therefore it is not possible to be identified based on sequence similarity. These types of functions are described so far only biochemically.


 

There is only a single entry here, reporting the total number of the pathways that are considered present (i.e. have been asserted) in the query organism (Pathways asserted total). By clicking at the number corresponding for this data-type, the user can see the list of all the pathways asserted to the organism, as well as the ORFs that are connected to each of those pathways next to them.


 

ERGO presents statistics for two types of computed clusters here:

  • the total number of Protein clusters found in the query genome (Protein clusters, total);
  • the total number of the Paralog protein clusters identified in the genome (Paralog clusters, total). As it was presented above in the cumulative general statistics (section 3.1.), there are also two subcategories for the Paralog clusters here: those that have at least one (but not all) hypothetical ORF (Paralog clusters, some members hypothetical), and those for which all the ORFs have unknown function (Paralog clusters, all members hypothetical).


 

The last two types of the available statistics have to do with unique domains generated from the Pfam (Unique Pfam domains) and COGs Database (Clusters, orthologous groups (COGs)).


 

By clicking at the number corresponding to the Pfam domains, the user can access a table that displays all the different Pfam domains identified in this genome, together with the number of ORFs corresponding to each of the domain (see table below).

By clicking at the individual Pfam domain the user is transferred to the Pfam database at the Washington University. By clicking at the number of the ORFs, the user can access the individual ORFs in the ERGO system that are characterized by the corresponding domain.


 

By clicking at the number corresponding to the Clusters of Orthologous Groups, the user can access a table (see below) that has organized all the different COG-functions, into functionally related categories (COG category). Next to each of these categories three types of data are presented:

  • the total number of ORFs associated to each of the COG functional category (ORFs);
  • the number of ORFs that either have a function (Function) or are without any predicted function (No function);
  • the number of ORFs that are in asserted pathways (Pathway) or are not yet in any asserted pathways (No pathway).

As with the Pfam domains, the user can click here again to any of the numbers and see a detailed list of the ORFs in each category, and in addition a break-down of each of the general functional categories into more detailed list of function descriptions. This table is quite useful along the annotation process, since a user can form an average estimate (according the functions predicted by COGs), of the ORFs that can be annotated, as well as of the ORFs that can be connected into pathways, always according to COGs.


 

At the bottom of the statistics table, a user can select a combination of statistics (Get selected ORFs) selected from the check-boxes available on the first two columns (presented on Part 1a above).


 

In addition to all of the above available options, the full version of ERGOTM, offers to the user the ability to compare the annotations of different users, or different databases, with those made by Integrated Genomics. To accomplish that, an additional link appears at the bottom of the Statistics page of each genome.
The Compare/Reconcile assignments difference link of the organisms statistics (Section 3.2.), leads to the page that looks like the figure following below (in this case this is for the genome of Bacillus anthracis).

On the first column, the names of the different Users appear (User). These can be either different databases, or different curators. The boxes of the second column allow the user to select each of the different users, while the other two columns display the number of differences with the log-in user (either for Functions or Pathways). Once the user selects one or more of the boxes, can then click on the buttons Reconcile functions (or Reconcile pathways, depending on which differences are to be compared). If all users are selected, the following table is presented, after clicking the Reconcile functions button:

This table presents the differences in the assigned functions, between the login user (i.e. visitor) with the different pre-existing set of selected users (i.e. in this case: COGs, Pfam, SwissProt, and TIGR). The table is organized in five columns:

  • the first column presents the ID of the ORF (ORF ID). If more than one of the selected users have information for the same ORF, then this ORF is presented in multiple rows (change of the gray-scale intensity in the background color facilitates the visualization of ORF change);
  • the second column presents the name of the user (User);
  • the third column presents the function predicted from the user of the second column (Function assignments);
  • the fifth column present the function of the login user (in this case: visitor) (Visitor's function assignment);
  • the fourth column has a yellow arrow for each entry. This is used to facilitate the transfer of a specific annotation from a particular user to the model generated from the login user.

ERGO is an interactive system, which supports the creation of user-specific models (that is user specific annotations, pathways and reconstructions). Therefore, every different login user, can either accept the available by the system annotations, or modify them according to the user's choice. These changes are saved into the system and become this user's default assignments. Any other login user can also visualize such annotations made by other user's through the reconcile differences page, or through the individual ORF page (see below in section ???). For each of the cases that the login user will not change any of the available default assignments made by the ERGO curation team, then these assignments will be also the default of the login user.

ERGO's annotations have a very specific vocabulary and do not recognize other synonyms as having essentially the same function. For example, for the first ORF presented in the table above (i.e. RBAT0001), all three databases, PIRnr, Pfam and TIGR, agree with ERGO's function prediction, even though the vocabulary or the precise annotation is not identical according to any of the different users. Therefore, this table will display all the ORFs for which there are available annotations under any of the available users, and provide a comparative view against those of the login user.

However, this table by displaying all the information available across different users (either they essentially agree or not), it allows for a relative fast inspection of all the assignments made by any of the selected users, in a comparative manner, in order to identify possible cases of disagreement, that would require further investigation.


 
ERGO family:
 
 ERGO™ bioinformatics suite is property of Integrated Genomics Inc.

IG is providing access to the ERGO™ through fee-based subscription
 
  Publicly available version of ERGO™

The server and the associated data are free of any charge for academic and non-commercial use only