
Integrated Genomics Inc. is welcoming you to the ERGO™
bioinformatics suite. The front
page of ERGO (see below) is designed to provide the user with direct access into the system, as
well as with general information regarding the company and the system. For a better description,
the front page is broken below into 4 distinct sections:
The first section provides information related to the current status of the system, as well
as the name of the Log-in User (in the case above:
User guest). The genomes number, which is
presented on the top left, signifies at any given time, the number of the genomes provided from
this version of the system (in the case above:
402 genomes). This also serves as a link to the list
of genomes available (see below section 2). The cumulative general statistics of all the
organisms into the system are provided by clicking at the
Statistics link, which is located on the
top middle of the page (see below section 3.1). By clicking at the name of the
User, one can also
change at any time the
User-Name.
The second section is the green menu bar, which provides access to different data sets,
tools and pages in the ERGO system. Not all of those data and tools are available to the Light
version of the system. The
Data menu bar provides access to various data sets in ERGO (for
detailed description see below in section xxx). These include Functions statistics, Clusters
statistics, as well as date from Microarrays or Transposomics (not available in the Light version). The
Tools
menu bar provides access to various tools available in the system (for detailed description
see below in section xxx). These are mostly related to whole genome comparisons. This menu is not
available in the Light version. The
Query menu bar provides access to various query pages in ERGO,
including, keyword searches, or upload sequence/pattern query (for detailed description see
below in section xxx). The
Configure menu bar allows for user-specific configuration of the
system (see below in section 1.1). Finally, the
Help menu bar provides to the user access to this
tutorial, as well as contact information and general acknowledgements.

(not available on the Light edition)
As mentioned above, the ERGO™ system supports the creation of a user-specific
environment through out several of the provided tools/pages. We suggest that you start by
configuring your environment from the front page (user-defined configuration in disabled in the
Light version).
From the green menu bar at the top of the page, the user can click on the arrow next to
the
Configure menu bar (see the menu inside the selected red box) in order to:
(a) change the
User Name;
(b) Select a preferred organism; or
(c) Select a preferred group of organisms, as
shown below:
By selecting a preferred organism, a new window appears with the list of the genomes
that are available on the subscribed ERGO system. The user can select any one of those genomes
(only one at a time), by clicking on the radio-button in front of the preferred organism. The
selected organism is then saved in the preferences of this particular User-Name. When the User
will re-login into the system, using this name, the selected organism will automatically appear.
The colored buttons in front of the organism names denote the different domains:
- Bacterial,
- Archaea,
- Viral,
- Eukarya.
The list of organisms is presented by default alphabetically, but it can be also presented
by Domain then name (selecting this option from the menu next to the:
Sort the organisms below
). Once the preferred organism is selected, then the user may click on the
Submit Organism Preference button. The
Delete Preference
button reverses the saved preferences to
those before the selection.
As soon as the user selects an organism, and clicks the
Submit Organism Preference, a
new menu bar appears below the green one (brown menu bar), with information and tools that
apply specifically for the selected organism. For example, if the user selects Bacillus cereus, the
new menu bar that will appear will say Bacillus cereus, and a list of tools will be presented next
to this, specific for this organism.
From the
Configure menu, the user can also select a preferred set of organisms by
clicking at the
Set Organism Group. This permits the selection of multiple organisms, on which
the user can later on perform a number of queries. A pre-selected list of set of Organisms,
already exists into the system (for example, Archaea, Bacteria, etc., see below). However the
user can create any new sets, by selecting the buttons of the first column, and by saving the
selected organisms in the menu next to the button
Save Loaded Group As.

The Light edition, as its name denotes, is a streamlined version of the enterprise-scale
ERGO™ Bioinformatics Suite, which is currently available only through subscription. ERGO-
Light is essentially a read-only version, lacking the annotation tools and the ability to configure a
user environment. However, Light contains a full set of individual genome analysis tools and all
the primary data types found in ERGO, including full contig data, protein sequence and
annotations and chromosomal clustering information. Users can take advantage of Integrated
Genomics' full metabolic and non-metabolic pathway database (containing over 5000 pathways)
and a sample of the highly-curated bacterial metabolic reconstruction
In most ERGO pages, there are help buttons, presented as blue boxes with a question
mark. We recommend the user to read those for general information of what the page has. In
several cases however, the information provided references to user-defined tools, absent from the
Light edition.
We recommend that you start by clicking at the "genomes" number link on the top of the
front page. This number denotes the number of organisms currently available through this
version of the ERGO system. The link will take you to a page that will provide the names of
these organisms as well as further links to pages with information about those genomes (see
below).

To see the list of available genomes the User can click at any point at the
"genomes"
number link on the top of the front page. By doing so in the ERGO-Light version, the list of the
available 7 genomes (at this point) appears. The user may click at the blue box with the question
mark for further help related to the information provided in this page (including information for
tools not available in the Light version).
D: Domain name,
is for Bacteria;
ID: the two or three letter code of the organism;
Stats: This
gives you a link to a page with genome's general statistics (See section 3.2.);
Model-Graph: this
link takes you to a graphical representation of the cellular overview (only parts of this are
available through ERGO-Light. For accessing the complete overview, full subscription is
necessary);
Model-Text: this link takes you to the textual overview of the genome, which is
automatically generated from the graphical overview;
allO: this denotes the total number of
ORFs identified at the moment for each genome, and provide a link to a table that lists all of
them;
AP: denotes the total number of pathways asserted for the organism, and provides a link to
this list of pathways.
Switch-Groups: this menu allows a user to switch from this group of
genomes, to other sub-groups, that are either by default into the system or can be created by the
user (see section 1.1.). For example, one may chose to see only the genomes sequenced from
Integrated Genomics Inc. (select IG Organisms), or only the complete and publishes genomes
(select Complete and Published). In the complete system (available on subscription), the user has
in addition the ability to define new sets and visualize the list through this page.
The data/links provided in this genome's page can also be configured according to the
needs of the User. This ability is again absent from the Light version.

(not available on the Light edition)
The full enterprise version of the ERGO system provides the user with the ability to
configure many of the pages through-out ERGO.
The data/links provided through the Genomes page can also be configured by clicking on the
corresponding button on the top right of the page:
Configure Page (selected in the red box).
The
Genomes Configuration Panel (shown at the right) allows the User to add to the above page of presented data/links list,
additional types of data. In fact all the data types the are presented further down at the Organisms statistics Page (Section 3.2.) can
be also displayed here on a organism comparative mode. Once the desirable data types are selected, then the
Submit Choices
button should be clicked, and the above
Organisms Overviews Page will automatically be updated, while the
Genome Configuration Panel page will disappear.
|
|

The ERGO system provides at any point general data statistics either for the overall system
or individually for each of the integrated genomes.

To access the Cumulative general Statistics of the system, the User can click on the
Statistics link on the top of the ERGO page (See above Section 1). This will lead to a page that
looks like the one below:
The data types for which the statistics are presented above are shown in the first column
(Data Category); the actual numbers for these data types are shown in the second column
(Counts),
and the percentage out of the total is presented in the third column
(% of Total). The data types
are organized into three distinct functional units:
(i) DNA-related statistics,
(ii) ORFs-related
statistics,
(iii) Pathway and Cluster related statistics.

The Cumulative Statistics for DNA include
(a) the total number of base pairs sequenced
in all genomes
(DNA total sequenced, bases), as well as a subset of those base pairs, that are
part of the coding region
(DNA coding sequences, bases);
(b) the DNA base pairs consisting
only of AGCT
(DNA, bases (AGCT only)), as well as statistics for the A+T
(DNA A+T content, bases) and the G+C
(DNA G+C content, bases) base pairs. Finally, the total number of
the DNA contigs is also reported.

The Cumulative Statistics for the ORFs include:
-
the total number of predicted ORFs
in all genomes
(ORFs total);
- the total number of the ORFs that have an assigned function
(ORFs with assigned function),
as well as a subset of those that do have a function, but no
sequence similarity to any other ORF in the system
(ORFs with function but no similarity);
- the number of ORFs without any assigned functions
(ORFs without assigned function), as well
as two subsets of those: the number of ORFs that have neither a function or sequence similarity
(FastA cut-off score used is P-value better than 0.01)
(ORFs without function or similarity),
and the number of ORFs without function but with sequence similarity to other ORFs in the
system
(ORFs without function, with similarity);
- the number of ORFs that are also
connected to asserted pathways
(ORFs in asserted pathways);
- the number of ORFs that are
not connected to asserted pathways
(ORFs not in asserted pathways), as well the subset of
these ORFs that do have a function, but are still not connected to asserted pathways
(ORFs with assigned function but no pathway):
this later set of ORFs is created either because the pathway
to which the ORFs are connected is not yet asserted to the reference organism, or simply because
there is not any such pathway into the ERGO system yet;
- the number of ORFs that are
connected to the general functional overview
(ORFs in the functional overview) (see below
section xxx, for the description of what the functional overview is);
- the number of ORFs that are found in protein clusters
(ORFS in protein clusters) (not available on the Light edition) ;
- the number of ORFs in clusters of paralog genes
(ORFs in paralog clusters);
- the number of
ORFs that have a match in the COGs database of NCBI
(ORFs in COGs);
- the number of
ORFs that have a hit in the Pfam database of Washington University
(ORFs in Pfam);
- the
number of ORFs identified in chromosomal clusters
(ORFs in chromosomal clusters);
- the
number of ORFs that participate in fusion events
(ORFs in possible fusion events), as well as
two subsets of those: the number of ORFs which participate in fusion events as composite
(ORFs in possible fusion events as composites), as well as those that participate as components
(ORFs in possible fusion events as components).

The Cumulative Statistics for the Pathway/Clusters include:
- the total number of
Pathways that have been asserted in all genomes
(Pathways asserted total);
- the total number
of the Paralog clusters
(Paralog clusters, total), as well as two subsets of those: the number of
paralog clusters which include at least one (but not all) ORFs that has unknown function
(Paralog clusters, some members hypothetical), as well as those paralog clusters for which all
the ORFs (in each cluster) have unknown functions
(Paralog clusters, all members hypothetical).
The paralog clusters with at least one (but not all), hypothetical functions, are
useful for the manual annotation process. According to this step, if some members of the cluster
have a function, then it is possible to assign a more general family function to the rest of the
members (if a specific function cannot be applied).

To access Organism "Statistics" page, the User can click on the
Stats link of the genomes
page (See above Section 2). Getting to the Organism "Statistics" page, one can get a brief
overview of the current status of the genome data in the system. The contents of this summary
page (as displayed below for the genome of Bacillus cereus) are updated on a daily basis at IG's
internal server only, and passed along the public servers when an update on them occurs. From
the top of the statistics page, the user may switch to the statistics of any other organism in the
system at any point, using the
Switch Organisms menu bar. (Please note that not all of the data
presented at the table below are available on the Light edition).
As with the Cumulative general statistics (of section 3.1.), the statistics provided for the
Organisms, are organized in distinct sections of data types that are displayed on the third column
(Data Category), the exact numbers corresponding to these data types are shown in the fourth
column
(Counts), and the percentage out of the total is presented in the fifth column
(% of Total). The blue colored numbers provide links to other pages with these data. The first two
columns allow the retrieval of data based on combinations of the pre-existing data categories. On
the first column
(I: Intersect) the user selected the categories to be included, or intersected, and
in the second column
(S: Subtract) the categories to subtract from the first ones. For example,
one may chose to select from the first column all the ORFs with assigned function, and then from
those subtract those that have a match in Pfam and in COGs.
The data types, similar to the general statistics, are organized into eight distinct functional
units:
- DNA-related statistics (section 3.2.1.),
- ORFs-related statistics (section 3.2.21.),
- Function related Statistics (section 3.2.3.),
- Pathway related statistics (section 3.2.4.),
- Cluster related statistics (section 3.2.5.),
- Domain related statistics (section 3.2.6.),
- Perform functions on different types of statistics (section 3.2.7.),
- Compare/reconcile assignment differences (section 3.2.8.).

The Cumulative Statistics for DNA include the same data types available for the
cumulative statistics, which are:
(a) the total number of base pairs sequenced in all genomes
(DNA total sequenced, bases), as well as two subset of those: the base pairs, that are part of the coding region
(DNA coding sequences, bases), and the base pairs for the the G+C
(DNA G+C content, bases) base pairs;
(b) the total number of the available DNA contigs
(DNA contigs).

From the
Data category (Section 3.2.), the user may select to see information related to
the
DNA contigs. To do so, one can click to the number (under the Counts column) that
corresponds to this Data category (in the case of Section 3.2. table above it would be the number
2, since there are only 2 contigs available for this organism), and get to the contig table:
Here, the first column
(Contig ID) provides access to a graphical overview of the contig
(see below: section 3.2.1.1.1.), the second
(Contig Length) provides access to the entire DNA
sequence of the contig and the third column
(# ORFs) provides access to a list of the ORFs in
the contig and their functions formatted in a table (see below: section 3.2.1.1.2.).

From the above contig table, the user can select to see a graphical overview of the contig
with the identified ORFs and RNAs on it. To do so, one can click on the
Contig ID, which is
presented on the first column of the table above. In this case, the organism Bacillus cereus is
completely sequenced and the entire genome is organized into two contigs (one chromosomal
and one plasmid). By clicking on the chromosome, the user will access the contig map:
On this map, Red arrows represent RNAs, and Blue arrows represent ORFs. By placing
the cursor on any of the RNAs or the ORFs, the user can see the geneID and the predicted
function. By clicking in any of these ORFs, the user will go to the ORF page (see below Section
???). On the bottom of the map the user has some options for modifying the table:
- The first
scroll-down menu allows the user to switch the map between different contigs
(Change contig to);
- The third menu on the bottom of the figure above, allows the user to select the DNA
region that will be projected on the map
(Region). The default window on the map is 500Kb, but
this can be expanded to cover the entire genome, or a smaller portion of it, by typing the correct
coordinates here. For example, if a user would like to see the entire map of the complete genome
of B. cereus (which is 5.4Mb genome) all that is required is to type in the
Region windows: 1 to
5500000 and then click
Redraw. Alternatively, the user can proceed through the genome on
500Kb windows (or other size that can be defined by the previous menu), by following the
arrow on the middle-right of the table;
- finally, the user may also select to color the ORFs
according to their functional category. This is made possible through the second scroll-down
menu
(Color). For example, one may select to view all genes related to Information Processing
(i.e. transcription, translation, replication etc.).
From the Color menu, the user will select this functional category, and then click the
Redraw button. All selected ORFs (i.e. all ORFs related to Information processing now) will be
colored blue, while the RNAs remain red, and all the other ORFs will become gray (see figure
below). Evidently, only the ORFs that have a function, which is also part of a subsystem can be
colored in this way on the contig map.

From the contig table of section 3.2.1.1., the user may select to see all the ORFs that are
identified in a particular contig, in a table format. To do so the user will click at the number of
the ORFs appearing on the third column
(# ORFs). The result will be a page where all the ORFs
will be presented in a table organized with:
- the
ORF IDs (these are also linked to the actual
ORF pages) (see section ???);
- the DNA coordinates
(Begin and
End of the Gene);
- transcription orientation
(Strand); Amino-acids size
(Length); and
-
Function of the gene.

The ORF-related Statistics include the same data types available for the cumulative
system statistics (of section 3.1.2.), and include:
- the total number of predicted ORFs in all
genomes
(ORFs total);
- the total number of the ORFs that have an assigned function
(ORFs with assigned function),
as well as a subset of those that do have a function, but no sequence
similarity to any other ORF in the system
(ORFs with function but no similarity);
- the
number of ORFs without any assigned functions
(ORFs without assigned function), as well as
two subsets of those: the number of ORFs that have neither a function or sequence similarity
(FastA cut-off score used is P-value better than 0.01)
(ORFs without function or similarity),
and the number of ORFs without function but with sequence similarity to other ORFs in the
system
(ORFs without function, with similarity);
- the number of ORFs that are also
connected to asserted pathways
(ORFs in asserted pathways);
- the number of ORFs that are
not connected to asserted pathways
(ORFs not in asserted pathways), as well the subset of
these ORFs that do have a function, but are still not connected to asserted pathways
(ORFs with assigned function but no pathway):
this later set of ORFs is created either because the pathway
to which the ORFs are connected is not yet asserted to the reference organism, or simply because
there is not any such pathway into the ERGO system yet;
- the number of ORFs that are
connected to the general functional overview
(ORFs in the functional overview) (see below
section xxx, for the description of what the functional overview is);
- the number of ORFs that
are found in protein clusters
(ORFS in protein clusters) (not available on the Light edition);
- the number of ORFs in clusters of paralog genes
(ORFs in paralog clusters);
- the number of ORFs that have a match in the COGs database of NCBI
(ORFs in COGs);
- the number of
ORFs that have a hit in the Pfam database of Washington University (ORFs in Pfam);
- the number of ORFs identified in chromosomal clusters
(ORFs in chromosomal clusters);
- the number of ORFs that participate in fusion events
(ORFs in possible fusion events), as well as
two subsets of those: the number of ORFs which participate in fusion events as composite
(ORFs in possible fusion events as composites), as well as those that participate as components
(ORFs in possible fusion events as components).

The statistics related with function include:
-
Functions assigned: this is the total
number of different functions identified in this organism. More than one ORF may have the same
exact function. By following the link under the counts column, a user can see the list of those
functions;
-
Functions assigned hypothetical: this is a sub-category of the assigned
functions, which includes all the hypothetical functions;
-
Functions assigned, connected to asserted pathways:
this is another sub-category of the assigned functions, which includes those
functions that are also connected to asserted pathways;
-
Functions assigned, not connected to asserted pathways:
this is the list of the assigned functions that are not connected to any
asserted pathways yet.
This set of functions, can be either hypothetical, or it can be that there is
not any pathway yet in the pathway database that have this function, or there is a pathway but is
not asserted in this organism.
Finally, there are two more reported statistics here related to
"missing" functions:
-
Functions missing from asserted pathways: these are the functions
that are expected to be found in the organism, based on the fact that the pathway they belong to,
is believed to be present (i.e. has been asserted) to this particular organism;
-
Functions with no sequence:
this is essentially the same as the previous case, with the only difference that, the
gene encoding this "missing" functions, has not been cloned yet in any organism, and therefore it
is not possible to be identified based on sequence similarity. These types of functions are
described so far only biochemically.

There is only a single entry here, reporting the total number of the pathways that are
considered present (i.e. have been asserted) in the query organism
(Pathways asserted total).
By
clicking at the number corresponding for this data-type, the user can see the list of all the
pathways asserted to the organism, as well as the ORFs that are connected to each of those
pathways next to them.

ERGO presents statistics for two types of computed clusters here:
-
the total number of Protein clusters found in the query genome
(Protein clusters, total);
- the total number of
the Paralog protein clusters identified in the genome
(Paralog clusters, total). As it was
presented above in the cumulative general statistics (section 3.1.), there are also two
subcategories for the Paralog clusters here: those that have at least one (but not all) hypothetical
ORF
(Paralog clusters, some members hypothetical), and those for which all the ORFs have
unknown function
(Paralog clusters, all members hypothetical).

The last two types of the available statistics have to do with unique domains generated
from the Pfam
(Unique Pfam domains) and COGs Database
(Clusters, orthologous groups (COGs)).

By clicking at the number corresponding to the Pfam domains, the user can access a
table that displays all the different Pfam domains identified in this genome, together with the
number of ORFs corresponding to each of the domain (see table below).
By clicking at the individual Pfam domain the user is transferred to the Pfam database at
the Washington University. By clicking at the number of the ORFs, the user can access the
individual ORFs in the ERGO system that are characterized by the corresponding domain.

By clicking at the number corresponding to the
Clusters of Orthologous Groups, the
user can access a table (see below) that has organized all the different COG-functions, into
functionally related categories
(COG category). Next to each of these categories three types of
data are presented:
- the total number of ORFs associated to each of the COG functional
category
(ORFs);
- the number of ORFs that either have a function
(Function) or are without any predicted function
(No function);
- the number of ORFs that are in asserted pathways
(Pathway) or are not yet in any asserted pathways
(No pathway).
As with the Pfam domains, the user can click here again to any of the numbers and see a
detailed list of the ORFs in each category, and in addition a break-down of each of the general
functional categories into more detailed list of function descriptions. This table is quite useful
along the annotation process, since a user can form an average estimate (according the functions
predicted by COGs), of the ORFs that can be annotated, as well as of the ORFs that can be
connected into pathways, always according to COGs.

At the bottom of the statistics table, a user can select a combination of statistics
(Get selected ORFs) selected from the check-boxes available on the first two columns (presented on
Part 1a above).

In addition to all of the above available options, the full version of ERGOTM, offers to the
user the ability to compare the annotations of different users, or different databases, with those
made by Integrated Genomics. To accomplish that, an additional link appears at the bottom of
the Statistics page of each genome.
The Compare/Reconcile assignments difference link of the organisms statistics
(Section 3.2.), leads to the page that looks like the figure following below (in this case this is for
the genome of Bacillus anthracis).
On the first column, the names of the different Users appear
(User). These can be either
different databases, or different curators. The boxes of the second column allow the user to select
each of the different users, while the other two columns display the number of differences with
the log-in user (either for Functions or Pathways). Once the user selects one or more of the
boxes, can then click on the buttons
Reconcile functions (or
Reconcile pathways, depending on
which differences are to be compared). If all users are selected, the following table is presented,
after clicking the Reconcile functions button:
This table presents the differences in the assigned functions, between the login user (i.e.
visitor) with the different pre-existing set of selected users (i.e. in this case: COGs, Pfam,
SwissProt, and TIGR). The table is organized in five columns:
- the first column presents the
ID of the ORF
(ORF ID). If more than one of the selected users have information for the same
ORF, then this ORF is presented in multiple rows (change of the gray-scale intensity in the
background color facilitates the visualization of ORF change);
- the second column presents
the name of the user
(User);
- the third column presents the function predicted from the user of
the second column
(Function assignments);
- the fifth column present the function of the
login user (in this case: visitor)
(Visitor's function assignment);
- the fourth column has a
yellow arrow for each entry. This is used to facilitate the transfer of a specific annotation from a
particular user to the model generated from the login user.
ERGO is an interactive system, which supports the creation of user-specific models (that
is user specific annotations, pathways and reconstructions). Therefore, every different login user,
can either accept the available by the system annotations, or modify them according to the user's
choice. These changes are saved into the system and become this user's default assignments.
Any other login user can also visualize such annotations made by other user's through the
reconcile differences page, or through the individual ORF page (see below in section ???). For
each of the cases that the login user will not change any of the available default assignments
made by the ERGO curation team, then these assignments will be also the default of the login
user.
ERGO's annotations have a very specific vocabulary and do not recognize other synonyms
as having essentially the same function. For example, for the first ORF presented in the table
above (i.e. RBAT0001), all three databases, PIRnr, Pfam and TIGR, agree with ERGO's
function prediction, even though the vocabulary or the precise annotation is not identical
according to any of the different users. Therefore, this table will display all the ORFs for which
there are available annotations under any of the available users, and provide a comparative view
against those of the login user.
However, this table by displaying all the information available across different users (either
they essentially agree or not), it allows for a relative fast inspection of all the assignments made
by any of the selected users, in a comparative manner, in order to identify possible cases of
disagreement, that would require further investigation.
|