Phenotype Enrichment Analysis

(1) SPECIES OF GENE SETS
(2) PHENOTYPES TO BE ANALYZED
(3) GENE SETS TO BE ANALYZED
(4) ALTERNATIVE HYPOTHESIS

RESULT OF EXAMPLE DATASETS

Example Dataset 1 Human genes exhibiting expression changes in response to Zika virus (ZIKV) infection.
Example Dataset 2 Budding yeast genes (Example Dataset 2.1) and fission yeast (Schizosaccharomyces pombe) genes (Example Dataset 2.2) arising from lineage-specific duplication events after the divergence of budding and fission yeasts.
Example Dataset 3 Zebrafish orthologs showing reduced expression in eyes of cave-dwelling Sinocyclocheilus species.
Example Dataset 4 C. elegans genes expressed in gonads.
Example Dataset 5 Genes that are highly expressed after a blood-meal by malaria vector mosquito.
Example Dataset 6 Highly expressed human genes in vascular smooth muscle cells of patients with giant cell arteritis (GCA).

ONLINE TUTORIALS

We designed modPhEA to functionally interpret gene sets based on phenotypes that were previously found to result from genetic perturbations in several model organisms (Caenorhabditis elegans, Danio rerio, Drosophila melanogaster, Mus musculus and Saccharomyces cerevisiae) and disease phenotypes resulting from spontaneous mutations in humans (Homo sapiens). To identify functions that represent phenotypic categories, input gene sets are enriched/depleted with phenotypes that are predefined in modPhEA (see “Example Datasets 1-5”), or with phenotypes that are customized by the user (see “Example Dataset 6”). Users can perform analyses that focus on phenotypes derived from mutagenesis experiments, knockdown experiments, or both. In addition, modPhEA supports analyses of genes from sequenced and annotated animal and fungal genomes (see full list of species that are supported).

DATA INPUT

Prior to submitting gene sets for functional interpretation, information regarding the following four sections should be provided: (1) SPECIES OF GENE SETS, (2) PHENOTYPES TO BE ANALYZED, (3) GENE SETS TO BE ANALYZED, and (4) ALTERNATIVE HYPOTHESIS.

(1) SPECIES OF GENE SETS

Since the genomes of different species vary in their gene content, the organism from which the gene set(s) to be analyzed was obtained must first be specified. Currently, there are 132 animal and 50 fungal species which can be analyzed by modPhEA, and these are alphabetically listed in a drop-down menu as shown in Figure 1.

Figure 1. An alphabetized list of species whose gene set(s) can be analyzed are provided in modPhEA.

(2) PHENOTYPES TO BE ANALYZED

With use of modPhEA, phenotype enrichment analyses can be conducted based on any one of six model organism systems. The organism in which the phenotypes are to be investigated should be specified in “Source of phenome” (as shown in Figure 2A). When the species for the gene set to be analyzed is selected (in the previous step), the phenome of the most closely related model organism included in modPhEA is automatically assigned. Users can then decide to proceed with the analysis based on the recommended phenome, or they can select an alternative phenome.

Next, users need to specify whether the enrichment analysis to be performed is based on predefined phenotypes (Figure 2B) or customized phenotypes (Figure 2C). The predefined phenotypes available in modPhEA are hierarchically structured according to the phenotype ontology obtained from the primary databases that commonly used by each of the organism-specific communities (see Table 1). The hierarchical structures of the predefined phenotypes are organized from the highest level to the lowest level for comprehensiveness. These available phenotypes can also be narrowed down when the box, “all levels”, is deselected (as shown in Figure 2B).

Figure 2. Interface that allows the user to indicate which phenotypes are to be analyzed (see animation tutorial).

Table 1. Available sources of phenotype ontology in modPhEA.

Model organism (common name)	Source of phenotype ontology (URL)
S. cerevisiae (budding yeast)	OBO Foundry (http://www.obofoundry.org/ontology/apo.html)
C. elegans (roundworm)	WormBase (ftp://ftp.wormbase.org/pub/wormbase/releases/WS253/ONTOLOGY/)
D. melanogaster (fruit fly)	Flybase (ftp://ftp.flybase.net/releases/FB2016_03/precomputed_files/ontologies/)
D. rerio (zebrafish)	ZFIN (https://zfin.org/downloads)
M. musculus (house mouse)	MGI (http://www.informatics.jax.org/downloads/reports/)
H. sapiens (human)	HPO (http://human-phenotype-ontology.github.io/downloads.html)

In modPhEA, several example datasets for analyses based on predefined phenotypes are available. Example Dataset 1 demonstrates an analysis based on human or mouse predefined phenotypes, while Example Datasets 2-5 demonstrate analyses of gene sets based on predefined phenotypes of budding yeast, zebrafish, worm, and fruit fly, respectively. Details regarding the source gene sets and the analytical results of each example dataset are described below in the section “EXAMPLE DATASETS”.

In addition, modPhEA allows users to conduct enrichment analyses on a series of customized phenotypes. A customized phenotype can be created by combining any number of predefined phenotypic terms. Example Dataset 6 demonstrates how a customized phenotype can be created and analyzed. The disease “giant cell arteritis (GCA)” often has manifestations of vasculitis (HP:0002633), granulomatosis (HP:0002955), amaurosis fugax (HP:0100576), facial palsy (HP: 0010628), renal amyloidosis (HP: 0001917), dysphagia (HP: 0002015), trismus (HP: 0000211), and encephalopathy (HP: 0001298). Accordingly, in this example, the customized phenotype, “giant cell arteritis (GCA)”, is composed of the above listed HP terms. For the analysis, whether or not a gene is associated with the phenotype of GCA depends on if a gene has at least one annotated phenotype associated with any of the designated GCA-associated HP terms or the downstream terms in the phenotype ontology. Multiple customized phenotypes can be created for an analysis in modPhEA as well. A list of customized phenotypes to be examined for enrichment in gene sets can be supplied with a file in the suggested format, or by manually adding the terms one-by-one into a user-friendly interface as demonstrated here.

Phenotypic data for each model organism implemented in modPhEA were obtained from various genetic approaches. Since these approaches can differ in sensitivity and may produce data subject to approach-specific biases, modPhEA allows a user to perform an analysis based on a subset of phenotypic data derived from the selected approaches as listed in “Advanced Options” (shown in Figure 3).

Figure 3. Gene data from (1) mutagenesis, (2) knockdown, or (3) other studies can be selected for inclusion.

(3) GENE SETS TO BE ANALYZED

Two types of enrichment analyses are available in modPhEA. The first analysis type identifies enriched or depleted phenotypes of a given gene set, and the remaining genes in the genome are used as the background reference (by selecting the option, “Against rest of genes in the genome”). The second analysis type detects differentially enriched phenotypes by comparing two given gene sets. For both analysis types, the title of the gene set should be provided (A in Figure 4) and the genes on the list should be entered as Ensembl gene IDs separated by (a) a tab, (b) a return, (c) a comma, (d) a semicolon, or (e) a single space. The gene list can also be supplied by uploading an text file (B in Figure 4). In the analysis performed by modPhEA, genes that belong to organisms other than the species selected from the gene list are recognized and automatically excluded. For instance, if “Mus musculus” is selected as the organism to be analyzed, and a gene that belongs to “Homo sapiens” is included in the input gene set, the human gene will be discarded from the analysis.

Figure 4. The input fields of a query gene set. Data input can be achieved by uploading a file or by directly copying/pasting the data input below the text field.

(4) ALTERNATIVE HYPOTHESIS

The alternative hypothesis for an enrichment analysis should be specified (Figure 5). Based on the number of genes that are associated or not associated with the phenotypic term investigated, 2×2 contingency tables are constructed for each of the phenotypic terms of the selected levels. Fisher’s exact tests are applied to obtain a P-value for each phenotype based on the null hypothesis. The option, “differentially enriched”, indicates a two-sided test under the null hypothesis for no enrichment/depletion of the examined phenotype. The options, “Set 1 Enriched” or “Set 1 Depleted”, will apply a one-sided test. The P-value are FDR corrected and Bonferroni corrected for multiple testing.

Figure 5. Alternative hypothesis options are available for enrichment analyses.

OUTPUT PAGE

After completing the above sections, users can click the “Submit” button to initiate an enrichment analysis of phenotypes of a chosen model organism. Users can then determine if the analysis should proceed. Once the analysis is complete, the results will be presented as shown in Figure 6. The top section of the results page will provide information regarding the gene sets according to the parameters selected by the user and the color code will indicate enrichment or depletion of the phenotypes (A in Figure 6). The bottom section of the results page will provide a list of the analyzed phenotypes and the corresponding enrichment/depletion status. There are two options for displaying the results (B in Figure 6) and the user can switch between the modes to either show all of the results or only the multiple-testing corrected results (Figure 8a). To download the results, plain text or Excel file formats are available (Figure 6C and Figure 8b).

The default display mode for the results (“Result in tree”) presents the enriched or depleted phenotypes in a hierarchical structure based on phenotype ontology. This mode of display is only available when predefined phenotypes are selected for the analysis (see Figure 2B). In this mode, the statistical details of each term are available by clicking on the term of interest. The display option, “Result in List”, presents the enriched/depleted phenotypes as a list, and the reported phenotypic terms are default sorted according to P-values. When the display option, “Show significant result”, is selected, only the enriched or depleted phenotypes that have a P-value, FDR corrected P-value, or Bonferroni-corrected P-value less than 0.05 are displayed.

The processed data and results can be downloaded from modPhEA by selecting the option, “Download results in list format (TEXT file)”, “Download results in list format (Excel file)”, or “Download results in tree format (Excel file)”. Please note that the .xlsx Excel files are only supported by Microsoft Office 2007 and above.

Figure 6. Listed results of differentially enriched phenotypes.

Figure 7. Results of differentially enriched phenotypes in a hierarchical structure.

Figure 8.a The display mode for the results can be switched between a show all mode and a multiple-testing corrected result mode.

Figure 8.b The analytical gene sets and results from an analysis can be downloaded as a plain text file or as an Excel file that lists the results with a hierarchical structure format.