Tuesday, March 6, 2012

Pathway Analysis for High-Throughput Genomics Studies

I get a lot of requests in the core about running a "pathway analysis." Someone ran a handful of gene expression arrays, or better yet, ran an RNA-seq experiment (with replicates!). These, and many other kinds of high-throughput assays (GWAS, ChIP-seq, etc.) result in a list of genes and some associated p-value, fold change, or other statistic.

Here's some R code to download public data from a study on susceptibility to colorectal cancer. I realize that this is an oversimplified design - that's not the focus here. You can read more about proper design and contrast matrices in the limma vignette. You can read more about the study and the samples in the paper or on the GEO page.

A thoughtful and thorough analysis doesn't end with a list of genes and P-values. You spent way too much money on an experiment just to end up with a list of genes and p-values at some arbitrary cutoff. I often see investigators going down the list and cherry-picking genes that they think are important or might play a role, and trying to create a story involving those cherry-picked genes. I believe it's better to be more agnostic, taking a data-driven approach to framing results in a functional context.

How do you do this? This review recently published in PLoS Comp Bio is an excellent place to start:

Khatri, P., Sirota, M., & Butte, A. J. (2012). Ten Years of Pathway Analysis: Current Approaches and Outstanding Challenges. PLoS Computational Biology, 8(2), e1002375. doi:10.1371/journal.pcbi.1002375

I wrote an evaluation of this paper at Faculty of 1000 - here's an expanded version. The paper gives an overview of the available methods for pathway analysis, and categorizes these methods into three functional classes: over-representation analysis, functional class scoring, and topology-based methods. The paper discusses the advantages and limitations of each approach, as well as future development directions.

Over-representation analysis is what you would typically do with something like gene ontology annotations. The Gene Ontology (GO) project is an effort to describe gene products from all organisms using a consistent language (ontologies are formal representations of knowledge domains; GO does this with cell biology). A biological process is typically made up of a group of genes, as opposed to an individual gene alone. The idea of over-representation analysis is that if a biological process is abnormal in a given study, the co-functioning genes are more likely to be selected as a relevant group by the high-throughput screening technologies. For example, if 10% of the your genes selected by a microarray experiment are kinases, as opposed to 1% of the genes in the whole human genome (this is the gene population background) that are kinases, then you have significant over-representation in your genes for kinases. There are many, many tools for doing over-representation analysis using gene ontology terms, but they all rely on the same idea, typically using a hypergeometric test. You can create directed acyclic graphs like the one below, color coding nodes based on the statistical significance of the term overrepresentation in your gene list. See the paper above for a discussion of limitations of over-representation analysis, and also check out this paper on the use and misuse of gene ontology annotations.

Functional Class Scoring. One of the biggest limitations of over-representation analysis is that you have to arbitrarily decide what you call significant and what you don't. If you use an FDR threshold of 0.05 and fold change cutoff of 2, you'll lose all those genes with a fold change of 1.95 or FDR 0.051, which are arguably just as important as the genes just under the arbitrary cutoff. Further, over-representation analyses only use the number of genes and ignore how strongly those genes are associated with whatever you're studying. Pathway analysis methods that the above review classifies as "Functional Class Scoring" methods use all the genes you have as well as their association statistics (e.g. fold change), and compute a running enrichment score for gene groupings (based on some functional knowledge like gene ontology or KEGG pathways). Gene Set Enrichment Analysis is one of the more popular approaches in this category. The plot below shows the kind of results you get from GSEA.  From the GSEA documentation: "The primary result of the gene set enrichment analysis is the enrichment score (ES), which reflects the degree to which a gene set is overrepresented at the top or bottom of a ranked list of genes. GSEA calculates the ES by walking down the ranked list of genes, increasing a running-sum statistic when a gene is in the gene set and decreasing it when it is not. The magnitude of the increment depends on the correlation of the gene with the phenotype. The top portion of the plot shows the running ES for the gene set as the analysis walks down the ranked list. The score at the peak of the plot (the score furthest from 0.0) is the ES for the gene set. Gene sets with a distinct peak at the beginning (such as the one shown here) or end of the ranked list are generally the most interesting." So this shows that you have significant enrichment in the cancer phenotype (labeled "LL") for genes involved in the KEGG Spliceosome pathway.

Topology based methods. The above methods don't take into account how genes interact with each other (e.g. activation, inhibition, phosphorylation, direct/indirect interaction, ubiquitination, etc). Pathway topology methods this extra information in computing pathway-level statistics. I've recently started using the bioconductor SPIA package (Signaling Pathway Impact Analysis), which integrates lists of differentially expressed genes, their fold changes, and pathway topology, to identify pathways associated with condition you're studying. The code below runs SPIA on the analysis we did above. Make sure you successfully loaded all the R packages in the code above.

You can read more about the results SPIA is giving you in their paper and vignette. Briefly, each pathway is represented by a point, and the x and y axes are showing increasing evidence for the involvement of this pathway in your disease (the total "perturbation" in the pathway based on your gene expression data).

It's worth noting that all of the above mentioned methods have limitations. We don't fully understand biology, and our understanding of molecular networks and signaling pathways is still very low-resolution. We also don't have information about how different isoforms have different effects - which is something we'll get from RNA-seq experiments. Annotations are often incorrect and inaccurate, and we don't have very much cell-type specific or dynamic information about these pathways. Finally, the analysis of large gene lists with the current enrichment tools is still more of an exploratory data-mining procedure rather than a pure statistical solution and analytical endpoint.

Despite the limitations, the kinds of methods discussed here and reviewed elsewhere (see below) are still very useful for extracting biological meaning and framing results from high-throughput experiments in a functional context. The bioinformatics core here at UVA can do any of the kinds of analyses discussed above. Please fill out a consultation request form and I'll be happy to talk to you about what kinds of insight you may be able to glean from your existing data or in an experiment you're planning.

Other useful literature on the subject:

Holmans, P., Green, E. K., Pahwa, J. S., Ferreira, M. a R., Purcell, S. M., Sklar, P., Owen, M. J., et al. (2009). Gene ontology analysis of GWA study data sets provides insights into the biology of bipolar disorder. American journal of human genetics, 85(1), 13-24. doi:10.1016/j.ajhg.2009.05.011

Huang, D. W., Sherman, B. T., & Lempicki, R. a. (2009). Bioinformatics enrichment tools: paths toward the comprehensive functional analysis of large gene lists. Nucleic acids research, 37(1), 1-13. doi:10.1093/nar/gkn923

Khatri, P., Sirota, M., & Butte, A. J. (2012). Ten Years of Pathway Analysis: Current Approaches and Outstanding Challenges. (C. A. Ouzounis, Ed.)PLoS Computational Biology, 8(2), e1002375. doi:10.1371/journal.pcbi.1002375

Rhee, S. Y., Wood, V., Dolinski, K., & Draghici, S. (2008). Use and misuse of the gene ontology annotations. Nature reviews. Genetics, 9(7), 509-15. doi:10.1038/nrg2363

Wang, K., Li, M., & Bucan, M. (2007). Pathway-Based Approaches for Analysis of Genomewide Association Studies. American journal of human genetics, 81(6), 1278-1283. doi:10.1086/522374

Yaspan, B. L., & Veatch, O. J. (2011). Strategies for Pathway Analysis from GWAS Data. Current protocols in human genetics / editorial board, Jonathan L. Haines ... [et al.], Chapter 1(October), Unit1.20. doi:10.1002/0471142905.hg0120s71


  1. Thanks for noticing our paper and giving it such a great plug! -- Atul Butte

  2. Hey,
    I just hopped over to your site via Stumbleupon. Not somthing I would normally read, but I liked your thoughts none the less. Thanks for making something worth reading.

  3. Thank you - very comprehensive!

  4. This was a great summary. One of my struggles working in maize is that many of the ontology and gene pathway information has been worked out in Arabidopsis. What I am seeing is that one copy or some copies are involved in the same pathways, but evolution pushes the duplicates in other directions. And a priori you can't tell which copy is coopted internally for alternate function.

  5. Often neglected is the key difference between "competitive" and "self-contained" tests. It's all about how the null hypothesis the test will reject.

    In brief,

    competitive gene-set test look for a higher than expected number of differentially expressed genes, or otherwise "highly ranked" or "experimentally positive" genes in your set compared to the gene universe.

    self-contained tests look for differences at the gene-set level. If you have array data, you have to define a difference index using all genes in the gene-set. If you have rare variants, you will compare the "load" of rare variants in cases to controls (basically doing an "association test" at the gene-set level).

    When the test is based on random sampling, gene sampling is typically used by competitive tests, phenotype / class-label sampling is typically used by self-contained tests.

    In my experience, competitive are used a lot for gene expression microarray data, especially when the replicates are reproducible and classes well-separated. Self-contained are prevalent for (rare) genetic variants.

    Here's one review for more:

  6. Hi,

    I'm analyzing a microarray data and stuck at the design matrix command.

    I have few samples, in which I divided them into two groups i.e. "Controls" & "Diseased".

    I assigned the samples information as a factor with two levels "C" & "D".

    In what exact order should I assign the levels? If I assign levels as c(C, D) then I get the logFC value as say X and if I do it as c(D, C) I get the logFC value as -X.

    Should I start with the diseased group (D) or control group (C)?

    In default, in what exact order the design matrix take the levels?

    My code is as follows:

    info$Group<-factor(info$Group, levels=c("C","D"))

    # ^^^ Which one should I consider first in the above command? The diseased group or control group? ^^^



    fit<-lmFit(exp, design)
    contr.str <- c()
    for(i in 1:(len-1))contr.str<-c(contr.str, paste(lev[(i+1):len], lev[i], sep="-"))
    contr.mat<-makeContrasts(contrasts=contr.str, levels=lev)
    fit2<-contrasts.fit(fit, contr.mat)
    top <- topTable(fit2, number=nrow(fit2), adjust.method="fdr")

  7. Hi Dr. Turner,

    Thanks for this useful post (and for many others on Getting Genetics Done). I am trying to use Pathway Express from RontoTools (which as I understand is advanced version of spia) for microarray analysis. I get an error when I use the option "updateCache = TRUE" in keggPathwayGraphs command in order to download the latest KEGG pathways. I was wondering if you have encountered this kind of error before. Am I getting this error because I do not have license for KEGG files?

    kpg <- keggPathwayGraphs("hsa", updateCache = TRUE, verbose = F)

    NA element in edges.
    Error in validObject(.Object) : invalid class “graphNEL” object: FALSE


    1. Hi Shruti. Thanks for the comment, and thanks for introducing me to ROntoTools - I didn't know about this package. I don't know how to solve the problem you're having. You might try SEQanswers.

  8. Hi Dr. Turner,
    I would like to put in a plug for a paper (no connection to my work and no other conflict of interest; I read it for a class project). It's another review of pathway analysis methods, and it's got a nice set of references that I don't think I would have found anywhere else. When I saw on Google Scholar that it only had one citation, I got sad.

    It focuses on network-topology based methods. Hope you or your readers will enjoy!
    Eric Kernfeld

    Full citation
    Liu, Y., & Chance, M. R. (2013). Pathway analyses and understanding disease associations. Current Genetic Medicine Reports, 1(4), 10.1007/s40142–013–0025–3. http://doi.org/10.1007/s40142-013-0025-3



Creative Commons License
Getting Genetics Done by Stephen Turner is licensed under a Creative Commons Attribution-NonCommercial 3.0 Unported License.