Friday, June 7, 2013

ENCODE ChIP-Seq Significance Tool: Which TFs Regulate my Genes?

I collaborate with several investigators on gene expression projects using both microarray and RNA-seq. After I show a collaborator which genes are dysregulated in a particular condition or tissue, the most common question I get is "what are the transcription factors regulating these genes?"

This isn't the easiest question to answer. You could look at transcription factor binding site position weight matrices like those from TRANSFAC and come up with a list of all factors that potentially hit that site, then perform some kind of enrichment analysis on that. But this involves some programming, and is based solely on sequence motifs, not experimental data.

The ENCODE consortium spent over $100M and generated hundreds of ChIP-seq experiments for different transcription factors and histone modifications across many cell types (if you don't know much about ENCODE, go read the main ENCODE paper, and Sean Eddy's very fair commentary). Regardless of what you might consider "biologically functional", the ENCODE project generated a ton of data, and much of this data is publicly available. But that still doesn't help answer our question, because genes are often bound by multiple TFs, and TFs can bind many regions. We need to perform an enrichment (read: hypergeometric) test to assess an over-representation of experimentally bound transcription factors around our gene targets of interest ("around" also implies that some spatial boundary must be specified). To date, I haven't found a good tool to do this easily.

Raymond Auerbach and Bin Chen in Atul Butte's lab recently developed a resource to address this very common need, called the ENCODE ChIP-Seq Significance Tool.

The paper: Auerbach et al. Relating Genes to Function: Identifying Enriched Transcription Factors using the ENCODE ChIP-Seq Significance Tool. Bioinformatics (2013): 10.1093/bioinformatics/btt316.

The software: ENCODE ChIP-Seq Significance Tool (

This tool takes a list of "interesting" (significant, dysregulated, etc.) genes as input, and identifies ENCODE transcription factors from this list. Head over to, select the ID type you're using (Ensembl, Symbol, etc), and paste in your list of genes. You can also specify your background set (this has big implications for the significance testing using the hypergeometric distribution). Scroll down some more to tell the tool how far up and downstream you want to look from the transcription start/end site or whole gene, select an ENCODE cell line (or ALL), and hit submit. 

You're then presented with a list of transcription factors that are most likely regulating your input genes (based on overrepresentation of ENCODE ChIP-seq binding sites). Your results can then be saved to CSV or PDF. You can also click on a number in the results table and get a list of genes that are regulated by a particular factor (the numbers do not appear as hyperlinks in my browser, but clicking the number still worked).

At the very bottom of the page, you can load example data that they used in the supplement of their paper, and run through the analysis presented therein. The lead author, Raymond Auerbach, even made a very informative screencast on how to use the tool:

Now, if I could only find a way to do something like this with mouse gene expression data.


  1. In theory, would it be possible to use ENCODE ChIP-Seq Significance Tool on data?

  2. Hello, Cscan does the same thing and works on Mouse data too. We just updated it few days ago with tons of data from encode, mousencode and modencode.

  3. A mouse version of the ENCODE ChIP-Seq significance tool could easily be included (our codebase wouldn't change much and it is just a matter of processing the mouse peaks on That being said, I did my PhD work in an ENCODE/modENCODE analysis lab so I'll explain why we limited the Bioinformatics submission to human.

    The reason why we limited the tool to human at this time is because the human ENCODE peaks have been uniformly scored by the Analysis Working Group (this means they are scored against the same background data set, replicate agreement determined by the same statistics, etc).

    The issue with the modENCODE and mouseENCODE data sets is that since those Consortia have not written an integrative paper yet, the data on those sites are submitted by the individual labs and can be processed using different criteria. No current tool has these "unified peak calls" (I don't know if they exist for mouse yet and the worm/fly versions fall under the modENCODE embargo until they publish. One doesn't mess with the NHGRI ENCODE Data Release policy :o). We knew this was going to be a problem for review (heck, as a reviewer I would have called any similar manuscript on that myself) so we stuck to human data for the paper.

    If there is demand, I'd be happy to put up a version using the public mouse peaks with a disclaimer that these peaks were submitted by the labs and are not normalized, etc., to be replaced with the official unified peaks later. Just let me know if you/the community would find that useful.

    1. Thanks for the detailed back story Raymond. I also agree that waiting for the mouse consortium writes their integrative paper with normalized "unified" peak calls, as in the human data, might be of more value than using the individual lab data. I look forward to the mouse/mod version of this tool!

    2. Hi, I study mouse immune cell development. I went ahead and put in genes from my microarray experiment. Your program gave back the transcription factors critical to the exact stage of development I am study. I assume it didn't use the genes with names not conserved between mice and humans.

      So... that worked really well.

      Patrick Collins

    3. Cool. I've been apprehensive about trying this, but glad to hear that you saw what you expected.

  4. Take a look at ismara:


Creative Commons License
Getting Genetics Done by Stephen Turner is licensed under a Creative Commons Attribution-NonCommercial 3.0 Unported License.