Thursday, June 30, 2011

Mapping SNPs to Genes for GWAS Enrichment Analysis

There are several tools available for conducting a post-hoc analysis of GWAS data looking for enrichment of significant SNPs using literature or pathway based resources. Examples include GRAIL, ALLIGATOR, and WebGestalt among others (see SNPath R Package). Since gene enrichment and pathway analysis essentially evolved from methods for analyzing gene expression data, many of these tools require specific gene identifiers as input.

To prep GWAS data for analyses such as these, you'll need to ensure that your SNPs have current map position information, and assign each SNP to genes. To update your map information, the easiest resource to use is the UCSC Tables browser.

To use UCSC, go to the TABLES link on the browser page. Select the current assembly, "Variation and Repeats" for group, "Common SNPs(132)" for the track, and "snp132Commmon" for the table. Now you'll need to upload your SNP list. You can manually edit the map file using a tool like Excel or Linux commands like awk to extract the rs_ids for your SNPs into a text file. Upload this file by clicking the "Upload List" button in the "identifiers" section. Once this file has uploaded, select "BED - browser extensible data" for the output format and click "Get Output". The interface will ask you if you want to add upstream or downstream regions -- for SNPs this makes little sense, so leave these at zero. Also, ensure that "Send to Galaxy" is not checked.

This will export a BED file containing the positions of all your SNPs in the build you selected, preferably the most current one -- let's call it SNPs.bed. BED files were developed by the folks behind the UCSC browser to document genomic annotations. The basic format is:


Because BED files were designed to list genomic ranges, SNPs are listed with a range of 1 BP, meaning the StartPosition is the location of the SNP and the EndPosition is the location of the SNP + 1.

The next step is to map these over to genes. To save a lot of effort resolving identifiers (which is no fun, I promise), I have generated a list of ~17,000 genes that are consistently annotated across Ensembl and Entrez-gene databases, and which have HUGO identifiers. These files are available here:

Hugo Genes
Entrez Genes
Ensembl Genes

The genes listed in these files were generated by comparing the cross-references between the Ensembl and Entrez-gene databases. To arrive at a set of "consensus" genes, I selected only genes where Ensembl refers to an Entrez-gene with the same coordinates, and that Entrez-gene entry refers back to the same Ensembl gene. Nearly all cases of inconsistent cross-referencing are genes annotated based on electronic predictions, or genes with multiple or inconsistent mappings. For the purpose of doing enrichment analysis, these would be ignored anyway, as they aren't likely to show up in pathway or functional databases. For these genes, we then obtained the HUGO approved gene identifier. The coordinates for all genes are mapped using hg19/GRChB37.

BED files are plain text and easy to modify. If you wanted to add 10KB to the gene bounds, for example, you could alter these files using excel formulas, or with a simple awk command:

awk '{printf("%d\t%d\t%d\t%s\n", $1, $2-10000, $3+10000, $4); }' ensemblgenes.bed

You'll need to download a collection of utilities called BEDTools. If you have access to a Linux system, this should be fairly straightforward. These utilities are already installed on Vanderbilt's Linux application servers, but to compile the program yourself, download the .tar.gz file above, and use the following commands to compile:

tar -zxvf BEDTools..tar.gz
cd BEDTools
make clean
make all
ls bin

# copy the binaries to a directory in your PATH. e.g.,
sudo cp bin/* /usr/local/bin

Once you get BEDTools installed, there is a utility called intersectBed. This command allows you to examine the overlap of two BED files. Since we have our collection of SNPs and our collections of Genes in BED file format already, all we need to do is run the command.

intersectBed -a SNPs.bed -b entrezgenes.bed -wa -wb

This will print to the screen any SNP that falls within an EntrezGene boundary, along with the coordinates from each entry in each file. To strip this down to only RS numbers and gene identifiers, we can pipe this through awk.

intersectBed -a SNPs.bed -b entrezgenes.bed -wa -wb | awk '{printf("%s\t%s\n", $4, $8);}'

This will produce a tab-delimited file with SNPs that fall within genes identified by EntrezGene ID number. For HUGO names or Ensembl gene IDs, use the corresponding BED file.

Wednesday, June 29, 2011

Nucleic Acids Research Web Server Issue

Nucleic Acids Research just published its Web Server Issue, featuring new and updates to existing web servers and applications for genomics and proteomics research. In case you missed it, be sure to check out the Database Issue that came out earlier this year.

This web server issue has lots of papers on tools for microRNA analysis, and protein/RNA secondary structure analysis and annotation. Here are a few that sounded interesting for those doing systems genomics and trying to put findings into a functional, biologically relevant context:

g:Profiler—a web server for functional interpretation of gene lists (2011 update)

Abstract: Functional interpretation of candidate gene lists is an essential task in modern biomedical research. Here, we present the 2011 update of g:Profiler (, a popular collection of web tools for functional analysis. g:GOSt and g:Cocoa combine comprehensive methods for interpreting gene lists, ordered lists and list collections in the context of biomedical ontologies, pathways, transcription factor and microRNA regulatory motifs and protein–protein interactions. Additional tools, namely the biomolecule ID mapping service (g:Convert), gene expression similarity searcher (g:Sorter) and gene homology searcher (g:Orth) provide numerous ways for further analysis and interpretation. In this update, we have implemented several features of interest to the community: (i) functional analysis of single nucleotide polymorphisms and other DNA polymorphisms is supported by chromosomal queries; (ii) network analysis identifies enriched protein–protein interaction modules in gene lists; (iii) functional analysis covers human disease genes; and (iv) improved statistics and filtering provide more concise results. g:Profiler is a regularly updated resource that is available for a wide range of species, including mammals, plants, fungi and insects.

KOBAS 2.0: a web server for annotation and identification of enriched pathways and diseases

Abstract: High-throughput experimental technologies often identify dozens to hundreds of genes related to, or changed in, a biological or pathological process. From these genes one wants to identify biological pathways that may be involved and diseases that may be implicated. Here, we report a web server, KOBAS 2.0, which annotates an input set of genes with putative pathways and disease relationships based on mapping to genes with known annotations. It allows for both ID mapping and cross-species sequence similarity mapping. It then performs statistical tests to identify statistically significantly enriched pathways and diseases. KOBAS 2.0 incorporates knowledge across 1327 species from 5 pathway databases (KEGG PATHWAY, PID, BioCyc, Reactome and Panther) and 5 human disease databases (OMIM, KEGG DISEASE, FunDO, GAD and NHGRI GWAS Catalog). KOBAS 2.0 can be accessed at

ICSNPathway: identify candidate causal SNPs and pathways from genome-wide association study by one analytical framework

Abstract: Genome-wide association study (GWAS) is widely utilized to identify genes involved in human complex disease or some other trait. One key challenge for GWAS data interpretation is to identify causal SNPs and provide profound evidence on how they affect the trait. Currently, researches are focusing on identification of candidate causal variants from the most significant SNPs of GWAS, while there is lack of support on biological mechanisms as represented by pathways. Although pathway-based analysis (PBA) has been designed to identify disease-related pathways by analyzing the full list of SNPs from GWAS, it does not emphasize on interpreting causal SNPs. To our knowledge, so far there is no web server available to solve the challenge for GWAS data interpretation within one analytical framework. ICSNPathway is developed to identify candidate causal SNPs and their corresponding candidate causal pathways from GWAS by integrating linkage disequilibrium (LD) analysis, functional SNP annotation and PBA. ICSNPathway provides a feasible solution to bridge the gap between GWAS and disease mechanism study by generating hypothesis of SNP → gene → pathway(s). The ICSNPathway server is freely available at

AnnotQTL: a new tool to gather functional and comparative information on a genomic region

Abstract: AnnotQTL is a web tool designed to aggregate functional annotations from different prominent web sites by minimizing the redundancy of information. Although thousands of QTL regions have been identified in livestock species, most of them are large and contain many genes. This tool was therefore designed to assist the characterization of genes in a QTL interval region as a step towards selecting the best candidate genes. It localizes the gene to a specific region (using NCBI and Ensembl data) and adds the functional annotations available from other databases (Gene Ontology, Mammalian Phenotype, HGNC and Pubmed). Both human genome and mouse genome can be aligned with the studied region to detect synteny and segment conservation, which is useful for running inter-species comparisons of QTL locations. Finally, custom marker lists can be included in the results display to select the genes that are closest to your most significant markers. We use examples to demonstrate that in just a couple of hours, AnnotQTL is able to identify all the genes located in regions identified by a full genome scan, with some highlighted based on both location and function, thus considerably increasing the chances of finding good candidate genes. AnnotQTL is available at

Génie: literature-based gene prioritization at multi genomic scale

Abstract: Biomedical literature is traditionally used as a way to inform scientists of the relevance of genes in relation to a research topic. However many genes, especially from poorly studied organisms, are not discussed in the literature. Moreover, a manual and comprehensive summarization of the literature attached to the genes of an organism is in general impossible due to the high number of genes and abstracts involved. We introduce the novel Génie algorithm that overcomes these problems by evaluating the literature attached to all genes in a genome and to their orthologs according to a selected topic. Génie showed high precision (up to 100%) and the best performance in comparison to other algorithms in most of the benchmarks, especially when high sensitivity was required. Moreover, the prioritization of zebrafish genes involved in heart development, using human and mouse orthologs, showed high enrichment in differentially expressed genes from microarray experiments. The Génie web server supports hundreds of species, millions of genes and offers novel functionalities. Common run times below a minute, even when analyzing the human genome with hundreds of thousands of literature records, allows the use of Génie in routine lab work. Availability:

Nucleic Acids Research: Web Server Issue

Wednesday, June 22, 2011

Steal This Blog!

I wanted to contribute any content and code I post here to the R Programming Wikibook so I made a slight change to the Creative Commons license on this blog. All the written content is now cc-by-sa and all the code here is still open source BSD. So feel free to wholesale copy, modify, share, or redistribute anything you find here, just include a link back to the site.

Tuesday, June 14, 2011

Displaying Regression Coefficients from Complex Analyses

Genome-wide association studies have produced a wealth of new genetic associations to numerous traits over the last few years. As such, new studies of these phenotypes often attempt to replicate previous associations in their samples, or examine how the effects of these SNPs are altered by environmental factors or clinical subtypes. This growing trend means that the results section must become multi-dimensional, accounting for all the ways by which samples were partitioned for analysis.

A great way to display regression coefficients is with forest plots, and an excellent example can be seen in the previous post.

There are various ways to produce these useful plots, but our colleague Sarah Pendergrass recently published a very nice tool for creating high-dimensional forest plots, like the one seen below.

This tool, called Synthesis-View provides several additional plot types that make the results of complex analyses much more accessible. You can read about the details here.

While Sarah designed this for the examination of multi-ethnic cohort studies, it could easily be adapted to simultaneously plot coefficients stratified by sex, age groups, or clinical subtypes.

Tuesday, June 7, 2011

Replication of >180 Genetic Associations with Self-Report Medical Data

DNA genotyping and sequencing are getting cheaper every day. As Oxford Nanopore CTO Clive Brown recently discussed at Genomes Unzipped, when the cost of a full DNA sequence begins to fall below $1000, the value of having that information far outweighs the cost of data generation.

Participant collection and ascertainment, however, isn't getting cheaper any time soon, spurring a burgeoning interest in using DNA biobanks and electronic medical records (EMRs) for genomics research (reviewed here). In fact, this is exactly the focus of the eMERGE network - a consortium of five sites having biobanks linked to electronic medical records for genetic research. The first order of business for the eMERGE network was assessment - can DNA biobanks + EMRs be used for genetic research? This question was answered with a demonstration project using Vanderbilt University's BioVU biobank+EMR. Here, 21 previously reported associations to five complex diseases were tested for association to electronically abstracted phenotypes from BioVU's EMR. This forest plot shows that for the 21 previously reported associations (red), five replicated at a nominal significance threshold, and for the rest, the calculated odds ratios (blue) trended in the same direction as the reported association.

While electronic phenotyping is much cheaper than ascertainment in the traditional sense, it can still be costly and labor intensive, involving interative cycles of manual medical record chart review followed by refinement of natural language processing algorithms. In many instances, self-report can be comparably accurate, and much easier to obtain (for example, compare the eMERGE network's hypothyroidism phenotyping algorithm with simply asking the question: "Have you ever been diagnosed with Hashmoto's Thyroiditis?").

This is the approach to genetic research 23andMe is taking. Joyce Tung presented some preliminary results at last year's ASHG conference, and now you can read the preprint of the research paper online at Nature Preceedings - "Efficient Replication of Over 180 Genetic Associations with Self-Reported Medical Data". In this paper the authors amassed self-reported data from >20,000 genotyped 23andMe customers on 50 medical phenotypes and attempted to replicate known findings from the GWAS catalog. Using only self-reported data, the authors were able to replicate at a nominal significance level 75% of the previously reported associations that they were powered to detect. Figure 1 from the preprint is similar to the figure above, where blue X's are prior GWAS hits and black dots and lines represent their ORs and 95% CI's:

One might ask whether there is any confounding due to the fact that 23andMe customers can view trait/disease risks before answering questions (see this paper and this discussion by Luke Jostins). The authors investigated this and did not observe a consistent or significant effect of seeing genetic risk results before answering questions.  There's also a good discussion regarding SNPs that failed to replicate, and a general discussion of using self-report from a recontactable genotyped cohort for genetic studies.

Other than 23andMe's customer base, I can't think of any other genotyped, recontactable cohort that have incentive to fill out research surveys for this kind of study. But this team has shown here and in the past that this model for performing genetic research works and is substantially cheaper than traditional ascertainment or even mining electronic medical records. I look forward to seeing more research results from this group!

Efficient Replication of Over 180 Genetic Associations with Self-Reported Medical Data

Monday, June 6, 2011

Agilent Integrated Biology Grant Program

Agilent Technologies is fostering integrated, whole-systems approaches to biological research through two $75,000 grants. The application deadline is August 12, 2011.

Funds will support academic or nonprofit research projects covering the development of open source Agilent-compatible software tools for integrating data from different omics platforms—genomics, transcriptomics, proteomics, and metabolomics. Click here for full details on eligibility, submission, and the review process.

Grant 1: Validating Protein Pathway Information—Integrating Proteomic Data with Transcriptomic or Metabolomic Data Sets. The purpose of this initiative is to support the development or improvement of advanced mass spectrometry informatics tools that drive integration of gene expression, metabolomic, and targeted proteomics data. Specifically, we are looking for proposals that focus on automation of targeted mass spectrometry-based proteomics experiments (e.g., SRM, exact mass) aimed at hypothesis testing or validation of protein pathways and/or interaction networks generated by integrating existing transcriptomic and/or metabolomic datasets.

Grant 2: Modeling Disease Progression—Combining Gene Expression and Copy Number Variation Data. The purpose of this initiative is to support the development of advanced microarray and next-generation sequencing-based informatics tools to drive the integration of gene expression and genomic copy number data. Specifically, we are looking for proposals that focus on the correlation of copy number events and whole transcriptome measurements aimed at hypothesis testing or validation of disease progression models.

Agilent Integrated Biology Grant Program

Wednesday, June 1, 2011

Resources for Pathway Analysis, Functional Annotation, and Interactome Reconstruction with Omics Data

I just read a helpful paper on pathway analysis and interactome reconstruction:

Tieri, P., Fuente, A. D., Termanini, A., & Franceschi, C. (2011). Integrating Omics Data for Signaling Pathways, Interactome Reconstruction, and Functional Analysis. In Bioinformatics for Omics Data, Methods in Molecular Biology, vol. 719. doi: 10.1007/978-1-61779-027-0. (PubMed).

The authors give a description about how each of these tools can be used in pathway analysis and functional annotation, along with an example of using several of these resources for mapping the interactome for transcription factor NF-kappa-B.

While a more extensive list of hundreds of tools and databases for biological pathway analysis can be found at Pathguide, this looks like a good starting point.

APID Agile Protein Interaction DataAnalyzer –
Ariadne Genomics Pathway Studio –
BIND Biomolecular Interaction Network Database –
BioGRID The Biological General Repository for Interaction Datasets –
BiologicalNetworks –
CellDesigner –
DIP Database of Interacting Proteins –
GenMAPP Gene Map Annotator and Pathway Profiler –
HPRD Human Protein Reference Database –
HUBBA Hub objects analyzer –
Ingenuity Systems –
KEGG Kyoto Encyclopedia of Genes and Genomes –
MINT the Molecular INTeraction database –
NCI-Nature Pathway Interaction Database –
Pathguide: the pathway resource list –
Pathway Commons –
R Project for Statistical Computing –
SBW Systems Biology Workbench –
TRED Transcriptional Regulatory Element Database –

Integrating Omics data for signaling pathways, interactome reconstruction, and functional analysis.
Creative Commons License
Getting Genetics Done by Stephen Turner is licensed under a Creative Commons Attribution-NonCommercial 3.0 Unported License.