Friday, April 10, 2015

Translational Bioinformatics Year In Review

Per tradition, Russ Altman gave his "Translational Bioinformatics: The Year in Review" presentation at the close of the AMIA Joint Summit on Translational Bioinformatics in San Francisco on March 26th.  This year, papers came from six key areas (and a final Odds and Ends category).  His full slide deck is available here.

I always enjoy this talk because it routinely points me to new collections of data and new software tools that are useful for a variety of analyses; as such, I thought I would highlight these resources from his talk this year.

GRASP: analysis of genotype-phenotype results from1390 genome-wide association studies and corresponding open access database
Some of you may have accessed the Johnson and O'Donnell catalog of GWAS results published in 2009.  This data set was a more extensive collection of GWAS findings than the popular NHGRI GWAS catalog, as it did not impose a genome-wide significance threshold for reported associations.  The GRASP database is a similar effort, reporting numerous attributes of each study.
A zip archive of the full data set (a flat file) is available here.

Effective diagnosis of genetic disease by computational phenotype analysis of the disease associated genome
This paper tackles the enormously complex task of diagnosing rare genetic diseases using a combination of genetic variants (from a VCF file), a list of phenotype characteristics (fed from the Human Phenotype Ontology), and a few other aspects of the disease.
The online tool called PhenIX is available here.

A network based method for analysis of lncRNA disease associations and prediction of lncRNAs implicated in diseases
Here, Yang et al. examine relationships between known long non-coding RNAs and disease using graph propagation.  Their underlying database, however, was generated using PubMed mining along with some manual curation.
Their lncRNA-Disease database is available here.

SNPsea: an algorithm to identify cell types, tissuesand pathways affected by risk loci
This tool is a type of SNP set enrichment, designed to specifically look at functional enrichment in the context of specific tissues and cell types.  The tool is a C++ executable, available for download here.
The data sources underlying the SNPsea algorithm are available here.

Human symptoms-disease network
Here Zhou et al. systematically extract symptom-to-disease network by exploting MeSH annotations.  They compiled a list of 322 symptoms and 4,442 diseases from the MeSH vocabulary, and document their occurrence within PubMed.  Using this disease-symptom network, the authors explore the biological underpinnings of certain symptoms by looking at shared genomic elements between diseases with similar symptoms.
The full list of ~130,000 edges in their disease-symptom network is available here.

A circadian gene expression atlas in mammals: implications for biology and medicine
This fascinating paper explores the temporal impact on gene expression traits from 12 mouse organs.  By systematically collecting transcriptome data from these tissues at two hour intervals, the authors construct a temporal atlas of gene expression, and show that 43% of proteins have a circadian expression profile.
The accompanying CircaDB database is available online here.

dRiskKB: a large-scale disease-disease riskrelationship knowledge base constructed frombiomedical text
The authors of dRiskKB use text mining across MEDLINE citations using a controlled disease vocabulary, in this case the Human Disease Ontology, to generate pairs of diseases that co-occur with specific patterns in abstract text. These pairs are ranked with a scoring algorithm and provide a new resource for disease co-morbidity relationships.
The flat file data driving dRiskKB can be found online here.

A tissue-based map of the human proteome
In this major effort, a group of investigators have published the most detailed atlas of human protein expression to date.  The transcriptome has been extensively studied across human tissues, but it remains unclear to what extent transcriptional activity reflects translation into protein.  But most importantly, the data are searchable via a beautiful website.
The underlying data from the Human Protein Atlas is available here.

R User Group Recap: Heatmaps and Using the caret Package

At our most recent R user group meeting we were delighted to have presentations from Mark Lawson and Steve Hoang, both bioinformaticians at Hemoshear. All of the code used in both demos is in our Meetup’s GitHub repo.

Making heatmaps in R

Steve started with an overview of making heatmaps in R. Using the iris dataset, Steve demonstrated making heatmaps of the continuous iris data using the heatmap.2 function from the gplots package, the aheatmap function from NMF, and the hard way using ggplot2. The “best in class” method used aheatmap to draw an annotated heatmap plotting z-scores of columns and annotated rows instead of raw values, using the Pearson correlation instead of Euclidean distance as the distance metric.
iris2 = iris # prep iris data for plotting
rownames(iris2) = make.names(iris2$Species, unique = T)
iris2 = iris2 %>% select(-Species) %>% as.matrix()
aheatmap(iris2, color = "-RdBu:50", scale = "col", breaks = 0,
         annRow = iris["Species"], annColors = "Set2", 
         distfun = "pearson", treeheight=c(200, 50), 
         fontsize=13, cexCol=.7, 
         filename="heatmap.png", width=8, height=16)

Classification and regression using caret

Mark wrapped up with a gentle introduction to the caret package for classification and regression training. This demonstration used the caret package to split data into training and testing sets, and run repeated cross-validation to train random forest and penalized logistic regression models for classifying Fisher’s iris data.
First, get a look at the data with the featurePlot function in the caret package:
featurePlot(x = iris[, 1:4],
            y = iris$Species,
            plot = "pairs",
            auto.key = list(columns = 3))

Next, after splitting the data into training and testing sets and using the caret package to automate training and testing both random forest and partial least squares models using repeated 10-fold cross-validation (see the code), it turns out random forest outperforms PLS in this case, and performs fairly well overall:
setosa versicolor virginica
Sensitivity 1.00 1.00 0.00
Specificity 1.00 0.50 1.00
Pos Pred Value 1.00 0.50 NaN
Neg Pred Value 1.00 1.00 0.67
Prevalence 0.33 0.33 0.33
Detection Rate 0.33 0.33 0.00
Detection Prevalence 0.33 0.67 0.00
Balanced Accuracy 1.00 0.75 0.50
A big thanks to Mark and Steve at Hemoshear for putting this together!

Thursday, February 26, 2015

Using and Abusing Data Visualization: Anscombe's Quartet and Cheating Bonferroni

Anscombe’s quartet comprises four datasets that have nearly identical simple statistical properties, yet appear very different when graphed. Each dataset consists of eleven (x,y) points. They were constructed in 1973 by the statistician Francis Anscombe to demonstrate both the importance of graphing data before analyzing it and the effect of outliers on statistical properties.
Let’s load and view the data. There’s a built-in dataset, but I munged the data into a tidy format and included it in an R package that I wrote primarily for myself.
# If you don't have Tmisc installed, first install devtools, then install
# from github: install.packages('devtools')
# devtools::install_github('stephenturner/Tmisc')
## 'data.frame':    44 obs. of  3 variables:
##  $ set: Factor w/ 4 levels "I","II","III",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ x  : int  10 8 13 9 11 14 6 4 12 7 ...
##  $ y  : num  8.04 6.95 7.58 8.81 8.33 ...
set x y
I 10 8.04
I 8 6.95
I 13 7.58
II 10 9.14
II 8 8.14
II 13 8.74
III 10 7.46
III 8 6.77
III 13 12.74
IV 8 6.58
IV 8 5.76
IV 8 7.71
Now, let’s compute the mean and standard deviation of both x and y, and the correlation coefficient between x and y for each dataset.
quartet %>%
  group_by(set) %>%
  summarize(mean(x), sd(x), mean(y), sd(y), cor(x,y))
## Source: local data frame [4 x 6]
##   set mean(x) sd(x) mean(y) sd(y) cor(x, y)
## 1   I       9  3.32     7.5  2.03     0.816
## 2  II       9  3.32     7.5  2.03     0.816
## 3 III       9  3.32     7.5  2.03     0.816
## 4  IV       9  3.32     7.5  2.03     0.817
Looks like each dataset has the same mean, median, standard deviation, and correlation coefficient between x and y.
Now, let’s plot y versus x for each set with a linear regression trendline displayed on each plot:
p = ggplot(quartet, aes(x, y)) + geom_point()
p = p + geom_smooth(method = lm, se = FALSE)
p = p + facet_wrap(~set)

This classic example really illustrates the importance of looking at your data, not just the summary statistics and model parameters you compute from it.
With that said, you can’t use data visualization to “cheat” your way into statistical significance. I recently had a collaborator who wanted some help automating a data visualization task so that she could decide which correlations to test. This is a terrible idea, and it’s going to get you in serious type I error trouble. To see what I mean, consider an experiment where you have a single outcome and lots of potential predictors to test individually. For example, some outcome and a bunch of SNPs or gene expression measurements. You can’t just visually inspect all those relationships then cherry-pick the ones you want to evaluate with a statistical hypothesis test, thinking that you’ve outsmarted your way around a painful multiple-testing correction.
Here’s a simple simulation showing why that doesn’t fly. In this example, I’m simulating 100 samples with a single outcome variable y and 64 different predictor variables, x. I might be interested in which x variable is associated with my y (e.g., which of my many gene expression measurement is associated with measured liver toxicity). But in this case, both x and y are random numbers. That is, I know for a fact the null hypothesis is true, because that’s what I’ve simulated. Now we can make a scatterplot for each predictor variable against our outcome, and look at that plot.
ndset = 64
n = 100
d = data_frame(
  set = factor(rep(1:ndset, each = n)),
  x = rnorm(n * ndset),
  y = rep(rnorm(n), ndset))
## Source: local data frame [6,400 x 3]
##    set       x       y
## 1    1  1.3710  1.2546
## 2    1 -0.5647  0.0936
## 3    1  0.3631 -0.0678
## 4    1  0.6329  0.2846
## 5    1  0.4043  1.0350
## 6    1 -0.1061 -2.1364
## 7    1  1.5115 -1.5967
## 8    1 -0.0947  0.7663
## 9    1  2.0184  1.8043
## 10   1 -0.0627 -0.1122
## .. ...     ...     ...
ggplot(d, aes(x, y)) + geom_point() + geom_smooth(method = lm) + facet_wrap(~set)

Now, if I were to go through this data and compute the p-value for the linear regression of each x on y, I’d get a uniform distribution of p-values, my type I error is where it should be, and my FDR and Bonferroni-corrected p-values would almost all be 1. This is what we expect — remember, the null hypothesis is true.
results = d %>%
  group_by(set) %>%
  do(mod = lm(y ~ x, data = .)) %>%
  summarize(set = set, p = anova(mod)$"Pr(>F)"[1]) %>%
  mutate(bon = p.adjust(p, method = "bonferroni")) %>%
  mutate(fdr = p.adjust(p, method = "fdr"))
## Source: local data frame [64 x 4]
##    set      p   bon   fdr
## 1    1 0.2738 1.000 0.749
## 2    2 0.2125 1.000 0.749
## 3    3 0.7650 1.000 0.900
## 4    4 0.2094 1.000 0.749
## 5    5 0.8073 1.000 0.900
## 6    6 0.0132 0.844 0.749
## 7    7 0.4277 1.000 0.820
## 8    8 0.7323 1.000 0.900
## 9    9 0.9323 1.000 0.932
## 10  10 0.1600 1.000 0.749
## .. ...    ...   ...   ...

BUT, if I were to look at those plots above and cherry-pick out which hypotheses to test based on how strong the correlation looks, my type I error will skyrocket. Looking at the plot above, it looks like the x variables 6, 28, 41, and 49 have a particularly strong correlation with my outcome, y. What happens if I try to do the statistical test on only those variables?
results %>% filter(set %in% c(6, 28, 41, 49))
## Source: local data frame [4 x 4]
##   set      p   bon   fdr
## 1   6 0.0132 0.844 0.749
## 2  28 0.0338 1.000 0.749
## 3  41 0.0624 1.000 0.749
## 4  49 0.0898 1.000 0.749
When I do that, my p-values for those four tests are all below 0.1, with two below 0.05 (and I'll say it again, the null hypothesis is true in this experiment, because I've simulated random data). In other words, my type I error is now completely out of control, with more than 50% false positives at a p<0.05 level. You'll notice that the Bonferroni and FDR-corrected p-values (correcting for all 64 tests) are still not significant.

The moral of the story here is to always look at your data, but don't "cheat" by basing which statistical tests you perform based solely on that visualization exercise.

Wednesday, February 4, 2015

Microbial Genomics: the State of the Art in 2015

Current Opinion in Microbiology recently published a special issue in genomics. In an excellent editorial overview, “Genomics: The era of genomically-enabled microbiology”, Neil Hall and Jay Hinton give an overview of the state of the field in microbial genomics, summarize recent contributions, and give a great synopsis of each of the reviews in this issue. Hall and Hinton’s editorial overview goes into a little more depth, but here’s a rundown of the reviews in this special issue. There’s a lot of good stuff here!
Quantitative bacterial transcriptomics with RNA-seq (James Creecy and Tyrrell Conway) discusses RNA-seq in bacteria and how transcriptome analysis adds a wealth of annotation information to the genome.
One chromosome, one contig: complete microbial genomes from long-read sequencing and assembly (Sergey Koren and Adam Phillippy) describes newer long-read sequencing technologies and their characteristics, discusses how microbial genomes can be easily and automatically finished using these methods for under $1,000, and discusses challenges for microbial and metagenome assembly.
Using comparative genomics to drive new discoveries in microbiology (Daniel Haft) describes progress using comparative genomics to make new discoveries, and takes the reader on a “bioinformatics journey” to describe a code-breaking exercise in comparative genomics that starts with weak hypotheses and uses genomics to fill in the biological picture.
Taking the pseudo out of pseudogenes (Ian Goodhead and Alistair Darby) reviews how pseudogenes are surprisingly prevalent, and discusses how problems with genome annotation can be addressed by combining multiple “omics” data.
Ten years of pan-genome analyses (George Vernikos et al.) describes how pan-genome analyses provide a framework for predicting and modling genomic diversity, where the “core genome” of many bacterial species constitutes only the minority of genes.
Lateral gene transfers and the origins of the eukaryote proteome: a view from microbial parasites (Robert Hirt et al.) reviews the dynamic nature of lateral gene transfer, its role in microbial diversity, how it contributes to eukaryotic genomes, and how once again integrating different “omics” methodologies is needed to recognize the extent to which LGT affects eukaryotes.
The application of genomics to tracing bacterial pathogen transmission (Nicholas Croucher and Xavier Didelot) reviews how bacterial whole-genome sequencing gives you the ultimate resolution for investigating direct pathogen transmission, distinguishing transmission chains, and defining outbreaks. If you haven’t kept up with this quickly growing body of literature, this review is a great place to start catching up.
The impact of genomics on population genetics of parasitic diseases (Daniel Hupalo et al.) describes the influence of genomics on parasite population genetics and how burgeoning genomic data has enabled new types of investigations, and focuses on Plasmodium population genomics as a foundation for studies of neglected parasites.

Tuesday, February 3, 2015

R + ggplot2 Graph Catalog

Joanna Zhao’s and Jenny Bryan’s R graph catalog is meant to be a complement to the physical book, Creating More Effective Graphs, but it’s a really nice gallery in its own right. The catalog shows a series of different data visualizations, all made with R and ggplot2. Click on any of the plots and you get the R code necessary to generate the data and produce the plot.

You can use the panel on the left to filter by plot type, graphical elements, or the chapter of the book if you’re actually using it. All of the code and data used for this website is open-source, in this GitHub repository. Here's an example for plotting population demographic data by county that uses faceting to create small multiples:

this_base = "fig08-15_population-data-by-county"

my_data = data.frame(
  Race = c("White", "Latino", "Black", "Asian American", "All Others"),
  Bronx = c(194000, 645000, 415000, 38000, 40000),
  Kings = c(855000, 488000, 845000, 184000, 93000),
  New.York = c(703000, 418000, 233000, 143000, 39000),
  Queens = c(733000, 556000, 420000, 392000, 128000),
  Richmond = c(317000, 54000, 40000, 24000, 9000),
  Nassau = c(986000, 133000, 129000, 62000, 24000),
  Suffolk = c(1118000, 149000, 92000, 34000, 26000),
  Westchester = c(592000, 145000, 123000, 41000, 23000),
  Rockland = c(205000, 29000, 30000, 16000, 6000),
  Bergen = c(638000, 91000, 43000, 94000, 18000),
  Hudson = c(215000, 242000, 73000, 57000, 22000),
  Passiac = c(252000, 147000, 60000, 18000, 12000))

my_data_long = melt(my_data, id = "Race",
            = "county", = "population")

my_data_long$county = factor(
  my_data_long$county, c("New.York", "Queens", "Kings", "Bronx", "Nassau",
                         "Suffolk", "Hudson", "Bergen", "Westchester",
                         "Rockland", "Richmond", "Passiac"))

my_data_long$Race =
         rev(c("White", "Latino", "Black", "Asian American", "All Others")))

p = ggplot(my_data_long, aes(x = population / 1000, y = Race)) +
  geom_point() +
  facet_wrap(~ county, ncol = 3) +
  scale_x_continuous(breaks = seq(0, 1000, 200),
                     labels = c(0, "", 400, "", 800, "")) +
  labs(x = "Population (thousands)", y = NULL) +
  ggtitle("Fig 8.15 Population Data by County") +
  theme_bw() +
  theme(panel.grid.major.y = element_line(colour = "grey60"),
        panel.grid.major.x = element_blank(),
        panel.grid.minor = element_blank(),
        panel.margin = unit(0, "lines"),
        plot.title = element_text(size = rel(1.1), face = "bold", vjust = 2),
        strip.background = element_rect(fill = "grey80"),
        axis.ticks.y = element_blank())


ggsave(paste0(this_base, ".png"),
       p, width = 6, height = 8)

Keep in mind not all of these visualizations are recommended. You’ll find pie charts, ugly grouped bar charts, and other plots for which I can’t think of any sensible name. Just because you can use the add_cat() function from Hilary Parker’s cats package to fetch a random cat picture from the internet and create an annotation_raster layer to add to your ggplot2 plot, doesn’t necessarily mean you should do such a thing for a publication-quality figure. But if you ever needed to know how, this R graph catalog can help you out.

this_base = "0002_add-background-with-cats-package"

## devtools::install_github("hilaryparker/cats")
## library(help = "cats")

p = ggplot(mpg, aes(cty, hwy)) +
  add_cat() +

ggsave(paste0(this_base, ".png"), p, width = 6, height = 5)

Tuesday, January 20, 2015

Microbiome Digest Blog

I have a noteworthy blogs tag on this blog that I sort of forgot about, and haven't used in years. But I started reading one recently that's definitely qualified for the distinction.

The Microbiome Digest is written by Elisabeth Bik, a scientist studying the microbiome at Stanford. It's a near-daily compilation of papers and popular press articles mostly relating to microbiome research, split up into categories like the human microbiome, the non-human microbiome (soil, animal, plants, other environments), metagenomics and bioinformatics methods, reviews, news articles, and other general science or career advice articles.

I imagine Elisabeth spends hours each week culling the huge onslaught of literature into these highly relevant digests. I wish someone else would do the same for other areas I care about so I don't have to. I subscribe to the RSS feed and the email list so I never miss a post. If you're at all interested in metagenomics or microbiome research, I suggest you do the same!

Microbiome Digest

Wednesday, January 14, 2015

Using the microbenchmark package to compare the execution time of R expressions

I recently learned about the microbenchmark package while browsing through Hadley’s advanced R programming book. I’ve done some quick benchmarking using system.time() in a for loop and taking the average, but the microbenchmark function in the microbenchmark package makes this much easier. Hadley gives the example of taking the square root of a vector using the built-in sqrt function versus the mathematical equivalent of raising the vector to the power of 0.5.
x = runif(100)
  x ^ 0.5
By default, microbenchmark runs each argument 100 times to get an average look at how long each evaluation takes. Results:
Unit: nanoseconds
    expr  min     lq    mean median     uq   max neval
 sqrt(x)  825  860.5 1212.79  892.5  938.5 12905   100
   x^0.5 3015 3059.5 3776.81 3101.5 3208.0 15215   100
On average sqrt(x) takes 1212 nanoseconds, compared to 3776 for x^0.5. That is, the built-in sqrt function is about 3 times faster. (This was surprising to me. Anyone care to comment on why this is the case?)
Now, let’s try it on something just a little bigger. This is similar to a real-life application I faced where I wanted to compute summary statistics of some value grouping by levels of some other factor. In the example below we’ll use the nycflights13 package, which is a data package that has info on 336,776 outbound flights from NYC in 2013. I’m going to go ahead and load the dplyr package so things print nicely.
Source: local data frame [336,776 x 16]

year month day dep_time dep_delay arr_time arr_delay carrier tailnum
1  2013     1   1      517         2      830        11      UA  N14228
2  2013     1   1      533         4      850        20      UA  N24211
3  2013     1   1      542         2      923        33      AA  N619AA
4  2013     1   1      544        -1     1004       -18      B6  N804JB
5  2013     1   1      554        -6      812       -25      DL  N668DN
6  2013     1   1      554        -4      740        12      UA  N39463
7  2013     1   1      555        -5      913        19      B6  N516JB
8  2013     1   1      557        -3      709       -14      EV  N829AS
9  2013     1   1      557        -3      838        -8      B6  N593JB
10 2013     1   1      558        -2      753         8      AA  N3ALAA
..  ...   ... ...      ...       ...      ...       ...     ...     ...
Variables not shown: flight (int), origin (chr), dest (chr), air_time
(dbl), distance (dbl), hour (dbl), minute (dbl)
Let’s say we want to know the average arrival delay (arr_delay) broken down by each airline (carrier). There’s more than one way to do this.
Years ago I would have used the built-in aggregate function.
aggregate(flights$arr_delay, by=list(flights$carrier), mean, na.rm=TRUE)
This gives me the results I’m looking for:
   Group.1          x
1       9E  7.3796692
2       AA  0.3642909
3       AS -9.9308886
4       B6  9.4579733
5       DL  1.6443409
6       EV 15.7964311
7       F9 21.9207048
8       FL 20.1159055
9       HA -6.9152047
10      MQ 10.7747334
11      OO 11.9310345
12      UA  3.5580111
13      US  2.1295951
14      VX  1.7644644
15      WN  9.6491199
16      YV 15.5569853
Alternatively, you can use the sqldf package, which feels natural if you’re used to writing SQL queries.
sqldf("SELECT carrier, avg(arr_delay) FROM flights GROUP BY carrier")
Not long ago I learned about the data.table package, which is good at doing these kinds of operations extremely fast.
flightsDT = data.table(flights)
flightsDT[ , mean(arr_delay, na.rm=TRUE), carrier]
Finally, there’s my new favorite, the dplyr package, which I covered recently.
flights %>% group_by(carrier) %>% summarize(mean(arr_delay, na.rm=TRUE))
Each of these will give you the same result, but which one is faster? That’s where the microbenchmark package becomes handy. Here, I’m passing all four evaluations to the microbenchmark function, and I’m naming those “base”, “sqldf”, “datatable”, and “dplyr” so the output is easier to read.
mbm = microbenchmark(
  base = aggregate(flights$arr_delay, by=list(flights$carrier), mean, na.rm=TRUE),
  sqldf = sqldf("SELECT carrier, avg(arr_delay) FROM flights GROUP BY carrier"),
  datatable = flightsDT[ , mean(arr_delay, na.rm=TRUE), carrier],
  dplyr = flights %>% group_by(carrier) %>% summarize(mean(arr_delay, na.rm=TRUE)),
Here’s the output:
Unit: milliseconds
      expr     min      lq    mean  median      uq     max neval
      base 1487.39 1521.12 1544.73 1539.96 1554.55 1676.25    50
     sqldf  867.14  880.34  892.24  887.88  897.28  982.91    50
 datatable    4.12    4.57    5.29    4.89    5.43   18.69    50
     dplyr   14.49   15.53   16.59   15.86   16.58   25.04    50
In this example, data.table was clearly the fastest on average. dplyr took ~3 times longer, sqldf took ~180x longer, and the base aggregate function took over 300 times longer. Let’s visualize those results using ggplot2 (microbenchmark has an autoplot method available, and note the log scale):

In this example data.table and dplyr were both relatively fast, with data.table being just a few milliseconds faster. Sometimes this will matter, other times it won’t. This is a matter of personal preference, but I personally find the data.table incantation not the least bit intuitive compared to dplyr. The way we pronounce flights %>% group_by(carrier) %>% summarize(mean(arr_delay, na.rm=TRUE)) is: “take flights then group that data by the carrier variable then summarize the data taking the mean of arr_delay.” The dplyr syntax, for me, is much easier to use and extend to much more complex data management and analysis tasks, so I’ll sacrifice those few milliseconds or program run time for the minutes or hours of programmer debugging time. But if you’re planning on running a piece of code on, for instance, millions or more simulations, then those few milliseconds might be important to you. The microbenchmark package makes benchmarking easy for small pieces of code like this.
The code used for this analysis is consolidated here on GitHub
Creative Commons License
Getting Genetics Done by Stephen Turner is licensed under a Creative Commons Attribution-NonCommercial 3.0 Unported License.