Saturday, January 14, 2017

RStudio Conference 2017 Recap

The first ever RStudio conference was held January 11-14, 2017 in Orlando, FL. For anyone else like me who spends hours each working day staring into an RStudio session, the conference was truly excellent. The speaker lineup was diverse and covered lots of areas related to development in R, including the tidyverse, the RStudio IDE, Shiny, htmlwidgets, and authoring with RMarkdown.
This is not a complete list by any means — with split sessions I could only go to half the talks at most. Here are some noncomprehensive notes and links to slides and resources for some of the awesome things are doing with R and RStudio that I learned about at the RStudio Conference.

Hadley Wickham kicked off the meeting with a keynote on doing data science in R. The talk focused on the tidyverse, and the notion of splitting functions into commands that do something, as compared to queries that calculate something, and how it’s generally a good idea to keep these different functionalties contained in their own separate functions. (Contrast this to things like lm that both computes values and does things, like printing those values to the screen, making it difficult to capture (see broom).
I asked Hadley after his talk about strategies to reduce issues getting Bioconductor data structures to play nicely with tidyverse tools. Within minutes David Robinson released a new feature in the fuzzyjoin package that leverages IRanges within this tidyverse-friendly package for efficiently doing things like joining on genomic intervals.
Charlotte Wickham’s 2-hour purrr tutorial was awesome. Here’s a link to a shared dropbox folder with code, challenges, slides, data, etc. The purrr package is a core package in the tidyverse, and I’ll be replacing many of the base ?apply and plyr ??ply functions that I still use here and there. The map_* functions are integral to working with nested list-columns in dplyr, and I think I’m finally starting to grok how to work with these.
Jenny Bryan gave a great talk on list columns. You can see her slides here. Jenny also put together this excellent tutorial with lots of worked examples and code snippets. And if you need some example list data structures for more practice or for teaching that aren’t foo/bar/iris/mtcars-level boring, see her repurrrsive package. Related to this, for more on list columns and purrr map functions, start reading at the “Many Models” section of Hadley’s R for Data Science book.
Julia Silge, data scientist at Stack Overflow, gave a great introduction to tidy text mining with R. You can read Julia and David’s Tidy Text Mining with R book here online (the book was authored in Rmarkdown using bookdown!).
Andrew Flowers, data journalist and former writer at FiveThirtyEight gave the second day’s keynote address on finding and telling stories using R. He gave a series of examples illustrating six motivating features that make data stories worth telling, along with potential danger inherent to each one:
  1. Novelty (potential danger: triviality)
  2. Outlier (spurious result; see also, p-hacking)
  3. Archetype (oversimplification)
  4. Trend (variance)
  5. Debunking (confirmation bias)
  6. Forecast (overfitting)
Yihui Xie led a two-hour tutorial on advanced RMarkdown. You can see his slides here. The rticles package has LaTeX Journal Article Templates for R Markdown for various journals. The tufte package now supports both PDF and HTML output. See an example here. Yihui’s xaringan package ports the remark.js library for slideshows into R. Careful. Yihui warns that you may not sleep after learning about how cool remark.js is. Yihui showed an early version of the in-development blogdown package that can build blog-aware static websites using the blazing-fast and well-documented Hugo static site generator. Finally, the bookdown package is just awesome. It takes multiple RMarkdown documents as input and renders into multiple output formats (screen-readable ebook, PDF, epub, etc.). It looks great for writing books and technical documentation with pushbutton publishing to multiple output formats with some nice built-in styles out of the box. Some examples:
Finally, a few gems from other talks that I jotted down:
  • Chester Ismay gave a great talk on teaching introductory statistics using R, with the open-source course textbook written in RMarkdown using bookdown.
  • Bob Rudis talked about using pipes (%>%), and pipes within pipes, and best piping practices. See his slides here.
  • Hilary Parker talked about the idea of an analysis development, (and analysis developers), drawing similarities to software development/developers. Hilary discussed this once before on the excellent podcast that she and Roger Peng host, and you can probably find it in their Conversations On Data Science ebook that summarize and transcribe these conversations.
  • Simon Jackson introduced corrr package for exploring and manipulating correlations and correlation matrices in a tidy way.
  • Gordon Shotwell introduced the easymake package that generates Makefiles from a data frame using R.
  • Karthik Ram quickly introduced several of the (many) rOpenSci packages related to data publication, data access, scientific literature access, scalable & reproducible computing, databases, visualization, taxonomy, geospatial analysis, and many utility tools for data analysis and manipulation.

With split sessions I missed more than half the talks. Lots of people here are active on Twitter, and you can catch many more notes and tidbits on the #rstudioconf hashtag. The meeting was superbly organized, I learned a ton, and I enjoyed meeting in person many of the folks I follow on Twitter and elsewhere online. A few days of 80-degree weather in mid-January didn’t hurt either. I’ll definitely be coming again next year. Kudos to the rstudio::conf organizers and speakers!
All the talks were recorded and will supposedly find their way to at some point soon. I’ll update this post with a link when that happens.

Monday, September 19, 2016

Primers in computational biology

I recently stumbled across this collection of computational biology primers in Nature Biotechnology. Many of these are old, but they're still great resources to get a fundamental understanding of the topic. Here they are in no particular order.


How does multiple testing correction work?

What is principal component analysis?

SNP imputation in association studies

How does gene expression clustering work?

What is a hidden Markov model?

What is a support vector machine?

What is the expectation maximization algorithm?

Where did the BLOSUM62 alignment score matrix come from?

What are DNA sequence motifs?

How does DNA sequence motif discovery work?

What are decision trees?

What is dynamic programming?

What is Bayesian statistics?

What are artificial neural networks?

How does eukaryotic gene prediction work?

How to map billions of short reads onto genomes

How to apply de Bruijn graphs to genome assembly

Thursday, June 30, 2016

Syntax Highlight Code in Keynote or Powerpoint

I came across this awesome gist explaining how to syntax highlight code in Keynote. The same trick works for Powerpoint. Mac only.
  1. Install homebrew if you don’t have it already and brew install highlight.
  2. highlight -O rtf myfile.ext | pbcopy to highlight code to a formatted text converter in RTF output format, and copy the result to the system clipboard.
  3. Paste into Keynote or Powerpoint.
If I’ve got some code in a file called eset_pca.R:
I can simply highlight -O rtf eset_pca.R | pbcopy and then paste it right into Keynote or Powerpoint.

Wednesday, June 1, 2016

Covcalc: Shiny App for Calculating Coverage Depth or Read Counts for Sequencing Experiments

How many reads do I need? What's my sequencing depth? These are common questions I get all the time. Calculating how much sequence data you need to hit a target depth of coverage, or the inverse, what's the coverage depth given a set amount of sequencing, are both easy to answer with some basic algebra. Given one or the other, plus the genome size and read length/configuration, you can calculate either. This was inspired by a similar calculator written by James Hadfield, and was an opportunity for me to create my first Shiny app.

Check out the app here:

And the source code on GitHub:

Give it your read length, whether you're using single- or paired-end sequencing, select a genome or enter your own. Then, select whether you want to calculate (a) the number of reads you need to hit a target depth of coverage, or (b) the coverage depth you'll hit given a set number of sequencing reads. Once you make the selection, use the slider to adjust either the desired coverage or number of reads sequenced, and the output text below is automatically updated.

Shiny App: Coverage / Read Count Calculator

Friday, February 5, 2016

Shiny Developer Conference 2016 Recap

This is a guest post from VP Nagraj, a data scientist embedded within UVA’s Health Sciences Library, who runs our Data Analysis Support Hub (DASH) service.

Last weekend I was fortunate enough to be able to participate in the first ever Shiny Developer Conference hosted by RStudio at Stanford University. I’ve built a handful of apps, and have taught an introductory workshop on Shiny. In spite of that, almost all of the presentations de-mystified at least one aspect of the how, why or so what of the framework. Here’s a recap of what resonated with me, as well as some code and links out to my attempts to put what I digested into practice.


  • reactivity is a beast
  • javascript isn’t cheating
  • there are already a ton of shiny features … and more on the way


For me, understanding reactivity has been one of the biggest challenges to using Shiny … or at least to using Shiny well. But after > 3 hours of an extensive (and really clear) presentation by Joe Cheng, I think I’m finally starting to see what I’ve been missing. Here’s something in particular that stuck out to me:
output$plot = renderPlot() is not an imperative to the browser to do a what … it’s a recipe for how the browser should do something.
Shiny ‘render’ functions (e.g. renderPlot(), renderText(), etc) inherently depend on reactivity. What the point above emphasizes is that assignments to a reactive expression are not the same as assignments made in “regular” R programming. Reactive outputs depend on inputs, and subsequently change as those inputs are manipulated.
If you want to watch how those changes happen in your own app, try adding options(shiny.reactlog=TRUE) to the top of your server script. When you run the app in a browser and press COMMAND + F3 (or CTRL + F3 on Windows) you’ll see a force directed network that outlines the connections between inputs and outputs.
Another way to implement reactivity is with the reactive() function.
For my apps, one of the pitfalls has been re-running the same code multiple times. That’s a perfect use-case for reactivity outside of the render functions.
Here’s a trivial example:

    ui = fluidPage(
         numericInput("threshold", "mpg threshold", value = 20),

    server = function(input, output) {

        output$size = renderPlot({

            dat = subset(mtcars, mpg > input$threshold)


        output$names = renderText({

            dat = subset(mtcars, mpg > input$threshold)


shinyApp(ui = ui, server = server)
The code above works … but it’s redundant. There’s no need to calculate the “dat” object separately in each render function.
The code below does the same thing but stores “dat” in a reactive that is only calculated once.

ui = fluidPage(
    numericInput("threshold", "mpg threshold", value = 20),

server = function(input, output) {

    dat = reactive({

        subset(mtcars, mpg > input$threshold)


    output$size = renderPlot({



    output$names = renderText({



shinyApp(ui = ui, server = server)


For whatever reason I’ve been stuck on the idea that using JavaScript inside a Shiny app would be “cheating”. But Shiny is actually well equipped for extensions with JavaScript libraries. Several of the speakers leaned in on this idea. Yihui Xie presented on the DT package, which is an interface to use features like client-side filtering from the DataTables library. And Dean Attali demonstrated shinyjs, a package that makes it really easy to incorporate JavaScript operations.
Below is code for a masterpiece that that does some hide() and show():

  ui = fluidPage( 
        titlePanel(actionButton("start", "start the game")),
        hidden(actionButton("restart", "restart the game")),

  server = function(input, output) {

        output$game_over =
                "game over, man ... game over"

       observeEvent(input$start, {

            show("game_over", anim = TRUE, animType = "fade")

       observeEvent(input$restart, {


everything else


Adding a brush argument to plotOutput() let’s you click and drag to select a points on a plot. You can use this for “zooming in” on something like a time series plot. Here’s the code for an app I wrote based on data from the babynames package - in this case the brush let’s you zoom to see name frequency over specific range of years.


ui = fluidPage(titlePanel(title = "names (1880-2012)"),
                textInput("name", "enter a name"),
                actionButton("go", "search"),
                plotOutput("plot1", brush = "plot_brush"),


server = function(input, output) {

    dat = eventReactive(input$go, {

        subset(babynames, tolower(name) == tolower(input$name))


    output$plot1 = renderPlot({

        ggplot(dat(), aes(year, prop, col=sex)) + 
            geom_line() + 
            xlim(1880,2012) +
            theme_minimal() +
            # format labels with percent function from scales package
            scale_y_continuous(labels = percent) +
            labs(list(title ="% of individuals born with name by year and gender",
                      x = "\n click-and-drag over the plot to 'zoom'",
                      y = ""))


    output$plot2 = renderPlot({

        # need latest version of shiny to use req() function
        brushed = brushedPoints(dat(), input$plot_brush)

        ggplot(brushed, aes(year, prop, col=sex)) + 
            geom_line() +
            theme_minimal() +
            # format labels with percent function from scales package
            scale_y_continuous(labels = percent) +
            labs(list(title ="% of individuals born with name by year and gender",
                      x = "",
                      y = ""))


    output$info = renderText({

        "data source: social security administration names from babynames package
}) } shinyApp(ui, server)


A relatively easy way to leverage Shiny reactivity for visual inspection and interaction with data within RStudio. The main difference here is that you’re using an abbreviated (or ‘mini’) ui. The advantage of this workflow is that you can include it in your script to make your analysis interactive. I modified the example in the documentation and wrote a basic brushing gadget that removes outliers:

outlier_rm = function(data, xvar, yvar) {

    ui = miniPage(
        gadgetTitleBar("Drag to select points"),
            # The brush="brush" argument means we can listen for
            # brush events on the plot using input$brush.
            plotOutput("plot", height = "100%", brush = "brush")

    server = function(input, output, session) {

        # Render the plot
        output$plot = renderPlot({
            # Plot the data with x/y vars indicated by the caller.
            ggplot(data, aes_string(xvar, yvar)) + geom_point()

        # Handle the Done button being pressed.
        observeEvent(input$done, {

            # create id for data
            data$id = 1:nrow(data)

            # Return the brushed points. See ?shiny::brushedPoints.
            p = brushedPoints(data, input$brush)

            # create vector of ids that match brushed points and data
            g = which(p$id %in% data$id)

            # return a subset of the original data without brushed points

    runGadget(ui, server)

# run to open plot viewer
# click and drag to brush
# press done return a subset of the original data without brushed points
outlier_rm(gapminder, "lifeExp", "gdpPercap")

# you can also use the same method above but pass the output into a dplyr pipe syntax
# without the selection what is the mean life expectancy by country?
outlier_rm(gapminder, "lifeExp", "gdpPercap") %>%
    group_by(country) %>%


This solves the issue of requiring an input - I’m definitely going to use this so I don’t have to do the return(NULL) work around:
# no need to do do this any more
# inFile = input$file1
#         if (is.null(inFile))
#             return(NULL)

# use req() instead


Super helpful method for digging into the call stack of your R code to see how you might optimize it.
One or two seconds of processing can make a big difference, particularly for a Shiny app …

rstudio connect

Jeff Allen from RStudio gave a talk on deployment options for Shiny applications and mentioned this product, which is a “coming soon” platform for hosting apps alongside RMarkdown documents and plots. It’s not available as a full release yet, but there is a beta version for testing.

Friday, January 8, 2016

Repel overlapping text labels in ggplot2

A while back I showed you how to make volcano plots in base R for visualizing gene expression results. This is just one of many genome-scale plots where you might want to show all individual results but highlight or call out important results by labeling them, for example, with a gene name.
But if you want to annotate lots of points, the annotations usually get so crowded that they overlap one another and become illegible. There are ways around this - reducing the font size, or adjusting the position or angle of the text, but these usually don’t completely solve the problem, and can even make the visualization worse. Here’s the plot again, reading the results directly from GitHub, and drawing the plot with ggplot2 and geom_text out of the box.

What a mess. It’s difficult to see what any of those downregulated genes are on the left. Enter the ggrepel package, a new extension of ggplot2 that repels text labels away from one another. Just sub in geom_text_repel() in place of geom_text() and the extension is smart enough to try to figure out how to label the points such that the labels don’t interfere with each other. Here it is in action.

And the result (much better!):
See the ggrepel package vignette for more.
Creative Commons License
Getting Genetics Done by Stephen Turner is licensed under a Creative Commons Attribution-NonCommercial 3.0 Unported License.