Monday, September 19, 2016

Primers in computational biology

I recently stumbled across this collection of computational biology primers in Nature Biotechnology. Many of these are old, but they're still great resources to get a fundamental understanding of the topic. Here they are in no particular order.


How does multiple testing correction work?

What is principal component analysis?

SNP imputation in association studies

How does gene expression clustering work?

What is a hidden Markov model?

What is a support vector machine?

What is the expectation maximization algorithm?

Where did the BLOSUM62 alignment score matrix come from?

What are DNA sequence motifs?

How does DNA sequence motif discovery work?

What are decision trees?

What is dynamic programming?

What is Bayesian statistics?

What are artificial neural networks?

How does eukaryotic gene prediction work?

How to map billions of short reads onto genomes

How to apply de Bruijn graphs to genome assembly

Thursday, June 30, 2016

Syntax Highlight Code in Keynote or Powerpoint

I came across this awesome gist explaining how to syntax highlight code in Keynote. The same trick works for Powerpoint. Mac only.
  1. Install homebrew if you don’t have it already and brew install highlight.
  2. highlight -O rtf myfile.ext | pbcopy to highlight code to a formatted text converter in RTF output format, and copy the result to the system clipboard.
  3. Paste into Keynote or Powerpoint.
If I’ve got some code in a file called eset_pca.R:
I can simply highlight -O rtf eset_pca.R | pbcopy and then paste it right into Keynote or Powerpoint.

Wednesday, June 1, 2016

Covcalc: Shiny App for Calculating Coverage Depth or Read Counts for Sequencing Experiments

How many reads do I need? What's my sequencing depth? These are common questions I get all the time. Calculating how much sequence data you need to hit a target depth of coverage, or the inverse, what's the coverage depth given a set amount of sequencing, are both easy to answer with some basic algebra. Given one or the other, plus the genome size and read length/configuration, you can calculate either. This was inspired by a similar calculator written by James Hadfield, and was an opportunity for me to create my first Shiny app.

Check out the app here:

And the source code on GitHub:

Give it your read length, whether you're using single- or paired-end sequencing, select a genome or enter your own. Then, select whether you want to calculate (a) the number of reads you need to hit a target depth of coverage, or (b) the coverage depth you'll hit given a set number of sequencing reads. Once you make the selection, use the slider to adjust either the desired coverage or number of reads sequenced, and the output text below is automatically updated.

Shiny App: Coverage / Read Count Calculator

Friday, February 5, 2016

Shiny Developer Conference 2016 Recap

This is a guest post from VP Nagraj, a data scientist embedded within UVA’s Health Sciences Library, who runs our Data Analysis Support Hub (DASH) service.

Last weekend I was fortunate enough to be able to participate in the first ever Shiny Developer Conference hosted by RStudio at Stanford University. I’ve built a handful of apps, and have taught an introductory workshop on Shiny. In spite of that, almost all of the presentations de-mystified at least one aspect of the how, why or so what of the framework. Here’s a recap of what resonated with me, as well as some code and links out to my attempts to put what I digested into practice.


  • reactivity is a beast
  • javascript isn’t cheating
  • there are already a ton of shiny features … and more on the way


For me, understanding reactivity has been one of the biggest challenges to using Shiny … or at least to using Shiny well. But after > 3 hours of an extensive (and really clear) presentation by Joe Cheng, I think I’m finally starting to see what I’ve been missing. Here’s something in particular that stuck out to me:
output$plot = renderPlot() is not an imperative to the browser to do a what … it’s a recipe for how the browser should do something.
Shiny ‘render’ functions (e.g. renderPlot(), renderText(), etc) inherently depend on reactivity. What the point above emphasizes is that assignments to a reactive expression are not the same as assignments made in “regular” R programming. Reactive outputs depend on inputs, and subsequently change as those inputs are manipulated.
If you want to watch how those changes happen in your own app, try adding options(shiny.reactlog=TRUE) to the top of your server script. When you run the app in a browser and press COMMAND + F3 (or CTRL + F3 on Windows) you’ll see a force directed network that outlines the connections between inputs and outputs.
Another way to implement reactivity is with the reactive() function.
For my apps, one of the pitfalls has been re-running the same code multiple times. That’s a perfect use-case for reactivity outside of the render functions.
Here’s a trivial example:

    ui = fluidPage(
         numericInput("threshold", "mpg threshold", value = 20),

    server = function(input, output) {

        output$size = renderPlot({

            dat = subset(mtcars, mpg > input$threshold)


        output$names = renderText({

            dat = subset(mtcars, mpg > input$threshold)


shinyApp(ui = ui, server = server)
The code above works … but it’s redundant. There’s no need to calculate the “dat” object separately in each render function.
The code below does the same thing but stores “dat” in a reactive that is only calculated once.

ui = fluidPage(
    numericInput("threshold", "mpg threshold", value = 20),

server = function(input, output) {

    dat = reactive({

        subset(mtcars, mpg > input$threshold)


    output$size = renderPlot({



    output$names = renderText({



shinyApp(ui = ui, server = server)


For whatever reason I’ve been stuck on the idea that using JavaScript inside a Shiny app would be “cheating”. But Shiny is actually well equipped for extensions with JavaScript libraries. Several of the speakers leaned in on this idea. Yihui Xie presented on the DT package, which is an interface to use features like client-side filtering from the DataTables library. And Dean Attali demonstrated shinyjs, a package that makes it really easy to incorporate JavaScript operations.
Below is code for a masterpiece that that does some hide() and show():

  ui = fluidPage( 
        titlePanel(actionButton("start", "start the game")),
        hidden(actionButton("restart", "restart the game")),

  server = function(input, output) {

        output$game_over =
                "game over, man ... game over"

       observeEvent(input$start, {

            show("game_over", anim = TRUE, animType = "fade")

       observeEvent(input$restart, {


everything else


Adding a brush argument to plotOutput() let’s you click and drag to select a points on a plot. You can use this for “zooming in” on something like a time series plot. Here’s the code for an app I wrote based on data from the babynames package - in this case the brush let’s you zoom to see name frequency over specific range of years.


ui = fluidPage(titlePanel(title = "names (1880-2012)"),
                textInput("name", "enter a name"),
                actionButton("go", "search"),
                plotOutput("plot1", brush = "plot_brush"),


server = function(input, output) {

    dat = eventReactive(input$go, {

        subset(babynames, tolower(name) == tolower(input$name))


    output$plot1 = renderPlot({

        ggplot(dat(), aes(year, prop, col=sex)) + 
            geom_line() + 
            xlim(1880,2012) +
            theme_minimal() +
            # format labels with percent function from scales package
            scale_y_continuous(labels = percent) +
            labs(list(title ="% of individuals born with name by year and gender",
                      x = "\n click-and-drag over the plot to 'zoom'",
                      y = ""))


    output$plot2 = renderPlot({

        # need latest version of shiny to use req() function
        brushed = brushedPoints(dat(), input$plot_brush)

        ggplot(brushed, aes(year, prop, col=sex)) + 
            geom_line() +
            theme_minimal() +
            # format labels with percent function from scales package
            scale_y_continuous(labels = percent) +
            labs(list(title ="% of individuals born with name by year and gender",
                      x = "",
                      y = ""))


    output$info = renderText({

        "data source: social security administration names from babynames package
}) } shinyApp(ui, server)


A relatively easy way to leverage Shiny reactivity for visual inspection and interaction with data within RStudio. The main difference here is that you’re using an abbreviated (or ‘mini’) ui. The advantage of this workflow is that you can include it in your script to make your analysis interactive. I modified the example in the documentation and wrote a basic brushing gadget that removes outliers:

outlier_rm = function(data, xvar, yvar) {

    ui = miniPage(
        gadgetTitleBar("Drag to select points"),
            # The brush="brush" argument means we can listen for
            # brush events on the plot using input$brush.
            plotOutput("plot", height = "100%", brush = "brush")

    server = function(input, output, session) {

        # Render the plot
        output$plot = renderPlot({
            # Plot the data with x/y vars indicated by the caller.
            ggplot(data, aes_string(xvar, yvar)) + geom_point()

        # Handle the Done button being pressed.
        observeEvent(input$done, {

            # create id for data
            data$id = 1:nrow(data)

            # Return the brushed points. See ?shiny::brushedPoints.
            p = brushedPoints(data, input$brush)

            # create vector of ids that match brushed points and data
            g = which(p$id %in% data$id)

            # return a subset of the original data without brushed points

    runGadget(ui, server)

# run to open plot viewer
# click and drag to brush
# press done return a subset of the original data without brushed points
outlier_rm(gapminder, "lifeExp", "gdpPercap")

# you can also use the same method above but pass the output into a dplyr pipe syntax
# without the selection what is the mean life expectancy by country?
outlier_rm(gapminder, "lifeExp", "gdpPercap") %>%
    group_by(country) %>%


This solves the issue of requiring an input - I’m definitely going to use this so I don’t have to do the return(NULL) work around:
# no need to do do this any more
# inFile = input$file1
#         if (is.null(inFile))
#             return(NULL)

# use req() instead


Super helpful method for digging into the call stack of your R code to see how you might optimize it.
One or two seconds of processing can make a big difference, particularly for a Shiny app …

rstudio connect

Jeff Allen from RStudio gave a talk on deployment options for Shiny applications and mentioned this product, which is a “coming soon” platform for hosting apps alongside RMarkdown documents and plots. It’s not available as a full release yet, but there is a beta version for testing.

Friday, January 8, 2016

Repel overlapping text labels in ggplot2

A while back I showed you how to make volcano plots in base R for visualizing gene expression results. This is just one of many genome-scale plots where you might want to show all individual results but highlight or call out important results by labeling them, for example, with a gene name.
But if you want to annotate lots of points, the annotations usually get so crowded that they overlap one another and become illegible. There are ways around this - reducing the font size, or adjusting the position or angle of the text, but these usually don’t completely solve the problem, and can even make the visualization worse. Here’s the plot again, reading the results directly from GitHub, and drawing the plot with ggplot2 and geom_text out of the box.

What a mess. It’s difficult to see what any of those downregulated genes are on the left. Enter the ggrepel package, a new extension of ggplot2 that repels text labels away from one another. Just sub in geom_text_repel() in place of geom_text() and the extension is smart enough to try to figure out how to label the points such that the labels don’t interfere with each other. Here it is in action.

And the result (much better!):
See the ggrepel package vignette for more.

Monday, December 14, 2015

GRUPO: Shiny App For Benchmarking Pubmed Publication Output

This is a guest post from VP Nagraj, a data scientist embedded within UVA’s Health Sciences Library, who runs our Data Analysis Support Hub (DASH) service.

The What

GRUPO (Gauging Research University Publication Output) is a Shiny app that provides side-by-side benchmarking of American research university publication activity.

The How

The code behind the app is written in R, and leverages the NCBI Eutils API via the rentrez package interface.
The methodology is fairly simple:
  1. Build the search query in Pubmed syntax based on user input parameters.
  2. Extract total number of articles from results.
  3. Output a visualization of the total counts for both selected institutions.
  4. Extract unique article identifiers from results.
  5. Output the number of article identifiers that match (i.e. “collaborations”) between the two selected institutions.

Build Query

The syntax for the searching Pubmed relies on MEDLINE tags and boolean operators. You can peek into how to use the keywords and build these kinds of queries with the Pubmed Advanced Search Builder.
GRUPO builds its queries based on two fields in particular: “Affiliation” and “Date.” Because this search term will have to be built multiple times (at least twice to compare results for two institutions) I wrote a helper function called build_query():
# use %y/%m/%d (e.g. 1999/02/14) date format for startDate and endDate arguments

build_query = function(institution, startDate, endDate) {

    if (grepl("-", institution)==TRUE) {                
        split_name = strsplit(institution, split="-")
        search_term = paste(split_name[[1]][1], '[Affiliation]',
                             ' AND ',
                             ' AND ',
                             '[PDAT] : ',
        search_term = gsub("-","/",search_term)
    } else {
        search_term = paste(institution, 
                             ' AND ',
                             '[PDAT] : ',
        search_term = gsub("-","/",search_term)

The if/else logic in there accommodates cases like “University of North Carolina-Chapel Hill”, which otherwise wouldn’t search properly in the affiliation field. This method does depend on the institution name having its specific locale separated by a - symbol. In other words, if you passed in “University of Colorado/Boulder” you’d be stuck.
So by using this function for the University of Virginia from January 1, 2014 to January 1, 2015 you’d get the following term:
University of Virginia[Affiliation] AND 2014/01/01[PDAT] : 2015/01/01[PDAT]
And for University of Texas-Austin over the same dates you get the following term:
University of Texas[Affiliation] AND Austin[Affiliation] AND 2014/01/01[PDAT] : 2015/01/01[PDAT]
The advantage of using this function in a Shiny app is that you can pass the institution names and dates dynamically. Users enter the input parameters for which date range and institutions to search via the widgets in the ui.R script.
For the app to work, there has to be one date picker widget and two text inputs (one for each of the two institutions) in the ui.R script. The corresponding server.R script would have a reactive element wrapped around the following:
search_term = build_query(institution = input$institution1, startDate = input$dates[1], endDate = input$dates[2])
search_term2 = build_query(institution = input$institution2, startDate = input$dates[1], endDate = input$dates[2])
### Run Query
With the query built, you can run the search in Pubmed. The entrez_search() function from the rentrez package lets us get the information we want. This function returns four elements:
  • ids (unique Pubmed identifiers for each article in the result list)
  • count (total number of results)
  • retmax (maximum number of results that could have been returned)
  • file (the actual XML record containing the values above)
The following code returns total articles for each of two different searches:
affiliation_search = entrez_search("pubmed", search_term, retmax = 99999)
affiliation_search2 = entrez_search("pubmed", search_term2, retmax = 99999)

total_articles = as.numeric(affiliation_search$count)
total_articles2 = as.numeric(affiliation_search2$count)

Plot Results

The code above lives in the server.R script and is the functional workhorse for the app. But to adequately represent the benchmarking, GRUPO needed some kind of plot.
We can combine the total articles for each institution with the institution names, which we used to build the search terms. The result is a tiny (2 x 2) data frame of “Institution” and “Total.Articles” variables. Nothing fancy. But it does the trick.
With a data frame in hand, we can load it into ggplot2 and do some very simple barplotting:

Output Collaborations

Although the primary function of GRUPO is side-by-side benchmarking, it does have at least one other feature so far.
The inclusion of the “ids” object in the query result makes it possible to do something else. You can compare how many of the article identifiers match between two queries. That should represent the number of “collaborations” (i.e. how many of the publications share authorship) between individuals at the two institutions.
To get the total number of collaborations, we can do a simple calculation of length on the vector of intersections between the two search results:
collaboration_count = length(intersect(affiliation_search$ids,affiliation_search2$ids)
By placing the search call inside a reactive element within Shiny, GRUPO can store the results (“count” and “ids”) rather than repeating the query for each purpose.
NB This approach to assessing collaboration counts is spurious when considering articles published before October 2013, which was when the National Library of Medicine (NLM) began including affiliation tags for all authors.

The Next Steps

What’s next? There are a number of potential new features for GRUPO. It’s worth pointing out that a discussion of these possibilities will likely highlight some of the limitations of the app as it exists now.
For example, it would be advantageous to include other “research output” data sources. GRUPO currently only accounts for publications indexed in Pubmed. That’s a fairly one-dimensional representation of scholarly activities. Information about publications indexed elsewhere, funding awarded or altmetric indicators isn’t accounted for.
And neither is any information about the institutions. While all of them are considered to have very high research activity one could argue that some are “apples” and some are “oranges” based on discrepancies in budgets, number of faculty members, student body size, etc. A more thorough benchmarking tool might model research universities based on additional administrative data, and restrict comparisons to “similar” institutions.
So GRUPO is still a work in progress. But it’s a solid example of a Shiny app that effectively leverages an API as its primary data source. Feel free to post a comment if you have any feedback or questions.
Creative Commons License
Getting Genetics Done by Stephen Turner is licensed under a Creative Commons Attribution-NonCommercial 3.0 Unported License.