Wednesday, August 20, 2014

Do your "data janitor work" like a boss with dplyr

Data “janitor-work”

The New York Times recently ran a piece on wrangling and cleaning data:
Whether you call it “janitor-work,” wrangling/munging, cleaning/cleansing/scrubbing, tidying, or something else, the article above is worth a read (even though it implicitly denigrates the important work that your housekeeping staff does). It’s one of the few “Big Data” pieces that truly appreciates what we spend so much time doing every day as data scientists/janitors/sherpas/cowboys/etc. The article was chock-full of quotable tidbits on the nature of data munging as part of the analytical workflow:
It’s an absolute myth that you can send an algorithm over raw data and have insights pop up…
Data scientists … spend 50-80% of their time mired in this more mundane labor of collecting and preparing unruly digital data, before it can be explored for useful nuggets.
But if the value comes from combining different data sets, so does the headache… Before a software algorithm can go looking for answers, the data must be cleaned up and converted into a unified form that the algorithm can understand.
But practically, because of the diversity of data, you spend a lot of your time being a data janitor, before you can get to the cool, sexy things that got you into the field in the first place.
As data analysis experts we justify our existence by (accurately) evangelizing that the bottleneck in the discovery process is usually not data generation, it’s data analysis. I clarify that point further with my collaborators: data analysis is usually the easy part — if I give you properly formatted, tidy, and rigorously quality-controlled data, hitting the analysis “button” is usually much easier than the work that went into cleaning, QC’ing, and preparing the data in the first place.
To that effect, I’d like to introduce you to a tool that recently made its way into my data analysis toolbox.

dplyr

Unless you’ve had your head buried in the sand of the data analysis desert for the last few years, you’ve definitely encountered a number of tools in the “Hadleyverse.” These are R packages created by Hadley Wickham and friends that make things like data visualization (ggplot2), data management and split-apply-combine analysis (plyr, reshape2), and R package creation and documentation (devtools, roxygen2) much easier.
The dplyr package introduces a few simple functions, and integrates functionality from the magrittr package that will fundamentally change the way you write R code.
I’m not going to give you a full tutorial on dplyr because the vignette should take you 15 minutes to go through, and will do a much better job of introducing this “grammar of data manipulation” than I could do here. But I’ll try to hit a few higlights.

dplyr verbs

First, dplyr works on data frames, and introduces a few “verbs” that allow you to do some basic manipulation:
  • filter() filters rows from the data frame by some criterion
  • arrange() arranges rows ascending or descending based on the value(s) of one or more columns
  • select() allows you to select one or more columns
  • mutate() allows you to add new columns to a data frame that are transformations of other columns
  • group_by() and summarize() are usually used together, and allow you to compute values grouped by some other variable, e.g., the mean, SD, and count of all the values of $y separately for each level of factor variable $group.
Individually, none of these add much to your R arsenal that wasn’t already baked into the language in some form or another. But the real power comes from chaining these commands together, with the %>% operator.

The %>% operator: “then”

This dataset and example was taken directly from, and is described more verbosely in the vignette. The hflights dataset has information about more than 200,000 flights that departed Houston in 2011. Let’s say we want to do the following:
  1. Use the hflights dataset
  2. group_by the Year, Month, and Day
  3. select out only the Day, the arrival delay, and the departure delay variables
  4. Use summarize to calculate the mean of the arrival and departure delays
  5. filter the resulting dataset where the arrival delay or the departure delay is more than 30 minutes.
Here’s an example of how you might have done this previously:
filter(
  summarise(
    select(
      group_by(hflights, Year, Month, DayofMonth),
      Year:DayofMonth, ArrDelay, DepDelay
    ),
    arr = mean(ArrDelay, na.rm = TRUE),
    dep = mean(DepDelay, na.rm = TRUE)
  ),
  arr > 30 | dep > 30
)
Notice that the order that we write the code in this example is inside out - we describe our problem as: use hflights, then group_by, then select, then summarize, then filter, but traditionally in R we write the code inside-out by nesting functions 4-deep:
filter(summarize(select(group_by(hflights, ...), ...), ...), ...)
To fix this, dplyr provides the %>% operator (pronounced “then”). x %>% f(y) turns into f(x, y) so you can use it to rewrite multiple operations so you can read from left-to-right, top-to-bottom:
hflights %>%
  group_by(Year, Month, DayofMonth) %>%
  select(Year:DayofMonth, ArrDelay, DepDelay) %>%
  summarise(
    arr = mean(ArrDelay, na.rm = TRUE),
    dep = mean(DepDelay, na.rm = TRUE)
  ) %>%
  filter(arr > 30 | dep > 30)
Writing the code this way actually follows the order we think about the problem (use hflights, then group_by, then select, then summarize, then filter it).
You aren’t limited to using %>% to only dplyr functions. You can use it with anything. E.g., instead of head(iris, 10), we could write iris %>% head(10) to get the first ten lines of the built-in iris dataset. Furthermore, since the input to ggplot is always a data.frame, we can munge around a dataset then pipe the whole thing into a plotting function. Here’s a simple example where we take the iris dataset, then group it by Species, then summarize it by calculating the mean of the Sepal.Length, then use ggplot2 to make a simple bar plot.
library(dplyr)
library(ggplot2)
iris %>%
  group_by(Species) %>%
  summarize(meanSepLength=mean(Sepal.Length)) %>%
  ggplot(aes(Species, meanSepLength)) + geom_bar(stat="identity")
Once you start using %>% you’ll wonder to yourself why this isn’t a core part of the R language itself rather than add-on functionality provided by a package. It will fundamentally change the way you write R code, making it feel more natural and making your code more readable. There's a lot more dplyr can do with databases that I didn't even mention, and if you're interested, you should see the other vignettes on the CRAN package page.
As a side note, I’ve linked to it several times here, but you should really check out Hadley’s Tidy Data paper and the tidyr package, vignette, and blog post.

4 comments:

  1. I neglected to mention in the post, but `summarize()` and `summarise()` are synonymous: both spellings do the same thing.

    ReplyDelete
  2. Great post, thanks for this! I love most of the tools in the "Hadley-verse" but was not aware of this one yet. Love the unix-style piping of %>%.

    One problem that I've found with the earlier plyr package was that its performance didn't always scale well with large datasets, multiple groupings, or other complex queries. Instead, I migrated to using the "data.table" package, which can do similar SQL-like operations on (pre-cleaned) data frames. It implements all of the features you describe, minus piping to other functions, but with a performance that is lightning fast even on big tables. After getting over the learning curve, I much prefer the syntax of data.table both for reading and writing R code.

    For example, here is your code snippet implemented with data.table:

    hflights[, list(arr = mean(ArrDelay, na.rm = TRUE), # select and summarize
    dep = mean(DepDelay, na.rm = TRUE)),
    by=list(Year, Month, DayOfMonth) # group by
    ][arr > 30 | dep > 30] # filter via chaining

    Here the verbs are implicit based on their position in the brackets. A filter goes in the first slot, a select or summarize (implemented as a list of expressions) in the second, and grouping or other arguments go in the third slot.

    I look forward to trying out dplyr and comparing to data.table, but I highly recommend spending a little time with this package. Like you say about dplyr, data.table fundamentally changed the way I write R code.

    ReplyDelete
    Replies
    1. I messed around with data.table a while back but found the documentation/vignette lacking. One thing to note - dplyr works with data.tables as well, and I believe one of the vignettes talks about where data.table operations can be faster and others where dplyr excels.

      Delete
    2. Here's an article from May which gives a nice summary :
      http://www.marketingdistillery.com/2014/05/05/working-with-large-data-sets-in-r-the-data-table-module/

      Delete

Creative Commons License
Getting Genetics Done by Stephen Turner is licensed under a Creative Commons Attribution-NonCommercial 3.0 Unported License.