Monday, December 14, 2015

GRUPO: Shiny App For Benchmarking Pubmed Publication Output

This is a guest post from VP Nagraj, a data scientist embedded within UVA’s Health Sciences Library, who runs our Data Analysis Support Hub (DASH) service.

The What

GRUPO (Gauging Research University Publication Output) is a Shiny app that provides side-by-side benchmarking of American research university publication activity.

The How

The code behind the app is written in R, and leverages the NCBI Eutils API via the rentrez package interface.
The methodology is fairly simple:
  1. Build the search query in Pubmed syntax based on user input parameters.
  2. Extract total number of articles from results.
  3. Output a visualization of the total counts for both selected institutions.
  4. Extract unique article identifiers from results.
  5. Output the number of article identifiers that match (i.e. “collaborations”) between the two selected institutions.

Build Query

The syntax for the searching Pubmed relies on MEDLINE tags and boolean operators. You can peek into how to use the keywords and build these kinds of queries with the Pubmed Advanced Search Builder.
GRUPO builds its queries based on two fields in particular: “Affiliation” and “Date.” Because this search term will have to be built multiple times (at least twice to compare results for two institutions) I wrote a helper function called build_query():
# use %y/%m/%d (e.g. 1999/02/14) date format for startDate and endDate arguments

build_query = function(institution, startDate, endDate) {

    if (grepl("-", institution)==TRUE) {                
        split_name = strsplit(institution, split="-")
        search_term = paste(split_name[[1]][1], '[Affiliation]',
                             ' AND ',
                             ' AND ',
                             '[PDAT] : ',
        search_term = gsub("-","/",search_term)
    } else {
        search_term = paste(institution, 
                             ' AND ',
                             '[PDAT] : ',
        search_term = gsub("-","/",search_term)

The if/else logic in there accommodates cases like “University of North Carolina-Chapel Hill”, which otherwise wouldn’t search properly in the affiliation field. This method does depend on the institution name having its specific locale separated by a - symbol. In other words, if you passed in “University of Colorado/Boulder” you’d be stuck.
So by using this function for the University of Virginia from January 1, 2014 to January 1, 2015 you’d get the following term:
University of Virginia[Affiliation] AND 2014/01/01[PDAT] : 2015/01/01[PDAT]
And for University of Texas-Austin over the same dates you get the following term:
University of Texas[Affiliation] AND Austin[Affiliation] AND 2014/01/01[PDAT] : 2015/01/01[PDAT]
The advantage of using this function in a Shiny app is that you can pass the institution names and dates dynamically. Users enter the input parameters for which date range and institutions to search via the widgets in the ui.R script.
For the app to work, there has to be one date picker widget and two text inputs (one for each of the two institutions) in the ui.R script. The corresponding server.R script would have a reactive element wrapped around the following:
search_term = build_query(institution = input$institution1, startDate = input$dates[1], endDate = input$dates[2])
search_term2 = build_query(institution = input$institution2, startDate = input$dates[1], endDate = input$dates[2])
### Run Query
With the query built, you can run the search in Pubmed. The entrez_search() function from the rentrez package lets us get the information we want. This function returns four elements:
  • ids (unique Pubmed identifiers for each article in the result list)
  • count (total number of results)
  • retmax (maximum number of results that could have been returned)
  • file (the actual XML record containing the values above)
The following code returns total articles for each of two different searches:
affiliation_search = entrez_search("pubmed", search_term, retmax = 99999)
affiliation_search2 = entrez_search("pubmed", search_term2, retmax = 99999)

total_articles = as.numeric(affiliation_search$count)
total_articles2 = as.numeric(affiliation_search2$count)

Plot Results

The code above lives in the server.R script and is the functional workhorse for the app. But to adequately represent the benchmarking, GRUPO needed some kind of plot.
We can combine the total articles for each institution with the institution names, which we used to build the search terms. The result is a tiny (2 x 2) data frame of “Institution” and “Total.Articles” variables. Nothing fancy. But it does the trick.
With a data frame in hand, we can load it into ggplot2 and do some very simple barplotting:

Output Collaborations

Although the primary function of GRUPO is side-by-side benchmarking, it does have at least one other feature so far.
The inclusion of the “ids” object in the query result makes it possible to do something else. You can compare how many of the article identifiers match between two queries. That should represent the number of “collaborations” (i.e. how many of the publications share authorship) between individuals at the two institutions.
To get the total number of collaborations, we can do a simple calculation of length on the vector of intersections between the two search results:
collaboration_count = length(intersect(affiliation_search$ids,affiliation_search2$ids)
By placing the search call inside a reactive element within Shiny, GRUPO can store the results (“count” and “ids”) rather than repeating the query for each purpose.
NB This approach to assessing collaboration counts is spurious when considering articles published before October 2013, which was when the National Library of Medicine (NLM) began including affiliation tags for all authors.

The Next Steps

What’s next? There are a number of potential new features for GRUPO. It’s worth pointing out that a discussion of these possibilities will likely highlight some of the limitations of the app as it exists now.
For example, it would be advantageous to include other “research output” data sources. GRUPO currently only accounts for publications indexed in Pubmed. That’s a fairly one-dimensional representation of scholarly activities. Information about publications indexed elsewhere, funding awarded or altmetric indicators isn’t accounted for.
And neither is any information about the institutions. While all of them are considered to have very high research activity one could argue that some are “apples” and some are “oranges” based on discrepancies in budgets, number of faculty members, student body size, etc. A more thorough benchmarking tool might model research universities based on additional administrative data, and restrict comparisons to “similar” institutions.
So GRUPO is still a work in progress. But it’s a solid example of a Shiny app that effectively leverages an API as its primary data source. Feel free to post a comment if you have any feedback or questions.

1 comment:

  1. Great effort! One question about the choices of Universities: will you be adding University of California at San Francisco anytime soon? I know they are a powerhouse institution for biomedical research.




Creative Commons License
Getting Genetics Done by Stephen Turner is licensed under a Creative Commons Attribution-NonCommercial 3.0 Unported License.