Thursday, December 16, 2010

Epistasis in New Places

Coming from the lineage of Jason Moore, I am obliged to occasionally remind everyone that biological systems are inherently complex, and to some degree, we should therefore expect statistical models involving those systems to be complex as well.

With the development of GWAS, many approaches to examine epistasis are weighed down by the computational burden of exhaustively conducting billions of statistical tests. With this in mind, several bioinformatics approaches (such as Biofilter and INTERSNP) have focused on looking for gene-gene interactions within biological pathways, ontologies, or protein-protein interaction networks. The assumption underlying these methods is that interactions occur between variants of two different genes – what you could call trans-epistasis.

Considering the epic complexity of the transcriptions process, the genetics of gene expression seems just as likely to harbor epistasis as biological pathways. Following the excellent work of Barbara Stranger, Jonathan Pritchard, and various other luminaries in this area, Stephen Turner and I examined HapMap genotypes and gene expression levels from corresponding cell lines to look for cis-epistasis.

We found 79 genes where SNP pairs in the gene's regulatory region can interact to influence the gene's expression. What is perhaps most interesting is that there are often large distances between the two interacting SNPs (with minimal LD between them), meaning that most haplotype and sliding window approaches would miss these effects. The full text is available online: "Multivariate analysis of regulatory SNPs: empowering personal genomics by considering cis-epistasis and heterogeneity."

Wednesday, December 15, 2010

Which Reference Management Software do you use? (Reader Poll)

When I started grad school I started using Reference Manager (RefMan), similar to EndNote, to manage my references and bibliographies. It's a real pain, and I often feel like I'm powering my computer with the endless pumping and clicking of the mouse that it takes to import a reference into my library.

Recently I've started using Zotero because of how easy it is to import references, store PDFs, and sync between computers. It also integrates with MS Word and allows you to insert citations and format a bibliography using any of EndNote's styles. And it's free.

Before I make the switch and leave RefMan for good, I would love to see what everyone else here uses to manage references. I know many of you use social bookmarking sites like CiteULike,, FriendFeed and others to save and share literature, but I'm really interested to see what software you use while writing to manage references and format bibliographies, and how satisfied you are with what you use.

Thanks for responding! Check back in a few days and I'll summarize what you all said.

Tuesday, December 14, 2010

Sync your Zotero Library with Dropbox using WebDAV

About a year ago I wrote a post about Dropbox - a free, awesome, cross-platform utility that syncs files across multiple computers and securely backs up your files online. Dropbox is indispensable in my own workflow. I store all my R code, perl scripts, and working manuscripts in my Dropbox. You can also share folders on your computer with other Dropbox users, which makes coauthoring a paper and sharing manuscript files a trivial task. If you're not using it yet, start now.

I've also been using Zotero for some time now to manage my references. What's nice about Zotero over RefMan, EndNote and others, is that it runs inside Firefox, and when you're on a Pubmed or Journal website, you can save a reference and the PDF with a single click within your Zotero library. Zotero also interfaces with both MS Word and OO.o, and uses all the standard EndNote styles for formatting bibliographies.

You can also sync your Zotero library, including all your references, snapshots of the HTML version of all your articles, and all the PDFs using the Zotero servers. This syncs your library to every other computer you're using. This is nice when you're away from the office and need to look at a paper, but you're not on your institution's LAN and journal articles are paywalled. The problem with Zotero is a low storage limit - you only get tiny 100MB storage space for free. Have any more papers or references you want to sync and you have to pay for it.

That's if you use Zotero's servers. You can also sync your library using your own WebDAV server. Go into Zotero's preferences and you'll see this under the sync pane.

Here's where Dropbox comes in handy. You get 2GB for free when you sign up for Dropbox, and you can add tons more space by referring others, filling out surveys, viewing the help pages, etc. I've bumped my free account up to 19GB. Dropbox doesn't support WebDAV by itself, but a 3rd party service, DropDAV, allows you to do this. Just give DropDAV your Dropbox credentials, and you now have your own WebDAV server at Now simply point Zotero to sync with your own DropDAV server rather than Zotero's servers, and you can sync gigabytes of references and PDFs using your Dropbox.

Why not simply move the location of your Zotero library to a folder in your dropbox and forget syncing altogether? I did that for a while, but as long as Firefox is open, Zotero holds your library files open, which means they're not syncing properly. If you have instances of Firefox open on more than one machine you're going to run into trouble. Syncing to Dropbox with DropDAV only touches your Dropbox during a Zotero sync operation.

What you'll need:

1. Dropbox. Sign up for a free 2GB Dropbox account. If you use this special referral link, you'll get an extra 250MB for free. Create a folder in your Dropbox called "zotero."

2. DropDAV. Log in here with your Dropbox credentials and you'll have DropDAV up and running.

3. Firefox + Zotero. First, start using Firefox if you haven't already, then install the Zotero extension.

4. Connect Zotero to DropDAV. Go into Zotero's preferences, sync panel. See the screenshot above to set your Zotero library to sync to your Dropbox via WebDAV using DropDAV.

You're done! Now, go out and start saving/syncing gigabytes of papers!

Tuesday, December 7, 2010

Webinar on Revolution R Enterprise

R evangelist David Smith, marketing VP at Revolution R, will be giving a webinar showing off some of the finer features of Revolution R Enterprise - an integrated development environment (IDE) for R that has an enhanced script editor with syntax highlighting, function completion, suntax checking, mouseover help, R code snippets for common tasks, an object browser, a real debugger, and more. Revolution R Enterprise is free for academics. The webinar is tomorrow (Wednesday December 8) at 9am Pacific time (11 CST), and you can register here.

I've been happy using NppToR - a utility that adds syntax highlighting, code folding, and a hotkey to send lines of R code from Notepad++ (hands down the best text editor for Windows) to the R console. You can read more about NppToR on page 62 of the June issue of the R journal. But it looks like the Revolution R Enterprise IDE has much more to offer. Here's an example of the debugger with breakpoints set.

Webinar - Revolution R Enterprise - 100% R and More

Monday, December 6, 2010

Using the "Divide by 4 Rule" to Interpret Logistic Regression Coefficients

I was recently reading a bit about logistic regression in Gelman and Hill's book on hierarchical/multilevel modeling when I first learned about the "divide by 4 rule" for quickly interpreting coefficients in a logistic regression model in terms of the predicted probabilities of the outcome. The idea is pretty simple. The logistic curve (predicted probabilities) is steepest at the center where a+ßx=0, where logit-1(x+ßx)=0.5. See the plot below (or use the R code to plot it yourself).

The slope of this curve (1st derivative of the logistic curve) is maximized at a+ßx=0, where it takes on the value:




So you can take the logistic regression coefficients (not including the intercept) and divide them by 4 to get an upper bound of the predictive difference in probability of the outcome y=1 per unit increase in x. This approximation the best at the midpoint of x where predicted probabilities are close to 0.5, which is where most of the data will lie anyhow.

So if your regression coefficient is 0.8, a rough approximation using the ß/4 rule is that a 1 unit increase in x results in about a 0.8/4=0.2, or 20% increase in the probability of y=1.

Tuesday, November 30, 2010

Abstract Art with PubMed2Wordle

While preparing for my upcoming defense, I found a cool little web app called pubmed2wordle that turns a pubmed query into a word cloud using text from the titles and abstracts returned by the query. Here are the results for a pubmed query for me ("turner sd AND vanderbilt"):

And quite different results for where I'm planning to do my postdoc:

Looks useful to quickly get a sense of what other people work on.

Monday, November 29, 2010


On Friday, December 3rd, at 8:00 AM, after copious amounts of coffee, my friend, colleague, and perpetual workout buddy Stephen Turner will defend his thesis.
Knowledge-Driven Genome-wide Analysis of Multigenic Interactions Impacting HDL Cholesterol Level

Join us in room 206 of the Preston Research Building at Vanderbilt Medical Center for the auspicious occasion!

Wednesday, November 24, 2010

How to Not Suck at Powerpoint, and the Secrets of Malcolm Gladwell

Designer Jesse Dee has an entertaining presentation on Slideshare about how to use Powerpoint effectively (although Edward Tufte may assert that such a thing is impossible). These are all things we probably know, but just don't take into consideration enough when we're giving a presentation.

According to Dee, the number one most common mistake is lack of preparation. According to a survey taken by a company that specializes in presentation skills coaching, 86% of executives say that presentation skills affect their career and income, yet only 25% spend more than 2 hours of prep for "high-stakes" presentations. Malcolm Gladwell, author of Blink, Tipping Point, Outliers, and an excellent article on drug development in the New Yorker ("The Treatment: Why is it so Difficult to Develop Drugs for Cancer?"), is known for delivering a seemingly effortless presentation, ending at exactly the right time, without ever looking at his watch (see his Ted Talk on Spaghetti Sauce). When Financial Times writer Gideon Rachman asked how he does it, Gladwell responded, "I know it may not look like this. But it’s all scripted. I write down every word and then I learn it off by heart. I do that with all my talks and I’ve got lots of them." I've had lots of folks tell me the best way to give a talk is to throw up a few main points and wing it through an hour, but perhaps rote memorization is a more attractive alternative. But in our line of work where we show lots of data, tables, figures, and statistics, it's already easy enough to bore your audience, and delivering a memorized speech might make this worse. I tend to prefer something in between complete improv and autopilot. What are your favorite tips for presenting scientific, quantitative, or statistical data and results?

You Suck at Powerpoint (Slideshare, via @lifehacker)

Financial Times - Secrets of Malcolm Gladwell

Malcolm Gladwell - The Treatment: Why is it so Difficult to Develop Drugs for Cancer?

Tuesday, November 23, 2010

Randomly Select Subsets of Individuals from a Binary Pedigree .fam File

I'm working on imputing GWAS data to the 1000 Genomes Project data using MaCH. For the model estimation phase you only need ~200 individuals. Here's a one-line unix command that will pull out 200 samples at random from a binary pedigree .fam file called myfamfile.fam:

for i in `cut -d ' ' -f 1-2  myfamfile.fam | sed s/\ /,/g`; do echo "$RANDOM $i"; done | sort |  cut -d' ' -f 2| sed s/,/\ /g | head -n 200

Redirect this output to a file, and then run PLINK using the --keep option with this new file.

Wednesday, November 17, 2010

Syntax Highlighting R Code, Revisited

A few months ago I showed you how to syntax-highlight R code using Github Gists for displaying R code on your blog or other online medium. The idea's really simple if you use blogger - head over to, paste in your R code, create a public "gist", hit "embed", then copy the javascript onto your blog. However, if you use a hosted or other blogging platform that doesn't allow javascript tags within a post, you can't use this method.

While I still prefer using Github Gists for archiving and version control, there's an alternative that works where javascript doesn't. Inside-R has a nice tool - Pretty R Syntax Highlighter - where you simply paste in your R code, and it generates HTML code that syntax-highlights your R code. What's more, functions in your R code link back to the documentation on Here's an example of some code I posted a while back on making QQ plots of p-values using R base graphics.

Without any highlighting, it's hard to read, and spacing isn't faithfully preserved:

# Define the function
ggd.qqplot = function(pvector, main=NULL, ...) {
    o = -log10(sort(pvector,decreasing=F))
    e = -log10( 1:length(o)/length(o) )
    plot(e,o,pch=19,cex=1, main=main, ...,
        xlim=c(0,max(e)), ylim=c(0,max(o)))

#Generate some fake data that deviates from the null

# pvalues is a numeric vector

# Using the ggd.qqplot() function

# Add a title
ggd.qqplot(pvalues, "QQ-plot of p-values using ggd.qqplot")

Here's the same code embeded using Github Gists:

And the same code using's Pretty R Syntax Highlighter. Note that function calls are hyperlinks to the function's documentation on inside-R.

# Originally posted at
# Define the function
ggd.qqplot = function(pvector, main=NULL, ...) {
    o = -log10(sort(pvector,decreasing=F))
    e = -log10( 1:length(o)/length(o) )
    plot(e,o,pch=19,cex=1, main=main, ...,
        xlim=c(0,max(e)), ylim=c(0,max(o)))
#Generate some fake data that deviates from the null
# pvalues is a numeric vector
# Using the ggd.qqplot() function
# Add a title
ggd.qqplot(pvalues, "QQ-plot of p-values using ggd.qqplot")

Have any other solutions for syntax highlighting R code? Please share in the comments!

Github Gists
Inside-R Pretty R Syntax Highlighter

Tuesday, November 16, 2010

Parallelize IBD estimation with PLINK

Obtaining the probability that zero, one, or two alleles are shared identical by descent (IBD) is useful for many reasons in a GWAS analysis. A while back I showed you how to visualize sample relatedness using R and ggplot2, which requires IBD estimates. Using plink --genome uses IBS and allele frequencies to infer IBD. While a recent article in Nature Reviews Genetics on IBD and IBS analysis demonstrates potentially superior approaches, PLINK's approach is definitely the easiest because of PLINK's superior data management capabilities. The problem with IBD inference is that while computation time is linear with respect to the number of SNPs, it's geometric (read: SLOW) with respect to the number of samples. With GWAS data on 10,000 samples, (10,000 choose 2) = 49,995,000 pairwise IBD estimates. This can take quite some time to calculate on a single processor.

A developer in Will's lab, Justin Giles, wrote a Perl script which utilizes one of PLINK's advanced features, --genome-lists, which takes two files as arguments. You can read about this feature under the advanced hint section of the PLINK documentation of IBS clustering. Each of these files contain lists of family IDs and individual IDs of samples for whom you'd like to calculate IBD. In other words, you can break up the IBD calculations by groups of samples, instead of requiring a single process to do it all. The Perl script also takes the base filename of your binary pedfile and parses the .fam file to split up the list of individuals into small chunks. The size of these chunks are specified by the user. Assuming you have access to a high-performance computing cluster using Torque/PBS for scheduling and job submission, the script also writes out PBS files that can be used to submit each of the segments to a node on a supercomputing cluster (although this can easily be modified to fit other parallelization frameworks, so modify the script as necessary). The script also needs all the minor allele frequencies (which can easily be attained with the --freq option in PLINK).

One of the first things the script does is parses and splits up your fam file into chunks of N individuals (where N is set by the user - I used 100 and estimation only took ~15 minutes). This can be accomplished by a simple gawk command:

gawk '{print $1,$2}' data.fam | split -d -a 3 -l 100 - tmp.list

Then the script sets up some PBS scripts (like shell scripts) to run PLINK commands:

At which point you'll have the output files...


...that you can easily concatenate.

Here's the perl script below. To run it, give it the full path to your binary pedfile, the number of individuals in each "chunk" to infer IBD between, and the fully qualified path to your .frq file that you get from running plink --freq. If you're not using PBS to submit jobs, you'll have to modify the code a little bit in the main print statement in the middle. If you're not running this in your /scratch/username/ibd directory, you'll want to change that on line 57. You'll also want to change your email address on line 38 if you want to receive emails from your scheduler if you use PBS.

After you submit all these jobs, you can very easily run these commands to concatenate the results and clean up the temporary files:

cat data.sub.*genome > results.genome
rm tmp.list*
rm data.sub.*

Monday, November 15, 2010

Seminar: A New Measure of Coefficient of Determination for Regression Models

Human Genetics / Biostatistics Associate Professor (and my first statistics teacher) Dr. Chun Li will be giving a talk Wednesday on a new measure of R² for continuous, binary, ordinal, and survival outcomes. Here are the details:

Department of Biostatistics Seminar/Workshop Series

A New Measure of Coefficient of Determination for Regression Models
Chun Li, PhD

Associate Professor, Department of Biostatistics, Vanderbilt University School of Medicine

Wednesday, November 17, 1:30-2:30pm, MRBIII Conference Room 1220
Summary: Coefficient of determination is a measure of the goodness of fit for a model. It is best known as R² in ordinary least squares (OLS) for continuous outcomes. However, as a ratio of values on the squared outcome scale, is often not intuitive to think of. In addition, extensions of the definition to other outcome types often have unsatisfactory properties. One approach is to define a ratio of two quantities, but often such a definition does not have an underlying decomposition property that is enjoyed by . Another approach is to employ the connection of with the likelihood ratio statistic in linear regression where the residuals follow normal distributions, but for discrete outcomes, this will result in a value less than one even for a perfect model fit. For regression models, we propose a new measure of coefficient of determination as the correlation coefficient between the observed values and the fitted distributions. As a correlation coefficient, this new measure is intuitive and easy to interpret. It takes into account the variation in fitted distributions, and can be readily extended to other outcome types. For OLS, it is numerically the same as ! We present the new measure for continuous, binary, ordinal, and time-to-event outcomes.

Thursday, November 11, 2010

Split up a GWAS dataset (PED/BED) by Chromosome

As I mentioned in my recap of the ASHG 1000 genomes tutorial, I'm doing to be imputing some of my own data to 1000 genomes, and I'll try to post lessons learned along the way here under the 1000 genomes and imputation tags.

I'm starting from a binary pedigree format file (plink's bed/bim/fam format) and the first thing in the 1000 genomes imputation cookbook is to store your data in Merlin format, one per chromosome. Surprisingly there is no option in PLINK to split up a dataset into separate files by chromosome, so I wrote a Perl script to do it myself. The script takes two arguments: 1. the base filename of the binary pedfile (if your files are data.bed, data.bim, data.fam, the base filename will be "data" without the quotes); 2. a base filename for the output files to be split up by chromosome. You'll need PLINK installed for this to work, and I've only tested this on a Unix machine. You can copy the source code below:

Tuesday, November 9, 2010

Video and slides from ASHG 1000 Genomes tutorials

If you missed the tutorial on the 1000 genomes project data last week at ASHG, you can now watch the tutorials on youtube and download the slides online at Here's a recap of the speakers and topics:

Gil McVean, Ph.D.
Professor of Statistical Genetics
University of Oxford

Description of the 1000 Genomes Data
Gabor Marth, D.Sc.
Associate Professor of Biology
Boston College

How to Access the Data
Steve Sherry, Ph.D.
National Center for Biotechnology Information
National Library of Medicine
National Institutes of Health. Bethesda, Md.

How to Use the Browser
Paul Flicek, Ph.D.
European Molecular Biology Laboratory
Vertebrate Genomics Team
European Bioinformatics Institute (EBI)

Stuctural Variants
Jan Korbel, Ph.D.
Group Leader, Genome Biology Research Unit
Joint Appointment with EMBL-EBI
European Molecular Biology Laboratory (Heidelberg, Germany)

How to Use the Data in Disease Studies
Jeffrey Barrett, Ph.D.
Team Leader, Statistical and Computational Genetics
Wellcome Trust Sanger Institute
Hinxton, United Kingdom

Visit for links to all the videos and slides. I found Jeff Barrett's overview of using the 1000 genomes data for imputation particularly helpful. Also, don't forget about Goncalo Abecasis's 1000 genomes imputation cookbook, which gives a little more detailed information about formatting, parallelizing code, etc. I'm going to be trying this myself soon, and I'll post tips along the way.

Friday, November 5, 2010

Keep up with what's happening at ASHG 2010

As this year's ASHG meeting starts to wind down be sure to check out Variable Genome where Larry Parnell is summarizing what's going on at the talks he's been to. Also see the commentary on Genetic Inference by Luke Jostins. The 1000 Genomes tutorial from Wednesday night will be made available on soon, and the presidential address, plenary sessions, and distinguished speaker symposium talks were recorded and will also soon be online. You can keep up with what's going on in real time by following the #ASHG2010 tag on Twitter.

Friday, October 29, 2010

Reproducible Research in the Omics Era: A Presentation and Panel Discussion

Seminar announcement for Vanderbilt folks:

Vanderbilt-Ingram Cancer Center
Quantitative Sciences Seminar Series


Reproducible Research in the Omics Era:
A Presentation and Panel Discussion
Kevin R. Coombes, PhD
Deputy Chair, Bioinformatics, and Professor of Bioinformatics and
Computational Biology
M.D. Anderson Cancer Center


Keith Baggerly, PhD
Associate Professor, Dept. of Bioinformatics and Computational Biology
M.D. Anderson Cancer Center

Panel Discussion at 1 p.m., following presentations:
Featuring Drs. Baggerly and Coombes, along with
Vanderbilt University School of Medicine’s
Dr. William Pao, Dr. Frank Harrell, and Dr. Yu Shyr

Friday, November 19, 2010
12 noon – 2 PM
214 Light Hall

Thursday, October 28, 2010

PacBio Film, Discussion & Reception/Dinner at ASHG 2010

Pacific Biosciences is hosting a reception and dinner, and is screening their film The New Biology at this year's ASHG meeting. According to a flyer the mailed me, the film will showcase their SMRT sequencing technology and how it can be used to "create predictive models of living systems and gain wisdom about the fundamental nature of life itself." While the last bit is perhaps an overstatement, the event should nonetheless be an event worth attending. The event includes a reception, dinner, and a moderated discussion featuring individuals from the film. Unfortunately this conflicts with the previously mentioned 1000 Genomes Tutorial, but if you get waitlisted at the tutorial, sign up for this event at the link below!

Wednesday, November 3 2010


Smithsonian National Air and Space Museum
Independence Ave at 6th St SW
Washington, DC 20560

RSVP here -

Wednesday, October 27, 2010

Prioritizing GWAS Results: A Review of Statistical Methods and Recommendations for Their Application

While writing my thesis I came across this nice review by Rita Cantor, Kenneth Lange, and Janet Sinsheimer at UCLA, "Prioritizing GWAS Results: A Review of Statistical Methods and Recommendations for Their Application." Skip the introduction unless you're new to GWAS, in which case you'll probably want to start with this more recent review by Teri Manolio. After skipping the intro you'll find succinct introduction to meta-analysis for GWAS with lots of very good references, including these among others:

DerSimonian R., Laird N. Meta-analysis in clinical trials. Control. Clin. Trials. 1986;7:177–188. [PubMed]

Fleiss J.L. The statistical basis of meta-analysis. Stat. Methods Med. Res. 1993;2:121–145. [PubMed]

Yesupriya A., Yu W., Clyne M., Gwinn M., Khoury M.J. The continued need to synthesize the results of genetic associations across multiple studies. Genet. Med. 2008;10:633–635. [PubMed]

Lau J., Ioannidis J.P., Schmid C.H. Quantitative synthesis in systematic reviews. Ann. Intern. Med. 1997;127:820–826. [PubMed]

Allison D.B., Schork N.J. Selected methodological issues in meiotic mapping of obesity genes in humans: Issues of power and efficiency. Behav. Genet. 1997;27:401–421. [PubMed]

Ioannidis J.P., Gwinn M., Little J., Higgins J.P., Bernstein J.L., Boffetta P., Bondy M., Bray M.S., Brenchley P.E., Buffler P.A., Human Genome Epidemiology Network and the Network of Investigator Networks A road map for efficient and reliable human genome epidemiology. Nat. Genet. 2006;38:3–5. [PubMed]

de Bakker P.I., Ferreira M.A., Jia X., Neale B.M., Raychaudhuri S., Voight B.F. Practical aspects of imputation-driven meta-analysis of genome-wide association studies. Hum. Mol. Genet. 2008;17(R2):R122–R128. [PMC free article] [PubMed]

Sagoo G.S., Little J., Higgins J.P., Human Genome Epidemiology Network Systematic reviews of genetic association studies. PLoS Med. 2009;6:e28. [PMC free article] [PubMed]

Zeggini E., Ioannidis J.P. Meta-analysis in genome-wide association studies. Pharmacogenomics. 2009;10:191–201. [PMC free article] [PubMed]

Egger M., Smith G.D., Phillips A.N. Meta-analysis: Principles and procedures. BMJ. 1997;315:1533–1537. [PMC free article] [PubMed]

Ioannidis J.P., Patsopoulos N.A., Evangelou E. Heterogeneity in meta-analyses of genome-wide association investigations. PLoS ONE. 2007;2:e841. [PMC free article] [PubMed]

This section covers using imputation in meta-analysis, fixed effects versus random effects meta-analysis, canned software for meta-analysis (such as METAL), Bayesian hierarchical approaches, and references to many applications of meta-analysis in GWAS.

After the meta-analysis section there's a nice section on modeling epistasis, or gene-gene interactions, to prioritize associations with links to other reviews of statistical methods, and brief coverage of data mining procedures like CART, MDR, random forests, conditional entropy methods, neural networks, genetic programming, logic regression, pattern mining, Bayesian partitioning, and penalized regression approaches, again with lots of references. This section also covers parameterization of epistatic models, and covers some of the computation and statistical issues you'll face with the dimensionality problem.

Finally, the review concludes with a section on pathway analysis. As the review admits, pathway analysis in GWAS has no set of strict guidelines or best practices, and new approaches arise every day.

While this review is nearly a year old at this point, I think it's a real gem because of all the references it offers, especially in the meta-analysis and epistasis sections.

AJHG: Prioritizing GWAS Results: A Review of Statistical Methods and Recommendations for Their Application

Thursday, October 14, 2010

Tutorial on the 1000 Genomes Project Data

There will be a (free) tutorial on the 1000 genomes project at this year's ASHG meeting on Wednesday, November 3, 7:00 – 9:30pm. You can register online at the link below. The tutorial will describe the 1000 genomes data, how to access it, and what to do with it. Specifically, the speakers and topics covered are:

1. Introduction
2. Description of the 1000 Genomes data -- Gabor Marth
3. How to access the data -- Steve Sherry
4. How to use the browser -- Paul Flicek
5. Structural variants -- Jan Korbel
6. How to use the data in disease studies -- Jeff Barrett
7. Q&A

Online registration for 1000 genomes tutorial

Hopefully I'll see some of you there. I'm not sure if imputation is covered in this tutorial. If not, I will cover it here in a future post. I'll soon be using Goncalo Abecasis's 1000 Genomes Imputation Cookbook to impute my own data to the 1kG SNPs, and I'll share any tips I discover along the way.

Wednesday, October 6, 2010

Random forests for high-dimensional genomics data

I know I've been MIA for a while. My defense date is December 3, and I've still got a thesis to write! I'll try to post more soon, but in the meantime follow me on Twitter for things that won't make it into a full blog post.

For those at Vanderbilt and the surrounding environs: I saw this announcement for the next cancer biostatistics workshop that looked interesting.

2010 Cancer Biostatistics Workshop

Friday, october 15, 2010
1:00 to 2:00 PM
898B Preston Research Building

Random forests for high-dimensional genomics data

Xi (Steven) Chen, PhD
Assistant Professor
Department of Biostatistics
Cancer Biostatistics Center, Vanderbilt-Ingram Cancer Center

Wednesday, September 29, 2010

Vanderbilt Genetics Symposium: Beyond Disease Dichotomy - Quantitative Traits and Intermediate Phenotypes

About a year ago I reiterated a point made nicely in a Nature Reviews Genetics article, that there is no such thing as a common disorder - only extremes of quantitative traits. Such is the theme of this year's Annual Vanderbilt Genetics Symposium, "Beyond Disease Dichotomy - Quantitative Traits and Intermediate Phenotypes." This is a day-long event held at the Vanderbilt Student Life Center on Wednesday October 13, 8am-4pm. Registration is free but required to attend. Students in our program will be presenting posters, and students in other programs are welcome to submit an abstract as well.  You can check out the full agenda at the link below. Here is the speaker lineup:

Keynote Speakers

Molly Losh, Ph.D.Jane and Michael Hoffman Assistant Professor of
Communication Sciences & Disorders
Northwestern University

Charles R. Farber, Ph.D.Assistant Professor of Medicine
University of Virginia

Andrew J. Saykin, PsyD, ABCNRaymond C. Beeler Professor of Radiology and Imaging Sciences
Professor of Medical and Molecular Genetics
Director, Center for Neuroimaging
Indiana University School of Medicine

Vanderbilt Speakers

Roger Cone, Ph.D.Professor and Chairman, Department of Molecular
Physiology & Biophysics

Dana Crawford, Ph.D.Assistant Professor, Department of Molecular
Physiology & Biophysics
Investigator, Center for Human Genetics Research

Karoly Mirnics, Ph.D.Professor and Vice Chair for Basic Research,
Department of Psychiatry

Vanderbilt Genetics Symposium: Beyond Disease Dichotomy - Quantitative Traits and Intermediate Phenotypes

Monday, September 27, 2010

Towards a More Rigorous Approach to Personalized Medicine

Frank Harrell, chair of our Biostats department, will be giving a seminar entitled "Towards a More Rigorous Approach to Personalized Medicine." As a champion of methods and strategies for reproducible research, Dr. Harrell's lecture on personalized medicine should be interesting.

Frank E Harrell Jr, Professor and Chair, Department of Biostatistics

Wednesday, 29 Sep 10, 1:30-2:30pm, MRBIII Conference Room 1220

Intended Audience: Persons interested in personalized medicine, biomarkers, reproducible research, clinical epidemiology


There are many ways to personalize the diagnosis and treatment of diseases, pharmacogenomics being one of them. Personalization can be based on routinely collected information, molecular signatures, or on repeated trials on the patient whose treatment plan is being devised. However, current emphases in personalized medicine research often ignore characteristics known to impact treatment benefit, in favor of tests that either generate more revenue or are developed with research that is perhaps easier to fund than "low-tech" research. Failure of the research community to fully utilize rich datasets generated by randomized clinical trials only hightens this concern.

Research supporting personalized medicine can be made more rigorous and relevant. For example in acute diseases, multi-period crossover studies can be used to measure individual response to therapy, and these studies can provide an upper bound on the genome by treatment interaction. When patient by treatment interaction is demonstrated, crossover studies can form an ideal basis for pharmacogenomics. However, even with the best within-patient data, group average treatment effects need to be incorporated in order for predictions for individual patients to have high precision.

There are a few ways to do personalized medicine well but a multitude of ways to do it poorly. Biomarker research in particular has not fulfilled its early promises, a major reason being flawed methodology. The flaws include faulty experimental design, bias, overfitting, weak validation, irreproducible research, data processing and analysis practices, and failure to rigorously show that the new markers add information to readily available clinical data. This will be discussed in terms of Platt's concept of "strong inference", seeking alternative explanations of findings, and sensitivity analysis.

This talk is also a call for the biostatistics and clinical epidemiology communities to be more integrally involved in research related to personalized medicine.

Tuesday, September 21, 2010

Install and load R package "Rcmdr" to quickly install lots of other packages

I recently reformatted my laptop and needed to reinstall R and all the packages that I regularly use. In a previous post I covered R Commander, a nice GUI for R that includes a decent data editor and menus for graphics and basic statistical analysis. Since Rcmdr depends on many other packages, installing and loading Rcmdr like this...

install.packages("Rcmdr", dependencies=TRUE)

...will also install and load nearly every other package you've ever needed to use (except ggplot2, Hmisc, and rms/design). This saved me a lot of time trying to remember which packages I normally use and installing them one at a time. Specifically, installing and loading Rcmdr will install the following packages from CRAN: fBasics, bitops, ellipse, mix, tweedie, gtools, gdata, caTools, Ecdat, scatterplot3d, ape, flexmix, gee, mclust, rmeta, statmod, cubature, kinship, gam, MCMCpack, tripack, akima, logspline, gplots, maxLik, miscTools, VGAM, sem, mlbench, randomForest, SparseM, kernlab, HSAUR, Formula, ineq, mlogit, np, plm, pscl, quantreg, ROCR, sampleSelection, systemfit, truncreg, urca, oz, fUtilities, fEcofin, RUnit, quadprog, mlmRev, MEMSS, coda, party, ipred, modeltools, e1071, vcd, AER, chron, DAAG, fCalendar, fSeries, fts, its, timeDate, timeSeries, tis, tseries, xts, foreach, DBI, RSQLite, mvtnorm, lme4, robustbase, mboost, coin, xtable, sandwich, zoo, strucchange, dynlm, biglm, rgl, relimp, multcomp, lmtest, leaps, effects, aplpack, abind, RODBC.

Anyone else have a solution for batch-installing packages you use on a new machine or fresh R installation? Leave it in the comments!

Monday, September 13, 2010

Empowering Personal Genomics by Considering Regulatory Cis-Epistasis and Heterogeneity

Will Bush and I just heard that our paper "Multivariate Analysis of Regulatory SNPs: Empowering Personal Genomics by Considering Cis-Epistasis and Heterogeneity" was accepted for publication and a talk at the Personal Genomics session of the 2011 Pacific Symposium in Biocomputing.

Your humble GGD contributors embarked on our first collaborative paper using genome-wide transcriptome data and genome-wide SNP data from HapMap lymphoblastoid cell lines to examine an alternative mechanism for how epistasis might affect human traits. Many human traits are driven by alterations in gene expression, and it's known that common genetic variation affects the expression of nearby genes. We also know that epistasis is ubiquitous and affects human traits. Combining these three ideas, is it possible that genetic variation can interact epistatically to exert a cis-regulatory effect on the expression of nearby genes? If so, what is the genomic and statistical structure of these epistatically interacting multilocus models? Are genes which are affected by cis-epistasis associated with complex human disease or morphological phenotypes? If so, how might we use this knowledge to guide the reanalysis of existing datasets? We addressed these questions here using experimental data from HapMap cell lines. If you're interested in seeing the paper please email me, or try to catch our talk at PSB (a meeting worth going to!).

Abstract: Understanding how genetic variants impact the regulation and expression of genes is important for forging mechanistic links between variants and phenotypes in personal genomics studies.  In this work, we investigate statistical interactions among variants that alter gene expression and identify 79 genes showing highly significant interaction effects consistent with genetic heterogeneity.  Of the 79 genes, 28 have been linked to phenotypes through previous genomic studies.  We characterize the structural and statistical nature of these 79 cis-epistasis models, and show that interacting regulatory SNPs often lie far apart from each other and can be quite distant from the gene they regulate.  By using cis-epistasis models that account for more variance in gene expression, investigators may improve the power and replicability of their genomics studies, and more accurately estimate an individual's gene expression level, improving phenotype prediction.

Pacific Symposium in Biocomputing 2011

Tuesday, September 7, 2010

Embed R Code with Syntax Highlighting on your Blog

Note 2010-11-17: there's more than one way to do this. See the updated post from 2010-11-17.

If you use blogger or even wordpress you've probably found that it's complicated to post code snippets with spacing preserved and syntax highlighting (especially for R code). I've discovered a few workarounds that involve hacking the blogger HTML template and linking to someone else's javascript templates, but it isn't pretty and I'm relying on someone else to perpetually host and maintain the necessary javascript. Github Gists make this really easy. Github is a source code hosting and collaborative/social coding website, and makes it very easy to post, share, and embed code snippets with syntax highlighting for almost any language you can think of.

Here's an example of some R code I posted a few weeks ago on making QQ plots of p-values using R base graphics.

The Perl highlighter also works well. Here's some code I posted recently to help clean up PLINK output:

Simply head over to and paste in your code, select a language for syntax highlighting, and hit "Create Public Gist." The embed button will give you a line of HTML that you can paste into your blog to embed the code directly.

Finally, if you're using Wordpress you can get the Github Gist plugin for Wordpress to get things done even faster. A big tip of the had to economist J.D. Long (blogger at Cerebral Mastication) for pointing this out to me.

Thursday, September 2, 2010

Rebecca Skloot (HeLa) to speak at Vanderbilt September 7

Rebecca Skloot, author of bestselling The Immortal Life of Henrietta Lacks (Amazon, $14), will be speaking here at Vanderbilt next Tuesday at noon in 208 Light Hall. This is one you don't want to miss. Be sure to get there a few minutes early. When 208 fills up they'll have overflow in 202 with a live webcast. RSVP to for a free lunch. On a related note, apparently Oprah and Alan Ball (screenwriter, Six Feet Under, True Blood) will be teaming up with HBO to produce a movie based on the book. No doubt this will stir up a much needed dialog about the nature of informed consent in scientific research. If you don't know the story about the origin of HeLa cells, you can get the quick summary on Wikipedia. Better yet, buy the book.

Tuesday, August 31, 2010

Writing my Thesis - Follow me on Twitter

A few weeks ago I suddenly reached the point that every graduate student once thought would never come - time to start writing my thesis. With a blank page and a blinking cursor staring me in the face it's time to compile all of my published and unpublished work I've accumulated over the last few years and wordsmith this pile of papers and results into a single cohesive unit. And since job prospects are starting to materialize and because my public defense is now set for December 3rd, I have little time to spare away from my thesis writing cave.

I'll still post any important announcements and reviews of interesting literature I come across, but since I'm mostly writing, the typical blog posts about software, R code, statistical and analytical tips may be more sparse than usual over the next few weeks.

I'll be posting and linking to interesting things on Twitter that I might not have time to expand into a full blog post, so follow me at I'm looking forward to meeting some of you at ASHG and/or IGES this year. Stop by my poster at IGES or my talk at the "Methods in Statistical Genetics" session at ASHG Friday, November 5, Room 202, 5:45 pm.

@genetics_blog on Twitter

Monday, August 16, 2010

Deducer: R and ggplot2 GUI

Last Year I introduced you to R Commander, a nice graphical user interface (GUI) for R for those of you who are still hesitant to leave the clicky-box style research a la SPSS for the far more superior reproducible research using R. As most of you know I'm a huge fan of ggplot2. Many of you came to the short course Hadley Wickham gave here a few weeks ago on ggplot2 and plyr. I just came across Deducer, another GUI for R that also allows you to build plots using the ggplot2 framework using the graphical interface. See the video below and the related videos on the youtube page for a quick preview of what Deducer can do.

Deducer: Intuitive Data Analysis using R and ggplot2

Friday, August 13, 2010

Success of GWAS Approach Demonstrated by Latest Lipids Meta-Analysis

Last year I linked to a series of perspectives in NEJM with contrasting views on the success or failure of GWAS - David Goldstein's paper and Nick Wade's synopsis that soon followed in the New York Times being particularly pessimistic. Earlier this year I was swayed by an essay in Cell by Jon McClellan and Mary-Claire King condemning the common disease common variant hypothesis and chalking up most GWAS hits to population stratification. The main argument, that the frequency of the risk allele being hypervariable even in European populations - was persuasive until Kai Wang pointed out on this blog and recently expanded in a correspondence in Cell that McClellan and King used a cohort with extremely small sample sizes to estimate allele frequencies, and that this locus was no more variable than the average SNP in European populations. (There are a series of three letters in Cell, including a response by McClellan and King that are definitely worth reading here, here, and here.)

One of the main arguments in David Goldstein's 2009 NEJM paper is that doing GWAS with increasingly larger sample sizes will not yield meaningful discoveries, especially if the newly detected loci explain such a small proportion of the heritability of the trait being studied. A paper published last week in Nature provides empirical evidence refuting such claims. "Biological, clinical and population relevance of 95 loci for blood lipids" by Teslovich et al (with Goncalo Abecasis, Cristen Willer, Sekar Kathiresan, Leena Peltonen, Kari Steffanson, Yurii Aulchenko, Chiara Sabatti, Robert Hegele, Francis Collins, and many, many other co-authors) presents a meta-analysis of blood lipid levels in over 100,000 samples in multiple ethnic groups. This study identified 95 loci (59 novel) associated with either total cholesterol, LDL, HDL, or Triglycerides, explaining 10-12% of the variation in these traits. A handful of these loci demonstrated clear clinical and/or biological significance: several are common variants in or near genes harboring rare variants known to cause extreme dyslipidemias, and with others the authors demonstrated altered lipid levels in mice after disturbing the regulation of several of the newly discovered genes. Furthermore, most of the newly discovered loci were significant in other non-European populations with the same direction of association .

This study demonstrates that combining studies using meta-analysis, achieving massive sample sizes to detect extremely small effects can result in both clinically and biologically meaningful discoveries using GWAS. This study also demonstrates that most of the significant results are in fact associated with lipid traits across global populations, which has implications for enabling personal genomics / personalized medicine in non-European populations. Furthermore, as Teri Manolio noted in her recent review in NEJM, one cannot equate variance explained with potential clinical importance: Type II diabetes associated genes PPARG and KCNJ11 and psoriasis-associated IL12B encode proteins that are drug targets for thiazolidinediones, sulfonylureas, and anti-p40 antibodies respectively, yet all of these associations have odds ratios less than 1.45. So a GWAS with >100,000 samples uncovers new loci with extremely small effects... while these loci alone may not be useful today for treatment or clinical risk stratification, it's difficult to judge the importance of these loci until you perturb the system with pharmaceutical or some other environmental intervention.

A true testament to the success of GWAS, the paper is a pleasure to read, (even though unfortunately the real substance of the paper is buried in the 19 tables and 3 figures in the 83 page supplement).

Biological, clinical and population relevance of 95 loci for blood lipids

Tuesday, August 10, 2010

Accuracy of Individualized Risk Estimates for Personalized Medicine

Lucila Ohno-Machado, Professor of Medicine and Chief of the Division of Biomedical Informatics at UC-San Diego, will be giving a talk on "Accuracy of Individualized Risk Estimates for Personalized Medicine" next week, August 18, noon-1pm in 202 Light Hall. This should be an interesting perspective from a scientist with medical training on the utility of personal genomics tools in making healthcare decisions.

Bio: Lucila Ohno-Machado, MD, PhD, is Professor of Medicine and founding chief of the Division of Biomedical Informatics at the University of California San Diego. She received her medical degree from the University of Sao Paulo and her doctoral degree in Medical Information Sciences from Stanford University. Prior to her current role, she was director of the training program for the Harvard-MIT-Tufts-Boston University consortium in Boston, and director of the Decision Systems Group at Brigham and Women's Hospital, Harvard Medical School. Her research focuses on the development of new evaluation methods for predictive models of disease, with special emphasis on the analysis of model calibration and implications in healthcare. She is an elected member of the American College of Medical Informatics, the American Institute for Medical and Biological Engineering, and the American Society for Clinical Investigation. She is associate editor for the Journal of the American Medical Informatics Association, and will become Editor-in-Chief in January 2011. Dr. Ohno-Machado will discuss the problems with evaluating individual risk estimates and predictive models based on binary outcomes using existing methods. She will present alternative methods for evaluating calibration of risk assessment tools and discuss implications in healthcare practice.

Abstract: Medical decision support tools are increasingly available on the Internet and are being used by lay persons as well as health care professionals. The goal of some of these tools is to provide an "individualized" prediction of future health care related events (e.g.,  prognosis of breast cancer given specific information about the individual). Under the umbrella of "personalized" medicine, these individualized prognostic assessments are sought as a means to replace general prognostic information with specific probability estimates that pertain to a small stratum to which the patient belongs, and ultimately specifically to each patient. Subsequently, these estimates are used to inform decision making and are therefore of critical importance for public health. In this presentation, I will discuss the problems with assessing the quality of individual estimates, present existing and proposed tools for evaluating prognostic models, and discuss implications for individual counseling.

This should be an interesting talk, and very relevant to current regulatory issues surrounding personal genomics.

Monday, August 9, 2010

Quickly Find the Class of data.frame vectors in R

Aviad Klein over at My ContRibution wrote a convenient R function to list the classes of all the vectors that make up a data.frame. You would think apply(kyphosis,2,class) would do the job but it doesn't - it calls every vector a character class. Aviad wrote an elegant little function that does the job perfectly without having to load any external package:  

allClass<-function(x) {unlist(lapply(unclass(x),class))}.

Here it is in action:

> # load the CO2 dataset
> data(CO2)
> # look at the first few rows
> head(CO2)
  Plant   Type  Treatment conc uptake
1   Qn1 Quebec nonchilled   95   16.0
2   Qn1 Quebec nonchilled  175   30.4
3   Qn1 Quebec nonchilled  250   34.8
4   Qn1 Quebec nonchilled  350   37.2
5   Qn1 Quebec nonchilled  500   35.3
6   Qn1 Quebec nonchilled  675   39.2
> # this doesn't work
> apply(CO2,2,class)
      Plant        Type   Treatment        conc      uptake 
"character" "character" "character" "character" "character" 
> # this does
> allClass <- function(x) {unlist(lapply(unclass(x),class))}
> allClass(CO2)
   Plant1    Plant2      Type Treatment      conc    uptake 
"ordered"  "factor"  "factor"  "factor" "numeric" "numeric" 

Nice tip, Aviad.

Tuesday, August 3, 2010

Convert PLINK output to tab or comma delimited CSV using Perl

Last week Will showed you a bash script version of a sed command covered here a while back that would convert PLINK output from the default variable space-delimited format to a more database-loading-friendly tab or comma delimited file. A commenter asked how to do this on windows, so I'll share the way I do this using a perl script which you can use on windows after installing ActivePerl. First copy the code below and save the file as somewhere in your path.


# (c) Stephen D. Turner 2010
# This is free open-source software.
# See

my $help = "\nUsage: $0 <input whitespace file> <tab or comma>\n\n";
die $help if @ARGV<2;

die $help unless ($delimiter=~/tab/i|$delimiter=~/comma/i);

if ($delimiter =~ /comma/i) {
    foreach (@inputfiles) {

        open (IN,"<$_");
        open (OUT,">$_.csv");
        while (<IN>) {
            $_ =~ s/^\s+//;  #Trim whitespace at beginning
            $_ =~ s/\s+$//;  #Trim whitespace at end
            $_ =~ s/\s+/,/g; #Remaining whitespace into commas
            #$_ =~ s/NA/-9/g;#If you want to recode NA as -9
            print OUT "$_\n";
} elsif ($delimiter =~ /tab/i) {
    foreach (@inputfiles) {
        open (IN,"<$_");
        open (OUT,">$");
        while (<IN>) {
            $_ =~ s/^\s+//;  #Trim whitespace at beginning
            $_ =~ s/\s+$//;  #Trim whitespace at end
            $_ =~ s/\s+/\t/g;#Remaining whitespace into commas
            #$_ =~ s/NA/-9/g;#If you want to recode NA as -9
            print OUT "$_\n";
} else {
    die $help;
Run the program with the first argument(s) as the PLINK output file(s) you want to convert, and the last argument being either "comma" or "tab" without the quotes. It'll create another file in the current directory ending with either .csv or .tab. Look below to see in action.

turnersd@provolone:~/tmp$ ls
turnersd@provolone:~/tmp$ cat plink.qassoc
 CHR         SNP         BP    NMISS       BETA         SE         R2        T            P
   1   rs3094315     742429     3643    -0.2461     0.2703  0.0002275  -0.9102       0.3628
   1  rs12562034     758311     3644    -0.1806     0.3315  8.149e-05  -0.5448       0.5859
   1   rs3934834     995669     3641    0.04591     0.2822  7.271e-06   0.1627       0.8708
   1   rs9442372    1008567     3645     0.1032     0.2063  6.868e-05   0.5002       0.6169
   1   rs3737728    1011278     3644     0.1496     0.2268  0.0001195   0.6598       0.5094
   1   rs6687776    1020428     3645    -0.5378     0.2818   0.000999   -1.909      0.05639
   1   rs9651273    1021403     3643     0.2002     0.2264  0.0002149   0.8847       0.3764
   1   rs4970405    1038818     3645    -0.4994     0.3404  0.0005903   -1.467       0.1425
   1  rs12726255    1039813     3645    -0.4515     0.2956  0.0006398   -1.527       0.1268
turnersd@provolone:~/tmp$ plink.qassoc comma
turnersd@provolone:~/tmp$ ls
plink.qassoc  plink.qassoc.csv
turnersd@provolone:~/tmp$ cat plink.qassoc.csv
turnersd@provolone:~/tmp$ plink.qassoc tab
turnersd@provolone:~/tmp$ ls
plink.qassoc  plink.qassoc.csv
turnersd@provolone:~/tmp$ cat
CHR     SNP     BP      NMISS   BETA    SE      R2      T       P
1       rs3094315       742429  3643    -0.2461 0.2703  0.0002275       -0.9102 0.3628
1       rs12562034      758311  3644    -0.1806 0.3315  8.149e-05       -0.5448 0.5859
1       rs3934834       995669  3641    0.04591 0.2822  7.271e-06       0.1627  0.8708
1       rs9442372       1008567 3645    0.1032  0.2063  6.868e-05       0.5002  0.6169
1       rs3737728       1011278 3644    0.1496  0.2268  0.0001195       0.6598  0.5094
1       rs6687776       1020428 3645    -0.5378 0.2818  0.000999        -1.909  0.05639
1       rs9651273       1021403 3643    0.2002  0.2264  0.0002149       0.8847  0.3764
1       rs4970405       1038818 3645    -0.4994 0.3404  0.0005903       -1.467  0.1425
1       rs12726255      1039813 3645    -0.4515 0.2956  0.0006398       -1.527  0.1268

Tuesday, July 27, 2010

Hadley Wickham's ggplot2 / Data Visualization Course Materials

Hadley Wickham, creator of ggplot2, an immensely popular framework for Tufte-friendly data visualization using R, is teaching two short courses at Vanderbilt this week. Once we opened registration to Vanderbilt students and staff we instantly filled all the available seats, so unfortunately I wasn't able to announce the course here. But the good news is that Hadley's made all the data, code, and slides from the course available online here. We weren't able to record the course, but David Smith over at Revolutions posted links to videos of Hadley teaching a similar course a few months ago.

Hadley Wickham - Visualizing Data workshop at Vanderbilt

Thursday, July 22, 2010

Webcast this Morning: House Committee on Energy and Commerce hearing on DTC Genetic Testing

A live webcast of the House Committee on Energy and Commerce hearing on “Direct-to-Consumer Genetic Testing and the Consequences to the Public Health" is available at this link. I had trouble viewing the webcast in firefox - had to save the link and open it with VLC media player to get it working. You can also follow the #HouseDTC hastag on Twitter.

In case you missed the FDA public meeting on oversight of Laboratory Developed Tests (LDTs), Dan Vorhaus over at Genomics Law Report posted recaps of day 1 and day 2 of the meeting.

Update July 22 1:48pm CDT: this webcast is over. You can read a written testimony from all the witnesses along with statements from committee members here. You can also follow ongoing discussion on Twitter, and of course check Genomics Law Report in the next day or two for further analysis by Dan Vorhaus.

Update July 23: The testimony from Gregory Kutz at the Government Accountability Office (referred to as the "GAO Report," available online here) in addition to the discussion at yesterday's House hearing on DTC testing caused quite a stir. 23andMe quickly responded with a thorough point-by-point rebuttal of major points made in the GAO report. Dan Vorhaus at Genomics Law Report posted a very thorough and thoughtful summary on yesterday's events and discussion. Daniel MacArthur has also posted a summary and response on Genomes Unzipped, and the comment string on this post is definitely worth reading through.

Wednesday, July 21, 2010

How to Read a Genome-Wide Association Study (@GenomesUnzipped)

Jeff Barret (@jcbarret on Twitter) over at Genomes Unzipped (@GenomesUnzipped) has posted a nice guide for the uninitiated on how to read a GWAS paper. Barret outlines five critical areas that readers should pay attention to: sample size, quality control, confounding (including population substructure), the replication requirement, and biological significance. It would be nice to see a follow-up post like this on things to look out for in studies that investigate other forms of human genetic variation such as copy number polymorphism, rare variation, or gene-environment interaction.

And this is also a convenient point for me to mention Genomes Unzipped - a collaborative blog covering topics relevant to the personal genomics industry, featuring posts by several of my favorite bloggers including Daniel MacArthur (of Genetic Future), Luke Jostins (of Genetic Inference), Dan Vorhaus (of Genomics Law Report), Jan Aerts (Saaien Tist), Jeff Barret, Caroline Wright, Katherine Morley, and Vincent Plagnol. GNZ, as it's called, has only been live for about two weeks, but looks like a good one to follow as the personal genomics industry begins to mature over the next few years.

Genomes Unzipped: How to Read a Genome-Wide Association Study
Creative Commons License
Getting Genetics Done by Stephen Turner is licensed under a Creative Commons Attribution-NonCommercial 3.0 Unported License.