Tuesday, August 31, 2010

Writing my Thesis - Follow me on Twitter

A few weeks ago I suddenly reached the point that every graduate student once thought would never come - time to start writing my thesis. With a blank page and a blinking cursor staring me in the face it's time to compile all of my published and unpublished work I've accumulated over the last few years and wordsmith this pile of papers and results into a single cohesive unit. And since job prospects are starting to materialize and because my public defense is now set for December 3rd, I have little time to spare away from my thesis writing cave.

I'll still post any important announcements and reviews of interesting literature I come across, but since I'm mostly writing, the typical blog posts about software, R code, statistical and analytical tips may be more sparse than usual over the next few weeks.

I'll be posting and linking to interesting things on Twitter that I might not have time to expand into a full blog post, so follow me at http://twitter.com/genetics_blog. I'm looking forward to meeting some of you at ASHG and/or IGES this year. Stop by my poster at IGES or my talk at the "Methods in Statistical Genetics" session at ASHG Friday, November 5, Room 202, 5:45 pm.

@genetics_blog on Twitter

Monday, August 16, 2010

Deducer: R and ggplot2 GUI

Last Year I introduced you to R Commander, a nice graphical user interface (GUI) for R for those of you who are still hesitant to leave the clicky-box style research a la SPSS for the far more superior reproducible research using R. As most of you know I'm a huge fan of ggplot2. Many of you came to the short course Hadley Wickham gave here a few weeks ago on ggplot2 and plyr. I just came across Deducer, another GUI for R that also allows you to build plots using the ggplot2 framework using the graphical interface. See the video below and the related videos on the youtube page for a quick preview of what Deducer can do.

Deducer: Intuitive Data Analysis using R and ggplot2

Friday, August 13, 2010

Success of GWAS Approach Demonstrated by Latest Lipids Meta-Analysis

Last year I linked to a series of perspectives in NEJM with contrasting views on the success or failure of GWAS - David Goldstein's paper and Nick Wade's synopsis that soon followed in the New York Times being particularly pessimistic. Earlier this year I was swayed by an essay in Cell by Jon McClellan and Mary-Claire King condemning the common disease common variant hypothesis and chalking up most GWAS hits to population stratification. The main argument, that the frequency of the risk allele being hypervariable even in European populations - was persuasive until Kai Wang pointed out on this blog and recently expanded in a correspondence in Cell that McClellan and King used a cohort with extremely small sample sizes to estimate allele frequencies, and that this locus was no more variable than the average SNP in European populations. (There are a series of three letters in Cell, including a response by McClellan and King that are definitely worth reading here, here, and here.)

One of the main arguments in David Goldstein's 2009 NEJM paper is that doing GWAS with increasingly larger sample sizes will not yield meaningful discoveries, especially if the newly detected loci explain such a small proportion of the heritability of the trait being studied. A paper published last week in Nature provides empirical evidence refuting such claims. "Biological, clinical and population relevance of 95 loci for blood lipids" by Teslovich et al (with Goncalo Abecasis, Cristen Willer, Sekar Kathiresan, Leena Peltonen, Kari Steffanson, Yurii Aulchenko, Chiara Sabatti, Robert Hegele, Francis Collins, and many, many other co-authors) presents a meta-analysis of blood lipid levels in over 100,000 samples in multiple ethnic groups. This study identified 95 loci (59 novel) associated with either total cholesterol, LDL, HDL, or Triglycerides, explaining 10-12% of the variation in these traits. A handful of these loci demonstrated clear clinical and/or biological significance: several are common variants in or near genes harboring rare variants known to cause extreme dyslipidemias, and with others the authors demonstrated altered lipid levels in mice after disturbing the regulation of several of the newly discovered genes. Furthermore, most of the newly discovered loci were significant in other non-European populations with the same direction of association .

This study demonstrates that combining studies using meta-analysis, achieving massive sample sizes to detect extremely small effects can result in both clinically and biologically meaningful discoveries using GWAS. This study also demonstrates that most of the significant results are in fact associated with lipid traits across global populations, which has implications for enabling personal genomics / personalized medicine in non-European populations. Furthermore, as Teri Manolio noted in her recent review in NEJM, one cannot equate variance explained with potential clinical importance: Type II diabetes associated genes PPARG and KCNJ11 and psoriasis-associated IL12B encode proteins that are drug targets for thiazolidinediones, sulfonylureas, and anti-p40 antibodies respectively, yet all of these associations have odds ratios less than 1.45. So a GWAS with >100,000 samples uncovers new loci with extremely small effects... while these loci alone may not be useful today for treatment or clinical risk stratification, it's difficult to judge the importance of these loci until you perturb the system with pharmaceutical or some other environmental intervention.

A true testament to the success of GWAS, the paper is a pleasure to read, (even though unfortunately the real substance of the paper is buried in the 19 tables and 3 figures in the 83 page supplement).

Biological, clinical and population relevance of 95 loci for blood lipids

Tuesday, August 10, 2010

Accuracy of Individualized Risk Estimates for Personalized Medicine

Lucila Ohno-Machado, Professor of Medicine and Chief of the Division of Biomedical Informatics at UC-San Diego, will be giving a talk on "Accuracy of Individualized Risk Estimates for Personalized Medicine" next week, August 18, noon-1pm in 202 Light Hall. This should be an interesting perspective from a scientist with medical training on the utility of personal genomics tools in making healthcare decisions.

Bio: Lucila Ohno-Machado, MD, PhD, is Professor of Medicine and founding chief of the Division of Biomedical Informatics at the University of California San Diego. She received her medical degree from the University of Sao Paulo and her doctoral degree in Medical Information Sciences from Stanford University. Prior to her current role, she was director of the training program for the Harvard-MIT-Tufts-Boston University consortium in Boston, and director of the Decision Systems Group at Brigham and Women's Hospital, Harvard Medical School. Her research focuses on the development of new evaluation methods for predictive models of disease, with special emphasis on the analysis of model calibration and implications in healthcare. She is an elected member of the American College of Medical Informatics, the American Institute for Medical and Biological Engineering, and the American Society for Clinical Investigation. She is associate editor for the Journal of the American Medical Informatics Association, and will become Editor-in-Chief in January 2011. Dr. Ohno-Machado will discuss the problems with evaluating individual risk estimates and predictive models based on binary outcomes using existing methods. She will present alternative methods for evaluating calibration of risk assessment tools and discuss implications in healthcare practice.

Abstract: Medical decision support tools are increasingly available on the Internet and are being used by lay persons as well as health care professionals. The goal of some of these tools is to provide an "individualized" prediction of future health care related events (e.g.,  prognosis of breast cancer given specific information about the individual). Under the umbrella of "personalized" medicine, these individualized prognostic assessments are sought as a means to replace general prognostic information with specific probability estimates that pertain to a small stratum to which the patient belongs, and ultimately specifically to each patient. Subsequently, these estimates are used to inform decision making and are therefore of critical importance for public health. In this presentation, I will discuss the problems with assessing the quality of individual estimates, present existing and proposed tools for evaluating prognostic models, and discuss implications for individual counseling.

This should be an interesting talk, and very relevant to current regulatory issues surrounding personal genomics.

Monday, August 9, 2010

Quickly Find the Class of data.frame vectors in R

Aviad Klein over at My ContRibution wrote a convenient R function to list the classes of all the vectors that make up a data.frame. You would think apply(kyphosis,2,class) would do the job but it doesn't - it calls every vector a character class. Aviad wrote an elegant little function that does the job perfectly without having to load any external package:  

allClass<-function(x) {unlist(lapply(unclass(x),class))}.

Here it is in action:

> # load the CO2 dataset
> data(CO2)
> # look at the first few rows
> head(CO2)
  Plant   Type  Treatment conc uptake
1   Qn1 Quebec nonchilled   95   16.0
2   Qn1 Quebec nonchilled  175   30.4
3   Qn1 Quebec nonchilled  250   34.8
4   Qn1 Quebec nonchilled  350   37.2
5   Qn1 Quebec nonchilled  500   35.3
6   Qn1 Quebec nonchilled  675   39.2
> # this doesn't work
> apply(CO2,2,class)
      Plant        Type   Treatment        conc      uptake 
"character" "character" "character" "character" "character" 
> # this does
> allClass <- function(x) {unlist(lapply(unclass(x),class))}
> allClass(CO2)
   Plant1    Plant2      Type Treatment      conc    uptake 
"ordered"  "factor"  "factor"  "factor" "numeric" "numeric" 

Nice tip, Aviad.

Tuesday, August 3, 2010

Convert PLINK output to tab or comma delimited CSV using Perl

Last week Will showed you a bash script version of a sed command covered here a while back that would convert PLINK output from the default variable space-delimited format to a more database-loading-friendly tab or comma delimited file. A commenter asked how to do this on windows, so I'll share the way I do this using a perl script which you can use on windows after installing ActivePerl. First copy the code below and save the file as cleanplink.pl somewhere in your path.


# cleanplink.pl
# (c) Stephen D. Turner 2010 http://www.stephenturner.us/
# This is free open-source software.
# See http://gettinggeneticsdone.blogspot.com/p/copyright.html

my $help = "\nUsage: $0 <input whitespace file> <tab or comma>\n\n";
die $help if @ARGV<2;

die $help unless ($delimiter=~/tab/i|$delimiter=~/comma/i);

if ($delimiter =~ /comma/i) {
    foreach (@inputfiles) {

        open (IN,"<$_");
        open (OUT,">$_.csv");
        while (<IN>) {
            $_ =~ s/^\s+//;  #Trim whitespace at beginning
            $_ =~ s/\s+$//;  #Trim whitespace at end
            $_ =~ s/\s+/,/g; #Remaining whitespace into commas
            #$_ =~ s/NA/-9/g;#If you want to recode NA as -9
            print OUT "$_\n";
} elsif ($delimiter =~ /tab/i) {
    foreach (@inputfiles) {
        open (IN,"<$_");
        open (OUT,">$_.tab");
        while (<IN>) {
            $_ =~ s/^\s+//;  #Trim whitespace at beginning
            $_ =~ s/\s+$//;  #Trim whitespace at end
            $_ =~ s/\s+/\t/g;#Remaining whitespace into commas
            #$_ =~ s/NA/-9/g;#If you want to recode NA as -9
            print OUT "$_\n";
} else {
    die $help;
Run the program with the first argument(s) as the PLINK output file(s) you want to convert, and the last argument being either "comma" or "tab" without the quotes. It'll create another file in the current directory ending with either .csv or .tab. Look below to see cleanplink.pl in action.

turnersd@provolone:~/tmp$ ls
turnersd@provolone:~/tmp$ cat plink.qassoc
 CHR         SNP         BP    NMISS       BETA         SE         R2        T            P
   1   rs3094315     742429     3643    -0.2461     0.2703  0.0002275  -0.9102       0.3628
   1  rs12562034     758311     3644    -0.1806     0.3315  8.149e-05  -0.5448       0.5859
   1   rs3934834     995669     3641    0.04591     0.2822  7.271e-06   0.1627       0.8708
   1   rs9442372    1008567     3645     0.1032     0.2063  6.868e-05   0.5002       0.6169
   1   rs3737728    1011278     3644     0.1496     0.2268  0.0001195   0.6598       0.5094
   1   rs6687776    1020428     3645    -0.5378     0.2818   0.000999   -1.909      0.05639
   1   rs9651273    1021403     3643     0.2002     0.2264  0.0002149   0.8847       0.3764
   1   rs4970405    1038818     3645    -0.4994     0.3404  0.0005903   -1.467       0.1425
   1  rs12726255    1039813     3645    -0.4515     0.2956  0.0006398   -1.527       0.1268
turnersd@provolone:~/tmp$ cleanplink.pl plink.qassoc comma
turnersd@provolone:~/tmp$ ls
plink.qassoc  plink.qassoc.csv
turnersd@provolone:~/tmp$ cat plink.qassoc.csv
turnersd@provolone:~/tmp$ cleanplink.pl plink.qassoc tab
turnersd@provolone:~/tmp$ ls
plink.qassoc  plink.qassoc.csv  plink.qassoc.tab
turnersd@provolone:~/tmp$ cat plink.qassoc.tab
CHR     SNP     BP      NMISS   BETA    SE      R2      T       P
1       rs3094315       742429  3643    -0.2461 0.2703  0.0002275       -0.9102 0.3628
1       rs12562034      758311  3644    -0.1806 0.3315  8.149e-05       -0.5448 0.5859
1       rs3934834       995669  3641    0.04591 0.2822  7.271e-06       0.1627  0.8708
1       rs9442372       1008567 3645    0.1032  0.2063  6.868e-05       0.5002  0.6169
1       rs3737728       1011278 3644    0.1496  0.2268  0.0001195       0.6598  0.5094
1       rs6687776       1020428 3645    -0.5378 0.2818  0.000999        -1.909  0.05639
1       rs9651273       1021403 3643    0.2002  0.2264  0.0002149       0.8847  0.3764
1       rs4970405       1038818 3645    -0.4994 0.3404  0.0005903       -1.467  0.1425
1       rs12726255      1039813 3645    -0.4515 0.2956  0.0006398       -1.527  0.1268
Creative Commons License
Getting Genetics Done by Stephen Turner is licensed under a Creative Commons Attribution-NonCommercial 3.0 Unported License.