Thursday, April 29, 2010

More on the McClellan / King GWAS essay

First, if you haven't taken a look at the comments on my previous post on this paper, go take a look. Thanks to everyone for sharing your thoughts and pointing out some of my own oversight regarding this paper.

There was one issue in particular that deserves more attention than just another comment thread. McClellan and King draw special attention to a study by Kai Wang et al (2009) Common genetic variants on 5p14.1 associate with autism spectrum disorders. Nature 459:528. A big thanks to Kai Wang for pointing out this particularly egregious misrepresentation by McClellan and the emphasis I added in my own all-too-cursory review. McClellan discuss rs4307059, reported by Wang et al. to be associated with autism, as a “particularly dramatic example of the perils of cryptic population stratification”, reasoning that the substructure is a result of large frequency differences across Europe and its fixation in Africa, when in fact the frequency of this SNP is fairly consistent across large cohorts of European ancestry: European Americans (MAF=39%), WTCCC (MAF=38%), POPRES British (MAF=39%), POPRES Spanish (MAF=37%). The extreme estimates (.21-.77) come from extremely small sample sizes (n=7 in Tuscany, MAF=75%, and n=15 in the Orcadian sample, MAF=25%). These sample sizes are way to small to estimate allele frequencies with any stability. In fact, you can see the allele frequency distribution across 51 populations here, which shows that it's quite similar across most of Europe:

Further, using the full Fst data set (which can be downloaded directly at this link), if you sort all Illumina SNPs by their variation of allele frequencies (more precisely, Fst), the SNP rs4307059 lies right in middle, so it is fairly normal for any SNP with similar MAF to display variation of allele frequencies in subpopulations in Europe or in HapMap.

There are a few other issues pointed out in the comment thread that deserve attention. McClellan asserted, and I emphasized, that most GWAS hits do not replicate. While it's definitely true that nonreplication was a huge issue in genetic association studies in the past and in the early days of GWAS, most GWAS hits that are genome-wide significant (e.g. p<1e-8) DO replicate, and studies done with family designs, which can't be explained away by population stratification, add further evidence that many of these associations are genuine. And simply because a SNP lies outside a region with known biological function doesn't mean we should wave it off so easily. There's a nice discussion of this over at Gene Expression.

Tuesday, April 27, 2010

Discovering New Disease Genes Using Orthologous Phenotypes in Model Organisms

Check out this paper in PNAS and the corresponding synopsis in the New York Times. The authors take a unique approach to finding genes likely to be associated with human traits using orthologous phenotypes in model organisms, or phenologs. The idea is simple. The authors have a database of ~2000 disease associated genes in humans. To this database they added another ~200,000 gene-trait associations in model organisms including mice, yeast, worm, and plants. Then they look for overlapping sets of orthologous genes from these organisms to identify phenotypes in the model organisms. The related genes causing orthologous phenotypes, or phenologs, are predictive of genes causing disease in humans. For example, the authors found genes responsible for angiogenesis using yeast, breast cancer associated genes in C. elegans, and even genes responsible for deafness using plants.

I remember seeing a talk about this at this year's Pacific Symposium in Biocomputing. You can learn more about the methodology at, and download all the original data used in the paper and build your own phenolog database, which could be very useful for disease gene prediction or prioritization of GWAS hits for followup. 

PNAS: Systematic discovery of nonobvious human disease models through orthologous phenotypes

New York Times: The Search for Genes Leads to Unexpected Places Systematic discovery of non-obvious disease models and candidate genes

Monday, April 26, 2010

How today's scientific culture affects young scientists (This is good.)

Here's a very good 3-page essay on how modern scientific policy and culture (e.g. short-term funding, unstable job security, publish-or-perish mindset) is adversely affecting young scientists, causing lots of bright minds to abandon academia in search of other careers (via @WileyScience).

BioEssays: How today's scientific culture affects young scientists

Anyone care to comment?

Abstract: Surviving in academia has become a headache for many young scientists. But not only for them - older researchers too are increasingly preoccupied with the state of science policy and the procedures of scientific evaluation. We here analyze the pressures that prospective scientists like us feel today and compare them to what researchers from the past witnessed. What emerges is that science has undergone a profound cultural change that would have prevented some scholars from the 19th century from making their breakthroughs. While the inner motivation of most scientists at that time was to satisfy their appetite for knowledge, the modern raison d'ĂȘtre of scientists mostly addresses how to provide a living for themselves. Nevertheless, the general awareness of this situation amongst scientists suggests that there is space for an open debate and reform.

Friday, April 23, 2010

Top 10 Algorithms in Data Mining

The authors here invited ACM KDD Innovation Award and IEEE ICDM Research Contributions Award winners to each nominate up to 10 best-known algorithms in data mining, including the algorithm name, justification for nomination, and a representative publication reference. The list was voted on by other IEEE and ACM award winners to narrow this down to a top 10 list. These algorithms are used for association analysis, classification, clustering, statistical learning, and much more.You can read the paper here.

Here are the winners:
  1. C4.5
  2. The k-Means algorithm
  3. Support Vector Machines
  4. The Apriori algorithm
  5. Expectation-Maximization
  6. PageRank
  7. AdaBoost
  8. k-Nearest Neighbor Classification
  9. Naive Bayes
  10. CART (Classification and Regression Trees)
The 2007 paper gives a brief overview of what the method is commonly used for and how it works, along with lots of references. It also has a much more detailed description of how these winners were selected than what I've said here.

The exciting thing is I've seen nearly all of these algorithms used for mining genetic data for complex patterns of genetic and environmental exposures that influence complex disease. See some recent papers at EvoBio and PSB. Further, lots of these methods are implemented in several R packages.

Top 10 Algorithms in Data Mining (PDF)

Thursday, April 22, 2010

Havasupai Indians and the Ethical Use of Data

The recent settlement between Arizona State University and the Havasupai Indian tribe is calling attention to (and perhaps challenging) the ideas of informed consent. While I'm sure there are arguments to be made supporting both sides of this case, regardless of your position this is an excellent reminder that there are people's lives behind the alleles in our spreadsheets and PED files.

Wednesday, April 21, 2010

Checklist: Statistical Problems to Document and Avoid

Update 2010-04-21: I forgot to post the link last time. That would have been helpful. Here you go:

Vanderbilt Biostatistics: Statistical Problems to Document and to Avoid


At the Regression Modeling Strategies course I attended a few weeks ago, Frank Harrell pointed out the checklist on the biostatistics department's website of statistical problems to document and avoid. It was recommended that authors of any paper employing statistical analysis should go through this checklist before writing and submitting a manuscript.  Some of the topics include:

  • Design and sample size issues
  • Inefficient use of continuous variables (don't categorize!)
  • Assumptions of parametric tests
  • Inappropriate analysis of repeated measures data
  • P-value interpretation
  • Filtering results
  • Missing data
  • Multiple testing concerns
  • Model building and specification
  • Use of stepwise variable selection (don't do it)
  • Overfitting
Be sure to check this out before writing up results, and ideally before you even plan any experiments, especially if you are relatively new to quantitative analysis.

Unix and Perl for Biologists

This looks like a must-read for anyone starting out in computational biology without extensive experience at the command line.  The 135-page document linked at the bottom of the Google Group page looks like an excellent primer with lots of examples that could probably be completed in a day or two, and provides a great start for working in a Linux/Unix environment and programming with Perl. This started out as a graduate student course at UC Davis, and is now freely available for anyone who wants to learn Unix and Perl. Also, don't forget about the printable linux command line cheat sheet I posted here long ago.

Google Groups: Unix and Perl for Biologists

Friday, April 16, 2010

My thoughts on King's "Genetic Heterogeneity" essay in Cell

Update Thursday, April 29, 2010: See further commentary at a newer post here.

Just finished reading Jon McClellan and Mary-Claire King's Genetic Heterogeneity in Human Disease essay in Cell. It's definitely one of the most forthright and compelling essays I've read on the subject of the inadequacy of GWAS for identifying genes that cause complex human disease. The essay starts with an evolutionary perspective. Most human variation is relatively ancient - originating in ancient human populations long before the migration out of Africa. Yet new alleles arise constantly, and because of the relatively recent human population growth, we can be certain that most alleles are actually recent and rareFor a common allele to remain in the population it must withstand evolutionary pressure. If the variation is pathogenic, it must either (1) lead to disease later in life so as not to affect fitness (e.g. Alzheimer's Disease, AMD), or (2) it must be balanced by positive selection (e.g. hemoglobin genes which cause sickle cell anemia are balanced by positive selection from malaria resistance).

The authors then dive into heterogeneity, citing many examples of human diseases which display both locus heterogeneity (mutations in many different genes lead to the same disease), and allelic heterogeneity (many mutations in the same gene cause the same disease). The authors discuss early-onset breast and ovarian cancer, inherited hearing loss, genetics of lipid metabolism, and severe mental illnesses such as autism or schizophrenia.

Next comes a very nice discussion of the common-disease-common-variant (CDCV) hypothesis and GWAS. Thousands of "risk variants" have been identified from GWAS, yet most of these have no apparent biological function. Since most genotyping platforms select for common variants, and because evolution has ensured that most common variants are neutral, then it follows that most GWAS findings are neutral, stemming from factors other than a true association with disease risk.

For one, the authors cite a problem we're all well aware of: population stratification. Yet we tend to think that if we eliminate ethnic outliers or control for stratification with PCA or the like, then we've eliminated the problem. Yet the authors point to a recently Nature-published GWAS in autism that provides a striking example of the problem hypervariable alleles can cause. The authors found an association with a SNP which had a frequency in cases of 0.65, and a frequency in controls of 0.61. All cases and controls were of European descent. Yet the frequency of the risk variant varies from 0.21 to 0.77 across European populations! (N.b. - see the discussion of this point in a newer post). This difference in frequency across European populations is 14 times higher than the frequency difference between cases and controls! Even very minimal differences in ancestry between cases and controls could have explained this association rather than true association with autism.

The authors do give a few examples of where common variants truly affect a common disease (hemoglobin genes and sickle-cell anemia, autoimmune disorders and the MHC region, Alzheimer's disease and APOE, lactose intolerance and alleles in the lactase gene enhancer region). Yet these examples prove two points. (1) All the variants in these genes have a demonstrable effect on the protein or its expression, as opposed to most GWAS findings, and (2) back to the evolutionary perspective, all of these genes have reason to remain common because of their evasion of evolutionary pressure, because they either do not affect reproductive fitness, or are balanced by positive selection.

The authors conclude by offering potential paths going forward, utilizing high-throughput sequencing technologies. One of the problems with sequencing data is not just finding potentially deleterious mutations, but determining which of the many potentially deleterious mutations actually play a role in human disease. One of the most promising strategies is to use next-gen sequencing to trace coinheritance of potential disease causing alleles with disease within affected families - essentially linkage analysis. Finally the authors assert that replication in genetics studies should focus on the identification and confirmation of multiple biologically relevant mutations in the same gene. This would provide both biological and epidemiological support for the causality of the gene or pathway in the pathogenesis of the disease.

This essay is definitely worth a read.

Cell: Genetic Heterogeneity in Human Disease

Update Tuesday, April 27, 2010: Keep an eye out over at Genetic Future for an upcoming post pointing out some of the problems with this paper I didn't consider here.

Cell: Genetic Heterogeneity in Human Disease Review

Check out this review essay in Cell: Genetic Heterogeneity in Human Disease, by Jon McClellan and Mary-Claire King. (King's lab, incidentally, was the group who discovered via linkage analysis that the gene for early-onset breast and ovarian cancer on chromosome 17q21, nearly 5 years before Myriad Genetics filed for patent protection on the BRCA1/2 genes). Anyhow, looks like a great review on genetic heterogeneity and GWAS. Thanks @JVJAI.

Cell: Genetic Heterogeneity in Human Disease

UPDATE 4/16/2010: See my synopsis and thoughts on this essay here.

UPDATE 4/29/2010: See further thoughts on this essay here.

Wednesday, April 14, 2010

Cancer Biostatistics Workshop: Overfitting

This month's cancer biostatistics workshop on overfitting will be given by Fei Ye and Zhiguo (Alex) Zhao, both in the Department of Biostatistics and the Cancer Biostatistics Center. This looks like a good one, especially after attending Frank Harrell's regression modeling strategies course a few weeks ago. See the link below for the full 2010 series.

2010 Cancer Biostatistics Works Series

Journal Club 4/16/2010

Our Program in Computation Genomics Journal Club is starting again, now the 3rd Friday of each month. The next meeting is this Friday, April 16, at 3pm in the CHGR conference room. As usual, please bring in any articles you've found recently and give a brief overview of why you thought it was interesting. Also, take a look at these papers related to BioVU by investigators at Vanderbilt:

The BioVU demonstration project:

Ritchie MD, Denny JC, Crawford DC, Ramirez AH, Weiner JB, Pulley JM, Basford MA, Brown-Gentry K, Balser JR, Masys DR, Haines JL, Roden DM. Robust Replication of Genotype-Phenotype Associations across Multiple Diseases in an Electronic Medical Record. Am J Hum Genet. 2010 Mar 31.


The PheWAS:

Denny JC, Ritchie MD, Basford M, Pulley J, Bastarache L, Brown-Gentry K, Wang D, Masys DR, Roden DM, Crawford DC. PheWAS: Demonstrating the feasibility of a phenome-wide scan to discover gene-disease associations. Bioinformatics. 2010 Mar 24.

And finally, the PNAS paper from Brad Malin's group, which has generated quite a lot of press (Nature News, Technology Review, Genomics Law Report):

Loukides G, Gkoulalas-Divanis A, Malin B. Anonymization of electronic medical records for validating genome-wide association studies. PNAS.

Tuesday, April 13, 2010

Efficient Mixed-Model Association in GWAS using R

I recently did an analysis for the eMERGE network where I had lots of individuals from a small town in central Wisconsin where many of the subjects were related to one another. The subjects could not be treated as independent, but I could not use a family-based design either. I ended up using a mixed model approach using previously mentioned GenABEL. You can read about the method here (PubMed).

While researching which methods to use, I ran into what could be a potential problem. All of the methods that examine relatedness (including the method mentioned above), assume you have an ethnically homogeneous population. Yet all of the methods which look for population stratification (Eigenstrat, Structure, etc) assume samples are unrelated. So what do you do if you have both population stratification AND a high level of relatedness among your samples?

A few weeks ago our graduate student association invited and hosted Dr. Elaine Ostrander here from the NIH to talk about her work with gene mapping in dogs. She mentioned a method she used called Efficient Mixed-Model Association (EMMA) for performing association mapping while simultaneously correcting for relatedness and population structure. Using multiple highly inbred dog breeds represents the extreme case of simultaneously having to deal with substructure, inbreeding, and relatedness. If this method works for association mapping combining several purebred dog breeds, it should work for a less problematic human dataset as well.

EMMA is also implemented in R. You can download the necessary R package from the project's website below.

Efficient Mixed-Model Association (EMMA) website

PubMed: Efficient control of population structure in model organism association mapping.

Abstract: Genomewide association mapping in model organisms such as inbred mouse strains is a promising approach for the identification of risk factors related to human diseases. However, genetic association studies in inbred model organisms are confronted by the problem of complex population structure among strains. This induces inflated false positive rates, which cannot be corrected using standard approaches applied in human association studies such as genomic control or structured association. Recent studies demonstrated that mixed models successfully correct for the genetic relatedness in association mapping in maize and Arabidopsis panel data sets. However, the currently available mixed-model methods suffer from computational inefficiency. In this article, we propose a new method, efficient mixed-model association (EMMA), which corrects for population structure and genetic relatedness in model organism association mapping. Our method takes advantage of the specific nature of the optimization problem in applying mixed models for association mapping, which allows us to substantially increase the computational speed and reliability of the results. We applied EMMA to in silico whole-genome association mapping of inbred mouse strains involving hundreds of thousands of SNPs, in addition to Arabidopsis and maize data sets. We also performed extensive simulation studies to estimate the statistical power of EMMA under various SNP effects, varying degrees of population structure, and differing numbers of multiple measurements per strain. Despite the limited power of inbred mouse association mapping due to the limited number of available inbred strains, we are able to identify significantly associated SNPs, which fall into known QTL or genes identified through previous studies while avoiding an inflation of false positives. An R package implementation and webserver of our EMMA method are publicly available.

Tuesday, April 6, 2010

ProbABEL - R package for GWAS data imputation

I've been using GenABEL for some time now for GWAS analysis using related individuals. It has an excellent set of functions for estimating a kinship matrix from a dense marker panel and then using this in a linear mixed effects model to allow for related individuals in the analysis of a quantitative trait. GenABEL also has many other nice features for analysis and visualization of GWAS data that you can't find in PLINK, it's free, cross-platform, and implemented in R. I'll write another post about GenABEL later, but here I wanted to note that GenABEL's creator, Yurii Aulchenko, released another package called ProbABEL for genome-wide association of imputed data. ProbABEL can perform imputation analyzing quantitative, binary, and survival outcomes while taking imputation uncertainty into account.

BMC Bioinformatics - ProbABEL package for genome-wide association analysis of imputed data

GenABEL homepage

GenABEL tutorial and reference manual

ProbABEL manual

Friday, April 2, 2010

Professor Bush

A short announcement - my friend, colleague, running partner, and GGD contributor Will Bush is now an assistant professor in the Department of Biomedical Informatics, and investigator in the Center for Human Genetics Research here at Vanderbilt.

Thursday, April 1, 2010

Frank Harrell's Regression Modeling Strategies Course Handouts

The previously mentioned Regression Modeling Strategies short course taught by Frank Harrell is nearly over. Here are the handouts (PDF) from the course.

Keep an eye out here, I'll be writing a few more posts in the near future on topics Frank covered in this course.
Creative Commons License
Getting Genetics Done by Stephen Turner is licensed under a Creative Commons Attribution-NonCommercial 3.0 Unported License.