A few months ago I covered an algorithm called EMMA (Efficient Mixed-Model Association) implemented in R for simultaneously correct for both population stratification and relatedness in an association study. This method/software is very useful because most methods that account for relatedness in an association study assume a genetically (ethnically) homogeneous population, while methods that detect and correct for population stratification typically assume individuals are unrelated. The EMMA algorithm simultaneously accounts for both types of population structure by using a linear mixed model with an empirically estimated relatedness matrix to model the correlation between phenotypes of sample subjects.

The original EMMA algorithm, however, is computationally infeasible for datasets with thousands of individuals because the variance components parameters are estimated for each marker, which can take about 10 minutes per marker on the authors' large GWAS dataset, which would take over 6 years to complete on a single processor. A new implementation of the algorithm called EMMAX (Efficient Mixed-Model Association eXpedited) makes the simplifying assumption that because the effect of any given SNP on the trait is typically small, then the variance parameters only need to be estimated once for the entire dataset, rather than once for each marker.

In the paper the authors take the Northern Finland Birth Cohort and estimate genomic control inflation factors (gamma) for uncorrected test statistics, test statistics adjusted for the top 100 principle components using Eigenstrat, and corrected for structure using the EMMAX algorithm and found that the inflation factors were closest to 1 for the EMMAX-corrected tests. Further, whereas genomic control simply adjusts all test statistics downward without changing the rank of the test statistics, the EMMAX method does result in changes of the ranks of test statistics for each SNP.

A beta version of EMMAX is available online, with a complete version to be released soon. Conveniently, the software is able to take a PLINK transposed ped file and covariate files as input (tped and tfam documentation here).

Nature Genetics Technical Report - Variance component model to account for sample structure in genome-wide association studies

I just wanted to mention that this post won you the !shortest post title of the month :)

ReplyDeleteBest,

Tal

Hi.

ReplyDeletegreat blog, I love reading it and learning from it!

From the perspective of a biochemist, I just wonder what are the differences between EMMAX and ProbABEL as both implement a fast mixed model. Both papers state, that their approach is feasible to perform the mixed model in GWAS datasets. Unfortunately, within the EMMAX paper I found no direct comparison with the ProbABEL algorithm. Does anyone could give advice in which situation which software is best?

Thanks!

Holger

Hi- I just wanted to mention that there is a second method published in the same issue of Nature Genetics- that is addressing the same issue, and solving it more efficiently than EMMAX does. This new method is actually developed by the original group of people who introduced mixed-model association analysis at the first place in 2007 -http://www.nature.com/ng/journal/v38/n2/full/ng1702.html

ReplyDeleteThe link to the new method is below:

Zhang et al. Mixed linear model approach adapted for genome-wide association studies

http://www.nature.com/ng/journal/v42/n4/abs/ng.546.html