Monday, October 19, 2009

Annotated LD-Plots

Often when presenting statistics from a candidate gene study, or a region of interest from a genome-wide association study, it is useful to see various SNP-wise values in the context of linkage disequilibrium patterns. To aid the interpretation of regional association results, we developed a tool called LD-Plus that presents a familiar Haplo-view style correlation plot that is annotated with haplotypes, haplotype blocks, and haplotype frequencies, along with continuous and categorical SNP statistics. An example plot is shown below (click for the larger version).

This plot design displays p-values from association or other types of statistics in the context of quality control measures such as genotyping efficiency, HWE p-values, and minor allele frequency. It also displays association signals relative to LD structure, so the user can clearly see if multiple SNPs in the same LD block show evidence for association. The haplotypes and their relative frequencies are shown along the top of the figure, and gray shading carries the haplotype structure up through the other statistics in the figure. Categorical stats, such as inclusion on genotyping platforms, whether a SNP was imputed or typed, or the consequence of the SNP (coding/non-coding) can be displayed also.

LD-Plus is available via a web interface, and is an evolving project. To give it a try, visit Details on how to properly format input files is available under the "Documentation" link on the site. We'd love your feedback on the design, and what other features we might include in the design!

Thursday, October 15, 2009

Check out the GGD poster at ASHG!

Aloha! If you're going to the American Society of Human Genetics meeting next week, come by our poster (poster #116, abstract #1567/T) and say hello! We would love to meet any of you who read GGD and hear what you think and what you'd like to see in the future.

Wednesday, October 14, 2009

Free Book: Elements of Statistical Learning

The Elements of Statistical Learning: Data Mining, Inference, and Prediction by Trevor Hastie, Robert Tibshirani, and Jerome Friedman, one of the best books on data mining and machine learning, is now available free in PDF format.

Download it here or view it online here.

Friday, October 9, 2009

Visualizing sample relatedness in a GWAS using PLINK and R

Strict quality control procedures are extremely important for any genome-wide association study.  One of the first steps you should take when running QC on your GWAS is to look for related samples in your dataset.  This does two things for you.  First, you can get an idea of how many related samples you have in your dataset, and second, if you have access to self-report relationship information, you can identify potential sample mix-ups based on discrepancies between genetic information and self-report.

PLINK allows you to estimate genomewide IBD-sharing coefficients between seemingly unrelated individuals from whole-genome data.  You can read lots more about the particular method they use to infer IBD given IBS and allele frequencies in the PLINK paper.  It's relatively straightforward to compute these estimates.  If you already have your data in pedfile format, use the following command (assuming mydata.ped and are in the current directory where you run PLINK).

plink --file mydata --genome --min 0.05

This will usually take a few days if you use genome-wide data.  You could speed this up by first thinning your dataset to around 100,000 markers using the --thin option to create a smaller dataset.

After you've run this you'll have a file in that directory called plink.genome.  You can read about the output here, but we're really interested in Z1 and Z0, the proportion of markers identical by descent 1 and 0 respectively, for every pair of individuals in the dataset.  You can plot Z1 by Z0 and color code them by the relationship type as coded in the pedfile using this R code.

Here's what the columns of interest will look like:

  RT     Z1     Z0
1 UN 0.1697 0.8294
2 OT 0.5034 0.4879
3 OT 0.4403 0.5439
4 OT 0.5200 0.4674
5 UN 0.1646 0.8311
6 OT 0.5519 0.4481

First, fire up R, go to the directory where your plink results are located and load in the data:

d = read.table("plink.genome", header=T)

Next, set up the plot area. The par option makes the points solid instead of an open circle, and the plot command sets up your plot and axes without adding any points.

with(d,plot(Z0,Z1, xlim=c(0,1), ylim=c(0,1), type="n"))

Your plot should be empty, like this:

Now use this command to plot points where the relationship type is FS, or full sibling.  Make the points green (col=3).

with(subset(d,RT=="FS") , points(Z0,Z1,col=3))

Do the same thing with half sibs, "other related" (have the same family ID), parent offspring pairs, and unrelated, making each of them a different color.

with(subset(d,RT=="HS") , points(Z0,Z1,col="darkorange"))
with(subset(d,RT=="OT") , points(Z0,Z1,col=4))
with(subset(d,RT=="PO") , points(Z0,Z1,col=2))
with(subset(d,RT=="UN") , points(Z0,Z1,col=1))

You might want to rerun the commands for "other related" and especially half sibs, because they were covered up when you plotted unrelateds.

with(subset(d,RT=="OT") , points(Z0,Z1,col=4))
with(subset(d,RT=="HS") , points(Z0,Z1,col="darkorange"))

Finally, add a legend

legend(1,1, xjust=1, yjust=1, legend=levels(d$RT), pch=16, col=c(3,"darkorange",4,2,1))

For a recap, your code should look like this:

d = read.table("plink.genome", header=T)

with(d,plot(Z0,Z1, xlim=c(0,1), ylim=c(0,1), type="n"))

with(subset(d,RT=="FS") , points(Z0,Z1,col=3))
with(subset(d,RT=="HS") , points(Z0,Z1,col="darkorange"))
with(subset(d,RT=="OT") , points(Z0,Z1,col=4))
with(subset(d,RT=="PO") , points(Z0,Z1,col=2))
with(subset(d,RT=="UN") , points(Z0,Z1,col=1))
with(subset(d,RT=="OT") , points(Z0,Z1,col=4))
with(subset(d,RT=="HS") , points(Z0,Z1,col="darkorange"))

legend(1,1, xjust=1, yjust=1, legend=levels(d$RT), pch=16, col=c(3,"darkorange",4,2,1))

What you can see from the above plot is that self-reported parent-offspring pairs share 100% of their alleles IBD=1.  There is a single sib pair at the bottom left where Pr(IBD=1)=0 and Pr(IBD=0)=0.  This means that this sib pair shares 2 alleles identical by descent at every locus across the genome.  This is either a duplicated sample or a set of identical twins.  The full sibs cluster right around the middle of the plot, and you have lots of unrelated or "other related" pairs stretching from completely unrelated (bottom right quadrant) to relatively closely related (2nd or 3rd degree relatives, right around the middle of the plot).  If you see self-reported sibs showing up where the unrelated individuals cluster on this plot, or vice versa, this should clue you in on potential sample identity problems.


Just for fun, let's make the same plot using ggplot2.  Make sure you have the data loaded as above, then install ggplot2 if you haven't already


You only have to install the package once.  Next load the package:


Use this incredibly simple command to make the plot:


I'm not a huge fan of these colors, so lets make them the same as above:

qplot(Z0,Z1,data=d,colour=RT) + scale_colour_manual(value=c(3,"darkorange",4,2,1))

That's a little better.  Obviously the code to do this is MUCH simpler in ggplot2.  The only problem I have with this is that it automatically plots the points in the order of the levels of RT, ie. FS, then HS, then OT, PO, then UN, so you end up covering up all your half-sibs with "other related" and unrelated pairs.  Post a comment if you know how to go back and replot the HS points over the UN points.  For more on ggplot2, see my previous post comparing histograms made using Stata, R base, R lattice and R ggplot2.

Tuesday, October 6, 2009

R Commander: A Basic Statistics GUI for R

R is a great tool with lots of resources for genetics, genome-wide association studies, and many other biological applications.  We've covered several places to find help in R in the past, but if you're still apprehensive about diving into R's command-line interface, fear not.  The R commander is a graphical user interface (GUI) for R that works under Windows, Linux, and Mac.  It features a data editor that looks a lot like an Excel spreadsheet or Stata's data editor.  It also has menus and clicky-boxes to do lots of basic analyses and for producing graphics without ever typing a single command.

R commander depends on lots of other packages to function properly, so to install it along with all it's dependencies, type this in at the R command line:

install.packages("Rcmdr", dependencies=TRUE)

Note that this may take 5-10 minutes depending on your connection since it's downloading lots of extra packages, but you only need to do that once.  Now that it's installed, any time you want to load it, open R and type:


If you're familiar with Stata or SPSS you should find the menus and dialog boxes intuitive and self-explanatory. I've put in a couple screenshots below showing R commander's data editor, GLM tool, and graphics menu.

There are a few other GUIs for R that I've never used, including Rattle (the R Analytical Tool To Learn Easily), and JGR (Java Gui for R).  If you prefer one of these to Rcmdr, let us here why in the comments!

The R Commander: A Basic-Statistics GUI for R

Friday, October 2, 2009

Genomics to Confirm Nationality?

Thanks to Kylee for pointing out this frightening trend from the U.K. Border Agency.

Essentially, the United Kingdom is experimenting with using mitochondrial and Y-chromosome markers to determine if asylum-seekers that flee persecution are actually from the nation in question.

Advocates of this policy don't seem to understand that genetic variation is not nation-specific. Just as you cannot "look" at a person and determine their nationality, you cannot look at their DNA to make that determination either.
Creative Commons License
Getting Genetics Done by Stephen Turner is licensed under a Creative Commons Attribution-NonCommercial 3.0 Unported License.