Thursday, April 30, 2009

Merging data from different files using R

A reader asked yesterday how you would merge data from two different files. For example, let's say you have a ped file with genotypes for individuals, and another file that had some other data for some of the individuals in the pedfile, like clinical or environmental covariates. How would you go about automatically putting the clinical data from file 2 into the appropriate rows in file 1? Without using a database, the easiest way is probably using the "merge" function in R that will do the trick with a one-line command. Here's a short tutorial to get you started.

First, start up R on the cheeses simply by typing in the uppercase letter R and hit enter. Alternatively you could use R if it's installed on your computer. Now type in this command to view a small dataset I made up. It's not a real pedfile, but it will do the trick. Make sure to put it all on one line:


You should see the dataset displayed:

indiv_id SNP1 SNP2
1 1 1 1
2 2 1 1
3 3 0 0
4 4 1 0
5 5 1 1
6 6 1 0
7 7 1 1
8 8 0 1
9 9 1 1
10 10 1 1

We have 10 individuals numbered 1-10 in a column called "indiv_id" that have genotype information at 2 loci. Now let's view another dataset that has fake clinical covariates for a few of those individuals:


Again, you will see the dataset displayed.

indiv_id cov1 cov2
1 1 1.14 74.6
2 3 4.50 79.4
3 4 0.80 48.2
4 6 1.39 68.1
5 7 3.89 84.8
6 8 0.42 86.1

In this dataset, we have clinical information on the individuals in the ped file above, but only on some of the individuals. Now, let's merge what information we have here with the ped file above.

But first, since all we did above was view the datasets, let's run those commands again, but this time storing the tables in a "data frame" (R parlance for dataset). Type in these two commands. The " <- " is an assignment. It means create a data frame called ped from the following table from the internet, and the same with a data frame called meta.

ped <- read.table("")

meta <- read.table("")

Now if you just type in the name of the data frame, e.g. ped, and hit enter, you will see the data frame displayed. Now it's time for the real magic. Type the following command, and I'll explain what it does below:


The output should look like this:

indiv_id SNP1 SNP2 cov1 cov2
1 1 1 1 1.14 74.6
2 2 1 1 NA NA
3 3 0 0 4.50 79.4
4 4 1 0 0.80 48.2
5 5 1 1 NA NA
6 6 1 0 1.39 68.1
7 7 1 1 3.89 84.8
8 8 0 1 0.42 86.1
9 9 1 1 NA NA
10 10 1 1 NA NA

This command uses R's "merge" function to combine the data frames you created above, ped and meta, into a single data frame. It does this by joining the two data frames on a common variable called "indiv_id". The all=TRUE option tells R to join the two data frames, even if one has fewer entries than another. You'll see the NA's listed for a few individuals. NA is R's way of coding missing data. If you didn't include the all=TRUE, observations with any missing data for any variable would be deleted!

Finally, if you want to save this data to a file, re-run the command above, assigning it to a variable, then use the write.table command to write it to your home directory.

superpedfile <- merge(ped,meta,by="indiv_id",all=TRUE)


To exit R, type quit(). To do this on your own data, see the help on read.table by typing ?read.table at the R command line. If anybody knows of a better way to do this post it in the comments!


  1. I forgot to mention that R's way of encoding missing data, with "NA" is usually not acceptable to other programs. Type in ?write.table at the R command line to get help on this. You'll need to use the na="?" option, where the ? here stands for the character you want to use to represent missing data.

  2. This is by far the simplest explanation I have seen about how to use merge, and it did exactly what I wanted. Thank you!

  3. This is great trick, it solved my problem paritally. What if the SNP details for 10 individuals are in rows rather than as they are in the first file above? How would you merge the two files & display SNP information for the individuals where SNPs are arranged in rows as in the output?


  4. Look up the "reshape" command, that might help.

  5. Thanks for this explanation! One question: How do I define what variable to merge on if my columns are not labeled? Can I use column number?


  6. They usually have names, even if they're just V1, V2, V3... Try the command names(yourdataframe) to get the column names. Also look at the help for merge by typing ?merge. You may have to use the by.x and by.y commands if the columns aren't named the same in both dataframes.

  7. Hi,

    If I have two RData files, how can I merge them? Thanks!

  8. Hi,
    Is it possible to merge two files by a column 1 (ID) but replace lets say column 6 of one file with column 2 of another file. I was using plink plugin to extract my .ped file from genomestudio but for some reason column 6 is where the phenotype status should be but everytime I just have values of 0 instead of 1/0 for case control status and I wanted to know if I could merge the phenotype in the column 6 later on in R or something to have a complete .ped file before i can use plink. Any help would be greatly appreciated! Otherwise any help on how I can get column 6 as the phentoype in my .ped file from the plugin would also be helpful, i've tried but it's always missing.

  9. Hey - if you are familiar with plink this is much easier than would be in R. First make your pedfile with all 0's for the trait. Then call plink using the --pheno option where you have a separate phenotype file with family id, individual id, and the phenotype. You can read up on the format of this file on the plink website. you can also use multiple phenotypes in analysis with --pheno in combination with --all-pheno. you can also write out a new pedfile with the phenotype you've chosen using --recode or --make-bed. Hope this helps.

  10. Thanks for that will definately give that a go!

  11. I use gPLINK successfully to most analyses but have problems merging to binary filesets due to error report: mis-matching SNPs --check +/- strands?

    For which I find no gPLINK command / operation. Is there any?

  12. Hi,

    Thanks, this has been very useful! I have one more question, how do i merge 2 data sets with identical variables adding cases.

  13. Great easy explanation!!!! Thankyou!!!

  14. Thank you so much ... This is just what i needed

  15. Thanks for the great explanation. Is it also possible to merge 3 files at once? I tried to do this by including an extra by.z="indiv_id" into the merge function. But this doesn't work. I probably need to do it in 2 steps?
    Thanks in advance!

  16. probably should use a nested merge. E.g:

    merge(dat3, merge(dat1,dat2,by="id",all=T), by="id", all=T)

  17. To add to that, as always, you should create a fake dataset to test this out to make sure it worked as you expected before moving to a larger dataset where problems may be more insidious.

  18. thanks, it is a simple explanation, easy to understand. However, concerning the big database, for instance, I got records from feeding machine for feed intake and time spending in feeders of pigs. Everyday, I got one Zip file containing information about animal ID, feed consumption, time spending...Data recorded over many years. I am looking for the scripts to merge all information from different zip file.
    Thank again

  19. nicely written! thanks

  20. hey anyone guide me how to make .ped file


Note: Only a member of this blog may post a comment.

Creative Commons License
Getting Genetics Done by Stephen Turner is licensed under a Creative Commons Attribution-NonCommercial 3.0 Unported License.