Thursday, November 11, 2010

Split up a GWAS dataset (PED/BED) by Chromosome

As I mentioned in my recap of the ASHG 1000 genomes tutorial, I'm doing to be imputing some of my own data to 1000 genomes, and I'll try to post lessons learned along the way here under the 1000 genomes and imputation tags.

I'm starting from a binary pedigree format file (plink's bed/bim/fam format) and the first thing in the 1000 genomes imputation cookbook is to store your data in Merlin format, one per chromosome. Surprisingly there is no option in PLINK to split up a dataset into separate files by chromosome, so I wrote a Perl script to do it myself. The script takes two arguments: 1. the base filename of the binary pedfile (if your files are data.bed, data.bim, data.fam, the base filename will be "data" without the quotes); 2. a base filename for the output files to be split up by chromosome. You'll need PLINK installed for this to work, and I've only tested this on a Unix machine. You can copy the source code below:


  1. Thanks, Will and Stephen! I have shared this with my colleagues who are in the midst of our GWAS analysis.

  2. This is very helpful Stephen. I am getting ready to work through the tutorial myself and my first question. now that I have used your script, is how do I go from plink bed files to merlin format? Are the plink ped format and merlin format compatible and if so, wouldn't it be possible to use the --recode option in your script as opposed to the make-bed option? Thanks for this and all of your great posts.

  3. Brad, good question. It looks like ped and merlin formats are nearly the same. See the Merlin website. To get ped/map file from a binary ped file, use

    plink --bfile basename --recode --out --outputname

    The mapfile needs to be only three columns, so use

    gawk '{print $1,$2,$4}' > data.map3

    There's one other file you'll need, called the data file, described on the Merlin page under "Describing the pedigree file". All you need here is one row per data item in the pedigree file, indicating the data type (encoded as M - marker, A - affection status, T - Quantitative Trait and C - Covariate) and providing a one-word label for each item. So if your pedigree data has a quantitative trait as the phenotype (or no phenotype), and the only thing else in the pedfile are SNPs, you can use this command to make a .dat file that you'll need for mach:

    gawk 'BEGIN {print "T","pheno";}{print "M",$2}'

    That will make a new .dat file that looks like this:
    T pheno
    M rs3094315
    M rs12562034
    M rs3934834
    M rs9442372
    M rs3737728
    ....... and so on.

  4. When I split, I also split at the centromeres. More friendly to the memory-hungry imputation jobs. Given that modern computers have so many cores, this means I can run more jobs concurrently.

  5. Just wanted to note that if you are on a multiple-processor machine, you can easily parallelized this task by using the GNU Parallel command (which you can install from

    The command would be something like:
    seq 1 22 | parallel -j +0 plink --nonfounders --allow-no-sex --noweb --bfile inbase --chr {} --make-bed --out outbase

    Note also that you can replicate the behavior of your perl script using a one-line bash command:
    for chr in `seq 1 22`; do plink --nonfounders --allow-no-sex --noweb --bfile inbase --chr ${chr} --make-bed --out outbase; done

    Just thought it might be helpful to point out other ways to get such things done -- it's always good to have more options available!

  6. Do you need to remove the 6th column from the plink ped file? MACH's ped files only has 5 columns.

  7. Dear Will and Stephen, I am a Masters student from Canada. I am statistician looking to come up with new methods in the field of Genetics. I am taking a genetic course and for my final project I need to do some data analysis and am looking for a GWAS data. I kindly request you to suggest me some website from which I can download them or in some case if you have some genotype data I kindly request to mail me to

  8. Hi Friends,

    I have a data i following format an looking for a script to convert it in to merlin format. Please help

    SNP Chromosome Position ind1 ind2 ind3 ind4 ind5 ind6 ind7
    rs10458597 1 564621 CC CC CC CC CC CC CC
    rs12565286 1 721290 GG GG GG GG GG GG GG
    rs12082473 1 740857 GG GG GG GG GG GG GG
    rs3094315 1 752566 AA AA GA AA AA GA AA
    rs2286139 1 761732 TT TT CT TT TT CT TT
    rs11240776 1 765269 AA AA AA AA AA AA AA
    rs2980319 1 777122 TT TT AT TT TT AT TT
    rs2980300 1 785989 CC CC TC CC CC TC CC
    rs2905036 1 792480 TT TT TT TT TT TT TT
    rs11240777 1 798959 GG GG AG GG GG AG GG
    rs4245756 1 799463 CC CC CC CC CC CC CC
    rs3748597 1 888659 CT CC CC CC CT CC CT
    rs2341354 1 918573 AG GG AG GG AG AG AG
    rs4970403 1 926431 TT TT TT TT TT TT TT
    rs2465126 1 947034 AA AA AA AA AA AA AA

    thanks and best regards,


  9. Hi,
    I am graduate student analyzing Affymatrix array 6.0 data. I have done phasing using both fastphase and PHASE programme. Now, I want to use this phase output data to prepare sweep input file. Please, can you show me the way to do that.

  10. I am trying to meta-analysis of GWAS data. But I have a big troble in managing the raw GWAS data.
    I can not convert the raw GWAS data which I downloaded from the site,, to ped and map file for PLINK analysis.
    The file name is like this, phs000202.pha002867.txt.

    Could you tell me how to convert "phs000202.pha002867.txt" file to ped and map file (such as phs000202.pha002867.ped and for GWAS meta-analysis using PLINK.

  11. Does this script work if your genome data set isn't human and your chromosome names are, for example "2R", "2L", etc.?

  12. Hi,

    there is an undocumented option in PLINK that splits the data per chr and creates input files for imputation in BEAGLE.



  13. Hi,

    there is an undocumented option in PLINK v1.8 --recode-beagle, that splits the data per chr and creates input files for imputation in BEAGLE.




Creative Commons License
Getting Genetics Done by Stephen Turner is licensed under a Creative Commons Attribution-NonCommercial 3.0 Unported License.