Friday, August 28, 2009

Convert PLINK output to CSV

I tip my hat to Will for showing me this little command line trick. PLINK's output looks nice when you print it to the screen, but it can be a pain to load the output into excel or a MySQL database because all the fields are separated by a variable number of spaces. This little command line trick will convert a variable-space delimited PLINK output file to a comma delimited file.

You need to be on a Linux/Unix machine to do this. Here's the command. I'm looking at results from Mendelian errors here. Replace "mendel" with the results file you want to reformat, and put this all on one line.

cat mendel.txt | sed -r 's/^\s+//g' | sed -r 's/\s+/,/g' > mendel.csv

You'll have created a new file called results.hwe.csv that you can now open directly in Excel or load into a database more easily than you could with the default output.

Before:

turnersd@provolone~: cat mendel.txt
FID PAT MAT CHLD N
1089 16223073 16223062 1 149
1116 16233564 16233589 1 114
123 16230767 16230725 2 189
12 16221778 16221803 1 116
12 16221805 16221822 1 98
12 16230486 16230496 1 76
12 16231205 16232111 2 180
134 16222939 16222945 2 140
1758 16230755 16231121 2 206

After:

turnersd@provolone~: cat mendel.csv
FID,PAT,MAT,CHLD,N
1089,16223073,16223062,1,149
1116,16233564,16233589,1,114
123,16230767,16230725,2,189
12,16221778,16221803,1,116
12,16221805,16221822,1,98
12,16230486,16230496,1,76
12,16231205,16232111,2,180
134,16222939,16222945,2,140
1758,16230755,16231121,2,206


If you're interested in the details of what this is doing here you go:

First, you cat the contents of the file and pipe it to a command called sed. The thing between the single quotes in the sed command is called a regular expression, which is similar to doing a find-and-replace in MS Word. What this does is searches for the thing between the first pair of slashes and replaces it with the thing between the next two slashes. You need the -r option, and the "s" before the first and the "g" after the last slash to make it work right.

/^\s+// is the first regular expression. \s is special and it means means search for whitespace. \s+ means search for any amount of whitespace. The ^ means only look for it at the beginning of the line. Notice there is nothing between the second and third slashes, so it will replace any whitespace with nothing. This part will trim any whitespace from the beginning of the line, which is important because in the next part we're turning any remaining whitespace into a comma, so we don't want the line to start with a comma.

/\s+/,/ is the second regular expression. Again we're searching for a variable amount of whitespace but this time replacing it with a comma.

3 comments:

  1. I came across this post a couple of months ago and have used this command line tip regularly since then. It has saved me hours of frustration. Thanks so much!

    ReplyDelete
  2. I have used the command countless times (and kept my sanity in the process) - thank you so much!

    ReplyDelete
  3. Thank you thank you thank you! I've put this into a bash script and then made it executable as a command so I can just pipe the file into "2csv", it works like a charm!

    ReplyDelete

Creative Commons License
Getting Genetics Done by Stephen Turner is licensed under a Creative Commons Attribution-NonCommercial 3.0 Unported License.