Thursday, March 26, 2009

Ditch Excel for AWK!

How often have you needed to extract a certain column from a file, or strip out only a few individuals from a ped file? This is often done with Excel, but that requires loading the data, manipulating it, then saving it back into a format that is acceptable, which may require converting tabs to spaces or other aggravating things. In many circumstances, a powerful linux command, AWK, can be used to accomplish the same thing with far fewer steps (at least once you get the hang of it).

AWK is almost like its own language, and in fact portions of PERL are based on AWK. Let's say you have a file called text.txt and you want to find all lines of that file that contain the word "the":

> awk '/The/' text.txt

Or you'd like to see all lines that start with "rs":

> awk '/^rs/' text.txt

Or perhaps most usefully, you want to strip the top 5 lines out of a file:

> awk 'NR > 5' text.txt

This just scratches the surface of course... for a good tutorial with examples, check out this site:

I'll also look into setting up a post with AWK snippets for common useful procedures...



  1. To expand on this, so you don't have to muck through the documentation on AWK, here's a quick way to pull out the first 6 columns of a PED file (very useful for large files):

    awk '{print $1,$2,$3,$4,$5,$6}' mypedfile.ped > myoutputfile.txt

  2. For very short scripts, perl can be scripted from the command line with the "-e" option. I started with sed/awk many moons ago, but now I use perl. Maybe I'm just getting old, but I find it hard to remember the syntax of the other command line tools. Perl was created to integrate shell scripts, sed, awk and the like. I would still say it's good to know awk, but I can't tell you how many times I could have written a perl script by the time I got an awk script debugged :)


Creative Commons License
Getting Genetics Done by Stephen Turner is licensed under a Creative Commons Attribution-NonCommercial 3.0 Unported License.