Tuesday, March 24, 2009

Write Your First Perl Script

And it will do way more than display "Hello, world!" to the screen. An anonymous commenter on one of our Linux posts recently asked how to write scripts that will automate the same analysis on multiple files. While there are potentially several ways to do this, perl will almost always get the job done. Here I'll pose a situation you may run into where perl may help, and we'll break down what this simple perl script does, line by line.

Let's say you want to run something like plink or MDR, or any other program that you have to call at the command line with a configuration or batch file that contains the details of the analysis. For this simple case, let's pretend we're running MDR on a dataset we've stratified 3 ways by race. We have three MDR config files: afr-am.cfg, caucsian.cfg, and hispanic.cfg that will each run MDR on the respective dataset. Now, we could manually run MDR three separate times, and in fact, in this scenario that may be easier. But when you need to run dozens, or maybe hundreds of analyses, you'll need a way to automate things. Check out this perl script, and I'll explain what it's doing below. Fire up something like nano or pico, copy/paste this, and save the file as "runMDR.pl"

foreach $current_cfg (@ARGV)
# This will run sMDR
`./sMDR $current_cfg`;
# Hooray, we're done!

Now, if you call this script from the command line like this, giving the config files you want to run as arguments to the script, it will run sMDR on all three datasets, one after the other:

> perl runMDR.pl afr-am.cfg caucasian.cfg hispanic.cfg

You could also use the askerisk to pass everything that ends with ".cfg" as an argument to the script:

> perl runMDR.pl *.cfg

Okay, let's break this down, step by step.
  1. First, some syntax. Perl ignores everything on a line after the # sign, so this way you can comment your code, so you can remember what it does later. The little ` things on the 4th line are backticks. Those are usually above your tab key on your keyboard. And that semicolon is important.
  2. @ARGV is an array that contains the arguments that you pass to the program (the MDR config files here), and $current_config is a variable that assumes each element in @ARGV, one at a time.
  3. Each time $current_config assumes a new identity, perl will execute the code between the curly braces. So the first time, perl will execute `./sMDR afr-am.cfg`; The stuff between the backticks is executed exactly as if you were typing it into the shell yourself. Here, I'm assuming you have sMDR and afr-am.cfg in the current directory.
  4. Once perl executes the block of code between the braces for each element of @ARGV, it quits, and now you'll have results for all three analyses.
A few final thoughts... If the stuff you're automating is going to take a while to complete, you may consider checking out Greg's previous tutorial on screen. Next, if whatever program you're running over and over again displays output to the screen, you'll have to add an extra line to see that output yourself, or write that output to a file. Also, don't forget your comments! Perl can be quick to write but difficult to understand later on, so comment your scripts well. Finally, if you need more help and you can't find it here or here, many of the folks on this hall have used perl for some time, so ask around!

No comments:

Post a Comment

Creative Commons License
Getting Genetics Done by Stephen Turner is licensed under a Creative Commons Attribution-NonCommercial 3.0 Unported License.