Wednesday, August 21, 2013

Utility script for launching bare JAR files

Torsten Seemann compiled a list of minimum standards for bioinformatics command line tools, things like printing help when no commands are specified, including version info, avoid hardcoded paths, etc. These should be obvious to any seasoned software engineer, but many of these standards are not followed in bioinformatics.

#8 on the list was "Don't distribute bare JAR files." This is particularly annoying, requiring a user to invoke the software using something like: java -Xmx1000m -jar /path/on/my/system/to/software.jar . There are a few notable offenders in bioinformatics out there (I'm looking at you, Trimmomatic, snpEff, GATK...).

A very simple solution to the bare JAR file problem is distributing your java tool with a shell script wrapper that makes it easier for your users to invoke. E.g., if I have GATK installed in ~/bin/ngs/gatk/GenomeAnalysisTK.jar, I can create this shell script at ~/bin/ngs/gatk/gatk (replace GenomeAnalysisTK.jar with someOtherBioTool.jar):

Once I make that script executable and include that directory in my path, calling GATK is much simpler:

Yes, I'm fully aware that making my own JAR launcher utility scripts for existing software will make my code less reproducible, but for quick testing and development I don't think it matters. The tip has the best results when JAR files are distributed from the developer with utility scripts for invoking them.

See the post below for more standards that should be followed in bioinformatics software development.

Torsten Seeman: Minimum standards for bioinformatics command line tools

Monday, August 12, 2013

Understanding the ENSEMBL Schema

ENSEMBL is a frequently used resource for various genomics and transcriptomics tasks.  The ENSEMBL website and MART tools provide easy access to their rich database, but ENSEMBL also provides flat-file downloads of their entire database and a public MySQL portal.  You can access this using the MySQL Workbench using the following:
User:     anonymous
Once inside, you can get a sense for what the ENSEMBL schema (or data model) is like.  First, it’s important to understand the ENSEMBL ID system.  Most of the primary entities in the ENSEMBL database (genes, exons, transcripts, proteins) have a formal, stable identifier (beginning with ENSG, ENSE, ENST, and ENSP respectively) that does not change from build to build.  These entries can be found in the gene_stable_id tables.  All of these entities also have an internal identifier (an integer).  Once you have an internal ID for the entity of interest, details of the entity can be found in the genes, exons, transcripts, and translations (proteins) table. For example, the following query will retrieve a list of all transcripts and their exons for a given gene.
SELECT * FROM gene_stable_id a
inner join gene b on a.gene_id = b.gene_id
inner join transcript c on b.gene_id = c.gene_id
inner join exon_transcript d on c.transcript_id = d.transcript_id
inner join exon e on d.exon_id = e.exon_id
inner join transcript_stable_id f on c.transcript_id = f.transcript_id
inner join exon_stable_id g on e.exon_id = g.exon_id
The exon_transcript table contains a mapping of each exon to any transcripts containing it, and also contains a rank to indicate which exon it is relative to a given transcript.  To retrieve exons for a list of genes by their ENSEMBL IDs, these could be loaded into a table and joined to the gene_stable_id table in the query above.  To pull the build 37 chromosome and coordinates for an exon, use the following:
Select a.exon_id,, a.seq_region_start, a.seq_region_end from exon a
inner join seq_region b on a.seq_region_id = b.seq_region_id
inner join coord_system c on b.coord_system_id = c.coord_system_id
where c.version = "GRCh37";
In this query, the seq_region table contains a field called name that identifies the contig to which the coordinates refer, in this case the chromosome number. 

There are also extensive cross-references in the ENSEMBL database.  To retrieve alternate identifiers for a set of transcripts, execute the following: 
select * from transcript_stable_id a
inner join transcript b on a.transcript_id = b.transcript_id
inner join object_xref c on b.transcript_id = c.ensembl_id
inner join xref d on c.xref_id = d.xref_id
inner join external_db e on d.external_db_id = e.external_db_id
where ensembl_object_type = "Transcript"
limit 20;
ENSEMBL organizes cross-references (xrefs) for all entity types into a single table object_xref.  This table contains an ensemble_object_type field that is a “Transcript”, “Gene”, or “Translation”, and an ensemble_id that matches either a gene_id, transcript_id, or a translation_id.  Replace “transcript” in the above query with “gene” or “translation” to retrieve gene or protein cross-references.  A list of all external cross-reference sources can be found by querying: 
Select db_name from external_db;
There is a great deal of information within the ENSEMBL database that can be accessed using SQL, which for some types of operations is easier than using the MART or web interface.  Full details of the ENSEMBL schema can be found here (

Monday, August 5, 2013

Google Developers R Programming Video Lectures

Google Developers recognized that most developers learn R in bits and pieces, which can leave significant knowledge gaps. To help fill these gaps, they created a series of introductory R programming videos. These videos provide a solid foundation for programming tools, data manipulation, and functions in the R language and software. The series of short videos is organized into four subsections: intro to R, loading data and more data formats, data processing and writing functions. Start watching the YouTube playlist here, or watch an individual lecture below:

1.1 - Initial Setup and Navigation
1.2 - Calculations and Variables
1.3 - Create and Work With Vectors
1.4 - Character and Boolean Vectors
1.5 - Vector Arithmetic
1.6 - Building and Subsetting Matrices
1.7 - Section 1 Review and Help Files
2.1 - Loading Data and Working With Data Frames
2.2 - Loading Data, Object Summaries, and Dates
2.3 - if() Statements, Logical Operators, and the which() Function
2.4 - for() Loops and Handling Missing Observations
2.5 - Lists
3.1 - Managing the Workspace and Variable Casting
3.2 - The apply() Family of Functions
3.3 - Access or Create Columns in Data Frames, or Simplify a Data Frame using aggregate()
4.1 - Basic Structure of a Function
4.2 - Returning a List and Providing Default Arguments
4.3 - Add a Warning or Stop the Function Execution
4.4 - Passing Additional Arguments Using an Ellipsis
4.5 - Make a Returned Result Invisible and Build Recursive Functions
4.6 - Custom Functions With apply()

Creative Commons License
Getting Genetics Done by Stephen Turner is licensed under a Creative Commons Attribution-NonCommercial 3.0 Unported License.