GNU datamash is a command-line utility that offers simple calculations (e.g. count, sum, min, max, mean, stdev, string coalescing) as well as a rich set of statistical functions, to quickly assess information in textual input files or from a UNIX pipe. Here are a few examples:
E.g., let’s use the
seqcommand to generate some data, and use
datamash sumto add up all the values in the first column:
$ seq 5 1 2 3 4 5 $ seq 5 | datamash sum 1 15
What else? Let’s calculate the mean, 1st quartile, median, 3rd quarile, IQR, sample-standard-deviation, and p-value of Jarque-Bera test for normal distribution, using some data in
$ cat file.txt | datamash -H mean 1 q1 1 median 1 q3 1 iqr 1 sstdev 1 jarque 1 mean(x) q1(x) median(x) q3(x) iqr(x) sstdev(x) jarque(x) 45.32 23 37 61.5 38.5 30.4487 8.0113e-09
Go take a look at more examples, specifically the examples using datamash to parse a GTF file. Datamash can very quickly do things like find the number of isoforms per gene by grouping (collapsing) over gene IDs and counting the number of transcripts, find the number of genes with more than 5 isoforms, find genes transcribed from multiple chromosomes, examine variability in exon counts per gene, etc.