unicorn is a command-line tool for analyzing SAM/BAM alignment files. It computes alignment-based statistics at multiple levels, including per-reference, per-taxid, and whole-BAM summaries. It can also filter alignments and references based on criteria such as read count, reference length, alignment score, ANI, and sequence complexity. When taxonomy metadata is available, unicorn can annotate BAM files with taxonomic tags and aggregate results by rank using NCBI-style taxonomy files. This makes it especially useful for metagenomic workflows, reference screening, and quality control of alignment data. It is also handy for extracting coverage, ANI, and duplicity metrics from large sequencing datasets. In short, Unicorn helps turn raw alignments into actionable summaries and filtered outputs for downstream analysis.
Unicorn computes alignment-based statistics from SAM/BAM files. It is aimed at metagenomic and reference-screening workflows where the same alignment file may need per-reference statistics, per-taxon summaries, BAM-wide summaries, or alignment filtering.
The executable is called unicorn. The conda package is currently named
enhjoerning.
conda install -c conda-forge -c bioconda enhjoerningClone with submodules and build:
git clone --recursive https://github.com/GeoGenetics/unicorn.git
cd unicorn
makeBuilding from source requires:
- C compiler (gcc is fine)
- A C++ compiler with C++20 support(gcc is fine)
make- htslib
- Genesis
- klib, included as a submodule
If src/genesis is present, the Makefile builds against the bundled Genesis
submodule. Otherwise it uses a Genesis installation from CONDA_PREFIX.
If htslib is installed in a custom prefix, point HTSSRC to a directory that
contains include/ and lib/:
export HTSSRC=/path/to/htslib-prefix
makeFor a conda development environment:
export HTSSRC="$CONDA_PREFIX"
make./unicorn command [options] -b <in.bam>|<in.sam>
Commands:
refstats Compute per-reference statistics.
bamstats Compute per-BAM statistics.
taxstats Compute per-taxid statistics.
alnfilt Filter alignments based on user-defined criteria.
Computes one row of statistics per reference sequence. It is the main command for reference-level filtering, coverage metrics, and adding taxonomy tags to filtered BAM files.
unicorn 2.5.1 b53cd7e
Jun 4 2026 13:36:20
./unicorn refstats [options] -b <in.bam>|<in.sam>
Options:
-b <str> Input bam|sam [Required]
-t <int>, --threads <int> Number of threads [4]
-o <str>, --outbam <str> Output BAM file with filtered references
-k <int>, --ksize <int> kmer size for duplicity computation [17]
--outstat <str> Print statistics to file <str> [stdout]
--[FILTER] <PARAM> Apply filter "FILTER" with parameter "PARAM"
Example: "--minreads 100" to filter out references with
less than 100 reads.
Filters:
- minreflen <int> Minimum reference length to consider [1]
- minreads <int> Minimum number of reads per reference [1]
- minalnas <int> Minimum alignment score [-Inf]
- maxdust <int> Maximum alignment dust score [100]
--names <str> Taxonomy nodeid to name mapping file.
--nodes <str> Taxonomy nodeid to parent nodeid mapping file.
--acc2tax <str> Accession to taxid mapping file.
Report taxid of reference sequence. Enabled automatically when
--acc2tax, --names and --nodes are provided.
taxid is reported in bam records in custom:
XT:i:<taxid> tag and
XR:i:<taxid> tag in.
taxid column 2 in the output statistics file.
--rank <str> Taxonomic rank for XR tag. [genus]
--qsize <int> Size of queue for coordinate sorted input bam files [1024]
--verbose Print libunicorn's messages.
-h print this help messageFilter aligned.bam so that we only keep references with at least 10 reads aligned to them. We also only keep references that are 1000bp or more. Additionally only keep reads whose dust value is 80 or less. Write the statistics to refstats.txt and the alignments that passed filters to refstats.bam.
unicorn refstats \
-b aligned.bam \
-o refstats.bam
--minreads 10 \
--minrefl 1000 \
--maxdust 80 \
--outstat refstats.txtWhen --acc2tax, --names, and --nodes are provided, refstats can annotate
the filtered BAM with taxonomy tags.
unicorn refstats \
-b aligned.bam \
--acc2tax acc2tax.txt \
--names names.dmp \
--nodes nodes.dmp \
--rank genus \
--outbam refstats.bam > refstats.txtThe taxonomy inputs are:
--nodes: NCBI taxonomy nodes file.--names: NCBI taxonomy names file.--acc2tax: accession-to-taxid map. It maps BAM reference names to taxids.
refstats writes two integer tags to BAM records:
XT:i:<taxid>: the direct taxid assigned to the reference accession.XR:i:<rank_taxid>: the ancestor taxid at the requested--rank.
It also records how XR was produced in the BAM header:
@CO unicorn:tax-tags XR=rank_taxid rank=genus
This makes the BAM usable by taxstats without reloading the accession-to-taxid
map.
taxstats computes one row of statistics per taxid at the requested rank. It
can either assign alignments to taxa from an accession map, or consume BAM files
that already contain Unicorn XT and XR tags.
unicorn 2.5.1 b53cd7e
Jun 4 2026 13:36:20
./unicorn taxstats [options] -b <in.bam>|<in.sam>
Options:
-b <str> Input bam|sam
-t <int>, --threads <int> Number of threads [4]
-k <int>, --ksize <int> kmer size for duplicity computation [17]
--outstat <str> Print statistics to file <str> [stdout]
--[FILTER] <PARAM> Apply filter "FILTER" with parameter "PARAM"
For example "--minreads 100" to filter out taxids with
less than 100 reads.
Available filters:
- minrefl <int> Minimum reference length. [1]
- minreads <int> Minimum number of reads per taxid. [1]
- minmani <float> Minimum mean ANI per taxid. [0]
- minalnas <int> Minimum alignment score [-Inf]
- maxdust <int> Maximum alignment dust score [100]
--acc2tax <str> Accession to taxid mapping file or .khash file.
--names <str> Taxonomy names file.
--nodes <str> Taxonomy nodes file
--rank <str> Taxonomic rank to summarize by. [genus]
--qsize <int> Size of queue for XR sorted input bam files [1024]
--verbose Prints libunicorn's messages.
-h Print this help messageRun taxstats from an accession map:
unicorn taxstats \
-b aligned.bam \
--acc2tax acc2tax.txt.gz \
--names names.dmp \
--nodes nodes.dmp \
--rank genus \
--outstat genus.taxstats.txtRun taxstats from a BAM already annotated by refstats:
unicorn refstats \
-b alignments.bam \
-o refstats.bam \
--names names.dmp --nodes nodes.dmp > /dev/null
unicorn taxstats \
-b refstats.bam \
--names names.dmp \
--nodes nodes.dmp \
--rank genus \
--outstat genus.taxstats.txtrefstats and taxstats are designed to work together. A common workflow is:
From a metagenomic sample, let's keep only the reads that mapped to any mamal. Only report mammals with at least 1000 reads.
unicorn refstats \
-b aligned.bam \
--maxdust 50\
--acc2tax acc2tax.txt.gz \
--names names.dmp \
--nodes nodes.dmp \
--rank class \
--outbam refstat.bam \
--outstat refstats.txt
samtools sort -t XR refstat.bam > refstats.XRsorted.bam
unicorn taxstats \
-b refstats.XRSorted.bam \
--minrefl 1000000000\
--minreads 10000\
--minalnas -10\
--outstat genus.taxstats.txt
samtools view refstats.XRSorted.bam <() |
Unicorn prints numbered headers in each statistics file. The most important
columns are shared across refstats and taxstats:
num_alns,num_reads: alignments and unique reads contributing to the row.mean_readl,stdev_readl,median_readl,mode_readl: read length summary.mean_alnnm: mean edit distance from theNMtag.mean_alnani,stdev_alnani,median_alnani: alignment ANI summary.num_covbases,mean_cov,breath_cov: coverage depth and breadth.exp_breath,breath_ratio: expected breadth and observed/expected breadth.mean_covcovered,stdev_covcovered,evenness_cov: coverage among covered bases.site_density: covered bases per kilobase.duplicity: unique canonical k-mers divided by observed k-mers.mdust,stdev_dust: sequence complexity summary.
refstats also reports coverage entropy, Gini, normalized entropy,
normalized Gini, and tad80.
taxstats reports:
taxidandnamenum_accessions: number of references assigned to the taxidtotal_length: summed reference length for the taxid
Unicorn expects NCBI-style taxonomy files:
names.dmp: taxonomy node ID to name mapping.nodes.dmp: taxonomy node ID to parent/rank mapping.acc2tax: tab-separated accession-to-taxid mapping..khash: binary accession-to-taxid map used for faster loading.
Supported rank names are:
species genus family order class phylum kingdom domain
refstats,bamstats, andtaxstatscan read ordinary SAM/BAM files.alnfiltrequires query-sorted or query-grouped input.- Alignment ANI uses the
NMtag and read length. - Alignment filters that use score require an alignment score tag compatible with Unicorn's score checks.
- Coverage breadth is the fraction of reference bases covered at least once.
- Coverage evenness is computed from coverage on covered bases, so it is best interpreted together with
breath_cov.
MIT License. See LICENSE for details.