Skip to content

GeoGenetics/unicorn

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

254 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

C/C++ CI

Unicorn

unicorn is a command-line tool for analyzing SAM/BAM alignment files. It computes alignment-based statistics at multiple levels, including per-reference, per-taxid, and whole-BAM summaries. It can also filter alignments and references based on criteria such as read count, reference length, alignment score, ANI, and sequence complexity. When taxonomy metadata is available, unicorn can annotate BAM files with taxonomic tags and aggregate results by rank using NCBI-style taxonomy files. This makes it especially useful for metagenomic workflows, reference screening, and quality control of alignment data. It is also handy for extracting coverage, ANI, and duplicity metrics from large sequencing datasets. In short, Unicorn helps turn raw alignments into actionable summaries and filtered outputs for downstream analysis.

Index

Unicorn computes alignment-based statistics from SAM/BAM files. It is aimed at metagenomic and reference-screening workflows where the same alignment file may need per-reference statistics, per-taxon summaries, BAM-wide summaries, or alignment filtering.

The executable is called unicorn. The conda package is currently named enhjoerning.

Install

From Conda

conda install -c conda-forge -c bioconda enhjoerning

From Source

Clone with submodules and build:

git clone --recursive https://github.com/GeoGenetics/unicorn.git
cd unicorn
make

Source Requirements

Building from source requires:

  • C compiler (gcc is fine)
  • A C++ compiler with C++20 support(gcc is fine)
  • make
  • htslib
  • Genesis
  • klib, included as a submodule

If src/genesis is present, the Makefile builds against the bundled Genesis submodule. Otherwise it uses a Genesis installation from CONDA_PREFIX.

If htslib is installed in a custom prefix, point HTSSRC to a directory that contains include/ and lib/:

export HTSSRC=/path/to/htslib-prefix
make

For a conda development environment:

export HTSSRC="$CONDA_PREFIX"
make

Run

Commands

./unicorn command [options] -b <in.bam>|<in.sam>

Commands:
  refstats    Compute per-reference statistics.
  bamstats    Compute per-BAM statistics.
  taxstats    Compute per-taxid statistics.
  alnfilt     Filter alignments based on user-defined criteria.

refstats

Computes one row of statistics per reference sequence. It is the main command for reference-level filtering, coverage metrics, and adding taxonomy tags to filtered BAM files.

unicorn 2.5.1 b53cd7e
	Jun  4 2026 13:36:20
./unicorn refstats [options] -b <in.bam>|<in.sam>
Options:
  -b <str>   Input bam|sam [Required]
  -t <int>, --threads <int> Number of threads [4]
  -o <str>, --outbam  <str> Output BAM file with filtered references
  -k <int>, --ksize <int>   kmer size for duplicity computation [17]
  --outstat <str> Print statistics to file <str> [stdout]
  --[FILTER] <PARAM>  Apply filter "FILTER" with parameter "PARAM"
      Example: "--minreads 100" to filter out references with
                 less than 100 reads.
      Filters:
       - minreflen <int>  Minimum reference length to consider  [1]
       - minreads  <int>  Minimum number of reads per reference [1]
       - minalnas  <int>  Minimum alignment score [-Inf]
       - maxdust   <int>  Maximum alignment dust score [100]
  --names   <str> Taxonomy nodeid to name mapping file.
  --nodes   <str> Taxonomy nodeid to parent nodeid mapping file.
  --acc2tax <str> Accession to taxid mapping file.
  Report taxid of reference sequence. Enabled automatically when
  --acc2tax, --names and --nodes are provided.
  taxid is reported in bam records in custom:
  XT:i:<taxid> tag and
  XR:i:<taxid> tag in.
  taxid column 2 in the output statistics file.
  --rank <str>  Taxonomic rank for XR tag. [genus]
  --qsize <int> Size of queue for coordinate sorted input bam files [1024]
  --verbose  Print libunicorn's messages.
  -h         print this help message

Filter aligned.bam so that we only keep references with at least 10 reads aligned to them. We also only keep references that are 1000bp or more. Additionally only keep reads whose dust value is 80 or less. Write the statistics to refstats.txt and the alignments that passed filters to refstats.bam.

unicorn refstats \
  -b aligned.bam \
  -o refstats.bam
  --minreads 10 \
  --minrefl 1000 \
  --maxdust 80 \
  --outstat refstats.txt

Add Taxonomy To BAM Files

When --acc2tax, --names, and --nodes are provided, refstats can annotate the filtered BAM with taxonomy tags.

unicorn refstats \
  -b aligned.bam \
  --acc2tax acc2tax.txt \
  --names names.dmp \
  --nodes nodes.dmp \
  --rank genus \
  --outbam refstats.bam > refstats.txt

The taxonomy inputs are:

  • --nodes: NCBI taxonomy nodes file.
  • --names: NCBI taxonomy names file.
  • --acc2tax: accession-to-taxid map. It maps BAM reference names to taxids.

refstats writes two integer tags to BAM records:

  • XT:i:<taxid>: the direct taxid assigned to the reference accession.
  • XR:i:<rank_taxid>: the ancestor taxid at the requested --rank.

It also records how XR was produced in the BAM header:

@CO	unicorn:tax-tags	XR=rank_taxid	rank=genus

This makes the BAM usable by taxstats without reloading the accession-to-taxid map.

taxstats

taxstats computes one row of statistics per taxid at the requested rank. It can either assign alignments to taxa from an accession map, or consume BAM files that already contain Unicorn XT and XR tags.

unicorn 2.5.1 b53cd7e
	Jun  4 2026 13:36:20
./unicorn taxstats [options] -b <in.bam>|<in.sam>
Options:
  -b <str>                     Input bam|sam
  -t <int>, --threads <int>    Number of threads [4]
  -k <int>, --ksize <int>      kmer size for duplicity computation [17]
  --outstat <str> Print statistics to file <str> [stdout]
  --[FILTER] <PARAM>  Apply filter "FILTER" with parameter "PARAM"
      For example "--minreads 100" to filter out taxids with
      less than 100 reads.
      Available filters:
       - minrefl  <int>   Minimum reference length. [1]
       - minreads <int>   Minimum number of reads per taxid. [1]
       - minmani  <float> Minimum mean ANI per taxid. [0]
       - minalnas <int>   Minimum alignment score [-Inf]
       - maxdust  <int>   Maximum alignment dust score [100]
  --acc2tax <str>   Accession to taxid mapping file or .khash file.
  --names <str>     Taxonomy names file.
  --nodes <str>     Taxonomy nodes file
  --rank <str>      Taxonomic rank to summarize by. [genus]
  --qsize <int>     Size of queue for XR sorted input bam files [1024]
  --verbose         Prints libunicorn's messages.
  -h                Print this help message

Run taxstats from an accession map:

unicorn taxstats \
  -b aligned.bam \
  --acc2tax acc2tax.txt.gz \
  --names names.dmp \
  --nodes nodes.dmp \
  --rank genus \
  --outstat genus.taxstats.txt

Run taxstats from a BAM already annotated by refstats:

unicorn refstats \
  -b alignments.bam \
  -o refstats.bam \
  --names names.dmp --nodes nodes.dmp > /dev/null 
unicorn taxstats \
  -b refstats.bam \
  --names names.dmp \
  --nodes nodes.dmp \
  --rank genus \
  --outstat genus.taxstats.txt

Relation To refstats Taxonomy Tags

refstats and taxstats are designed to work together. A common workflow is:

From a metagenomic sample, let's keep only the reads that mapped to any mamal. Only report mammals with at least 1000 reads.

unicorn refstats \
  -b aligned.bam \
  --maxdust 50\
  --acc2tax acc2tax.txt.gz \
  --names names.dmp \
  --nodes nodes.dmp \
  --rank class \
  --outbam refstat.bam \
  --outstat refstats.txt

samtools sort -t XR refstat.bam > refstats.XRsorted.bam

unicorn taxstats \
  -b refstats.XRSorted.bam \
  --minrefl 1000000000\
  --minreads 10000\
  --minalnas -10\
  --outstat genus.taxstats.txt


samtools view refstats.XRSorted.bam <() | 
  

Output Columns

Unicorn prints numbered headers in each statistics file. The most important columns are shared across refstats and taxstats:

  • num_alns, num_reads: alignments and unique reads contributing to the row.
  • mean_readl, stdev_readl, median_readl, mode_readl: read length summary.
  • mean_alnnm: mean edit distance from the NM tag.
  • mean_alnani, stdev_alnani, median_alnani: alignment ANI summary.
  • num_covbases, mean_cov, breath_cov: coverage depth and breadth.
  • exp_breath, breath_ratio: expected breadth and observed/expected breadth.
  • mean_covcovered, stdev_covcovered, evenness_cov: coverage among covered bases.
  • site_density: covered bases per kilobase.
  • duplicity: unique canonical k-mers divided by observed k-mers.
  • mdust, stdev_dust: sequence complexity summary.

refstats also reports coverage entropy, Gini, normalized entropy, normalized Gini, and tad80.

taxstats reports:

  • taxid and name
  • num_accessions: number of references assigned to the taxid
  • total_length: summed reference length for the taxid

Taxonomy Files

Unicorn expects NCBI-style taxonomy files:

  • names.dmp: taxonomy node ID to name mapping.
  • nodes.dmp: taxonomy node ID to parent/rank mapping.
  • acc2tax: tab-separated accession-to-taxid mapping.
  • .khash: binary accession-to-taxid map used for faster loading.

Supported rank names are:

species genus family order class phylum kingdom domain

Input Notes

  • refstats, bamstats, and taxstats can read ordinary SAM/BAM files.
  • alnfilt requires query-sorted or query-grouped input.
  • Alignment ANI uses the NM tag and read length.
  • Alignment filters that use score require an alignment score tag compatible with Unicorn's score checks.
  • Coverage breadth is the fraction of reference bases covered at least once.
  • Coverage evenness is computed from coverage on covered bases, so it is best interpreted together with breath_cov.

License

MIT License. See LICENSE for details.

Packages

 
 
 

Contributors

Languages