snpmer scans FASTA sequences and emits SNP-mers: k-mers where the middle base is variable (A/C/G/T) while the flanking bases are identical.
This is useful for building k-mer sets that represent potential SNP sites without keeping genomic coordinates.
Requirements
- Rust 1.85+ (this project uses the Rust 2024 edition)
- (Optional)
zstdCLI to inspect/decompress output
Compile a release binary
cargo build --release
./target/release/snpmer --helpInstall locally
cargo install --path .snpmer \
--input input.fa.gz \
--output snpmers.fa.zst \
--k 21 \
--threads 16Full CLI help:
Detect central-base SNP kmers
Usage: snpmer [OPTIONS] --input <INPUT> --output <OUTPUT>
Options:
-i, --input <INPUT> Input FASTA file (optionally .gz)
-o, --output <OUTPUT> Output zstd-compressed FASTA file path
-k, --k <K> K-mer length (must be odd and fit into 2*(k-1) bits <= 128) [default: 21]
--compression-level <COMPRESSION_LEVEL> Compression level for zstd (1-21) [default: 3]
--threads <THREADS> Override rayon worker threads
--shards <SHARDS> Number of hash-map shards (power of two) [default: 8192]
--max-indexed-kmers <MAX_INDEXED_KMERS> Maximum number of distinct (k-1)-mer keys to keep indexed at once; 0 disables the limit [default: 10000000000]
-h, --help Print help
-V, --version Print version
For each k-mer window in the input:
- Take the (k-1) flanking bases (everything except the middle base).
- Canonicalize by choosing the lexicographically smaller of:
- the forward flanks, or
- the reverse-complement flanks
(if reverse-complement wins, the middle base is complemented as well)
- Record which middle base(s) were observed for those flanks.
A flank key is output as a SNPmer if it was observed with at least two distinct middle bases.
--output is a zstd-compressed FASTA file. Each SNPmer is written as two lines:
>snpmer_<N>|mask=<HH>
<SEQUENCE>
SEQUENCEis lengthkand contains an IUPAC ambiguity code at the center position.maskis a 4-bit allele mask in hex (HH), using:0x01= A0x02= C0x04= G0x08= T
Common biallelic examples:
- A/C →
mask=03→ center isM - A/G →
mask=05→ center isR - A/T →
mask=09→ center isW - C/G →
mask=06→ center isS - C/T →
mask=0A→ center isY - G/T →
mask=0C→ center isK
Inspect the output:
zstd -d -c snpmers.fa.zst | head--k: must be odd and>= 3.- Internally the (k-1) flanks are packed into a
u128, sokis at most 65. - There is also a practical limit tied to
--shardsbecause a packed key is split into(shard_idx, quotient)wherequotientis stored as au32. If you see an error like “too large to fit 32-bit quotient”, reducekor increase--shards(in powers of two).
- Internally the (k-1) flanks are packed into a
--compression-level: zstd level 1–21 (higher = smaller output, slower).--threads: sets the Rayon worker pool; the zstd writer is also configured to use up to this many workers.--shards: number of hash-map shards (rounded up to the next power of two). More shards can improve parallel merging but add overhead.--max-indexed-kmers: soft cap for how many distinct flank keys to keep in memory at once.- If the limit is exceeded,
snpmerprocesses shards in multiple passes, deferring some shards to later passes. - Set lower to reduce peak memory; set
0to disable the limit entirely.
- If the limit is exceeded,
- Only A/C/G/T are treated as valid bases. Any other character (e.g.
N) breaks k-mer windows that overlap it. - Input FASTA headers are not preserved; output records are generated (
snpmer_1,snpmer_2, …). - Output order is not guaranteed (hash-map iteration and multi-pass partitioning).
- SNPmers do not include positional information; they represent contexts with variable central bases, not coordinates.