Skip to content

Malfoy/SNPmer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 

Repository files navigation

SNPmer

snpmer scans FASTA sequences and emits SNP-mers: k-mers where the middle base is variable (A/C/G/T) while the flanking bases are identical.

This is useful for building k-mer sets that represent potential SNP sites without keeping genomic coordinates.

Build

Requirements

  • Rust 1.85+ (this project uses the Rust 2024 edition)
  • (Optional) zstd CLI to inspect/decompress output

Compile a release binary

cargo build --release
./target/release/snpmer --help

Install locally

cargo install --path .

Usage

snpmer \
  --input  input.fa.gz \
  --output snpmers.fa.zst \
  --k 21 \
  --threads 16

Full CLI help:

Detect central-base SNP kmers

Usage: snpmer [OPTIONS] --input <INPUT> --output <OUTPUT>

Options:
  -i, --input <INPUT>                          Input FASTA file (optionally .gz)
  -o, --output <OUTPUT>                        Output zstd-compressed FASTA file path
  -k, --k <K>                                  K-mer length (must be odd and fit into 2*(k-1) bits <= 128) [default: 21]
      --compression-level <COMPRESSION_LEVEL>  Compression level for zstd (1-21) [default: 3]
      --threads <THREADS>                      Override rayon worker threads
      --shards <SHARDS>                        Number of hash-map shards (power of two) [default: 8192]
      --max-indexed-kmers <MAX_INDEXED_KMERS>  Maximum number of distinct (k-1)-mer keys to keep indexed at once; 0 disables the limit [default: 10000000000]
  -h, --help                                   Print help
  -V, --version                                Print version

What is a “SNPmer” here?

For each k-mer window in the input:

  1. Take the (k-1) flanking bases (everything except the middle base).
  2. Canonicalize by choosing the lexicographically smaller of:
    • the forward flanks, or
    • the reverse-complement flanks
      (if reverse-complement wins, the middle base is complemented as well)
  3. Record which middle base(s) were observed for those flanks.

A flank key is output as a SNPmer if it was observed with at least two distinct middle bases.

Output format

--output is a zstd-compressed FASTA file. Each SNPmer is written as two lines:

>snpmer_<N>|mask=<HH>
<SEQUENCE>
  • SEQUENCE is length k and contains an IUPAC ambiguity code at the center position.
  • mask is a 4-bit allele mask in hex (HH), using:
    • 0x01 = A
    • 0x02 = C
    • 0x04 = G
    • 0x08 = T

Common biallelic examples:

  • A/C → mask=03 → center is M
  • A/G → mask=05 → center is R
  • A/T → mask=09 → center is W
  • C/G → mask=06 → center is S
  • C/T → mask=0A → center is Y
  • G/T → mask=0C → center is K

Inspect the output:

zstd -d -c snpmers.fa.zst | head

Parameter notes

  • --k: must be odd and >= 3.
    • Internally the (k-1) flanks are packed into a u128, so k is at most 65.
    • There is also a practical limit tied to --shards because a packed key is split into (shard_idx, quotient) where quotient is stored as a u32. If you see an error like “too large to fit 32-bit quotient”, reduce k or increase --shards (in powers of two).
  • --compression-level: zstd level 1–21 (higher = smaller output, slower).
  • --threads: sets the Rayon worker pool; the zstd writer is also configured to use up to this many workers.
  • --shards: number of hash-map shards (rounded up to the next power of two). More shards can improve parallel merging but add overhead.
  • --max-indexed-kmers: soft cap for how many distinct flank keys to keep in memory at once.
    • If the limit is exceeded, snpmer processes shards in multiple passes, deferring some shards to later passes.
    • Set lower to reduce peak memory; set 0 to disable the limit entirely.

Caveats

  • Only A/C/G/T are treated as valid bases. Any other character (e.g. N) breaks k-mer windows that overlap it.
  • Input FASTA headers are not preserved; output records are generated (snpmer_1, snpmer_2, …).
  • Output order is not guaranteed (hash-map iteration and multi-pass partitioning).
  • SNPmers do not include positional information; they represent contexts with variable central bases, not coordinates.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages