A high-performance tool for analyzing k-mer variability across genomes, designed to identify regions of unique sequence complexity. VarProfiler helps researchers understand genomic variation patterns by computing unique k-mer counts in sliding windows across chromosomes.
-
kmer_profiler: Fast C++ tool for k-mer counting in sliding windows
- Efficient 2-bit encoding for k-mers up to 31bp
- Multi-threaded chromosome processing
- Canonical k-mer handling (forward/reverse complement)
-
cenpb_finder: Specialized C++ tool for CENP-B box detection
- Edit distance search with Myers' algorithm
- IUPAC degenerate nucleotide support
- Both strand searching
- Unlimited thread scaling
-
motif_discovery: Genome-wide motif variant discovery tool
- Finds all variants within edit distance of reference motif
- Reports frequency statistics per strand
- Supports IUPAC degenerate codes
- Multi-threaded genome scanning
- Satellite DNA detection: Automatic identification of low k-mer diversity regions
- Advanced centromere prediction: Multi-signal scoring system combining:
- CENP-B box density (primary signal, 40% weight)
- K-mer diversity minima (30% weight)
- Satellite array size context (20% weight)
- Local minimum detection (10% weight)
- CENP-B box quantification: Count and map CENP-B binding sites with biological calibration
- Confidence scoring: High/Medium/Low confidence levels for centromere predictions
- Sequence extraction: Export detected regions as FASTA sequences
- TRF integration: Classify repeats using Tandem Repeat Finder annotations
- Chromosome plots: Individual plots for each chromosome
- Karyotype view: All chromosomes with consistent scaling
- Grouped plots: Size-based grouping for better visibility
- HTML reports: Complete analysis summary with all results
- GFF annotations: Standard format for genome browsers
- C++ compiler with C++17 support (g++ or clang++)
- Python 3.7+
- Required Python packages: pandas, matplotlib
# Clone the repository
git clone https://github.com/aglabx/varprofiler.git
cd varprofiler
# Compile the C++ tools (kmer_profiler and cenpb_finder)
make
# Install Python dependencies
pip install -r requirements.txt
# Or install as a package
pip install .The easiest way to run VarProfiler is using the unified pipeline script that handles all analysis steps automatically:
# Basic usage - runs complete analysis with default parameters
python varprofiler_pipeline.py genome.fa -o results/
# Use predefined profiles for different analysis depths
python varprofiler_pipeline.py genome.fa -o results/ --profile standard
# Use configuration file for reproducible analyses
python varprofiler_pipeline.py genome.fa -c config.json
# Using Makefile shortcuts
make standard-run GENOME=genome.faVarProfiler includes three predefined analysis profiles:
| Profile | K-mer Size | Window Size | Step Size | Use Case |
|---|---|---|---|---|
| quick | 15 | 200kb | 100kb | Fast overview, large genomes |
| standard | 23 | 100kb | 25kb | Default, balanced analysis |
| detailed | 31 | 50kb | 10kb | High resolution, small genomes |
The pipeline creates an organized output directory with:
varprofiler_results/
├── kmer_profiles/ # BED files with k-mer counts
├── satellites/ # Detected regions (GFF, FASTA, summary)
├── plots/ # Chromosome and karyotype visualizations
├── logs/ # Pipeline logs and configuration
└── report.html # Complete HTML report with all results
For reproducible analyses, use a JSON configuration file (see config_example.json):
{
"input_fasta": "genome.fa",
"output_dir": "varprofiler_results",
"kmer_size": 23,
"window_size": 100000,
"step_size": 25000,
"threads": 8,
"detect_satellites": true,
"find_centromeres": true,
"percentile": 5,
"min_satellite_length": 10000,
"trf_annotations": "trf_output.gff"
}Run with config: python varprofiler_pipeline.py genome.fa -c config.json
For more control over individual steps, you can run each component separately:
./kmer_profiler <input.fasta> <output.bed> <kmer_len> <window_size> <step_size> [threads]Parameters:
input.fasta: Input genome file in FASTA formatoutput.bed: Output BED file with k-mer countskmer_len: Length of k-mers (1-31, recommended: 23)window_size: Size of sliding window in bp (e.g., 100000)step_size: Step size for sliding window in bp (e.g., 25000)threads: (Optional) Number of threads to use (default: number of CPU cores)
Examples:
# Using default number of threads (auto-detect CPU cores)
./kmer_profiler genome.fa kmer_counts.bed 23 100000 25000
# Using specific number of threads (e.g., 8 threads)
./kmer_profiler genome.fa kmer_counts.bed 23 100000 25000 8python detect_satellites.py <input.bed> -k <kmer_size> [-o output_prefix] [-g genome.fa]Parameters:
input.bed: BED file generated by kmer_profiler-k, --kmer-size: Required - K-mer size used in analysis-o, --output-prefix: Prefix for output files (default: satellites)-p, --percentile: Percentile threshold for low variability (default: 5)-m, --min-length: Minimum region length in bp (default: 10000)-g, --genome: Genome FASTA file for sequence extraction-t, --trf-annotations: GFF file with TRF annotations for repeat classification--global-threshold: Use global threshold instead of per-chromosome--find-centromeres: Attempt to identify centromere candidates--all-centromeres: Report all centromere candidates, not just best per chromosome--centromere-min-score: Minimum score for centromere candidates (0-100, default: 30)
Output files:
.gff3: GFF3 annotations for all detected regions.fasta: FASTA sequences of detected regions (if genome provided)_centromeres.fasta: FASTA sequences of predicted centromeres with detailed headers_summary.txt: Summary statistics_cenpb.tsv: CENP-B box analysis results (if genome provided)
Example:
# Basic satellite detection
python detect_satellites.py kmer_counts.bed -k 23 -o human_satellites
# With sequence extraction and centromere detection
python detect_satellites.py kmer_counts.bed -k 23 -o human_satellites -g genome.fa --find-centromeres
# With TRF annotations for repeat classification
python detect_satellites.py kmer_counts.bed -k 23 -o human_satellites -t trf_annotations.gff
# Full analysis with all features including CENP-B box detection
python detect_satellites.py kmer_counts.bed -k 23 -o human_satellites -g genome.fa -t trf_annotations.gff --find-centromeres --cenpb-threads 4
# Find ALL potential centromeres (not just best per chromosome)
python detect_satellites.py kmer_counts.bed -k 23 -o all_centromeres -g genome.fa --find-centromeres --all-centromeres --centromere-min-score 40python plot_chromosomes.py <input.bed> -k <kmer_size> [-o output_dir] [-g gff_file]Parameters:
input.bed: BED file generated by kmer_profiler-k, --kmer-size: Required - K-mer size used in analysis (must match kmer_profiler)-o, --output_dir: Directory to save plots (default: kmer_plots)-g, --gff: Optional GFF file with satellite DNA annotations-c, --centromere-gff: Optional GFF file with centromere annotations (e.g., from detect_satellites.py)
Examples:
# Basic visualization (k=23 was used in kmer_profiler)
python plot_chromosomes.py kmer_counts.bed -k 23 -o chromosome_plots
# With satellite DNA annotations overlay (k=31 was used)
python plot_chromosomes.py kmer_counts.bed -k 31 -o chromosome_plots -g satellites.gff
# With centromere regions from detect_satellites.py
python plot_chromosomes.py kmer_counts.bed -k 23 -o chromosome_plots -c satellites.gff3
# Combined: both TRF satellites and detected centromeres
python plot_chromosomes.py kmer_counts.bed -k 23 -o chromosome_plots -g trf_satellites.gff -c detected_regions.gff3
# For diploid genomes with maternal/paternal chromosomes
# Input BED file should contain chromosomes named like: chr1_mat, chr1_pat, chr2_maternal, chr2_paternal, etc.
python plot_chromosomes.py diploid_kmers.bed -k 23 -o diploid_plotsThe tool outputs a standard BED file with the following columns:
- Chromosome name
- Window start position (0-based)
- Window end position
- Count of unique k-mers in the window
- Generates one PNG file per chromosome
- Additional karyotype-wide plot showing all chromosomes together
- High-resolution plots (200 DPI) suitable for publication
- X-axis: Genomic position in megabases (Mb)
- Y-axis: Percentage of unique k-mers in window (0-110%)
- Fixed Y-axis scale for easy comparison across chromosomes
- Orange shading: Satellite DNA regions from GFF annotations (when provided)
- Red shading: Centromere-containing regions (when provided via -c)
- Cyan line/area: K-mer variability profile
- Title includes k-mer size for clarity
Standard Karyotype Plot
- Single image with all chromosomes arranged in grid (3 pairs per row for diploid genomes)
- Consistent scale: 1 Mb has same width across all chromosomes
- Natural chromosome sorting (1, 2, ..., 10, ..., 22, X, Y, Z, W, MT)
- Diploid genome support:
- For diploid assemblies, use
_mator_maternalsuffix for maternal chromosomes and_pator_paternalsuffix for paternal chromosomes - Autosomal chromosomes will be displayed side by side as pairs
- Sex chromosomes (X, Y, Z, W) are always displayed individually, even if they have mat/pat suffixes
- For diploid assemblies, use
- Saved as
karyotype_kmer_distribution.png
Grouped Karyotype Plots
- Chromosomes grouped by size for better visibility of smaller chromosomes
- Each size group saved as separate file with its own scale
- Chromosomes fill at least 60% of plot width in each group
- Automatic grouping based on genome size:
- Large genomes: >150 Mb, 100-150 Mb, 50-100 Mb, 10-50 Mb, Dot chromosomes (<10 Mb)
- Medium genomes: >100 Mb, 50-100 Mb, 25-50 Mb, 10-25 Mb, Dot chromosomes (<10 Mb)
- Small genomes: >50 Mb, 25-50 Mb, 10-25 Mb, 5-10 Mb, Dot chromosomes (<5 Mb)
- Dot chromosomes displayed with up to 8 per row for better space utilization
- Chromosome sizes shown in titles for easy reference
- Saved as separate files:
grouped_Large_gt150_Mb_kmer_distribution.png,grouped_Dot_chromosomes_lt10_Mb_kmer_distribution.png, etc.
VarProfiler uses a sophisticated multi-signal approach for centromere identification, moving beyond simple k-mer minima to a biologically-informed scoring system.
-
CENP-B Box Density (40% weight) - Primary signal
-
2.5 boxes/kb: Maximum score (alpha-satellite rich)
- 1-2.5 boxes/kb: Moderate score (typical arrays)
- <1 box/kb: Low score (degenerate/poor arrays)
- Expected: ~3 boxes/kb in human alpha-satellites
-
-
K-mer Diversity (30% weight) - Conservation signal
- <2%: Maximum score (highly conserved)
- 2-10%: Moderate score
-
10%: Low score
-
Satellite Array Size (20% weight) - Context signal
-
5 Mb: Maximum score (large arrays)
- 1-5 Mb: Moderate score
- <1 Mb: Low score
-
-
Local Minimum (10% weight) - Additional confidence
- Bonus if window is local k-mer minimum within array
- High (≥70): Red - Strong centromeric signature
- Medium (50-69): Orange - Probable centromere
- Low (<50): Yellow - Possible centromere
The cenpb_finder tool searches for CENP-B boxes - conserved ~17bp sequences that mark functional centromeres. The canonical human CENP-B box is YTTCGTTGGAARCGGGA.
./cenpb_finder <genome.fasta> <regions.bed> <output.tsv> [options]Parameters:
genome.fasta: Input genome in FASTA formatregions.bed: BED file with regions to search (e.g., from detect_satellites.py)output.tsv: Output file with CENP-B box counts and positions-p PATTERN: CENP-B box pattern with IUPAC codes (default: YTTCGTTGGAARCGGGA)-d DISTANCE: Maximum edit distance (default: 3)-t THREADS: Number of threads (default: 1, no upper limit)-s: Search forward strand only (default: both strands)-v: Verbose output
Examples:
# Direct usage - search in low k-mer diversity regions
./cenpb_finder genome.fa low_kmer_regions.bed cenpb_boxes.tsv -d 3 -t 192
# Custom pattern for different species
./cenpb_finder mouse.fa satellites.bed mouse_cenpb.tsv -p YTTCGTTGGAWRCGGGA -d 4
# Integrated with satellite detection (automatic)
python detect_satellites.py kmer_counts.bed -k 23 -g genome.fa --find-centromeres \
--cenpb-pattern YTTCGTTGGAARCGGGA --cenpb-distance 3 --cenpb-threads 8Output format (TSV):
#chrom start end kmer_count cenpb_count cenpb_density cenpb_positions
chr1 5000000 5100000 1234 15 0.15 5001234,5002345,5003456,...
chr1 5200000 5300000 1456 23 0.23 5201234,5202345,...
- Edit distance search: Uses Myers' bit-parallel algorithm for efficient approximate matching
- IUPAC support: Handles degenerate nucleotide codes (Y=C/T, R=A/G, W=A/T, etc.)
- Both strand search: Searches forward and reverse complement
- Parallel processing: Multi-threaded with no artificial thread limit
- Memory efficient: Processes regions in parallel without loading entire genome
- Integration: Automatically runs on low k-mer diversity regions when detecting centromeres
CENP-B boxes are binding sites for CENP-B protein, which:
- Helps establish and maintain centromeric chromatin
- Is found in most human centromeres (except Y chromosome)
- Shows conservation across primates and mammals
- High density indicates functional centromeric regions
The CENP-B box count and density are included in the GFF annotations and help confirm functional centromeres.
For non-human genomes or to refine the CENP-B box consensus, use the motif discovery tool:
./motif_discovery <genome.fasta> <motif> <max_distance> <threads> <output.tsv>Parameters:
genome.fasta: Input genome in FASTA formatmotif: Reference motif with IUPAC codes (e.g., YTTCGTTGGAARCGGGA)max_distance: Maximum edit distance (0-10)threads: Number of parallel threadsoutput.tsv: Output TSV with variant statistics
Output TSV columns:
motif: Discovered sequence variant (may be different length for indels)alignment: Visual alignment showing differences (.= match, letter = mismatch,-= gap)edit_distance: True edit distance including insertions/deletionscigar: CIGAR string (M=match, X=mismatch, I=insertion, D=deletion)count_forward: Occurrences on forward strandcount_reverse: Occurrences on reverse strandcount_total: Total occurrencesfreq_forward: Percentage on forward strandfreq_reverse: Percentage on reverse strandfreq_total: Overall percentage
Examples:
# Discover CENP-B variants in human genome
./motif_discovery human.fa YTTCGTTGGAARCGGGA 3 8 cenpb_variants.tsv
# Search for variants in mouse genome with relaxed threshold
./motif_discovery mouse.fa YTTCGTTGGAWRCGGGA 4 16 mouse_cenpb.tsv
# Find exact matches only
./motif_discovery genome.fa TTCGTTGGAAACGGGA 0 4 exact_matches.tsvProgress tracking: The tool displays a real-time progress bar showing:
- Percentage completed
- Megabases processed/total
- Estimated time remaining (ETA)
Use cases:
- Species-specific optimization: Find the most common CENP-B variant in your species
- Motif refinement: Identify degenerate positions in the consensus
- Quality control: Verify expected motif frequencies
- Comparative genomics: Compare motif usage across species
The tool will report the top 10 most frequent variants to the console and save all variants sorted by frequency to the CSV file.
After discovering motif variants and annotating satellites, analyze which motifs are specific to satellite DNA:
# Install required dependency (optional, for progress bars)
pip install tqdm
python analyze_motif_satellites.py <motif_discovery.tsv> <satellite_annotations.tsv> -o enrichment.tsvInput files:
motif_discovery.tsv: Output from motif_discovery toolsatellite_annotations.tsv: Satellite annotations withtrf_arraycolumn containing sequences
Output columns:
motif: Sequence variantgenome_total: Total occurrences in whole genomesatellite_total: Occurrences in satellite DNAsatellite_fraction: Proportion found in satellites (0-1)enrichment: Classification (satellite-enriched or genome-wide)
Examples:
# Basic analysis
python analyze_motif_satellites.py cenpb_variants.tsv satellites.tsv -o cenpb_enrichment.tsv
# Analyze only top 100 motifs
python analyze_motif_satellites.py variants.tsv satellites.tsv --top-motifs 100
# Include breakdown by satellite family
python analyze_motif_satellites.py variants.tsv satellites.tsv --by-typeInterpretation:
- >80% in satellites: Highly satellite-specific (likely functional)
- 50-80% in satellites: Moderately enriched
- <50% in satellites: Genome-wide distribution
- 0% in satellites: Present only outside satellite DNA
Create FASTA files where motif occurrences are highlighted in lowercase:
python visualize_motifs_in_satellites.py <satellites.tsv> <motifs.tsv> -o marked_satellites.faFeatures:
- Converts motif occurrences to lowercase for visualization
- Marks both forward and reverse strand matches
- Preserves original sequence with case-based highlighting
- Adds coverage statistics to headers
Input options:
- Satellite TSV: File with
trf_arraycolumn - Motif TSV: Output from motif_discovery
- Single motif: Use
--single-motif SEQUENCEfor one motif
Examples:
# Mark CENP-B boxes in satellites
python visualize_motifs_in_satellites.py satellites.tsv cenpb_variants.tsv -o marked.fa --top-motifs 10
# Mark single motif
python visualize_motifs_in_satellites.py satellites.tsv dummy --single-motif YTTCGTTGGAARCGGGA -o marked.fa
# Process FASTA directly
python visualize_motifs_in_satellites.py genome.fa motifs.tsv -o marked_genome.fa --fasta-input
# Add statistics to headers
python visualize_motifs_in_satellites.py satellites.tsv motifs.tsv -o marked.fa --statsOutput format:
>sat_1 chr=chr1 pos=1000-2000 period=171 motif_bp=85 coverage=5.0%
ATCGATCGttcgttggaaacgggaATCGATCGATCGATCGATCGttcgttggaaacggga...
Where lowercase letters indicate motif occurrences.
For detailed monomer-level analysis, visualize motifs in decomposed satellite arrays:
python visualize_motifs_in_monomers.py <monomers.tsv> <motifs.tsv> -o marked_monomers.tsvInput format (monomer TSV):
sequence_id orientation index type length is_flank sequence
chr1_array fwd 0 MONOMER 171 FALSE ATCG...
chr1_array fwd 1 MONOMER 171 FALSE CGTA...
Features:
- Marks motifs in each monomer/flank sequence
- Adds
marked_sequencecolumn with lowercase motifs - Adds
motif_countcolumn with number of motif occurrences (for easy filtering) - Calculates per-monomer coverage statistics
- Groups statistics by sequence type (MONOMER, LEFT_FLANK, etc.)
- Identifies monomers with highest motif density
Examples:
# Mark top 10 motifs in all monomers
python visualize_motifs_in_monomers.py monomers.tsv motifs.tsv -o marked.tsv --top-motifs 10
# Process only monomers (skip flanks)
python visualize_motifs_in_monomers.py monomers.tsv motifs.tsv -o marked.tsv --only-monomers
# Generate FASTA output with statistics
python visualize_motifs_in_monomers.py monomers.tsv motifs.tsv -o marked.tsv --fasta marked.fa --stats stats.tsv
# Filter FASTA by minimum coverage
python visualize_motifs_in_monomers.py monomers.tsv motifs.tsv -o marked.tsv --fasta marked.fa --min-coverage 5.0Output columns in TSV:
- All original columns preserved
marked_sequence: Sequence with motifs in lowercasemotif_count: Number of motif occurrences (0, 1, 2, etc.)
Output statistics:
- Sequences with/without motifs (percentage)
- Motif count distribution (0, 1, 2, 3+ motifs)
- Per-monomer motif coverage percentage
- Average coverage by type (MONOMER vs FLANK)
- Top monomers by motif density
Filtering examples (using output TSV):
# Get only sequences with motifs
awk -F'\t' 'NR==1 || $NF>0' marked_monomers.tsv > with_motifs.tsv
# Get sequences with 2+ motifs
awk -F'\t' 'NR==1 || $NF>=2' marked_monomers.tsv > multiple_motifs.tsvThis helps identify:
- Which monomers contain functional motifs (e.g., CENP-B boxes)
- Motif distribution patterns within arrays
- Differences between monomers and flanking sequences
- Evolution of motif presence/absence in tandem arrays
VarProfiler uses an efficient sliding window approach:
- K-mer encoding: Nucleotides are encoded in 2-bit format (A=00, C=01, G=10, T=11)
- Canonical k-mers: For each k-mer, the lexicographically smaller of the forward and reverse complement is used
- Gap handling: K-mers containing 'N' bases are skipped, preventing assembly gaps from being misidentified as low-diversity regions
- Sliding window:
- Initial window: Count all unique k-mers
- Subsequent windows: Add new k-mers entering the window, remove k-mers leaving
- Parallel processing: Each chromosome is processed in a separate thread
- Memory usage: Approximately 1GB per 100Mb of sequence data
- Speed: Processes human genome (~3Gb) in approximately 30-60 minutes on a modern multi-core system
- Optimization tips:
- Use larger step sizes for faster processing
- Adjust window size based on desired resolution
- K-mer length of 23 provides good balance of specificity and speed
- Genome assembly validation: Identify regions of low complexity or repetitive sequences
- Comparative genomics: Compare k-mer profiles between species or individuals
- Variant detection: Identify regions with high sequence variability
- Primer design: Find regions with unique k-mers for specific amplification
- Genome annotation: Correlate k-mer complexity with functional elements
If you use VarProfiler in your research, please cite:
Marina Popova, Aleksey Komissarov (2025)
VarProfiler: A tool for k-mer variability profiling across genomes
bioRxiv (manuscript in preparation)
- Marina Popova
- Aleksey Komissarov
- Claude (AI assistant) - Co-author for code implementation and documentation
MIT License - See LICENSE file for details
Contributions are welcome! Please feel free to submit pull requests or open issues for bugs and feature requests.
For questions and support:
- Email: Aleksey Komissarov ad3002@gmail.com
- GitHub Issues: https://github.com/aglabx/varprofiler/issues