Select N maximally diverse texts from M input texts
DiverseN is a production-ready, encoder-agnostic Python framework for selecting the most diverse subset of texts. Given M input texts and a target N, it selects the N texts that maximize diversity.
Author: Dr. Tom Ron
DiverseN selects a diverse subset from your text collection:
selection.N= number of texts to select- Given M texts with N → selects N most diverse texts
- Objective: Maximize diversity among selected texts
Example: 200 texts with N=20 selects the 20 most diverse texts from the 200.
- Encoder-Agnostic Design: Swap text encoders with a single config change
- Multiple Selection Strategies:
- Clustering-based (Equal-sized K-Means, HDBSCAN, Spectral)
- Max-sum optimization (Brute Force, Greedy, Local Search with Tabu)
- Equal-Sized Clustering:
kmeans_equalstrategy for balanced clusters - Flexible Outlier Handling: Three strategies for HDBSCAN outliers (reassign, singleton, ignore)
- Flexible Distance Metrics: Cosine, Euclidean, BERTScore
- Comprehensive Diversity Metrics: 6 evaluation metrics for comparing approaches
- GPU Acceleration: Automatic CUDA/MPS support for encoding
- Caching: Smart embedding caching to avoid recomputation
- CLI Interface: Easy-to-use commands for all operations
# Clone the repository
git clone <repository-url>
cd Sentence_Diversification
# Install with pip
pip install -e .
# Install with dev dependencies (testing, linting, formatting)
pip install -e ".[dev]"
# Install with notebook support (Jupyter, matplotlib, seaborn)
pip install -e ".[notebook]"
# Install with both dev and notebook dependencies
pip install -e ".[dev,notebook]"- Python 3.9+ (tested on 3.9, 3.10, 3.11, 3.12, 3.13)
- PyTorch 2.0+
- See
pyproject.tomlfor full dependency list
Create a JSONL file with your texts (one JSON object per line):
{"id": "1", "text": "First text goes here."}
{"id": "2", "text": "Second text goes here."}
{"id": "3", "text": "Third text goes here."}Copy and edit configs/example.yaml:
experiment_name: "my_experiment"
seed: 42
data:
input_path: "data/sample/texts.jsonl"
output_dir: "outputs/my_experiment"
encoder:
name: "sentence-transformers"
model: "WhereIsAI/UAE-Large-V1"
device: "auto"
# N = number of diverse texts to select
selection:
N: 20
approach:
clustering:
strategy: "kmeans_equal" # Recommended for equal-sized clusters
outlier_handling: "reassign" # Options: reassign, singleton, ignore
selection_policy: "medoid"# Generate embeddings
python -m diverseN.cli embed --config configs/example.yaml
# Compute distances
python -m diverseN.cli distances --config configs/example.yaml
# Select diverse texts using max-sum approach
python -m diverseN.cli select --config configs/example.yaml
# Or select using clustering approach
python -m diverseN.cli cluster --config configs/example.yaml
# Or compare both approaches
python -m diverseN.cli compare --config configs/example.yamlCreates N clusters, then selects one representative (medoid) from each cluster.
How it works:
- Create clusters using the configured clustering strategy
- Handle outliers according to
outlier_handlingsetting - Select the medoid from each cluster
- Return one text per cluster
Strategies:
- kmeans_equal (recommended): K-Means with enforced equal-sized clusters. Creates exactly N clusters.
- spectral: Graph-based clustering, naturally balanced. Creates exactly N clusters.
- hdbscan: Density-based, auto-determines cluster count (ignores N). With
outlier_handling: "singleton", total selected = K natural clusters + O outliers.
When to use: When you want guaranteed representation from all semantic regions.
Optimizes directly for diversity by maximizing the sum of pairwise distances.
How it works:
- Use diversification strategy to select N diverse texts
- Objective: maximize sum of pairwise distances among selected texts
Strategies:
- greedy: Fast O(N·M), good quality
- local_search: Slower, higher quality (~90-98% of optimal)
- brute_force: Optimal but only feasible for small N
When to use: When you want maximum diversity without semantic clustering constraints.
When using HDBSCAN clustering, configure how to handle outliers (texts that don't fit any cluster):
approach:
clustering:
strategy: "hdbscan"
outlier_handling: "reassign" # Options below| Option | Description |
|---|---|
reassign |
Reassign outliers to the nearest cluster, then balance cluster sizes |
singleton |
Treat each outlier as its own singleton cluster (all outliers are selected; cluster balancing is skipped) |
ignore |
Exclude outliers from selection entirely |
DiverseN computes 6 metrics to evaluate the diversity of selected texts:
| Metric | Description | Better |
|---|---|---|
| Avg Pairwise Cosine Distance | Mean cosine distance between all pairs | Higher |
| Min Pairwise Distance | Minimum distance between any pair | Higher |
| Covariance Trace | Sum of variances (total spread) | Higher |
| Covariance Determinant | Log-determinant (generalized variance) | Higher |
| Cluster Entropy | How evenly texts spread across clusters | Higher |
| KL Divergence from Uniform | Deviation from uniform cluster distribution | Lower |
These metrics are automatically computed and included in reports.
Results are saved as CSV files:
Output files:
selected_clustering.csv: Texts selected by clustering approachselected_maxsum.csv: Texts selected by max-sum approachclusters.csv: Cluster assignments (clustering approach only)report_clustering.json/md: Metrics report for clusteringreport_maxsum.json/md: Metrics report for max-sumcomparison.md: Side-by-side comparison of both approaches
CSV columns:
| Column | Description |
|---|---|
id |
Original text ID |
text |
Text content |
selection:
# Number of diverse texts to select
N: 20approach:
clustering:
# Strategy: "kmeans_equal" (recommended), "spectral", or "hdbscan"
strategy: "kmeans_equal"
# Outlier handling for HDBSCAN: "reassign", "singleton", or "ignore"
outlier_handling: "reassign"
# Selection policy: "medoid" (recommended)
selection_policy: "medoid"
# Equal-sized K-Means parameters
kmeans_equal:
max_iter: 300
n_init: 10
# HDBSCAN parameters (if using hdbscan)
hdbscan:
min_cluster_size: 10
min_samples: 5
metric: "euclidean"
# Spectral parameters (if using spectral)
spectral:
assign_labels: "kmeans"approach:
max_sum:
# Strategy: "greedy", "local_search", or "brute_force"
strategy: "greedy"
greedy:
init: "farthest_point"
local_search:
restarts: 10
max_iters: 1000
tabu_window: 25
allow_stochastic: true
temperature: 0.1
brute_force:
max_combinations: 2000000
time_limit_sec: 30encoder:
name: "sentence-transformers"
model: "WhereIsAI/UAE-Large-V1"
device: "auto" # "auto" (cuda > mps > cpu), "cuda", "mps", or "cpu"
batch_size: 64
normalize: true # L2-normalize embeddings
cache_path: "cache/embeddings.npy"
use_float16: false # Use half precision to save memorydistance:
# Options: "cosine", "euclidean", or "bertscore"
type: "cosine"
normalize: false
# BERTScore-specific settings (only used if type="bertscore")
bertscore:
model: "microsoft/deberta-large-mnli"
idf: false # Use IDF weighting for rare wordsGenerate embeddings for all texts.
python -m diverseN.cli embed --config configs/example.yamlOutputs: embeddings.npy
Compute pairwise distance matrix.
python -m diverseN.cli distances --config configs/example.yamlOutputs: distances.npy
Select diverse texts using clustering approach.
python -m diverseN.cli cluster --config configs/example.yamlOutputs: clusters.csv, selected_clustering.csv, report_clustering.json/md
Select diverse texts using max-sum approach.
python -m diverseN.cli select --config configs/example.yamlOutputs: selected_maxsum.csv, report_maxsum.json/md
Run both approaches and generate comparison.
python -m diverseN.cli compare --config configs/example.yamlOutputs: All of the above plus comparison.md
Interactive notebooks are available in the notebooks/ directory (not tracked in git):
# Install notebook dependencies
pip install -e ".[notebook]"Notebooks demonstrate loading texts, computing distances, running both selection approaches, computing diversity metrics, and comparing results.
- Encoding: 10-60s depending on model and GPU
- Distances: 1-5s
- Selection (Greedy): <1s
- Selection (Local Search): 5-30s
- Clustering: 1-5s
- Enable caching: Set
cache_pathto reuse embeddings - Use GPU: Set
device: "auto"to auto-detect CUDA/MPS - Try float16: Set
use_float16: trueto reduce memory - Batch size: Increase if you have enough GPU memory
- Strategy selection:
- Use
greedyfor speed - Use
local_searchfor better quality - Use
kmeans_equalclustering for balanced clusters
- Use
Run the test suite:
# Install dev dependencies
pip install -e ".[dev]"
# Run all tests
pytest --override-ini="addopts="
# Run with coverage
pytest --override-ini="addopts=" --cov=diverseN --cov-report=htmlSentence_Diversification/
├── configs/
│ └── example.yaml # Example configuration (copy and customize)
├── data/
│ └── sample/ # Sample dataset
│ └── texts.jsonl
├── src/diverseN/
│ ├── cli.py # Command-line interface
│ ├── encoders/ # Text encoder implementations
│ ├── distances/ # Distance calculators
│ ├── similarities/ # Alternative similarity metrics (BERTScore)
│ ├── clustering/ # Clustering strategies
│ │ ├── kmeans_equal.py # Equal-sized K-Means
│ │ └── utils.py # Outlier handling, balancing
│ ├── diversify/ # Diversification strategies
│ ├── pipeline/ # Selection orchestration
│ ├── evaluation/ # Metrics and reporting
│ │ └── metrics.py # 6 diversity metrics
│ ├── io/ # Data loading and saving
│ └── utils/ # Logging, seeding, caching
├── notebooks/ # Interactive notebooks (not tracked in git)
├── tests/ # Unit tests
├── outputs/ # Generated results (not tracked in git)
├── cache/ # Embedding cache (not tracked in git)
├── pyproject.toml # Project metadata and dependencies
├── LICENSE # MIT License
└── README.md
DiverseN uses clean abstractions via Python ABCs:
class TextEncoder(ABC):
def encode(self, texts: list[str]) -> np.ndarray: ...
class DistanceCalculator(ABC):
def pairwise(self, X: np.ndarray) -> np.ndarray: ...
class ClusteringStrategy(ABC):
def fit_predict(self, X: np.ndarray) -> np.ndarray: ...
class DiversificationStrategy(ABC):
def select(self, D: np.ndarray, N: int) -> list[int]: ...from diverseN.clustering.utils import (
reassign_outliers, # Reassign outliers to nearest cluster
balance_clusters, # Balance cluster sizes
compute_cluster_medoid # Find medoid of a cluster
)
from diverseN.evaluation.metrics import (
compute_all_metrics, # Compute all 6 metrics
average_pairwise_cosine_distance,
minimum_pairwise_distance,
covariance_matrix_trace,
covariance_matrix_determinant,
cluster_entropy,
kl_divergence_from_uniform,
)- Create
src/diverseN/clustering/my_clustering.py - Subclass
ClusteringStrategyand implementfit_predict() - Register in
cli.pyclustering initialization
- Create
src/diverseN/diversify/my_strategy.py - Subclass
DiversificationStrategyand implementselect() - Register in
cli.pystrategy initialization
If you use DiverseN in your research, please cite:
@software{diverseN,
title={DiverseN: A Framework for Text Diversification},
author={Ron, Tom},
year={2026},
url={https://github.com/...}
}MIT License - see LICENSE for details.
Contributions welcome! Please:
- Fork the repository
- Create a feature branch
- Add tests for new functionality
- Ensure all tests pass
- Submit a pull request
- Reduce
batch_size - Enable
use_float16: true - Use a smaller encoder model
- Enable GPU:
device: "auto" - Use caching: set
cache_path - Choose faster encoder (e.g.,
all-MiniLM-L6-v2) - Use
greedyinstead oflocal_search
- Set
outlier_handling: "reassign"to reassign to nearest cluster - Set
outlier_handling: "singleton"to keep all outliers as individual selections (no cluster balancing) - Or switch to
kmeans_equalorspectralclustering (these create exactly N clusters with no outliers)
Author: Dr. Tom Ron
For issues, questions, or contributions, please open an issue on GitHub.