DiverseN - Text Diversification Framework

Select N maximally diverse texts from M input texts

DiverseN is a production-ready, encoder-agnostic Python framework for selecting the most diverse subset of texts. Given M input texts and a target N, it selects the N texts that maximize diversity.

Author: Dr. Tom Ron

Key Concept: Selection

DiverseN selects a diverse subset from your text collection:

selection.N = number of texts to select
Given M texts with N → selects N most diverse texts
Objective: Maximize diversity among selected texts

Example: 200 texts with N=20 selects the 20 most diverse texts from the 200.

Features

Encoder-Agnostic Design: Swap text encoders with a single config change
Multiple Selection Strategies:
- Clustering-based (Equal-sized K-Means, HDBSCAN, Spectral)
- Max-sum optimization (Brute Force, Greedy, Local Search with Tabu)
Equal-Sized Clustering: kmeans_equal strategy for balanced clusters
Flexible Outlier Handling: Three strategies for HDBSCAN outliers (reassign, singleton, ignore)
Flexible Distance Metrics: Cosine, Euclidean, BERTScore
Comprehensive Diversity Metrics: 6 evaluation metrics for comparing approaches
GPU Acceleration: Automatic CUDA/MPS support for encoding
Caching: Smart embedding caching to avoid recomputation
CLI Interface: Easy-to-use commands for all operations

Installation

# Clone the repository
git clone <repository-url>
cd Sentence_Diversification

# Install with pip
pip install -e .

# Install with dev dependencies (testing, linting, formatting)
pip install -e ".[dev]"

# Install with notebook support (Jupyter, matplotlib, seaborn)
pip install -e ".[notebook]"

# Install with both dev and notebook dependencies
pip install -e ".[dev,notebook]"

Requirements

Python 3.9+ (tested on 3.9, 3.10, 3.11, 3.12, 3.13)
PyTorch 2.0+
See pyproject.toml for full dependency list

Quick Start

1. Prepare Your Data

Create a JSONL file with your texts (one JSON object per line):

{"id": "1", "text": "First text goes here."}
{"id": "2", "text": "Second text goes here."}
{"id": "3", "text": "Third text goes here."}

2. Configure Your Experiment

Copy and edit configs/example.yaml:

experiment_name: "my_experiment"
seed: 42

data:
  input_path: "data/sample/texts.jsonl"
  output_dir: "outputs/my_experiment"

encoder:
  name: "sentence-transformers"
  model: "WhereIsAI/UAE-Large-V1"
  device: "auto"

# N = number of diverse texts to select
selection:
  N: 20

approach:
  clustering:
    strategy: "kmeans_equal"  # Recommended for equal-sized clusters
    outlier_handling: "reassign"  # Options: reassign, singleton, ignore
    selection_policy: "medoid"

3. Run the Pipeline

# Generate embeddings
python -m diverseN.cli embed --config configs/example.yaml

# Compute distances
python -m diverseN.cli distances --config configs/example.yaml

# Select diverse texts using max-sum approach
python -m diverseN.cli select --config configs/example.yaml

# Or select using clustering approach
python -m diverseN.cli cluster --config configs/example.yaml

# Or compare both approaches
python -m diverseN.cli compare --config configs/example.yaml

Approaches

Clustering-Based Selection

Creates N clusters, then selects one representative (medoid) from each cluster.

How it works:

Create clusters using the configured clustering strategy
Handle outliers according to outlier_handling setting
Select the medoid from each cluster
Return one text per cluster

Strategies:

kmeans_equal (recommended): K-Means with enforced equal-sized clusters. Creates exactly N clusters.
spectral: Graph-based clustering, naturally balanced. Creates exactly N clusters.
hdbscan: Density-based, auto-determines cluster count (ignores N). With outlier_handling: "singleton", total selected = K natural clusters + O outliers.

When to use: When you want guaranteed representation from all semantic regions.

Max-Sum Selection

Optimizes directly for diversity by maximizing the sum of pairwise distances.

How it works:

Use diversification strategy to select N diverse texts
Objective: maximize sum of pairwise distances among selected texts

Strategies:

greedy: Fast O(N·M), good quality
local_search: Slower, higher quality (~90-98% of optimal)
brute_force: Optimal but only feasible for small N

When to use: When you want maximum diversity without semantic clustering constraints.

Outlier Handling (HDBSCAN)

When using HDBSCAN clustering, configure how to handle outliers (texts that don't fit any cluster):

approach:
  clustering:
    strategy: "hdbscan"
    outlier_handling: "reassign"  # Options below

Option	Description
`reassign`	Reassign outliers to the nearest cluster, then balance cluster sizes
`singleton`	Treat each outlier as its own singleton cluster (all outliers are selected; cluster balancing is skipped)
`ignore`	Exclude outliers from selection entirely

Diversity Metrics

DiverseN computes 6 metrics to evaluate the diversity of selected texts:

Metric	Description	Better
Avg Pairwise Cosine Distance	Mean cosine distance between all pairs	Higher
Min Pairwise Distance	Minimum distance between any pair	Higher
Covariance Trace	Sum of variances (total spread)	Higher
Covariance Determinant	Log-determinant (generalized variance)	Higher
Cluster Entropy	How evenly texts spread across clusters	Higher
KL Divergence from Uniform	Deviation from uniform cluster distribution	Lower

These metrics are automatically computed and included in reports.

Output Format

Results are saved as CSV files:

Output files:

selected_clustering.csv: Texts selected by clustering approach
selected_maxsum.csv: Texts selected by max-sum approach
clusters.csv: Cluster assignments (clustering approach only)
report_clustering.json/md: Metrics report for clustering
report_maxsum.json/md: Metrics report for max-sum
comparison.md: Side-by-side comparison of both approaches

CSV columns:

Column	Description
`id`	Original text ID
`text`	Text content

Configuration Reference

Selection Configuration

selection:
  # Number of diverse texts to select
  N: 20

Clustering Configuration

approach:
  clustering:
    # Strategy: "kmeans_equal" (recommended), "spectral", or "hdbscan"
    strategy: "kmeans_equal"

    # Outlier handling for HDBSCAN: "reassign", "singleton", or "ignore"
    outlier_handling: "reassign"

    # Selection policy: "medoid" (recommended)
    selection_policy: "medoid"

    # Equal-sized K-Means parameters
    kmeans_equal:
      max_iter: 300
      n_init: 10

    # HDBSCAN parameters (if using hdbscan)
    hdbscan:
      min_cluster_size: 10
      min_samples: 5
      metric: "euclidean"

    # Spectral parameters (if using spectral)
    spectral:
      assign_labels: "kmeans"

Max-Sum Configuration

approach:
  max_sum:
    # Strategy: "greedy", "local_search", or "brute_force"
    strategy: "greedy"

    greedy:
      init: "farthest_point"

    local_search:
      restarts: 10
      max_iters: 1000
      tabu_window: 25
      allow_stochastic: true
      temperature: 0.1

    brute_force:
      max_combinations: 2000000
      time_limit_sec: 30

Encoder Configuration

encoder:
  name: "sentence-transformers"
  model: "WhereIsAI/UAE-Large-V1"
  device: "auto"          # "auto" (cuda > mps > cpu), "cuda", "mps", or "cpu"
  batch_size: 64
  normalize: true         # L2-normalize embeddings
  cache_path: "cache/embeddings.npy"
  use_float16: false      # Use half precision to save memory

Distance Configuration

distance:
  # Options: "cosine", "euclidean", or "bertscore"
  type: "cosine"
  normalize: false

  # BERTScore-specific settings (only used if type="bertscore")
  bertscore:
    model: "microsoft/deberta-large-mnli"
    idf: false  # Use IDF weighting for rare words

CLI Commands

`embed`

Generate embeddings for all texts.

python -m diverseN.cli embed --config configs/example.yaml

Outputs: embeddings.npy

`distances`

Compute pairwise distance matrix.

python -m diverseN.cli distances --config configs/example.yaml

Outputs: distances.npy

`cluster`

Select diverse texts using clustering approach.

python -m diverseN.cli cluster --config configs/example.yaml

Outputs: clusters.csv, selected_clustering.csv, report_clustering.json/md

`select`

Select diverse texts using max-sum approach.

python -m diverseN.cli select --config configs/example.yaml

Outputs: selected_maxsum.csv, report_maxsum.json/md

`compare`

Run both approaches and generate comparison.

python -m diverseN.cli compare --config configs/example.yaml

Outputs: All of the above plus comparison.md

Jupyter Notebooks

Interactive notebooks are available in the notebooks/ directory (not tracked in git):

# Install notebook dependencies
pip install -e ".[notebook]"

Notebooks demonstrate loading texts, computing distances, running both selection approaches, computing diversity metrics, and comparing results.

Performance Tips

For ~1,000 Texts with N=20

Encoding: 10-60s depending on model and GPU
Distances: 1-5s
Selection (Greedy): <1s
Selection (Local Search): 5-30s
Clustering: 1-5s

Optimization

Enable caching: Set cache_path to reuse embeddings
Use GPU: Set device: "auto" to auto-detect CUDA/MPS
Try float16: Set use_float16: true to reduce memory
Batch size: Increase if you have enough GPU memory
Strategy selection:
- Use greedy for speed
- Use local_search for better quality
- Use kmeans_equal clustering for balanced clusters

Testing

Run the test suite:

# Install dev dependencies
pip install -e ".[dev]"

# Run all tests
pytest --override-ini="addopts="

# Run with coverage
pytest --override-ini="addopts=" --cov=diverseN --cov-report=html

Project Structure

Sentence_Diversification/
├── configs/
│   └── example.yaml       # Example configuration (copy and customize)
├── data/
│   └── sample/            # Sample dataset
│       └── texts.jsonl
├── src/diverseN/
│   ├── cli.py             # Command-line interface
│   ├── encoders/          # Text encoder implementations
│   ├── distances/         # Distance calculators
│   ├── similarities/      # Alternative similarity metrics (BERTScore)
│   ├── clustering/        # Clustering strategies
│   │   ├── kmeans_equal.py    # Equal-sized K-Means
│   │   └── utils.py           # Outlier handling, balancing
│   ├── diversify/         # Diversification strategies
│   ├── pipeline/          # Selection orchestration
│   ├── evaluation/        # Metrics and reporting
│   │   └── metrics.py         # 6 diversity metrics
│   ├── io/                # Data loading and saving
│   └── utils/             # Logging, seeding, caching
├── notebooks/             # Interactive notebooks (not tracked in git)
├── tests/                 # Unit tests
├── outputs/               # Generated results (not tracked in git)
├── cache/                 # Embedding cache (not tracked in git)
├── pyproject.toml         # Project metadata and dependencies
├── LICENSE                # MIT License
└── README.md

Architecture

DiverseN uses clean abstractions via Python ABCs:

class TextEncoder(ABC):
    def encode(self, texts: list[str]) -> np.ndarray: ...

class DistanceCalculator(ABC):
    def pairwise(self, X: np.ndarray) -> np.ndarray: ...

class ClusteringStrategy(ABC):
    def fit_predict(self, X: np.ndarray) -> np.ndarray: ...

class DiversificationStrategy(ABC):
    def select(self, D: np.ndarray, N: int) -> list[int]: ...

Key Utilities

from diverseN.clustering.utils import (
    reassign_outliers,      # Reassign outliers to nearest cluster
    balance_clusters,       # Balance cluster sizes
    compute_cluster_medoid  # Find medoid of a cluster
)

from diverseN.evaluation.metrics import (
    compute_all_metrics,              # Compute all 6 metrics
    average_pairwise_cosine_distance,
    minimum_pairwise_distance,
    covariance_matrix_trace,
    covariance_matrix_determinant,
    cluster_entropy,
    kl_divergence_from_uniform,
)

Extending DiverseN

Add a New Clustering Strategy

Create src/diverseN/clustering/my_clustering.py
Subclass ClusteringStrategy and implement fit_predict()
Register in cli.py clustering initialization

Add a New Diversification Strategy

Create src/diverseN/diversify/my_strategy.py
Subclass DiversificationStrategy and implement select()
Register in cli.py strategy initialization

Citation

If you use DiverseN in your research, please cite:

@software{diverseN,
  title={DiverseN: A Framework for Text Diversification},
  author={Ron, Tom},
  year={2026},
  url={https://github.com/...}
}

License

MIT License - see LICENSE for details.

Contributing

Contributions welcome! Please:

Fork the repository
Create a feature branch
Add tests for new functionality
Ensure all tests pass
Submit a pull request

Troubleshooting

Out of Memory

Reduce batch_size
Enable use_float16: true
Use a smaller encoder model

Slow Performance

Enable GPU: device: "auto"
Use caching: set cache_path
Choose faster encoder (e.g., all-MiniLM-L6-v2)
Use greedy instead of local_search

Too Many Outliers (HDBSCAN)

Set outlier_handling: "reassign" to reassign to nearest cluster
Set outlier_handling: "singleton" to keep all outliers as individual selections (no cluster balancing)
Or switch to kmeans_equal or spectral clustering (these create exactly N clusters with no outliers)

Contact

Author: Dr. Tom Ron

For issues, questions, or contributions, please open an issue on GitHub.

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
configs		configs
data/sample		data/sample
src/diverseN		src/diverseN
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

License

tomron87/Sentence_Diversification

Folders and files

Latest commit

History

Repository files navigation

DiverseN - Text Diversification Framework

Key Concept: Selection

Features

Installation

Requirements

Quick Start

1. Prepare Your Data

2. Configure Your Experiment

3. Run the Pipeline

Approaches

Clustering-Based Selection

Max-Sum Selection

Outlier Handling (HDBSCAN)

Diversity Metrics

Output Format

Configuration Reference

Selection Configuration

Clustering Configuration

Max-Sum Configuration

Encoder Configuration

Distance Configuration

CLI Commands

embed

distances

cluster

select

compare

Jupyter Notebooks

Performance Tips

For ~1,000 Texts with N=20

Optimization

Testing

Project Structure

Architecture

Key Utilities

Extending DiverseN

Add a New Clustering Strategy

Add a New Diversification Strategy

Citation

License

Contributing

Troubleshooting

Out of Memory

Slow Performance

Too Many Outliers (HDBSCAN)

Contact

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

`embed`

`distances`

`cluster`

`select`

`compare`

Packages