Skip to content

DiverseN — A production-ready framework for selecting maximally diverse text subsets. Encoder-agnostic, with clustering and max-sum optimization strategies. Built for psychometric test item creation.

License

Notifications You must be signed in to change notification settings

tomron87/Sentence_Diversification

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

22 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

DiverseN - Text Diversification Framework

Select N maximally diverse texts from M input texts

DiverseN is a production-ready, encoder-agnostic Python framework for selecting the most diverse subset of texts. Given M input texts and a target N, it selects the N texts that maximize diversity.

Author: Dr. Tom Ron

Key Concept: Selection

DiverseN selects a diverse subset from your text collection:

  • selection.N = number of texts to select
  • Given M texts with N → selects N most diverse texts
  • Objective: Maximize diversity among selected texts

Example: 200 texts with N=20 selects the 20 most diverse texts from the 200.

Features

  • Encoder-Agnostic Design: Swap text encoders with a single config change
  • Multiple Selection Strategies:
    • Clustering-based (Equal-sized K-Means, HDBSCAN, Spectral)
    • Max-sum optimization (Brute Force, Greedy, Local Search with Tabu)
  • Equal-Sized Clustering: kmeans_equal strategy for balanced clusters
  • Flexible Outlier Handling: Three strategies for HDBSCAN outliers (reassign, singleton, ignore)
  • Flexible Distance Metrics: Cosine, Euclidean, BERTScore
  • Comprehensive Diversity Metrics: 6 evaluation metrics for comparing approaches
  • GPU Acceleration: Automatic CUDA/MPS support for encoding
  • Caching: Smart embedding caching to avoid recomputation
  • CLI Interface: Easy-to-use commands for all operations

Installation

# Clone the repository
git clone <repository-url>
cd Sentence_Diversification

# Install with pip
pip install -e .

# Install with dev dependencies (testing, linting, formatting)
pip install -e ".[dev]"

# Install with notebook support (Jupyter, matplotlib, seaborn)
pip install -e ".[notebook]"

# Install with both dev and notebook dependencies
pip install -e ".[dev,notebook]"

Requirements

  • Python 3.9+ (tested on 3.9, 3.10, 3.11, 3.12, 3.13)
  • PyTorch 2.0+
  • See pyproject.toml for full dependency list

Quick Start

1. Prepare Your Data

Create a JSONL file with your texts (one JSON object per line):

{"id": "1", "text": "First text goes here."}
{"id": "2", "text": "Second text goes here."}
{"id": "3", "text": "Third text goes here."}

2. Configure Your Experiment

Copy and edit configs/example.yaml:

experiment_name: "my_experiment"
seed: 42

data:
  input_path: "data/sample/texts.jsonl"
  output_dir: "outputs/my_experiment"

encoder:
  name: "sentence-transformers"
  model: "WhereIsAI/UAE-Large-V1"
  device: "auto"

# N = number of diverse texts to select
selection:
  N: 20

approach:
  clustering:
    strategy: "kmeans_equal"  # Recommended for equal-sized clusters
    outlier_handling: "reassign"  # Options: reassign, singleton, ignore
    selection_policy: "medoid"

3. Run the Pipeline

# Generate embeddings
python -m diverseN.cli embed --config configs/example.yaml

# Compute distances
python -m diverseN.cli distances --config configs/example.yaml

# Select diverse texts using max-sum approach
python -m diverseN.cli select --config configs/example.yaml

# Or select using clustering approach
python -m diverseN.cli cluster --config configs/example.yaml

# Or compare both approaches
python -m diverseN.cli compare --config configs/example.yaml

Approaches

Clustering-Based Selection

Creates N clusters, then selects one representative (medoid) from each cluster.

How it works:

  1. Create clusters using the configured clustering strategy
  2. Handle outliers according to outlier_handling setting
  3. Select the medoid from each cluster
  4. Return one text per cluster

Strategies:

  • kmeans_equal (recommended): K-Means with enforced equal-sized clusters. Creates exactly N clusters.
  • spectral: Graph-based clustering, naturally balanced. Creates exactly N clusters.
  • hdbscan: Density-based, auto-determines cluster count (ignores N). With outlier_handling: "singleton", total selected = K natural clusters + O outliers.

When to use: When you want guaranteed representation from all semantic regions.

Max-Sum Selection

Optimizes directly for diversity by maximizing the sum of pairwise distances.

How it works:

  1. Use diversification strategy to select N diverse texts
  2. Objective: maximize sum of pairwise distances among selected texts

Strategies:

  • greedy: Fast O(N·M), good quality
  • local_search: Slower, higher quality (~90-98% of optimal)
  • brute_force: Optimal but only feasible for small N

When to use: When you want maximum diversity without semantic clustering constraints.

Outlier Handling (HDBSCAN)

When using HDBSCAN clustering, configure how to handle outliers (texts that don't fit any cluster):

approach:
  clustering:
    strategy: "hdbscan"
    outlier_handling: "reassign"  # Options below
Option Description
reassign Reassign outliers to the nearest cluster, then balance cluster sizes
singleton Treat each outlier as its own singleton cluster (all outliers are selected; cluster balancing is skipped)
ignore Exclude outliers from selection entirely

Diversity Metrics

DiverseN computes 6 metrics to evaluate the diversity of selected texts:

Metric Description Better
Avg Pairwise Cosine Distance Mean cosine distance between all pairs Higher
Min Pairwise Distance Minimum distance between any pair Higher
Covariance Trace Sum of variances (total spread) Higher
Covariance Determinant Log-determinant (generalized variance) Higher
Cluster Entropy How evenly texts spread across clusters Higher
KL Divergence from Uniform Deviation from uniform cluster distribution Lower

These metrics are automatically computed and included in reports.

Output Format

Results are saved as CSV files:

Output files:

  • selected_clustering.csv: Texts selected by clustering approach
  • selected_maxsum.csv: Texts selected by max-sum approach
  • clusters.csv: Cluster assignments (clustering approach only)
  • report_clustering.json/md: Metrics report for clustering
  • report_maxsum.json/md: Metrics report for max-sum
  • comparison.md: Side-by-side comparison of both approaches

CSV columns:

Column Description
id Original text ID
text Text content

Configuration Reference

Selection Configuration

selection:
  # Number of diverse texts to select
  N: 20

Clustering Configuration

approach:
  clustering:
    # Strategy: "kmeans_equal" (recommended), "spectral", or "hdbscan"
    strategy: "kmeans_equal"

    # Outlier handling for HDBSCAN: "reassign", "singleton", or "ignore"
    outlier_handling: "reassign"

    # Selection policy: "medoid" (recommended)
    selection_policy: "medoid"

    # Equal-sized K-Means parameters
    kmeans_equal:
      max_iter: 300
      n_init: 10

    # HDBSCAN parameters (if using hdbscan)
    hdbscan:
      min_cluster_size: 10
      min_samples: 5
      metric: "euclidean"

    # Spectral parameters (if using spectral)
    spectral:
      assign_labels: "kmeans"

Max-Sum Configuration

approach:
  max_sum:
    # Strategy: "greedy", "local_search", or "brute_force"
    strategy: "greedy"

    greedy:
      init: "farthest_point"

    local_search:
      restarts: 10
      max_iters: 1000
      tabu_window: 25
      allow_stochastic: true
      temperature: 0.1

    brute_force:
      max_combinations: 2000000
      time_limit_sec: 30

Encoder Configuration

encoder:
  name: "sentence-transformers"
  model: "WhereIsAI/UAE-Large-V1"
  device: "auto"          # "auto" (cuda > mps > cpu), "cuda", "mps", or "cpu"
  batch_size: 64
  normalize: true         # L2-normalize embeddings
  cache_path: "cache/embeddings.npy"
  use_float16: false      # Use half precision to save memory

Distance Configuration

distance:
  # Options: "cosine", "euclidean", or "bertscore"
  type: "cosine"
  normalize: false

  # BERTScore-specific settings (only used if type="bertscore")
  bertscore:
    model: "microsoft/deberta-large-mnli"
    idf: false  # Use IDF weighting for rare words

CLI Commands

embed

Generate embeddings for all texts.

python -m diverseN.cli embed --config configs/example.yaml

Outputs: embeddings.npy

distances

Compute pairwise distance matrix.

python -m diverseN.cli distances --config configs/example.yaml

Outputs: distances.npy

cluster

Select diverse texts using clustering approach.

python -m diverseN.cli cluster --config configs/example.yaml

Outputs: clusters.csv, selected_clustering.csv, report_clustering.json/md

select

Select diverse texts using max-sum approach.

python -m diverseN.cli select --config configs/example.yaml

Outputs: selected_maxsum.csv, report_maxsum.json/md

compare

Run both approaches and generate comparison.

python -m diverseN.cli compare --config configs/example.yaml

Outputs: All of the above plus comparison.md

Jupyter Notebooks

Interactive notebooks are available in the notebooks/ directory (not tracked in git):

# Install notebook dependencies
pip install -e ".[notebook]"

Notebooks demonstrate loading texts, computing distances, running both selection approaches, computing diversity metrics, and comparing results.

Performance Tips

For ~1,000 Texts with N=20

  • Encoding: 10-60s depending on model and GPU
  • Distances: 1-5s
  • Selection (Greedy): <1s
  • Selection (Local Search): 5-30s
  • Clustering: 1-5s

Optimization

  1. Enable caching: Set cache_path to reuse embeddings
  2. Use GPU: Set device: "auto" to auto-detect CUDA/MPS
  3. Try float16: Set use_float16: true to reduce memory
  4. Batch size: Increase if you have enough GPU memory
  5. Strategy selection:
    • Use greedy for speed
    • Use local_search for better quality
    • Use kmeans_equal clustering for balanced clusters

Testing

Run the test suite:

# Install dev dependencies
pip install -e ".[dev]"

# Run all tests
pytest --override-ini="addopts="

# Run with coverage
pytest --override-ini="addopts=" --cov=diverseN --cov-report=html

Project Structure

Sentence_Diversification/
├── configs/
│   └── example.yaml       # Example configuration (copy and customize)
├── data/
│   └── sample/            # Sample dataset
│       └── texts.jsonl
├── src/diverseN/
│   ├── cli.py             # Command-line interface
│   ├── encoders/          # Text encoder implementations
│   ├── distances/         # Distance calculators
│   ├── similarities/      # Alternative similarity metrics (BERTScore)
│   ├── clustering/        # Clustering strategies
│   │   ├── kmeans_equal.py    # Equal-sized K-Means
│   │   └── utils.py           # Outlier handling, balancing
│   ├── diversify/         # Diversification strategies
│   ├── pipeline/          # Selection orchestration
│   ├── evaluation/        # Metrics and reporting
│   │   └── metrics.py         # 6 diversity metrics
│   ├── io/                # Data loading and saving
│   └── utils/             # Logging, seeding, caching
├── notebooks/             # Interactive notebooks (not tracked in git)
├── tests/                 # Unit tests
├── outputs/               # Generated results (not tracked in git)
├── cache/                 # Embedding cache (not tracked in git)
├── pyproject.toml         # Project metadata and dependencies
├── LICENSE                # MIT License
└── README.md

Architecture

DiverseN uses clean abstractions via Python ABCs:

class TextEncoder(ABC):
    def encode(self, texts: list[str]) -> np.ndarray: ...

class DistanceCalculator(ABC):
    def pairwise(self, X: np.ndarray) -> np.ndarray: ...

class ClusteringStrategy(ABC):
    def fit_predict(self, X: np.ndarray) -> np.ndarray: ...

class DiversificationStrategy(ABC):
    def select(self, D: np.ndarray, N: int) -> list[int]: ...

Key Utilities

from diverseN.clustering.utils import (
    reassign_outliers,      # Reassign outliers to nearest cluster
    balance_clusters,       # Balance cluster sizes
    compute_cluster_medoid  # Find medoid of a cluster
)

from diverseN.evaluation.metrics import (
    compute_all_metrics,              # Compute all 6 metrics
    average_pairwise_cosine_distance,
    minimum_pairwise_distance,
    covariance_matrix_trace,
    covariance_matrix_determinant,
    cluster_entropy,
    kl_divergence_from_uniform,
)

Extending DiverseN

Add a New Clustering Strategy

  1. Create src/diverseN/clustering/my_clustering.py
  2. Subclass ClusteringStrategy and implement fit_predict()
  3. Register in cli.py clustering initialization

Add a New Diversification Strategy

  1. Create src/diverseN/diversify/my_strategy.py
  2. Subclass DiversificationStrategy and implement select()
  3. Register in cli.py strategy initialization

Citation

If you use DiverseN in your research, please cite:

@software{diverseN,
  title={DiverseN: A Framework for Text Diversification},
  author={Ron, Tom},
  year={2026},
  url={https://github.com/...}
}

License

MIT License - see LICENSE for details.

Contributing

Contributions welcome! Please:

  1. Fork the repository
  2. Create a feature branch
  3. Add tests for new functionality
  4. Ensure all tests pass
  5. Submit a pull request

Troubleshooting

Out of Memory

  • Reduce batch_size
  • Enable use_float16: true
  • Use a smaller encoder model

Slow Performance

  • Enable GPU: device: "auto"
  • Use caching: set cache_path
  • Choose faster encoder (e.g., all-MiniLM-L6-v2)
  • Use greedy instead of local_search

Too Many Outliers (HDBSCAN)

  • Set outlier_handling: "reassign" to reassign to nearest cluster
  • Set outlier_handling: "singleton" to keep all outliers as individual selections (no cluster balancing)
  • Or switch to kmeans_equal or spectral clustering (these create exactly N clusters with no outliers)

Contact

Author: Dr. Tom Ron

For issues, questions, or contributions, please open an issue on GitHub.

About

DiverseN — A production-ready framework for selecting maximally diverse text subsets. Encoder-agnostic, with clustering and max-sum optimization strategies. Built for psychometric test item creation.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •  

Languages