GitHub - kontramind/sdpype: Synthetic Data Pipeline (SDPype)

Here's the complete updated README.md:

# SDPype - Synthetic Data Pipeline

Reproducible synthetic data generation pipeline with DVC and experiment tracking.

## Features

- Multiple SDG libraries (SDV, Synthcity, Synthpop)
- DVC pipeline for reproducibility
- Experiment versioning with metadata tracking
- Statistical similarity evaluation
- Detection-based quality assessment
- Recursive training for multi-generation analysis

## Quick Start

```bash
# Setup repository
uv run sdpype setup

# Configure experiment in params.yaml
vim params.yaml

# Run pipeline
uv run sdpype pipeline

# View available models
uv run sdpype models

Sampling Methodology

Overview

The transform command in prepare_mimic_iii_mini_cli.py implements a two-stage sampling strategy designed to:

Prevent data leakage at the episode level
Create realistic train/test splits with natural class distributions
Preserve a representative unsampled population for deployment testing

Sampling Strategy

Stage 1: Random Study Cohort Selection

First, we randomly sample 20,000 ICU stay episodes (ICUSTAY_ID) from the full population to create a "study cohort":

All 61,532 episodes
    ↓
Random sample 20,000 episodes
    (Study Cohort)

Why random? This mimics realistic research practice where you collect a representative sample for your study.

Stage 2: Random Train/Test Split

The 20,000-episode cohort is then split 50/50 into training and test sets using random sampling to preserve natural class distributions:

Study Cohort (20,000)
    ↓
Random 50/50 split
    ↓
├─ Train: 10,000 episodes (natural READMIT distribution)
└─ Test:  10,000 episodes (natural READMIT distribution)

Why random? Natural class distributions are essential for training synthetic data generators and evaluating hallucination detection models. Balanced splits would create unrealistic training conditions.

Stage 3: Unsampled Population

All remaining episodes (not in the study cohort) become the "unsampled" dataset:

Remaining ~41,000 episodes
    ↓
Unsampled (Population)

Why keep all? The unsampled dataset represents the full diverse population with natural class distribution, useful for testing deployment scenarios.

Final Dataset Characteristics

Dataset	Size	READMIT Rate	Purpose
Training	10,000	~10.3% (natural)	Synthetic data generator training
Test	10,000	~10.3% (natural)	Model evaluation with realistic distributions
Unsampled	~41,000	~10.3% (natural)	Deployment/population testing

Leakage Prevention

The sampling ensures no episode-level leakage:

Each ICUSTAY_ID appears in exactly ONE split
No ICU stay is shared between train/test/unsampled

Note on patient-level overlap: The same patient (SUBJECT_ID) may have different ICU episodes in different splits. This is:

✓ Acceptable for episode-level prediction (predicting readmission for a specific ICU stay)
❌ Not acceptable for patient-level prediction (predicting patient readmission risk)

Our use case is episode-level prediction, so patient overlap is expected and not considered leakage.

Usage Examples

Basic Usage (natural distribution)

uv run prepare_mimic_iii_mini_cli.py transform \
    MIMIC-III-mini-population.xlsx \
    --output MIMIC-III-mini-core \
    --sample 10000 \
    --seed 42

This creates:

dseed42/MIMIC-III-mini-core_sample10000_dseed42_training.csv (10k, natural distribution)
dseed42/MIMIC-III-mini-core_sample10000_dseed42_test.csv (10k, natural distribution)
dseed42/MIMIC-III-mini-core_sample10000_dseed42_unsampled.csv (~41k, natural distribution)

With Validation (Keep IDs)

uv run prepare_mimic_iii_mini_cli.py transform \
    MIMIC-III-mini-population.xlsx \
    --output MIMIC-III-mini-core \
    --sample 10000 \
    --seed 42 \
    --keep-ids

The --keep-ids flag:

Preserves ICUSTAY_ID, SUBJECT_ID, HADM_ID in output files
Runs comprehensive validation checks
Displays leakage detection report
Useful for debugging and quality assurance

Validation Output

When using --keep-ids, the command displays a validation report:

════════════════════════════════════════════════════════════
Data Split Validation
════════════════════════════════════════════════════════════

1. Episode-Level Leakage Check (ICUSTAY_ID):

  Train ∩ Test:           0 ✓ No leakage
  Train ∩ Unsampled:      0 ✓ No leakage
  Test ∩ Unsampled:       0 ✓ No leakage

2. Patient-Level Overlap (SUBJECT_ID):

  Train ∩ Test:         959 patients
  Train ∩ Unsampled:  2,639 patients
  Test ∩ Unsampled:   2,663 patients

  ℹ Patient-level overlap is EXPECTED for episode-level prediction.
    Same patient can have different ICU episodes in different splits.

3. Target Distribution (READMIT):

  Training:  0.1050 (1,050 / 10,000)
  Test:      0.1042 (1,042 / 10,000)
  Unsampled: 0.1036 (4,301 / 41,532)

  ✓ Train/Test distributions similar (diff: 0.0008)
    Natural distribution sampling preserves realistic class imbalances.

════════════════════════════════════════════════════════════

Rationale

This two-stage approach provides:

Realistic Research Scenario
- Stage 1 (random cohort) = "We collected 20k patients for our study"
- Stage 2 (random split) = "We split them randomly preserving natural distributions"
- Stage 3 (population) = "We test on broader population"
Realistic Data
- All splits preserve natural class distributions
- Essential for synthetic data generation and hallucination detection
- Reflects real-world deployment scenarios
Validation & Debugging
- --keep-ids: Preserves ID columns for validation
- Comprehensive leakage detection
- Patient-level overlap tracking

References

This methodology follows best practices for:

Clinical machine learning (episode-level prediction)
Random sampling with natural distributions
Data leakage prevention (ICUSTAY_ID-level splitting)

Installation

# Clone repository
git clone <repository-url>
cd sdpype

# Install with uv (recommended)
uv sync

# Or with pip
pip install -e .

Pipeline Stages

The pipeline consists of the following stages:

train_sdg: Train synthetic data generator
generate_synthetic: Generate synthetic data
statistical_similarity: Compare statistical similarity between original and synthetic data
detection_evaluation: Assess synthetic data quality using classifier-based detection methods

Running Individual Stages

# Run specific stages (after editing params.yaml)
uv run sdpype stage train_sdg
uv run sdpype stage generate_synthetic
uv run sdpype stage statistical_similarity
uv run sdpype stage detection_evaluation

Evaluation Framework

SDPype provides statistical similarity evaluation to assess the quality of synthetic data:

Statistical Similarity Assessment

# Run statistical similarity evaluation (after setting up experiment in params.yaml)
uv run sdpype stage statistical_similarity

Detecting Hallucinations in Synthetic Data

SDPype includes advanced capabilities for identifying "hallucinations" - synthetic records that contain unrealistic patterns not present in real data.

What Are Hallucinations?

When generating synthetic data, the generator may create records that look realistic but contain patterns that don't exist in real data. Training models on these hallucinations can lead to poor predictions on real patients/cases.

Example: A synthetic patient record might combine symptoms in ways that never occur in reality. A model trained on this data would learn false patterns.

How We Detect Them: Data Shapley

We use Data Shapley to measure each synthetic record's contribution to model performance on real test data:

Positive Shapley value: Record improves predictions → valuable data ✓
Zero: Record has negligible impact
Negative Shapley value: Record harms predictions → hallucination! ⚠️

Core concept: Like measuring team members' contributions - add people one at a time and see if the team improves or gets worse.

Running Hallucination Detection

# Basic usage
uv run sdpype downstream mimic-iii-valuation \
  --train-data experiments/data/synthetic/train.csv \
  --test-data experiments/data/real/test.csv \
  --num-samples 20 \
  --max-coalition-size 1500

# Using optimized parameters from training (auto-loads encoding)
uv run sdpype downstream mimic-iii-valuation \
  --train-data experiments/data/synthetic/train.csv \
  --test-data experiments/data/real/test.csv \
  --lgbm-params-json experiments/models/downstream/lgbm_readmission_*.json \
  --num-samples 20 \
  --max-coalition-size 1500

Understanding the Output

Console Output:

Negative values (potential hallucinations): 1,234 (17.3%)
  95% CI: [16.4%, 18.2%]  (Wilson score interval)

Interpretation: 17.3% of synthetic records have negative Shapley values (hallucinations), with 95% confidence the true percentage is between 16.4-18.2%.

CSV Output (experiments/data_valuation/data_valuation_*.csv):

Features...	target	shapley_value	sample_index
[data]	1	-0.00347	0
[data]	0	+0.00123	1

Sorted by Shapley value (most harmful first) for easy identification and removal.

Controlling Speed vs Accuracy

Two key parameters control the tradeoff:

--num-samples (Number of random shuffles)

Controls variance (stability of estimates)
Higher = smoother estimates, lower = noisier but faster
Recommended: 10-50

--max-coalition-size (Early truncation)

Controls bias (completeness of evaluation)
How many records to test together per shuffle
Larger = more thorough, smaller = faster but may miss patterns
Recommended: 20-30% of dataset size

Which matters more? For hallucination detection, coalition size is more important than num samples. A record might look harmless in small groups but become obviously bad in larger coalitions.

Recommended settings for 5,000-10,000 records:

Use Case	`--num-samples`	`--max-coalition-size`	Time	Quality
Quick test	10	1000	1-3 hours	Good for initial screening
Balanced (recommended)	20	1500	4-8 hours	Best speed/accuracy
Thorough	50	2000	12-20 hours	Highest accuracy

Rule of thumb:

Coalition size should be 20-30% of your dataset
Num samples should be at least 10-20
Prioritize larger coalition size over more samples

Using the Results

Identify hallucinations: Filter for shapley_value < 0
Remove them: Create cleaned dataset excluding bottom 10-20%
Retrain: Train final model on cleaned synthetic data
Improved performance: Better predictions on real data

Configuration

Edit params.yaml to configure experiments:

experiment:
  seed: 42
  name: "baseline"
  description: "Baseline CTGAN experiment"

sdg:
  library: sdv
  model_type: ctgan
  parameters:
    epochs: 500
    batch_size: 100

Available Models

# List all available SDG models
uv run sdpype models

SDV Models:

gaussiancopula - Fast Gaussian copula model, good for tabular data
ctgan - High quality generative adversarial network
tvae - Tabular Variational Autoencoder
copulagan - Mix of copula and GAN methods

Synthcity Models:

ctgan - CTGAN implementation
tvae - Tabular VAE
rtvae - Robust Tabular VAE with beta divergence
arf - Adversarial Random Forests for density estimation
marginaldistributions - Simple baseline sampling from marginal distributions independently
adsgan - AdsGAN for anonymization through data synthesis with privacy guarantees
aim - AIM for differentially private synthetic data with formal privacy guarantees
decaf - DECAF for generating fair synthetic data using causally-aware networks
pategan - PATEGAN for differentially private synthetic data using teacher ensemble aggregation
privbayes - PrivBayes for differentially private data release via Bayesian networks
nflow - Normalizing flows
ddpm - Diffusion model
bayesiannetwork - Bayesian Network using probabilistic graphical models

Project Structure

sdpype/
├── experiments/
│   ├── data/
│   │   ├── raw/               # Raw input data
│   │   ├── processed/         # Preprocessed data
│   │   └── synthetic/         # Generated synthetic data
│   ├── models/                # Trained models
│   └── metrics/               # Evaluation results
├── sdpype/                    # Source code
│   ├── cli/                   # CLI commands
│   ├── core/                  # Core functionality
│   ├── evaluation/            # Evaluation metrics
│   ├── training.py            # Model training
│   ├── generation.py          # Data generation
│   └── evaluate.py            # Evaluation pipeline
├── params.yaml                # Configuration
├── dvc.yaml                   # Pipeline definition
├── recursive_train.py         # Recursive training script
├── trace_chain.py             # Chain tracing utility
└── README.md

Example Workflows

Basic Experiment

# 1. Set up baseline experiment
vim params.yaml  # Set: experiment.name="baseline", experiment.seed=42
uv run sdpype pipeline

# 2. Try different model
vim params.yaml  # Set: sdg.model_type="gaussiancopula"
uv run sdpype pipeline

# 3. Compare results
uv run sdpype model list

Recursive Training

Run multiple generations where each generation trains on synthetic data from the previous generation:

# Run 10 generations
python recursive_train.py --generations 10

# Resume from checkpoint
python recursive_train.py --generations 20 --resume

# Resume specific chain
python recursive_train.py --resume-from MODEL_ID --generations 20

Chain Tracing

Analyze metric degradation across generations:

# Trace a generation chain
python trace_chain.py MODEL_ID

# Export to CSV
python trace_chain.py MODEL_ID --format csv --output chain.csv

# Generate plot
python trace_chain.py MODEL_ID --plot --plot-output degradation.png

Standalone Evaluation Tools

SDPype provides standalone CLI tools for specific evaluation tasks:

SDMetrics Quality Report

Comprehensive synthesis quality analysis using SDMetrics QualityReport with pairwise correlation matrices.

# Run quality analysis
python sdmetrics_quality_cli.py \
  --real experiments/data/real/test.csv \
  --synthetic experiments/data/synthetic/test.csv \
  --metadata experiments/data/metadata.json \
  --output quality_report.json

What it does:

Generates three correlation matrices:
1. Real data correlations (with pairwise deletion and confidence metrics)
2. Synthetic data correlations (complete data analysis)
3. Quality scores (SDMetrics similarity assessment)
Computes distribution similarity for individual columns
Analyzes correlation preservation across column pairs
Provides diagnostic analysis with flags (✓/⚠/✗) for each pair

Key features:

Uses pairwise deletion to handle missing values in real data
Supports mixed data types (numerical, categorical, boolean)
Confidence flags based on sample sizes (✓ ≥1000, ⚠ 500-999, ✗ 50-499, - <50)
Diagnostic analysis identifies learning failures and spurious correlations
JSON export for automated pipelines

Interpreting results:

Quality scores ≥0.8: Excellent preservation
Quality scores ≥0.6: Good preservation
Quality scores <0.6: Poor preservation
Diagnostic flags:
- ✓ = Good preservation of correlations
- ⚠ = Mixed quality or moderate issues
- ✗ = Poor preservation or learning failure

K-Anonymity Evaluation

Compute k-anonymity metrics for privacy assessment:

# Run k-anonymity analysis
python k_anonymity_cli.py \
  --population data/population.csv \
  --reference data/reference.csv \
  --synthetic data/synthetic.csv \
  --metadata data/metadata.json \
  --qi-cols "age,gender,zipcode" \
  --output k_anonymity_results.json

Statistical Analysis

Aggregating Metrics Across Experiments

When analyzing results from multiple experimental runs, SDPype provides two aggregation tools:

aggregate_metrics_cli.py: Simple bootstrap aggregation for independent runs
aggregate_hybrid_stratified_metrics_cli.py: Hierarchical aggregation for nested experimental designs

Hybrid Stratified Aggregation

For experiments with a hierarchical structure (e.g., multiple data subsets × multiple model seeds), use the hybrid stratified approach to avoid pseudoreplication and obtain valid statistical inferences.

When to Use This Method

Use aggregate_hybrid_stratified_metrics_cli.py when your experimental design has:

Multiple data subsets (dseeds) - different train/test splits or data samples
Multiple model training seeds (mseeds) per data subset
Nested structure: mseeds are nested within dseeds

Example: 6 data subsets × 6 model seeds = 36 total runs

The Pseudoreplication Problem

Problem: Treating all 36 runs as independent observations inflates statistical confidence because model seeds within the same data subset are correlated (they all train on the same data).

Solution: Use hierarchical aggregation that respects the nested structure.

Two-Step Aggregation Method

Step 1: Average across model seeds (mseeds) within each data subset (dseed)

For each dseed, compute the mean across all mseeds
This produces one representative value per data subset
Smooths out model training variance

Step 2: Compute t-based confidence intervals on dseed averages

Treat dseed averages as independent observations (n = number of dseeds)
Use t-distribution for small sample sizes (critical for n < 30)
Calculate both confidence intervals (CI) and prediction intervals (PI)

Statistical Formulas

Confidence Interval (95%):

CI = x̄ ± t(α/2, df) × (s / √n)

where:

x̄ = mean of dseed averages
s = standard deviation of dseed averages
n = number of data subsets (dseeds)
df = n - 1 (degrees of freedom)
t(α/2, df) = critical value from t-distribution (e.g., 2.571 for n=6, 95% CI)

Prediction Interval (95%):

PI = x̄ ± t(α/2, df) × s × √(1 + 1/n)

Interpretation

Confidence Interval (Inner band, colored):

"Where the true population mean likely is"
Quantifies uncertainty about the mean
Narrows as sample size increases

Prediction Interval (Outer band, gray):

"Where a NEW data subset would likely fall"
Accounts for both estimation uncertainty AND natural variation
Remains wider than CI even with large samples
More relevant for predicting future experimental outcomes

Example (n=6 dseeds):

CI: Mean ± 2.571 × (s / √6) ≈ Mean ± 1.05×s
PI: Mean ± 2.571 × s × √(1 + 1/6) ≈ Mean ± 2.82×s

The prediction interval is ~2.7× wider, honestly representing the variation you'd expect in a new experiment.

Usage Example

# Aggregate results from hierarchical design
python aggregate_hybrid_stratified_metrics_cli.py \
  "./mimic_iii_baseline_dseed*_mseed*/" \
  -s results.html \
  -c aggregate_metrics.csv \
  -i dseed_averages.csv

# Output files:
# - results.html: Interactive visualizations with CI and PI bands
# - aggregate_metrics.csv: Summary statistics per generation
# - dseed_averages.csv: Intermediate dseed-level averages

Key Advantages

Avoids pseudoreplication: Properly handles nested structure (Hurlbert, 1984)
Correct for small samples: Uses t-distribution instead of normal approximation
Honest uncertainty: Prediction intervals show realistic variation
Statistically valid: Provides proper basis for scientific claims

When NOT to Use This Method

If your experimental runs are truly independent (e.g., completely different datasets, not just different seeds on the same data), use the simpler aggregate_metrics_cli.py instead.

References

Statistical Methods

Hurlbert, S. H. (1984). "Pseudoreplication and the design of ecological field experiments." Ecological Monographs, 54(2), 187-211.
- Foundational paper on pseudoreplication in experimental design
Lazic, S. E. (2010). "The problem of pseudoreplication in neuroscientific studies: is it affecting your analysis?" BMC Neuroscience, 11(5).
- Modern application of pseudoreplication concepts
Student (Gosset, W. S.). (1908). "The probable error of a mean." Biometrika, 6(1), 1-25.
- Original t-distribution paper for small sample inference
Meeker, W. Q., Hahn, G. J., & Escobar, L. A. (2017). Statistical Intervals: A Guide for Practitioners and Researchers (2nd ed.). Wiley.
- Comprehensive guide to confidence and prediction intervals
Quinn, G. P., & Keough, M. J. (2002). Experimental Design and Data Analysis for Biologists. Cambridge University Press.
- Practical guide to nested experimental designs
Cumming, G. (2014). "The new statistics: Why and how." Psychological Science, 25(1), 7-29.
- Modern perspective on estimation and confidence intervals

Data Valuation & Shapley Values

Ghorbani, A., & Zou, J. (2019). "Data Shapley: Equitable valuation of data for machine learning." Proceedings of the 36th International Conference on Machine Learning (ICML), 97, 2242-2251.
- Foundational paper on Data Shapley for valuing training examples
Jia, R., et al. (2019). "Towards efficient data valuation based on the Shapley value." Proceedings of the 22nd International Conference on Artificial Intelligence and Statistics (AISTATS).
- Efficient algorithms for computing Data Shapley values

BibTeX Format

@article{hurlbert1984pseudoreplication,
  title={Pseudoreplication and the design of ecological field experiments},
  author={Hurlbert, Stuart H},
  journal={Ecological monographs},
  volume={54},
  number={2},
  pages={187--211},
  year={1984}
}

@article{lazic2010pseudoreplication,
  title={The problem of pseudoreplication in neuroscientific studies: is it affecting your analysis?},
  author={Lazic, Stanley E},
  journal={BMC neuroscience},
  volume={11},
  number={5},
  year={2010}
}

@article{student1908probable,
  title={The probable error of a mean},
  author={Student},
  journal={Biometrika},
  volume={6},
  number={1},
  pages={1--25},
  year={1908}
}

@book{meeker2017statistical,
  title={Statistical Intervals: A Guide for Practitioners and Researchers},
  author={Meeker, William Q and Hahn, Gerald J and Escobar, Luis A},
  edition={2},
  year={2017},
  publisher={Wiley}
}

@book{quinn2002experimental,
  title={Experimental Design and Data Analysis for Biologists},
  author={Quinn, Gerry P and Keough, Michael J},
  year={2002},
  publisher={Cambridge University Press}
}

@article{cumming2014new,
  title={The new statistics: Why and how},
  author={Cumming, Geoff},
  journal={Psychological science},
  volume={25},
  number={1},
  pages={7--29},
  year={2014}
}

@inproceedings{ghorbani2019data,
  title={Data Shapley: Equitable valuation of data for machine learning},
  author={Ghorbani, Amirata and Zou, James},
  booktitle={International Conference on Machine Learning},
  pages={2242--2251},
  year={2019}
}

@inproceedings{jia2019towards,
  title={Towards efficient data valuation based on the Shapley value},
  author={Jia, Ruoxi and Dao, David and Wang, Boxin and Hubis, Frances Ann and Hynes, Nick and G{\"u}rel, Nezihe Merve and Li, Bo and Zhang, Ce and Song, Dawn and Spanos, Costas J},
  booktitle={International Conference on Artificial Intelligence and Statistics},
  year={2019}
}

Model Management

# List all trained models
uv run sdpype model list

# Get detailed model info
uv run sdpype model info MODEL_ID

CLI Commands

# Setup repository
uv run sdpype setup

# Run complete pipeline
uv run sdpype pipeline

# Run specific stage
uv run sdpype stage train_sdg

# Show repository status
uv run sdpype status

# List available models
uv run sdpype models

# Model management
uv run sdpype model list
uv run sdpype model info MODEL_ID

# Metrics inspection
uv run sdpype metrics list
uv run sdpype metrics compare MODEL_ID1 MODEL_ID2

# Purge experiments (destructive)
uv run sdpype purge --yes

Model ID Format

Models are identified with the following format:

library_modeltype_refhash_roothash_trnhash_gen_N_cfghash_seed

Example:

sdv_gaussiancopula_0cf8e0f5_852ba944_852ba944_gen_0_9eadbd5d_51

Components:

library: SDG library (sdv, synthcity, etc.)
modeltype: Model type (gaussiancopula, ctgan, etc.)
refhash: Reference data hash
roothash: Root training data hash (generation 0)
trnhash: Current training data hash
gen_N: Generation number
cfghash: Configuration hash
seed: Random seed

Development

# Install development dependencies
uv sync --all-extras

# Run tests
pytest

# Format code
ruff format .

# Lint
ruff check .

License

MIT License - see LICENSE file for details

Citation

If you use SDPype in your research, please cite:

@software{sdpype,
  title = {SDPype: Synthetic Data Pipeline},
  author = {Your Name},
  year = {2024},
  url = {https://github.com/yourusername/sdpype}
}

Name		Name	Last commit message	Last commit date
Latest commit History 479 Commits
pypyr		pypyr
queries		queries
sdpype		sdpype
.gitignore		.gitignore
.python-version		.python-version
DATA_SOURCES_REFERENCE.md		DATA_SOURCES_REFERENCE.md
DDR_METRIC_README.md		DDR_METRIC_README.md
PLAUSIBLE_VALIDATOR_README.md		PLAUSIBLE_VALIDATOR_README.md
POST_TRAINING_METRICS_GUIDE.md		POST_TRAINING_METRICS_GUIDE.md
README.md		README.md
aggregate_hybrid_stratified_metrics_cli.py		aggregate_hybrid_stratified_metrics_cli.py
aggregate_metrics_cli.py		aggregate_metrics_cli.py
batch_trace.py		batch_trace.py
batch_train.py		batch_train.py
compute_metrics_cli.py		compute_metrics_cli.py
ddr_metric.py		ddr_metric.py
dvc.yaml		dvc.yaml
hallucination_score.py		hallucination_score.py
k_anonymity_cli.py		k_anonymity_cli.py
mimic_iii_params_generator_cli.py		mimic_iii_params_generator_cli.py
params.yaml		params.yaml
plausible_validator.py		plausible_validator.py
prepare_mimic_cli.py		prepare_mimic_cli.py
prepare_mimic_iii_cli.py		prepare_mimic_iii_cli.py
prepare_mimic_iii_mini_cli.py		prepare_mimic_iii_mini_cli.py
pyproject.toml		pyproject.toml
recursive_train.py		recursive_train.py
sdmetrics_quality_cli.py		sdmetrics_quality_cli.py
test_post_training_metrics.sh		test_post_training_metrics.sh
trace_chain.py		trace_chain.py
uv.lock		uv.lock

kontramind/sdpype

Folders and files

Latest commit

History

Repository files navigation

Sampling Methodology

Overview

Sampling Strategy

Stage 1: Random Study Cohort Selection

Stage 2: Random Train/Test Split

Stage 3: Unsampled Population

Final Dataset Characteristics

Leakage Prevention

Usage Examples

Basic Usage (natural distribution)

With Validation (Keep IDs)

Validation Output

Rationale

References

Installation

Pipeline Stages

Running Individual Stages

Evaluation Framework

Statistical Similarity Assessment

Detecting Hallucinations in Synthetic Data

What Are Hallucinations?

How We Detect Them: Data Shapley

Running Hallucination Detection

Understanding the Output

Controlling Speed vs Accuracy

Using the Results

Configuration

Available Models

Project Structure

Example Workflows

Basic Experiment

Recursive Training

Chain Tracing

Standalone Evaluation Tools

SDMetrics Quality Report

K-Anonymity Evaluation

Statistical Analysis

Aggregating Metrics Across Experiments

Hybrid Stratified Aggregation

When to Use This Method

The Pseudoreplication Problem

Two-Step Aggregation Method

Statistical Formulas

Interpretation

Usage Example

Key Advantages

When NOT to Use This Method

References

Statistical Methods

Data Valuation & Shapley Values

BibTeX Format

Model Management

CLI Commands

Model ID Format

Development

License

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages