insurance-datasets

Synthetic UK insurance datasets with known data generating processes. Built for testing pricing models — this is the dataset used throughout the Burning Cost training course.

When you are developing a GLM, a gradient boosted tree, or any other pricing algorithm, you need data where you know what the right answer is. Real policyholder data is messy, access-controlled, and the true coefficients are unknown. This package gives you clean, realistic synthetic data where the true parameters are published — so you can verify your implementation produces the right coefficients.

Two datasets are available:

Motor: UK personal lines motor insurance. 18 columns covering driver age, NCD, ABI vehicle group, area band, and more. Frequency and severity from a known Poisson-Gamma DGP.
Home: UK household insurance. 16 columns covering property value, flood zone, construction type, subsidence risk, and security level. Same Poisson-Gamma structure.

Installation

uv add insurance-datasets

Or with uv:

uv add insurance-datasets

To use polars output:

uv add "insurance-datasets[polars]"
# or
uv add insurance-datasets[polars]

💬 Questions or feedback? Start a Discussion. Found it useful? A ⭐ helps others find it.

Requires Python 3.10+. Dependencies: numpy, pandas. Polars is optional: uv add polars.

Quick start

from insurance_datasets import load_motor, load_home

motor = load_motor(n_policies=50_000, seed=42)
home  = load_home(n_policies=50_000, seed=42)

print(motor.shape)   # (50000, 18)
print(home.shape)    # (50000, 16)

If you prefer polars, pass polars=True:

motor_pl = load_motor(n_policies=50_000, seed=42, polars=True)
home_pl  = load_home(n_policies=50_000, seed=42, polars=True)

print(type(motor_pl))  # <class 'polars.dataframe.frame.DataFrame'>

Polars is an optional dependency. If it is not installed you will get a clear ImportError with install instructions.

Motor dataset

load_motor() returns one row per policy. Default: 50,000 policies, inception years 2019–2023.

Columns

Column	Type	Description
`policy_id`	int	Sequential identifier
`inception_date`	date	Policy start
`expiry_date`	date	Policy end (may be < 12 months for cancellations)
`inception_year`	int	Calendar year of inception — use for cohort splits
`vehicle_age`	int	0–20 years
`vehicle_group`	int	ABI group 1–50
`driver_age`	int	17–85
`driver_experience`	int	Years licensed
`ncd_years`	int	0–5 (UK NCD scale)
`ncd_protected`	bool	Protected NCD flag
`conviction_points`	int	Total endorsement points (0 = clean)
`annual_mileage`	int	2,000–30,000 miles
`area`	str	ABI area band A–F (A = rural/low risk, F = inner city)
`occupation_class`	int	1–5
`policy_type`	str	`'Comp'` or `'TPFT'`
`claim_count`	int	Number of claims in period
`incurred`	float	Total incurred cost (£); 0.0 if no claims
`exposure`	float	Earned years (< 1.0 for cancellations)

True DGP — frequency

Poisson frequency model with log-linear predictor:

log(lambda) = log(exposure) + intercept
            + vehicle_group_coef * vehicle_group
            + driver_age_young * I(driver_age < 25)
            + driver_age_old   * I(driver_age >= 70)
            + ncd_years_coef   * ncd_years
            + area_B * I(area == 'B') ... area_F * I(area == 'F')
            + has_convictions  * I(conviction_points > 0)

Ages 25–29 blend linearly from the young-driver load down to zero by age 30.

from insurance_datasets import MOTOR_TRUE_FREQ_PARAMS

print(MOTOR_TRUE_FREQ_PARAMS)
# {'intercept': -3.2, 'vehicle_group': 0.025, 'driver_age_young': 0.55,
#  'driver_age_old': 0.3, 'ncd_years': -0.12, 'area_B': 0.1, 'area_C': 0.2,
#  'area_D': 0.35, 'area_E': 0.5, 'area_F': 0.65, 'has_convictions': 0.45}

The intercept -3.2 is the log-space baseline: exp(-3.2) ≈ 4.1% is the frequency for a base-category risk (area A, no convictions, zero NCD, at vehicle group 0) with one year of exposure. The portfolio average frequency is approximately 10% per year after all factor loadings push the mean up — young drivers, higher vehicle groups, and urban areas all contribute positively.

True DGP — severity

Gamma severity model with shape=2 (coefficient of variation ~0.71):

from insurance_datasets import MOTOR_TRUE_SEV_PARAMS

print(MOTOR_TRUE_SEV_PARAMS)
# {'intercept': 7.8, 'vehicle_group': 0.018, 'driver_age_young': 0.25}

Baseline severity at intercept 7.8 gives a mean of roughly £2,440.

Home dataset

load_home() returns one row per policy. Default: 50,000 policies, inception years 2019–2023.

Columns

Column	Type	Description
`policy_id`	int	Sequential identifier
`inception_date`	date	Policy start
`expiry_date`	date	Policy end
`inception_year`	int	Calendar year of inception
`region`	str	UK region (ONS groupings, 12 values)
`property_value`	int	Buildings sum insured (£); regional log-normal
`contents_value`	int	Contents sum insured (£)
`construction_type`	str	`'Standard'`, `'Non-Standard'`, or `'Listed'`
`flood_zone`	str	Environment Agency zone: `'Zone 1'`, `'Zone 2'`, `'Zone 3'`
`is_subsidence_risk`	bool	High-subsidence area (London clay, Midlands)
`security_level`	str	`'Basic'`, `'Standard'`, or `'Enhanced'`
`bedrooms`	int	1–5
`property_age_band`	str	Construction era: Pre-1900, 1900–1945, 1945–1980, 1980–2000, Post-2000
`claim_count`	int	Number of claims in period
`incurred`	float	Total incurred cost (£); 0.0 if no claims
`exposure`	float	Earned years

True DGP — frequency

from insurance_datasets import HOME_TRUE_FREQ_PARAMS

print(HOME_TRUE_FREQ_PARAMS)
# {'intercept': -2.8, 'property_value_log': 0.18,
#  'construction_non_standard': 0.4, 'construction_listed': 0.25,
#  'flood_zone_2': 0.3, 'flood_zone_3': 0.85, 'subsidence_risk': 0.55,
#  'security_standard': -0.1, 'security_enhanced': -0.25}

property_value_log and contents_value_log are scaled as log(value / reference) — log(property_value / 250_000) and log(contents_value / 30_000) respectively.

True DGP — severity

Gamma severity model with shape=1.5 (CV ~0.82 — household claims are more volatile than motor):

from insurance_datasets import HOME_TRUE_SEV_PARAMS

print(HOME_TRUE_SEV_PARAMS)
# {'intercept': 8.1, 'property_value_log': 0.35,
#  'flood_zone_3': 0.45, 'contents_value_log': 0.22}

Baseline severity at intercept 8.1 gives a mean of roughly £3,300.

Verifying your model against the true parameters

The point of a known DGP is that you can check your implementation. Here is a worked GLM example.

The DGP includes driver_age_young and driver_age_old effects in the frequency model. Including driver age in the fitted model is important — omitting it causes omitted variable bias and the intercept absorbs some of the age effect:

import numpy as np
import statsmodels.api as sm
from insurance_datasets import load_motor, MOTOR_TRUE_FREQ_PARAMS

df = load_motor(n_policies=50_000, seed=42)
df["has_convictions"] = (df["conviction_points"] > 0).astype(int)
df["driver_age_young"] = (df["driver_age"] < 25).astype(int)
df["driver_age_old"] = (df["driver_age"] >= 70).astype(int)

for band in ["B", "C", "D", "E", "F"]:
    df[f"area_{band}"] = (df["area"] == band).astype(int)

features = [
    "vehicle_group", "ncd_years", "has_convictions",
    "driver_age_young", "driver_age_old",
    "area_B", "area_C", "area_D", "area_E", "area_F",
]
X = sm.add_constant(df[features])

result = sm.GLM(
    df["claim_count"],
    X,
    family=sm.families.Poisson(),
    offset=np.log(df["exposure"].clip(lower=1e-6)),
).fit(disp=False)

print("Parameter recovery:")
print(f"  vehicle_group:     fitted={result.params['vehicle_group']:.4f}  true={MOTOR_TRUE_FREQ_PARAMS['vehicle_group']:.4f}")
print(f"  ncd_years:         fitted={result.params['ncd_years']:.4f}  true={MOTOR_TRUE_FREQ_PARAMS['ncd_years']:.4f}")
print(f"  convictions:       fitted={result.params['has_convictions']:.3f}  true={MOTOR_TRUE_FREQ_PARAMS['has_convictions']:.3f}")
print(f"  driver_age_young:  fitted={result.params['driver_age_young']:.3f}  true={MOTOR_TRUE_FREQ_PARAMS['driver_age_young']:.3f}")

At 50k policies, slope estimates should be within a few percent of the true values. Omitting driver_age_young from the formula introduces omitted variable bias — NCD and convictions will both appear inflated because they are correlated with driver age in this portfolio.

Verifying a flood zone relativity (home)

from insurance_datasets import load_home

df = load_home(n_policies=50_000, seed=42)

z1 = df[df["flood_zone"] == "Zone 1"]
z3 = df[df["flood_zone"] == "Zone 3"]
ratio = (z3["claim_count"].sum() / z3["exposure"].sum()) / (z1["claim_count"].sum() / z1["exposure"].sum())
print(f"Zone 3 vs Zone 1 frequency ratio: {ratio:.2f}x")
# True DGP implies exp(0.85) = 2.34x — you should be close to this unadjusted

Design choices

Why Poisson-Gamma and not something more exotic? GLMs are the industry standard for personal lines pricing in the UK. The DGP matches what a correctly specified production model would use. If you want to test Tweedie models, use raw incurred as the response — the data supports it.

Why no missing values? This is a testing dataset. Missing value imputation is a separate problem. Mixing the two makes it harder to isolate algorithm correctness.

Why 50,000 policies as the default? Below about 10,000 policies, coefficient estimates become noisy enough that a correct implementation can look wrong. At 50,000 the estimates are stable. For quick unit tests, 1,000–5,000 is sufficient.

Why is the home DGP simpler than motor? Motor pricing in the UK has more rating variables with stronger interactions. The home DGP reflects a less mature pricing environment where a handful of factors (flood zone, construction type, subsidence) dominate.

Why inception_year and not accident_year? In actuarial triangles, accident year means the year a loss occurred — which for a synthetic portfolio without claim development would need to be modelled separately from inception. This column is simply the year the policy incepted, so inception_year is the right term.

Performance

Benchmarked on Databricks serverless (Python 3.11, seed=42, n=50,000 policies). Full script: benchmarks/run_benchmark.py.

The benchmark fits correctly-specified Poisson GLMs and compares the fitted coefficients to the published true DGP parameters. This is the primary intended use of the package: verifying that a GLM implementation recovers the right answer on data where the right answer is known.

Motor frequency (Poisson GLM, n=50,000)

Parameter	True	Fitted	Bias
vehicle_group	0.0250	0.0239	−0.0011
ncd_years	−0.1200	−0.1245	−0.0045
has_convictions	0.4500	0.5119	+0.0619
driver_age_young	0.5500	0.4434	−0.1066
driver_age_old	0.3000	0.2063	−0.0937

Parameter RMSE (5 main factors): 0.069. 95% CI coverage: 60% (3/5 parameters). Load time 0.50s, fit time 0.46s.

The two age parameters have the largest bias, driven by the blended transition zone in the DGP (ages 25–29 blend linearly down from the young-driver load). A binary indicator slightly misspecifies the edge, which is expected. The parameters that matter most for pricing decisions — vehicle group, NCD, and area bands — are recovered with negligible bias.

Omitted variable bias (dropping driver_age)

Parameter	True	Full model	Omit age	Bias (omit)
vehicle_group	0.025	0.024	0.024	−0.001
ncd_years	−0.120	−0.125	−0.154	−0.034
has_convictions	0.450	0.512	0.578	+0.128

Omitting driver age inflates the NCD coefficient by 24% and the convictions loading by 13%. This is the classic omitted variable bias: driver age is correlated with NCD (young drivers have fewer NCD years) and with conviction history. When the model cannot see age, it pushes those effects onto the correlated variables.

Home frequency (Poisson GLM, n=50,000)

Parameter	True	Fitted	Bias
property_value_log	0.180	0.150	−0.030
construction_non_standard	0.400	0.314	−0.086
construction_listed	0.250	0.310	+0.060
flood_zone_2	0.300	0.281	−0.019
flood_zone_3	0.850	0.924	+0.074
subsidence_risk	0.550	0.635	+0.085
security_standard	−0.100	−0.127	−0.027
security_enhanced	−0.250	−0.292	−0.042

Parameter RMSE (8 factors): 0.059. Zone 3 vs Zone 1 raw frequency ratio: 2.56x (true DGP implies exp(0.85) = 2.34x). The GLM estimate of 2.52x is close; the raw ratio is slightly inflated by marginal composition effects.

Running the tests

Tests include a GLM coefficient recovery check and require statsmodels:

uv add --dev statsmodels
uv run pytest

Capabilities

The notebook at notebooks/insurance_datasets_demo.py loads both datasets at 50,000 policies and runs Poisson GLMs for frequency and Gamma GLMs for severity on both motor and home, comparing fitted coefficients to the published true values. It demonstrates:

GLM coefficient recovery: Motor frequency Poisson GLM recovers all major parameters (vehicle group, NCD, area band, convictions, driver age) within a few percent of their true values at 50k policies.
Severity recovery: Gamma GLM with log link recovers the vehicle group and young driver severity parameters accurately.
Home DGP validation: Flood zone, construction type, and subsidence coefficients are recovered from the home dataset, including the Zone 3 frequency uplift of roughly 2.3x relative to Zone 1.
Ground truth as a testing tool: The MOTOR_TRUE_FREQ_PARAMS and HOME_TRUE_FREQ_PARAMS dicts let you quantify how far any modelling implementation deviates from the correctly specified answer — something impossible with real data.
Reproducibility: All datasets are fully deterministic given a seed; load_motor(seed=42) always returns the same 50,000 policies.

Related libraries

Library	Why it's relevant
insurance-synthetic	Generate portfolio-fitted synthetic data — use when you need data matched to your own book rather than a fixed DGP
insurance-interactions	GLM interaction detection — use this dataset to validate that the CANN pipeline recovers known interaction structure
insurance-cv	Walk-forward cross-validation — this dataset gives a controlled environment to benchmark CV strategies
insurance-validation	Model validation tools — use with this dataset to check validation metrics against known true parameters

All Burning Cost libraries and course →

Licence

MIT. See LICENSE.

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
.github/workflows		.github/workflows
benchmarks		benchmarks
notebooks		notebooks
src/insurance_datasets		src/insurance_datasets
tests		tests
.gitignore		.gitignore
.python-version		.python-version
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
run_benchmark_databricks.py		run_benchmark_databricks.py
run_tests_databricks.py		run_tests_databricks.py
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

insurance-datasets

Installation

Quick start

Motor dataset

Columns

True DGP — frequency

True DGP — severity

Home dataset

Columns

True DGP — frequency

True DGP — severity

Verifying your model against the true parameters

Verifying a flood zone relativity (home)

Design choices

Performance

Motor frequency (Poisson GLM, n=50,000)

Omitted variable bias (dropping driver_age)

Home frequency (Poisson GLM, n=50,000)

Running the tests

Capabilities

Related libraries

Licence

About

Uh oh!

Releases 3

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

insurance-datasets

Installation

Quick start

Motor dataset

Columns

True DGP — frequency

True DGP — severity

Home dataset

Columns

True DGP — frequency

True DGP — severity

Verifying your model against the true parameters

Verifying a flood zone relativity (home)

Design choices

Performance

Motor frequency (Poisson GLM, n=50,000)

Omitted variable bias (dropping driver_age)

Home frequency (Poisson GLM, n=50,000)

Running the tests

Capabilities

Related libraries

Licence

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 3

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages