Automated detection of missing interaction terms in UK personal lines GLMs.
A Poisson frequency GLM for motor insurance with 12 rating factors has 66 possible pairwise interactions. Manually searching them — fitting, testing, reviewing 2D actual-vs-expected plots — takes days and is driven by intuition rather than data. You will miss interactions that are not obvious from marginal plots, and you will spend time testing pairs that are irrelevant.
The standard manual process:
- Fit a GBM to get a benchmark prediction
- Loop over pairs of factors; produce 2D A/E plots
- Identify where the multiplicative GLM assumption breaks down
- Test candidate interactions via likelihood-ratio test
- Repeat
This library automates steps 2–4.
The pipeline has three stages:
Stage 1 - CANN: Train a Combined Actuarial Neural Network (Schelldorfer & Wüthrich 2019) on the residuals of your existing GLM. The CANN uses a skip connection so it starts from the GLM prediction and only learns what the GLM is missing. After training, any deviation of the CANN from zero encodes structure the GLM cannot express — interactions.
Stage 2 - NID: Apply Neural Interaction Detection (Tsang et al. 2018) to the trained CANN weights. The algorithm reads the interaction structure directly from the weight matrices: two features can only interact if they both contribute to the same first-layer hidden unit. The NID score for a pair (i, j) is:
d(i,j) = Σ_s z_s · min(|W1[s,i]|, |W1[s,j]|)
where z_s is how much first-layer unit s influences the output — computed as the propagated sum of absolute weights from layer 2 to the output (not a matrix product). This gives a ranked list of candidate interactions in milliseconds after training.
Stage 3 - GLM testing: For each top-K candidate pair, refit the GLM with the interaction added and compute a likelihood-ratio test statistic. The output table includes deviance improvement, AIC/BIC, p-values (Bonferroni corrected), and n_cells — the parameter cost of adding each interaction.
Both Poisson (frequency) and Gamma (severity) families are supported.
Requires uv add "insurance-interactions[torch]". The CANN and NID stages use PyTorch.
import polars as pl
import numpy as np
from insurance_interactions import InteractionDetector, build_glm_with_interactions
# Generate synthetic UK motor data to run this example end to end.
# In production, supply your actual rating factor DataFrame, claim counts,
# fitted GLM predictions, and exposure weights.
rng = np.random.default_rng(42)
N = 10_000
age_band = pl.Series('age_band', rng.choice(['<25', '25-40', '40-60', '60+'], size=N))
vehicle_group = pl.Series('vehicle_group', rng.choice(['A', 'B', 'C', 'D'], size=N))
ncd = pl.Series('ncd', rng.integers(0, 10, size=N))
annual_mileage = pl.Series('annual_mileage', rng.integers(3000, 30000, size=N))
X_train = pl.DataFrame([age_band, vehicle_group, ncd, annual_mileage])
# Exposure and claim counts with a known age_band x vehicle_group interaction
exposure_train = rng.uniform(0.1, 1.0, size=N)
base_rate = 0.06
young_hv = ((age_band == '<25') & (vehicle_group == 'D')).to_numpy().astype(float)
# True data-generating process: includes the interaction
mu_true = base_rate * exposure_train * (1 + 0.4 * young_hv)
y_train = rng.poisson(mu_true)
# GLM baseline: main-effects-only (does NOT know about the interaction).
# This is the key: the CANN learns the residual between what the GLM predicts
# and what the data shows. If you pass mu_true as glm_predictions, the CANN
# sees flat residuals and NID detects nothing.
mu_glm_train = base_rate * exposure_train # intercept-only GLM
detector = InteractionDetector(family="poisson")
detector.fit(
X=X_train,
y=y_train,
glm_predictions=mu_glm_train,
exposure=exposure_train,
)
# Ranked interaction table with deviance gains and LR test results
print(detector.interaction_table())
# Top recommended interactions (significant after Bonferroni correction)
suggested = detector.suggest_interactions(top_k=5)
# [("age_band", "vehicle_group"), ("age_band", "ncd"), ...]
# Refit GLM with approved interactions
final_model, comparison = build_glm_with_interactions(
X=X_train,
y=y_train,
exposure=exposure_train,
interaction_pairs=suggested,
family="poisson",
)
print(comparison)The interaction table contains one row per candidate pair:
| Column | Description |
|---|---|
feature_1, feature_2 |
Factor names |
nid_score |
Raw NID score (higher = stronger detected interaction in CANN) |
nid_score_normalised |
Normalised to [0, 1] for interpretability |
n_cells |
Parameter cost: (L_i - 1)(L_j - 1) for cat×cat |
delta_deviance |
Deviance reduction when adding this pair to the GLM |
delta_deviance_pct |
As a percentage of base GLM deviance |
lr_chi2, lr_df, lr_p |
Likelihood-ratio test statistic and p-value |
recommended |
True if significant after Bonferroni correction |
The n_cells column is important for credibility decisions: a strong interaction requiring 200 new parameters may be less useful than a moderate one requiring 4.
The core package does not include PyTorch (the CANN + NID pipeline requires it).
Install with the torch extra to use InteractionDetector:
uv add "insurance-interactions[torch]"Without torch, only the GLM testing functions (test_interactions, build_glm_with_interactions)
and NID scoring utilities are available.
With SHAP interaction validation (requires CatBoost):
uv add "insurance-interactions[shap]"💬 Questions or feedback? Start a Discussion. Found it useful? A ⭐ helps others find it.
Training is controlled via DetectorConfig:
from insurance_interactions import DetectorConfig, InteractionDetector
cfg = DetectorConfig(
cann_hidden_dims=[32, 16], # MLP architecture
cann_n_epochs=300,
cann_n_ensemble=5, # Average over 5 training runs for stable NID
cann_patience=30, # Early stopping patience
top_k_nid=20, # NID pairs to forward to GLM testing
top_k_final=10, # Interactions in final suggest_interactions()
mlp_m=True, # MLP-M variant: reduces false positive interactions
nid_max_order=2, # 2 = pairwise; 3 = also compute three-way
alpha_bonferroni=0.05, # Significance level after Bonferroni correction
)
detector = InteractionDetector(family="poisson", config=cfg)Setting mlp_m=True activates the MLP-M architecture (Tsang et al. 2018): each feature gets its own small univariate network to absorb the main effect, forcing the main MLP to model only interactions. This reduces false positive interactions at the cost of more training parameters. Recommended for datasets with strongly correlated features (e.g. age and NCD).
cann_n_ensemble=3 (or more) trains multiple CANN runs with different random seeds and averages the NID scores. CANN training is stochastic; a single run may produce unstable weight matrices. Three runs is a reasonable default; five is better for production use.
Run the detector separately for frequency and severity:
freq_detector = InteractionDetector(family="poisson")
freq_detector.fit(X=X, y=claim_counts, glm_predictions=mu_freq_glm, exposure=exposure)
sev_detector = InteractionDetector(family="gamma")
sev_detector.fit(X=X_claims, y=claim_amounts, glm_predictions=mu_sev_glm, exposure=claim_counts)In practice, frequency and severity interactions differ. Young driver × sports car interactions are typically stronger in frequency. Severity interactions are noisier due to the higher variance in claim amounts.
The CANN is from Schelldorfer & Wüthrich (2019), "Nesting Classical Actuarial Models into Neural Networks" (SSRN 3320525). NID is from Tsang, Cheng & Liu (2018), "Detecting Statistical Interactions from Neural Network Weights" (ICLR 2018). The direct application of this pipeline to insurance GLMs is in Lindström & Palmquist (2023), "Detection of Interacting Variables for GLMs via Neural Networks" (European Actuarial Journal).
The CANN architecture:
μ_CANN(x) = μ_GLM(x) * exp(NN(x; θ))
The GLM prediction enters as a fixed log-space offset. The output layer of the neural network is zero-initialised so the CANN equals the GLM exactly at the start of training. The network then learns only the residual structure — which, in a well-specified GLM missing interactions, corresponds to those interaction terms.
- NID depends on the CANN having converged. Poor training (small dataset, high learning rate, too few epochs) produces unreliable weight matrices. Use
n_ensemble ≥ 3and checkcann.val_deviance_history. - Very small datasets (< 5,000 policies) may not provide enough signal for the CANN to learn stable residual structure. The LR tests still work but the NID ranking may be noisy.
- NID is not a statistical test — it produces a ranking, not p-values. The LR test in Stage 3 provides the statistical rigour.
- Correlated features (age and NCD in UK motor) can spread interaction signal across spurious pairs. The MLP-M variant with L1 sparsity partially mitigates this.
- The GLM refit step uses glum. If your rating engine uses a different GLM package, the
n_cells,delta_deviance, and LR statistics are still valid; just refit your own model with the suggested interaction pairs.
UK actuaries working under PRA SS1/23 model risk governance and FCA Consumer Duty pricing rules need interaction decisions to be auditable. This library is designed to support that: it produces a ranked table with test statistics, not a black-box model. The actuary decides which interactions to add; the library provides the shortlist and the evidence.
databricks/benchmark.py — the definitive benchmark:
CANN+NID vs exhaustive LR testing vs main-effects-only GLM on 15,000 synthetic UK motor
policies with 10 rating factors and 3 planted interactions (strong, moderate, and weak).
Measures out-of-sample Poisson deviance, interaction recovery, and projects the time
saving at scale (10 to 50 features). Run this first to understand what the library does
and does not deliver.
insurance_interactions_demo.py — the workflow demo:
the full CANN + NID + GLM testing pipeline end-to-end on a simpler 5-feature dataset.
Good for understanding the API before running the benchmark.
Finding the Interactions Your GLM Missed — how CANN + NID automates the interaction search and why manual 2D A/E plots miss the non-obvious pairs.
A UK motor book with 50 rating factors has C(50, 2) = 1,225 candidate interaction pairs. Exhaustive pairwise likelihood-ratio testing has two problems at this scale:
- Time: fitting 1,225 GLMs on 50,000 policies with 50 features each takes roughly 30–60 minutes on a CPU. That is before any review.
- Sensitivity: Bonferroni correction at 1,225 tests means the threshold drops to p < 0.000041. An interaction with a 0.25 log-point effect can easily fail to reach this threshold even when it is real.
CANN+NID solves both. One neural network fit screens all 1,225 pairs via weight-matrix analysis. Only the top 15 candidates receive GLM likelihood-ratio tests, so the Bonferroni threshold is p < 0.0033 — 82x more sensitive than exhaustive testing.
| Metric | CANN+NID | Exhaustive LR testing | Notes |
|---|---|---|---|
| Candidate pairs | 1,225 | 1,225 | same problem |
| GLM fits required | ~15 (top-K) | 1,225 | CANN replaces 98.8% of GLM fits |
| Bonferroni threshold | p < 0.0033 | p < 0.000041 | 82x difference in detectable effect size |
| Wall-clock time (CPU) | 5–10 min | ~40–60 min (extrapolated) | CANN training time is fixed regardless of feature count |
| Planted interactions recovered (3 planted) | expected 2–3 / 3 | not practical to run | exhaustive: threshold too strict for moderate effects |
| False positives after correction | expected 0–1 | expected 0–3 | NID pre-screening reduces spurious discoveries |
Run benchmarks/benchmark_50features.py on Databricks to reproduce. The benchmark generates 50,000 synthetic policies with 50 features (10 genuine rating factors + 40 noise covariates), plants 3 interactions, times both approaches, and prints the full comparison.
With 10 features (C(10,2) = 45 pairs), exhaustive testing is fast (a few seconds) and the Bonferroni threshold is p < 0.0011 — perfectly workable. CANN+NID with compact settings (n_ensemble=2, n_epochs=150) can underperform exhaustive testing on this toy problem. This is expected and honest: if you only have 10 features, exhaustive testing is fine.
Benchmarked on Databricks (2026-03-16), n_ensemble=2, n_epochs=150:
| Metric | Exhaustive LR testing | CANN+NID (compact settings) |
|---|---|---|
| Pairs fitted / tested | 45 | 45 screened -> top 10 tested |
| Runtime (Databricks CPU) | 43.3s | 34.7s |
| True positives (2 planted) | 2 / 2 | 0 / 2 |
| False positives (Bonferroni-corrected) | 5 | 1 |
The CANN missed both planted interactions at compact settings — a 2-run ensemble on a 150-epoch training run produces unstable NID rankings. Exhaustive testing found both but returned 5 false positives alongside them.
Run benchmarks/benchmark.py to see this result. It is kept in the repo because the 10-feature result is honest: use exhaustive testing when it is practical, CANN+NID when it is not.
The crossover point is roughly 20–25 features (C(20,2) = 190, C(25,2) = 300 pairs). Above that, CANN+NID is faster and more sensitive.
| Library | What it does |
|---|---|
| shap-relativities | Extract rating relativities from GBMs — use the GBM as the benchmark that reveals where the GLM is missing structure |
| bayesian-pricing | Hierarchical Bayesian models for thin rating cells — once interactions are identified, thin interaction cells need partial pooling |
| insurance-datasets | Synthetic UK motor and home datasets — use to validate the detector recovers known interaction structure |
| insurance-cv | Walk-forward cross-validation for pricing models — use to assess whether adding interactions improves out-of-sample performance |
| insurance-causal | Double Machine Learning for causal inference — establishes whether detected interactions are genuine causal drivers |
| insurance-synthetic | Synthetic portfolio generation — create datasets with known interaction structure to validate detection |