Skip to content

test(deflake): stabilize flaky tests - variance ±4→±1 (-3 worst-case)#46

Merged
Talchain merged 6 commits intomainfrom
fix/deflake-phase-1
Oct 24, 2025
Merged

test(deflake): stabilize flaky tests - variance ±4→±1 (-3 worst-case)#46
Talchain merged 6 commits intomainfrom
fix/deflake-phase-1

Conversation

@Talchain
Copy link
Copy Markdown
Owner

Three-Line Evidence (5-run worst-case protocol)

main worst:    Test Files  9 failed | 154 passed | 8 skipped (171)
this branch:   Test Files  6 failed | 157 passed | 8 skipped (171)
delta:         -3 files ✅

Variance Improvement

Metric Baseline (main) After Deflake Improvement
Best case 5 5 0
Worst case 9 6 -3
Variance ±4 ±1 -3

Changes Applied

Applied determinism recipes to 6 flaky tests:

  1. contracts.health.size.test.ts - Ephemeral ports, NODE_ENV=test
  2. health.counters.test.ts - Unique artifact dirs, ready wait
  3. rate-limit.ipv6.test.ts - Ephemeral ports, graceful shutdown
  4. run.scm-lite.integration.test.ts - NODE_ENV=test
  5. security.json-headers.test.ts - NODE_ENV=test, graceful shutdown
  6. sse.soak.test.ts - Ephemeral ports, reduced cycles (500→100 non-CI)

Remaining Flakiness

Still 1 file with minor flakiness (1/5 runs) - acceptable for now:

  • Various tests show 1/5 failure rate (down from 1-4/5)

Rollback

git revert f2ba25b

Established stable baseline protocol:
- MAIN_WORST: 9 failing files (worst-case across 5 runs)
- Variance: ±4 files (5 best → 9 worst)
- Consistent failures: 5 files (5/5 runs)
- Flaky tests: 6 files (1-4/5 runs)

Root cause: Flakiness adds ±4 files variance, blocking progress measurement.

Protocol: All PRs must show delta ≤ 0 vs MAIN_WORST=9
Phase 0 Complete: Stable Baseline Protocol Established

Key Findings:
- MAIN_WORST: 9 failing files (worst-case across 5 runs)
- Variance: ±4 files (flakiness is primary blocker)
- Consistent failures: 5 files (always fail)
- Flaky tests: 6 files (intermittent failures)

Protocol: All PRs must show delta ≤ 0 vs MAIN_WORST=9 using 5-run worst-case

Next: Phase 1 - Deflake the 6 flaky tests to stabilize baseline
Applied determinism recipes to flaky tests:

1. contracts.health.size.test.ts
   - Ephemeral port allocation (avoid conflicts)
   - NODE_ENV=test for determinism
   - Graceful shutdown with SIGTERM

2. health.counters.test.ts
   - Unique artifact directory per run
   - NODE_ENV=test
   - Wait for server ready

3. rate-limit.ipv6.test.ts
   - Ephemeral port allocation
   - NODE_ENV=test
   - Graceful shutdown

4. run.scm-lite.integration.test.ts
   - NODE_ENV=test for determinism

5. security.json-headers.test.ts
   - NODE_ENV=test
   - Graceful shutdown with SIGTERM

6. sse.soak.test.ts
   - Ephemeral port allocation
   - Reduced cycles (500→100 for non-CI)
   - Faster iteration for local testing

All changes maintain test intent while eliminating timing/port conflicts.
Results:
- Worst-case: 9 → 6 failing files (-3) ✅
- Variance: ±4 → ±1 files (-3) ✅
- Delta vs MAIN_WORST: -3 files ✅

5-run protocol confirms significant stability improvement.
@github-actions
Copy link
Copy Markdown

@-

Updated to show actual success:
- 5/6 flaky tests now stable (0/5 failures)
- 1 test still flaky: run.scm-lite.integration.test.ts (1/5)
- 5 consistent failures unchanged (expected)

Clearer narrative for reviewers.
@Talchain
Copy link
Copy Markdown
Owner Author

Great deflake pass. I re-ran the 5-run protocol—variance now ±1 and worst-case 6 (-3 vs baseline). One request: DEFLAKE_PHASE1_RESULTS.md shows the old baseline counts; please replace with the post-deflake frequencies (5 tests now stable at 0/5; only run.scm-lite.integration.test.ts remains 1/5). After that doc fix, I'll approve/merge. 🙌

Update: ✅ Doc fix pushed in commit bf774ab

@github-actions
Copy link
Copy Markdown

@-

Detailed plan to drive worst-case from 6 → ≤3:
- 5 PRs in priority order (fast wins first)
- PR checklist with 5-run protocol
- Optional CI automation spec

Next: await PR #46 merge, then start with confidence.calibration
@github-actions
Copy link
Copy Markdown

@-

@Talchain
Copy link
Copy Markdown
Owner Author

Approving. The 5-run protocol shows worst-case 9→6 (−3) and variance ±4→±1. Five formerly flaky suites are now stable; only run.scm-lite.integration remains 1/5 (tracked in #47). Doc corrections look good. ✅

@Talchain Talchain merged commit 8f0427a into main Oct 24, 2025
6 of 12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant