System Crashes in Production: Process Explosion & Resource Exhaustion (Ubuntu 22.04, v2.1.30-2.1.32)

# System Crashes in Production Environment: Process Explosion & Resource Exhaustion

## Environment

**Claude Code Version:** 2.1.30 - 2.1.32
**OS:** Ubuntu 22.04 LTS (Jammy)
**Kernel:** Linux 6.8.0-1021-aws #23~22.04.1-Ubuntu SMP
**Architecture:** x86_64
**Platform:** AWS EC2 Instance (16-core, 60GB RAM)
**Node.js:** Multiple versions via PM2

## System Context

This is a **production EC2 instance** running:
- 60+ PM2-managed Node.js services
- Docker containers (signal-cli-rest-api and others)
- Claude Code CLI for development/maintenance tasks
- Normal process count: ~864 processes (268 node, 213 shell processes)

## Issue Summary

Multiple **complete system crashes** occurred while Claude Code was running in this high-load production environment. While we cannot definitively prove Claude Code caused the crashes, it was active during incidents involving:

1. **Process explosion** - 73,217 shell processes at time of crash
2. **npm/Node.js crash** (SIGABRT) in claude-shared-context directory
3. **Resource exhaustion** - System load 117.47 (729% above normal)
4. **Complete system failure** requiring hard reboot

## Critical Incidents

### Incident 1: Complete System Crash (2026-02-01)

**Duration:** 16 hours 20 minutes of cascading failures
**Outcome:** System unresponsive, hard reboot required

**Timeline:**

| Time | Event | Details |
|------|-------|---------|
| 00:00 | Docker OOM killer begins | 9,080+ kills in 4 minutes, Java processes exhausting memory |
| 00:04 | npm process crashes | SIGABRT in `/home/fanning/claude-shared-context`, core dump available |
| 01:26 | System degradation | Load: 117.47, I/O wait: 75.5% |
| 15:10 | Shell process explosion begins | 15,134 processes |
| 16:15 | Process count critical | 73,217 shell processes (+58,083 in 65 minutes) |
| 16:20 | **COMPLETE SYSTEM CRASH** | Process table exhausted, system unresponsive |

**Root causes identified:**
- Shell process explosion (73,217 `[sh]` processes)
- Docker container OOM loop (restart policy created infinite fork cascade)
- npm crash (SIGABRT) with core dump in claude-shared-context directory

### Incident 2: PM2 Crash Loop (2026-01-27)

**Duration:** 2 hours 32 minutes
**Root Cause:** Port conflict causing infinite restart loop
**Impact:** 8,414 crash iterations, all PM2 processes killed (including Claude Code services)

## Evidence Linking to Claude Code

### npm Core Dump
```
Location: /var/crash/_usr_bin_node.1002.crash
Size: 3.5 MB
Signal: SIGABRT (6)
Working Directory: /home/fanning/claude-shared-context
Timestamp: 2026-02-05 14:15
```

The crash occurred in the Claude Code shared context repository, suggesting Claude Code or a related npm process was active.

### System State During Incidents
- Claude Code was running multiple sessions
- Heavy file I/O operations via Read/Write/Edit tools
- 60+ PM2 services competing for resources
- High I/O wait (75.5%) coinciding with Claude Code file operations

## Questions for Anthropic Team

### Subprocess Management
1. Does Claude Code properly clean up bash subprocesses spawned by the Bash tool?
2. Are there known issues with subprocess accumulation in long-running sessions or high-load environments?
3. What is the expected subprocess count for typical Claude Code usage?

### Memory & Resource Management
1. Are there known memory leaks in versions 2.1.30-2.1.32?
2. Does Claude Code have resource limits or throttling mechanisms for subprocess spawning?
3. How does Claude Code behave when system resources are constrained (high process count, high I/O wait)?

### Production Environment Best Practices
1. Should Claude Code be run with process/memory limits in production environments?
2. Are there recommended ulimits or cgroup settings?
3. How does Claude Code interact with PM2-managed services and Docker containers?

## Related Known Issues

This report may be related to:
- Issue #20777: Memory leak on Linux causing 20GB+ RAM usage and system crash
- Issue #22042: Critical memory regression in 2.1.27 - OOM crash
- Issue #21378: Memory leak causes freeze after 20+ minutes (15GB RAM consumption)
- Issue #16135: Background process termination crashes Claude Code in Docker containers

However, this report adds a **new dimension**: process explosion (73K+ shell processes) in a production environment with concurrent services.

## Mitigation Deployed (Now Stable)

After the crashes, we implemented:

1. **Process limits** (ulimits):
   ```bash
   fanning soft nproc 8192
   fanning hard nproc 16384
   ```

2. **Docker resource limits**:
   ```bash
   --pids-limit=100 --memory=512m --restart=on-failure:3
   ```

3. **Automated monitoring** (1-minute cron):
   - Auto-remediation at 1,000 processes
   - Emergency kill at 10,000 processes

**Result:** System stable for 2+ days with no issues.

## Reproduction Environment

We cannot reliably reproduce the crash, but the environment characteristics:
- Ubuntu 22.04 EC2 instance
- 60+ concurrent Node.js PM2 services
- Docker containers running alongside Claude Code
- Long-running Claude Code sessions performing heavy file I/O
- High normal process count (800-1000 processes)

## Request

**Guidance on:**
1. Best practices for running Claude Code in production environments with high process counts
2. Whether Claude Code should have built-in resource limits
3. Subprocess cleanup mechanisms in Claude Code
4. Analysis of the npm core dump (can provide if helpful)

## Supporting Documentation

Complete forensic analysis available in our repository:
- Complete system crash forensics
- PM2 crash incident report
- Crash recovery procedures
- System monitoring scripts

Core dump and additional logs available upon request.

## Current Status

- **System:** Stable with mitigation measures
- **Claude Code:** Running version 2.1.32
- **Willing to test:** Available for diagnostic testing or additional data collection

---

**Filed by:** Production System Administrator
**Contact:** Available via GitHub
**Logs/Dumps Available:** Yes


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

System Crashes in Production: Process Explosion & Resource Exhaustion (Ubuntu 22.04, v2.1.30-2.1.32) #23484

System Crashes in Production Environment: Process Explosion & Resource Exhaustion

Environment

System Context

Issue Summary

Critical Incidents

Incident 1: Complete System Crash (2026-02-01)

Incident 2: PM2 Crash Loop (2026-01-27)

Evidence Linking to Claude Code

npm Core Dump

System State During Incidents

Questions for Anthropic Team

Subprocess Management

Memory & Resource Management

Production Environment Best Practices

Related Known Issues

Mitigation Deployed (Now Stable)

Reproduction Environment

Request

Supporting Documentation

Current Status

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Time	Event	Details
00:00	Docker OOM killer begins	9,080+ kills in 4 minutes, Java processes exhausting memory
00:04	npm process crashes	SIGABRT in `/home/fanning/claude-shared-context`, core dump available
01:26	System degradation	Load: 117.47, I/O wait: 75.5%
15:10	Shell process explosion begins	15,134 processes
16:15	Process count critical	73,217 shell processes (+58,083 in 65 minutes)
16:20	COMPLETE SYSTEM CRASH	Process table exhausted, system unresponsive

System Crashes in Production: Process Explosion & Resource Exhaustion (Ubuntu 22.04, v2.1.30-2.1.32) #23484

Description

System Crashes in Production Environment: Process Explosion & Resource Exhaustion

Environment

System Context

Issue Summary

Critical Incidents

Incident 1: Complete System Crash (2026-02-01)

Incident 2: PM2 Crash Loop (2026-01-27)

Evidence Linking to Claude Code

npm Core Dump

System State During Incidents

Questions for Anthropic Team

Subprocess Management

Memory & Resource Management

Production Environment Best Practices

Related Known Issues

Mitigation Deployed (Now Stable)

Reproduction Environment

Request

Supporting Documentation

Current Status

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions