-
Notifications
You must be signed in to change notification settings - Fork 5.4k
Description
System Crashes in Production Environment: Process Explosion & Resource Exhaustion
Environment
Claude Code Version: 2.1.30 - 2.1.32
OS: Ubuntu 22.04 LTS (Jammy)
Kernel: Linux 6.8.0-1021-aws #23~22.04.1-Ubuntu SMP
Architecture: x86_64
Platform: AWS EC2 Instance (16-core, 60GB RAM)
Node.js: Multiple versions via PM2
System Context
This is a production EC2 instance running:
- 60+ PM2-managed Node.js services
- Docker containers (signal-cli-rest-api and others)
- Claude Code CLI for development/maintenance tasks
- Normal process count: ~864 processes (268 node, 213 shell processes)
Issue Summary
Multiple complete system crashes occurred while Claude Code was running in this high-load production environment. While we cannot definitively prove Claude Code caused the crashes, it was active during incidents involving:
- Process explosion - 73,217 shell processes at time of crash
- npm/Node.js crash (SIGABRT) in claude-shared-context directory
- Resource exhaustion - System load 117.47 (729% above normal)
- Complete system failure requiring hard reboot
Critical Incidents
Incident 1: Complete System Crash (2026-02-01)
Duration: 16 hours 20 minutes of cascading failures
Outcome: System unresponsive, hard reboot required
Timeline:
| Time | Event | Details |
|---|---|---|
| 00:00 | Docker OOM killer begins | 9,080+ kills in 4 minutes, Java processes exhausting memory |
| 00:04 | npm process crashes | SIGABRT in /home/fanning/claude-shared-context, core dump available |
| 01:26 | System degradation | Load: 117.47, I/O wait: 75.5% |
| 15:10 | Shell process explosion begins | 15,134 processes |
| 16:15 | Process count critical | 73,217 shell processes (+58,083 in 65 minutes) |
| 16:20 | COMPLETE SYSTEM CRASH | Process table exhausted, system unresponsive |
Root causes identified:
- Shell process explosion (73,217
[sh]processes) - Docker container OOM loop (restart policy created infinite fork cascade)
- npm crash (SIGABRT) with core dump in claude-shared-context directory
Incident 2: PM2 Crash Loop (2026-01-27)
Duration: 2 hours 32 minutes
Root Cause: Port conflict causing infinite restart loop
Impact: 8,414 crash iterations, all PM2 processes killed (including Claude Code services)
Evidence Linking to Claude Code
npm Core Dump
Location: /var/crash/_usr_bin_node.1002.crash
Size: 3.5 MB
Signal: SIGABRT (6)
Working Directory: /home/fanning/claude-shared-context
Timestamp: 2026-02-05 14:15
The crash occurred in the Claude Code shared context repository, suggesting Claude Code or a related npm process was active.
System State During Incidents
- Claude Code was running multiple sessions
- Heavy file I/O operations via Read/Write/Edit tools
- 60+ PM2 services competing for resources
- High I/O wait (75.5%) coinciding with Claude Code file operations
Questions for Anthropic Team
Subprocess Management
- Does Claude Code properly clean up bash subprocesses spawned by the Bash tool?
- Are there known issues with subprocess accumulation in long-running sessions or high-load environments?
- What is the expected subprocess count for typical Claude Code usage?
Memory & Resource Management
- Are there known memory leaks in versions 2.1.30-2.1.32?
- Does Claude Code have resource limits or throttling mechanisms for subprocess spawning?
- How does Claude Code behave when system resources are constrained (high process count, high I/O wait)?
Production Environment Best Practices
- Should Claude Code be run with process/memory limits in production environments?
- Are there recommended ulimits or cgroup settings?
- How does Claude Code interact with PM2-managed services and Docker containers?
Related Known Issues
This report may be related to:
- Issue Memory leak on Linux - RAM grows to 20GB+ causing system crash #20777: Memory leak on Linux causing 20GB+ RAM usage and system crash
- Issue [BUG] Critical memory regression in 2.1.27 - OOM crash on simple input #22042: Critical memory regression in 2.1.27 - OOM crash
- Issue 🚨 CRITICAL: Memory leak causes freeze after 20+ minutes (15GB RAM consumption) #21378: Memory leak causes freeze after 20+ minutes (15GB RAM consumption)
- Issue Background process termination crashes Claude Code in Docker containers #16135: Background process termination crashes Claude Code in Docker containers
However, this report adds a new dimension: process explosion (73K+ shell processes) in a production environment with concurrent services.
Mitigation Deployed (Now Stable)
After the crashes, we implemented:
-
Process limits (ulimits):
fanning soft nproc 8192 fanning hard nproc 16384
-
Docker resource limits:
--pids-limit=100 --memory=512m --restart=on-failure:3
-
Automated monitoring (1-minute cron):
- Auto-remediation at 1,000 processes
- Emergency kill at 10,000 processes
Result: System stable for 2+ days with no issues.
Reproduction Environment
We cannot reliably reproduce the crash, but the environment characteristics:
- Ubuntu 22.04 EC2 instance
- 60+ concurrent Node.js PM2 services
- Docker containers running alongside Claude Code
- Long-running Claude Code sessions performing heavy file I/O
- High normal process count (800-1000 processes)
Request
Guidance on:
- Best practices for running Claude Code in production environments with high process counts
- Whether Claude Code should have built-in resource limits
- Subprocess cleanup mechanisms in Claude Code
- Analysis of the npm core dump (can provide if helpful)
Supporting Documentation
Complete forensic analysis available in our repository:
- Complete system crash forensics
- PM2 crash incident report
- Crash recovery procedures
- System monitoring scripts
Core dump and additional logs available upon request.
Current Status
- System: Stable with mitigation measures
- Claude Code: Running version 2.1.32
- Willing to test: Available for diagnostic testing or additional data collection
Filed by: Production System Administrator
Contact: Available via GitHub
Logs/Dumps Available: Yes