Skip to content

System Crashes in Production: Process Explosion & Resource Exhaustion (Ubuntu 22.04, v2.1.30-2.1.32) #23484

@fanning

Description

@fanning

System Crashes in Production Environment: Process Explosion & Resource Exhaustion

Environment

Claude Code Version: 2.1.30 - 2.1.32
OS: Ubuntu 22.04 LTS (Jammy)
Kernel: Linux 6.8.0-1021-aws #23~22.04.1-Ubuntu SMP
Architecture: x86_64
Platform: AWS EC2 Instance (16-core, 60GB RAM)
Node.js: Multiple versions via PM2

System Context

This is a production EC2 instance running:

  • 60+ PM2-managed Node.js services
  • Docker containers (signal-cli-rest-api and others)
  • Claude Code CLI for development/maintenance tasks
  • Normal process count: ~864 processes (268 node, 213 shell processes)

Issue Summary

Multiple complete system crashes occurred while Claude Code was running in this high-load production environment. While we cannot definitively prove Claude Code caused the crashes, it was active during incidents involving:

  1. Process explosion - 73,217 shell processes at time of crash
  2. npm/Node.js crash (SIGABRT) in claude-shared-context directory
  3. Resource exhaustion - System load 117.47 (729% above normal)
  4. Complete system failure requiring hard reboot

Critical Incidents

Incident 1: Complete System Crash (2026-02-01)

Duration: 16 hours 20 minutes of cascading failures
Outcome: System unresponsive, hard reboot required

Timeline:

Time Event Details
00:00 Docker OOM killer begins 9,080+ kills in 4 minutes, Java processes exhausting memory
00:04 npm process crashes SIGABRT in /home/fanning/claude-shared-context, core dump available
01:26 System degradation Load: 117.47, I/O wait: 75.5%
15:10 Shell process explosion begins 15,134 processes
16:15 Process count critical 73,217 shell processes (+58,083 in 65 minutes)
16:20 COMPLETE SYSTEM CRASH Process table exhausted, system unresponsive

Root causes identified:

  • Shell process explosion (73,217 [sh] processes)
  • Docker container OOM loop (restart policy created infinite fork cascade)
  • npm crash (SIGABRT) with core dump in claude-shared-context directory

Incident 2: PM2 Crash Loop (2026-01-27)

Duration: 2 hours 32 minutes
Root Cause: Port conflict causing infinite restart loop
Impact: 8,414 crash iterations, all PM2 processes killed (including Claude Code services)

Evidence Linking to Claude Code

npm Core Dump

Location: /var/crash/_usr_bin_node.1002.crash
Size: 3.5 MB
Signal: SIGABRT (6)
Working Directory: /home/fanning/claude-shared-context
Timestamp: 2026-02-05 14:15

The crash occurred in the Claude Code shared context repository, suggesting Claude Code or a related npm process was active.

System State During Incidents

  • Claude Code was running multiple sessions
  • Heavy file I/O operations via Read/Write/Edit tools
  • 60+ PM2 services competing for resources
  • High I/O wait (75.5%) coinciding with Claude Code file operations

Questions for Anthropic Team

Subprocess Management

  1. Does Claude Code properly clean up bash subprocesses spawned by the Bash tool?
  2. Are there known issues with subprocess accumulation in long-running sessions or high-load environments?
  3. What is the expected subprocess count for typical Claude Code usage?

Memory & Resource Management

  1. Are there known memory leaks in versions 2.1.30-2.1.32?
  2. Does Claude Code have resource limits or throttling mechanisms for subprocess spawning?
  3. How does Claude Code behave when system resources are constrained (high process count, high I/O wait)?

Production Environment Best Practices

  1. Should Claude Code be run with process/memory limits in production environments?
  2. Are there recommended ulimits or cgroup settings?
  3. How does Claude Code interact with PM2-managed services and Docker containers?

Related Known Issues

This report may be related to:

However, this report adds a new dimension: process explosion (73K+ shell processes) in a production environment with concurrent services.

Mitigation Deployed (Now Stable)

After the crashes, we implemented:

  1. Process limits (ulimits):

    fanning soft nproc 8192
    fanning hard nproc 16384
  2. Docker resource limits:

    --pids-limit=100 --memory=512m --restart=on-failure:3
  3. Automated monitoring (1-minute cron):

    • Auto-remediation at 1,000 processes
    • Emergency kill at 10,000 processes

Result: System stable for 2+ days with no issues.

Reproduction Environment

We cannot reliably reproduce the crash, but the environment characteristics:

  • Ubuntu 22.04 EC2 instance
  • 60+ concurrent Node.js PM2 services
  • Docker containers running alongside Claude Code
  • Long-running Claude Code sessions performing heavy file I/O
  • High normal process count (800-1000 processes)

Request

Guidance on:

  1. Best practices for running Claude Code in production environments with high process counts
  2. Whether Claude Code should have built-in resource limits
  3. Subprocess cleanup mechanisms in Claude Code
  4. Analysis of the npm core dump (can provide if helpful)

Supporting Documentation

Complete forensic analysis available in our repository:

  • Complete system crash forensics
  • PM2 crash incident report
  • Crash recovery procedures
  • System monitoring scripts

Core dump and additional logs available upon request.

Current Status

  • System: Stable with mitigation measures
  • Claude Code: Running version 2.1.32
  • Willing to test: Available for diagnostic testing or additional data collection

Filed by: Production System Administrator
Contact: Available via GitHub
Logs/Dumps Available: Yes

Metadata

Metadata

Assignees

No one assigned

    Labels

    duplicateThis issue or pull request already exists

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions