Skip to content

Conversation

@carlos-gn
Copy link
Contributor

@carlos-gn carlos-gn commented Dec 12, 2025

Summary

Fixes #2440

When -egress or -ingress proxy containers are restarted via Docker (e.g., docker restart <workload>-egress), Squid enters an infinite crash loop because it finds a stale PID file from the previous instance.

Root Cause

  1. Squid writes a PID file to /tmp/squid.pid on startup
  2. On docker restart, Docker sends SIGTERM and Squid begins a 30-second graceful shutdown
  3. Docker's default restart timeout is 10 seconds - it sends SIGKILL before Squid can clean up
  4. Container restarts with the stale PID file still present
  5. New Squid instance refuses to start, thinking another instance is running

This doesn't affect thv restart because ToolHive uses a 30-second stop timeout giving Squid enough time to gracefully shutdown and clean up the PID file.

Fix

Changed pid_filename /tmp/squid.pid to pid_filename none in the generated Squid config. Without a PID file, there's nothing to go stale.

Note: The PID file was only used by Squid internally to prevent multiple instances from running. Since Docker already ensures only one container process runs, the PID file is redundant. Nothing in ToolHive depends on it - the container healthcheck uses ps aux | grep squid, not the PID file.

References

Test plan

  • Run thv run <server> --isolate-network
  • Run docker restart <server>-egress
  • Verify docker logs <server>-egress shows clean startup (no FATAL errors)
  • Verify container survives multiple restarts
  • Run task test to ensure no regressions

@github-actions github-actions bot added the size/XS Extra small PR: < 100 lines changed label Dec 12, 2025
@codecov
Copy link

codecov bot commented Dec 12, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 56.82%. Comparing base (7bee0c8) to head (a90f25f).
⚠️ Report is 1 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #3022   +/-   ##
=======================================
  Coverage   56.82%   56.82%           
=======================================
  Files         335      335           
  Lines       33474    33474           
=======================================
  Hits        19022    19022           
  Misses      12868    12868           
  Partials     1584     1584           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

  Squid proxy containers enter an infinite crash loop when restarted via
  Docker directly because the PID file persists across restarts.

  Change pid_filename from /tmp/squid.pid to none so Squid doesn't create
  a PID file that can become stale on abrupt container restarts.

  The PID file was only used by Squid internally to prevent multiple
  instances. Since Docker ensures only one container runs, it's redundant.
  The container healthcheck uses ps aux | grep squid, not the PID file.

  Fixes stacklok#2440

Signed-off-by: carlos <21148423+carlos-gn@users.noreply.github.com>
@carlos-gn carlos-gn force-pushed the fix/squid-error-loop branch from 5f23ebf to 377828a Compare December 12, 2025 19:59
@github-actions github-actions bot added size/XS Extra small PR: < 100 lines changed and removed size/XS Extra small PR: < 100 lines changed labels Dec 12, 2025
@carlos-gn carlos-gn marked this pull request as ready for review December 12, 2025 20:03
@github-actions github-actions bot added size/XS Extra small PR: < 100 lines changed and removed size/XS Extra small PR: < 100 lines changed labels Dec 12, 2025
@github-actions github-actions bot added size/XS Extra small PR: < 100 lines changed and removed size/XS Extra small PR: < 100 lines changed labels Dec 13, 2025
@github-actions github-actions bot added size/XS Extra small PR: < 100 lines changed and removed size/XS Extra small PR: < 100 lines changed labels Dec 16, 2025
@JAORMX
Copy link
Collaborator

JAORMX commented Dec 16, 2025

@claude can you review this against squid security best practices?

@claude
Copy link
Contributor

claude bot commented Dec 16, 2025

Claude encountered an error —— View job

Failed with exit code 128

I'll analyze this and get back to you.

@github-actions github-actions bot added size/XS Extra small PR: < 100 lines changed and removed size/XS Extra small PR: < 100 lines changed labels Dec 17, 2025
@JAORMX JAORMX merged commit 0a7bacf into stacklok:main Dec 17, 2025
30 checks passed
@carlos-gn carlos-gn deleted the fix/squid-error-loop branch December 17, 2025 14:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size/XS Extra small PR: < 100 lines changed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Ingress and egress containers go into error loop if restarted

3 participants