Skip to content

Conversation

@fredericlefeurmou
Copy link
Contributor

@fredericlefeurmou fredericlefeurmou commented Dec 16, 2025

Closes #3079

Problem

When a remote MCP server becomes unreachable (e.g., Datadog MCP), the health check fails and the proxy is stopped. However, the workload status is not updated, causing thv list to report the server as "running" even though nothing is listening on the port.

This creates a "ghost status" problem where:

  • thv list shows the server as "running"
  • Cursor shows the MCP as enabled
  • lsof -i :PORT shows nothing listening
  • Cursor gets stuck on "Loading tools..."

Solution

Add a callback mechanism that notifies the runner when a health check fails, allowing it to update the workload status to unhealthy.

Changes

  • Add HealthCheckFailedCallback type to transport types
  • Add onHealthCheckFailed callback to TransparentProxy
  • Pass callback from HTTPTransport to TransparentProxy
  • Runner sets callback for remote servers to update status manager
  • Call callback in monitorHealth() before stopping proxy

Result

When a remote MCP server dies:

  1. Health check detects the failure
  2. Callback notifies the runner
  3. Runner updates status to unhealthy
  4. thv list shows accurate status

@fredericlefeurmou fredericlefeurmou force-pushed the fix/remote-workload-health-status-callback branch 2 times, most recently from f60f1fc to 59d6d41 Compare December 16, 2025 23:52
When a remote MCP server becomes unreachable, the health check fails
and the proxy is stopped. However, the workload status was not being
updated, causing thv list to report the server as running even though
nothing was listening on the port.

This fix adds a callback mechanism that notifies the runner when a
health check fails, allowing it to update the workload status to
unhealthy.

Closes stacklok#3079

Signed-off-by: Frederic Le Feurmou <fredericlefeurmou@gmail.com>
@fredericlefeurmou fredericlefeurmou force-pushed the fix/remote-workload-health-status-callback branch from 59d6d41 to 9f7dc5e Compare December 17, 2025 00:35
@amirejaz amirejaz requested a review from Copilot December 17, 2025 12:50
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes a "ghost status" issue where remote MCP servers that become unreachable continue to show as "running" in thv list, even though the health check has failed and the proxy has stopped. The solution adds a callback mechanism that allows the proxy to notify the runner when health checks fail, enabling proper workload status updates.

Key Changes:

  • Added HealthCheckFailedCallback type to allow notification of health check failures
  • Integrated callback into the transparent proxy health monitoring system
  • Connected the callback to the status manager in the runner for remote servers

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
pkg/transport/types/transport.go Defines HealthCheckFailedCallback type and adds OnHealthCheckFailed field to Config struct
pkg/transport/proxy/transparent/transparent_proxy.go Adds onHealthCheckFailed callback parameter to constructor and invokes it when health check fails
pkg/transport/proxy/transparent/transparent_test.go Updates all test calls to NewTransparentProxy with nil callback parameter
pkg/transport/http.go Adds SetOnHealthCheckFailed setter method and passes callback to TransparentProxy
pkg/runner/runner.go Sets health check failure callback for remote servers to update workload status to unhealthy
cmd/thv/app/proxy.go Passes nil callback for local proxy command (not needed for local proxies)

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

amirejaz and others added 2 commits December 17, 2025 14:23
Removed the unused `OnHealthCheckFailed HealthCheckFailedCallback` code

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
@github-actions github-actions bot added size/XS Extra small PR: < 100 lines changed and removed size/XS Extra small PR: < 100 lines changed labels Dec 17, 2025
@codecov
Copy link

codecov bot commented Dec 17, 2025

Codecov Report

❌ Patch coverage is 39.28571% with 17 lines in your changes missing coverage. Please review.
✅ Project coverage is 56.80%. Comparing base (0a7bacf) to head (aea9a24).
⚠️ Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
pkg/runner/runner.go 0.00% 12 Missing ⚠️
pkg/transport/http.go 0.00% 3 Missing ⚠️
...g/transport/proxy/transparent/transparent_proxy.go 84.61% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3077      +/-   ##
==========================================
- Coverage   56.82%   56.80%   -0.03%     
==========================================
  Files         335      335              
  Lines       33474    33492      +18     
==========================================
+ Hits        19022    19025       +3     
- Misses      12868    12883      +15     
  Partials     1584     1584              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@JAORMX JAORMX merged commit 7c7db9a into stacklok:main Dec 17, 2025
31 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size/XS Extra small PR: < 100 lines changed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Workload status not updated when health check fails for remote MCP servers

3 participants