Skip to content

Conversation

@fioan89
Copy link
Collaborator

@fioan89 fioan89 commented Dec 2, 2025

Netflix reported that only seems to reproduce on Linux (we've only tested Ubuntu so far).
I can’t reproduce it on macOS. First, here’s some context:

  1. Polling workspaces:
    Coder Toolbox polls the deployment every 5 seconds for workspace updates.
    These updates (new workspaces, deletions,status changes) are stored in a
    cached “environments” list (an oversimplified explanation). When a URI is executed,
    we reset the content of the list and run the login sequence, which re-initializes
    the HTTP poller and CLI using the new deployment URL and token. A new polling loop
    then begins populating the environments list again.

  2. Cache monitoring:
    Toolbox watches this cached list for changes—especially status changes, which determine
    when an SSH connection can be established.

In Netflix’s case, they launched Toolbox, created a workspace from the Dashboard, and the
poller added it to the environments list. When the workspace switched from starting to ready,
they used a URI to connect to it. The URI reset the list, then the poller repopulated it. But
because the list had the same IDs (but new object references), Toolbox didn’t detect any changes.
As a result, it never triggered the SSH connection. This issue only reproduces on Linux, but it
might explain some of the sporadic macOS failures Atif mentioned in the past.

I need to dig deeper into the Toolbox bytecode to determine whether this is a Toolbox bug, but
it does seem like Toolbox wasn’t designed to switch cleanly between multiple deployments and/or users.
The current Coder plugin behavior—always performing a full login sequence on every URI—is also ...sub-optimal.
It only really makes sense in these scenarios:

  1. Toolbox started with deployment A, but the URI targets deployment B.
  2. Toolbox started with deployment A/user X, but the URI targets deployment A/user Y.

But this design is inefficient for the most common case: connecting via URI to a workspace on the
same deployment and same user. While working on the fix, I realized that scenario (2) is not realistic.
On the same host machine, why would multiple users log into the same deployment via Toolbox? The whole
fix revolves around the idea of just recreating the http client and updating the CLI with the new token
instead of going through the full authentication steps when the URI deployment URL is the same as the
currently opened URL

Small improvement where we get rid of emitting environment and
ssh connection trigger events from new coroutines. StateFlow in
Kotlin is a hot, conflated flow that keeps only the most recent value.
In other words we can immediately update the value without needing to
launch a new coroutine, and we won't block the current thread.
…ened

Netflix reported that only seems to reproduce on Linux (we've only tested Ubuntu so far).
I can’t reproduce it on macOS. First, here’s some context:
1. Polling workspaces:
Coder Toolbox polls the deployment every 5 seconds for workspace updates.
These updates (new workspaces, deletions,status changes) are stored in a
cached “environments” list (an oversimplified explanation). When a URI is executed,
we reset the content of the list and run the login sequence, which re-initializes
the HTTP poller and CLI using the new deployment URL and token. A new polling loop
then begins populating the environments list again.

2. Cache monitoring:
Toolbox watches this cached list for changes—especially status changes, which determine
when an SSH connection can be established.

In Netflix’s case, they launched Toolbox, created a workspace from the Dashboard, and the
poller added it to the environments list. When the workspace switched from starting to ready,
they used a URI to connect to it. The URI reset the list, then the poller repopulated it. But
because the list had the same IDs (but new object references), Toolbox didn’t detect any changes.
As a result, it never triggered the SSH connection. This issue only reproduces on Linux, but it
might explain some of the sporadic macOS failures Atif mentioned in the past.

I need to dig deeper into the Toolbox bytecode to determine whether this is a Toolbox bug, but
it does seem like Toolbox wasn’t designed to switch cleanly between multiple deployments and/or users.
The current Coder plugin behavior—always performing a full login sequence on every URI—is also ...sub-optimal.
It only really makes sense in these scenarios:

1. Toolbox started with deployment A, but the URI targets deployment B.
2. Toolbox started with deployment A/user X, but the URI targets deployment A/user Y.

But this design is inefficient for the most common case: connecting via URI to a workspace on the
same deployment and same user. While working on the fix, I realized that scenario (2) is not realistic.
On the same host machine, why would multiple users log into the same deployment via Toolbox? The whole
fix revolves around the idea of just recreating the http client and updating the CLI with the new token
instead of going through the full authentication steps when the URI deployment URL is the same as the
currently opened URL
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant