system deployments: fix scheduler write skew from stale snapshot#27497
Open
system deployments: fix scheduler write skew from stale snapshot#27497
Conversation
fb1fc22 to
70fdf20
Compare
tgross
commented
Feb 12, 2026
70fdf20 to
6e34f8f
Compare
tgross
added a commit
that referenced
this pull request
Feb 13, 2026
This problem isn't especially unique to Nomad, as virtually every web application in the wild displays write skew. But pointing it out here in our somewhat unique state store is helpful for onboarding and a good reminder. Ref: #27497
1 task
tgross
added a commit
that referenced
this pull request
Feb 13, 2026
This problem isn't especially unique to Nomad, as virtually every web application in the wild displays write skew. But pointing it out here in our somewhat unique state store is helpful for onboarding and a good reminder. Ref: #27497
tehut
reviewed
Feb 13, 2026
During a system deployment, if a client update marking an allocation healthy happens after a scheduler worker has taken a snapshot, the plan will overwrite the `HealthyAllocs` count on the deployment. The client never writes another health update, so the deployment gets stuck even though all allocations are marked healthy. Likewise, if a scheduler worker takes a snapshot and the deployment is promoted before the plan is submitted, the number of canaries will be mutated and the promotion flag potentially removed. Break the `DeploymentState` into "client owned" vs "server owned" fields, and have the update that comes from upserting a plan treat the client owned fields as authoritative for that data. Enforce in the FSM that flipping `Promoted` to true for the deployment state is a one-way operation. Fixes: #27382
6e34f8f to
a486dd0
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
During a system deployment, if a client update marking an allocation healthy happens after a scheduler worker has taken a snapshot, the plan will overwrite the
HealthyAllocscount on the deployment. The client never writes another health update, so the deployment gets stuck even though all allocations are marked healthy.Likewise, if a scheduler worker takes a snapshot and the deployment is promoted before the plan is submitted, the number of canaries will be mutated and the promotion flag potentially removed.
Break the
DeploymentStateinto "client owned" vs "server owned" fields, and have the update that comes from upserting a plan treat the client owned fields as authoritative for that data. Enforce in the FSM that flippingPromotedto true for the deployment state is a one-way operation.Fixes: #27382
Ref: https://hashicorp.atlassian.net/browse/NMD-1197
Contributor Checklist
changelog entry using the
make clcommand.ensure regressions will be caught.
and job configuration, please update the Nomad product documentation, which is stored in the
web-unified-docsrepo. Refer to theweb-unified-docscontributor guide for docs guidelines.Please also consider whether the change requires notes within the upgrade
guide. If you would like help with the docs, tag the
nomad-docsteam in this PR.Reviewer Checklist
backporting document.
in the majority of situations. The main exceptions are long-lived feature branches or merges where
history should be preserved.
within the public repository.
Changes to Security Controls
Are there any changes to security controls (access controls, encryption, logging) in this pull request? If so, explain.