Skip to content

devices: add support for first_available device priotisation#27391

Draft
chrisboulton wants to merge 1 commit intohashicorp:mainfrom
chrisboulton:device-first-available
Draft

devices: add support for first_available device priotisation#27391
chrisboulton wants to merge 1 commit intohashicorp:mainfrom
chrisboulton:device-first-available

Conversation

@chrisboulton
Copy link

(note: this is in a bit of a draft state right now - that said, I'd love feedback from HashiCorp on the chances of having something like this incorporated and how to best align it to y'alls design goals - a rough first pass at the design would be amazing -- don't spend a lot of time on the changes themselves until we're happy with the design and I've done more of my own homework)

This PR introduces a new first_available block for device requests in Nomad job specifications. This enables more flexible device scheduling by allowing you to specify a prioritized list of device reservation sizes, where the scheduler attempts each option in order and selects the first one that can be fulfilled.

This is particularly useful in heterogeneous clusters with varying device types (such as a bunch of different GPU models) where you want to prioritize one type of GPU over another, but to carry the workload you need to reserve a different number of devices (GPUs).

A concrete example: I've got a workload which fits on a single 96GB GH200, but if I don't have that available I can also carry it on two H100s with 80GB memory each. I want to be able to do this in one job, and have Nomad figure out what the resource reservation should be. Today, this needs multiple jobs or multiple task groups, because device only accepts a single reservation size (count).

To support this, the following is introduced:

device "nvidia/gpu" {
  # i would prefer this workload to land on a GH200 and if it does, it needs one GPU
  first_available {
    count = 1
    constraint {
      attribute = "${device.attr.model}"
      value = "GH200"
    }
  }
  # otherwise, i'll take a pair of H100s
  first_available {
    count = 2
    constraint {
      attribute = "${device.attr.model}"
      value = "H100"
    }
  }
}

With a job configuration like this, Nomad will first try to schedule the workload on a GH200. If that's not available, it will then try to schedule on two H100 SDMs. If that's not available, it will fail the job.

count, affinity, and constraint without first_available are supported as before.

Implementation Notes

I'm open to feedback on the implementation of this -- this was just the first take that came to mind.

first_available is an ordered list of options where the first match wins. Inside first_available, constraint is supported, which lets you perform the additional filtering.

first_available and count are mutually exclusive at the device level.

Alternative Approach

Would it make sense to have a syntax like this instead, where the constraints are specified inline instead of in their own constraint block?

device "nvidia/gpu" {
  first_available {
    count = 1
    attribute = "${device.attr.model}"
    value = "GH200"
  }
}

Testing Notes

I've gone in and added a bunch of E2E tests for this as a first pass - these cover the existing device scheduling functionality and the new first_available functionality.

I've only run these tests locally - I've not used the Terraform E2E test suite, and am mostly certain (given I let Claude do the work almost exclusively for the tests) that at least the TF test setup needs some work.. but otherwise, the tests themselves are passing and seem to do the right thing.

AI Use

I noticed a new callout for this in the contributing guidelines, so to call it out:

  • A bunch of the initial scaffolding I implemented myself. For work around the scheduler (especially the feasibility checks), I asked for help from Opus 4.5 w/ CC -- mostly because I've not navigated that part of Nomad much before. On review, it looks like the right things are happening - these are design decisions I'd probably make myself.. but I still need to do a more exhaustive review before I'm willing to say I'm happy with the approach.
  • CC was used to generate the bulk of the tests, especially in the E2E suites -- I'm pretty happy with these

@jrasell
Copy link
Member

jrasell commented Jan 22, 2026

Hi @chrisboulton and thanks for raising this PR, adding all the detail, and clearly having read our documentation. Given the size of the addition I think a good first step would be to open up an issue where we can better discuss the use cases and design specifics. I'll be able to raise this internally and get the right people involved to try and move it forward. That being, said a quick glance by a few of us indicates we do like this idea, so would be keen to see it progress.

@chrisboulton
Copy link
Author

Hey @jrasell let's do it! 🚀 #27402

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Development

Successfully merging this pull request may close these issues.

3 participants