Skip to content

Allow Agent Reload with API - /v1/agent/reload#27106

Open
benjamin-lykins wants to merge 21 commits intohashicorp:mainfrom
benjamin-lykins:feat/config-reload-api
Open

Allow Agent Reload with API - /v1/agent/reload#27106
benjamin-lykins wants to merge 21 commits intohashicorp:mainfrom
benjamin-lykins:feat/config-reload-api

Conversation

@benjamin-lykins
Copy link
Contributor

@benjamin-lykins benjamin-lykins commented Nov 14, 2025

Description

Provide an API endpoint to reload agent config without elevated permissions on the infrastructure running Nomad.

  • Provides ability to reload config remotely.
  • Leverage Nomad Token + ACL to reload.

Testing

Reloading log level without ACL

curl -k  -X PUT \
  https://localhost:4646/v1/agent/reload
{"message":"agent configuration reloaded"}

Attempting without token (ACL Enabled):

curl -k  -X PUT \  
https://localhost:4646/v1/agent/reload
Permission denied

Attempting with token (ACL Enabled; Agent = Write):

curl -k -X PUT \                                                    
  --header "X-Nomad-Token: 79d77f6a-e168-3bbb-8fa7-e668b2b09346" \
  https://localhost:4646/v1/agent/reload 
{"message":"agent configuration reloaded"}

Attempting with token (ACL Enabled; Agent = Read):

curl -k -X PUT \
  --header "X-Nomad-Token: 2cde0074-d0ac-53e1-a6dc-3f416a8e1737" \
  https://localhost:4646/v1/agent/reload
Permission denied

Invalid Config File:

curl -k -X PUT \
  --header "X-Nomad-Token: 79d77f6a-e168-3bbb-8fa7-e668b2b09346" \
  https://localhost:4646/v1/agent/reload
Error loading /Users/benjamin.lykins/git/nomad/nomad/bin/config/general.hcl: failed to decode HCL file /Users/benjamin.lykins/git/nomad/nomad/bin/config/general.hcl: At 3:16: literal not terminated

Links

Fixes: #24048

Contributor Checklist

  • Changelog Entry If this PR changes user-facing behavior, please generate and add a
    changelog entry using the make cl command.
  • Testing Please add tests to cover any new functionality or to demonstrate bug fixes and
    ensure regressions will be caught.
  • Documentation If the change impacts user-facing functionality such as the CLI, API, UI,
    and job configuration, please update the Nomad website documentation to reflect this. Refer to
    the website README for docs guidelines. Please also consider whether the
    change requires notes within the upgrade guide.

Reviewer Checklist

  • Backport Labels Please add the correct backport labels as described by the internal
    backporting document.
  • Commit Type Ensure the correct merge method is selected which should be "squash and merge"
    in the majority of situations. The main exceptions are long-lived feature branches or merges where
    history should be preserved.
  • Enterprise PRs If this is an enterprise only PR, please add any required changelog entry
    within the public repository.
  • If a change needs to be reverted, we will roll out an update to the code within 7 days.

Changes to Security Controls

Are there any changes to security controls (access controls, encryption, logging) in this pull request? If so, explain.

@benjamin-lykins benjamin-lykins requested review from a team as code owners November 14, 2025 19:19
@tgross tgross self-requested a review November 17, 2025 15:09
Copy link
Member

@tgross tgross left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@benjamin-lykins there's a failing test not touched here but clearly related to this work. Let's get that resolved before we review:

=== FAIL: command/agent TestServer_Reload_TLS_Certificate (0.00s)
panic: runtime error: invalid memory address or nil pointer dereference [recovered, repanicked]
[signal SIGSEGV: segmentation violation code=0x1 addr=0x18 pc=0x30fefa3]

goroutine 87962 [running]:
testing.tRunner.func1.2({0x3898300, 0x69a5e70})
	/opt/hostedtoolcache/go/1.25.3/x64/src/testing/testing.go:1872 +0x237
testing.tRunner.func1()
	/opt/hostedtoolcache/go/1.25.3/x64/src/testing/testing.go:1875 +0x35b
panic({0x3898300?, 0x69a5e70?})
	/opt/hostedtoolcache/go/1.25.3/x64/src/runtime/panic.go:783 +0x132
github.com/hashicorp/nomad/command/agent.(*Agent).Reload(0xc00c83c700, 0xc00d476d08)
	/home/runner/work/nomad/nomad/command/agent/agent.go:1533 +0x63
github.com/hashicorp/nomad/command/agent.TestServer_Reload_TLS_Certificate(0xc00a9f8540)
	/home/runner/work/nomad/nomad/command/agent/agent_test.go:1197 +0x1d7
testing.tRunner(0xc00a9f8540, 0x42c8720)
	/opt/hostedtoolcache/go/1.25.3/x64/src/testing/testing.go:1934 +0xea
created by testing.(*T).Run in goroutine 1
	/opt/hostedtoolcache/go/1.25.3/x64/src/testing/testing.go:1997 +0x465

DONE 5542 tests, 12 skipped, 4 failures in 182.433s

@benjamin-lykins
Copy link
Contributor Author

@benjamin-lykins there's a failing test not touched here but clearly related to this work. Let's get that resolved before we review:

=== FAIL: command/agent TestServer_Reload_TLS_Certificate (0.00s)
panic: runtime error: invalid memory address or nil pointer dereference [recovered, repanicked]
[signal SIGSEGV: segmentation violation code=0x1 addr=0x18 pc=0x30fefa3]

goroutine 87962 [running]:
testing.tRunner.func1.2({0x3898300, 0x69a5e70})
	/opt/hostedtoolcache/go/1.25.3/x64/src/testing/testing.go:1872 +0x237
testing.tRunner.func1()
	/opt/hostedtoolcache/go/1.25.3/x64/src/testing/testing.go:1875 +0x35b
panic({0x3898300?, 0x69a5e70?})
	/opt/hostedtoolcache/go/1.25.3/x64/src/runtime/panic.go:783 +0x132
github.com/hashicorp/nomad/command/agent.(*Agent).Reload(0xc00c83c700, 0xc00d476d08)
	/home/runner/work/nomad/nomad/command/agent/agent.go:1533 +0x63
github.com/hashicorp/nomad/command/agent.TestServer_Reload_TLS_Certificate(0xc00a9f8540)
	/home/runner/work/nomad/nomad/command/agent/agent_test.go:1197 +0x1d7
testing.tRunner(0xc00a9f8540, 0x42c8720)
	/opt/hostedtoolcache/go/1.25.3/x64/src/testing/testing.go:1934 +0xea
created by testing.(*T).Run in goroutine 1
	/opt/hostedtoolcache/go/1.25.3/x64/src/testing/testing.go:1997 +0x465

DONE 5542 tests, 12 skipped, 4 failures in 182.433s

I think I know what is causing this. I'll get it fixed.

@benjamin-lykins
Copy link
Contributor Author

@tgross fixed the test, which then uncovered another broken test after another I changed the return response. both of those were cleaned up and looks happier now here.

Copy link
Member

@tgross tgross left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good start @benjamin-lykins!

In addition to the comments I've left, make sure you've added a changelog entry file. You can do that via make cl. Once this is getting closer to being ready, we'll need a docs PR to go with this, which is now going to be over in the web-unified-docs repo.


newConf := DefaultConfig()

for _, path := range currConf.Files {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the -config flag in the command line arguments for this agent points to a directory, this Files field will contain only the files that were in that directory at the time we started the agent. If you add a new file, it won't be present on reload. We may need to figure out a way to get the original -config flag attached to the config for this to work.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for catching this one, wasn't thinking of it or realized when I originally ran through it. Made some modifications to save the -config flag values and tested locally without issue. Tested if I passed over two -config as well and worked as expected. I was also able to reuse the logic in LoadConfig to check each path, which seems to work out well. Lastly, I tested adding files with bad configuration and got a decent response back, but I could also just send a canned message back

curl -k  -X PUT \
  https://localhost:4646/v1/agent/reload
Error loading /Users/benjamin.lykins/git/nomad/nomad/bin/config1/badfile.hcl: failed to decode HCL file /Users/benjamin.lykins/git/nomad/nomad/bin/config1/badfile.hcl: At 1:6: key 'asdf' expected start of object ('{') or assignment ('=')


newConf.Files = append([]string(nil), currConf.Files...)

if err := s.agent.Reload(newConf); err != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you take a look at the signal handler for reloading, you'll see that this step only reloads the agent configuration and won't push any changes to the server, client, or HTTP server configuration. We should at least have parity with the signal handler here, which calls (*Command) handleReload. We may need to figure out a way to extract that function so that it's reachable here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This still needs to be done, and I'll start looking ahead at how to get this going. I saw this when I was mucking around with TLS rotations with server/client, and it was only updating the more general agent config.

@benjamin-lykins
Copy link
Contributor Author

@tgross

  • Testing : Made the updates with the testing, I realized I did see that in the contributor guides. I just realized I have some lines I need to clean up from when I was moving them from testify to test.
  • -config : Updated to store the config values which can be passed when starting up an agent. Tested on my end without issue, and made additional notes under your comment.

@tgross tgross added theme/api HTTP API and SDK issues type/enhancement labels Dec 11, 2025
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Storing the config paths, if a path to a directory was used (not just the file names). If a new file is added, it gets picked up.

Comment on lines +1628 to +1697
// FullReload reloads the agent configuration including server, client, and HTTP server.
func (a *Agent) FullReload(newConfig *Config) error {
shouldReloadAgent, shouldReloadHTTP := a.ShouldReload(newConfig)
if shouldReloadAgent {
a.logger.Debug("starting reload of agent config")
if err := a.Reload(newConfig); err != nil {
a.logger.Error("failed to reload the config", "error", err)
return err
}
}

if s := a.Server(); s != nil {
if newConfig.Server == nil {
a.logger.Debug("skipping server reload - no server config in new configuration")
} else {
a.logger.Debug("starting reload of server config")
sconf, err := convertServerConfig(newConfig)
if err != nil {
a.logger.Error("failed to convert server config", "error", err)
return err
}

// Finalize the config to get the agent objects injected in
a.finalizeServerConfig(sconf)

// Reload the config
if err := s.Reload(sconf); err != nil {
a.logger.Error("reloading server config failed", "error", err)
return fmt.Errorf("reloading server config failed: %w", err)
}
}
}

if client := a.Client(); client != nil {
if newConfig.Client == nil {
a.logger.Debug("skipping client reload - no client config in new configuration")
} else {
a.logger.Debug("starting reload of client config")
clientConfig, err := convertClientConfig(newConfig)
if err != nil {
a.logger.Error("failed to convert client config", "error", err)
return err
}

// Finalize the config to get the agent objects injected in
if err := a.finalizeClientConfig(clientConfig); err != nil {
a.logger.Error("failed to finalize client config", "error", err)
return err
}

if err := client.Reload(clientConfig); err != nil {
a.logger.Error("reloading client config failed", "error", err)
return fmt.Errorf("reloading client config failed: %w", err)
}
}
}

// reload HTTP server after we have reloaded both client and server, in case
// we error in either of the above cases. For example, reloading the http
// server to a TLS connection could succeed, while reloading the server's rpc
// connections could fail.
if shouldReloadHTTP {
if err := a.reloadHTTPServers(); err != nil {
a.logger.Error("failed to reload HTTP servers", "error", err)
}
}

return nil
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New method, which I pretty much just grabbing from what was in command.go. I was able to test and get it to work - atleast I haven't stumbled into any issues yet. Created a new test for this method as well, although I could expand it out to cover more specific scenarios (only client, only server, etc).

Copy link
Member

@tgross tgross left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I see where you're going with this, but there's an architectural issue. See my comment for more details.

@@ -1189,64 +1195,12 @@ func (c *Command) handleReload() error {
newConf.LogLevel = c.agent.GetConfig().LogLevel
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Extracting all this below into its own function makes sense, but we've skipped over everything above this point. As it turns out, the readConfig function in particular is doing a lot of work around reading in the correct files and merging the configuration with the command line arguments. If we extract this into the agent FullReload without that work, we'll end up dropping any command line args that are used to set configuration.

My hunch is that it's just hard to extract all this because the section above is tied to Command and not Agent. What if instead of trying to remove the work below from command and sticking it in agent, we pass a closure from Command into Agent when it's constructed? That is, we keep all the logic in Command but pass Agent a function with a signature func() error that's called from both.

Ex.

func (c *Command) reloaderFunc() func() error {
	return func() error {
 		return c.handleReload()
	}
}
func NewAgent(config *Config, 
	logger log.InterceptLogger, logOutput io.Writer, 
	inmem *metrics.InmemSink,
	configReloader func() error,
) (*Agent, error) {
	// ...
}


func (a *Agent) Reload() error {
   return a.configReloader()
}

You'd probably need to change handleReload so that it returns all its errors for the caller to run c.Ui with as needed (not all errors are fatal though, so that's a quirk). But that'd let you avoid all the extra validation code you're writing here -- it's already being done in the handleReload and if it's not we can add it there and get it for both code paths.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

theme/api HTTP API and SDK issues type/enhancement

Projects

None yet

Development

Successfully merging this pull request may close these issues.

agent configuration reload API

2 participants