metrics: add total http connections metric for agent#26756
metrics: add total http connections metric for agent#26756mismithhisler wants to merge 1 commit intomainfrom
Conversation
| serverInitializationErrors error | ||
|
|
||
| connCount int = 0 | ||
| countMux *sync.Mutex = &sync.Mutex{} |
There was a problem hiding this comment.
Adding a lock to the ConnState callback could potentially introduce some contention if there are many incoming requests and a short telemetry.collectionInterval. This may be worth running some benchmarks.
There was a problem hiding this comment.
I scripted a very basic test hitting the nomad api concurrently. Comparing the times to a previous version of Nomad without this metric, it showed no meaningful difference in total time. This is not surprising as the handling of connections is already bottlenecked by the lock within the connection limiter.
| connMux.Unlock() | ||
|
|
||
| // Call connection limiter if enabled | ||
| if connLimit > 0 { |
There was a problem hiding this comment.
We're already introducing some closures here with the connCount and mutex, adding the connLimit as one more closure allows us to condense the logic a bit.
There was a problem hiding this comment.
I had to do a little truth table for myself to verify we hadn't changed the behavior unexpectedly. 😁 I think this holds for both the old and new versions so LGTM
| non-TLS or no handshake timeout |
connLimit > 0 | call connLimiter? | set deadlines? |
|---|---|---|---|
| yes | yes | yes | no |
| yes | no | no | no |
| no | yes | yes | yes |
| no | no | no | yes |
|
@mismithhisler Since the docs content has migrated to the web-unified-docs repo, the updates in this PR need to be recreated in the other repo. I can do the docs update or help, once Eng decides which release this is going into. |
|
|
||
| // lock connCount to avoid torn reads, as this is updated by ConnState callbacks | ||
| countMux.Lock() | ||
| metrics.SetGauge([]string{"nomad", "agent", "http", "connections"}, float32(connCount)) |
There was a problem hiding this comment.
For something as simple as this int, would an atomic be better?
There was a problem hiding this comment.
Yeah I'd probably make this an atomic, so you never have to worry about holding it open or where to unlock it. Just mind you don't accidentally copy it, same as with the mutex.
tgross
left a comment
There was a problem hiding this comment.
Looking good! Don't forget a changelog.
|
|
||
| // lock connCount to avoid torn reads, as this is updated by ConnState callbacks | ||
| countMux.Lock() | ||
| metrics.SetGauge([]string{"nomad", "agent", "http", "connections"}, float32(connCount)) |
There was a problem hiding this comment.
Yeah I'd probably make this an atomic, so you never have to worry about holding it open or where to unlock it. Just mind you don't accidentally copy it, same as with the mutex.
| connMux.Unlock() | ||
|
|
||
| // Call connection limiter if enabled | ||
| if connLimit > 0 { |
There was a problem hiding this comment.
I had to do a little truth table for myself to verify we hadn't changed the behavior unexpectedly. 😁 I think this holds for both the old and new versions so LGTM
| non-TLS or no handshake timeout |
connLimit > 0 | call connLimiter? | set deadlines? |
|---|---|---|---|
| yes | yes | yes | no |
| yes | no | no | no |
| no | yes | yes | yes |
| no | no | no | yes |
|
@mismithhisler what do you think about closing out #7893 while we're here and adding some metrics on rejected connections too? |
Description
This PR adds tracking of http connections and emits those as the metric
nomad.agent.http.connection. We use a gauge for this metric so we can track long lived connections.Testing & Reproduction steps
In one terminal, run a dev agent. In another terminal, run
nomad monitor. Wait thein_memory_collection_interval. In a third terminal, runcurl -s localhost:4646/v1/metrics | jq '.Gauges[] | select(.Name | contains("nomad.nomad.agent.http"))'. Observe the value of1.0for the total connections.Links
Docs deploy preview: https://nomad-git-f-add-total-http-connections-metric-hashicorp.vercel.app/nomad/docs/reference/metrics#agent-metrics
Contributor Checklist
changelog entry using the
make clcommand.ensure regressions will be caught.
and job configuration, please update the Nomad website documentation to reflect this. Refer to
the website README for docs guidelines. Please also consider whether the
change requires notes within the upgrade guide.
Reviewer Checklist
backporting document.
in the majority of situations. The main exceptions are long-lived feature branches or merges where
history should be preserved.
within the public repository.
Changes to Security Controls
Are there any changes to security controls (access controls, encryption, logging) in this pull request? If so, explain.