Allow partitioning sensors with hash ranges across multiple deployments instead of one-sensor-per-pod #218

cognifloyd · 2021-07-06T22:15:01Z

This allows us to specify .Values.st2sensorcontainer.deployments instead of hard-coding each sensor pod in .Values.st2.packs.sensors. It uses the ST2 hash range partitioning method to enable this behavior.
See: https://docs.stackstorm.com/reference/sensor_partitioning.html#hash

This may be helpful in private k8s clusters where hardware resources are constrained, so that only a few sensor containers can be running.

sensor parition with hash ranges
fix lint
add changelog entry
documentation

Props to @emptywee for figuring out how to calculate the hash ranges automatically. I translated what he did in a bash script into the helm chart.

TODO:

documentation in values.yaml and README (including adjustments to current docstrings)

cognifloyd · 2021-07-08T02:49:16Z

K. I think this is ready for review. I renamed st2sensorcontainer.replicas to st2sensorcontainer.deployments to more accurately reflect how the hash range partitioning is done. Plus I added documentation in several places to explain the option.

cognifloyd · 2021-07-17T17:08:15Z

Hmm. I think I can break this up into two PRs. I'll submit that in a few minutes.

cognifloyd · 2021-07-18T06:29:22Z

#221 and #222 should be merged before this one will be ready to review.
#221 makes st2sensorcontainer provide defaults for each pod in st2.packs.sensors - this makes it easier to review as introducing $sensor and $name vars is easier to follow at a glance without introducing hash_ranges at the same time.
#222 (edit: and #246) makes the difference between sensor modes (all-sensors-in-one-pod and one-sensor-per-pod) explicit and less magical.

Then, this PR can focus on the sensor partitioning methods. It turns out, by adding hash ranges, there are 3 sensor modes (Yes, there were 2 modes before, it just wasn't obvious, which is why I added #222). Quoting from the docs I just updated in this PR:

It is possible to run st2sensorcontainer(s) in one of these modes:

run all sensors in one pod (1 deployment with 1 pod, the default); or

run multiple sensors per pod (2+ deployments with 1 pod each) using hash range partitions; or

run one sensor per pod using st2.packs.sensors.

To use the deployments (modes 1 and 2 in this list), st2.packs.sensors must be empty.
For one-sensor-per-pod, define defaults in st2sensorcontainer and add config for each sensor to st2.packs.sensors.

There was a default sensor container defined in st2.packs.sensors to ensure that there would be at least one sensor container.
Despite the inline documentation, that wasn't obvious - it just looked like an example that wasn't commented.

So, this improves the user experience by

using st2sensorcontainer.deployments = 1 to cover that use case (all sensor in one pod), so that st2.packs.sensors is only for defining pods in the one-sensor-per-pod mode.
allow for scaling up sensors slightly by merely increasing st2sensorcontainer.deployments (hash range partitioning essentially becomes an implementation detail).
Then, if someone needs to go full HA with one sensor per pod, they can still do so with st2.packs.sensors.

arm4b

Thanks for the PR!
I definitely like the idea that sensor partitioning is auto-detected just based on the number.

Though, it's definitely a complex feature: the implementation, configuration flow, adding even more moving pieces to an already complicated chart.
Historically, we didn't plan to implement something like this in the K8s for prod use, considering how K8s manages the Pods health, availability, self-healing, failover. And so one-sensor-for-container was considered as a production recommended way that's compatible with the K8s model.

Such a significant addition to the chart functionality needs a good testing and feedback.
It would be nice if someone from the community could try this PR and share their experience here, especially around the different sensor modes and how switching affects the environment.

cognifloyd · 2021-12-01T23:50:56Z

I definitely like the idea that sensor partitioning is auto-detected just based on the number.

Internally, we've had a script auto-calculating hash ranges on sensor pod startup since at least November 2017 with StackStorm 2.4.1 (I just looked through the git history). I hope this is useful for others as well, so I re-implemented that docker-entry-script using helm's template primitives.

considering how K8s manages the Pods health, availability, self-healing, failover. And so one-sensor-for-container was considered as a production recommended way that's compatible with the K8s model.

In our environment, we provide the StackStorm platform and various team members can add packs that may or may not have sensors. They only need to learn the stackstorm layer, not the kubernetes layer as deploying StackStorm follows a completely different deployment model than everything else we deploy in kubernetes. So, defining the sensors is out-of-scope when installing StackStorm via the helm chart. All I can do is configure how many pods should be available for sensors, and other devs or devops people (who do not have access to deploy directly to kubernetes) will add their packs+sensors to StackStorm.

hash ranges is the only method StackStorm supports for dynamic sensor partitioning. So, that's the method that I need to be able to install StackStorm in kubernetes.

As far as "how K8s manages the Pods health, availability, self-healing, failover", I chose to use deployments instead of replicas for this because it takes advantage of all of these kubernetes native features. The hash_range is node-specific configuration, so configuring all sensor nodes within the same deployment would have caused headaches if any particular sensor node crashes or needs to be moved to a different host for some reason (ie k8s follows a LIFO approach to replicas, you can't easily restart individual replicas). So, one-sensor-per-pod is not the only method that takes advantage of all of these k8s health-management features, hash-range based some-sensors-per-pod mode also maximizes use of the k8s-features given the constraint that I can't know how many sensors need to run at StackStorm installation time.

Such a significant addition to the chart functionality needs a good testing and feedback.
It would be nice if someone from the community could try this PR and share their experience here, especially around the different sensor modes and how switching affects the environment.

Configuring hash range partitioning has been a complex feature to configure on its own. This method simplifies that significantly when people are doing HA in k8s. So, maybe people will start preferring the hash-range strategy over all-sensors-in-one-pod mode once this is merged. Do you know anyone who is providing StackStorm as a platform-style service? Could you ping them to see if they can test this? Very few people seem to be working with stackstorm-ha, so I'm not sure who we can get to provide the feedback you want.

cognifloyd · 2021-12-02T06:11:53Z

Such a significant addition to the chart functionality needs a good testing

OK - I'm not sure how to add tests for this. Does anyone have any suggestions?

Is there a way to query StackStorm for which partitions are active? I can use st2 sensor list to list sensors, but I don't see a way to list the running sensorcontainers and which ranges they cover.

Maybe we can query kubernetes for the pods and check the annotations, but would such a test have any value? And how do I do kubernetes queries from within the bats tests? We don't have kubectl installed on the st2actionrunner image that the st2tests-pod uses... Would we have to manually craft k8s API requests to run via curl?

cognifloyd

I just added unittests for this feature.

cognifloyd · 2022-03-01T01:49:28Z

templates/deployments.yaml

 {{- $one_sensor_per_pod := not ($.Values.st2.packs.sensors | empty) }}
-{{- range ($one_sensor_per_pod | ternary ($.Values.st2.packs.sensors) (until 1)) }}
-  {{- $sensor := omit $.Values.st2sensorcontainer "name" "ref" "postStartScript" }}
+{{- $some_sensors_per_pod := gt $sensor_deployments_count 1 }}
+{{- if and $one_sensor_per_pod $some_sensors_per_pod }}
+  {{- fail "The sensor values are ambiguous. To use one-sensor-per-pod, use `st2.packs.sensors`. To use multiple-sensors-per-pod, use `st2sensorcontainer.deployments > 1`. Do not use both!" }}


I just added a safety check to help users avoid ambiguous sensor configuration. If st2.packs.sensors is not empty, then st2sensorcontainer.deployments must be the default (1 or less).

arm4b

Thanks for all the efforts making it possible 💯

arm4b · 2022-03-03T17:46:03Z

README.md

+You can increase the number of `st2sensorcontainer` pods by increasing the number of deployments.
+The replicas count is still only `1` per deployment, but the sensors are distributed between the deployments using
+[Sensor Hash Range Partitioning](https://docs.stackstorm.com/reference/sensor_partitioning.html#hash).
+The hash ranges are calculated automatically based on the number of deployments.


The docs look really clear 👍
Hope more users will find it helpful to configure the sensor pods automatically.

cognifloyd · 2022-03-03T17:49:26Z

Woohoo! Thanks for the review!

pull-request-size bot added the size/M PR that changes 30-99 lines. Good size to review. label Jul 6, 2021

cognifloyd force-pushed the dynamic-st2sensors branch from 62a1015 to bf37cc5 Compare July 6, 2021 22:15

cognifloyd added feature WIP labels Jul 6, 2021

pull-request-size bot added size/L PR that changes 100-499 lines. Requires some effort to review. and removed size/M PR that changes 30-99 lines. Good size to review. labels Jul 8, 2021

cognifloyd added RFR and removed WIP labels Jul 8, 2021

cognifloyd requested a review from arm4b July 8, 2021 02:46

cognifloyd changed the title ~~Allow partitioning sensors by hash range + replicas instead of one-sensor-per-pod~~ Allow partitioning sensors with hash ranges across multiple deployments instead of one-sensor-per-pod Jul 8, 2021

cognifloyd force-pushed the dynamic-st2sensors branch 5 times, most recently from 177e7d2 to 9534789 Compare July 17, 2021 13:43

cognifloyd added WIP and removed RFR labels Jul 17, 2021

cognifloyd marked this pull request as draft July 17, 2021 17:07

cognifloyd force-pushed the dynamic-st2sensors branch 3 times, most recently from 72e88db to 8135045 Compare July 18, 2021 06:15

cognifloyd mentioned this pull request Jul 20, 2021

Clarify the distinction between sensor modes: all-sensors-in-one-pod and one-sensor-per-pod #222

Merged

cognifloyd force-pushed the dynamic-st2sensors branch 3 times, most recently from 472d38e to 6347e7f Compare July 27, 2021 18:47

cognifloyd force-pushed the dynamic-st2sensors branch from 6347e7f to a613a61 Compare September 30, 2021 05:34

pull-request-size bot removed the size/L PR that changes 100-499 lines. Requires some effort to review. label Sep 30, 2021

arm4b reviewed Dec 1, 2021

View reviewed changes

arm4b added status:feedback needed status:to be verified labels Dec 1, 2021

cognifloyd added 2 commits December 3, 2021 15:47

Merge branch 'master' into dynamic-st2sensors

1bd9c79

Merge branch 'master' into dynamic-st2sensors

bf504fb

pull-request-size bot added size/L PR that changes 100-499 lines. Requires some effort to review. and removed size/M PR that changes 30-99 lines. Good size to review. labels Dec 21, 2021

cognifloyd force-pushed the dynamic-st2sensors branch 2 times, most recently from 4c7d181 to 97d8c3b Compare December 21, 2021 18:40

cognifloyd force-pushed the dynamic-st2sensors branch from 766cc5c to 2d20051 Compare December 28, 2021 20:48

Merge branch 'master' into dynamic-st2sensors

5b06aa8

cognifloyd force-pushed the dynamic-st2sensors branch from 0ab6d65 to 5b06aa8 Compare February 15, 2022 23:51

pull-request-size bot added size/M PR that changes 30-99 lines. Good size to review. and removed size/L PR that changes 100-499 lines. Requires some effort to review. labels Feb 15, 2022

Merge branch 'master' into dynamic-st2sensors

4a8268d

cognifloyd mentioned this pull request Mar 1, 2022

Add unit tests for a variety of chart aspects #288

Merged

cognifloyd added 3 commits February 28, 2022 19:01

Merge branch 'master' into dynamic-st2sensors

2c03fb1

add unit test for multiple-sensors-per-ppd

fccef65

Protect users from configuring an ambiguous sensor mode

1855103

pull-request-size bot added size/L PR that changes 100-499 lines. Requires some effort to review. and removed size/M PR that changes 30-99 lines. Good size to review. labels Mar 1, 2022

cognifloyd commented Mar 1, 2022

View reviewed changes

cognifloyd requested a review from arm4b March 2, 2022 04:21

arm4b approved these changes Mar 3, 2022

View reviewed changes

cognifloyd merged commit 181f448 into StackStorm:master Mar 3, 2022

cognifloyd removed status:feedback needed status:to be verified labels Apr 1, 2022

Uh oh!

Allow partitioning sensors with hash ranges across multiple deployments instead of one-sensor-per-pod #218

Allow partitioning sensors with hash ranges across multiple deployments instead of one-sensor-per-pod #218

Uh oh!

Conversation

cognifloyd commented Jul 6, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cognifloyd commented Jul 8, 2021

Uh oh!

cognifloyd commented Jul 17, 2021

Uh oh!

cognifloyd commented Jul 18, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

arm4b left a comment

Choose a reason for hiding this comment

Uh oh!

cognifloyd commented Dec 1, 2021

Uh oh!

cognifloyd commented Dec 2, 2021

Uh oh!

cognifloyd left a comment

Choose a reason for hiding this comment

Uh oh!

cognifloyd Mar 1, 2022

Choose a reason for hiding this comment

Uh oh!

arm4b Mar 3, 2022

Choose a reason for hiding this comment

Uh oh!

arm4b left a comment

Choose a reason for hiding this comment

Uh oh!

arm4b Mar 3, 2022

Choose a reason for hiding this comment

Uh oh!

cognifloyd commented Mar 3, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

cognifloyd commented Jul 6, 2021 •

edited

Loading

cognifloyd commented Jul 18, 2021 •

edited

Loading