Conversation
58b8992 to
4668ec4
Compare
|
/cc @wonderfly |
|
At a high level, this looks ok for phase 2. I would like more detail on the following:
Will do a detailed pass tomorrow. |
Signed-off-by: Vishnu Kannan <vishnuk@google.com>
4668ec4 to
0bd5266
Compare
|
Thanks for the quick review. I have updated the PR to address your comments. PTAL. |
| reservation grows), or running multiple Kubelets on a single node. | ||
| Kubernetes nodes typically run many OS system daemons in addition to kubernetes daemons like kubelet, runtime, etc. and user pods. | ||
| Kubernetes assumes that all the compute resources available, referred to as `Capacity`, in a node are available for user pods. | ||
| In reality, system daemons use non-trivial amoutn of resources and their availability is critical for the stability of the system. |
| 1. It's resource consumption is tied to the number of pods running on a node. | ||
|
|
||
| Note that the hierarchy below recommends having dedicated cgroups for kubelet and the runtime to individally track their usage. | ||
|
|
There was a problem hiding this comment.
probably meant to put this section below in a code block
| . . . .tasks(container processes) | ||
| . . . | ||
| . . +..PodOverhead | ||
| . . . .tasks(per-pod processes) |
There was a problem hiding this comment.
seems like this would take docker-level support for putting all pause container processes in a single cgroup, if that is indeed what this PodOverhead cgroup is for.
There was a problem hiding this comment.
nevermind, i see now that PodGuarenteed is not a literal name, but a placeholder for the pod-level cgroup.
There was a problem hiding this comment.
Not necessary. Each pod overhead process can be in its own cgroup too which might be helpful for tracking and terminating purposes.
| . . . | ||
| . . ... | ||
|
|
||
| `systemreserved` & `kubereserved` cgroups are expected to be created by users. If Kubelet is creating cgroups for itself and docker daemon, it will create the `kubereserved` cgroup automatically. |
There was a problem hiding this comment.
So... the kubelet is creating a cgroup and putting itself in it? Then moving the docker daemon or containerd to that cgroup as well?
There was a problem hiding this comment.
Yeah. Kubelet does this to support older distros. This is legacy behavior.
derekwaynecarr
left a comment
There was a problem hiding this comment.
i want to push a little harder to see if we can get some way of reporting pressure relative to allocatable capacity.
| On the other hand, the `DiskPressure` condition if true should dissuade the scheduler from | ||
| placing **any** new pods on the node since they will be rejected by the `kubelet` in admission. | ||
|
|
||
| ## Enforcing Node Allocatable |
There was a problem hiding this comment.
is this intended for this pr? node allocatable for 1.6 is not local storage aware? i would expect this to be local storage pr.
There was a problem hiding this comment.
never mind, i see the reference below.
| . .tasks(docker-engine, containerd) | ||
| . | ||
| . | ||
| +..kubepods or kubepods.slice (Node Allocatable enforced here by Kubelet) |
There was a problem hiding this comment.
it would be nice if we call out that this is dynamically created on kubelet startup whereas the others need to be pre-provision in the OS image.
There was a problem hiding this comment.
i see the note below that says the same, so ignore above.
|
|
||
| ### Phase 2 - Enforce Allocatable on Pods | ||
|
|
||
| **Status**: Targetted for v1.6 |
| . . . | ||
| . . ... | ||
|
|
||
| `systemreserved` & `kubereserved` cgroups are expected to be created by users. If Kubelet is creating cgroups for itself and docker daemon, it will create the `kubereserved` cgroup automatically. |
There was a problem hiding this comment.
we need to check that systemreserved and kubereserved address distinct non-overlapping parts of the cgroup tree. should error if we observe that one covers the other.
|
|
||
| In this phase, Kubelet will expose usage metrics for `KubeReserved`, `SystemReserved` and `Allocatable` top level cgroups via Summary metrics API. | ||
| `Storage` will also be introduced as a reservable resource in this phase. | ||
| Support for evictions based on Allocatable will be introduced in this phase. |
There was a problem hiding this comment.
i am a little worried that doing this separately will cause problems. basically, things will just get OOM killed and never moved when utilization starts to reach allocatable for memory. it seems like memory pressure not being reported back to the scheduler for allocatable is a step back.
There was a problem hiding this comment.
Thinking about this more, users could set their eviction thresholds higher to cover system+kube reserved in the interim.
There was a problem hiding this comment.
it seems like memory pressure not being reported back to the scheduler for allocatable is a step back.
That's true. We need to prevent new pods from being scheduled. But look below.
Thinking about this more, users could set their eviction thresholds higher to cover system+kube reserved in the interim.
Let's assume the following:
- user sets their eviction thresholds to say
85% - they enforce
SystemReserved&KubeReservedwhich are set to5%each. - System and kube components use upto capacity.
- Hence allocatable is
90%.
Pods cannot use more than 75% of the memory. If they use more than that, they will be evicted anyways.
Given that do we have to do anything extra? Am I missing something?
I feel eviction thresholds limit the available capacity on the nodes for pods. If a node only runs Guaranteed pods, today, they cannot use upto Allocatable if user space evictions are turned on and that worries me.
|
Posted kubernetes/kubernetes#41234 to implement Phase 2 mentioned in this proposal. |
Signed-off-by: Vishnu Kannan <vishnuk@google.com>
90950d6 to
f1d5e24
Compare
| To improve the reliability of nodes, kubelet evicts pods whenever the node runs out of memory or local storage. | ||
| Together, evictions and node allocatable help improve node stability. | ||
|
|
||
| As of v1.5, evictions are based on `Capacity` (overall node usage). Kubelet evicts pods based on QoS and user configured eviction thresholds. |
There was a problem hiding this comment.
Not sure what the "(overall node usage)" is trying to clarify.
Possibly rephrase: evictions are based on overall node usage relative to Capacity
| Once Kubelet supports `storage` as an `Allocatable` resource, Kubelet will perform evictions whenever the total storage usage by pods exceed node allocatable. | ||
|
|
||
| The trigger threshold for storage evictions will not be user configurable for the purposes of `Allocatable`. | ||
| Kubelet will evict pods once the `storage` usage is greater than or equal to `98%` of `Allocatable`. |
There was a problem hiding this comment.
I realize this is an implementation detail, but what are we trying to protect against by setting the threshold to 98%? Why not 100%, since the overall node is protected by the eviction-hard threshold?
There was a problem hiding this comment.
Yes. That makes sense!
There was a problem hiding this comment.
I also think we should consider reusing the minimum reclaim value here. In theory, the value should be configured to reduce hitting the same threshold in quick succession. In my estimation, this should be the same, regardless of which threshold is being hit. Under what circumstances would anyone want the allocation-min-reclaim to be different from eviction-min-reclaim?
Signed-off-by: Vishnu Kannan <vishnuk@google.com>
Signed-off-by: Vishnu Kannan <vishnuk@google.com>
|
cc: @ethernetdan |
Signed-off-by: Vishnu kannan <vishnuk@google.com>
|
@derekwaynecarr Updated PR based on sig-node discussion. |
|
@derekwaynecarr can I get an approval on this PR? |
|
… On Wed, Feb 15, 2017 at 2:50 PM Vish Kannan ***@***.***> wrote:
@derekwaynecarr <https://github.com/derekwaynecarr> can I get an approval
on this PR?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#348 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AF8dbJ8NxlsPbXre4e-1Ipogio9rU3zcks5rc1cTgaJpZM4L7i0R>
.
|
derekwaynecarr
left a comment
There was a problem hiding this comment.
I agree with 99% of this proposal. See comments and fix typos.
| Kubelet will evict pods until it can reclaim `5%` of `storage Allocatable`, thereby brining down usage to `95%` of `Allocatable`. | ||
| These thresholds apply for both storage `capacity` and `inodes`. | ||
|
|
||
| *Note that these values are subject to change based on feedback from production.* |
There was a problem hiding this comment.
My two cents is that this text is premature pending agreement on the local storage proposal. Can we remove this text and just state in the future storage will be a part of allocatable. I am not able to agree at this time on lack of configuration and it's also not a planned 1.6 feature.
| By explicitly reserving compute resources, the intention is to avoid overcommiting the node and not have system daemons compete with user pods. | ||
| The resources available to system daemons and user pods will be capped based on user specified reservations. | ||
|
|
||
| If `Allocatable` is available, the scheduler use that instead of `Capacity`, thereby not overcommiting the node. |
| together in the `/system` raw container). | ||
| together in the `/system` raw container on non-systemd nodes). | ||
|
|
||
| ## Kubelet Evictions Tresholds |
| Kubelet evicts pods based on QoS and user configured eviction thresholds. | ||
| More deails in [this doc](./kubelet-eviction.md#enforce-node-allocatable) | ||
|
|
||
| From v1.6, if `Allocatable` is enforced by default across all pods on a node using cgroups, pods cannot to exceed `Allocatable`. |
There was a problem hiding this comment.
Cannot exceed allocatable
|
|
||
| If we enforce Node Allocatable (`28.9Gi`) via top level cgroups, then pods can never exceed `28.9Gi` in which case evictions will not be performed unless kernel memory consumption is above `100Mi`. | ||
|
|
||
| In order to support evictions and avoid memcg OOM kills for pods, we will set the top level cgroup limits for pods to be `Node Allocatable` + `Eviction Hard Tresholds`. |
| 1. A container runtime on Kubernetes nodes is not expected to be used outside of the Kubelet. | ||
| 1. It's resource consumption is tied to the number of pods running on a node. | ||
|
|
||
| Note that the hierarchy below recommends having dedicated cgroups for kubelet and the runtime to individally track their usage. |
|
|
||
| ``` | ||
|
|
||
| `systemreserved` & `kubereserved` cgroups are expected to be created by users. If Kubelet is creating cgroups for itself and docker daemon, it will create the `kubereserved` cgroups automatically. |
There was a problem hiding this comment.
Can you add text clarifying how users communicate to kubelet the cgroups that were preconfigured?
| * If `--cgroups-per-qos=false`, then this flag has to be set to `""`. Otherwise its an error and kubelet will fail. | ||
| * It is recommended to drain and restart nodes prior to upgrading to v1.6. This is necessary for `--cgroups-per-qos` feature anyways which is expected to be turned on by default in `v1.6`. | ||
| * Users intending to turn off this feature can set this flag to `""`. | ||
| * Specifying `kube-reserved` value in this flag is invalid if `--kube-reserved-cgroup` flag is not specified. |
There was a problem hiding this comment.
Is the syntax for this flag driver specific?
There was a problem hiding this comment.
No. Its expected to be an absolute cgroupfs path.
| * It is recommended to drain and restart nodes prior to upgrading to v1.6. This is necessary for `--cgroups-per-qos` feature anyways which is expected to be turned on by default in `v1.6`. | ||
| * Users intending to turn off this feature can set this flag to `""`. | ||
| * Specifying `kube-reserved` value in this flag is invalid if `--kube-reserved-cgroup` flag is not specified. | ||
| * Specifying `system-reserved` value in this flag is invalid if `--system-reserved-cgroup` flag is not specified. |
There was a problem hiding this comment.
Same question here: do I say system.slice or /system?
There was a problem hiding this comment.
See clarifying text below.
| 2. `--kube-reserved-cgroup=<absolute path to a cgroup>` | ||
| * This flag helps kubelet identify the control group managing all kube components like Kubelet & container runtime that fall under the `KubeReserved` reservation. | ||
|
|
||
| 3. `--system-reserved-cgroup=<absolute path to a cgroup>` |
There was a problem hiding this comment.
I wish we could do a sensible default value here. /system is almost universally wrong
There was a problem hiding this comment.
Defaulting SGTM. It will help with metrics collection. I'd tackle that in v1.7 once all the distro owners digest this proposal.
|
Overall looks good to me. |
Signed-off-by: Vishnu kannan <vishnuk@google.com>
|
@derekwaynecarr PTAL |
|
@wonderfly thanks for the quick review |
Signed-off-by: Vishnu Kannan <vishnuk@google.com>
|
I added another flag to help with the rollout of the changes in this PR. Specifically, it helps ignore Hard Eviction thresholds while computing Node Allocatable. |
|
@derekwaynecarr Can I get an LGTM on this PR? |
derekwaynecarr
left a comment
There was a problem hiding this comment.
The new flag makes sense since it reduces allocatable.
| * This flag helps kubelet identify the control group managing all OS specific system daemons that fall under the `SystemReserved` reservation. | ||
| * Example: `/system.slice`. Note that absolute paths are required and systemd naming scheme isn't supported. | ||
|
|
||
| 4. `--experimental-node-allocatable-ignore-eviction-threshold` |
|
/lgtm |
|
Merging this PR based on LGTM. |
Automatic merge from submit-queue Enforce Node Allocatable via cgroups This PR enforces node allocatable across all pods using a top level cgroup as described in kubernetes/community#348 This PR also provides an option to enforce `kubeReserved` and `systemReserved` on user specified cgroups. This PR will by default make kubelet create top level cgroups even if `kubeReserved` and `systemReserved` is not specified and hence `Allocatable = Capacity`. ```release-note New Kubelet flag `--enforce-node-allocatable` with a default value of `pods` is added which will make kubelet create a top level cgroup for all pods to enforce Node Allocatable. Optionally, `system-reserved` & `kube-reserved` values can also be specified separated by comma to enforce node allocatable on cgroups specified via `--system-reserved-cgroup` & `--kube-reserved-cgroup` respectively. Note the default value of the latter flags are "". This feature requires a **Node Drain** prior to upgrade failing which pods will be restarted if possible or terminated if they have a `RestartNever` policy. ``` cc @kubernetes/sig-node-pr-reviews @kubernetes/sig-node-feature-requests TODO: - [x] Adjust effective Node Allocatable to subtract hard eviction thresholds - [x] Add unit tests - [x] Complete pending e2e tests - [x] Manual testing - [x] Get the proposal merged @dashpole is working on adding support for evictions for enforcing Node allocatable more gracefully. That work will show up in a subsequent PR for v1.6
Automatic merge from submit-queue Eviction Manager Enforces Allocatable Thresholds This PR modifies the eviction manager to enforce node allocatable thresholds for memory as described in kubernetes/community#348. This PR should be merged after #41234. cc @kubernetes/sig-node-pr-reviews @kubernetes/sig-node-feature-requests @vishh ** Why is this a bug/regression** Kubelet uses `oom_score_adj` to enforce QoS policies. But the `oom_score_adj` is based on overall memory requested, which means that a Burstable pod that requested a lot of memory can lead to OOM kills for Guaranteed pods, which violates QoS. Even worse, we have observed system daemons like kubelet or kube-proxy being killed by the OOM killer. Without this PR, v1.6 will have node stability issues and regressions in an existing GA feature `out of Resource` handling.
Phase 2 of node allocatable
The kubelet will terminate end-user pods when the worker node has 'MemoryPressure' according to [1]. But confusingly, there exits two reasons for pods being evicted: - one is the whole machine's free memory is too low, - the other is k8s itself calculation[2], e.i. memory.available[3] is too low. To resolve such confusion for k8s users, collect and show k8s global workingset memory to distinguish between these two causes. Note: 1. Only collect k8s global memory stats is enough, this is because cgroupfs stats are propagated from child to parent. Thus the parent can always notice the change and then updates. And From v1.6 k8s[4], allocatable(/sys/fs/cgroup/memory/kubepods/) is more convincing than capacity(/sys/fs/cgroup/memory/). 2. There are two cgroup drivers or managers to control resources: cgroupfs and systemd[5]. We should take both into account. (The 'systemd' cgroup driver always ends with '.slice') 3. The difference between cgroupv1 and cgroupv2: different field names for memory.stat file, and memory.currentUsage storing in different files (cgv1's memory.usage_in_bytes v.s. cgv2's memory.current). [1]https://kubernetes.io/docs/concepts/scheduling-eviction/node-pressure-eviction/#node-out-of-memory-behavior [2]kubernetes/kubernetes#43916 [3]memory.available = memory.allocatable/capacity - memory.workingSet, memory.workingSet = memory.currentUsage - memory.inactivefile [4]kubernetes/kubernetes#42204 kubernetes/community#348 [5]https://kubernetes.io/docs/tasks/administer-cluster/kubeadm/configure-cgroup-driver/ Signed-off-by: Fei Li <lifei.shirley@bytedance.com> Reported-by: Teng Hu <huteng.ht@bytedance.com>
The kubelet will terminate end-user pods when the worker node has 'MemoryPressure' according to [1]. But confusingly, there exits two reasons for pods being evicted: - one is the whole machine's free memory is too low, - the other is k8s itself calculation[2], e.i. memory.available[3] is too low. To resolve such confusion for k8s users, collect and show k8s global workingset memory to distinguish between these two causes. Note: 1. Only collect k8s global memory stats is enough, this is because cgroupfs stats are propagated from child to parent. Thus the parent can always notice the change and then updates. And From v1.6 k8s[4], allocatable(/sys/fs/cgroup/memory/kubepods/) is more convincing than capacity(/sys/fs/cgroup/memory/). 2. There are two cgroup drivers or managers to control resources: cgroupfs and systemd[5]. We should take both into account. (The 'systemd' cgroup driver always ends with '.slice') 3. The difference between cgroupv1 and cgroupv2: different field names for memory.stat file, and memory.currentUsage storing in different files (cgv1's memory.usage_in_bytes v.s. cgv2's memory.current). [1]https://kubernetes.io/docs/concepts/scheduling-eviction/node-pressure-eviction/#node-out-of-memory-behavior [2]kubernetes/kubernetes#43916 [3]memory.available = memory.allocatable/capacity - memory.workingSet, memory.workingSet = memory.currentUsage - memory.inactivefile [4]kubernetes/kubernetes#42204 kubernetes/community#348 [5]https://kubernetes.io/docs/tasks/administer-cluster/kubeadm/configure-cgroup-driver/ Signed-off-by: Fei Li <lifei.shirley@bytedance.com> Reported-by: Teng Hu <huteng.ht@bytedance.com>
The kubelet will terminate end-user pods when the worker node has 'MemoryPressure' according to [1]. But confusingly, there exits two reasons for pods being evicted: - one is the whole machine's free memory is too low, - the other is k8s itself calculation[2], e.i. memory.available[3] is too low. To resolve such confusion for k8s users, collect and show k8s global workingset memory to distinguish between these two causes. Note: 1. Only collect k8s global memory stats is enough, this is because cgroupfs stats are propagated from child to parent. Thus the parent can always notice the change and then updates. And From v1.6 k8s[4], allocatable(/sys/fs/cgroup/memory/kubepods/) is more convincing than capacity(/sys/fs/cgroup/memory/). 2. There are two cgroup drivers or managers to control resources: cgroupfs and systemd[5]. We should take both into account. (The 'systemd' cgroup driver always ends with '.slice') 3. The difference between cgroupv1 and cgroupv2: different field names for memory.stat file, and memory.currentUsage storing in different files (cgv1's memory.usage_in_bytes v.s. cgv2's memory.current). [1]https://kubernetes.io/docs/concepts/scheduling-eviction/node-pressure-eviction/#node-out-of-memory-behavior [2]kubernetes/kubernetes#43916 [3]memory.available = memory.allocatable/capacity - memory.workingSet, memory.workingSet = memory.currentUsage - memory.inactivefile [4]kubernetes/kubernetes#42204 kubernetes/community#348 [5]https://kubernetes.io/docs/tasks/administer-cluster/kubeadm/configure-cgroup-driver/ Signed-off-by: Fei Li <lifei.shirley@bytedance.com> Reported-by: Teng Hu <huteng.ht@bytedance.com>
The kubelet will terminate end-user pods when the worker node has 'MemoryPressure' according to [1]. But confusingly, there exits two reasons for pods being evicted: - one is the whole machine's free memory is too low, - the other is k8s itself calculation[2], e.i. memory.available[3] is too low. To resolve such confusion for k8s users, collect and show k8s global workingset memory to distinguish between these two causes. Note: 1. Only collect k8s global memory stats is enough, this is because cgroupfs stats are propagated from child to parent. Thus the parent can always notice the change and then updates. And From v1.6 k8s[4], allocatable(/sys/fs/cgroup/memory/kubepods/) is more convincing than capacity(/sys/fs/cgroup/memory/). 2. There are two cgroup drivers or managers to control resources: cgroupfs and systemd[5]. We should take both into account. (The 'systemd' cgroup driver always ends with '.slice') 3. The difference between cgroupv1 and cgroupv2: different field names for memory.stat file, and memory.currentUsage storing in different files (cgv1's memory.usage_in_bytes v.s. cgv2's memory.current). [1]https://kubernetes.io/docs/concepts/scheduling-eviction/node-pressure-eviction/#node-out-of-memory-behavior [2]kubernetes/kubernetes#43916 [3]memory.available = memory.allocatable/capacity - memory.workingSet, memory.workingSet = memory.currentUsage - memory.inactivefile [4]kubernetes/kubernetes#42204 kubernetes/community#348 [5]https://kubernetes.io/docs/tasks/administer-cluster/kubeadm/configure-cgroup-driver/ Signed-off-by: Fei Li <lifei.shirley@bytedance.com> Reported-by: Teng Hu <huteng.ht@bytedance.com>
The kubelet will terminate end-user pods when the worker node has 'MemoryPressure' according to [1]. But confusingly, there exits two reasons for pods being evicted: - one is the whole machine's free memory is too low, - the other is k8s itself calculation[2], e.i. memory.available[3] is too low. To resolve such confusion for k8s users, collect and show k8s global workingset memory to distinguish between these two causes. Note: 1. Only collect k8s global memory stats is enough, this is because cgroupfs stats are propagated from child to parent. Thus the parent can always notice the change and then updates. And From v1.6 k8s[4], allocatable(/sys/fs/cgroup/memory/kubepods/) is more convincing than capacity(/sys/fs/cgroup/memory/). 2. There are two cgroup drivers or managers to control resources: cgroupfs and systemd[5]. We should take both into account. (The 'systemd' cgroup driver always ends with '.slice') 3. The difference between cgroupv1 and cgroupv2: different field names for memory.stat file, and memory.currentUsage storing in different files (cgv1's memory.usage_in_bytes v.s. cgv2's memory.current). [1]https://kubernetes.io/docs/concepts/scheduling-eviction/node-pressure-eviction/#node-out-of-memory-behavior [2]kubernetes/kubernetes#43916 [3]memory.available = memory.allocatable/capacity - memory.workingSet, memory.workingSet = memory.currentUsage - memory.inactivefile [4]kubernetes/kubernetes#42204 kubernetes/community#348 [5]https://kubernetes.io/docs/tasks/administer-cluster/kubeadm/configure-cgroup-driver/ Signed-off-by: Fei Li <lifei.shirley@bytedance.com> Reported-by: Teng Hu <huteng.ht@bytedance.com>
The kubelet will terminate end-user pods when the worker node has 'MemoryPressure' according to [1]. But confusingly, there exits two reasons for pods being evicted: - one is the whole machine's free memory is too low, - the other is k8s itself calculation[2], e.i. memory.available[3] is too low. To resolve such confusion for k8s users, collect and show k8s global workingset memory to distinguish between these two causes. Note: 1. Only collect k8s global memory stats is enough, this is because cgroupfs stats are propagated from child to parent. Thus the parent can always notice the change and then updates. And From v1.6 k8s[4], allocatable(/sys/fs/cgroup/memory/kubepods/) is more convincing than capacity(/sys/fs/cgroup/memory/). 2. There are two cgroup drivers or managers to control resources: cgroupfs and systemd[5]. We should take both into account. (The 'systemd' cgroup driver always ends with '.slice') 3. The difference between cgroupv1 and cgroupv2: different field names for memory.stat file, and memory.currentUsage storing in different files (cgv1's memory.usage_in_bytes v.s. cgv2's memory.current). [1]https://kubernetes.io/docs/concepts/scheduling-eviction/node-pressure-eviction/#node-out-of-memory-behavior [2]kubernetes/kubernetes#43916 [3]memory.available = memory.allocatable/capacity - memory.workingSet, memory.workingSet = memory.currentUsage - memory.inactivefile [4]kubernetes/kubernetes#42204 kubernetes/community#348 [5]https://kubernetes.io/docs/tasks/administer-cluster/kubeadm/configure-cgroup-driver/ Signed-off-by: Fei Li <lifei.shirley@bytedance.com> Reported-by: Teng Hu <huteng.ht@bytedance.com>
cc @kubernetes/sig-node-proposals
I plan on implementing the 2nd phase mentioned in this proposal while this PR gets reviewed.
cc @adityakali @Amey-D Take a look at the cgroup configuration mentioned in this proposal. We need to alter COS cgroup hierarchies