Skip to content

Phase 2 of node allocatable#348

Merged
vishh merged 10 commits intokubernetes:masterfrom
vishh:node-allocatable
Feb 21, 2017
Merged

Phase 2 of node allocatable#348
vishh merged 10 commits intokubernetes:masterfrom
vishh:node-allocatable

Conversation

@vishh
Copy link
Contributor

@vishh vishh commented Feb 9, 2017

cc @kubernetes/sig-node-proposals

I plan on implementing the 2nd phase mentioned in this proposal while this PR gets reviewed.

cc @adityakali @Amey-D Take a look at the cgroup configuration mentioned in this proposal. We need to alter COS cgroup hierarchies

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Feb 9, 2017
@adityakali
Copy link

/cc @wonderfly

@derekwaynecarr
Copy link
Member

At a high level, this looks ok for phase 2.

I would like more detail on the following:

  • eviction at kubepods.slice and not just root
  • it's useful if each new cgroup could have output in summary stats API. Especially for what held the user workload
  • clarify that cgroup created by kubelet is cgroup driver specific. So /kubepods would become kubepods.slice
  • Agree defaulting to pods only is good for 1.6
  • a blurb on containerized kubelet would probably be appreciated by others

Will do a detailed pass tomorrow.

Signed-off-by: Vishnu Kannan <vishnuk@google.com>
@vishh
Copy link
Contributor Author

vishh commented Feb 9, 2017

Thanks for the quick review. I have updated the PR to address your comments. PTAL.
Not all these issues are relevant for v1.6 and so I request focussing initially on work items necessary for v1.6.

@vishh
Copy link
Contributor Author

vishh commented Feb 9, 2017

cc @dashpole @rkouj

reservation grows), or running multiple Kubelets on a single node.
Kubernetes nodes typically run many OS system daemons in addition to kubernetes daemons like kubelet, runtime, etc. and user pods.
Kubernetes assumes that all the compute resources available, referred to as `Capacity`, in a node are available for user pods.
In reality, system daemons use non-trivial amoutn of resources and their availability is critical for the stability of the system.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/amoutn/amount/

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

1. It's resource consumption is tied to the number of pods running on a node.

Note that the hierarchy below recommends having dedicated cgroups for kubelet and the runtime to individally track their usage.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably meant to put this section below in a code block

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops yeah. Fixed it

. . . .tasks(container processes)
. . .
. . +..PodOverhead
. . . .tasks(per-pod processes)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems like this would take docker-level support for putting all pause container processes in a single cgroup, if that is indeed what this PodOverhead cgroup is for.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nevermind, i see now that PodGuarenteed is not a literal name, but a placeholder for the pod-level cgroup.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not necessary. Each pod overhead process can be in its own cgroup too which might be helpful for tracking and terminating purposes.

. . .
. . ...

`systemreserved` & `kubereserved` cgroups are expected to be created by users. If Kubelet is creating cgroups for itself and docker daemon, it will create the `kubereserved` cgroup automatically.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So... the kubelet is creating a cgroup and putting itself in it? Then moving the docker daemon or containerd to that cgroup as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah. Kubelet does this to support older distros. This is legacy behavior.

Copy link
Member

@derekwaynecarr derekwaynecarr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i want to push a little harder to see if we can get some way of reporting pressure relative to allocatable capacity.

On the other hand, the `DiskPressure` condition if true should dissuade the scheduler from
placing **any** new pods on the node since they will be rejected by the `kubelet` in admission.

## Enforcing Node Allocatable
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this intended for this pr? node allocatable for 1.6 is not local storage aware? i would expect this to be local storage pr.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

never mind, i see the reference below.

. .tasks(docker-engine, containerd)
.
.
+..kubepods or kubepods.slice (Node Allocatable enforced here by Kubelet)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it would be nice if we call out that this is dynamically created on kubelet startup whereas the others need to be pre-provision in the OS image.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i see the note below that says the same, so ignore above.


### Phase 2 - Enforce Allocatable on Pods

**Status**: Targetted for v1.6
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: Targeted

. . .
. . ...

`systemreserved` & `kubereserved` cgroups are expected to be created by users. If Kubelet is creating cgroups for itself and docker daemon, it will create the `kubereserved` cgroup automatically.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we need to check that systemreserved and kubereserved address distinct non-overlapping parts of the cgroup tree. should error if we observe that one covers the other.


In this phase, Kubelet will expose usage metrics for `KubeReserved`, `SystemReserved` and `Allocatable` top level cgroups via Summary metrics API.
`Storage` will also be introduced as a reservable resource in this phase.
Support for evictions based on Allocatable will be introduced in this phase.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i am a little worried that doing this separately will cause problems. basically, things will just get OOM killed and never moved when utilization starts to reach allocatable for memory. it seems like memory pressure not being reported back to the scheduler for allocatable is a step back.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking about this more, users could set their eviction thresholds higher to cover system+kube reserved in the interim.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it seems like memory pressure not being reported back to the scheduler for allocatable is a step back.

That's true. We need to prevent new pods from being scheduled. But look below.

Thinking about this more, users could set their eviction thresholds higher to cover system+kube reserved in the interim.

Let's assume the following:

  1. user sets their eviction thresholds to say 85%
  2. they enforce SystemReserved & KubeReserved which are set to 5% each.
  3. System and kube components use upto capacity.
  4. Hence allocatable is 90%.

Pods cannot use more than 75% of the memory. If they use more than that, they will be evicted anyways.

Given that do we have to do anything extra? Am I missing something?

I feel eviction thresholds limit the available capacity on the nodes for pods. If a node only runs Guaranteed pods, today, they cannot use upto Allocatable if user space evictions are turned on and that worries me.

@vishh
Copy link
Contributor Author

vishh commented Feb 10, 2017

Posted kubernetes/kubernetes#41234 to implement Phase 2 mentioned in this proposal.

Signed-off-by: Vishnu Kannan <vishnuk@google.com>
To improve the reliability of nodes, kubelet evicts pods whenever the node runs out of memory or local storage.
Together, evictions and node allocatable help improve node stability.

As of v1.5, evictions are based on `Capacity` (overall node usage). Kubelet evicts pods based on QoS and user configured eviction thresholds.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure what the "(overall node usage)" is trying to clarify.
Possibly rephrase: evictions are based on overall node usage relative to Capacity

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Once Kubelet supports `storage` as an `Allocatable` resource, Kubelet will perform evictions whenever the total storage usage by pods exceed node allocatable.

The trigger threshold for storage evictions will not be user configurable for the purposes of `Allocatable`.
Kubelet will evict pods once the `storage` usage is greater than or equal to `98%` of `Allocatable`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I realize this is an implementation detail, but what are we trying to protect against by setting the threshold to 98%? Why not 100%, since the overall node is protected by the eviction-hard threshold?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. That makes sense!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also think we should consider reusing the minimum reclaim value here. In theory, the value should be configured to reduce hitting the same threshold in quick succession. In my estimation, this should be the same, regardless of which threshold is being hit. Under what circumstances would anyone want the allocation-min-reclaim to be different from eviction-min-reclaim?

vishh and others added 5 commits February 10, 2017 14:57
Signed-off-by: Vishnu Kannan <vishnuk@google.com>
Signed-off-by: Vishnu Kannan <vishnuk@google.com>
Signed-off-by: Vishnu Kannan <vishnuk@google.com>
@calebamiles
Copy link
Contributor

cc: @ethernetdan

Signed-off-by: Vishnu kannan <vishnuk@google.com>
@vishh
Copy link
Contributor Author

vishh commented Feb 14, 2017

@derekwaynecarr Updated PR based on sig-node discussion.

@vishh
Copy link
Contributor Author

vishh commented Feb 15, 2017

@derekwaynecarr can I get an approval on this PR?

@derekwaynecarr
Copy link
Member

derekwaynecarr commented Feb 16, 2017 via email

Copy link
Member

@derekwaynecarr derekwaynecarr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with 99% of this proposal. See comments and fix typos.

Kubelet will evict pods until it can reclaim `5%` of `storage Allocatable`, thereby brining down usage to `95%` of `Allocatable`.
These thresholds apply for both storage `capacity` and `inodes`.

*Note that these values are subject to change based on feedback from production.*
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My two cents is that this text is premature pending agreement on the local storage proposal. Can we remove this text and just state in the future storage will be a part of allocatable. I am not able to agree at this time on lack of configuration and it's also not a planned 1.6 feature.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack

By explicitly reserving compute resources, the intention is to avoid overcommiting the node and not have system daemons compete with user pods.
The resources available to system daemons and user pods will be capped based on user specified reservations.

If `Allocatable` is available, the scheduler use that instead of `Capacity`, thereby not overcommiting the node.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will use that....

together in the `/system` raw container).
together in the `/system` raw container on non-systemd nodes).

## Kubelet Evictions Tresholds
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo on threshold

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated

Kubelet evicts pods based on QoS and user configured eviction thresholds.
More deails in [this doc](./kubelet-eviction.md#enforce-node-allocatable)

From v1.6, if `Allocatable` is enforced by default across all pods on a node using cgroups, pods cannot to exceed `Allocatable`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cannot exceed allocatable

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done


If we enforce Node Allocatable (`28.9Gi`) via top level cgroups, then pods can never exceed `28.9Gi` in which case evictions will not be performed unless kernel memory consumption is above `100Mi`.

In order to support evictions and avoid memcg OOM kills for pods, we will set the top level cgroup limits for pods to be `Node Allocatable` + `Eviction Hard Tresholds`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo on thresholds

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

1. A container runtime on Kubernetes nodes is not expected to be used outside of the Kubelet.
1. It's resource consumption is tied to the number of pods running on a node.

Note that the hierarchy below recommends having dedicated cgroups for kubelet and the runtime to individally track their usage.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo individually

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done


```

`systemreserved` & `kubereserved` cgroups are expected to be created by users. If Kubelet is creating cgroups for itself and docker daemon, it will create the `kubereserved` cgroups automatically.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add text clarifying how users communicate to kubelet the cgroups that were preconfigured?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup

* If `--cgroups-per-qos=false`, then this flag has to be set to `""`. Otherwise its an error and kubelet will fail.
* It is recommended to drain and restart nodes prior to upgrading to v1.6. This is necessary for `--cgroups-per-qos` feature anyways which is expected to be turned on by default in `v1.6`.
* Users intending to turn off this feature can set this flag to `""`.
* Specifying `kube-reserved` value in this flag is invalid if `--kube-reserved-cgroup` flag is not specified.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the syntax for this flag driver specific?

Copy link
Contributor Author

@vishh vishh Feb 17, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. Its expected to be an absolute cgroupfs path.

* It is recommended to drain and restart nodes prior to upgrading to v1.6. This is necessary for `--cgroups-per-qos` feature anyways which is expected to be turned on by default in `v1.6`.
* Users intending to turn off this feature can set this flag to `""`.
* Specifying `kube-reserved` value in this flag is invalid if `--kube-reserved-cgroup` flag is not specified.
* Specifying `system-reserved` value in this flag is invalid if `--system-reserved-cgroup` flag is not specified.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same question here: do I say system.slice or /system?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See clarifying text below.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

2. `--kube-reserved-cgroup=<absolute path to a cgroup>`
* This flag helps kubelet identify the control group managing all kube components like Kubelet & container runtime that fall under the `KubeReserved` reservation.

3. `--system-reserved-cgroup=<absolute path to a cgroup>`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wish we could do a sensible default value here. /system is almost universally wrong

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Defaulting SGTM. It will help with metrics collection. I'd tackle that in v1.7 once all the distro owners digest this proposal.

@wonderfly
Copy link

Overall looks good to me.

Signed-off-by: Vishnu kannan <vishnuk@google.com>
@vishh
Copy link
Contributor Author

vishh commented Feb 17, 2017

@derekwaynecarr PTAL

@vishh
Copy link
Contributor Author

vishh commented Feb 17, 2017

@wonderfly thanks for the quick review

Signed-off-by: Vishnu Kannan <vishnuk@google.com>
@vishh
Copy link
Contributor Author

vishh commented Feb 20, 2017

I added another flag to help with the rollout of the changes in this PR. Specifically, it helps ignore Hard Eviction thresholds while computing Node Allocatable.

@vishh
Copy link
Contributor Author

vishh commented Feb 20, 2017

@derekwaynecarr Can I get an LGTM on this PR?

Copy link
Member

@derekwaynecarr derekwaynecarr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new flag makes sense since it reduces allocatable.

* This flag helps kubelet identify the control group managing all OS specific system daemons that fall under the `SystemReserved` reservation.
* Example: `/system.slice`. Note that absolute paths are required and systemd naming scheme isn't supported.

4. `--experimental-node-allocatable-ignore-eviction-threshold`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makes sense

@derekwaynecarr
Copy link
Member

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 21, 2017
@vishh
Copy link
Contributor Author

vishh commented Feb 21, 2017

Merging this PR based on LGTM.

@vishh vishh merged commit 7a444fa into kubernetes:master Feb 21, 2017
k8s-github-robot pushed a commit to kubernetes/kubernetes that referenced this pull request Feb 28, 2017
Automatic merge from submit-queue

Enforce Node Allocatable via cgroups

This PR enforces node allocatable across all pods using a top level cgroup as described in kubernetes/community#348

This PR also provides an option to enforce `kubeReserved` and `systemReserved` on user specified cgroups. 

This PR will by default make kubelet create top level cgroups even if `kubeReserved` and `systemReserved` is not specified and hence `Allocatable = Capacity`.

```release-note
New Kubelet flag `--enforce-node-allocatable` with a default value of `pods` is added which will make kubelet create a top level cgroup for all pods to enforce Node Allocatable. Optionally, `system-reserved` & `kube-reserved` values can also be specified separated by comma to enforce node allocatable on cgroups specified via `--system-reserved-cgroup` & `--kube-reserved-cgroup` respectively. Note the default value of the latter flags are "".
This feature requires a **Node Drain** prior to upgrade failing which pods will be restarted if possible or terminated if they have a `RestartNever` policy.
```

cc @kubernetes/sig-node-pr-reviews @kubernetes/sig-node-feature-requests 

TODO:

- [x] Adjust effective Node Allocatable to subtract hard eviction thresholds
- [x] Add unit tests
- [x] Complete pending e2e tests
- [x] Manual testing
- [x] Get the proposal merged

@dashpole is working on adding support for evictions for enforcing Node allocatable more gracefully. That work will show up in a subsequent PR for v1.6
k8s-github-robot pushed a commit to kubernetes/kubernetes that referenced this pull request Mar 4, 2017
Automatic merge from submit-queue

Eviction Manager Enforces Allocatable Thresholds

This PR modifies the eviction manager to enforce node allocatable thresholds for memory as described in kubernetes/community#348.
This PR should be merged after #41234. 

cc @kubernetes/sig-node-pr-reviews @kubernetes/sig-node-feature-requests @vishh 

** Why is this a bug/regression**

Kubelet uses `oom_score_adj` to enforce QoS policies. But the `oom_score_adj` is based on overall memory requested, which means that a Burstable pod that requested a lot of memory can lead to OOM kills for Guaranteed pods, which violates QoS. Even worse, we have observed system daemons like kubelet or kube-proxy being killed by the OOM killer.
Without this PR, v1.6 will have node stability issues and regressions in an existing GA feature `out of Resource` handling.
MadhavJivrajani pushed a commit to MadhavJivrajani/community that referenced this pull request Nov 30, 2021
ShirleyFei added a commit to bytedance/atop that referenced this pull request Mar 13, 2023
The kubelet will terminate end-user pods when the worker node has
'MemoryPressure' according to [1]. But confusingly, there exits two
reasons for pods being evicted:
- one is the whole machine's free memory is too low,
- the other is k8s itself calculation[2], e.i. memory.available[3]
  is too low.

To resolve such confusion for k8s users, collect and show k8s global
workingset memory to distinguish between these two causes.

Note:
1. Only collect k8s global memory stats is enough, this is because
   cgroupfs stats are propagated from child to parent. Thus the
   parent can always notice the change and then updates. And From
   v1.6 k8s[4], allocatable(/sys/fs/cgroup/memory/kubepods/) is more
   convincing than capacity(/sys/fs/cgroup/memory/).
2. There are two cgroup drivers or managers to control resources:
   cgroupfs and systemd[5]. We should take both into account.
   (The 'systemd' cgroup driver always ends with '.slice')
3. The difference between cgroupv1 and cgroupv2: different field names
   for memory.stat file, and memory.currentUsage storing in different
   files (cgv1's memory.usage_in_bytes v.s. cgv2's memory.current).

[1]https://kubernetes.io/docs/concepts/scheduling-eviction/node-pressure-eviction/#node-out-of-memory-behavior
[2]kubernetes/kubernetes#43916
[3]memory.available = memory.allocatable/capacity - memory.workingSet,
   memory.workingSet = memory.currentUsage - memory.inactivefile
[4]kubernetes/kubernetes#42204
   kubernetes/community#348
[5]https://kubernetes.io/docs/tasks/administer-cluster/kubeadm/configure-cgroup-driver/

Signed-off-by: Fei Li <lifei.shirley@bytedance.com>
Reported-by: Teng Hu <huteng.ht@bytedance.com>
ShirleyFei added a commit to bytedance/atop that referenced this pull request Mar 14, 2023
The kubelet will terminate end-user pods when the worker node has
'MemoryPressure' according to [1]. But confusingly, there exits two
reasons for pods being evicted:
- one is the whole machine's free memory is too low,
- the other is k8s itself calculation[2], e.i. memory.available[3]
  is too low.

To resolve such confusion for k8s users, collect and show k8s global
workingset memory to distinguish between these two causes.

Note:
1. Only collect k8s global memory stats is enough, this is because
   cgroupfs stats are propagated from child to parent. Thus the
   parent can always notice the change and then updates. And From
   v1.6 k8s[4], allocatable(/sys/fs/cgroup/memory/kubepods/) is more
   convincing than capacity(/sys/fs/cgroup/memory/).
2. There are two cgroup drivers or managers to control resources:
   cgroupfs and systemd[5]. We should take both into account.
   (The 'systemd' cgroup driver always ends with '.slice')
3. The difference between cgroupv1 and cgroupv2: different field names
   for memory.stat file, and memory.currentUsage storing in different
   files (cgv1's memory.usage_in_bytes v.s. cgv2's memory.current).

[1]https://kubernetes.io/docs/concepts/scheduling-eviction/node-pressure-eviction/#node-out-of-memory-behavior
[2]kubernetes/kubernetes#43916
[3]memory.available = memory.allocatable/capacity - memory.workingSet,
   memory.workingSet = memory.currentUsage - memory.inactivefile
[4]kubernetes/kubernetes#42204
   kubernetes/community#348
[5]https://kubernetes.io/docs/tasks/administer-cluster/kubeadm/configure-cgroup-driver/

Signed-off-by: Fei Li <lifei.shirley@bytedance.com>
Reported-by: Teng Hu <huteng.ht@bytedance.com>
ShirleyFei added a commit to bytedance/atop that referenced this pull request Mar 14, 2023
The kubelet will terminate end-user pods when the worker node has
'MemoryPressure' according to [1]. But confusingly, there exits two
reasons for pods being evicted:
- one is the whole machine's free memory is too low,
- the other is k8s itself calculation[2], e.i. memory.available[3]
  is too low.

To resolve such confusion for k8s users, collect and show k8s global
workingset memory to distinguish between these two causes.

Note:
1. Only collect k8s global memory stats is enough, this is because
   cgroupfs stats are propagated from child to parent. Thus the
   parent can always notice the change and then updates. And From
   v1.6 k8s[4], allocatable(/sys/fs/cgroup/memory/kubepods/) is more
   convincing than capacity(/sys/fs/cgroup/memory/).
2. There are two cgroup drivers or managers to control resources:
   cgroupfs and systemd[5]. We should take both into account.
   (The 'systemd' cgroup driver always ends with '.slice')
3. The difference between cgroupv1 and cgroupv2: different field names
   for memory.stat file, and memory.currentUsage storing in different
   files (cgv1's memory.usage_in_bytes v.s. cgv2's memory.current).

[1]https://kubernetes.io/docs/concepts/scheduling-eviction/node-pressure-eviction/#node-out-of-memory-behavior
[2]kubernetes/kubernetes#43916
[3]memory.available = memory.allocatable/capacity - memory.workingSet,
   memory.workingSet = memory.currentUsage - memory.inactivefile
[4]kubernetes/kubernetes#42204
   kubernetes/community#348
[5]https://kubernetes.io/docs/tasks/administer-cluster/kubeadm/configure-cgroup-driver/

Signed-off-by: Fei Li <lifei.shirley@bytedance.com>
Reported-by: Teng Hu <huteng.ht@bytedance.com>
ShirleyFei added a commit to bytedance/atop that referenced this pull request Mar 14, 2023
The kubelet will terminate end-user pods when the worker node has
'MemoryPressure' according to [1]. But confusingly, there exits two
reasons for pods being evicted:
- one is the whole machine's free memory is too low,
- the other is k8s itself calculation[2], e.i. memory.available[3]
  is too low.

To resolve such confusion for k8s users, collect and show k8s global
workingset memory to distinguish between these two causes.

Note:
1. Only collect k8s global memory stats is enough, this is because
   cgroupfs stats are propagated from child to parent. Thus the
   parent can always notice the change and then updates. And From
   v1.6 k8s[4], allocatable(/sys/fs/cgroup/memory/kubepods/) is more
   convincing than capacity(/sys/fs/cgroup/memory/).
2. There are two cgroup drivers or managers to control resources:
   cgroupfs and systemd[5]. We should take both into account.
   (The 'systemd' cgroup driver always ends with '.slice')
3. The difference between cgroupv1 and cgroupv2: different field names
   for memory.stat file, and memory.currentUsage storing in different
   files (cgv1's memory.usage_in_bytes v.s. cgv2's memory.current).

[1]https://kubernetes.io/docs/concepts/scheduling-eviction/node-pressure-eviction/#node-out-of-memory-behavior
[2]kubernetes/kubernetes#43916
[3]memory.available = memory.allocatable/capacity - memory.workingSet,
   memory.workingSet = memory.currentUsage - memory.inactivefile
[4]kubernetes/kubernetes#42204
   kubernetes/community#348
[5]https://kubernetes.io/docs/tasks/administer-cluster/kubeadm/configure-cgroup-driver/

Signed-off-by: Fei Li <lifei.shirley@bytedance.com>
Reported-by: Teng Hu <huteng.ht@bytedance.com>
ShirleyFei added a commit to bytedance/atop that referenced this pull request Mar 14, 2023
The kubelet will terminate end-user pods when the worker node has
'MemoryPressure' according to [1]. But confusingly, there exits two
reasons for pods being evicted:
- one is the whole machine's free memory is too low,
- the other is k8s itself calculation[2], e.i. memory.available[3]
  is too low.

To resolve such confusion for k8s users, collect and show k8s global
workingset memory to distinguish between these two causes.

Note:
1. Only collect k8s global memory stats is enough, this is because
   cgroupfs stats are propagated from child to parent. Thus the
   parent can always notice the change and then updates. And From
   v1.6 k8s[4], allocatable(/sys/fs/cgroup/memory/kubepods/) is more
   convincing than capacity(/sys/fs/cgroup/memory/).
2. There are two cgroup drivers or managers to control resources:
   cgroupfs and systemd[5]. We should take both into account.
   (The 'systemd' cgroup driver always ends with '.slice')
3. The difference between cgroupv1 and cgroupv2: different field names
   for memory.stat file, and memory.currentUsage storing in different
   files (cgv1's memory.usage_in_bytes v.s. cgv2's memory.current).

[1]https://kubernetes.io/docs/concepts/scheduling-eviction/node-pressure-eviction/#node-out-of-memory-behavior
[2]kubernetes/kubernetes#43916
[3]memory.available = memory.allocatable/capacity - memory.workingSet,
   memory.workingSet = memory.currentUsage - memory.inactivefile
[4]kubernetes/kubernetes#42204
   kubernetes/community#348
[5]https://kubernetes.io/docs/tasks/administer-cluster/kubeadm/configure-cgroup-driver/

Signed-off-by: Fei Li <lifei.shirley@bytedance.com>
Reported-by: Teng Hu <huteng.ht@bytedance.com>
liutingjieni pushed a commit to bytedance/atop that referenced this pull request Jul 27, 2023
The kubelet will terminate end-user pods when the worker node has
'MemoryPressure' according to [1]. But confusingly, there exits two
reasons for pods being evicted:
- one is the whole machine's free memory is too low,
- the other is k8s itself calculation[2], e.i. memory.available[3]
  is too low.

To resolve such confusion for k8s users, collect and show k8s global
workingset memory to distinguish between these two causes.

Note:
1. Only collect k8s global memory stats is enough, this is because
   cgroupfs stats are propagated from child to parent. Thus the
   parent can always notice the change and then updates. And From
   v1.6 k8s[4], allocatable(/sys/fs/cgroup/memory/kubepods/) is more
   convincing than capacity(/sys/fs/cgroup/memory/).
2. There are two cgroup drivers or managers to control resources:
   cgroupfs and systemd[5]. We should take both into account.
   (The 'systemd' cgroup driver always ends with '.slice')
3. The difference between cgroupv1 and cgroupv2: different field names
   for memory.stat file, and memory.currentUsage storing in different
   files (cgv1's memory.usage_in_bytes v.s. cgv2's memory.current).

[1]https://kubernetes.io/docs/concepts/scheduling-eviction/node-pressure-eviction/#node-out-of-memory-behavior
[2]kubernetes/kubernetes#43916
[3]memory.available = memory.allocatable/capacity - memory.workingSet,
   memory.workingSet = memory.currentUsage - memory.inactivefile
[4]kubernetes/kubernetes#42204
   kubernetes/community#348
[5]https://kubernetes.io/docs/tasks/administer-cluster/kubeadm/configure-cgroup-driver/

Signed-off-by: Fei Li <lifei.shirley@bytedance.com>
Reported-by: Teng Hu <huteng.ht@bytedance.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants