Skip to content

[Bug] Running dcgm -r 3 the node is reported as healthy even in presence of failing checks #68

@mgazz

Description

@mgazz

Summary

Running the dcgm check with run level 3, autopilot reports the state of the gpu node as healthy even when the dcgmi command reports failures

Steps To Reproduce

One of our GPU nodes reports failures running the command dcgmi -r 3 (Integration -> PCIe)

Defaulted container "nvidia-dcgm-ctr" out of: nvidia-dcgm-ctr, toolkit-validation (init)
Successfully ran diagnostic for group.
+---------------------------+------------------------------------------------+
| Diagnostic                | Result                                         |
+===========================+================================================+
|-----  Metadata  ----------+------------------------------------------------|
| DCGM Version              | 3.3.5                                          |
| Driver Version Detected   | 550.54.15                                      |
| GPU Device IDs Detected   | 20b5,20b5,20b5,20b5                            |
|-----  Deployment  --------+------------------------------------------------|
| Denylist                  | Pass                                           |
| NVML Library              | Pass                                           |
| CUDA Main Library         | Pass                                           |
| Permissions and OS Blocks | Pass                                           |
| Persistence Mode          | Pass                                           |
| Environment Variables     | Pass                                           |
| Page Retirement/Row Remap | Pass                                           |
| Graphics Processes        | Pass                                           |
| Inforom                   | Pass                                           |
+-----  Integration  -------+------------------------------------------------+
| PCIe                      | Pass - GPUs: 0, 1, 2                           |
|                           | Fail - GPU: 3                                  |
| Warning                   | GPU 3 GPU 3 is running at PCI link width 8X,   |
|                           | which is below the minimum allowed link gener  |
|                           | ation of 16 (parameter 'min_pci_width') Check  |
|                           |  DCGM and system configuration. This error ma  |
|                           | y be eliminated with an updated configuration  |
|                           | .                                              |
| Info                      | GPU 0 GPU to Host bandwidth:  26.86 GB/s, GPU  |
|                           |  0 Host to GPU bandwidth:  26.16 GB/s, GPU 0   |
|                           | bidirectional bandwidth: 40.53 GB/s, GPU 0 GP  |
|                           | U to Host latency:  1.689 us, GPU 0 Host to G  |
|                           | PU latency:  1.902 us, GPU 0 bidirectional la  |
|                           | tency:  3.316 us                               |
| Info                      | GPU 1 GPU to Host bandwidth:  26.25 GB/s, GPU  |
|                           |  1 Host to GPU bandwidth:  26.77 GB/s, GPU 1   |
|                           | bidirectional bandwidth: 40.24 GB/s, GPU 1 GP  |
|                           | U to Host latency:  1.863 us, GPU 1 Host to G  |
|                           | PU latency:  1.920 us, GPU 1 bidirectional la  |
|                           | tency:  3.757 us                               |
| Info                      | GPU 2 GPU to Host bandwidth:  26.91 GB/s, GPU  |
|                           |  2 Host to GPU bandwidth:  26.52 GB/s, GPU 2   |
|                           | bidirectional bandwidth: 40.34 GB/s, GPU 2 GP  |
|                           | U to Host latency:  1.861 us, GPU 2 Host to G  |
|                           | PU latency:  1.883 us, GPU 2 bidirectional la  |
|                           | tency:  3.747 us                               |
| Info                      | GPU 3 GPU to Host bandwidth:  13.44 GB/s, GPU  |
|                           |  3 Host to GPU bandwidth:  13.28 GB/s, GPU 3   |
|                           | bidirectional bandwidth: 23.45 GB/s, GPU 3 GP  |
|                           | U to Host latency:  1.756 us, GPU 3 Host to G  |
|                           | PU latency:  1.876 us, GPU 3 bidirectional la  |
|                           | tency:  3.556 us                               |
+-----  Hardware  ----------+------------------------------------------------+
| GPU Memory                | Pass - All                                     |
| Diagnostic                | Pass - All                                     |
+-----  Stress  ------------+------------------------------------------------+
| Targeted Stress           | Pass - All                                     |
| Targeted Power            | Pass - All                                     |
| Memory Bandwidth          | Pass - All                                     |
| EUD Test                  | Skip - All                                     |
+---------------------------+------------------------------------------------+
command terminated with exit code 226

We installed autopilot in the cluster and we run the same check via REST targeting the same node. Following the output from Autopilot.

❯ curl "http://localhost:3333/status?check=dcgm&r=3&host=adcpu016"
Asking to run on remote node(s) adcpu016 or with node label None

Initiated connection to ['http://10.129.9.28:3333/status?host=adcpu016&check=dcgm&r=3'].

Autopilot Endpoint: 10.129.9.28
Node: adcpu016
url(s): http://10.129.9.28:3333/status?host=adcpu016&check=dcgm&r=3
Response:

[[ DCGM ]] Briefings completed. Continue with dcgm evaluation.
[[ DCGM ]] DCGM process terminated with errors. Other processes might be running on GPUs. ABORT
[[ DCGM ]] GPUs currently utilized:
 utilization.gpu [%]
0 %
0 %
0 %
0 %

[[ DCGM ]] SUCCESS

Node Status: DCGM Failed
-------------------------------------

Node Summary:

{'adcpu016': ['DCGM Failed']}

Following the logs from the autopilot pod running on the same node:

I0312 11:33:54.317935       7 healthcheck.go:74] Running health check: dcgm -r 3
I0312 11:41:03.508579       7 healthcheck.go:369] DCGM test completed:
I0312 11:41:03.508622       7 healthcheck.go:376] DCGM cannot be run.
[[ DCGM ]] Briefings completed. Continue with dcgm evaluation.
[[ DCGM ]] DCGM process terminated with errors. Other processes might be running on GPUs. ABORT
[[ DCGM ]] GPUs currently utilized:
 utilization.gpu [%]
0 %
0 %
0 %
0 %

[[ DCGM ]] SUCCESS
I0312 11:41:03.508635       7 healthcheck.go:135] Total time (s) for all checks: 429.190786651
I0312 11:41:03.508652       7 handler.go:78] Errors after running local, on demand health checks: false
I0312 11:41:03.512564       7 functions.go:196] Node adcpu016 label found PASS
I0312 11:41:03.529527       7 functions.go:213] Node patched with label
        {
                "metadata": {
                        "labels": {
                                "autopilot.ibm.com/gpuhealth": "PASS"
                        }
                }
        }

Expected behavior

I would expect the node to be tagged with a label that express the presence of errors. (autopilot.ibm.com/gpuhealth: EVICT?)

Evidence

Attached the json output running the dcgmi command.

Proposed Solution

The problem seems to be related with the way autopilot checks the json generated by the dcgmi command. The function checks only the first output of each result. In the dgmi output reported above the failure happens to be in the second line of the PCIe test (Fail - GPU: 3).

In https://github.com/mgazz/autopilot/tree/dcgm-run-3-fix we update the parsing to consider all the results.

Here the output running the same test

❯ curl "http://localhost:3333/status?check=dcgm&r=3&host=adcpu016"
Asking to run on remote node(s) adcpu016 or with node label None

Initiated connection to ['http://10.129.9.44:3333/status?host=adcpu016&check=dcgm&r=3'].

Autopilot Endpoint: 10.129.9.44
Node: adcpu016
url(s): http://10.129.9.44:3333/status?host=adcpu016&check=dcgm&r=3
Response:

[[ DCGM ]] Briefings completed. Continue with dcgm evaluation.
[[ DCGM ]] DCGM process terminated with errors. Other processes might be running on GPUs. ABORT
[[ DCGM ]] GPUs currently utilized:
 utilization.gpu [%]
0 %
0 %
0 %
0 %

PCIe : Pass
Host adcpu016
[[ DCGM ]] FAIL

Node Status: DCGM Failed, DCGM Failed
-------------------------------------

Node Summary:

{'adcpu016': ['DCGM Failed', 'DCGM Failed']}

runtime: 432.19736194610596 sec

Here the output from autopilot

I0312 15:35:31.961798       7 healthcheck.go:74] Running health check: dcgm -r 3
I0312 15:42:44.132503       7 healthcheck.go:369] DCGM test completed:
I0312 15:42:44.132562       7 healthcheck.go:376] DCGM cannot be run.
[[ DCGM ]] Briefings completed. Continue with dcgm evaluation.
[[ DCGM ]] DCGM process terminated with errors. Other processes might be running on GPUs. ABORT
[[ DCGM ]] GPUs currently utilized:
 utilization.gpu [%]
0 %
0 %
0 %
0 %

PCIe : Pass
Host adcpu016
[[ DCGM ]] FAIL
I0312 15:42:44.132577       7 healthcheck.go:135] Total time (s) for all checks: 432.170828371
I0312 15:42:44.132603       7 handler.go:78] Errors after running local, on demand health checks: false
I0312 15:42:44.136562       7 functions.go:196] Node adcpu016 label found EVICT
I0312 15:42:44.136626       7 functions.go:199] Cannot patch node's label, value found: EVICT

Now autopilot tries to path the node with EVICT however there seems to be issues with setting the label.

Additional context
here

dcgmi-3-result.json

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions