[Bug] Running dcgm -r 3 the node is reported as healthy even in presence of failing checks

# Summary

Running the dcgm check with run level 3, autopilot reports the state of the gpu node as healthy even when the dcgmi command reports failures

## Steps To Reproduce

One of our GPU nodes reports failures running the command `dcgmi -r 3` (`Integration -> PCIe`)

```
Defaulted container "nvidia-dcgm-ctr" out of: nvidia-dcgm-ctr, toolkit-validation (init)
Successfully ran diagnostic for group.
+---------------------------+------------------------------------------------+
| Diagnostic                | Result                                         |
+===========================+================================================+
|-----  Metadata  ----------+------------------------------------------------|
| DCGM Version              | 3.3.5                                          |
| Driver Version Detected   | 550.54.15                                      |
| GPU Device IDs Detected   | 20b5,20b5,20b5,20b5                            |
|-----  Deployment  --------+------------------------------------------------|
| Denylist                  | Pass                                           |
| NVML Library              | Pass                                           |
| CUDA Main Library         | Pass                                           |
| Permissions and OS Blocks | Pass                                           |
| Persistence Mode          | Pass                                           |
| Environment Variables     | Pass                                           |
| Page Retirement/Row Remap | Pass                                           |
| Graphics Processes        | Pass                                           |
| Inforom                   | Pass                                           |
+-----  Integration  -------+------------------------------------------------+
| PCIe                      | Pass - GPUs: 0, 1, 2                           |
|                           | Fail - GPU: 3                                  |
| Warning                   | GPU 3 GPU 3 is running at PCI link width 8X,   |
|                           | which is below the minimum allowed link gener  |
|                           | ation of 16 (parameter 'min_pci_width') Check  |
|                           |  DCGM and system configuration. This error ma  |
|                           | y be eliminated with an updated configuration  |
|                           | .                                              |
| Info                      | GPU 0 GPU to Host bandwidth:  26.86 GB/s, GPU  |
|                           |  0 Host to GPU bandwidth:  26.16 GB/s, GPU 0   |
|                           | bidirectional bandwidth: 40.53 GB/s, GPU 0 GP  |
|                           | U to Host latency:  1.689 us, GPU 0 Host to G  |
|                           | PU latency:  1.902 us, GPU 0 bidirectional la  |
|                           | tency:  3.316 us                               |
| Info                      | GPU 1 GPU to Host bandwidth:  26.25 GB/s, GPU  |
|                           |  1 Host to GPU bandwidth:  26.77 GB/s, GPU 1   |
|                           | bidirectional bandwidth: 40.24 GB/s, GPU 1 GP  |
|                           | U to Host latency:  1.863 us, GPU 1 Host to G  |
|                           | PU latency:  1.920 us, GPU 1 bidirectional la  |
|                           | tency:  3.757 us                               |
| Info                      | GPU 2 GPU to Host bandwidth:  26.91 GB/s, GPU  |
|                           |  2 Host to GPU bandwidth:  26.52 GB/s, GPU 2   |
|                           | bidirectional bandwidth: 40.34 GB/s, GPU 2 GP  |
|                           | U to Host latency:  1.861 us, GPU 2 Host to G  |
|                           | PU latency:  1.883 us, GPU 2 bidirectional la  |
|                           | tency:  3.747 us                               |
| Info                      | GPU 3 GPU to Host bandwidth:  13.44 GB/s, GPU  |
|                           |  3 Host to GPU bandwidth:  13.28 GB/s, GPU 3   |
|                           | bidirectional bandwidth: 23.45 GB/s, GPU 3 GP  |
|                           | U to Host latency:  1.756 us, GPU 3 Host to G  |
|                           | PU latency:  1.876 us, GPU 3 bidirectional la  |
|                           | tency:  3.556 us                               |
+-----  Hardware  ----------+------------------------------------------------+
| GPU Memory                | Pass - All                                     |
| Diagnostic                | Pass - All                                     |
+-----  Stress  ------------+------------------------------------------------+
| Targeted Stress           | Pass - All                                     |
| Targeted Power            | Pass - All                                     |
| Memory Bandwidth          | Pass - All                                     |
| EUD Test                  | Skip - All                                     |
+---------------------------+------------------------------------------------+
command terminated with exit code 226
```

We installed autopilot in the cluster and we run the same check via REST targeting the same node. Following the output from Autopilot.

```
❯ curl "http://localhost:3333/status?check=dcgm&r=3&host=adcpu016"
Asking to run on remote node(s) adcpu016 or with node label None

Initiated connection to ['http://10.129.9.28:3333/status?host=adcpu016&check=dcgm&r=3'].

Autopilot Endpoint: 10.129.9.28
Node: adcpu016
url(s): http://10.129.9.28:3333/status?host=adcpu016&check=dcgm&r=3
Response:

[[ DCGM ]] Briefings completed. Continue with dcgm evaluation.
[[ DCGM ]] DCGM process terminated with errors. Other processes might be running on GPUs. ABORT
[[ DCGM ]] GPUs currently utilized:
 utilization.gpu [%]
0 %
0 %
0 %
0 %

[[ DCGM ]] SUCCESS

Node Status: DCGM Failed
-------------------------------------

Node Summary:

{'adcpu016': ['DCGM Failed']}

```

Following the logs from the autopilot pod running on the same node:
```
I0312 11:33:54.317935       7 healthcheck.go:74] Running health check: dcgm -r 3
I0312 11:41:03.508579       7 healthcheck.go:369] DCGM test completed:
I0312 11:41:03.508622       7 healthcheck.go:376] DCGM cannot be run.
[[ DCGM ]] Briefings completed. Continue with dcgm evaluation.
[[ DCGM ]] DCGM process terminated with errors. Other processes might be running on GPUs. ABORT
[[ DCGM ]] GPUs currently utilized:
 utilization.gpu [%]
0 %
0 %
0 %
0 %

[[ DCGM ]] SUCCESS
I0312 11:41:03.508635       7 healthcheck.go:135] Total time (s) for all checks: 429.190786651
I0312 11:41:03.508652       7 handler.go:78] Errors after running local, on demand health checks: false
I0312 11:41:03.512564       7 functions.go:196] Node adcpu016 label found PASS
I0312 11:41:03.529527       7 functions.go:213] Node patched with label
        {
                "metadata": {
                        "labels": {
                                "autopilot.ibm.com/gpuhealth": "PASS"
                        }
                }
        }

```

## Expected behavior

I would expect the node to be tagged with a label that express the presence of errors. (`autopilot.ibm.com/gpuhealth: EVICT`?)

## Evidence

Attached the json output running the dcgmi command. 

## Proposed Solution

The problem seems to be related with the way autopilot checks the json generated by the `dcgmi` command. The function checks [only the first output of each result](https://github.com/IBM/autopilot/blob/main/autopilot-daemon/gpu-dcgm/entrypoint.py#L47). In the `dgmi` output reported above the failure happens to be in the second line of the PCIe test (`Fail - GPU: 3`). 

In https://github.com/mgazz/autopilot/tree/dcgm-run-3-fix we update the parsing to consider all the results. 

Here the output running the same test
```
❯ curl "http://localhost:3333/status?check=dcgm&r=3&host=adcpu016"
Asking to run on remote node(s) adcpu016 or with node label None

Initiated connection to ['http://10.129.9.44:3333/status?host=adcpu016&check=dcgm&r=3'].

Autopilot Endpoint: 10.129.9.44
Node: adcpu016
url(s): http://10.129.9.44:3333/status?host=adcpu016&check=dcgm&r=3
Response:

[[ DCGM ]] Briefings completed. Continue with dcgm evaluation.
[[ DCGM ]] DCGM process terminated with errors. Other processes might be running on GPUs. ABORT
[[ DCGM ]] GPUs currently utilized:
 utilization.gpu [%]
0 %
0 %
0 %
0 %

PCIe : Pass
Host adcpu016
[[ DCGM ]] FAIL

Node Status: DCGM Failed, DCGM Failed
-------------------------------------

Node Summary:

{'adcpu016': ['DCGM Failed', 'DCGM Failed']}

runtime: 432.19736194610596 sec
```

Here the output from autopilot 
```
I0312 15:35:31.961798       7 healthcheck.go:74] Running health check: dcgm -r 3
I0312 15:42:44.132503       7 healthcheck.go:369] DCGM test completed:
I0312 15:42:44.132562       7 healthcheck.go:376] DCGM cannot be run.
[[ DCGM ]] Briefings completed. Continue with dcgm evaluation.
[[ DCGM ]] DCGM process terminated with errors. Other processes might be running on GPUs. ABORT
[[ DCGM ]] GPUs currently utilized:
 utilization.gpu [%]
0 %
0 %
0 %
0 %

PCIe : Pass
Host adcpu016
[[ DCGM ]] FAIL
I0312 15:42:44.132577       7 healthcheck.go:135] Total time (s) for all checks: 432.170828371
I0312 15:42:44.132603       7 handler.go:78] Errors after running local, on demand health checks: false
I0312 15:42:44.136562       7 functions.go:196] Node adcpu016 label found EVICT
I0312 15:42:44.136626       7 functions.go:199] Cannot patch node's label, value found: EVICT
``` 

Now autopilot tries to path the node with `EVICT` however there seems to be issues with setting the label.

**Additional context**
here 

[dcgmi-3-result.json](https://github.com/user-attachments/files/19226581/dcgmi-3-result.json)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bug] Running dcgm -r 3 the node is reported as healthy even in presence of failing checks #68

Summary

Steps To Reproduce

Expected behavior

Evidence

Proposed Solution

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Bug] Running dcgm -r 3 the node is reported as healthy even in presence of failing checks #68

Description

Summary

Steps To Reproduce

Expected behavior

Evidence

Proposed Solution

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions