The goal of this challenge is to enable the OpenShift container platform to become the premier platform to orchestrate the full life cycle of Foundation Model workflows (pre-processing, training, adaptation/distillation, and inference) seamlessly across public, private, and on-prem cloud environments.
From operation perspective, infrastructure stability is always important. We actually saw the various errors and anomaly states in GPU and Network, for instance, so it becomes crucial to provide a tool to detect, avoid, and handle the infrastructure issues while running the AI training job.
We provide a collection of tools (named Autopilot) to steer and address these infrastructure issues automatically by pre-flight checks, in-flight checks, and also post-flight to learn or improve the issue detection logic.
Autopilot runs as a DaemonSet on all worker nodes that have GPUs. All results from health checks are exposed through Prometheus and a Grafana dashboard is available in the utility-tools folder.
The toolkit currently provides health checks for pre-flight and post-flight phases, while in-flight checks will be enabled in the future. In more details (list subject to change):
-
pre-flight checks
- validate infrastructure before the start of jobs
-
in-flight checks
-
workload and system performance is continuously monitored
-
detect anomaly, decide to continue or stop the job
-
issue alert to end users
-
-
post-flight checks
- validate infrastructure once the job ends
The current status of Autopilot includes:
- GPU PCIe Link Bandwidth: The PCIe NVidia bandwidth test to check host-to-device connection on each node
- GPU Memory: GPUs remapped rows evaluation through
nvidia-smi - GPU Memory Bandwidth Performance: GPUs memory bandwidth evaluation through DAXPY and DGEMM
- GPU Diagnostics: NVidia DCGM (Data Center GPU Manager) diagnostics through
dcgmi diag - GPU Power Slowdown: verify if power throttle is active through
nvidia-smi - Network Reachability:
pingto evaluate hosts reachability - Network Bandwidth:
iperf3to evaluate network bandwidth and hosts connectivity
All test except iperf3 are executed periodically every hour by default. The time frame can be customized during installation.
Autopilot provides a /status handler that can be queried to get the entire system status, meaning that it will run all the tests on all the nodes. Autopilot is reachable by service name autopilot-healthchecks.autopilot.svc in-cluster only, meaning it can be reached from a pod running in the cluster, or through port forwarding (see below).
Health check names are pciebw, dcgm, remapped, ping, iperf.
For example, using port forwarding to localhost and curl
curl "http://localhost:3333/status?check=pciebw&host=nodename1"All tests can be tailored by a combination of:
host=<hostname1,hostname2,...>, to run all tests on a specific node or on a comma separated list of nodes.check=<healthcheck1,healtcheck2,...>, to run a single test (pciebw,dcgm,remapped,gpumem,ping,iperforall) or a list of comma separated tests. When no parameters are specified, onlypciebw,dcgm,remapped,pingtests are run.batch=<#hosts>, how many hosts to check at a single moment. Requests to the batch are run in parallel asynchronously. Batching is done to avoid running too many requests in parallel when the number of worker nodes increases. Defaults to all nodes.
Some health checks provide further customization.
This test runs dcgmi diag, and we support only r as (parameter)[https://docs.nvidia.com/datacenter/dcgm/latest/user-guide/dcgm-diagnostics.html#command-line-options].
The default is 1, but can customize it by /status?check=dcgm&r=2.
This tests runs from a client node, which
- Issues several RPCs to start remote
iperf3servers - Launches a certain number of clients towards each of those servers
Both can be customized.
serverspernodecan be used to create a certain number of servers on each remote node.- if the value is lower than the number of secondary network interfaces, it will create the minimum number of
1server per interface (excludeseth0andlo). Each server runs on a separate port. - otherwise, it will divide that value by the number of network interfaces existing in the cluster.
- if the value is lower than the number of secondary network interfaces, it will create the minimum number of
clientsperifacecan be used to launch a desired number of clients against a single remote server.
Another possible customization is to decide which network plane to test. By default is data plane, that is, what runs on secondary interfaces.
To test connection on eth0, that is, the management plane (mgmt), can use the plane parameter as follows /status?check=iperf&plane=mgmt.
It will create only one client and there is a single server per node.
Health checks can be executed through a utility tool provided with a Helm chart, or by querying the Autopilot service.
Results can be visualized by either checking the logs of the utility tool/service query, or by looking at the data in a Prometheus dashboard.
Metrics are exposed through the autopilot_health_checks gauge, and health checks can be selected through the keyword health and any of the health checks provided (except from iperf).
An example is:
autopilot_health_checks{health=~"pciebw"}Alternatively, it is possible to port-forward the autopilot healthchecks Service and curl from localhost.
kubectl port-forward service/autopilot-healthchecks 3333:3333 -n autopilotWill print the following output:
Forwarding from 127.0.0.1:3333 -> 3333
Forwarding from [::1]:3333 -> 3333Then on another terminal, run the desired curl command. In this example, we target one node and check the pcie bandwidth.
In this scenario, we have a value lower than 8GB/s, which results in an alert. This error will be exported to the OpenShift web console and on Slack, if that is enabled by admins.
curl "http://127.0.0.1:3333/status?check=pciebw"The output of the command above, will be similar to the following (edited to save space):
Checking status on all nodes
Autopilot Endpoint: 10.128.6.187
Node: hostname
url(s): http://10.128.6.187:3333/status?host=hostname&check=pciebw
Response:
Checking system status of host hostname (localhost)
[[ PCIEBW ]] Briefings completed. Continue with PCIe Bandwidth evaluation.
[[ PCIEBW ]] FAIL
Host hostname
12.3 12.3 12.3 12.3 5.3 12.3 12.3 12.3
Node Status: PCIE Failed
-------------------------------------
Autopilot Endpoint: 10.131.4.93
Node: hostname2
url(s): http://10.131.4.93:3333/status?host=hostname2&check=pciebw
Response:
Checking system status of host hostname2 (localhost)
[[ PCIEBW ]] Briefings completed. Continue with PCIe Bandwidth evaluation.
[[ PCIEBW ]] SUCCESS
Host hostname2
12.1 12.0 12.3 12.3 11.9 11.5 12.1 12.1
Node Status: Ok
-------------------------------------
Node Summary:
{'hostname': ['PCIE Failed'],
'hostname2': ['Ok']}
runtime: 31.845192193984985 secIn the example below, we create a utility nginx pod from which we can run curl commands against the autopilot-healthchecks service.
We run the PCIe bandwidth test on all nodes, and we can see it is failing on one node.
Create a dummy nginx pod:
kubectl create job curl-pod --image=nginx -- sleep infThen run an health check:
kubectl exec jobs/curl-pod -- curl "http://autopilot-healthchecks.autopilot.svc:3333/status?check=pciebw"Installation: Autopilot can be installed through Helm and need admin privileges to create objects like services, serviceaccounts, namespaces and relevant RBAC.
- Need to install
helm-gitplugin on all hosts
helm plugin install https://github.com/aslafy-z/helm-git --version 0.15.1Helm charts values can be found here.
By default, it will create a namespace named autopilot where to run the components. Users workloads do not run in the autopilot namespace. The creation of the namespace can be disabled by setting create to false in the namespace block of the Values.yaml file.
namespace:
create: true
name: autopilotIf you do not want to create a new namespace and use an existing one, then set create: false and specify the namespace name.
On OpenShift, please ntice that you must label the namespace oc label ns <namespace> openshift.io/cluster-monitoring=true to have Prometheus scrape metrics from Autopilot.
- To pull the image from a private registry, the admin needs to add
imagePullSecretdata in one of the helm charts. It is possible to avoid the creation of the pull secret by setting the valuecreateto false in the imagePullSecret block, and by setting the name of the one that will be used (i.e.,autopilot-pull-secret).
pullSecrets:
create: true
name: autopilot-pull-secret
imagePullSecretData: <encoded-key>- Autopilot runs tests periodically. The default is set to every hour, but it can be customized be changing the following
repeat: <hours>- PCIe bandwidth critical value is defaulted to 4GB/s. It can be customized by changing the following
PCIeBW: <val>- If secondary nics are available by, for instance, Multus or Multi-Nic-Operator, those can be enabled in autopilot by setting the following
annotations:
k8s.v1.cni.cncf.io/networks: <network-config-name>All these values can be saved in a config.yaml file, which can be passed to helm.
An example (the image repository and tag are set by default to the ones in this example):
namespace:
create: true
name: autopilot
image:
repository: your-repo/autopilot/autopilot
tag: preferred-tag
pullSecrets:
create: true
name: autopilot-pull-secret
imagePullSecretData: <encoded-key>
annotations:
k8s.v1.cni.cncf.io/networks: multi-nic-configIt is possible to build and push the image through
make imageYou will need to change the IMAGE and TAG environment variables to fit your needs.
- Add autopilot repo, here is where it checks for ssh keys
helm repo add autopilot git+https://github.com/IBM/autopilot.git@autopilot-daemon/helm-charts/autopilot?ref=gh-pagesor with ssh keys if preferred
helm repo add autopilot git+ssh://git@github.com/IBM/autopilot@autopilot-daemon/helm-charts/autopilot?ref=gh-pages- Install autopilot (idempotent command). The config file is for customizing the helm values. Namespace is where the helm chart will live, not the namespace where Autopilot runs
helm upgrade autopilot autopilot/autopilot-daemon --install --namespace=<default> -f your-config.ymlThe controllers should show up in the selected namespace
oc get po -n autopilotNAME READY STATUS RESTARTS AGE
autopilot-daemon-autopilot-g7j6h 1/1 Running 0 70m
autopilot-daemon-autopilot-g822n 1/1 Running 0 70m
autopilot-daemon-autopilot-x6h8d 1/1 Running 0 70m
autopilot-daemon-autopilot-xhntv 1/1 Running 0 70m helm uninstall autopilot % -n <namespace-where-chart-resides>