- This is going to be built with maximum security. We will use Manged NGINX ingress controller with SSL termination fronted by App Gateway.
- Both inferencing and embedding API's would be exposed by API management.
NOTE: To Deploy N series GPUs, one needs approval to enable N series on VMs. See Here
-
Automated scaling of AKS cluster based on the load
-
Resource management and cost optimization as compared to PTUs
-
HA through Self-healing
-
Edge computing with inferencing at the edge
-
Secure and compliant with data residency requirements=
-
Streamlined deployment and management
-
Observability and monitoring
-
Enable mtls between APIM and NGINX can be implemented using
-
install the Nvidia device plugin for kubernetes k8s-device-plugin
-
install KubeRay for distributed inference
-
To view Dashboard:
kubectl port-forward service/${RAYCLUSTER_NAME}-head-svc 8265:8265
- Create a resource group
az group create --name <your-resource-group-name> --location <your-location>- Create the infrastructure using bicep
az deployment group create --resource-group <your-resource-group-name> --template-file init.bicep- Connect to the AKS cluster
az aks get-credentials --resource-group <your-resource-group-name> --name <your-aks-cluster-name>- Install the Nvidia device plugin for Kubernetes
kubectl apply -f nvidia-device-plugin.yml- Install KubeRay for distributed inference
helm repo add kuberay https://ray-project.github.io/kuberay-helm/
helm repo update
# Install both CRDs and KubeRay operator v1.3.0.
helm install kuberay-operator kuberay/kuberay-operator --version 1.3.0- Deploy the LLM
kubectl apply -f raysvc-llama3-8b-A100.yaml- Test the deployment
kubectl port-forward svc/<NAME> 8000
$ curl http://localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{
"model": "meta-llama/Meta-Llama-3-8B-Instruct",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Provide a brief sentence describing the Ray open-source project."}
],
"temperature": 0.7
}'