Skip to content

bkauf/ai-ml-tooling

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

23 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Load Data into GPUs fast with Model Streamer in vLLM on Google Cloud

The Run:ai model streamer loads models from object storage to GPU memory direct from CPU memory. It is a package available for vLLM and greatly accelerates model download times.

Prep VM Machine for vLLM testing

To test the model streamer on a GCE instance, create one that has a single Nvidia L4 GPU. Then install CUDA and drivers.

  1. Create a virtual envoinment and install dependencies
sudo apt install python3.11-venv
sudo apt install python3-pip
mkdir vllm
cd vllm
python3 -m venv venv
  1. Activate the Environment
source venv/bin/activate
  1. Transfer Data from Hugging Face to GCS

Use data-loading/hf-gcs.py script from any machine you can login to hugging face from. Change vars to match model/folder names

repo_id="google/gemma-3-4b-it" 
local_dir="/tmp"
gcs_bucket_name="" 
gcs_prefix="gemma-3-4b-it"
  1. Install the dependencides needed
pip3 install google-cloud-storage
pip3 install huggingface_hub
  1. Login to Huggingface

Get a token from Hugging face and use it to login via CLI

hf auth login
  1. Login to Google Cloud via CLI if not already logged in
gcloud init
  1. Run the script to populate the GCS bucket with the model from Huggingface
python3 data-loading/hf-gcs.py
  1. Install vLLM with Run AI loader
pip3 install vllm[runai]

Running vLLM with a model streaming from GCS

To use the model streamer you just need to add the "--load-format=runai_streamer" flag. Make sure you don't have sub folders in your GCS bucket that have "original" copies of the OSS models.

Example vllm command using a model in the following location gs://models-usc/gemma-3-4b-it

vllm serve gs://models-usc/gemma-3-4b-it --load-format=runai_streamer 

GIQ & Kubernetes Manifests

The following steps can help you use the model streamer on a GKE cluster

  1. Setup workload Identity

    If you are using an Autopilot GKE cluster workload identity is enabled by default. If using a GKE Standard cluster you will need to enable it if it not already on.

  2. Create IAM Rules for workload identity bucket access

    Create two policy bindings that the service account you will use in your GKE cluster can use to access the models in Google Cloud Object Storage

    export BUCKET=""
    export PROJECT_NUMBER=""
    export PROJECT_ID=""
    export SERVICE_ACCOUNT="gcs-access"
    
    gcloud storage buckets add-iam-policy-binding gs://$BUCKET --member principal://iam.googleapis.com/projects/$PROJECT_NUMBER/locations/global/workloadIdentityPools/$PROJECT_ID.svc.id.goog/subject/ns/default/sa/$SERVICE_ACCOUNT --role roles/storage.bucketViewer
    
    gcloud storage buckets add-iam-policy-binding gs://$BUCKET --member principal://iam.googleapis.com/projects/$PROJECT_NUMBER/locations/global/workloadIdentityPools/$PROJECT_ID.svc.id.goog/subject/ns/default/sa/$SERVICE_ACCOUNT --role roles/storage.objectUser
  3. Sample deployment.yaml

    Use the sample deployment.yaml file and change model location and service account to use the Run:ai streamer on GKE with workload identity. If using autopilot you will need to modify the deployment.yaml file to include the machine type needed in autopilot

    kubectl apply -f deployment.yaml

    View the logs to see the model streamer output

    kubectl logs [pod name]
  4. Optional- Use Google Inference Quickstart

    The Google Inference Quickstart can generate kubernetes manifests for you to use that have all autoscaling and metrics collection settings. These manefests are being updated to include the Run:ai model streamer.

    Example Command for gemma-3-4b-it on vllm

    gcloud container ai profiles manifests create --model=google/gemma-3-4b-it --model-server=vllm --accelerator-type=nvidia-l4 

GCS Anywhere Cache for Zonal Caching

GCS anywhere cache can reduce load times by as much as 30% once a model is cached to a zone where the GPU is located. You can enable the cache with the following commands. This feature is very useful for scale out inference workloads.

    export ZONE="us-central1-c"

    gcloud storage buckets anywhere-caches create gs://$BUCKET $ZONE
        
     ## Check the status of the cache
     gcloud storage buckets anywhere-caches describe $BUCKET/$ZONE

Creating Secrets in Kubernetes

example

kubectl create secret generic gcp-token --from-literal=gcp_api_token=xyz


kubectl create secret generic gcp-secret --from-literal=gcp_api_secret=xyz

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages