The Run:ai model streamer loads models from object storage to GPU memory direct from CPU memory. It is a package available for vLLM and greatly accelerates model download times.
To test the model streamer on a GCE instance, create one that has a single Nvidia L4 GPU. Then install CUDA and drivers.
- Create a virtual envoinment and install dependencies
sudo apt install python3.11-venv
sudo apt install python3-pip
mkdir vllm
cd vllm
python3 -m venv venv- Activate the Environment
source venv/bin/activate- Transfer Data from Hugging Face to GCS
Use data-loading/hf-gcs.py script from any machine you can login to hugging face from. Change vars to match model/folder names
repo_id="google/gemma-3-4b-it"
local_dir="/tmp"
gcs_bucket_name=""
gcs_prefix="gemma-3-4b-it"- Install the dependencides needed
pip3 install google-cloud-storage
pip3 install huggingface_hub- Login to Huggingface
Get a token from Hugging face and use it to login via CLI
hf auth login- Login to Google Cloud via CLI if not already logged in
gcloud init- Run the script to populate the GCS bucket with the model from Huggingface
python3 data-loading/hf-gcs.py- Install vLLM with Run AI loader
pip3 install vllm[runai]To use the model streamer you just need to add the "--load-format=runai_streamer" flag. Make sure you don't have sub folders in your GCS bucket that have "original" copies of the OSS models.
Example vllm command using a model in the following location gs://models-usc/gemma-3-4b-it
vllm serve gs://models-usc/gemma-3-4b-it --load-format=runai_streamer The following steps can help you use the model streamer on a GKE cluster
-
Setup workload Identity
If you are using an Autopilot GKE cluster workload identity is enabled by default. If using a GKE Standard cluster you will need to enable it if it not already on.
-
Create IAM Rules for workload identity bucket access
Create two policy bindings that the service account you will use in your GKE cluster can use to access the models in Google Cloud Object Storage
export BUCKET="" export PROJECT_NUMBER="" export PROJECT_ID="" export SERVICE_ACCOUNT="gcs-access" gcloud storage buckets add-iam-policy-binding gs://$BUCKET --member principal://iam.googleapis.com/projects/$PROJECT_NUMBER/locations/global/workloadIdentityPools/$PROJECT_ID.svc.id.goog/subject/ns/default/sa/$SERVICE_ACCOUNT --role roles/storage.bucketViewer gcloud storage buckets add-iam-policy-binding gs://$BUCKET --member principal://iam.googleapis.com/projects/$PROJECT_NUMBER/locations/global/workloadIdentityPools/$PROJECT_ID.svc.id.goog/subject/ns/default/sa/$SERVICE_ACCOUNT --role roles/storage.objectUser
-
Sample deployment.yaml
Use the sample deployment.yaml file and change model location and service account to use the Run:ai streamer on GKE with workload identity. If using autopilot you will need to modify the deployment.yaml file to include the machine type needed in autopilot
kubectl apply -f deployment.yaml
View the logs to see the model streamer output
kubectl logs [pod name]
-
Optional- Use Google Inference Quickstart
The Google Inference Quickstart can generate kubernetes manifests for you to use that have all autoscaling and metrics collection settings. These manefests are being updated to include the Run:ai model streamer.
Example Command for gemma-3-4b-it on vllm
gcloud container ai profiles manifests create --model=google/gemma-3-4b-it --model-server=vllm --accelerator-type=nvidia-l4
GCS Anywhere Cache for Zonal Caching
GCS anywhere cache can reduce load times by as much as 30% once a model is cached to a zone where the GPU is located. You can enable the cache with the following commands. This feature is very useful for scale out inference workloads.
export ZONE="us-central1-c"
gcloud storage buckets anywhere-caches create gs://$BUCKET $ZONE
## Check the status of the cache
gcloud storage buckets anywhere-caches describe $BUCKET/$ZONEexample
kubectl create secret generic gcp-token --from-literal=gcp_api_token=xyz
kubectl create secret generic gcp-secret --from-literal=gcp_api_secret=xyz