-
Notifications
You must be signed in to change notification settings - Fork 238
Description
Environment
- Tekton Operator version: v0.78.1
- Kubernetes version: v1.33
- Platform: Kubernetes (bare-metal)
Description
The tekton-operator-proxy-webhook Service uses name: tekton-operator as
its pod selector. This label is also present on pods of the main
tekton-operator Deployment. As a result, the Service load-balances admission
webhook traffic across both Deployments, even though tekton-operator
pods do not listen on port 8443.
This causes approximately 50% of all admission webhook requests to fail
with connection refused, which — because the MutatingWebhookConfiguration
has failurePolicy: Fail — immediately rejects the creation of any TaskRun
pod.
Steps to Reproduce
- Deploy Tekton Operator v0.78.1 on a Kubernetes cluster.
- Inspect the endpoints of the
tekton-operator-proxy-webhookService:kubectl get endpoints tekton-operator-proxy-webhook -n <namespace> - Observe that the Endpoints list includes pods from both
tekton-operatorandtekton-operator-proxy-webhookDeployments. - Trigger any Pipeline/TaskRun. Observe that roughly half of new TaskRun pod
creation attempts fail immediately.
Expected Behavior
The tekton-operator-proxy-webhook Service should only route traffic to
tekton-operator-proxy-webhook pods (port 8443). The tekton-operator pods
should never appear in this Service's Endpoints.
Actual Behavior
The Service Endpoints include pods from both Deployments:
# kubectl get endpoints tekton-operator-proxy-webhook -n tekton -o yaml
subsets:
- addresses:
- ip: 172.26.0.66 # tekton-operator-proxy-webhook pod ✅ serves on 8443
- ip: 172.26.1.157 # tekton-operator pod ❌ does not serve on 8443
ports:
- port: 8443Stress-testing the Service directly (20 requests via ClusterIP) showed
roughly 50% failing with connection refused:
req-1: PASS (HTTP 415) req-2: FAIL (000) req-3: FAIL (000)
req-4: FAIL (000) req-5: FAIL (000) req-6: PASS (HTTP 415)
req-7: PASS (HTTP 415) req-8: FAIL (000) req-9: PASS (HTTP 415)
req-10: FAIL (000)
The failure manifests as the following error when creating TaskRun pods:
failed to create task run pod "<pod-name>":
Internal error occurred: failed calling webhook "proxy.operator.tekton.dev":
failed to call webhook:
Post "https://tekton-operator-proxy-webhook.<ns>.svc:443/defaulting?timeout=10s":
dial tcp <ClusterIP>:443: connect: connection refused
Note: the error appends a misleading hint ("Maybe missing or invalid Task
…") that does not reflect the real cause.
Root Cause
Both Deployments use the same pod template label name: tekton-operator:
tekton-operator Deployment (config/kubernetes/base/operator.yaml):
selector:
matchLabels:
name: tekton-operator # ← same label
template:
metadata:
labels:
name: tekton-operator # ← same labeltekton-operator-proxy-webhook Deployment
(cmd/kubernetes/operator/kodata/webhook/webhook.yaml):
selector:
matchLabels:
name: tekton-operator # ← collision!
template:
metadata:
labels:
name: tekton-operator # ← collision!tekton-operator-proxy-webhook Service:
selector:
name: tekton-operator # ← matches both Deployments!The same issue exists in the OpenShift manifest
(cmd/openshift/operator/kodata/webhook/webhook.yaml).
Proposed Fix
Change the proxy-webhook Deployment's matchLabels selector and pod template
label from name: tekton-operator to name: tekton-operator-proxy-webhook,
and update the Service selector to match. The existing app: tekton-operator
label is left unchanged.
I have a patch ready and will submit a PR.