Skip to content

tekton-operator-proxy-webhook Service selector matches operator pods, causing ~50% webhook admission failures #3227

@bowling233

Description

@bowling233

Environment

  • Tekton Operator version: v0.78.1
  • Kubernetes version: v1.33
  • Platform: Kubernetes (bare-metal)

Description

The tekton-operator-proxy-webhook Service uses name: tekton-operator as
its pod selector. This label is also present on pods of the main
tekton-operator Deployment. As a result, the Service load-balances admission
webhook traffic across both Deployments, even though tekton-operator
pods do not listen on port 8443.

This causes approximately 50% of all admission webhook requests to fail
with connection refused, which — because the MutatingWebhookConfiguration
has failurePolicy: Fail — immediately rejects the creation of any TaskRun
pod.

Steps to Reproduce

  1. Deploy Tekton Operator v0.78.1 on a Kubernetes cluster.
  2. Inspect the endpoints of the tekton-operator-proxy-webhook Service:
    kubectl get endpoints tekton-operator-proxy-webhook -n <namespace>
    
  3. Observe that the Endpoints list includes pods from both
    tekton-operator and tekton-operator-proxy-webhook Deployments.
  4. Trigger any Pipeline/TaskRun. Observe that roughly half of new TaskRun pod
    creation attempts fail immediately.

Expected Behavior

The tekton-operator-proxy-webhook Service should only route traffic to
tekton-operator-proxy-webhook pods (port 8443). The tekton-operator pods
should never appear in this Service's Endpoints.

Actual Behavior

The Service Endpoints include pods from both Deployments:

# kubectl get endpoints tekton-operator-proxy-webhook -n tekton -o yaml
subsets:
- addresses:
  - ip: 172.26.0.66   # tekton-operator-proxy-webhook pod  ✅ serves on 8443
  - ip: 172.26.1.157  # tekton-operator pod                ❌ does not serve on 8443
  ports:
  - port: 8443

Stress-testing the Service directly (20 requests via ClusterIP) showed
roughly 50% failing with connection refused:

req-1: PASS (HTTP 415)  req-2: FAIL (000)  req-3: FAIL (000)
req-4: FAIL (000)       req-5: FAIL (000)  req-6: PASS (HTTP 415)
req-7: PASS (HTTP 415)  req-8: FAIL (000)  req-9: PASS (HTTP 415)
req-10: FAIL (000)

The failure manifests as the following error when creating TaskRun pods:

failed to create task run pod "<pod-name>":
Internal error occurred: failed calling webhook "proxy.operator.tekton.dev":
failed to call webhook:
Post "https://tekton-operator-proxy-webhook.<ns>.svc:443/defaulting?timeout=10s":
dial tcp <ClusterIP>:443: connect: connection refused

Note: the error appends a misleading hint ("Maybe missing or invalid Task
…") that does not reflect the real cause.

Root Cause

Both Deployments use the same pod template label name: tekton-operator:

tekton-operator Deployment (config/kubernetes/base/operator.yaml):

selector:
  matchLabels:
    name: tekton-operator   # ← same label
template:
  metadata:
    labels:
      name: tekton-operator # ← same label

tekton-operator-proxy-webhook Deployment
(cmd/kubernetes/operator/kodata/webhook/webhook.yaml):

selector:
  matchLabels:
    name: tekton-operator   # ← collision!
template:
  metadata:
    labels:
      name: tekton-operator # ← collision!

tekton-operator-proxy-webhook Service:

selector:
  name: tekton-operator     # ← matches both Deployments!

The same issue exists in the OpenShift manifest
(cmd/openshift/operator/kodata/webhook/webhook.yaml).

Proposed Fix

Change the proxy-webhook Deployment's matchLabels selector and pod template
label from name: tekton-operator to name: tekton-operator-proxy-webhook,
and update the Service selector to match. The existing app: tekton-operator
label is left unchanged.

I have a patch ready and will submit a PR.

Metadata

Metadata

Assignees

No one assigned

    Labels

    kind/bugCategorizes issue or PR as related to a bug.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions