K8s Ops Agent

基于 LangChain + K8s API 的智能运维助手，自动采集、分析集群故障信息并结合 LLM 进行诊断。

功能

自动监听 Pod/Node/Event 异常事件
自动拉取日志、事件、Prometheus 指标作为上下文
LLM 智能诊断故障原因并给出修复建议
Webhook 推送诊断结果（飞书、钉钉等）
事件去重和限流
主动对话交互（查询状态、诊断问题）
可选 RAG 知识库增强

项目结构

k8s-ops-agent/
├── src/
│   ├── agent.py                 # 主入口
│   ├── k8s_client.py           # K8s API 封装
│   ├── prometheus_client.py    # Prometheus 集成
│   ├── context_builder.py       # 上下文构建
│   ├── diagnosis_engine.py      # LLM 诊断
│   ├── rag_engine.py           # RAG 知识库
│   ├── webhook.py              # Webhook 通知
│   ├── event_deduplicator.py   # 事件去重
│   ├── health_server.py        # 健康检查
│   ├── command_executor.py     # 命令执行
│   ├── chat_server.py          # 对话服务
│   └── intent_recognition.py   # 意图识别
├── config/
│   └── config.yaml              # 配置文件
├── prompts/
│   └── diagnosis_prompt.yaml   # Prompt 模板
├── docs/                       # RAG 知识库文档
├── k8s-deploy.yaml           # K8s 部署清单
├── Dockerfile
├── docker-compose.yml
├── requirements.txt
└── requirements-rag.txt       # RAG 依赖

部署方式

方式一：本地运行

# 1. 安装依赖
pip install -r requirements.txt

# 2. 配置环境变量
export OPENAI_API_KEY="sk-xxx"
export WEBHOOK_URL="https://hooks.xxx.com/xxx"

# 3. 修改配置文件 config/config.yaml

# 4. 运行
python -m src.agent -c config/config.yaml

方式二：Docker 运行

# 1. 构建镜像
docker build -t k8s-ops-agent:latest .

# 2. 运行容器（需挂载 kubeconfig 访问 K8s）
docker run -d \
  --name k8s-ops-agent \
  -e OPENAI_API_KEY="sk-xxx" \
  -e WEBHOOK_URL="https://hooks.xxx.com/xxx" \
  -v ~/.kube/config:/root/.kube/config:ro \
  k8s-ops-agent:latest

# 或使用 docker-compose
docker-compose up -d

方式三：K8s 集群内运行

# 1. 构建并推送镜像
docker build -t k8s-ops-agent:latest .
docker push your-registry/k8s-ops-agent:latest

# 2. 修改 k8s-deploy.yaml 中的 Secret（API Key）

# 3. 部署
kubectl apply -f k8s-deploy.yaml

# 4. 检查
kubectl get pods -l app=k8s-ops-agent
kubectl get svc k8s-ops-agent

配置说明

关键配置项 config/config.yaml：

# K8s 配置
k8s:
  kubeconfig_path: ""  # 留空使用 in-cluster 配置

# Prometheus 配置
prometheus:
  url: "http://prometheus:9090"

# LLM 配置
llm:
  api_key: "${OPENAI_API_KEY}"
  model: "gpt-4"

# Agent 配置
agent:
  watch_resources:
    - kind: "Pod"
      namespaces: ["default", "kube-system"]
    - kind: "Node"
      namespaces: []
  event_filters:
    reason: ["Failed", "Unhealthy", "NodeNotReady"]
    type: ["Warning", "Error"]
  enable_command_execution: false  # 默认关闭命令执行

# Webhook 配置
webhook:
  url: "${WEBHOOK_URL}"
  confirmation_required: true

使用方法

1. 被动监听模式

Agent 启动后自动监听集群事件，发现异常时：

自动拉取相关日志、事件、Prometheus 指标
调用 LLM 诊断故障原因
通过 Webhook 推送诊断结果

2. 主动对话模式

Agent 提供 HTTP API 供主动查询：

# 查询 Pod 状态
curl -X POST http://localhost:8081/api/chat \
  -H "Content-Type: application/json" \
  -d '{"message": "查看 redis pod 状态"}'

# 查询所有 Pod
curl -X POST http://localhost:8081/api/chat \
  -H "Content-Type: application/json" \
  -d '{"message": "default 命名空间的 pods"}'

# 诊断某个 Pod
curl -X POST http://localhost:8081/api/chat \
  -H "Content-Type: application/json" \
  -d '{"message": "诊断 nginx pod"}'

# 执行命令（需确认）
curl -X POST http://localhost:8081/api/chat \
  -H "Content-Type: application/json" \
  -d '{"message": "删除 test pod"}'

3. 健康检查

curl http://localhost:8080/health   # 健康状态
curl http://localhost:8080/ready   # 就绪状态（含 K8s/Prometheus 连接检查）
curl http://localhost:8080/metrics  # 指标统计

RAG 知识库（可选）

如需使用公司内部运维知识库：

# 1. 安装 RAG 依赖
pip install -r requirements-rag.txt

# 2. 放置知识文档到 docs/ 目录
docs/
├── 故障处理指南.md
├── 部署手册.md
└── ...

# 3. 启用 RAG（修改 config/config.yaml）
rag:
  enabled: true
  docs_path: "docs"
  embedding_model: "BAAI/bge-small-zh-v1.5"

首次启动会自动构建向量知识库，之后诊断时会检索相关文档辅助 LLM。

安全

Agent 默认仅具备 K8s 只读权限
修复命令默认不执行（dry_run 模式）
敏感信息通过环境变量或 Secret 注入
命令执行仅允许 kubectl 命令

端口

端口	服务	路径
8080	Health Server	/health, /ready, /metrics
8081	Chat Server	/chat, /api/chat, /api/status

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

K8s Ops Agent

功能

项目结构

部署方式

方式一：本地运行

方式二：Docker 运行

方式三：K8s 集群内运行

配置说明

使用方法

1. 被动监听模式

2. 主动对话模式

3. 健康检查

RAG 知识库（可选）

安全

端口

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
config		config
prompts		prompts
src		src
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
docker-compose.yml		docker-compose.yml
k8s-deploy.yaml		k8s-deploy.yaml
requirements-rag.txt		requirements-rag.txt
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

K8s Ops Agent

功能

项目结构

部署方式

方式一：本地运行

方式二：Docker 运行

方式三：K8s 集群内运行

配置说明

使用方法

1. 被动监听模式

2. 主动对话模式

3. 健康检查

RAG 知识库（可选）

安全

端口

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages