Claude Agent SDK — SRE Incident Response Agent: System Prompts
来源: The Site Reliability Agent
System Prompt
You are an expert SRE incident response bot. Your job is to investigate
production incidents quickly and thoroughly.
Investigation approach:
1. Start with get_service_health for a quick overview
2. Drill into error rates to identify affected services
3. Check latency — high latency often precedes errors
4. Investigate resources — DB connections, CPU, memory
5. Read container logs for specific error messages
6. Check config files for misconfigurations
7. Correlate and conclude — connect symptoms to root cause
Note: The api-server has baseline error noise (~0.1-0.2 errors/sec). Focus on significant spikes.
Be thorough but efficient. Always explain your reasoning.
关键设计洞察
- System prompt 故意保持简单: 只给出通用调查方法论(从 health check 开始 → 深入 metrics → 查 logs → 关联发现),不指定调用哪些 tool、什么顺序、如何解读结果
- Tool descriptions 驱动 agent 行为: 比精心设计的 prompt 更有效。每个 MCP tool 都有丰富的 JSON Schema description,agent 从中判断何时使用什么 tool
- 投入在 tool + 环境上,而非 prompt 上: “invest in giving Claude access to the right tools and environment to perform tasks”
Agent 配置
options = ClaudeAgentOptions(
system_prompt=SYSTEM_PROMPT,
mcp_servers={
"sre": {
"command": sys.executable,
"args": [str(MCP_SERVER_PATH)],
}
},
allowed_tools=[
# Investigation tools
"mcp__sre__query_metrics",
"mcp__sre__list_metrics",
"mcp__sre__get_service_health",
"mcp__sre__get_logs",
"mcp__sre__get_alerts",
"mcp__sre__get_recent_deployments",
"mcp__sre__execute_runbook",
# Remediation tools
"mcp__sre__read_config_file",
"mcp__sre__edit_config_file",
"mcp__sre__run_shell_command",
"mcp__sre__get_container_logs",
# Documentation tools
"mcp__sre__write_postmortem",
],
hooks={
"PreToolUse": [
{
"matcher": "mcp__sre__edit_config_file",
"hooks": [{"type": "command", "command": f"bash {HOOKS_DIR}/validate_pool_size.sh"}],
},
{
"matcher": "mcp__sre__run_shell_command",
"hooks": [{"type": "command", "command": f"bash {HOOKS_DIR}/validate_config_before_deploy.sh"}],
},
],
},
permission_mode="acceptEdits",
model=MODEL,
)Tool Description 示例
Tool descriptions 是让 agent 自主工作的关键。示例:
query_metrics
Query Prometheus metrics using PromQL.
Common investigation queries:
Error rate: rate(http_requests_total{status="500"}[1m]),
Error ratio: sum(rate(http_requests_total{status="500"}[1m])) by (service) / sum(rate(http_requests_total[1m])) by (service),
DB connections: db_connections_active,
Latency P99: http_request_duration_milliseconds{quantile="0.99"},
CPU usage: process_cpu_seconds_total.
edit_config_file
Edit a configuration file to fix misconfigurations.
ONLY use this for remediation after confirming the root cause.
Restricted to files in the config/ directory.
run_shell_command
Run a shell command for infrastructure management.
Restricted to docker-compose and docker commands only.
Use for: restarting services, checking container status, rebuilding images.
Safety Hooks
两层防御:
- Tool handler 内置检查: 目录限制、命令前缀白名单
- PreToolUse hooks: shell 脚本验证变更内容(不只是位置)
validate_pool_size.sh: 检查 DB_POOL_SIZE 在 5-100 范围内validate_config_before_deploy.sh: 部署前验证配置合理性
Human-in-the-loop 设计
调查和修复分为两个阶段:
- Investigation: 只用只读工具诊断问题
- Remediation: 用写工具修复,需要人工审批后再执行
Skills 扩展示例
---
name: runbook
description: Execute documented runbooks for common SRE incidents.
---
## When to Use This Skill
Trigger this skill when you identify one of these patterns:
### Database Connection Exhaustion
- db_connections_active > 90
- "too many connections" in logs
### High Latency Cascade
- P99 latency > 1000ms on api-server
- Latency spreading to downstream services
## Workflow
1. Identify the incident type from symptoms
2. Call execute_runbook with phase="investigate"
3. Follow the diagnostic steps
4. Call execute_runbook with phase="remediate"
5. Present remediation options before taking actionTakeaway
- not production ready, toy
- work as reference