Monitoring

Monitoring specialist for observability, logging strategies, alerting rules, metrics dashboards, SLOs, and production visibility.

AG-MONITORING

The Monitoring & Observability Specialist designs and implements comprehensive observability solutions. AG-MONITORING ensures production systems are visible, healthy, and ready for incident response.

Capabilities

Structured Logging: JSON format, request/trace IDs, appropriate log levels
Metrics Collection: Application, infrastructure, and business metrics
Dashboard Creation: Grafana, Datadog, CloudWatch dashboards
Alerting Rules: Threshold-based, anomaly-based, and composite alerts
SLO Definition: Availability, latency, error rate targets with error budgets
Distributed Tracing: Request flow tracking through system
Health Checks: Endpoint health monitoring and dependency checks
Incident Response Runbooks: Detection, diagnosis, resolution procedures
Performance Analysis: Latency breakdown and resource usage patterns

When to Use

Use AG-MONITORING when you need to:

Design observability architecture for a service
Implement structured logging
Set up metrics collection and dashboards
Configure alerting rules and notification routing
Define SLOs and error budgets
Create incident response runbooks
Monitor application health in production
Analyze performance and latency patterns

How It Works

Context Loading: Agent reads expertise file and examines current monitoring setup
Design: Plans observability architecture (what to log, what to measure, what to alert)
Implementation: Sets up logging, metrics, dashboards, alerts
Verification: Tests health checks and alerting in staging
Documentation: Creates incident runbooks and SLO documentation
Coordination: Communicates via agent bus about monitoring status

Example

# Via /babysit
/agileflow:babysit
> "Set up monitoring and alerting for the new API service"
 
# Or directly invoke
/agileflow:observability-setup
 
# AG-MONITORING will:
# 1. Design logging and metrics strategy
# 2. Set up structured JSON logging
# 3. Configure Prometheus/Grafana for metrics
# 4. Create dashboards
# 5. Configure alerting rules
# 6. Create incident runbooks

Key Behaviors

Structured logging enforced - No unstructured plaintext logs in production
Alert on what matters - Reduce noise with intelligent thresholds
Prepare for failures - Create runbooks before incidents happen
Never log PII - Security and compliance first (no passwords, tokens, PII)
Monitor the happy path AND errors - Alert on both success thresholds and failures
SLOs drive decisions - Error budgets determine when to slow down and fix debt

Observability Pillars

AG-MONITORING implements all four observability pillars:

Metrics (Quantitative)

Response time (latency)
Throughput (requests/second)
Error rate (% failures)
Resource usage (CPU, memory, disk)
Business metrics (signups, transactions)

Logs (Detailed Events)

Application logs (errors, warnings, info)
Access logs (HTTP requests)
Audit logs (who did what)
Structured format (JSON, easily searchable)

Traces (Request Flow)

Distributed tracing through system
Latency breakdown (where time is spent)
Error traces and stack traces
Service dependencies

Alerts (Proactive Notification)

Threshold-based (metric > limit)
Anomaly-based (unusual patterns)
Composite (multiple conditions)
Smart routing (who to notify)

Tools Available

This agent has access to:

Read: Access application code and logging patterns
Write: Create logging utilities and monitoring configs
Edit: Update dashboards and alerting rules
Bash: Execute monitoring setup commands
Glob: Find logging patterns in codebase
Grep: Search for monitoring-related code

Core Responsibilities

Design observability architecture
Implement structured logging
Set up metrics collection
Create alerting rules
Build monitoring dashboards
Define SLOs and error budgets
Create incident response runbooks
Monitor application health
Coordinate with AG-DEVOPS on infrastructure monitoring
Maintain observability documentation

Quality Standards

Before marking work complete, AG-MONITORING ensures:

Structured logging implemented with request/trace IDs
All critical metrics collected and dashboarding
Dashboards created and actually useful
Alerting rules configured with appropriate thresholds
SLOs defined with error budgets calculated
Incident runbooks created for common failure scenarios
Health check endpoint working and responding correctly
Log retention policy defined and enforced
No PII/passwords/tokens in logs
Alert routing tested and working

Log Levels

AG-MONITORING uses these log levels:

ERROR: Service unavailable, data loss, critical failures
WARN: Degraded behavior, unexpected conditions, retry scenarios
INFO: Important state changes, deployments, feature flag changes
DEBUG: Detailed diagnostic information (development only, not production)

SLO Definition

Example SLO targets:

Availability: 99.9% uptime (8.7 hours downtime/year)
Latency: 95% of requests under 200ms
Error Rate: under 0.1% failed requests

Error budgets determine deployment velocity - when budget is exhausted, focus on stability.

AG-DEVOPS - Coordinate on infrastructure monitoring
AG-API - Monitor endpoint latency and error rates
AG-DATABASE - Monitor query latency and connection pool
AG-PERFORMANCE - Collaborate on performance monitoring

Slash Commands

AG-MONITORING can directly invoke these commands:

/agileflow:research:ask TOPIC=... - Research observability best practices
/agileflow:ai-code-review - Review monitoring code for best practices
/agileflow:adr-new - Document monitoring and observability decisions
/agileflow:status STORY=... STATUS=... - Update monitoring story status

Health Check Endpoint

AG-MONITORING implements health checks like this:

app.get('/health', async (req, res) => {
  const database = await checkDatabase();
  const cache = await checkCache();
  const external = await checkExternalService();
 
  const healthy = database && cache && external;
  const status = healthy ? 200 : 503;
 
  res.status(status).json({
    status: healthy ? 'healthy' : 'degraded',
    timestamp: new Date(),
    checks: { database, cache, external }
  });
});

Returns 200 if healthy, 503 if any dependency is down.

Incident Runbook Format

AG-MONITORING creates runbooks like this:

## [Incident Type]
 
**Detection**:
- Alert: [which alert fires]
- Symptoms: [what users see]
 
**Diagnosis**:
1. Check [metric 1]
2. Check [metric 2]
3. Verify [dependency]
 
**Resolution**:
1. [First step]
2. [Second step]
3. [Verification]
 
**Post-Incident**:
- Incident report
- Root cause analysis
- Preventive actions

Verification Protocol

AG-MONITORING follows the Session Harness system to prevent breaking functionality:

Pre-Implementation: Checks baseline monitoring status and environment
During Work: Tests health checks and alerting in real-time
Post-Implementation: Verifies monitoring is collecting data before marking complete
Story Completion: Can ONLY mark "in-review" if monitoring is operational

See the Session Harness Protocol for complete details.

CI Data Migration