AgileFlow

Monitoring

PreviousNext

Monitoring specialist for observability, logging strategies, alerting rules, metrics dashboards, SLOs, and production visibility.

AG-MONITORING

The Monitoring & Observability Specialist designs and implements comprehensive observability solutions. AG-MONITORING ensures production systems are visible, healthy, and ready for incident response.

Capabilities

  • Structured Logging: JSON format, request/trace IDs, appropriate log levels
  • Metrics Collection: Application, infrastructure, and business metrics
  • Dashboard Creation: Grafana, Datadog, CloudWatch dashboards
  • Alerting Rules: Threshold-based, anomaly-based, and composite alerts
  • SLO Definition: Availability, latency, error rate targets with error budgets
  • Distributed Tracing: Request flow tracking through system
  • Health Checks: Endpoint health monitoring and dependency checks
  • Incident Response Runbooks: Detection, diagnosis, resolution procedures
  • Performance Analysis: Latency breakdown and resource usage patterns

When to Use

Use AG-MONITORING when you need to:

  • Design observability architecture for a service
  • Implement structured logging
  • Set up metrics collection and dashboards
  • Configure alerting rules and notification routing
  • Define SLOs and error budgets
  • Create incident response runbooks
  • Monitor application health in production
  • Analyze performance and latency patterns

How It Works

  1. Context Loading: Agent reads expertise file and examines current monitoring setup
  2. Design: Plans observability architecture (what to log, what to measure, what to alert)
  3. Implementation: Sets up logging, metrics, dashboards, alerts
  4. Verification: Tests health checks and alerting in staging
  5. Documentation: Creates incident runbooks and SLO documentation
  6. Coordination: Communicates via agent bus about monitoring status

Example

# Via /babysit
/agileflow:babysit
> "Set up monitoring and alerting for the new API service"
 
# Or directly invoke
/agileflow:observability-setup
 
# AG-MONITORING will:
# 1. Design logging and metrics strategy
# 2. Set up structured JSON logging
# 3. Configure Prometheus/Grafana for metrics
# 4. Create dashboards
# 5. Configure alerting rules
# 6. Create incident runbooks

Key Behaviors

  • Structured logging enforced - No unstructured plaintext logs in production
  • Alert on what matters - Reduce noise with intelligent thresholds
  • Prepare for failures - Create runbooks before incidents happen
  • Never log PII - Security and compliance first (no passwords, tokens, PII)
  • Monitor the happy path AND errors - Alert on both success thresholds and failures
  • SLOs drive decisions - Error budgets determine when to slow down and fix debt

Observability Pillars

AG-MONITORING implements all four observability pillars:

Metrics (Quantitative)

  • Response time (latency)
  • Throughput (requests/second)
  • Error rate (% failures)
  • Resource usage (CPU, memory, disk)
  • Business metrics (signups, transactions)

Logs (Detailed Events)

  • Application logs (errors, warnings, info)
  • Access logs (HTTP requests)
  • Audit logs (who did what)
  • Structured format (JSON, easily searchable)

Traces (Request Flow)

  • Distributed tracing through system
  • Latency breakdown (where time is spent)
  • Error traces and stack traces
  • Service dependencies

Alerts (Proactive Notification)

  • Threshold-based (metric > limit)
  • Anomaly-based (unusual patterns)
  • Composite (multiple conditions)
  • Smart routing (who to notify)

Tools Available

This agent has access to:

  • Read: Access application code and logging patterns
  • Write: Create logging utilities and monitoring configs
  • Edit: Update dashboards and alerting rules
  • Bash: Execute monitoring setup commands
  • Glob: Find logging patterns in codebase
  • Grep: Search for monitoring-related code

Core Responsibilities

  1. Design observability architecture
  2. Implement structured logging
  3. Set up metrics collection
  4. Create alerting rules
  5. Build monitoring dashboards
  6. Define SLOs and error budgets
  7. Create incident response runbooks
  8. Monitor application health
  9. Coordinate with AG-DEVOPS on infrastructure monitoring
  10. Maintain observability documentation

Quality Standards

Before marking work complete, AG-MONITORING ensures:

  • Structured logging implemented with request/trace IDs
  • All critical metrics collected and dashboarding
  • Dashboards created and actually useful
  • Alerting rules configured with appropriate thresholds
  • SLOs defined with error budgets calculated
  • Incident runbooks created for common failure scenarios
  • Health check endpoint working and responding correctly
  • Log retention policy defined and enforced
  • No PII/passwords/tokens in logs
  • Alert routing tested and working

Log Levels

AG-MONITORING uses these log levels:

  • ERROR: Service unavailable, data loss, critical failures
  • WARN: Degraded behavior, unexpected conditions, retry scenarios
  • INFO: Important state changes, deployments, feature flag changes
  • DEBUG: Detailed diagnostic information (development only, not production)

SLO Definition

Example SLO targets:

  • Availability: 99.9% uptime (8.7 hours downtime/year)
  • Latency: 95% of requests under 200ms
  • Error Rate: under 0.1% failed requests

Error budgets determine deployment velocity - when budget is exhausted, focus on stability.

  • AG-DEVOPS - Coordinate on infrastructure monitoring
  • AG-API - Monitor endpoint latency and error rates
  • AG-DATABASE - Monitor query latency and connection pool
  • AG-PERFORMANCE - Collaborate on performance monitoring

Slash Commands

AG-MONITORING can directly invoke these commands:

  • /agileflow:research:ask TOPIC=... - Research observability best practices
  • /agileflow:ai-code-review - Review monitoring code for best practices
  • /agileflow:adr-new - Document monitoring and observability decisions
  • /agileflow:status STORY=... STATUS=... - Update monitoring story status

Health Check Endpoint

AG-MONITORING implements health checks like this:

app.get('/health', async (req, res) => {
  const database = await checkDatabase();
  const cache = await checkCache();
  const external = await checkExternalService();
 
  const healthy = database && cache && external;
  const status = healthy ? 200 : 503;
 
  res.status(status).json({
    status: healthy ? 'healthy' : 'degraded',
    timestamp: new Date(),
    checks: { database, cache, external }
  });
});

Returns 200 if healthy, 503 if any dependency is down.

Incident Runbook Format

AG-MONITORING creates runbooks like this:

## [Incident Type]
 
**Detection**:
- Alert: [which alert fires]
- Symptoms: [what users see]
 
**Diagnosis**:
1. Check [metric 1]
2. Check [metric 2]
3. Verify [dependency]
 
**Resolution**:
1. [First step]
2. [Second step]
3. [Verification]
 
**Post-Incident**:
- Incident report
- Root cause analysis
- Preventive actions

Verification Protocol

AG-MONITORING follows the Session Harness system to prevent breaking functionality:

  1. Pre-Implementation: Checks baseline monitoring status and environment
  2. During Work: Tests health checks and alerting in real-time
  3. Post-Implementation: Verifies monitoring is collecting data before marking complete
  4. Story Completion: Can ONLY mark "in-review" if monitoring is operational

See the Session Harness Protocol for complete details.