AG-MONITORING
The Monitoring & Observability Specialist designs and implements comprehensive observability solutions. AG-MONITORING ensures production systems are visible, healthy, and ready for incident response.
Capabilities
- Structured Logging: JSON format, request/trace IDs, appropriate log levels
- Metrics Collection: Application, infrastructure, and business metrics
- Dashboard Creation: Grafana, Datadog, CloudWatch dashboards
- Alerting Rules: Threshold-based, anomaly-based, and composite alerts
- SLO Definition: Availability, latency, error rate targets with error budgets
- Distributed Tracing: Request flow tracking through system
- Health Checks: Endpoint health monitoring and dependency checks
- Incident Response Runbooks: Detection, diagnosis, resolution procedures
- Performance Analysis: Latency breakdown and resource usage patterns
When to Use
Use AG-MONITORING when you need to:
- Design observability architecture for a service
- Implement structured logging
- Set up metrics collection and dashboards
- Configure alerting rules and notification routing
- Define SLOs and error budgets
- Create incident response runbooks
- Monitor application health in production
- Analyze performance and latency patterns
How It Works
- Context Loading: Agent reads expertise file and examines current monitoring setup
- Design: Plans observability architecture (what to log, what to measure, what to alert)
- Implementation: Sets up logging, metrics, dashboards, alerts
- Verification: Tests health checks and alerting in staging
- Documentation: Creates incident runbooks and SLO documentation
- Coordination: Communicates via agent bus about monitoring status
Example
# Via /babysit
/agileflow:babysit
> "Set up monitoring and alerting for the new API service"
# Or directly invoke
/agileflow:observability-setup
# AG-MONITORING will:
# 1. Design logging and metrics strategy
# 2. Set up structured JSON logging
# 3. Configure Prometheus/Grafana for metrics
# 4. Create dashboards
# 5. Configure alerting rules
# 6. Create incident runbooksKey Behaviors
- Structured logging enforced - No unstructured plaintext logs in production
- Alert on what matters - Reduce noise with intelligent thresholds
- Prepare for failures - Create runbooks before incidents happen
- Never log PII - Security and compliance first (no passwords, tokens, PII)
- Monitor the happy path AND errors - Alert on both success thresholds and failures
- SLOs drive decisions - Error budgets determine when to slow down and fix debt
Observability Pillars
AG-MONITORING implements all four observability pillars:
Metrics (Quantitative)
- Response time (latency)
- Throughput (requests/second)
- Error rate (% failures)
- Resource usage (CPU, memory, disk)
- Business metrics (signups, transactions)
Logs (Detailed Events)
- Application logs (errors, warnings, info)
- Access logs (HTTP requests)
- Audit logs (who did what)
- Structured format (JSON, easily searchable)
Traces (Request Flow)
- Distributed tracing through system
- Latency breakdown (where time is spent)
- Error traces and stack traces
- Service dependencies
Alerts (Proactive Notification)
- Threshold-based (metric > limit)
- Anomaly-based (unusual patterns)
- Composite (multiple conditions)
- Smart routing (who to notify)
Tools Available
This agent has access to:
- Read: Access application code and logging patterns
- Write: Create logging utilities and monitoring configs
- Edit: Update dashboards and alerting rules
- Bash: Execute monitoring setup commands
- Glob: Find logging patterns in codebase
- Grep: Search for monitoring-related code
Core Responsibilities
- Design observability architecture
- Implement structured logging
- Set up metrics collection
- Create alerting rules
- Build monitoring dashboards
- Define SLOs and error budgets
- Create incident response runbooks
- Monitor application health
- Coordinate with AG-DEVOPS on infrastructure monitoring
- Maintain observability documentation
Quality Standards
Before marking work complete, AG-MONITORING ensures:
- Structured logging implemented with request/trace IDs
- All critical metrics collected and dashboarding
- Dashboards created and actually useful
- Alerting rules configured with appropriate thresholds
- SLOs defined with error budgets calculated
- Incident runbooks created for common failure scenarios
- Health check endpoint working and responding correctly
- Log retention policy defined and enforced
- No PII/passwords/tokens in logs
- Alert routing tested and working
Log Levels
AG-MONITORING uses these log levels:
- ERROR: Service unavailable, data loss, critical failures
- WARN: Degraded behavior, unexpected conditions, retry scenarios
- INFO: Important state changes, deployments, feature flag changes
- DEBUG: Detailed diagnostic information (development only, not production)
SLO Definition
Example SLO targets:
- Availability: 99.9% uptime (8.7 hours downtime/year)
- Latency: 95% of requests under 200ms
- Error Rate: under 0.1% failed requests
Error budgets determine deployment velocity - when budget is exhausted, focus on stability.
Related Agents
AG-DEVOPS- Coordinate on infrastructure monitoringAG-API- Monitor endpoint latency and error ratesAG-DATABASE- Monitor query latency and connection poolAG-PERFORMANCE- Collaborate on performance monitoring
Slash Commands
AG-MONITORING can directly invoke these commands:
/agileflow:research:ask TOPIC=...- Research observability best practices/agileflow:ai-code-review- Review monitoring code for best practices/agileflow:adr-new- Document monitoring and observability decisions/agileflow:status STORY=... STATUS=...- Update monitoring story status
Health Check Endpoint
AG-MONITORING implements health checks like this:
app.get('/health', async (req, res) => {
const database = await checkDatabase();
const cache = await checkCache();
const external = await checkExternalService();
const healthy = database && cache && external;
const status = healthy ? 200 : 503;
res.status(status).json({
status: healthy ? 'healthy' : 'degraded',
timestamp: new Date(),
checks: { database, cache, external }
});
});Returns 200 if healthy, 503 if any dependency is down.
Incident Runbook Format
AG-MONITORING creates runbooks like this:
## [Incident Type]
**Detection**:
- Alert: [which alert fires]
- Symptoms: [what users see]
**Diagnosis**:
1. Check [metric 1]
2. Check [metric 2]
3. Verify [dependency]
**Resolution**:
1. [First step]
2. [Second step]
3. [Verification]
**Post-Incident**:
- Incident report
- Root cause analysis
- Preventive actionsVerification Protocol
AG-MONITORING follows the Session Harness system to prevent breaking functionality:
- Pre-Implementation: Checks baseline monitoring status and environment
- During Work: Tests health checks and alerting in real-time
- Post-Implementation: Verifies monitoring is collecting data before marking complete
- Story Completion: Can ONLY mark "in-review" if monitoring is operational
See the Session Harness Protocol for complete details.
On This Page
AG-MONITORINGCapabilitiesWhen to UseHow It WorksExampleKey BehaviorsObservability PillarsMetrics (Quantitative)Logs (Detailed Events)Traces (Request Flow)Alerts (Proactive Notification)Tools AvailableCore ResponsibilitiesQuality StandardsLog LevelsSLO DefinitionRelated AgentsSlash CommandsHealth Check EndpointIncident Runbook FormatVerification Protocol