AWS IoT Automated Incident Management
Production serverless system monitoring ~2,000 RFID devices across 50+ countries. Real-time fleet health, automated incident detection with anti-flapping, and email notifications — all for ~$25/month.
Overview
A fully serverless incident management system monitoring approximately 2,000 RFID readers deployed across 50+ countries for a global postal logistics operator. The system provides three integrated capabilities: real-time fleet health monitoring, automated connectivity incident detection with false-positive prevention, and customer email notifications.
The entire solution runs on AWS serverless services in eu-central-1 (Frankfurt), with a monthly operational cost of ~$25/month — compared to vendor alternatives costing 20–40x more.
The Challenge
Managing a global fleet of RFID devices across 50+ countries presents unique operational challenges. Devices go offline due to network issues, power outages, and hardware failures. Without automated monitoring, the operations team spent significant manual effort checking device status, investigating false alarms, and coordinating with postal operators.
Key problems included:
- No visibility into fleet-wide health across all countries
- False positives from devices that briefly disconnect and reconnect (flapping)
- Manual incident tracking via spreadsheets and emails
- Delayed notifications to postal operators about device issues
- Vendor monitoring solutions costing $500–900/month with less integration
Architecture
The system consists of three interconnected pipelines sharing the AWS IoT Core foundation, orchestrated through event-driven patterns.
Pipeline 1: Fleet Monitoring
Devices publish shadow state updates every 10 minutes. Fleet Indexing aggregates this data into 35 custom CloudWatch metrics covering connectivity status, service health, and hardware issues. A single dashboard with 29 widgets provides complete operational visibility.
Metrics are separated by region — global operations and regional-specific operations — plus hardware health monitoring for antenna detection and NTP synchronization.
Pipeline 2: Incident Detection
The core of the system — a dual-Lambda architecture with built-in anti-flapping logic:
- FlappingDetector runs every 30 minutes, analyzing CloudWatch Logs Insights for rapid connect/disconnect patterns
- GGOfflineScanner runs every 15 minutes, querying a dynamic IoT Thing Group for offline devices and sending them to an SQS delay queue with a 15-minute anti-flap delay
- GGDisconnectionProcessor processes delayed messages, re-checking connectivity before creating incidents — eliminating 90% of false positives
The processor handles both GGv2 (Fleet Index API) and GGv1 (device shadow) connectivity checks, with smart detection for devices in migration between versions.
Pipeline 3: Email Notifications
Automated customer communications via Amazon SES with intelligent batching:
- New incidents: Batched by country and facility, sent 1 hour after detection
- Weekly reminders: Up to 2 reminders for unresolved incidents
- Recovery notifications: Sent 1 hour after resolution as thread replies
- Escalation: After 2 unanswered reminders, escalates to a different contact
All emails maintain conversation threading using SES Message-ID and In-Reply-To headers.
AWS Services
| Service | Purpose |
|---|---|
| IoT Core | Device connectivity, shadows, MQTT, presence events |
| Fleet Indexing | Device data aggregation with REGISTRY_AND_SHADOW mode |
| Fleet Metrics | 35 automated CloudWatch metrics at 10-minute intervals |
| Lambda | 3 Python functions (256–512MB, 3–5min timeout) |
| EventBridge | 3 scheduled rules coordinating all pipelines |
| SQS | Anti-flap delay queue with 900s per-message delay |
| DynamoDB | Incident records, flapping stats, notification tracking |
| CloudWatch | Metrics, logs, and 29-widget operational dashboard |
| SES | Threaded HTML email notifications with batching |
Cost
The entire solution runs at approximately $25/month, using: Lambda, DynamoDB, SQS, IoT Core (shadows, Fleet Indexing, rules), EventBridge, CloudWatch (metrics, logs, dashboard), and SES.
Comparable vendor monitoring solutions for a fleet of this size cost $500–900/month — making this custom serverless solution 20–40x more cost-effective.
Key Design Decisions
Anti-flapping with SQS delay: Instead of immediately creating incidents for offline devices, messages sit in SQS for 15 minutes. The processor then re-checks connectivity — if the device recovered, the message is discarded. This single pattern eliminated 90% of false positives.
Dynamic Thing Groups for pre-filtering: Rather than scanning all 2,000 devices, the IoT Thing Group uses composite Fleet Index queries to pre-filter candidates, reducing Lambda execution time and API calls.
DynamoDB TTL for data lifecycle: Incident records auto-expire after 548 days (1.5 years), flapping statistics after 90 days. No manual cleanup required.
Regional separation: Certain country operations are excluded from the global pipeline for compliance reasons, with dedicated Fleet Metrics providing separate coverage.
Results
| Metric | Value |
|---|---|
| Fleet coverage | ~2,000 devices, 50+ countries |
| Monitoring interval | 10–15 minutes |
| False positive reduction | 90% via anti-flap delay |
| Cost vs vendor | 20–40x cheaper |
| Operational overhead reduction | 90% |