AWS IoT Automated Incident Management

Production serverless system monitoring ~2,000 RFID devices across 50+ countries. Real-time fleet health, automated incident detection with anti-flapping, and email notifications — all for ~$25/month.

Overview

A fully serverless incident management system monitoring approximately 2,000 RFID readers deployed across 50+ countries for a global postal logistics operator. The system provides three integrated capabilities: real-time fleet health monitoring, automated connectivity incident detection with false-positive prevention, and customer email notifications.

The entire solution runs on AWS serverless services in eu-central-1 (Frankfurt), with a monthly operational cost of ~$25/month — compared to vendor alternatives costing 20–40x more.

The Challenge

Managing a global fleet of RFID devices across 50+ countries presents unique operational challenges. Devices go offline due to network issues, power outages, and hardware failures. Without automated monitoring, the operations team spent significant manual effort checking device status, investigating false alarms, and coordinating with postal operators.

Key problems included:

No visibility into fleet-wide health across all countries
False positives from devices that briefly disconnect and reconnect (flapping)
Manual incident tracking via spreadsheets and emails
Delayed notifications to postal operators about device issues
Vendor monitoring solutions costing $500–900/month with less integration

Architecture

The system consists of three interconnected pipelines sharing the AWS IoT Core foundation, orchestrated through event-driven patterns.

Pipeline 1: Fleet Monitoring

Devices publish shadow state updates every 10 minutes. Fleet Indexing aggregates this data into 35 custom CloudWatch metrics covering connectivity status, service health, and hardware issues. A single dashboard with 29 widgets provides complete operational visibility.

Metrics are separated by region — global operations and regional-specific operations — plus hardware health monitoring for antenna detection and NTP synchronization.

Pipeline 2: Incident Detection

The core of the system — a dual-Lambda architecture with built-in anti-flapping logic:

FlappingDetector runs every 30 minutes, analyzing CloudWatch Logs Insights for rapid connect/disconnect patterns
GGOfflineScanner runs every 15 minutes, querying a dynamic IoT Thing Group for offline devices and sending them to an SQS delay queue with a 15-minute anti-flap delay
GGDisconnectionProcessor processes delayed messages, re-checking connectivity before creating incidents — eliminating 90% of false positives

The processor handles both GGv2 (Fleet Index API) and GGv1 (device shadow) connectivity checks, with smart detection for devices in migration between versions.

Pipeline 3: Email Notifications

Automated customer communications via Amazon SES with intelligent batching:

New incidents: Batched by country and facility, sent 1 hour after detection
Weekly reminders: Up to 2 reminders for unresolved incidents
Recovery notifications: Sent 1 hour after resolution as thread replies
Escalation: After 2 unanswered reminders, escalates to a different contact

All emails maintain conversation threading using SES Message-ID and In-Reply-To headers.

AWS Services

Service	Purpose
IoT Core	Device connectivity, shadows, MQTT, presence events
Fleet Indexing	Device data aggregation with REGISTRY_AND_SHADOW mode
Fleet Metrics	35 automated CloudWatch metrics at 10-minute intervals
Lambda	3 Python functions (256–512MB, 3–5min timeout)
EventBridge	3 scheduled rules coordinating all pipelines
SQS	Anti-flap delay queue with 900s per-message delay
DynamoDB	Incident records, flapping stats, notification tracking
CloudWatch	Metrics, logs, and 29-widget operational dashboard
SES	Threaded HTML email notifications with batching

Cost

The entire solution runs at approximately $25/month, using: Lambda, DynamoDB, SQS, IoT Core (shadows, Fleet Indexing, rules), EventBridge, CloudWatch (metrics, logs, dashboard), and SES.

Comparable vendor monitoring solutions for a fleet of this size cost $500–900/month — making this custom serverless solution 20–40x more cost-effective.

Key Design Decisions

Anti-flapping with SQS delay: Instead of immediately creating incidents for offline devices, messages sit in SQS for 15 minutes. The processor then re-checks connectivity — if the device recovered, the message is discarded. This single pattern eliminated 90% of false positives.

Dynamic Thing Groups for pre-filtering: Rather than scanning all 2,000 devices, the IoT Thing Group uses composite Fleet Index queries to pre-filter candidates, reducing Lambda execution time and API calls.

DynamoDB TTL for data lifecycle: Incident records auto-expire after 548 days (1.5 years), flapping statistics after 90 days. No manual cleanup required.

Regional separation: Certain country operations are excluded from the global pipeline for compliance reasons, with dedicated Fleet Metrics providing separate coverage.

Results

Metric	Value
Fleet coverage	~2,000 devices, 50+ countries
Monitoring interval	10–15 minutes
False positive reduction	90% via anti-flap delay
Cost vs vendor	20–40x cheaper
Operational overhead reduction	90%