← Architectures

AWS IoT Automated Incident Management

Production serverless system monitoring ~2,000 RFID devices across 50+ countries. Real-time fleet health, automated incident detection with anti-flapping, and email notifications — all for ~$25/month.

Category
IoT
Region
eu-central-1
Tags
8 services
AWS IoT Automated Incident Management
IoT CoreLambdaDynamoDBEventBridgeSQSCloudWatchSESFleet Indexing

Overview

A fully serverless incident management system monitoring approximately 2,000 RFID readers deployed across 50+ countries for a global postal logistics operator. The system provides three integrated capabilities: real-time fleet health monitoring, automated connectivity incident detection with false-positive prevention, and customer email notifications.

The entire solution runs on AWS serverless services in eu-central-1 (Frankfurt), with a monthly operational cost of ~$25/month — compared to vendor alternatives costing 20–40x more.

The Challenge

Managing a global fleet of RFID devices across 50+ countries presents unique operational challenges. Devices go offline due to network issues, power outages, and hardware failures. Without automated monitoring, the operations team spent significant manual effort checking device status, investigating false alarms, and coordinating with postal operators.

Key problems included:

  • No visibility into fleet-wide health across all countries
  • False positives from devices that briefly disconnect and reconnect (flapping)
  • Manual incident tracking via spreadsheets and emails
  • Delayed notifications to postal operators about device issues
  • Vendor monitoring solutions costing $500–900/month with less integration

Architecture

The system consists of three interconnected pipelines sharing the AWS IoT Core foundation, orchestrated through event-driven patterns.

Pipeline 1: Fleet Monitoring

Devices publish shadow state updates every 10 minutes. Fleet Indexing aggregates this data into 35 custom CloudWatch metrics covering connectivity status, service health, and hardware issues. A single dashboard with 29 widgets provides complete operational visibility.

Metrics are separated by region — global operations and regional-specific operations — plus hardware health monitoring for antenna detection and NTP synchronization.

Pipeline 2: Incident Detection

The core of the system — a dual-Lambda architecture with built-in anti-flapping logic:

  1. FlappingDetector runs every 30 minutes, analyzing CloudWatch Logs Insights for rapid connect/disconnect patterns
  2. GGOfflineScanner runs every 15 minutes, querying a dynamic IoT Thing Group for offline devices and sending them to an SQS delay queue with a 15-minute anti-flap delay
  3. GGDisconnectionProcessor processes delayed messages, re-checking connectivity before creating incidents — eliminating 90% of false positives

The processor handles both GGv2 (Fleet Index API) and GGv1 (device shadow) connectivity checks, with smart detection for devices in migration between versions.

Pipeline 3: Email Notifications

Automated customer communications via Amazon SES with intelligent batching:

  • New incidents: Batched by country and facility, sent 1 hour after detection
  • Weekly reminders: Up to 2 reminders for unresolved incidents
  • Recovery notifications: Sent 1 hour after resolution as thread replies
  • Escalation: After 2 unanswered reminders, escalates to a different contact

All emails maintain conversation threading using SES Message-ID and In-Reply-To headers.

AWS Services

ServicePurpose
IoT CoreDevice connectivity, shadows, MQTT, presence events
Fleet IndexingDevice data aggregation with REGISTRY_AND_SHADOW mode
Fleet Metrics35 automated CloudWatch metrics at 10-minute intervals
Lambda3 Python functions (256–512MB, 3–5min timeout)
EventBridge3 scheduled rules coordinating all pipelines
SQSAnti-flap delay queue with 900s per-message delay
DynamoDBIncident records, flapping stats, notification tracking
CloudWatchMetrics, logs, and 29-widget operational dashboard
SESThreaded HTML email notifications with batching

Cost

The entire solution runs at approximately $25/month, using: Lambda, DynamoDB, SQS, IoT Core (shadows, Fleet Indexing, rules), EventBridge, CloudWatch (metrics, logs, dashboard), and SES.

Comparable vendor monitoring solutions for a fleet of this size cost $500–900/month — making this custom serverless solution 20–40x more cost-effective.

Key Design Decisions

Anti-flapping with SQS delay: Instead of immediately creating incidents for offline devices, messages sit in SQS for 15 minutes. The processor then re-checks connectivity — if the device recovered, the message is discarded. This single pattern eliminated 90% of false positives.

Dynamic Thing Groups for pre-filtering: Rather than scanning all 2,000 devices, the IoT Thing Group uses composite Fleet Index queries to pre-filter candidates, reducing Lambda execution time and API calls.

DynamoDB TTL for data lifecycle: Incident records auto-expire after 548 days (1.5 years), flapping statistics after 90 days. No manual cleanup required.

Regional separation: Certain country operations are excluded from the global pipeline for compliance reasons, with dedicated Fleet Metrics providing separate coverage.

Results

MetricValue
Fleet coverage~2,000 devices, 50+ countries
Monitoring interval10–15 minutes
False positive reduction90% via anti-flap delay
Cost vs vendor20–40x cheaper
Operational overhead reduction90%