← Blog
AWSIoTServerlessCost Optimization

How We Built a $25/Month IoT Monitoring System for 2,000 Devices

15 March 2025
How We Built a $25/Month IoT Monitoring System for 2,000 Devices

The Problem

Managing 2,000 RFID readers across 50+ countries means devices go offline constantly — network issues, power outages, hardware failures. Without automation, the operations team was handling manual status checks, false alarms, and email chains with postal operators across dozens of countries.

The vendor offered a monitoring solution. Price tag: $500–900/month. And it wouldn't even integrate with the existing AWS IoT Core infrastructure.

We built something better for ~$25/month.

The Architecture

The system runs entirely on AWS serverless services in eu-central-1 (Frankfurt). Three pipelines handle everything:

Fleet Monitoring

Every 10 minutes, devices publish shadow updates to AWS IoT Core. Fleet Indexing aggregates this into 18 custom CloudWatch metrics — connectivity status, service health, antenna detection, NTP sync. A single dashboard with 29 widgets gives the operations team complete visibility across every country.

Automated Incident Detection

The naive approach — "device offline = create incident" — generates massive false positives. Devices flap: they disconnect briefly during network blips and come right back. Creating an incident for every transient disconnection is noise, not signal.

Our solution: a dual-Lambda architecture with SQS anti-flap delay.

  1. Every 15 minutes, a scanner checks for offline devices using a Dynamic Thing Group (pre-filtered by Fleet Index queries — no scanning all 2,000 devices)
  2. Offline devices get sent to an SQS queue with a 15-minute per-message delay
  3. After 15 minutes, a processor Lambda re-checks connectivity
  4. If the device is still offline → create incident with full metadata
  5. If it recovered → discard the message silently

This single pattern eliminated 90% of false positives.

The processor also handles mixed Greengrass versions — GGv2 uses Fleet Index API, GGv1 uses device shadow, and devices in migration get both checks with smart fallback.

Email Notifications

Automated SES emails batched by country and facility. New incidents are notified after 1 hour — not immediately, to allow time for auto-recovery. Weekly reminders for unresolved issues. Recovery notifications sent as thread replies. Escalation after 2 unanswered reminders.

All emails maintain conversation threading via SES Message-ID and In-Reply-To headers — operators see one clean email thread per incident group, not a flood of disconnected messages.

The Cost

The entire solution costs approximately $25/month, using: Lambda, DynamoDB, SQS, IoT Core (shadows, Fleet Indexing, rules), EventBridge, CloudWatch (metrics, logs, dashboard), and SES.

Under $300/year — compared to $6,000–$10,800/year for vendor alternatives. That's 20–40x more cost-effective, with deeper integration into the existing IoT infrastructure.

Key Engineering Decisions

Why SQS delay instead of Step Functions wait? Cost and simplicity. SQS per-message delay costs nothing extra — you're already paying for the message. Step Functions would charge per state transition and add operational complexity for what is essentially a timer.

Why DynamoDB over RDS? The access patterns are simple: write incidents, query by status, query by device. DynamoDB's on-demand billing means we pay only for actual operations. TTL auto-expires records after 548 days — zero maintenance.

Why fleet-level metrics instead of per-device? At $0.30/metric/month, per-device metrics for 2,000 devices would cost $600/month alone. Fleet aggregate metrics give the operations team what they actually need — overall health and trends — at a fraction of the cost.

What We Learned

Fleet Index queries are powerful but have limits. Composite queries combining connectivity status, shadow fields, and metadata work well for pre-filtering, but the query language has no support for time-based conditions. Flapping detection had to use CloudWatch Logs Insights instead.

GGv1 to GGv2 migration creates edge cases. Devices in transition report different connectivity states through different APIs. We built a version-aware checker that tries Fleet Index first, falls back to device shadow, and logs inconsistencies for manual review.

CloudWatch custom metrics pricing is predictable but needs planning. 18 metrics at fleet level is affordable. Designing the metric structure upfront — what to measure and at what granularity — prevents cost surprises later.

Results

  • 90% reduction in false positive incidents
  • 20–40x cost savings compared to vendor alternatives
  • 15-minute end-to-end incident detection latency
  • Zero maintenance — fully automated with DynamoDB TTL cleanup
  • Full observability — 29-widget dashboard covering every country and device type

This architecture was designed and deployed by NG Solutions for a global postal logistics operator. Want to build something similar? Get in touch.