AMR fleet data infrastructure at scale: Four Problems and How to Solve Them
You deployed ten robots. The system worked. Now you have a hundred, and your engineering team is complaining about things that didn't exist six months ago: storage filling up on edge devices, cloud bills growing faster than the fleet, retraining cycles that take weeks instead of days, and incident investigations lack the data they need.
This is a predictable consequence of crossing a scaling threshold that almost every AMR operator hits.
The Inflection Point
A single AMR collecting camera frames, LiDAR scans, and control logs generates somewhere between 30 and 100 GB per shift, depending on sensor configuration and logging frequency. At ten robots, that's manageable. A custom script, a shared drive, maybe an S3 bucket — it works well enough that nobody questions it.
At one hundred robots, you're collecting 3–10 TB per shift. The custom script doesn't scale. The shared drive is full. The S3 bucket costs more than expected. And none of these tools were designed for the actual shape of the data: binary sensor payloads that are too large for a time-series database and too time-sensitive for a generic object store.
The failures start slowly and accumulate over time.
Some teams sidestep this entirely by processing at the edge by extracting metrics, downsampling, discarding the raw frames before they ever leave the robot. That keeps storage small and transfer costs low. The tradeoff is irreversibility: once the raw record is gone, it is gone. When a model needs retraining on a different feature representation, when an incident investigation requires the original sensor data, or when a regulator asks for the record that predates the processed output, there is nothing to retrieve.
The infrastructure problem described below is the cost of doing something valuable: keeping the raw data that makes your fleet improvable over time.
Four Problems That Grow With the Fleet
Storage overflow at the edge. Most AMR deployments use some form of local storage on the robot itself to buffer data before it transfers to a central server or cloud. When that buffer fills and there is no intelligent retention policy, robots start dropping records. The newest data overwrites the oldest without any selection logic. By the time an incident occurs and someone asks for the sensor data from the previous three shifts, it is gone.
Cloud transfer cost explosion. Object storage is cheap per GB. What is not cheap is the API call volume that high-frequency sensor data generates. A fleet writing thousands of small records per second to S3 can generate millions of PUT requests per day. At standard pricing, that cost compounds fast and it scales with fleet size in a way that storage cost alone does not.
Model drift without a feedback loop. Retraining a perception model requires labelled examples of edge cases: near-misses, unusual lighting, unexpected obstacles, novel environments. Those examples only exist in operational sensor data. If the data pipeline doesn't preserve the right records — specifically, the records around anomalous events — retraining becomes either impossible or biased toward common cases. The model drifts. Failure rates climb slowly enough that the connection to data quality is rarely made until much later.
Audit gaps when incidents happen. Regulatory and insurance requirements for autonomous systems are tightening, particularly in the EU under the Cyber Resilience Act and related frameworks. An incident investigation requires a continuous, tamper-evident record: what the robot saw, what it decided, what happened next. If records were dropped because local storage filled, or if the audit trail has gaps because replication wasn't reliable, that investigation stalls — or produces a conclusion that is hard to defend.
What the Infrastructure Actually Needs to Do
A requirements framework for AMR data infrastructure has three layers, and most off-the-shelf tools cover at most one of them.
Edge durability with intelligent retention. The robot's local storage needs a FIFO-by-volume quota: when the disk fills, the oldest records roll off automatically, keeping the system stable regardless of how long a mission runs or how long connectivity is interrupted. Critically, high-priority records — fault windows, flagged events, anything tagged for retraining — need to be preserved ahead of routine telemetry. This is not what generic object stores do.
Selective replication, not full sync. Sending everything to the cloud is how you generate the cost explosion described above. The alternative is conditional replication: a filter that runs on labels attached to records at ingest time, and sends only the subset that matches. Fault events go through. Routine navigation frames from uneventful shifts stay on the edge until the FIFO rolls them off. The bandwidth bill reflects the operational value of the data, not its raw volume.
Time-indexed auditability across tiers. Every record needs a precise timestamp and an immutable chain from the sensor to wherever it ends up. When an incident investigation asks "what did robot 47 see at 14:23:07 on Thursday," the answer should be retrievable in seconds — not reconstructed from fragmented logs across three storage systems.
A Self-Assessment for Operations Teams
Before evaluating vendors, run these checks against your current setup:
- If a robot loses connectivity for 8 hours, does it continue logging without dropping records? What happens when it reconnects?
- Can you retrieve the sensor data from a specific 30-second window on a specific robot from 6 weeks ago?
- Do you know which records from last month's fleet operation are currently stored in the cloud, and why those ones specifically?
- If a model retraining job requires 500 labelled fault examples from the past 90 days, how long does it take to locate and export them?
- In the event of a fleet incident, what is your maximum gap in the sensor record?
If any of these questions produces a slow or uncertain answer, the infrastructure has a gap. The question is whether that gap is large enough to matter at your current fleet size and whether it will still be acceptable when the fleet doubles again.
If you want to walk through these questions against your current setup, we offer a free architecture review.
What Good Looks Like
A well-designed AMR data stack has the same shape at 10 robots as at 1,000, because the architecture scales rather than requiring a rebuild at each inflection point.
On each robot: a lightweight storage engine with FIFO-by-volume quotas, label-based filtering at ingest, and continuous local buffering. When the robot is connected, it replicates only the records that match a pre-configured conditional query — fault windows, flagged events, training candidates. When it's not connected, it keeps logging.
At the fleet server tier: aggregated data from all robots, with the same query interface. Engineers can pull a specific time window from a specific robot without knowing which physical server holds it.
In the cloud: a cost-efficient long-term store using batched writes to object storage, with lifecycle-aware tiering that moves infrequently accessed data to cold storage automatically. The same query interface applies.
Tools like ReductStore are built specifically for this architecture: binary payloads stored with microsecond-precision timestamps and labels, replicated conditionally across tiers, queryable by time window and label combination from any tier. The same model from edge to cloud, with no translation layer between them.
The Board Framing
If you need to explain this problem upward, the frame is straightforward: the data your fleet generates is the feedstock for every improvement you will make to it — model updates, route optimization, incident response, regulatory compliance. Infrastructure that drops records, transfers data indiscriminately, or lacks a continuous audit trail is eroding the value of the fleet as it scales, silently and in advance of any visible failure.
The cost of fixing it before the fleet grows is a fraction of the cost of rebuilding the pipeline after it breaks. And rebuilding under pressure, after an incident, is when mistakes get made.
