What a Decade of ML Infrastructure Taught Me About LLMs

I have spent close to a decade working on ML infrastructure, and by 2018, before MLOps was a job title most people recognized, the work already followed a familiar pattern: migrating pipelines into cloud environments, wiring together GPU clusters, and making autoscaling behave sensibly for workloads that spiked unpredictably. The models were rarely the hard part. Getting them to run reliably, cheaply, and observably in production was.

A few years ago I founded HADI Technology, a consultancy where that work continued across production AI platforms, model serving systems, and agent infrastructure. Over time, engagements shifted increasingly into LLM-based systems: serving open-source models on vLLM, building agent infrastructure, and operating AI-driven platforms at scale. What became clear through that transition is that the problems do not change as much as the current hype suggests. They evolve. They get harder in specific ways. But the years spent wrestling with traditional ML infrastructure turned out to be more relevant than I expected.

Traditional ML
With LLMs
Serving Latency
Predictable for a given input profile, with throughput and queueing behavior that can usually be modeled up front.
Completion length becomes a runtime variable, so requests that look identical can finish on very different timelines.
Reproducibility
Code, data, and environment drift are the main causes of silent failures across retraining pipelines.
Prompts, model versions, tool contracts, and context handling drift too, often without throwing an exception.
Observability
Latency, errors, and output distribution usually tell you whether the system is healthy.
Infrastructure metrics stay necessary, but output quality and tool correctness need their own instrumentation layer.
Cost
Mostly a provisioning and utilization problem, with spend driven by cluster sizing decisions.
Token volume, retry loops, and model choice pull cost governance up into the application layer.
Failure Modes
Failures tend to be loud: crashed jobs, bad responses, broken pipelines, and obvious traces.
Failures can propagate quietly through reasoning steps, producing plausible outputs on top of bad intermediate state.
A compact summary of what carries over from classical ML infrastructure, and what changes once reasoning-heavy LLM systems enter the picture.

Here is what transferred, and where the gaps are.


Serving Latency

In traditional ML, latency was usually profileable. For a given model and input size, you could reason about throughput, queue depth, and resource saturation with reasonable confidence. At Buzz Indexes, designing autoscaling ECS clusters for GPU-intensive ML tasks meant developing precise instincts about when to scale out versus when to buffer. That reasoning still holds.

With LLMs, the variable that breaks those instincts is output length. It is determined at runtime, which means a batch of requests that looks uniform at ingress can produce wildly different completion times depending on how the model responds. SLA guarantees and timeout handling become considerably more nuanced than they were for a fixed-output classifier. The habit of tracking p95 and p99 latency separately from p50, built over years of operating ML endpoints, turns out to matter even more, not less.

Pipeline Reproducibility

In traditional ML, one of the hardest lessons from retraining pipelines at BorealisAI was that environment drift is among the most common sources of silent failure. A pipeline that ran cleanly three months ago, run again today with subtly different library versions or data distributions, can produce a model that looks fine until it reaches production. Treating pipelines as reproducible artifacts, with pinned environments, versioned data, and logged parameters, was not a best practice. It was survival.

With LLMs, that discipline carries over directly, but the surface area for drift is considerably larger. In a traditional pipeline, the training code and data are the main variables. In an LLM-based system, the prompt, model version, context window handling, tool definitions, and the behavior of any external APIs those tools call are all variables too. Any of them can shift between runs and produce meaningfully different outputs without raising a single exception. The mlops-blueprint was an attempt to codify what end-to-end reproducibility looks like. The same thinking applies directly to LLM pipelines.

Observability

In traditional ML, deploying a federated Prometheus environment at Teranet across OpenShift clusters made one thing clear: observability is not optional infrastructure. It is the difference between knowing your system is healthy and guessing. The metrics were relatively clean to instrument: request rate, latency, prediction distribution, error rate.

With LLMs, all of that is still necessary, plus a category of signals that did not exist before. Output quality metrics, whether the model is completing tasks, whether tool calls are structured correctly, and whether response length is drifting cannot be inferred from infrastructure metrics alone. A healthy Prometheus dashboard can coexist with an agent pipeline that has been silently producing malformed outputs for hours because a prompt assumption broke after a model update. The infrastructure skills transfer. The instrumentation targets change.

Cost Management

In traditional ML, GPU cost was mostly a provisioning problem. You sized clusters, managed utilization, and the cost profile was relatively predictable over a billing cycle. Surprises usually came from infrastructure decisions.

With LLMs, cost is often proportional to token volume per request, which pushes the problem upward into the application layer. A single misbehaving agent that generates unnecessarily long outputs, gets stuck in a retry loop, or calls a tool repeatedly can spike costs in ways that no infrastructure alarm catches, because from the infrastructure’s perspective everything looks normal. Token budgets, output length constraints, and model selection based on task complexity are infrastructure decisions now, not application afterthoughts.

Failure Propagation

In traditional ML, failures are loud. A preprocessing step errors out, a training run crashes, or a serving endpoint returns a 500. The failure propagates through code, producing traces you can actually follow.

With LLMs, a model can generate a plausible-looking but incorrect tool call. The tool executes, returns a result, and the agent continues reasoning on bad data, producing confident output several steps later that is wrong in a way that only makes sense when you trace back through the full execution chain. No exception was raised. Nothing in the infrastructure metrics changed. The failure propagated through reasoning steps, not code steps.

This is the sharpest way LLM systems amplify what was already difficult about traditional ML. The instinct to distrust silent success, built from years of watching pipelines complete without errors but produce wrong results, is exactly right. The layer where that distrust needs to be applied has shifted upward, from data and infrastructure into the model’s reasoning chain. Checkpoints, structured output validation, and human review gates are the equivalent of dead letter queues and retry logic in any serious async pipeline. The icp-collab framework was built with exactly this in mind.

LLM / Agent Layer - new failure surface

Prompt Versioning

Prompts drift silently across model updates and often fail without obvious traces.

Token Budgets

Spend is driven by runtime behavior, not only by the shape of the infrastructure beneath it.

Output Validation

Correctness cannot be inferred from uptime, latency, or container health alone.

Agent State

Failures propagate through reasoning chains and tool interactions rather than through code paths.

Soft failures surface here, often with no exception raised.

Infrastructure signals can still look healthy even while application-layer correctness drifts.

Infrastructure Layer - classical ML foundation

Autoscaling

Queue depth, throughput, and resource saturation still determine whether production traffic behaves.

Observability

Latency, error rate, and system health remain the baseline instrumentation every deployment needs.

Reproducibility

Pinned environments, versioned data, and logged parameters remain the survival kit for rerunnable pipelines.

Cost Governance

Cluster sizing, GPU utilization, and provisioning choices still set the floor for operating cost.

Foundational disciplines carried over from classical ML infrastructure New operational surface introduced by LLM and agent systems
LLM systems do not replace the classical ML infrastructure layer. They add a new failure surface above it, where correctness, cost, and observability become application-level concerns.

What Transfers and What Does Not

Most foundational disciplines transfer cleanly: reproducible pipelines, proper observability, cost-aware capacity planning, async infrastructure designed for failure, and the general disposition that AI systems fail in non-obvious ways.

What does not transfer automatically is the mental model for where failures occur. Traditional ML engineers watch data quality, feature distributions, model drift, and infrastructure health. LLM systems need all of that plus application-layer correctness monitoring that the current tooling is still catching up to.

Engineers coming to LLMs from web development backgrounds are often rediscovering problems the ML community worked through years ago. Which is, in its own way, a reminder that fundamentals matter more than the current generation of tooling.

© 2026 Muntazir Fadhel. All rights reserved.