Latency is an Economic Constraint, not a Performance Target
Latency is defined by the product workflow you choose, not the models or infrastructure underneath. Once a response time is set, latency becomes an economic constraint that determines what cost controls are available and, by extension your margin ceiling.
| Latency Windows | |||
|---|---|---|---|
| Hours (Async) | Minutes (Trap) | Seconds (Sync) | |
| Examples | Github Workspace, Replit, Deep Research | Midjourney, Suno, NotebookLM, Reasoning | ChatGPT, Perplexity, Customer Support |
| Time Shift Compute (Can this workload happen later?) | Off-peak execution | No meaningful shifting | Impossible. Any delay is failure |
| Batching (Can multiple requests be combined?) | Large batching smooths variance | Only exists by delaying resolution of response | Not viable. Batch size = 1 |
| Workload Bounding (Can I cap how expensive a request can be?) | Complex requests can be deferred, split or rejected | Soft limits can be implemented. High risk of waste. | Not tolerated, the system must resolve |
| Failure tolerance (Can expensive cases be dropped?) | Expensive or failing cases can be dropped/retried later | Only explicit user-visible abandonment | Not tolerated. Errors break the workflow |
| Structural Gross Margin Ceiling | Deep Tech Profile ~55-70% (1) | Services Profile ~30-50% (2) | Services Profile ~30-50% (2) |
(1) In async workflows, when inference is decoupled from user wait time, better compute economics exist due to it being schedulable. This allows costs to be amortized through batching, utilization, smoothing and discount pricing.
(2) In minutes and seconds workflows, inference is purchased on-demand and executed synchronously. Utilization gaps, retries, and abandoned executions create unavoidable waste, capping margins regardless of model efficiency.
- Latency Sets the Margin Ceiling
Margin ceilings are determined by response time expectations not by scale, model quality or infrastructure. Immediacy removes leverage by eliminating the ability to batch, defer or average costs. - The Economic Cliff Is Structural, Not Incremental
There is a sharp break between Hours and Minutes. The difference between Minutes and Seconds is marginal because both require immediate resolution and lose access to background cost controls. - Workflow Choice Pre-Determines Unit Economics
You cannot tune latency. It is imposed by the product you are building. You cannot optimize a chat interface into an async margin. Unit economics are defined at the inception stage, not at infrastructure decisions and cannot be optimized.
You cannot fix unit economics with infrastructure after response time expectations are set. Latency of hours, minutes and seconds each define a different business with different margin ceilings and irreversible cost structures.