KotaML Logo
KotaML
Framework/Latency
3

Latency is an Economic Constraint, not a Performance Target

Latency is defined by the product workflow you choose, not the models or infrastructure underneath. Once a response time is set, latency becomes an economic constraint that determines what cost controls are available and, by extension your margin ceiling.

Latency Windows
Hours (Async)Minutes (Trap)Seconds (Sync)
Examples
Github Workspace, Replit, Deep Research
Midjourney, Suno, NotebookLM, Reasoning
ChatGPT, Perplexity, Customer Support
Time Shift Compute
(Can this workload happen later?)
Off-peak execution
No meaningful shifting
Impossible. Any delay is failure
Batching
(Can multiple requests be combined?)
Large batching smooths variance
Only exists by delaying resolution of response
Not viable. Batch size = 1
Workload Bounding
(Can I cap how expensive a request can be?)
Complex requests can be deferred, split or rejected
Soft limits can be implemented. High risk of waste.
Not tolerated, the system must resolve
Failure tolerance
(Can expensive cases be dropped?)
Expensive or failing cases can be dropped/retried later
Only explicit user-visible abandonment
Not tolerated. Errors break the workflow
Structural Gross Margin Ceiling
Deep Tech Profile ~55-70% (1)
Services Profile ~30-50% (2)
Services Profile ~30-50% (2)

(1) In async workflows, when inference is decoupled from user wait time, better compute economics exist due to it being schedulable. This allows costs to be amortized through batching, utilization, smoothing and discount pricing.

(2) In minutes and seconds workflows, inference is purchased on-demand and executed synchronously. Utilization gaps, retries, and abandoned executions create unavoidable waste, capping margins regardless of model efficiency.

  • Latency Sets the Margin Ceiling
    Margin ceilings are determined by response time expectations not by scale, model quality or infrastructure. Immediacy removes leverage by eliminating the ability to batch, defer or average costs.
  • The Economic Cliff Is Structural, Not Incremental
    There is a sharp break between Hours and Minutes. The difference between Minutes and Seconds is marginal because both require immediate resolution and lose access to background cost controls.
  • Workflow Choice Pre-Determines Unit Economics
    You cannot tune latency. It is imposed by the product you are building. You cannot optimize a chat interface into an async margin. Unit economics are defined at the inception stage, not at infrastructure decisions and cannot be optimized.

You cannot fix unit economics with infrastructure after response time expectations are set. Latency of hours, minutes and seconds each define a different business with different margin ceilings and irreversible cost structures.