Framework/Latency

Latency is an Economic Constraint, not a Performance Target

Latency is defined by the product workflow you choose, not the models or infrastructure underneath. Once a response time is set, latency becomes an economic constraint that determines what cost controls are available and, by extension your margin ceiling.

Latency Windows
	Hours (Async)	Minutes (Trap)	Seconds (Sync)
Examples	GitHub Workspace, Replit, Deep Research	Midjourney, Suno, NotebookLM, Reasoning	ChatGPT, Perplexity, Customer Support
Time Shift Compute (Can this workload happen later?)	Off-peak execution	No meaningful shifting	Impossible. Any delay is failure
Batching (Can multiple requests be combined?)	Large batching smooths variance	Only exists by delaying resolution of response	Not viable. Batch size = 1
Workload Bounding (Can I cap how expensive a request can be?)	Complex requests can be deferred, split or rejected	Soft limits can be implemented. High risk of waste.	Not tolerated, the system must resolve
Failure tolerance (Can expensive cases be dropped?)	Expensive or failing cases can be dropped/retried later	Only explicit user-visible abandonment	Not tolerated. Errors break the workflow
Structural Gross Margin Ceiling	Deep Tech Profile ~55-70% (1)	Services Profile ~30-50% (2)	Services Profile ~30-50% (2)

Illustrative ranges, not benchmarked targets.

(1) In async workflows, when inference is decoupled from user wait time, better compute economics exist due to it being schedulable. This allows costs to be amortized through batching, utilization, smoothing and discount pricing.

(2) In minutes and seconds workflows, inference is purchased on-demand and executed synchronously. Utilization gaps, retries, and abandoned executions create unavoidable waste, capping margins regardless of model efficiency.

Latency Sets the Margin Ceiling
Margin ceilings are determined by response time expectations not by scale, model quality or infrastructure. Immediacy removes leverage by eliminating the ability to batch, defer or average costs.
The Economic Cliff Is Structural, Not Incremental
There is a sharp break between Hours and Minutes. The difference between Minutes and Seconds is marginal because both require immediate resolution and lose access to background cost controls.
Workflow Choice Pre-Determines Unit Economics
You cannot tune latency. It is imposed by the product you are building. You cannot optimize a chat interface into an async margin. Unit economics are defined at the inception stage, not at infrastructure decisions and cannot be optimized.

You cannot fix unit economics with infrastructure after response time expectations are set. Latency of hours, minutes and seconds each define a different business with different margin ceilings and irreversible cost structures.