Token Robin Hood
OpenAIApr 22, 20266 min

OpenAI adds WebSocket mode to the Responses API: faster agent loops are now a runtime advantage

OpenAI's April 22 engineering post matters because it moves the conversation beyond model IQ and token pricing. The company is saying that once inference gets fast enough, agent products win or lose on transport, cached state, and how little redundant work they force through the loop.

What happenedOn April 22, 2026, OpenAI said WebSocket mode made Responses API agent loops 40% faster end-to-end by keeping a persistent connection and reusing previous response state.
Why builders careRepeated validation, tokenization, routing, and history rebuilds are now a visible product tax on coding agents and tool-using workflows.
TRH actionProfile your agent loop by stage and cut repeat work before chasing a bigger model budget.

What actually changed

OpenAI describes the old bottleneck clearly. A Codex-style bug-fix task can require dozens of round trips: decide the next action, call a tool, send the tool result back, then repeat. That overhead was easier to ignore when models generated around 65 tokens per second. It became much harder to hide once OpenAI pushed GPT-5.3-Codex-Spark toward 1,000 tokens per second.

The fix was not a new prompt trick. It was a transport change. OpenAI kept a persistent WebSocket connection alive, cached reusable response state in memory, and let follow-up requests continue through previous_response_id instead of rebuilding the whole conversation every time.

Why this is bigger than one API feature

This is an important builder signal because it makes agent speed a systems problem. OpenAI says the WebSocket version reuses prior input and output items, tool definitions, namespaces, and rendered tokens. It also lets the platform process only new input for some validators and safety checks instead of reprocessing the full history on every turn.

That is exactly where many agent products leak time and money. The visible invoice says "tokens." The hidden bill shows up as repeated context shaping, repeated validation, extra API handshakes, and slow tool-result handoffs. Faster models expose those mistakes.

What the launch results mean

OpenAI says alpha users saw up to 40% workflow improvements and that Codex moved most of its Responses API traffic onto WebSocket mode. The company also says Vercel, Cline, and Cursor reported material latency gains after integrating it. The practical takeaway is simple: runtime plumbing is now part of the competitive surface for coding agents.

For TRH readers, this is the same lesson behind why agentic AI feels expensive and runtime design for production agents. If every tool turn rebuilds too much state, your users will feel the drag before they notice the model got smarter.

What builders should do next

Measure one real agent workflow and split latency into four buckets: model inference, API overhead, client-side tool time, and post-processing. If the same history or tool schema is being revalidated on every turn, fix that first.

Then make three architecture checks. Keep conversation state incremental where possible. Separate tool execution latency from model latency in your dashboards. And decide where persistent connections make sense instead of defaulting to stateless request chains for long-running loops.

The point is not that every agent needs WebSockets tomorrow. The point is that transport and state reuse now directly shape user-perceived intelligence. When inference accelerates, waste in the loop becomes the product.

Sources