Fix: race condition in update_chat_ctx deletes server-created function calls#4960
Fix: race condition in update_chat_ctx deletes server-created function calls#4960StianHanssen wants to merge 4 commits intolivekit:mainfrom
Conversation
…xt state that tool call is processed
| await utils.aio.cancel_and_wait(exe_task) | ||
| _notify_fc_processed(function_calls) |
There was a problem hiding this comment.
🟡 Incomplete cleanup of _inflight_fc_ids on interruption leaks stale entries that permanently block deletion
When a speech is interrupted, _notify_fc_processed(function_calls) is called at line 2683 to release all in-flight function call IDs. However, function_calls is populated by the _read_fnc_stream task, which was already cancelled at line 2632 (await utils.aio.cancel_and_wait(*tasks)). This means function_calls may be incomplete — it might not include function calls whose IDs were already added to _inflight_fc_ids by the server's conversation.item.added event.
Root Cause: timing gap between server event and stream reader
The _inflight_fc_ids set is populated in _handle_conversion_item_added (realtime_model.py:1510-1511) when the server sends conversation.item.added (step 3 in the event sequence). However, a function call only enters function_calls after response.output_item.done (step 8) pushes it to function_ch, and then _read_fnc_stream reads it from the tee.
Two scenarios cause a leak:
-
Response cancelled before
response.output_item.done: The server creates the function_call item (step 3) but cancels the response before step 8. The function call never reachesfunction_ch, so it's never infunction_calls. -
_read_fnc_streamcancelled before reading buffered items: Even if step 8 fires and the item is in the tee buffer,_read_fnc_streamis cancelled atagent_activity.py:2632before reading it.
In both cases, the stale ID remains in _inflight_fc_ids indefinitely. Every future update_chat_ctx call sees the function_call in _remote_chat_ctx but not in local context, yet the _inflight_fc_ids guard (realtime_model.py:1189-1190) prevents deletion. The item permanently occupies space in the server's conversation context window.
Impact: After an interruption during function call generation, orphaned function_call items can accumulate in the remote context and can never be cleaned up by update_chat_ctx (e.g., during summarization), causing gradual context window waste.
Prompt for agents
In livekit-agents/livekit/agents/voice/agent_activity.py, at lines 2681-2684 (the interrupted return path), the _notify_fc_processed(function_calls) call uses an incomplete list because _read_fnc_stream was already cancelled. To fix this, instead of relying on function_calls (which comes from the cancelled stream reader), the cleanup should directly clear all inflight IDs for this generation. One approach: add a method to the DualChatContextSyncSession protocol (and its implementations in realtime_model.py and realtime_model_beta.py) like clear_all_inflight_fcs() that does self._inflight_fc_ids.clear(), and call it in the interrupted path. Alternatively, track which response's function calls are inflight (keyed by response_id) so only the relevant IDs are cleared. The simplest safe fix is to call a method that clears all inflight IDs when the generation is interrupted, since no other generation should be active concurrently.
Was this helpful? React with 👍 or 👎 to provide feedback.
There was a problem hiding this comment.
Ah, yes, function_calls can be incomplete if _read_fnc_stream was cancelled before consuming all items from the stream. However, this is expected and covered by a secondary clean-up path: _handle_conversion_item_deleted in realtime_model.py calls self._inflight_fc_ids.discard(event.item_id) whenever the server deletes an item. During interruption, the server typically deletes/truncates these items, which triggers the clean-up.
In the worst case (server retains the item), the ID lingers in _inflight_fc_ids, which prevents update_chat_ctx from deleting it, but I believe that is the correct behaviour, since the item legitimately exists on the server and deleting it would cause the cascade corruption this fix prevents.
|
thanks for the investigation on this issue. the core issue is that callers base their modifications on The flow would be:
|
|
@longcw Thank you for your thoughts!
In your approach, when updating from v1 to v2, how do you know which items to change to make v2? How do you know if a FunctionCall is inflight or intentionally deleted from v1? |
I see your point, that approach only protects function calls that arrived after the snapshot. |
Summary
update_chat_ctxcan delete in-flightfunction_callitems from the OpenAI Realtime server, causing cascading"failed to insert item: previous_item_id not found"corruption of_remote_chat_ctx.The root cause is a timing gap between two context-tracking structures:
_remote_chat_ctx: Updated immediately when the server sendsconversation.item.added_agent._chat_ctx: Updated later, only when tool execution starts (_tool_execution_started_cb)If
update_chat_ctxruns during this window (e.g. from summarization), the diff sees thefunction_callin remote but not in local, treats it as intentionally removed, and sends a delete event. The_is_content_emptyguard only protectsmessageitems,function_callitems pass through unconditionally.I created a unit test gist replicating the exact pipeline that demonstrates how
update_chat_ctxdeletes in-flightfunction_callitems. Note the "test" passes when the failure scenario happens.Fix
Track in-flight function calls with an
_inflight_fc_idsset on the OpenAIRealtimeSession. Items enter the set when the server creates them and leave when the agent framework acknowledges them (tool execution starts, is rejected, or is interrupted)._create_update_chat_ctx_eventsskips deletion for any item still in the set.A
runtime_checkableProtocol (DualChatContextSyncSession) bridges the signalling betweenlivekit-agents(where tool execution happens) andlivekit-plugins-openai(where the set lives), without modifying the abstractRealtimeSessionbase class.Future consideration
This fix only tracks
function_callitems.function_call_outputitems are currently client-initiated (manual_function_calls=True), so they enter_agent._chat_ctxbefore_remote_chat_ctxand are not vulnerable to this race. Ifauto_tool_reply_generationis enabled in a future configuration (server-generated outputs), a guard should be added to coverfunction_call_outputitems as well.