feat: RuntimeState event bus integration with checkpoint/resume#5241
Merged
greysonlalonde merged 74 commits intomainfrom Apr 6, 2026
Merged
feat: RuntimeState event bus integration with checkpoint/resume#5241greysonlalonde merged 74 commits intomainfrom
greysonlalonde merged 74 commits intomainfrom
Conversation
…resume via kickoff()
…ert TokenProcess to BaseModel
…ider pattern - Move runtime_state.py to state/runtime.py - Add acheckpoint async method using aiofiles - Introduce BaseProvider protocol and JsonProvider for pluggable storage - Add aiofiles dependency to crewai package - Use PrivateAttr for provider on RootModel
Set task_id and task_name in _set_task_fingerprint so events carry task identity through serialization. Use task_id to find the correct task_started event when restoring the scope stack on checkpoint resume.
- Add Literal type discriminators to all 119 event subclasses - Add BeforeValidator + PlainSerializer on EventNode.event to deserialize events into the correct subclass using a type registry - Falls back to BaseEvent for unrecognized or incomplete event dicts
Use BeforeValidator/PlainSerializer with create_model_from_schema to serialize type[BaseModel] as its JSON schema dict and reconstruct a dynamic model on deserialization.
# Conflicts: # lib/crewai/src/crewai/llms/providers/openai/completion.py # lib/devtools/pyproject.toml # uv.lock
BaseTool could not serialize to JSON because args_schema (a class reference) and cache_function (a lambda) are not JSON-serializable. This caused checkpointing to crash for any crew with tools. - Add PlainSerializer to args_schema so it round-trips via JSON schema - Replace default cache_function lambda with named _default_cache_function and type it as SerializableCallable so it serializes to a dotted path - Add computed_field tool_type that stores the fully qualified class name - Add restore_tool_from_dict to reconstruct the concrete subclass from checkpoint dicts, pre-resolving callback strings to callables - Update BaseAgent.validate_tools and Task._restore_tools_from_checkpoint to handle dict inputs from checkpoint deserialization
BaseTool could not serialize to JSON because args_schema (a class reference) and cache_function (a lambda) are not JSON-serializable. This caused checkpointing to crash for any crew with tools. - Add PlainSerializer to args_schema so it round-trips via JSON schema - Replace default cache_function lambda with named _default_cache_function and type it as SerializableCallable - Add computed_field tool_type storing the fully qualified class name - Add __init_subclass__ registry and __get_pydantic_core_schema__ on BaseTool so any list[BaseTool] field automatically dispatches to the concrete subclass during deserialization via tool_type lookup - No changes needed to BaseAgent.validate_tools or Task — Pydantic handles it natively through the custom core schema
tool_type computed field is legitimately required in the schema.
Flow.from_checkpoint deserialized the checkpoint_* fields but never copied them back into the private execution attrs (_completed_methods, _method_outputs, _method_execution_counts, _state). Calling kickoff() after from_checkpoint would restart from scratch. Add _restore_from_checkpoint that copies the checkpoint fields into the private attrs, using the existing _restore_state method for state reconstruction.
Flow.from_checkpoint deserializes as base Flow (entity_type discriminator), losing subclass methods and state type. When called on a subclass like MyFlow.from_checkpoint(), create a cls instance and transfer the checkpoint fields so @start methods, listeners, and structured state are available.
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 3be42f9. Configure here.
iris-clawd
approved these changes
Apr 6, 2026
Contributor
iris-clawd
left a comment
There was a problem hiding this comment.
Approved. RuntimeState checkpoint/resume with event record, serializable executors/tools/LLMs, and Flow subclass restoration all look solid after multiple review rounds. Minor follow-ups (EventRecord memory growth for long runs, lazy event type map) can be addressed separately.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.

Summary
RuntimeStateas optional third arg to event bus handlersRuntimeState.checkpoint(dir)writes timestamped JSON snapshotsCrew.from_checkpoint(path)restores and resumes viakickoff()_get_execution_start_indexskips tasks with existing outputCrewStructuredTool,StandardPromptResult,SystemPromptResult,TokenCalcHandlerto BaseModelCrewAgentExecutorMixinusesField(exclude=True)for back-referencesTest plan
Note
High Risk
High risk because it changes core execution flow (agent executors, task skipping/resume) and event emission semantics by introducing shared
RuntimeStaterecording and passing it into handlers.Overview
Adds first-class checkpoint/resume by serializing a unified
RuntimeState(entities + event record) to timestamped JSON and restoringCrew,Flow, andAgentvia newfrom_checkpoint()APIs that rehydrate runtime links, rebuild event scope, and resume execution from the first incomplete task.Integrates
RuntimeStateinto the event system: the event bus now records emitted events, auto-registers emitting entities, and optionally passes the current runtime state as a third argument to sync/async handlers while remaining compatible with existing 2-arg handlers.Refactors executor/state serialization to support checkpointing: introduces
BaseAgentExecutor(Pydantic model) as a shared base forCrewAgentExecutorandAgentExecutor, adds resumable message handling, and updates LLM/executor fields to round-trip as structured dicts (withllm_type/executor_typediscriminators) across agents/crews.Reviewed by Cursor Bugbot for commit 3be42f9. Bugbot is set up for automated code reviews on this repo. Configure here.