Skip to content

Conversation

@frankbria
Copy link
Owner

@frankbria frankbria commented Feb 9, 2026

Summary

Implements #352: Adds context window management to the ReAct loop with token tracking and 3-tier message compaction strategy.

  • Token estimation using len(text) / 4 conservative heuristic
  • Compaction triggered at configurable threshold (default 85% of context window)
  • 3-tier compaction: tool result summarization → intermediate step removal → conversation summary
  • Preserves last 5 tool call/result pairs, error info, architectural decisions, and file paths
  • Configurable via CODEFRAME_REACT_COMPACT_THRESHOLD environment variable (0.5-0.95)

Acceptance Criteria

  • Token estimation for each message in conversation history (conservative: len(text) / 4)
  • Compaction triggered at 85% of context window capacity
  • 3-tier compaction strategy (tool result clearing, intermediate step removal, conversation summary)
  • Preserved during compaction: architectural decisions, unresolved errors, file paths, last 5 tool pairs
  • Discarded during compaction: redundant file reads, passed test output, consumed search results
  • Configurable threshold via CODEFRAME_REACT_COMPACT_THRESHOLD env var
  • 52 tests verifying all behaviors
  • All 1312 existing core tests pass (no regressions)

Test Plan

  • Unit tests for token estimation (text, tool calls, tool results)
  • Compaction trigger threshold tests (including env var and clamping)
  • Tier 1 tests: tool result summarization, error preservation, recent pair preservation
  • Tier 2 tests: redundant read removal, edit-interleaved reads kept, mixed test results preserved
  • Tier 3 tests: summary creation, file path and error preservation
  • Orchestration tests: tier cascading, multiple rounds, stats reporting
  • Integration tests: compaction in react loop, existing behavior unaffected
  • Linting clean (ruff)

Implementation Notes

  • Replaces the old _trim_messages() character-based approach with smart token-aware compaction
  • _trim_messages() kept but marked deprecated (not deleted to avoid breaking any external callers)
  • Added AGENT_COMPACTION event type for observability
  • Code review identified and fixed a false-positive in Tier 2's test result filtering

Closes #352

Summary by CodeRabbit

  • New Features

    • Automatic conversation compaction: summarizes tool results, removes redundant exchanges, and condenses older messages to stay within token limits.
    • Token usage tracking: public API to retrieve detailed token statistics.
    • Configurable token thresholds and context window sizing to tune compaction behavior.
    • Emits compaction events when compaction runs.
  • Tests

    • Comprehensive tests covering compaction tiers, token estimation, config behavior, and integration with the agent loop.

Test User added 5 commits February 9, 2026 15:45
Add module-level constants (DEFAULT_COMPACTION_THRESHOLD,
PRESERVE_RECENT_PAIRS, DEFAULT_CONTEXT_WINDOW) and instance attributes
for token budget tracking. Implement _estimate_message_tokens,
_estimate_conversation_tokens, and _should_compact methods using
len(str)/4 heuristic. Env var CODEFRAME_REACT_COMPACT_THRESHOLD
configures threshold with validation (0.5-0.95 range, clamping).
Tier 1 (_compact_tool_results): Replace verbose tool result content
with short summaries while preserving error results.
Tier 2 (_remove_intermediate_steps): Remove redundant file reads
(same file read without intervening edit) and passed test outputs.
Tier 3 (_summarize_old_messages): Replace oldest messages with a
single summary preserving file paths, errors, and architectural
decisions. All tiers respect PRESERVE_RECENT_PAIRS boundary.
Add compact_conversation orchestrator that runs tiers 1->2->3 in order,
stopping when under threshold. Add get_token_stats for diagnostics.
Replace _trim_messages call in _react_loop with compact_conversation.
Emit AGENT_COMPACTION event and verbose output when compaction occurs.
Add AGENT_COMPACTION to EventType.
Add helper _build_conversation, tier edge case tests (token savings,
unique reads), orchestration tests (tier1 early stop, all tiers,
multiple rounds), and regression guard tests confirming existing
ReactAgent behavior is unaffected. Total: 51 compaction tests.
Tier 2 intermediate step removal was matching any test output containing
"passed", even when failures were also present. Now requires absence of
"failed" and "error" to avoid discarding useful failure context.
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 9, 2026

Walkthrough

Added token-budget tracking and a three-tier conversation compaction system to ReactAgent, plus a new AGENT_COMPACTION event type and tests verifying compaction behavior and token-estimation/configuration.

Changes

Cohort / File(s) Summary
Event Type Addition
codeframe/core/events.py
Added new AGENT_COMPACTION = "AGENT_COMPACTION" event type constant.
Token Budgeting & Compaction Logic
codeframe/core/react_agent.py
Added context-window and compaction configuration constants; introduced token-estimation helpers, compaction decision logic, three-tier compaction methods (_compact_tool_results, _remove_intermediate_steps, _summarize_old_messages), orchestration (compact_conversation, _finalize_compaction), counters/attrs (_total_tokens_used, _compaction_count, _context_window_size, _compaction_threshold), replaced prior trimming with compaction in the react loop, and emitted AGENT_COMPACTION events.
Compaction Test Suite
tests/core/test_react_agent_compaction.py
Added comprehensive tests covering constants, env-driven config, token estimation, compaction triggering, each compaction tier, orchestration, preservation rules, non-mutating behavior, and integration with the React loop.

Sequence Diagram(s)

sequenceDiagram
    participant ReactLoop as React Loop
    participant Agent as ReactAgent
    participant Tier1 as Tier 1: ToolResults
    participant Tier2 as Tier 2: IntermediateSteps
    participant Tier3 as Tier 3: Summarizer
    participant Emitter as EventEmitter

    ReactLoop->>Agent: iteration / _react_loop
    Agent->>Agent: _estimate_conversation_tokens(messages)
    Agent->>Agent: _should_compact(messages)
    alt compaction needed
        Agent->>Tier1: _compact_tool_results(messages)
        Tier1-->>Agent: (messages', tokens_saved)
        Agent->>Tier2: _remove_intermediate_steps(messages')
        Tier2-->>Agent: (messages'', tokens_saved)
        Agent->>Tier3: _summarize_old_messages(messages'')
        Tier3-->>Agent: (messages_final, tokens_saved)
        Agent->>Agent: _finalize_compaction(messages_final, stats)
        Agent->>Emitter: emit(AGENT_COMPACTION, stats)
        Emitter-->>ReactLoop: event published
    end
    Agent-->>ReactLoop: continue with compacted messages
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

Poem

🐰 I nibble tokens, tidy threads of chat,
Three gentle trims to keep the context flat,
I stash the noisy outputs into light rhyme,
Preserve the last steps, and save us some time.
Hop—compaction done, the loop runs fine!

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The PR title accurately summarizes the main change: token budget tracking and message compaction for ReactAgent, matching the core additions in the changeset.
Linked Issues check ✅ Passed The PR implements all acceptance criteria from issue #352: token estimation with len(text)/4 heuristic, 85% threshold trigger, 3-tier compaction strategy, preservation rules, discard rules, configurable env var, comprehensive tests, and event emissions.
Out of Scope Changes check ✅ Passed All changes are directly scoped to issue #352 objectives: new AGENT_COMPACTION event type, token budgeting logic, 3-tier compaction methods, and comprehensive test coverage validating the feature.
Docstring Coverage ✅ Passed Docstring coverage is 97.40% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch feature/issue-352-token-budget-compaction

No actionable comments were generated in the recent review. 🎉

🧹 Recent nitpick comments
codeframe/core/react_agent.py (1)

966-1049: Consider adding bounds check for multi-tool-call assistant messages.

Tier 2 only examines tool_calls[0] when determining the tool name and file path (lines 999-1001, 1013-1014). If an assistant message contains multiple tool calls (e.g., reading two different files), only the first call is considered. This could miss removal opportunities or incorrectly classify pairs.

This is a minor limitation rather than a bug, but worth noting for future enhancement.


Comment @coderabbitai help to get the list of available commands and usage tips.

@macroscopeapp
Copy link
Contributor

macroscopeapp bot commented Feb 9, 2026

Add token budget tracking and multi‑tier message compaction to ReactAgent runs and emit EventType.AGENT_COMPACTION when conversation usage exceeds a 0.85 threshold within a 200000‑token window

Introduce token estimation and progressive compaction in ReactAgent, add environment‑bounded threshold control, emit a new compaction event, and include a stats accessor. Core logic lives in react_agent.py; a new event constant is added in events.py; tests cover tiers and orchestration in test_react_agent_compaction.py.

📍Where to Start

Start with the ReactAgent.compact_conversation orchestrator and its callers in react_agent.py, then review _should_compact, tier methods, and event emission in run.


Macroscope summarized b7e56db.

@claude
Copy link

claude bot commented Feb 9, 2026

Code Review: Token Budget Tracking and Message Compaction (#352)

Overview

This PR implements a well-designed 3-tier message compaction strategy for the ReactAgent to prevent context window overflow. The implementation is comprehensive with excellent test coverage (52 tests) and follows the project's v2 architectural principles.


✅ Strengths

1. Architecture & Design

  • Clean separation of concerns with dedicated methods for each compaction tier
  • Conservative token estimation (len/4) is appropriate given the heuristic nature
  • Smart preservation strategy (last 5 tool call/result pairs) balances context retention with space savings
  • Graceful tier escalation prevents unnecessary summarization when simpler compaction suffices

2. Test Coverage

  • Exceptional test coverage (52 tests) covering all edge cases
  • Tests are well-organized by functionality (constants, initialization, tiers, orchestration)
  • Integration tests verify the feature works in the actual ReAct loop
  • Good use of fixtures and helper functions (_make_pair) for test data generation

3. Observability

  • New AGENT_COMPACTION event type enables monitoring
  • Stats dictionary provides detailed metrics (tokens_before/after, tiers_used, compaction_number)
  • Verbose output shows compaction activity during execution
  • get_token_stats() allows external inspection of token usage

4. Configuration

  • CODEFRAME_REACT_COMPACT_THRESHOLD env var with sensible defaults (0.85)
  • Input validation with clamping (0.5-0.95) prevents misconfiguration
  • Constants are module-level for easy adjustment

🔍 Issues & Concerns

Critical: Mutating Input Messages

Location: react_agent.py:955

The _compact_tool_results method mutates the input messages list in place. While the calling code in compact_conversation passes list(messages) copies, this creates a shallow copy. The dictionaries themselves are still shared references, so mutating messages[i] can affect the original data structure.

Impact: Potential unintended side effects if the messages list is used elsewhere after compaction.

Fix: Create deep copies of messages being modified, or document that the method expects ownership of the list.


Moderate: Tier 2 Assumption About Message Pairing

Location: react_agent.py:980

The code assumes strict alternating assistant/user message pairing. If the conversation history ever contains consecutive assistant or user messages, this logic will misalign.

Recommendations:

  1. Add assertion or validation to check message roles
  2. Or make it robust to non-paired messages by checking roles before processing

Minor: Token Estimation Edge Cases

Location: react_agent.py:872, 875

Issue 1: JSON serialization of tool_calls and tool_results happens on every token estimation call. For large conversations, this could become expensive.

Issue 2: No handling of messages with neither content nor tool_calls/results. Empty messages return 0 tokens but should probably count for message overhead.

Suggestion: Add a base token count per message (~5 tokens for role/metadata).


📊 Performance Considerations

Positive:

  • Early termination minimizes overhead
  • Token estimation is fast (simple len() operations)
  • Only compacts when needed

Potential Concern:

  • Multiple passes over messages list (estimate → tier1 → tier2 → tier3 → re-estimate)
  • For very large conversation histories, this could add latency
  • Consider adding performance benchmarks for 1000-message conversations

📝 Documentation & Code Quality

Strengths:

  • Excellent docstrings on all new methods
  • Clear comments explaining the 3-tier strategy
  • Consistent naming conventions
  • Type hints on all new code

Minor Issues:

  1. _finalize_compaction method name doesn't clearly indicate it also builds stats
  2. get_token_stats is public but not documented in the class docstring

✅ Testing Quality

Excellent coverage:

  • Constants validated ✓
  • Initialization tested ✓
  • Token estimation edge cases ✓
  • Each tier tested in isolation ✓
  • Orchestration tested ✓
  • Integration with ReAct loop ✓
  • Existing behavior preserved ✓

One gap: No test for multiple compactions within a single run() execution.


📋 Recommendations Summary

Must Fix (Critical):

  • Address message mutation issue in _compact_tool_results (use deep copy or document ownership)

Should Fix (Moderate):

  • Add validation for assistant/user message pairing in Tier 2
  • Add base token overhead to _estimate_message_tokens

Nice to Have (Minor):

  • Optimize JSON serialization in token estimation
  • Improve Tier 3 architectural keyword extraction
  • Add performance benchmark test for large conversations

🎯 Verdict

Overall: Strong implementation

This PR successfully implements the acceptance criteria from #352 with excellent test coverage and thoughtful design. The 3-tier compaction strategy is well-architected and the code is clean.

The critical issue around message mutation should be addressed before merge. The moderate issues are worth fixing but not blockers. The minor suggestions can be addressed in follow-up PRs if desired.

Recommendation: Request changes for the mutation issue, then approve after fix.

Great work on this feature! The token budget management will significantly improve the agent's ability to handle long-running tasks without hitting context limits.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🤖 Fix all issues with AI agents
In `@codeframe/core/react_agent.py`:
- Around line 1054-1103: The compaction logic in the method that builds
summary_msg ignores any existing "[Summary]" user messages in old_messages,
causing previously preserved file_paths/errors/decisions to be dropped on
subsequent compactions; update the code that loops old_messages (the block that
builds file_paths, errors, decisions) to first detect and parse any existing
summary messages (role == "user" and content.startswith("[Summary]")) and
extract the previously saved file paths, error snippets, and decision snippets
(parse patterns like "analyzed N files (...)", "errors encountered (...)", and
"architectural decisions: ...") into the file_paths/errors/decisions lists
before processing tool_calls/tool_results/assistant content so earlier preserved
data is merged rather than overwritten.
- Around line 353-360: The AGENT_COMPACTION event currently emits only
compact_stats; modify the emission so compact_stats includes the current task id
for traceability by adding something like compact_stats["task_id"] =
self.task_id (or the appropriate task identifier property) immediately before
calling self._emit(EventType.AGENT_COMPACTION, compact_stats) inside the
compacting block in react_agent.py (see compact_conversation, compact_stats,
_emit, and EventType.AGENT_COMPACTION).
- Around line 928-953: The summaries currently use the single variable tool_name
(from _extract_tool_name_from_pair) for all tool_results, which mislabels
multi-tool responses; before iterating tool_results, build a mapping of
tool_call_id -> tool name by extracting the tool calls from the corresponding
assistant message (e.g., inspect messages[i - 1] for a tool calls list or
construct a dict from any existing tool-call metadata), then inside the loop use
mapping.get(tr["tool_call_id"], tool_name) to pick the correct name for each tr
(falling back to the old _extract_tool_name_from_pair result if no mapping entry
exists) when composing summary and updating saved.

In `@tests/core/test_react_agent_compaction.py`:
- Around line 825-858: In test_compact_conversation_called_in_loop, after
calling agent.run("task-1") add assertions to verify the MockProvider recorded
LLM calls (e.g. assert provider.call_count >= 1 or assert len(provider.calls) >=
1) so the test not only checks compact_conversation was invoked but also that
the provider was actually called during the react loop; place these assertions
immediately after agent.run and before asserting compact_called.

- Include task_id in AGENT_COMPACTION event payload for SSE correlation
- Map tool results to actual tool names via tool_call_id for multi-tool calls
- Preserve prior [Summary] messages across compaction rounds
- Add defensive copy in compact_conversation to avoid caller mutation
- Add role validation in tier 2 before assuming assistant/user pairing
- Add MockProvider call_count assertion in integration test
@frankbria
Copy link
Owner Author

Responses to Code Review Feedback

Thanks for the thorough review! Here's what was addressed in b7e56db:

Fixed

Critical: Mutating Input Messages
Added a defensive messages = list(messages) at the start of compact_conversation() so the caller's list is never mutated. Added test test_defensive_copy_does_not_mutate_caller_list to verify.

Moderate: Tier 2 Message Pairing Assumption
Added role validation (assistant.get("role") != "assistant" or user.get("role") != "user") before processing each pair — non-standard pairs are now skipped safely. Added test TestTier2RoleValidation::test_skips_non_standard_role_pairs.

Intentionally Skipped

Minor: Base token overhead per message
The len(text)/4 heuristic is already conservative. Adding a ~5 token overhead per message would add complexity without meaningfully improving accuracy for a heuristic that's only used to decide when to compact (not for billing/limits).

Minor: JSON serialization optimization
Token estimation is called infrequently (once per iteration check + during compaction). Premature optimization here would add complexity for negligible benefit. Will revisit if profiling shows it as a bottleneck.

@claude
Copy link

claude bot commented Feb 10, 2026

Follow-Up Review: All Issues Resolved ✅

I've reviewed the latest commit (fix: address PR review feedback for compaction system) and can confirm that all previously identified issues have been properly addressed.


✅ Fixed Issues from Previous Review

1. Critical: Message Mutation Issue - RESOLVED

  • Previous concern: Shallow copy in compact_conversation could cause unintended mutations
  • Fix applied: Line 1145 now includes defensive copy with comment: # Defensive copy — avoid mutating the caller's list
  • Status: ✅ Fixed

2. Moderate: Role Validation in Tier 2 - RESOLVED

  • Previous concern: Assumed strict assistant/user pairing without validation
  • Fix applied: Lines 991-993 now validate roles before processing: if assistant.get("role") != "assistant" or user.get("role") != "user": continue
  • Status: ✅ Fixed

3. Moderate: Tool Name Mapping - RESOLVED

  • Previous concern: Single tool_name variable for multi-tool responses
  • Fix applied: Lines 931-937 build tool_name_by_id mapping from preceding assistant message
  • Status: ✅ Fixed

4. Minor: Prior Summary Preservation - RESOLVED

  • Previous concern: Repeated compactions could lose previously preserved data
  • Fix applied: Lines 1070-1079 extract and preserve existing [Summary] messages
  • Status: ✅ Fixed

5. Minor: Event Traceability - RESOLVED

  • Previous concern: AGENT_COMPACTION event lacked task correlation
  • Fix applied: Line 362 includes task_id in event payload
  • Status: ✅ Fixed

🎯 Final Assessment

Code Quality: Excellent

  • All code follows v2 architectural principles
  • Type hints throughout
  • Clear docstrings
  • Defensive programming practices

Test Coverage: Comprehensive

  • 52 tests covering all tiers and edge cases
  • Integration tests verify ReAct loop behavior
  • No regressions in existing 1312 core tests

Performance: Appropriate

  • Conservative token estimation (len/4) is intentionally simple
  • Early termination minimizes overhead
  • Compaction only triggers when needed

Observability: Strong

  • Detailed stats in AGENT_COMPACTION events
  • Verbose mode provides visibility during execution
  • get_token_stats() enables external monitoring

📋 Recommendation

APPROVE

This PR is ready to merge. The implementation successfully delivers all acceptance criteria from issue #352:

  • ✅ Token estimation with conservative heuristic
  • ✅ Configurable compaction threshold (default 85%, env var supported)
  • ✅ 3-tier compaction strategy with intelligent preservation
  • ✅ Preserves architectural decisions, errors, file paths, last 5 tool pairs
  • ✅ Discards redundant reads, passed test output, consumed search results
  • ✅ Comprehensive test coverage (52 tests)
  • ✅ No regressions

The latest commit addressed all review feedback thoroughly and professionally. Great work on this feature! 🎉

The token budget management will significantly improve the ReactAgent's ability to handle long-running tasks without hitting context limits.


# Build a map of file reads and edits in the compactable zone
# Process pairs: assistant at even index, user at odd index
for i in range(0, cutoff - 1, 2):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟢 Low

core/react_agent.py:987 Suggestion: _remove_intermediate_steps relies on even/odd indices and a truncated scan, which breaks when tier‑3 adds a user summary and can miss the last assistant step. Iterate by roles instead (find each assistant and its following user) and scan to the true end. Optionally emit the tier‑3 summary as a system message to avoid affecting pairing.

🚀 Want me to fix this? Reply ex: "fix it for me".

🤖 Prompt for AI
In file codeframe/core/react_agent.py around line 987:

Suggestion: `_remove_intermediate_steps` relies on even/odd indices and a truncated scan, which breaks when tier‑3 adds a `user` summary and can miss the last `assistant` step. Iterate by roles instead (find each `assistant` and its following `user`) and scan to the true end. Optionally emit the tier‑3 summary as a `system` message to avoid affecting pairing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Phase 2.5-E] Implement token budget tracking and message compaction

1 participant