[LEADS-218] Improve partial tool match reporting with detailed extra tool information#181
[LEADS-218] Improve partial tool match reporting with detailed extra tool information#181bsatapat-jpg wants to merge 1 commit intolightspeed-core:mainfrom
Conversation
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (3)
🚧 Files skipped from review as they are similar to previous changes (2)
WalkthroughTool evaluation comparison was extended to return and propagate richer match statistics for partial matches (including extra actual tools), new helper functions were added to format mode and stats, and tests were updated to expect the revised messages and details. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~22 minutes 🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
d83ab89 to
d69c507
Compare
There was a problem hiding this comment.
Actionable comments posted: 7
🧹 Nitpick comments (5)
tests/script/test_compare_evaluations.py (1)
194-237: Consider adding defensive checks for statistical test key existence.Lines 236-237 directly access
result["tests"]["t_test"]["significant"]andresult["tests"]["mann_whitney_u"]["significant"]without verifying the keys exist. Other tests in this file (e.g., lines 164-172, 384-391) use conditional checks likeif "t_test" in result["tests"]:before accessing results.With only 5 samples per group, there's a possibility (depending on implementation) that statistical tests might not be performed or might be skipped, which would cause a
KeyError.🔧 Suggested defensive pattern (consistent with other tests in file)
# Verify statistical tests show non-significance for this tiny shift - assert not result["tests"]["t_test"]["significant"] - assert not result["tests"]["mann_whitney_u"]["significant"] + assert "t_test" in result["tests"], "Expected t_test to be performed" + assert not result["tests"]["t_test"]["significant"] + assert "mann_whitney_u" in result["tests"], "Expected mann_whitney_u to be performed" + assert not result["tests"]["mann_whitney_u"]["significant"]🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/script/test_compare_evaluations.py` around lines 194 - 237, The test test_compare_score_distributions_precise_delta should defensively check for the presence of statistical test results before asserting on their "significant" field; update the assertions that read result["tests"]["t_test"]["significant"] and result["tests"]["mann_whitney_u"]["significant"] to first verify the keys exist in result["tests"] (e.g., if "t_test" in result["tests"]:) and only then assert False (or equivalent) on the "significant" value, matching the pattern used elsewhere in the file.Makefile (1)
85-85: Consider centralizing quality-check paths to prevent drift.Line 85 repeats the same path list used in Lines 45 and 48 (and Line 82). A shared variable will keep Black/Ruff/Docstyle aligned over time.
♻️ Proposed refactor
+QUALITY_PATHS := src tests script lsc_agent_eval + black-check: - uv run black src tests script lsc_agent_eval --check + uv run black $(QUALITY_PATHS) --check black-format: - uv run black src tests script lsc_agent_eval + uv run black $(QUALITY_PATHS) docstyle: - uv run pydocstyle -v src tests script lsc_agent_eval + uv run pydocstyle -v $(QUALITY_PATHS) ruff: - uv run ruff check src tests script lsc_agent_eval + uv run ruff check $(QUALITY_PATHS)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@Makefile` at line 85, Create a central variable (e.g., QUALITY_PATHS or QC_PATHS) in the Makefile that holds the shared path list "src tests script lsc_agent_eval" and replace the repeated literal lists in the targets that run ruff, black, and docstyle (the lines invoking "uv run ruff check ..." and the similar commands at Lines ~45, ~48, ~82) to reference this variable; update the targets that call "uv run ruff check", the black invocation, and the docstyle invocation to use $(QUALITY_PATHS) so the paths stay synchronized.tests/unit/core/metrics/custom/test_tool_eval.py (1)
456-479: Add a regression test for empty expected patterns in partial mode.The new detailed-stat tests are great, but there’s still no guard against
full_match=Falsemistakenly passing when expected is empty and actual has tools. Adding that case will prevent future regressions.🧪 Suggested test case
+ def test_partial_match_empty_expected_fails_when_actual_has_tools(self) -> None: + """Empty expected pattern should not pass when actual tool calls exist.""" + expected = [[]] + actual = [[{"tool_name": "tool1", "arguments": {}}]] + + success, details = evaluate_tool_calls(expected, actual, full_match=False) + + assert success is False + assert "didn't match any of the" in details or "0/0 expected matched" in details🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/unit/core/metrics/custom/test_tool_eval.py` around lines 456 - 479, The test suite is missing a regression test ensuring evaluate_tool_calls(full_match=False) fails when expected is empty but actual contains tool calls; add a new unit test in tests/unit/core/metrics/custom/test_tool_eval.py that calls evaluate_tool_calls with expected = [] and actual containing one or more tool entries (e.g., [{"tool_name":"toolX","arguments":{}}]), assert the returned success is False, and that details mention zero matched/expected and list the extra tools (toolX) and keywords like "partial" and "ordered" to mirror the existing detailed-stat checks; place it alongside test_partial_match_failure_reports_detailed_stats and reference evaluate_tool_calls to locate behavior under test..github/workflows/e2e_tests.yaml (1)
18-18: Prefer pinning the Lightspeed Stack image to an immutable digest.Using
:latestmakes E2E results non-reproducible across runs. Pin a versioned tag or@sha256digest to stabilize CI.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.github/workflows/e2e_tests.yaml at line 18, Replace the mutable image reference in the lsc_image_path array (currently "quay.io/lightspeed-core/lightspeed-stack:latest") with an immutable reference — either a specific versioned tag or preferably an `@sha256` digest — so CI uses a stable, reproducible image; update the value inside lsc_image_path accordingly and verify the digest/tag is the intended release.tests/integration/system-config-query.yaml (1)
4-132: Consider extracting a shared base config to prevent query/streaming drift.These two integration configs are almost identical; keeping them duplicated will make future edits error-prone. A shared base + small mode override would be easier to maintain.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/integration/system-config-query.yaml` around lines 4 - 132, Extract a shared base YAML for the common settings (llm, embedding, output, metrics_metadata, visualization, logging, environment, core) and have the two integration configs inherit/override only the differing API fields (primarily api.endpoint_type and any streaming-specific fields like time_to_first_token/streaming_duration columns); update references to api.provider, api.model, api.timeout, api.cache_dir/cache_enabled and api.api_base in the derived configs so they no longer duplicate the full blocks, and ensure the new base includes the shared csv_columns and summary_config_sections so maintenance edits affect both query and streaming modes.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In @.github/workflows/e2e_tests.yaml:
- Line 4: The workflow uses on: [pull_request_target] and checks out PR head
code while exposing OPENAI_API_KEY to untrusted PRs; change the trigger to
pull_request (or limit pull_request_target steps to only non-PR-code actions),
ensure any steps that run PR code use actions/checkout@v3 without secrets and do
not pass OPENAI_API_KEY, and move secret-requiring steps into a separate job
that runs under pull_request_target only for maintainers (or use a gated
merge/dispatch) so that OPENAI_API_KEY is never available to untrusted PR head
code.
In `@src/lightspeed_evaluation/__init__.py`:
- Around line 13-15: Remove the "# ruff: noqa: F401" and stop importing symbols
directly under TYPE_CHECKING; instead, import the module as an alias and bind
the names from that alias so linters don't see unused imports. Concretely, in
the TYPE_CHECKING block replace "from lightspeed_evaluation.api import evaluate,
evaluate_conversation, evaluate_turn" with an import of the module (e.g., "from
lightspeed_evaluation import api as _api") and then assign the names from _api
(e.g., "evaluate = _api.evaluate", "evaluate_conversation =
_api.evaluate_conversation", "evaluate_turn = _api.evaluate_turn") so evaluate,
evaluate_conversation, and evaluate_turn are referenced without using "# noqa".
In `@src/lightspeed_evaluation/core/api/client.py`:
- Around line 121-137: The cache lookup is missing conversation scope which
causes identical queries from different conversations to return wrong cached
responses; update the cache key generation used by _get_cached_response and
_add_response_to_cache (and any helper inside _prepare_request if it composes
keys) to include conversation_id (and attachments if relevant) so the key is
unique per conversation + query; adjust _get_cached_response to compute/accept
the new composite key and _add_response_to_cache to store under that same
composite key so cached responses are scoped correctly.
- Around line 88-90: The client is currently initialized with a hardcoded TLS
verification flag; update the client initialization to use the APIConfig values
instead: replace the hardcoded verify usage with self.config.ssl_verify and, if
self.config.ssl_cert_file is set, pass it as the client's cert parameter (or
equivalent) so TLS behavior is governed by APIConfig; look for the client
construction where base_url, verify (or verify=verify) and timeout are set and
change it to reference self.config.ssl_verify and self.config.ssl_cert_file.
In `@src/lightspeed_evaluation/core/llm/custom.py`:
- Around line 182-195: In the finally block that calls
self._track_tokens(response) (inside the method surrounding lines ~182), wrap
that call in its own try/except to ensure token-tracking errors do not suppress
or alter the original return/exception; catch Exception and log or ignore
safely. In _track_tokens, remove the "# pylint: disable=protected-access" and
stop reading the private _hidden_params; instead detect cache-hits from public
response metadata (e.g., response.headers or response.get("headers", {})) by
checking for Litellm's documented cache indicator such as 'x-litellm-cache-key'
(use getattr(response, "headers", {}) safely), and guard all attribute access
with getattr/dict.get to avoid AttributeError; keep using
TokenTracker.get_active() and only proceed to record tokens when tracker exists
and response is not None. Ensure no use of private fields and no pylint disables
remain.
In `@src/lightspeed_evaluation/core/models/system.py`:
- Around line 726-733: The handler currently silences non-list rubrics by
setting data["rubrics"] = None; instead, when raw_rubrics is present but not a
list you should fail fast: check raw_rubrics after raw.get("rubrics") and if it
is not a list raise a clear exception (e.g., TypeError or ValueError) indicating
"rubrics must be a list", otherwise proceed to build data["rubrics"] via
GEvalRubricConfig.model_validate and return cls.model_validate(data); update the
code paths around raw_rubrics, data["rubrics"], GEvalRubricConfig.model_validate
and cls.model_validate to enforce this.
In `@src/lightspeed_evaluation/runner/evaluation.py`:
- Around line 139-141: The printed message "⚙️ Initializing Evaluation
Pipeline..." is stale because the flow now calls the API-level evaluate(...);
update the print in evaluation.py (the print call near the top of the
run/evaluation entry) to reflect the new behavior—for example, change to "⚙️
Calling evaluate(...) API..." or "⚙️ Starting evaluation via evaluate(...)"—so
that the message accurately describes that evaluate(...) is being invoked rather
than an evaluation pipeline being initialized.
---
Nitpick comments:
In @.github/workflows/e2e_tests.yaml:
- Line 18: Replace the mutable image reference in the lsc_image_path array
(currently "quay.io/lightspeed-core/lightspeed-stack:latest") with an immutable
reference — either a specific versioned tag or preferably an `@sha256` digest — so
CI uses a stable, reproducible image; update the value inside lsc_image_path
accordingly and verify the digest/tag is the intended release.
In `@Makefile`:
- Line 85: Create a central variable (e.g., QUALITY_PATHS or QC_PATHS) in the
Makefile that holds the shared path list "src tests script lsc_agent_eval" and
replace the repeated literal lists in the targets that run ruff, black, and
docstyle (the lines invoking "uv run ruff check ..." and the similar commands at
Lines ~45, ~48, ~82) to reference this variable; update the targets that call
"uv run ruff check", the black invocation, and the docstyle invocation to use
$(QUALITY_PATHS) so the paths stay synchronized.
In `@tests/integration/system-config-query.yaml`:
- Around line 4-132: Extract a shared base YAML for the common settings (llm,
embedding, output, metrics_metadata, visualization, logging, environment, core)
and have the two integration configs inherit/override only the differing API
fields (primarily api.endpoint_type and any streaming-specific fields like
time_to_first_token/streaming_duration columns); update references to
api.provider, api.model, api.timeout, api.cache_dir/cache_enabled and
api.api_base in the derived configs so they no longer duplicate the full blocks,
and ensure the new base includes the shared csv_columns and
summary_config_sections so maintenance edits affect both query and streaming
modes.
In `@tests/script/test_compare_evaluations.py`:
- Around line 194-237: The test test_compare_score_distributions_precise_delta
should defensively check for the presence of statistical test results before
asserting on their "significant" field; update the assertions that read
result["tests"]["t_test"]["significant"] and
result["tests"]["mann_whitney_u"]["significant"] to first verify the keys exist
in result["tests"] (e.g., if "t_test" in result["tests"]:) and only then assert
False (or equivalent) on the "significant" value, matching the pattern used
elsewhere in the file.
In `@tests/unit/core/metrics/custom/test_tool_eval.py`:
- Around line 456-479: The test suite is missing a regression test ensuring
evaluate_tool_calls(full_match=False) fails when expected is empty but actual
contains tool calls; add a new unit test in
tests/unit/core/metrics/custom/test_tool_eval.py that calls evaluate_tool_calls
with expected = [] and actual containing one or more tool entries (e.g.,
[{"tool_name":"toolX","arguments":{}}]), assert the returned success is False,
and that details mention zero matched/expected and list the extra tools (toolX)
and keywords like "partial" and "ordered" to mirror the existing detailed-stat
checks; place it alongside test_partial_match_failure_reports_detailed_stats and
reference evaluate_tool_calls to locate behavior under test.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 17f56f64-c136-45b0-b813-414e0c03942e
📒 Files selected for processing (40)
.github/workflows/e2e_tests.yaml.gitignoreMakefileREADME.mdconfig/system.yamldocs/EVALUATION_GUIDE.mddocs/configuration.mdpyproject.tomlsrc/lightspeed_evaluation/__init__.pysrc/lightspeed_evaluation/api.pysrc/lightspeed_evaluation/core/api/client.pysrc/lightspeed_evaluation/core/constants.pysrc/lightspeed_evaluation/core/llm/custom.pysrc/lightspeed_evaluation/core/metrics/custom/tool_eval.pysrc/lightspeed_evaluation/core/metrics/geval.pysrc/lightspeed_evaluation/core/models/__init__.pysrc/lightspeed_evaluation/core/models/system.pysrc/lightspeed_evaluation/core/system/loader.pysrc/lightspeed_evaluation/core/system/validator.pysrc/lightspeed_evaluation/pipeline/evaluation/evaluator.pysrc/lightspeed_evaluation/runner/evaluation.pytests/integration/lightspeed-stack.yamltests/integration/run.yamltests/integration/system-config-query.yamltests/integration/system-config-streaming.yamltests/integration/test_evaluation_data.yamltests/script/test_compare_evaluations.pytests/unit/core/api/conftest.pytests/unit/core/api/test_client.pytests/unit/core/llm/test_custom.pytests/unit/core/metrics/custom/test_custom.pytests/unit/core/metrics/custom/test_tool_eval.pytests/unit/core/metrics/test_geval.pytests/unit/core/models/test_data.pytests/unit/core/models/test_system.pytests/unit/core/system/test_loader.pytests/unit/core/system/test_validator.pytests/unit/pipeline/evaluation/test_evaluator.pytests/unit/runner/test_evaluation.pytests/unit/test_api.py
✅ Files skipped from review due to trivial changes (1)
- tests/integration/test_evaluation_data.yaml
🚧 Files skipped from review as they are similar to previous changes (1)
- tests/unit/core/metrics/custom/test_custom.py
d69c507 to
0b70a0d
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/lightspeed_evaluation/core/metrics/custom/tool_eval.py (1)
30-52:⚠️ Potential issue | 🟠 MajorFailure detail should use the best failed alternative, not the last one.
Line 35 overwrites
last_resulteach iteration, so Line 51 can report stats from a weaker later pattern and hide the closest mismatch.Suggested fix
- last_result: dict[str, Any] = {} + best_failure_stats: dict[str, Any] | None = None for i, expected_set in enumerate(expected): result = compare_tool_calls( expected_set, actual, ordered=ordered, full_match=full_match ) - last_result = result + if not result["success"] and not full_match: + stats = result.get("stats") + if stats is not None: + candidate = ( + stats.get("matched", 0), + -stats.get("unmatched", 0), + -len(stats.get("extra_actual_tools", [])), + ) + if best_failure_stats is None: + best_failure_stats = stats + else: + current_best = ( + best_failure_stats.get("matched", 0), + -best_failure_stats.get("unmatched", 0), + -len(best_failure_stats.get("extra_actual_tools", [])), + ) + if candidate > current_best: + best_failure_stats = stats if result["success"]: return _create_success_message( @@ - last_stats=last_result.get("stats"), + last_stats=best_failure_stats, )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/lightspeed_evaluation/core/metrics/custom/tool_eval.py` around lines 30 - 52, The loop in evaluate logic currently overwrites last_result each iteration, causing the final failure message to use the last tried pattern rather than the closest match; change this by keeping a best_failed_result instead of last_result and update it only when compare_tool_calls (the call to compare_tool_calls(expected_set, actual, ...)) returns a failure that is "closer" than the current best (compare using a concrete metric from result.get("stats") such as higher matched count, lower mismatch/count of missing keys, or lower distance — pick the existing sensible stat keys and handle missing stats defensively). After the loop, pass best_failed_result.get("stats") into _create_failure_message (instead of last_result) so the failure details reflect the best alternative.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/lightspeed_evaluation/core/metrics/custom/tool_eval.py`:
- Around line 467-471: The current branch prints "No tool calls made" whenever
expected_set is empty even if match_stats exists (e.g., in partial mode); change
the logic so match_stats takes precedence: update the condition that sets
message to "No tool calls made" to only run when expected_set is empty AND
match_stats is falsy (e.g., if len(expected_set) == 0 and not match_stats), or
alternatively reorder the branches so the match_stats branch (the one calling
_format_match_stats) is evaluated before the empty expected_set check; refer to
the variables expected_set, match_stats and the _format_match_stats call to
locate the code.
---
Outside diff comments:
In `@src/lightspeed_evaluation/core/metrics/custom/tool_eval.py`:
- Around line 30-52: The loop in evaluate logic currently overwrites last_result
each iteration, causing the final failure message to use the last tried pattern
rather than the closest match; change this by keeping a best_failed_result
instead of last_result and update it only when compare_tool_calls (the call to
compare_tool_calls(expected_set, actual, ...)) returns a failure that is
"closer" than the current best (compare using a concrete metric from
result.get("stats") such as higher matched count, lower mismatch/count of
missing keys, or lower distance — pick the existing sensible stat keys and
handle missing stats defensively). After the loop, pass
best_failed_result.get("stats") into _create_failure_message (instead of
last_result) so the failure details reflect the best alternative.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 4f8724ea-104c-47f7-bc94-c7db0f89d153
📒 Files selected for processing (3)
src/lightspeed_evaluation/core/metrics/custom/tool_eval.pytests/unit/core/metrics/custom/test_custom.pytests/unit/core/metrics/custom/test_tool_eval.py
…tool information - Add reporting of extra tool names (not just count) in partial match results - Include detailed statistics in failure messages for partial match scenarios - Refactor message formatting with reusable helper functions - Update existing test assertions to match new message format Example output: Success: "2/2 expected matched, 0 unmatched, 1 extra in response: [oc_logs]" Failure: "0/2 expected matched, 2 unmatched, 3 extra in response: [tool_a, tool_b]" Made-with: Cursor
0b70a0d to
ebcfc8e
Compare
asamal4
left a comment
There was a problem hiding this comment.
Thank you !! This improves a lot.. However we can improve further..
up to you if you want to handle this in current PR or a different one
| result = compare_tool_calls( | ||
| expected_set, actual, ordered=ordered, full_match=full_match | ||
| ) | ||
| last_result = result |
There was a problem hiding this comment.
Last result is always replaced, so basically we provide details for last successful run and in case of failure - last alternative check.. So this is not a detailed reasoning. We are missing information about previous alternatives.
| message = ( | ||
| f"Tool calls match expected structure and arguments " | ||
| f"({match_mode}, {order_mode})" | ||
| f"{_get_mode_suffix(ordered, full_match)}" |
There was a problem hiding this comment.
FYI, this is not required as the there is a metadata column which will have this info. But I am okay to keep it.
| success, details = evaluate_tool_calls(expected, actual, full_match=False) | ||
| assert success is False | ||
| assert "1/2 expected matched" in details | ||
| assert "1 unmatched" in details |
There was a problem hiding this comment.
Are we not adding which didn't match from expected tool calls. The test case shows that we only show the actual name for the unmatched tool calls in actual response
| if not expected: | ||
| return (0, 0) | ||
| extra_tools = [_get_sequence_tool_names(seq) for seq in actual] | ||
| return (0, 0, extra_tools) |
There was a problem hiding this comment.
Are we providing 0/0 matched ? probably we need to provide a better message for empty alternative.
Description
Type of change
Tools used to create PR
Identify any AI code assistants used in this PR (for transparency and review context)
Related Tickets & Documents
Checklist before requesting a review
Testing
Summary by CodeRabbit
Improvements
Tests