Skip to content

[LEADS-218] Improve partial tool match reporting with detailed extra tool information#181

Open
bsatapat-jpg wants to merge 1 commit intolightspeed-core:mainfrom
bsatapat-jpg:dev_biswajit
Open

[LEADS-218] Improve partial tool match reporting with detailed extra tool information#181
bsatapat-jpg wants to merge 1 commit intolightspeed-core:mainfrom
bsatapat-jpg:dev_biswajit

Conversation

@bsatapat-jpg
Copy link
Collaborator

@bsatapat-jpg bsatapat-jpg commented Mar 5, 2026

Description

Type of change

  • Refactor
  • New feature
  • Bug fix
  • CVE fix
  • Optimization
  • Documentation Update
  • Configuration Update
  • Bump-up service version
  • Bump-up dependent library
  • Bump-up library or tool used for development (does not change the final image)
  • CI configuration change
  • Unit tests improvement

Tools used to create PR

Identify any AI code assistants used in this PR (for transparency and review context)

  • Assisted-by: Claude-4.5-opus-high
  • Generated by: Cursor

Related Tickets & Documents

Checklist before requesting a review

  • I have performed a self-review of my code.
  • PR has passed all pre-merge test jobs.
  • If it is a core feature, I have added thorough tests.

Testing

  • Please provide detailed steps to perform tests related to this code change.
  • How were the fix/results from this change verified? Please provide relevant screenshots or results.

Summary by CodeRabbit

  • Improvements

    • More detailed tool-evaluation messages show matched vs expected counts, unmatched items and any extra tools, with standardized "(partial/full, ordered/unordered)" mode indicators.
    • Partial-match reporting and logs now include extra-tool details for clearer debugging.
  • Tests

    • Updated and added tests to assert richer partial-match failure messages and presence of specific extra-tool details.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Mar 5, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 454fe252-eabf-4dd7-8259-293f488c2f2c

📥 Commits

Reviewing files that changed from the base of the PR and between 0b70a0d and ebcfc8e.

📒 Files selected for processing (3)
  • src/lightspeed_evaluation/core/metrics/custom/tool_eval.py
  • tests/unit/core/metrics/custom/test_custom.py
  • tests/unit/core/metrics/custom/test_tool_eval.py
🚧 Files skipped from review as they are similar to previous changes (2)
  • tests/unit/core/metrics/custom/test_tool_eval.py
  • tests/unit/core/metrics/custom/test_custom.py

Walkthrough

Tool evaluation comparison was extended to return and propagate richer match statistics for partial matches (including extra actual tools), new helper functions were added to format mode and stats, and tests were updated to expect the revised messages and details.

Changes

Cohort / File(s) Summary
Core Implementation
src/lightspeed_evaluation/core/metrics/custom/tool_eval.py
Updated comparison logic to return (matched, total, extra_actual_tools); partial-match flow now records last_stats and includes extra_actual_tools in logs/messages. Added helpers: _get_sequence_tool_names, _get_mode_suffix, _format_match_stats. Updated signatures for _compare_partial, _create_success_message, and _create_failure_message to accept richer match data.
Tests
tests/unit/core/metrics/custom/test_custom.py, tests/unit/core/metrics/custom/test_tool_eval.py
Adjusted expected message strings to the new format (e.g., "expected matched") and added a test (test_partial_match_failure_reports_detailed_stats) asserting detailed partial-match failure info including extra tool counts and specific tool names.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~22 minutes

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title directly and specifically describes the main change: improving partial tool match reporting with detailed extra tool information, which aligns with the core modifications throughout the codebase.
Docstring Coverage ✅ Passed Docstring coverage is 99.33% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@bsatapat-jpg bsatapat-jpg changed the title [LEADS-218] Improve partial tool match reporting with detailed extra … [LEADS-218] Improve partial tool match reporting with detailed extra tool information Mar 5, 2026
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 7

🧹 Nitpick comments (5)
tests/script/test_compare_evaluations.py (1)

194-237: Consider adding defensive checks for statistical test key existence.

Lines 236-237 directly access result["tests"]["t_test"]["significant"] and result["tests"]["mann_whitney_u"]["significant"] without verifying the keys exist. Other tests in this file (e.g., lines 164-172, 384-391) use conditional checks like if "t_test" in result["tests"]: before accessing results.

With only 5 samples per group, there's a possibility (depending on implementation) that statistical tests might not be performed or might be skipped, which would cause a KeyError.

🔧 Suggested defensive pattern (consistent with other tests in file)
         # Verify statistical tests show non-significance for this tiny shift
-        assert not result["tests"]["t_test"]["significant"]
-        assert not result["tests"]["mann_whitney_u"]["significant"]
+        assert "t_test" in result["tests"], "Expected t_test to be performed"
+        assert not result["tests"]["t_test"]["significant"]
+        assert "mann_whitney_u" in result["tests"], "Expected mann_whitney_u to be performed"
+        assert not result["tests"]["mann_whitney_u"]["significant"]
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/script/test_compare_evaluations.py` around lines 194 - 237, The test
test_compare_score_distributions_precise_delta should defensively check for the
presence of statistical test results before asserting on their "significant"
field; update the assertions that read result["tests"]["t_test"]["significant"]
and result["tests"]["mann_whitney_u"]["significant"] to first verify the keys
exist in result["tests"] (e.g., if "t_test" in result["tests"]:) and only then
assert False (or equivalent) on the "significant" value, matching the pattern
used elsewhere in the file.
Makefile (1)

85-85: Consider centralizing quality-check paths to prevent drift.

Line 85 repeats the same path list used in Lines 45 and 48 (and Line 82). A shared variable will keep Black/Ruff/Docstyle aligned over time.

♻️ Proposed refactor
+QUALITY_PATHS := src tests script lsc_agent_eval
+
 black-check:
-	uv run black src tests script lsc_agent_eval --check
+	uv run black $(QUALITY_PATHS) --check

 black-format:
-	uv run black src tests script lsc_agent_eval
+	uv run black $(QUALITY_PATHS)

 docstyle:
-	uv run pydocstyle -v src tests script lsc_agent_eval
+	uv run pydocstyle -v $(QUALITY_PATHS)

 ruff:
-	uv run ruff check src tests script lsc_agent_eval
+	uv run ruff check $(QUALITY_PATHS)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@Makefile` at line 85, Create a central variable (e.g., QUALITY_PATHS or
QC_PATHS) in the Makefile that holds the shared path list "src tests script
lsc_agent_eval" and replace the repeated literal lists in the targets that run
ruff, black, and docstyle (the lines invoking "uv run ruff check ..." and the
similar commands at Lines ~45, ~48, ~82) to reference this variable; update the
targets that call "uv run ruff check", the black invocation, and the docstyle
invocation to use $(QUALITY_PATHS) so the paths stay synchronized.
tests/unit/core/metrics/custom/test_tool_eval.py (1)

456-479: Add a regression test for empty expected patterns in partial mode.

The new detailed-stat tests are great, but there’s still no guard against full_match=False mistakenly passing when expected is empty and actual has tools. Adding that case will prevent future regressions.

🧪 Suggested test case
+    def test_partial_match_empty_expected_fails_when_actual_has_tools(self) -> None:
+        """Empty expected pattern should not pass when actual tool calls exist."""
+        expected = [[]]
+        actual = [[{"tool_name": "tool1", "arguments": {}}]]
+
+        success, details = evaluate_tool_calls(expected, actual, full_match=False)
+
+        assert success is False
+        assert "didn't match any of the" in details or "0/0 expected matched" in details
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/unit/core/metrics/custom/test_tool_eval.py` around lines 456 - 479, The
test suite is missing a regression test ensuring
evaluate_tool_calls(full_match=False) fails when expected is empty but actual
contains tool calls; add a new unit test in
tests/unit/core/metrics/custom/test_tool_eval.py that calls evaluate_tool_calls
with expected = [] and actual containing one or more tool entries (e.g.,
[{"tool_name":"toolX","arguments":{}}]), assert the returned success is False,
and that details mention zero matched/expected and list the extra tools (toolX)
and keywords like "partial" and "ordered" to mirror the existing detailed-stat
checks; place it alongside test_partial_match_failure_reports_detailed_stats and
reference evaluate_tool_calls to locate behavior under test.
.github/workflows/e2e_tests.yaml (1)

18-18: Prefer pinning the Lightspeed Stack image to an immutable digest.

Using :latest makes E2E results non-reproducible across runs. Pin a versioned tag or @sha256 digest to stabilize CI.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.github/workflows/e2e_tests.yaml at line 18, Replace the mutable image
reference in the lsc_image_path array (currently
"quay.io/lightspeed-core/lightspeed-stack:latest") with an immutable reference —
either a specific versioned tag or preferably an `@sha256` digest — so CI uses a
stable, reproducible image; update the value inside lsc_image_path accordingly
and verify the digest/tag is the intended release.
tests/integration/system-config-query.yaml (1)

4-132: Consider extracting a shared base config to prevent query/streaming drift.

These two integration configs are almost identical; keeping them duplicated will make future edits error-prone. A shared base + small mode override would be easier to maintain.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/integration/system-config-query.yaml` around lines 4 - 132, Extract a
shared base YAML for the common settings (llm, embedding, output,
metrics_metadata, visualization, logging, environment, core) and have the two
integration configs inherit/override only the differing API fields (primarily
api.endpoint_type and any streaming-specific fields like
time_to_first_token/streaming_duration columns); update references to
api.provider, api.model, api.timeout, api.cache_dir/cache_enabled and
api.api_base in the derived configs so they no longer duplicate the full blocks,
and ensure the new base includes the shared csv_columns and
summary_config_sections so maintenance edits affect both query and streaming
modes.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In @.github/workflows/e2e_tests.yaml:
- Line 4: The workflow uses on: [pull_request_target] and checks out PR head
code while exposing OPENAI_API_KEY to untrusted PRs; change the trigger to
pull_request (or limit pull_request_target steps to only non-PR-code actions),
ensure any steps that run PR code use actions/checkout@v3 without secrets and do
not pass OPENAI_API_KEY, and move secret-requiring steps into a separate job
that runs under pull_request_target only for maintainers (or use a gated
merge/dispatch) so that OPENAI_API_KEY is never available to untrusted PR head
code.

In `@src/lightspeed_evaluation/__init__.py`:
- Around line 13-15: Remove the "# ruff: noqa: F401" and stop importing symbols
directly under TYPE_CHECKING; instead, import the module as an alias and bind
the names from that alias so linters don't see unused imports. Concretely, in
the TYPE_CHECKING block replace "from lightspeed_evaluation.api import evaluate,
evaluate_conversation, evaluate_turn" with an import of the module (e.g., "from
lightspeed_evaluation import api as _api") and then assign the names from _api
(e.g., "evaluate = _api.evaluate", "evaluate_conversation =
_api.evaluate_conversation", "evaluate_turn = _api.evaluate_turn") so evaluate,
evaluate_conversation, and evaluate_turn are referenced without using "# noqa".

In `@src/lightspeed_evaluation/core/api/client.py`:
- Around line 121-137: The cache lookup is missing conversation scope which
causes identical queries from different conversations to return wrong cached
responses; update the cache key generation used by _get_cached_response and
_add_response_to_cache (and any helper inside _prepare_request if it composes
keys) to include conversation_id (and attachments if relevant) so the key is
unique per conversation + query; adjust _get_cached_response to compute/accept
the new composite key and _add_response_to_cache to store under that same
composite key so cached responses are scoped correctly.
- Around line 88-90: The client is currently initialized with a hardcoded TLS
verification flag; update the client initialization to use the APIConfig values
instead: replace the hardcoded verify usage with self.config.ssl_verify and, if
self.config.ssl_cert_file is set, pass it as the client's cert parameter (or
equivalent) so TLS behavior is governed by APIConfig; look for the client
construction where base_url, verify (or verify=verify) and timeout are set and
change it to reference self.config.ssl_verify and self.config.ssl_cert_file.

In `@src/lightspeed_evaluation/core/llm/custom.py`:
- Around line 182-195: In the finally block that calls
self._track_tokens(response) (inside the method surrounding lines ~182), wrap
that call in its own try/except to ensure token-tracking errors do not suppress
or alter the original return/exception; catch Exception and log or ignore
safely. In _track_tokens, remove the "# pylint: disable=protected-access" and
stop reading the private _hidden_params; instead detect cache-hits from public
response metadata (e.g., response.headers or response.get("headers", {})) by
checking for Litellm's documented cache indicator such as 'x-litellm-cache-key'
(use getattr(response, "headers", {}) safely), and guard all attribute access
with getattr/dict.get to avoid AttributeError; keep using
TokenTracker.get_active() and only proceed to record tokens when tracker exists
and response is not None. Ensure no use of private fields and no pylint disables
remain.

In `@src/lightspeed_evaluation/core/models/system.py`:
- Around line 726-733: The handler currently silences non-list rubrics by
setting data["rubrics"] = None; instead, when raw_rubrics is present but not a
list you should fail fast: check raw_rubrics after raw.get("rubrics") and if it
is not a list raise a clear exception (e.g., TypeError or ValueError) indicating
"rubrics must be a list", otherwise proceed to build data["rubrics"] via
GEvalRubricConfig.model_validate and return cls.model_validate(data); update the
code paths around raw_rubrics, data["rubrics"], GEvalRubricConfig.model_validate
and cls.model_validate to enforce this.

In `@src/lightspeed_evaluation/runner/evaluation.py`:
- Around line 139-141: The printed message "⚙️ Initializing Evaluation
Pipeline..." is stale because the flow now calls the API-level evaluate(...);
update the print in evaluation.py (the print call near the top of the
run/evaluation entry) to reflect the new behavior—for example, change to "⚙️
Calling evaluate(...) API..." or "⚙️ Starting evaluation via evaluate(...)"—so
that the message accurately describes that evaluate(...) is being invoked rather
than an evaluation pipeline being initialized.

---

Nitpick comments:
In @.github/workflows/e2e_tests.yaml:
- Line 18: Replace the mutable image reference in the lsc_image_path array
(currently "quay.io/lightspeed-core/lightspeed-stack:latest") with an immutable
reference — either a specific versioned tag or preferably an `@sha256` digest — so
CI uses a stable, reproducible image; update the value inside lsc_image_path
accordingly and verify the digest/tag is the intended release.

In `@Makefile`:
- Line 85: Create a central variable (e.g., QUALITY_PATHS or QC_PATHS) in the
Makefile that holds the shared path list "src tests script lsc_agent_eval" and
replace the repeated literal lists in the targets that run ruff, black, and
docstyle (the lines invoking "uv run ruff check ..." and the similar commands at
Lines ~45, ~48, ~82) to reference this variable; update the targets that call
"uv run ruff check", the black invocation, and the docstyle invocation to use
$(QUALITY_PATHS) so the paths stay synchronized.

In `@tests/integration/system-config-query.yaml`:
- Around line 4-132: Extract a shared base YAML for the common settings (llm,
embedding, output, metrics_metadata, visualization, logging, environment, core)
and have the two integration configs inherit/override only the differing API
fields (primarily api.endpoint_type and any streaming-specific fields like
time_to_first_token/streaming_duration columns); update references to
api.provider, api.model, api.timeout, api.cache_dir/cache_enabled and
api.api_base in the derived configs so they no longer duplicate the full blocks,
and ensure the new base includes the shared csv_columns and
summary_config_sections so maintenance edits affect both query and streaming
modes.

In `@tests/script/test_compare_evaluations.py`:
- Around line 194-237: The test test_compare_score_distributions_precise_delta
should defensively check for the presence of statistical test results before
asserting on their "significant" field; update the assertions that read
result["tests"]["t_test"]["significant"] and
result["tests"]["mann_whitney_u"]["significant"] to first verify the keys exist
in result["tests"] (e.g., if "t_test" in result["tests"]:) and only then assert
False (or equivalent) on the "significant" value, matching the pattern used
elsewhere in the file.

In `@tests/unit/core/metrics/custom/test_tool_eval.py`:
- Around line 456-479: The test suite is missing a regression test ensuring
evaluate_tool_calls(full_match=False) fails when expected is empty but actual
contains tool calls; add a new unit test in
tests/unit/core/metrics/custom/test_tool_eval.py that calls evaluate_tool_calls
with expected = [] and actual containing one or more tool entries (e.g.,
[{"tool_name":"toolX","arguments":{}}]), assert the returned success is False,
and that details mention zero matched/expected and list the extra tools (toolX)
and keywords like "partial" and "ordered" to mirror the existing detailed-stat
checks; place it alongside test_partial_match_failure_reports_detailed_stats and
reference evaluate_tool_calls to locate behavior under test.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 17f56f64-c136-45b0-b813-414e0c03942e

📥 Commits

Reviewing files that changed from the base of the PR and between fd0ed02 and d69c507.

📒 Files selected for processing (40)
  • .github/workflows/e2e_tests.yaml
  • .gitignore
  • Makefile
  • README.md
  • config/system.yaml
  • docs/EVALUATION_GUIDE.md
  • docs/configuration.md
  • pyproject.toml
  • src/lightspeed_evaluation/__init__.py
  • src/lightspeed_evaluation/api.py
  • src/lightspeed_evaluation/core/api/client.py
  • src/lightspeed_evaluation/core/constants.py
  • src/lightspeed_evaluation/core/llm/custom.py
  • src/lightspeed_evaluation/core/metrics/custom/tool_eval.py
  • src/lightspeed_evaluation/core/metrics/geval.py
  • src/lightspeed_evaluation/core/models/__init__.py
  • src/lightspeed_evaluation/core/models/system.py
  • src/lightspeed_evaluation/core/system/loader.py
  • src/lightspeed_evaluation/core/system/validator.py
  • src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
  • src/lightspeed_evaluation/runner/evaluation.py
  • tests/integration/lightspeed-stack.yaml
  • tests/integration/run.yaml
  • tests/integration/system-config-query.yaml
  • tests/integration/system-config-streaming.yaml
  • tests/integration/test_evaluation_data.yaml
  • tests/script/test_compare_evaluations.py
  • tests/unit/core/api/conftest.py
  • tests/unit/core/api/test_client.py
  • tests/unit/core/llm/test_custom.py
  • tests/unit/core/metrics/custom/test_custom.py
  • tests/unit/core/metrics/custom/test_tool_eval.py
  • tests/unit/core/metrics/test_geval.py
  • tests/unit/core/models/test_data.py
  • tests/unit/core/models/test_system.py
  • tests/unit/core/system/test_loader.py
  • tests/unit/core/system/test_validator.py
  • tests/unit/pipeline/evaluation/test_evaluator.py
  • tests/unit/runner/test_evaluation.py
  • tests/unit/test_api.py
✅ Files skipped from review due to trivial changes (1)
  • tests/integration/test_evaluation_data.yaml
🚧 Files skipped from review as they are similar to previous changes (1)
  • tests/unit/core/metrics/custom/test_custom.py

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/lightspeed_evaluation/core/metrics/custom/tool_eval.py (1)

30-52: ⚠️ Potential issue | 🟠 Major

Failure detail should use the best failed alternative, not the last one.

Line 35 overwrites last_result each iteration, so Line 51 can report stats from a weaker later pattern and hide the closest mismatch.

Suggested fix
-        last_result: dict[str, Any] = {}
+        best_failure_stats: dict[str, Any] | None = None
         for i, expected_set in enumerate(expected):
             result = compare_tool_calls(
                 expected_set, actual, ordered=ordered, full_match=full_match
             )
-            last_result = result
+            if not result["success"] and not full_match:
+                stats = result.get("stats")
+                if stats is not None:
+                    candidate = (
+                        stats.get("matched", 0),
+                        -stats.get("unmatched", 0),
+                        -len(stats.get("extra_actual_tools", [])),
+                    )
+                    if best_failure_stats is None:
+                        best_failure_stats = stats
+                    else:
+                        current_best = (
+                            best_failure_stats.get("matched", 0),
+                            -best_failure_stats.get("unmatched", 0),
+                            -len(best_failure_stats.get("extra_actual_tools", [])),
+                        )
+                        if candidate > current_best:
+                            best_failure_stats = stats
             if result["success"]:
                 return _create_success_message(
@@
-            last_stats=last_result.get("stats"),
+            last_stats=best_failure_stats,
         )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/lightspeed_evaluation/core/metrics/custom/tool_eval.py` around lines 30 -
52, The loop in evaluate logic currently overwrites last_result each iteration,
causing the final failure message to use the last tried pattern rather than the
closest match; change this by keeping a best_failed_result instead of
last_result and update it only when compare_tool_calls (the call to
compare_tool_calls(expected_set, actual, ...)) returns a failure that is
"closer" than the current best (compare using a concrete metric from
result.get("stats") such as higher matched count, lower mismatch/count of
missing keys, or lower distance — pick the existing sensible stat keys and
handle missing stats defensively). After the loop, pass
best_failed_result.get("stats") into _create_failure_message (instead of
last_result) so the failure details reflect the best alternative.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/lightspeed_evaluation/core/metrics/custom/tool_eval.py`:
- Around line 467-471: The current branch prints "No tool calls made" whenever
expected_set is empty even if match_stats exists (e.g., in partial mode); change
the logic so match_stats takes precedence: update the condition that sets
message to "No tool calls made" to only run when expected_set is empty AND
match_stats is falsy (e.g., if len(expected_set) == 0 and not match_stats), or
alternatively reorder the branches so the match_stats branch (the one calling
_format_match_stats) is evaluated before the empty expected_set check; refer to
the variables expected_set, match_stats and the _format_match_stats call to
locate the code.

---

Outside diff comments:
In `@src/lightspeed_evaluation/core/metrics/custom/tool_eval.py`:
- Around line 30-52: The loop in evaluate logic currently overwrites last_result
each iteration, causing the final failure message to use the last tried pattern
rather than the closest match; change this by keeping a best_failed_result
instead of last_result and update it only when compare_tool_calls (the call to
compare_tool_calls(expected_set, actual, ...)) returns a failure that is
"closer" than the current best (compare using a concrete metric from
result.get("stats") such as higher matched count, lower mismatch/count of
missing keys, or lower distance — pick the existing sensible stat keys and
handle missing stats defensively). After the loop, pass
best_failed_result.get("stats") into _create_failure_message (instead of
last_result) so the failure details reflect the best alternative.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 4f8724ea-104c-47f7-bc94-c7db0f89d153

📥 Commits

Reviewing files that changed from the base of the PR and between d69c507 and 0b70a0d.

📒 Files selected for processing (3)
  • src/lightspeed_evaluation/core/metrics/custom/tool_eval.py
  • tests/unit/core/metrics/custom/test_custom.py
  • tests/unit/core/metrics/custom/test_tool_eval.py

…tool information

- Add reporting of extra tool names (not just count) in partial match results
- Include detailed statistics in failure messages for partial match scenarios
- Refactor message formatting with reusable helper functions
- Update existing test assertions to match new message format

Example output:
  Success: "2/2 expected matched, 0 unmatched, 1 extra in response: [oc_logs]"
  Failure: "0/2 expected matched, 2 unmatched, 3 extra in response: [tool_a, tool_b]"

Made-with: Cursor
Copy link
Member

@VladimirKadlec VladimirKadlec left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thank you.

Copy link
Collaborator

@asamal4 asamal4 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you !! This improves a lot.. However we can improve further..
up to you if you want to handle this in current PR or a different one

result = compare_tool_calls(
expected_set, actual, ordered=ordered, full_match=full_match
)
last_result = result
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Last result is always replaced, so basically we provide details for last successful run and in case of failure - last alternative check.. So this is not a detailed reasoning. We are missing information about previous alternatives.

message = (
f"Tool calls match expected structure and arguments "
f"({match_mode}, {order_mode})"
f"{_get_mode_suffix(ordered, full_match)}"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI, this is not required as the there is a metadata column which will have this info. But I am okay to keep it.

success, details = evaluate_tool_calls(expected, actual, full_match=False)
assert success is False
assert "1/2 expected matched" in details
assert "1 unmatched" in details
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we not adding which didn't match from expected tool calls. The test case shows that we only show the actual name for the unmatched tool calls in actual response

if not expected:
return (0, 0)
extra_tools = [_get_sequence_tool_names(seq) for seq in actual]
return (0, 0, extra_tools)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we providing 0/0 matched ? probably we need to provide a better message for empty alternative.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants