[LEADS-218] Improve partial tool match reporting with detailed extra tool information by bsatapat-jpg · Pull Request #181 · lightspeed-core/lightspeed-evaluation

bsatapat-jpg · 2026-03-05T05:55:07Z

Description

Type of change

Tools used to create PR

Identify any AI code assistants used in this PR (for transparency and review context)

Assisted-by: Claude-4.5-opus-high
Generated by: Cursor

Related Tickets & Documents

Related Issue # https://issues.redhat.com/browse/LEADS-218
Closes #

Checklist before requesting a review

I have performed a self-review of my code.
PR has passed all pre-merge test jobs.
If it is a core feature, I have added thorough tests.

Testing

Please provide detailed steps to perform tests related to this code change.
How were the fix/results from this change verified? Please provide relevant screenshots or results.

Summary by CodeRabbit

Improvements
- More detailed tool-evaluation messages show matched vs expected counts, unmatched items and any extra tools, with standardized "(partial/full, ordered/unordered)" mode indicators.
- Partial-match reporting and logs now include extra-tool details for clearer debugging.
Tests
- Updated and added tests to assert richer partial-match failure messages and presence of specific extra-tool details.

coderabbitai · 2026-03-05T05:55:22Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 454fe252-eabf-4dd7-8259-293f488c2f2c

📥 Commits

Reviewing files that changed from the base of the PR and between 0b70a0d and ebcfc8e.

📒 Files selected for processing (3)

src/lightspeed_evaluation/core/metrics/custom/tool_eval.py
tests/unit/core/metrics/custom/test_custom.py
tests/unit/core/metrics/custom/test_tool_eval.py

🚧 Files skipped from review as they are similar to previous changes (2)

tests/unit/core/metrics/custom/test_tool_eval.py
tests/unit/core/metrics/custom/test_custom.py

Walkthrough

Tool evaluation comparison was extended to return and propagate richer match statistics for partial matches (including extra actual tools), new helper functions were added to format mode and stats, and tests were updated to expect the revised messages and details.

Changes

Cohort / File(s)	Summary
Core Implementation `src/lightspeed_evaluation/core/metrics/custom/tool_eval.py`	Updated comparison logic to return (matched, total, extra_actual_tools); partial-match flow now records last_stats and includes extra_actual_tools in logs/messages. Added helpers: `_get_sequence_tool_names`, `_get_mode_suffix`, `_format_match_stats`. Updated signatures for `_compare_partial`, `_create_success_message`, and `_create_failure_message` to accept richer match data.
Tests `tests/unit/core/metrics/custom/test_custom.py`, `tests/unit/core/metrics/custom/test_tool_eval.py`	Adjusted expected message strings to the new format (e.g., "expected matched") and added a test (`test_partial_match_failure_reports_detailed_stats`) asserting detailed partial-match failure info including extra tool counts and specific tool names.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~22 minutes

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title directly and specifically describes the main change: improving partial tool match reporting with detailed extra tool information, which aligns with the core modifications throughout the codebase.
Docstring Coverage	✅ Passed	Docstring coverage is 99.33% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 7

🧹 Nitpick comments (5)

tests/script/test_compare_evaluations.py (1)

194-237: Consider adding defensive checks for statistical test key existence.

Lines 236-237 directly access result["tests"]["t_test"]["significant"] and result["tests"]["mann_whitney_u"]["significant"] without verifying the keys exist. Other tests in this file (e.g., lines 164-172, 384-391) use conditional checks like if "t_test" in result["tests"]: before accessing results.

With only 5 samples per group, there's a possibility (depending on implementation) that statistical tests might not be performed or might be skipped, which would cause a KeyError.
🔧 Suggested defensive pattern (consistent with other tests in file)
         # Verify statistical tests show non-significance for this tiny shift
-        assert not result["tests"]["t_test"]["significant"]
-        assert not result["tests"]["mann_whitney_u"]["significant"]
+        assert "t_test" in result["tests"], "Expected t_test to be performed"
+        assert not result["tests"]["t_test"]["significant"]
+        assert "mann_whitney_u" in result["tests"], "Expected mann_whitney_u to be performed"
+        assert not result["tests"]["mann_whitney_u"]["significant"]
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/script/test_compare_evaluations.py` around lines 194 - 237, The test
test_compare_score_distributions_precise_delta should defensively check for the
presence of statistical test results before asserting on their "significant"
field; update the assertions that read result["tests"]["t_test"]["significant"]
and result["tests"]["mann_whitney_u"]["significant"] to first verify the keys
exist in result["tests"] (e.g., if "t_test" in result["tests"]:) and only then
assert False (or equivalent) on the "significant" value, matching the pattern
used elsewhere in the file.

Makefile (1)

85-85: Consider centralizing quality-check paths to prevent drift.

Line 85 repeats the same path list used in Lines 45 and 48 (and Line 82). A shared variable will keep Black/Ruff/Docstyle aligned over time.

♻️ Proposed refactor

+QUALITY_PATHS := src tests script lsc_agent_eval
+
 black-check:
-	uv run black src tests script lsc_agent_eval --check
+	uv run black $(QUALITY_PATHS) --check

 black-format:
-	uv run black src tests script lsc_agent_eval
+	uv run black $(QUALITY_PATHS)

 docstyle:
-	uv run pydocstyle -v src tests script lsc_agent_eval
+	uv run pydocstyle -v $(QUALITY_PATHS)

 ruff:
-	uv run ruff check src tests script lsc_agent_eval
+	uv run ruff check $(QUALITY_PATHS)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@Makefile` at line 85, Create a central variable (e.g., QUALITY_PATHS or
QC_PATHS) in the Makefile that holds the shared path list "src tests script
lsc_agent_eval" and replace the repeated literal lists in the targets that run
ruff, black, and docstyle (the lines invoking "uv run ruff check ..." and the
similar commands at Lines ~45, ~48, ~82) to reference this variable; update the
targets that call "uv run ruff check", the black invocation, and the docstyle
invocation to use $(QUALITY_PATHS) so the paths stay synchronized.

tests/unit/core/metrics/custom/test_tool_eval.py (1)

456-479: Add a regression test for empty expected patterns in partial mode.

The new detailed-stat tests are great, but there’s still no guard against full_match=False mistakenly passing when expected is empty and actual has tools. Adding that case will prevent future regressions.

🧪 Suggested test case

+    def test_partial_match_empty_expected_fails_when_actual_has_tools(self) -> None:
+        """Empty expected pattern should not pass when actual tool calls exist."""
+        expected = [[]]
+        actual = [[{"tool_name": "tool1", "arguments": {}}]]
+
+        success, details = evaluate_tool_calls(expected, actual, full_match=False)
+
+        assert success is False
+        assert "didn't match any of the" in details or "0/0 expected matched" in details

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tests/unit/core/metrics/custom/test_tool_eval.py` around lines 456 - 479, The
test suite is missing a regression test ensuring
evaluate_tool_calls(full_match=False) fails when expected is empty but actual
contains tool calls; add a new unit test in
tests/unit/core/metrics/custom/test_tool_eval.py that calls evaluate_tool_calls
with expected = [] and actual containing one or more tool entries (e.g.,
[{"tool_name":"toolX","arguments":{}}]), assert the returned success is False,
and that details mention zero matched/expected and list the extra tools (toolX)
and keywords like "partial" and "ordered" to mirror the existing detailed-stat
checks; place it alongside test_partial_match_failure_reports_detailed_stats and
reference evaluate_tool_calls to locate behavior under test.

.github/workflows/e2e_tests.yaml (1)

18-18: Prefer pinning the Lightspeed Stack image to an immutable digest.

Using :latest makes E2E results non-reproducible across runs. Pin a versioned tag or @sha256 digest to stabilize CI.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In @.github/workflows/e2e_tests.yaml at line 18, Replace the mutable image
reference in the lsc_image_path array (currently
"quay.io/lightspeed-core/lightspeed-stack:latest") with an immutable reference —
either a specific versioned tag or preferably an `@sha256` digest — so CI uses a
stable, reproducible image; update the value inside lsc_image_path accordingly
and verify the digest/tag is the intended release.

tests/integration/system-config-query.yaml (1)

4-132: Consider extracting a shared base config to prevent query/streaming drift.

These two integration configs are almost identical; keeping them duplicated will make future edits error-prone. A shared base + small mode override would be easier to maintain.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tests/integration/system-config-query.yaml` around lines 4 - 132, Extract a
shared base YAML for the common settings (llm, embedding, output,
metrics_metadata, visualization, logging, environment, core) and have the two
integration configs inherit/override only the differing API fields (primarily
api.endpoint_type and any streaming-specific fields like
time_to_first_token/streaming_duration columns); update references to
api.provider, api.model, api.timeout, api.cache_dir/cache_enabled and
api.api_base in the derived configs so they no longer duplicate the full blocks,
and ensure the new base includes the shared csv_columns and
summary_config_sections so maintenance edits affect both query and streaming
modes.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In @.github/workflows/e2e_tests.yaml:
- Line 4: The workflow uses on: [pull_request_target] and checks out PR head
code while exposing OPENAI_API_KEY to untrusted PRs; change the trigger to
pull_request (or limit pull_request_target steps to only non-PR-code actions),
ensure any steps that run PR code use actions/checkout@v3 without secrets and do
not pass OPENAI_API_KEY, and move secret-requiring steps into a separate job
that runs under pull_request_target only for maintainers (or use a gated
merge/dispatch) so that OPENAI_API_KEY is never available to untrusted PR head
code.

In `@src/lightspeed_evaluation/__init__.py`:
- Around line 13-15: Remove the "# ruff: noqa: F401" and stop importing symbols
directly under TYPE_CHECKING; instead, import the module as an alias and bind
the names from that alias so linters don't see unused imports. Concretely, in
the TYPE_CHECKING block replace "from lightspeed_evaluation.api import evaluate,
evaluate_conversation, evaluate_turn" with an import of the module (e.g., "from
lightspeed_evaluation import api as _api") and then assign the names from _api
(e.g., "evaluate = _api.evaluate", "evaluate_conversation =
_api.evaluate_conversation", "evaluate_turn = _api.evaluate_turn") so evaluate,
evaluate_conversation, and evaluate_turn are referenced without using "# noqa".

In `@src/lightspeed_evaluation/core/api/client.py`:
- Around line 121-137: The cache lookup is missing conversation scope which
causes identical queries from different conversations to return wrong cached
responses; update the cache key generation used by _get_cached_response and
_add_response_to_cache (and any helper inside _prepare_request if it composes
keys) to include conversation_id (and attachments if relevant) so the key is
unique per conversation + query; adjust _get_cached_response to compute/accept
the new composite key and _add_response_to_cache to store under that same
composite key so cached responses are scoped correctly.
- Around line 88-90: The client is currently initialized with a hardcoded TLS
verification flag; update the client initialization to use the APIConfig values
instead: replace the hardcoded verify usage with self.config.ssl_verify and, if
self.config.ssl_cert_file is set, pass it as the client's cert parameter (or
equivalent) so TLS behavior is governed by APIConfig; look for the client
construction where base_url, verify (or verify=verify) and timeout are set and
change it to reference self.config.ssl_verify and self.config.ssl_cert_file.

In `@src/lightspeed_evaluation/core/llm/custom.py`:
- Around line 182-195: In the finally block that calls
self._track_tokens(response) (inside the method surrounding lines ~182), wrap
that call in its own try/except to ensure token-tracking errors do not suppress
or alter the original return/exception; catch Exception and log or ignore
safely. In _track_tokens, remove the "# pylint: disable=protected-access" and
stop reading the private _hidden_params; instead detect cache-hits from public
response metadata (e.g., response.headers or response.get("headers", {})) by
checking for Litellm's documented cache indicator such as 'x-litellm-cache-key'
(use getattr(response, "headers", {}) safely), and guard all attribute access
with getattr/dict.get to avoid AttributeError; keep using
TokenTracker.get_active() and only proceed to record tokens when tracker exists
and response is not None. Ensure no use of private fields and no pylint disables
remain.

In `@src/lightspeed_evaluation/core/models/system.py`:
- Around line 726-733: The handler currently silences non-list rubrics by
setting data["rubrics"] = None; instead, when raw_rubrics is present but not a
list you should fail fast: check raw_rubrics after raw.get("rubrics") and if it
is not a list raise a clear exception (e.g., TypeError or ValueError) indicating
"rubrics must be a list", otherwise proceed to build data["rubrics"] via
GEvalRubricConfig.model_validate and return cls.model_validate(data); update the
code paths around raw_rubrics, data["rubrics"], GEvalRubricConfig.model_validate
and cls.model_validate to enforce this.

In `@src/lightspeed_evaluation/runner/evaluation.py`:
- Around line 139-141: The printed message "⚙️ Initializing Evaluation
Pipeline..." is stale because the flow now calls the API-level evaluate(...);
update the print in evaluation.py (the print call near the top of the
run/evaluation entry) to reflect the new behavior—for example, change to "⚙️
Calling evaluate(...) API..." or "⚙️ Starting evaluation via evaluate(...)"—so
that the message accurately describes that evaluate(...) is being invoked rather
than an evaluation pipeline being initialized.

---

Nitpick comments:
In @.github/workflows/e2e_tests.yaml:
- Line 18: Replace the mutable image reference in the lsc_image_path array
(currently "quay.io/lightspeed-core/lightspeed-stack:latest") with an immutable
reference — either a specific versioned tag or preferably an `@sha256` digest — so
CI uses a stable, reproducible image; update the value inside lsc_image_path
accordingly and verify the digest/tag is the intended release.

In `@Makefile`:
- Line 85: Create a central variable (e.g., QUALITY_PATHS or QC_PATHS) in the
Makefile that holds the shared path list "src tests script lsc_agent_eval" and
replace the repeated literal lists in the targets that run ruff, black, and
docstyle (the lines invoking "uv run ruff check ..." and the similar commands at
Lines ~45, ~48, ~82) to reference this variable; update the targets that call
"uv run ruff check", the black invocation, and the docstyle invocation to use
$(QUALITY_PATHS) so the paths stay synchronized.

In `@tests/integration/system-config-query.yaml`:
- Around line 4-132: Extract a shared base YAML for the common settings (llm,
embedding, output, metrics_metadata, visualization, logging, environment, core)
and have the two integration configs inherit/override only the differing API
fields (primarily api.endpoint_type and any streaming-specific fields like
time_to_first_token/streaming_duration columns); update references to
api.provider, api.model, api.timeout, api.cache_dir/cache_enabled and
api.api_base in the derived configs so they no longer duplicate the full blocks,
and ensure the new base includes the shared csv_columns and
summary_config_sections so maintenance edits affect both query and streaming
modes.

In `@tests/script/test_compare_evaluations.py`:
- Around line 194-237: The test test_compare_score_distributions_precise_delta
should defensively check for the presence of statistical test results before
asserting on their "significant" field; update the assertions that read
result["tests"]["t_test"]["significant"] and
result["tests"]["mann_whitney_u"]["significant"] to first verify the keys exist
in result["tests"] (e.g., if "t_test" in result["tests"]:) and only then assert
False (or equivalent) on the "significant" value, matching the pattern used
elsewhere in the file.

In `@tests/unit/core/metrics/custom/test_tool_eval.py`:
- Around line 456-479: The test suite is missing a regression test ensuring
evaluate_tool_calls(full_match=False) fails when expected is empty but actual
contains tool calls; add a new unit test in
tests/unit/core/metrics/custom/test_tool_eval.py that calls evaluate_tool_calls
with expected = [] and actual containing one or more tool entries (e.g.,
[{"tool_name":"toolX","arguments":{}}]), assert the returned success is False,
and that details mention zero matched/expected and list the extra tools (toolX)
and keywords like "partial" and "ordered" to mirror the existing detailed-stat
checks; place it alongside test_partial_match_failure_reports_detailed_stats and
reference evaluate_tool_calls to locate behavior under test.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 17f56f64-c136-45b0-b813-414e0c03942e

📥 Commits

Reviewing files that changed from the base of the PR and between fd0ed02 and d69c507.

📒 Files selected for processing (40)

.github/workflows/e2e_tests.yaml
.gitignore
Makefile
README.md
config/system.yaml
docs/EVALUATION_GUIDE.md
docs/configuration.md
pyproject.toml
src/lightspeed_evaluation/__init__.py
src/lightspeed_evaluation/api.py
src/lightspeed_evaluation/core/api/client.py
src/lightspeed_evaluation/core/constants.py
src/lightspeed_evaluation/core/llm/custom.py
src/lightspeed_evaluation/core/metrics/custom/tool_eval.py
src/lightspeed_evaluation/core/metrics/geval.py
src/lightspeed_evaluation/core/models/__init__.py
src/lightspeed_evaluation/core/models/system.py
src/lightspeed_evaluation/core/system/loader.py
src/lightspeed_evaluation/core/system/validator.py
src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
src/lightspeed_evaluation/runner/evaluation.py
tests/integration/lightspeed-stack.yaml
tests/integration/run.yaml
tests/integration/system-config-query.yaml
tests/integration/system-config-streaming.yaml
tests/integration/test_evaluation_data.yaml
tests/script/test_compare_evaluations.py
tests/unit/core/api/conftest.py
tests/unit/core/api/test_client.py
tests/unit/core/llm/test_custom.py
tests/unit/core/metrics/custom/test_custom.py
tests/unit/core/metrics/custom/test_tool_eval.py
tests/unit/core/metrics/test_geval.py
tests/unit/core/models/test_data.py
tests/unit/core/models/test_system.py
tests/unit/core/system/test_loader.py
tests/unit/core/system/test_validator.py
tests/unit/pipeline/evaluation/test_evaluator.py
tests/unit/runner/test_evaluation.py
tests/unit/test_api.py

✅ Files skipped from review due to trivial changes (1)

tests/integration/test_evaluation_data.yaml

🚧 Files skipped from review as they are similar to previous changes (1)

tests/unit/core/metrics/custom/test_custom.py

.github/workflows/e2e_tests.yaml

src/lightspeed_evaluation/__init__.py

src/lightspeed_evaluation/core/api/client.py

src/lightspeed_evaluation/core/llm/custom.py

src/lightspeed_evaluation/core/models/system.py

src/lightspeed_evaluation/runner/evaluation.py

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

src/lightspeed_evaluation/core/metrics/custom/tool_eval.py (1)

30-52: ⚠️ Potential issue | 🟠 Major

Failure detail should use the best failed alternative, not the last one.

Line 35 overwrites last_result each iteration, so Line 51 can report stats from a weaker later pattern and hide the closest mismatch.

Suggested fix

-        last_result: dict[str, Any] = {}
+        best_failure_stats: dict[str, Any] | None = None
         for i, expected_set in enumerate(expected):
             result = compare_tool_calls(
                 expected_set, actual, ordered=ordered, full_match=full_match
             )
-            last_result = result
+            if not result["success"] and not full_match:
+                stats = result.get("stats")
+                if stats is not None:
+                    candidate = (
+                        stats.get("matched", 0),
+                        -stats.get("unmatched", 0),
+                        -len(stats.get("extra_actual_tools", [])),
+                    )
+                    if best_failure_stats is None:
+                        best_failure_stats = stats
+                    else:
+                        current_best = (
+                            best_failure_stats.get("matched", 0),
+                            -best_failure_stats.get("unmatched", 0),
+                            -len(best_failure_stats.get("extra_actual_tools", [])),
+                        )
+                        if candidate > current_best:
+                            best_failure_stats = stats
             if result["success"]:
                 return _create_success_message(
@@
-            last_stats=last_result.get("stats"),
+            last_stats=best_failure_stats,
         )

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/lightspeed_evaluation/core/metrics/custom/tool_eval.py` around lines 30 -
52, The loop in evaluate logic currently overwrites last_result each iteration,
causing the final failure message to use the last tried pattern rather than the
closest match; change this by keeping a best_failed_result instead of
last_result and update it only when compare_tool_calls (the call to
compare_tool_calls(expected_set, actual, ...)) returns a failure that is
"closer" than the current best (compare using a concrete metric from
result.get("stats") such as higher matched count, lower mismatch/count of
missing keys, or lower distance — pick the existing sensible stat keys and
handle missing stats defensively). After the loop, pass
best_failed_result.get("stats") into _create_failure_message (instead of
last_result) so the failure details reflect the best alternative.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/lightspeed_evaluation/core/metrics/custom/tool_eval.py`:
- Around line 467-471: The current branch prints "No tool calls made" whenever
expected_set is empty even if match_stats exists (e.g., in partial mode); change
the logic so match_stats takes precedence: update the condition that sets
message to "No tool calls made" to only run when expected_set is empty AND
match_stats is falsy (e.g., if len(expected_set) == 0 and not match_stats), or
alternatively reorder the branches so the match_stats branch (the one calling
_format_match_stats) is evaluated before the empty expected_set check; refer to
the variables expected_set, match_stats and the _format_match_stats call to
locate the code.

---

Outside diff comments:
In `@src/lightspeed_evaluation/core/metrics/custom/tool_eval.py`:
- Around line 30-52: The loop in evaluate logic currently overwrites last_result
each iteration, causing the final failure message to use the last tried pattern
rather than the closest match; change this by keeping a best_failed_result
instead of last_result and update it only when compare_tool_calls (the call to
compare_tool_calls(expected_set, actual, ...)) returns a failure that is
"closer" than the current best (compare using a concrete metric from
result.get("stats") such as higher matched count, lower mismatch/count of
missing keys, or lower distance — pick the existing sensible stat keys and
handle missing stats defensively). After the loop, pass
best_failed_result.get("stats") into _create_failure_message (instead of
last_result) so the failure details reflect the best alternative.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 4f8724ea-104c-47f7-bc94-c7db0f89d153

📥 Commits

Reviewing files that changed from the base of the PR and between d69c507 and 0b70a0d.

📒 Files selected for processing (3)

src/lightspeed_evaluation/core/metrics/custom/tool_eval.py
tests/unit/core/metrics/custom/test_custom.py
tests/unit/core/metrics/custom/test_tool_eval.py

src/lightspeed_evaluation/core/metrics/custom/tool_eval.py

…tool information - Add reporting of extra tool names (not just count) in partial match results - Include detailed statistics in failure messages for partial match scenarios - Refactor message formatting with reusable helper functions - Update existing test assertions to match new message format Example output: Success: "2/2 expected matched, 0 unmatched, 1 extra in response: [oc_logs]" Failure: "0/2 expected matched, 2 unmatched, 3 extra in response: [tool_a, tool_b]" Made-with: Cursor

VladimirKadlec

LGTM, thank you.

asamal4

Thank you !! This improves a lot.. However we can improve further..
up to you if you want to handle this in current PR or a different one

asamal4 · 2026-03-06T08:57:42Z

src/lightspeed_evaluation/core/metrics/custom/tool_eval.py

            result = compare_tool_calls(
                expected_set, actual, ordered=ordered, full_match=full_match
            )
+            last_result = result


Last result is always replaced, so basically we provide details for last successful run and in case of failure - last alternative check.. So this is not a detailed reasoning. We are missing information about previous alternatives.

asamal4 · 2026-03-06T09:00:40Z

src/lightspeed_evaluation/core/metrics/custom/tool_eval.py

        message = (
            f"Tool calls match expected structure and arguments "
-            f"({match_mode}, {order_mode})"
+            f"{_get_mode_suffix(ordered, full_match)}"


FYI, this is not required as the there is a metadata column which will have this info. But I am okay to keep it.

asamal4 · 2026-03-06T09:02:56Z

tests/unit/core/metrics/custom/test_tool_eval.py

+        success, details = evaluate_tool_calls(expected, actual, full_match=False)
+        assert success is False
+        assert "1/2 expected matched" in details
+        assert "1 unmatched" in details


Are we not adding which didn't match from expected tool calls. The test case shows that we only show the actual name for the unmatched tool calls in actual response

asamal4 · 2026-03-06T09:04:19Z

src/lightspeed_evaluation/core/metrics/custom/tool_eval.py

    if not expected:
-        return (0, 0)
+        extra_tools = [_get_sequence_tool_names(seq) for seq in actual]
+        return (0, 0, extra_tools)


Are we providing 0/0 matched ? probably we need to provide a better message for empty alternative.

bsatapat-jpg force-pushed the dev_biswajit branch from d83ab89 to d69c507 Compare March 5, 2026 11:49

bsatapat-jpg changed the title ~~[LEADS-218] Improve partial tool match reporting with detailed extra …~~ [LEADS-218] Improve partial tool match reporting with detailed extra tool information Mar 5, 2026

coderabbitai bot reviewed Mar 5, 2026

View reviewed changes

bsatapat-jpg force-pushed the dev_biswajit branch from d69c507 to 0b70a0d Compare March 5, 2026 12:15

coderabbitai bot reviewed Mar 5, 2026

View reviewed changes

src/lightspeed_evaluation/core/metrics/custom/tool_eval.py Outdated Show resolved Hide resolved

bsatapat-jpg force-pushed the dev_biswajit branch from 0b70a0d to ebcfc8e Compare March 5, 2026 12:29

VladimirKadlec approved these changes Mar 5, 2026

View reviewed changes

asamal4 reviewed Mar 6, 2026

View reviewed changes

Conversation

bsatapat-jpg commented Mar 5, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of change

Tools used to create PR

Related Tickets & Documents

Checklist before requesting a review

Testing

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Mar 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

VladimirKadlec left a comment

Choose a reason for hiding this comment

Uh oh!

asamal4 left a comment

Choose a reason for hiding this comment

Uh oh!

asamal4 Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

asamal4 Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

asamal4 Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

asamal4 Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

bsatapat-jpg commented Mar 5, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Mar 5, 2026 •

edited

Loading