LCORE-1206: add tests for too long question#1232
LCORE-1206: add tests for too long question#1232radofuchs wants to merge 1 commit intolightspeed-core:mainfrom
Conversation
WalkthroughUnregisters/re-registers Llama Stack shields in server-mode E2E scenarios tagged Changes
Sequence Diagram(s)sequenceDiagram
participant Runner as Test Runner
participant Env as environment.py hooks
participant Shields as llama_stack_shields
participant API as Llama Stack API
participant Docker as Docker Engine
Runner->>Env: start scenario (tag `@disable-shields`)
Env->>Shields: unregister_shield("llama-guard")
Shields->>API: list/delete shield
API-->>Shields: provider IDs or 404/400
Shields-->>Env: return provider IDs (store on context)
Runner->>API: execute test (query / streaming)
alt streaming SSE contains "error" event
API-->>Runner: SSE stream with error event
Runner->>Steps: parse stream, attach stream_error
end
alt restoration health-check fails or command error
Env->>Docker: inspect containers / health / logs
Docker-->>Env: container state + last logs
Env-->>Runner: print diagnostics
end
Runner->>Env: end scenario
Env->>Shields: register_shield(shield_id, provider_ids)
Shields->>API: shields.register
API-->>Shields: success or error
Shields-->>Env: result
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Suggested reviewers
🚥 Pre-merge checks | ✅ 3 | ❌ 2❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (3)
tests/e2e/features/steps/llm_query_response.py (1)
193-193: Remove raw streamed body dump from assertion step.Line 193 prints full streamed content on every run. This can create noisy logs and expose payload content unnecessarily in CI output.
🔧 Suggested fix
`@then`("The streamed response contains error message {message}") def check_streamed_response_error_message(context: Context, message: str) -> None: @@ assert context.response is not None, "Request needs to be performed first" - print(context.response.text) parsed = _parse_streaming_response(context.response.text)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/e2e/features/steps/llm_query_response.py` at line 193, Remove the raw dump call print(context.response.text) from the assertion step (the print statement that outputs context.response.text); either delete it or replace it with a non-sensitive debug/log call that only emits a short, masked summary (e.g., status and length) behind a verbosity flag or logger.debug, ensuring the assertion logic (in the step handling LLM responses) remains unchanged.tests/e2e/features/query.feature (1)
220-225: Consider validating token metrics in the new active long-query query scenario.Line 220-225 checks status/body only. Adding token metric assertions here would strengthen coverage for quota-related regressions on too-long query failures.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/e2e/features/query.feature` around lines 220 - 225, Add assertions to the "Check if query with shields returns 413 when question is too long for model context" scenario to validate token metrics in the too-long-query response; after the step "When I use \"query\" to ask question with too-long query and authorization header" and the existing status/body checks, assert that the response includes token-related fields (e.g., token_count/token_usage or relevant headers) and that those values reflect the query exceeded the model/context limit (e.g., token_count > model_limit or a flagged usage field). Update the scenario's expectations so the step verifying the 413 and "Prompt is too long" message also checks the presence and correctness of these token metric indicators returned by the query handler.tests/e2e/features/streaming_query.feature (1)
182-195: Add explicit token-metrics assertions to new long-query streaming scenarios.Line 182 and Line 190 validate status/error behavior, but they don’t assert token metric behavior. Please add metric capture/assertions so these paths also protect quota accounting regressions.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/e2e/features/streaming_query.feature` around lines 182 - 195, Add explicit token-metrics capture and assertions around the two long-query streaming scenarios that use the "streaming_query" step (the scenario with shields that returns 413 and the `@disable-shields` scenario that returns 200). Before invoking the "streaming_query" step, record current token-related metrics (e.g., prompt_tokens, completion_tokens, tokens_consumed) from the same metrics endpoint or helper your test suite uses (call the existing getMetrics / metrics helper), then call "streaming_query" and re-fetch metrics and assert the deltas match expected behavior for each path (ensure the 413 path still records prompt token accounting if expected, and the `@disable-shields` path records streamed error token usage); add these metric assertions into the scenarios in tests/e2e/features/streaming_query.feature adjacent to the existing status/body checks.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@tests/e2e/features/environment.py`:
- Around line 176-190: When restoring shields after a `@disable-shields` scenario,
don't unconditionally call register_shield with defaults if unregister_shield
returned no previous provider; change the restore logic to check the saved tuple
from unregister_shield (the values stored on context.llama_guard_provider_id and
context.llama_guard_provider_shield_id) and only call register_shield to
re-create the shield when saved is truthy (i.e., the shield existed before). If
saved is falsy/None, skip register_shield so you don't create a new default
shield and thus preserve scenario isolation; update any code paths that assume
register_shield will always run to use this presence check instead.
---
Nitpick comments:
In `@tests/e2e/features/query.feature`:
- Around line 220-225: Add assertions to the "Check if query with shields
returns 413 when question is too long for model context" scenario to validate
token metrics in the too-long-query response; after the step "When I use
\"query\" to ask question with too-long query and authorization header" and the
existing status/body checks, assert that the response includes token-related
fields (e.g., token_count/token_usage or relevant headers) and that those values
reflect the query exceeded the model/context limit (e.g., token_count >
model_limit or a flagged usage field). Update the scenario's expectations so the
step verifying the 413 and "Prompt is too long" message also checks the presence
and correctness of these token metric indicators returned by the query handler.
In `@tests/e2e/features/steps/llm_query_response.py`:
- Line 193: Remove the raw dump call print(context.response.text) from the
assertion step (the print statement that outputs context.response.text); either
delete it or replace it with a non-sensitive debug/log call that only emits a
short, masked summary (e.g., status and length) behind a verbosity flag or
logger.debug, ensuring the assertion logic (in the step handling LLM responses)
remains unchanged.
In `@tests/e2e/features/streaming_query.feature`:
- Around line 182-195: Add explicit token-metrics capture and assertions around
the two long-query streaming scenarios that use the "streaming_query" step (the
scenario with shields that returns 413 and the `@disable-shields` scenario that
returns 200). Before invoking the "streaming_query" step, record current
token-related metrics (e.g., prompt_tokens, completion_tokens, tokens_consumed)
from the same metrics endpoint or helper your test suite uses (call the existing
getMetrics / metrics helper), then call "streaming_query" and re-fetch metrics
and assert the deltas match expected behavior for each path (ensure the 413 path
still records prompt token accounting if expected, and the `@disable-shields` path
records streamed error token usage); add these metric assertions into the
scenarios in tests/e2e/features/streaming_query.feature adjacent to the existing
status/body checks.
ℹ️ Review info
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (5)
tests/e2e/features/environment.pytests/e2e/features/query.featuretests/e2e/features/steps/llm_query_response.pytests/e2e/features/streaming_query.featuretests/e2e/utils/llama_stack_shields.py
There was a problem hiding this comment.
♻️ Duplicate comments (1)
tests/e2e/features/environment.py (1)
176-195:⚠️ Potential issue | 🟠 MajorReset shield restore state before unregister to prevent stale re-registration.
Line 187 can throw before context values are overwritten, and Line 247/Line 248 then read whatever was left from a previous scenario. That can trigger
register_shield(...)for a scenario that never successfully unregistered a shield.🔧 Suggested fix
def before_scenario(context: Context, scenario: Scenario) -> None: @@ if "disable-shields" in scenario.effective_tags: + context.llama_guard_restore_required = False + context.llama_guard_provider_id = None + context.llama_guard_provider_shield_id = None + if context.is_library_mode: scenario.skip( "Shield unregister/register only applies in server mode (Llama Stack as a " "separate service). In library mode the app's shields cannot be disabled from e2e." ) return try: saved = unregister_shield("llama-guard") - context.llama_guard_provider_id = saved[0] if saved else None - context.llama_guard_provider_shield_id = saved[1] if saved else None - print("Unregistered shield llama-guard for this scenario") + if saved: + context.llama_guard_provider_id = saved[0] + context.llama_guard_provider_shield_id = saved[1] + context.llama_guard_restore_required = True + print("Unregistered shield llama-guard for this scenario") + else: + print("Shield llama-guard was not registered; nothing to restore") except Exception as e: # pylint: disable=broad-exception-caught scenario.skip( f"Could not unregister shield (is Llama Stack reachable?): {e}" ) return @@ - if "disable-shields" in scenario.effective_tags: + if ( + "disable-shields" in scenario.effective_tags + and not context.is_library_mode + and getattr(context, "llama_guard_restore_required", False) + ): provider_id = getattr(context, "llama_guard_provider_id", None) provider_shield_id = getattr(context, "llama_guard_provider_shield_id", None) - if provider_id is not None and provider_shield_id is not None: - try: - register_shield( - "llama-guard", - provider_id=provider_id, - provider_shield_id=provider_shield_id, - ) - print("Re-registered shield llama-guard") - except Exception as e: # pylint: disable=broad-exception-caught - print(f"Warning: Could not re-register shield: {e}") + try: + register_shield( + "llama-guard", + provider_id=provider_id, + provider_shield_id=provider_shield_id, + ) + context.llama_guard_restore_required = False + print("Re-registered shield llama-guard") + except Exception as e: # pylint: disable=broad-exception-caught + print(f"Warning: Could not re-register shield: {e}")Also applies to: 245-259
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/e2e/features/environment.py` around lines 176 - 195, Reset the shield-related context state before attempting to unregister so stale values from prior scenarios can't be read if unregister_shield("llama-guard") raises; specifically, set context.llama_guard_provider_id = None and context.llama_guard_provider_shield_id = None immediately before calling unregister_shield, then in the try branch assign the saved tuple into those fields (saved[0]/saved[1] or None) and in the except ensure they remain None; apply the same defensive-reset pattern to the corresponding restore/register code that reads these context fields (the register_shield/restore block).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@tests/e2e/features/environment.py`:
- Around line 176-195: Reset the shield-related context state before attempting
to unregister so stale values from prior scenarios can't be read if
unregister_shield("llama-guard") raises; specifically, set
context.llama_guard_provider_id = None and
context.llama_guard_provider_shield_id = None immediately before calling
unregister_shield, then in the try branch assign the saved tuple into those
fields (saved[0]/saved[1] or None) and in the except ensure they remain None;
apply the same defensive-reset pattern to the corresponding restore/register
code that reads these context fields (the register_shield/restore block).
5ddefe5 to
7935999
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
tests/e2e/features/environment.py (1)
309-323:⚠️ Potential issue | 🟠 MajorRetry loop is bypassed on non-zero health check exit.
When curl exits with a non-zero status (indicating service unhealthy), line 322's
check=TrueraisesCalledProcessError. The inner exception handler at line 327 only catchesTimeoutExpired, soCalledProcessErrorpropagates to the outer handler at line 339, immediately exiting the retry loop instead of attempting the remaining retries.Additionally, line 324's
if result.returncode == 0:check is unreachable dead code whencheck=Trueis set, since non-zero exits raise an exception rather than returning.🔧 Proposed retry-safe pattern
result = subprocess.run( [ "docker", "exec", "llama-stack", "curl", "-f", f"http://{context.hostname_llama}:{context.port_llama}/v1/health", ], capture_output=True, timeout=5, - check=True, + check=False, ) if result.returncode == 0: print("✓ Llama Stack connection restored successfully") break + print( + f"Health check failed on attempt {attempt + 1}/6 " + f"(exit={result.returncode})" + ) except subprocess.TimeoutExpired: print(f"⏱ Health check timed out on attempt {attempt + 1}/6")🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/e2e/features/environment.py` around lines 309 - 323, The retry loop in tests/e2e/features/environment.py currently uses subprocess.run(..., check=True) inside the "for attempt in range(6)" loop which causes CalledProcessError to escape the inner handler and short-circuit retries; also the "if result.returncode == 0" check is dead when check=True. Change the subprocess.run invocation to use check=False (or remove check=True), so failures return a CompletedProcess you can inspect; then update the inner exception handling to still catch subprocess.TimeoutExpired and after the call inspect result.returncode (use result.returncode == 0 to break/return success, otherwise continue the retry loop and log the failure). Ensure references to subprocess.run, result.returncode, TimeoutExpired (and optionally CalledProcessError if you choose to catch it) are updated accordingly.
🧹 Nitpick comments (2)
tests/e2e/features/steps/llm_query_response.py (1)
91-91: Remove debugprint()calls from test step paths.At Line 91 and Line 193, these prints add noisy logs and may expose full streamed payloads without improving assertions.
🧹 Proposed cleanup
- print(f"Request: query length={len(long_query)}, model={context.default_model}") ask_question_authorized(context, endpoint) @@ - print(context.response.text) parsed = _parse_streaming_response(context.response.text)Also applies to: 193-193
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/e2e/features/steps/llm_query_response.py` at line 91, Remove the debug print statements that leak full payloads and clutter test logs—specifically the print(f"Request: query length={len(long_query)}, model={context.default_model}") call in the llm_query_response step and the other print at the later step; delete these prints or replace them with a low-verbosity logger.debug call (e.g., using the test suite's logger) that does not print full payloads, keeping only minimal safe info if needed and ensuring no assertions rely on the print output.tests/e2e/utils/llama_stack_shields.py (1)
55-58: Broaden error handling for shields.delete() responses to cover both 400 and 404 status codes.At line 57, the code only tolerates
status_code == 400with "not found" text. However,shields.delete()can return either400or404depending on whether the server exposes the DELETE endpoint and error message formatting varies across API versions. If the API returns404or uses different error text, the exception will be unnecessarily raised.🔧 Recommended hardening
except APIStatusError as e: # 400 "not found": shield already absent, scenario can proceed - if e.status_code == 400 and "not found" in str(e).lower(): + err = str(e).lower() + if e.status_code in {400, 404} and ( + "not found" in err or "does not exist" in err + ): return None raise🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/e2e/utils/llama_stack_shields.py` around lines 55 - 58, The except block catching APIStatusError from shields.delete() is too strict: change the condition in the except APIStatusError as e handler so it tolerates both status_code 400 and 404 (e.g., check if e.status_code in (400, 404)) and keep the existing "not found" text check for 400 while allowing 404 to return None even if message differs; locate the except block handling APIStatusError around shields.delete() and update the condition that currently checks `e.status_code == 400 and "not found" in str(e).lower()`.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@tests/e2e/features/environment.py`:
- Around line 186-195: Reset context.llama_guard_provider_id and
context.llama_guard_provider_shield_id to None before calling unregister_shield
in before_scenario so stale values can't be used later by the restore logic;
specifically, set those context fields to None immediately before the try that
calls unregister_shield (the block using unregister_shield and setting
context.llama_guard_provider_id / context.llama_guard_provider_shield_id), and
apply the same defensive reset in the symmetric block that handles restore (the
other before_scenario/after_scenario block around the restore logic) so both
paths clear prior state on failure.
---
Outside diff comments:
In `@tests/e2e/features/environment.py`:
- Around line 309-323: The retry loop in tests/e2e/features/environment.py
currently uses subprocess.run(..., check=True) inside the "for attempt in
range(6)" loop which causes CalledProcessError to escape the inner handler and
short-circuit retries; also the "if result.returncode == 0" check is dead when
check=True. Change the subprocess.run invocation to use check=False (or remove
check=True), so failures return a CompletedProcess you can inspect; then update
the inner exception handling to still catch subprocess.TimeoutExpired and after
the call inspect result.returncode (use result.returncode == 0 to break/return
success, otherwise continue the retry loop and log the failure). Ensure
references to subprocess.run, result.returncode, TimeoutExpired (and optionally
CalledProcessError if you choose to catch it) are updated accordingly.
---
Nitpick comments:
In `@tests/e2e/features/steps/llm_query_response.py`:
- Line 91: Remove the debug print statements that leak full payloads and clutter
test logs—specifically the print(f"Request: query length={len(long_query)},
model={context.default_model}") call in the llm_query_response step and the
other print at the later step; delete these prints or replace them with a
low-verbosity logger.debug call (e.g., using the test suite's logger) that does
not print full payloads, keeping only minimal safe info if needed and ensuring
no assertions rely on the print output.
In `@tests/e2e/utils/llama_stack_shields.py`:
- Around line 55-58: The except block catching APIStatusError from
shields.delete() is too strict: change the condition in the except
APIStatusError as e handler so it tolerates both status_code 400 and 404 (e.g.,
check if e.status_code in (400, 404)) and keep the existing "not found" text
check for 400 while allowing 404 to return None even if message differs; locate
the except block handling APIStatusError around shields.delete() and update the
condition that currently checks `e.status_code == 400 and "not found" in
str(e).lower()`.
ℹ️ Review info
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (5)
tests/e2e/features/environment.pytests/e2e/features/query.featuretests/e2e/features/steps/llm_query_response.pytests/e2e/features/streaming_query.featuretests/e2e/utils/llama_stack_shields.py
✅ Files skipped from review due to trivial changes (1)
- tests/e2e/features/streaming_query.feature
Description
Type of change
Tools used to create PR
Identify any AI code assistants used in this PR (for transparency and review context)
Related Tickets & Documents
Checklist before requesting a review
Testing
Summary by CodeRabbit