LCORE-86: Prioritize BYOK content over built-in content by are-ces · Pull Request #1208 · lightspeed-core/lightspeed-stack

are-ces · 2026-02-24T12:41:30Z

TL;DR

Dual RAG Strategies: Added configurable "Always RAG" (pre-query context injection) and "Tool RAG" (""Agentic"" RAG). Both can be used independently or together.
Multi-Source Support: Query BYOK and Solr OKP simultaneously and merge results with BYOK prioritized first.
Chunk Prioritization: Implemented score multipliers per vector store for Always RAG BYOK to weight results from different sources.
Configuration Updates: Tool RAG defaults to enabled=True for backward compatibility. Both strategies independently configurable in lightspeed-stack.yaml.
Enrichment Script Improvements: Updated to automatically configure Solr in llama-stack and fix bugs in vector store configuration building.
Max chunk Knobs: added constants to define max_chunks for Tool RAG, BYOK RAG (Always Rag), Solr RAG (Always RAG)

Motivation

To prioritize BYOK content, a mechanism to tune chunk scoring per vector store is necessary. The alternative was to create a client-side tool to change the behavior of the RAG tool, but that is not optimal for two reasons:

RAG as a tool alone is not the best strategy- it relies on the LLM deciding when to call RAG, but in practice the majority of use cases require RAG on every query (e.g. in OLS).
In Llama Stack 0.2.x the Agent could use a client-side function directly as a tool. In 0.3.x+ the Responses API returns a function tool call that must be intercepted by LCORE to then perform RAG, which increases complexity significantly.

Always RAG with LCORE-side chunk prioritization avoids both issues.

Description

Adds configurable RAG strategies and chunk prioritization. Control which documentation sources to search, how to search them, and which sources are more important.

How It Works

Always RAG retrieves chunks from configured sources (BYOK and/or Solr) and injects them into the query before sending to the LLM. The LLM always has documentation context without calling tools.

Tool RAG lets the AI call the file_search tool during generation. This is the original behavior, enabled by default for backward compatibility.

Note: Both can be enabled simultaneously, but Tool RAG is non-deterministic - the LLM decides when to invoke file_search, so behavior varies between requests and has not been tested properly in conjunction with Always RAG.

Chunk Prioritization applies a score_multiplier per BYOK source. All sources are queried in parallel, chunk scores are multiplied by their source's weight, then merged and sorted. Top N chunks are selected across all sources. Solr chunks are appended after BYOK chunks without cross-source ranking (TBD, needs discussion / spike).

Configuration

In lightspeed-stack.yaml:

rag:
  always:
    solr:
      enabled: true     # Enable/disable Solr pre-query search
      offline: true     # true = Mimir URLs, false = public URLs
    byok:
      enabled: true     # Enable/disable BYOK pre-query search
  tool:
    byok: # for clarity, RAG as a tool is a llama-stack tool and does not support Solr
      enabled: true     # Enable/disable tool-based RAG (default: true)

byok_rag:
  - rag_id: source-a
    score_multiplier: 1.0    # Baseline
  - rag_id: source-b
    score_multiplier: 1.5    # 50% boost
    # ... other config (embedding_model, vector_db_id, db_path)

Chunk limits in src/constants.py:

BYOK_RAG_MAX_CHUNKS = 10    # Total chunks from all BYOK sources
SOLR_RAG_MAX_CHUNKS = 5    # Total chunks from Solr
TOOL_RAG_MAX_CHUNKS = 10  # Total chunks from RAG as a tool

Type of change

Tools used to create PR

Identify any AI code assistants used in this PR (for transparency and review context)

Assisted-by: Claude

Related Tickets & Documents

Related Issue # LCORE-86
Closes # LCORE-86

Checklist before requesting a review

I have performed a self-review of my code.
PR has passed all pre-merge test jobs.
If it is a core feature, I have added thorough tests.

Testing

Prerequisites: BYOK vector stores (FAISS) created with rag-content tool, OKP Solr instance, Llama Stack 0.4.3+, Lightspeed Stack Providers installed, OpenAI API key.

Configure chunk limits in src/constants.py:

BYOK_RAG_MAX_CHUNKS = 5    # Total chunks from all BYOK sources
SOLR_RAG_MAX_CHUNKS = 5    # Total chunks from Solr

Configure Lightspeed Stack - add BYOK RAG sources and RAG strategy to lightspeed-stack.yaml:

# BYOK (Bring Your Own Knowledge) RAG configuration
byok_rag:
  - rag_id: openshift-docs-part1
    rag_type: inline::faiss
    embedding_model: sentence-transformers/<path-to-embeddings-model>
    embedding_dimension: 768
    vector_db_id: vs_e4270ee9-0834-422b-83f3-3dca19e2454e
    db_path: <path-to-faiss-store>/os-start-ac/faiss_store.db
    score_multiplier: 1.0
  - rag_id: openshift-docs-part2
    rag_type: inline::faiss
    embedding_model: sentence-transformers/<path-to-embeddings-model>
    embedding_dimension: 768
    vector_db_id: vs_61f66acb-c014-4ffa-b4fb-228b49219ca9
    db_path: <path-to-faiss-store>/os-end-mc/faiss_store.db
    score_multiplier: 1.2
  - rag_id: graham-docs
    rag_type: inline::faiss
    embedding_model: sentence-transformers/all-mpnet-base-v2
    embedding_dimension: 768
    vector_db_id: vs_8c94967b-81cc-4028-a294-9cfac6fd9ae2
    db_path: <path-to-faiss-store>/kv_store.db
    score_multiplier: 1.5

# RAG configuration
rag:
  always:
    solr:
      enabled: true
      offline: false
    byok:
      enabled: true
  tool:
    byok:
      enabled: false

Run enrichment script - reads lightspeed-stack config and generates an enriched llama-stack run.yaml with BYOK vector stores and Solr provider registered:

uv run python src/llama_stack_configuration.py \
  --input <path-to-llama-stack>/run_rag_minimal.yaml \
  --output <path-to-llama-stack>/run_enriched.yaml \
  --config lightspeed-stack.yaml

Install Lightspeed Stack Providers:

uv add git+https://github.com/lightspeed-core/lightspeed-providers.git

Start Llama Stack:

export EXTERNAL_PROVIDERS_DIR=<path-to-lightspeed-stack>/providers.d
uv run llama stack run <path-to-llama-stack>/run_enriched.yaml

Start Lightspeed Stack:

uv run src/lightspeed_stack.py -c lightspeed-stack.yaml

Test query — send to /v1/query or streaming endpoint:

curl -X POST http://localhost:8080/v1/query \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gpt-4o-mini",
    "provider": "openai",
    "query": "What is the Admission plugin?",
    "system_prompt": "You are a helpful assistant"
  }'

Response:

{
  "conversation_id": "...",
  "response": "Admission plugins in Red Hat OpenShift Container Platform are [...]",
  "rag_chunks": [
    {
      "content": "...",
      "source": "openshift-docs-part1",
      "score": 1.0038,
      "attributes": {
        "doc_url": "https://www.redhat.com/data/architecture/admission-plug-ins.txt",
        "title": "Admission plugins",
        "document_id": "file-7a70ef22c4a646f2a6f657c66961ba2c"
      }
    },
    {
      "content": "...",
      "source": "openshift-docs-part2",
      "score": 0.926,
      "attributes": {
        "doc_url": "https://www.redhat.com/web_console/dynamic-plugin/overview-dynamic-plugin.txt",
        "title": "Overview of dynamic plugins",
        "document_id": "file-b266f575a95a4da19d7ba058fd980f00"
      }
    },
    {
      "content": "...",
      "source": "OKP Solr",
      "score": 63.996,
      "attributes": {
        "document_id": "/documentation/en-us/openshift_container_platform/4.19/html-single/architecture/index"
      }
    }
  ],
  "referenced_documents": [
    {
      "doc_title": "Admission plugins",
      "doc_url": "https://www.redhat.com/data/architecture/admission-plug-ins.txt",
      "source": "openshift-docs-part1"
    },
    {
      "doc_title": "Overview of dynamic plugins",
      "doc_url": "https://www.redhat.com/web_console/dynamic-plugin/overview-dynamic-plugin.txt",
      "source": "openshift-docs-part2"
    },
    {
      "doc_title": null,
      "doc_url": "https://mimir.corp.redhat.com/documentation/en-us/openshift_container_platform/4.19/html-single/architecture/index",
      "source": "OKP Solr"
    }
  ],
  "truncated": false,
  "input_tokens": 3736,
  "output_tokens": 448
}

Summary by CodeRabbit

New Features
- Dual RAG modes: Inline RAG (pre-query context injection) and Tool RAG (on-demand retrieval), usable together.
- BYOK sources exposed as a file_search tool by default.
Configuration
- New top-level rag block with inline and tool subsections; per-source score_multiplier and example YAML updated.
- Embedding model can be specified with provider/model entries.
Documentation
- Guides, examples, and OpenAPI updated to document the two-mode RAG strategy and config.

coderabbitai · 2026-02-24T12:41:40Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

Walkthrough

This PR replaces the prior Solr-centric RAG approach with a two-mode RAG architecture (Inline RAG and Tool RAG), adds a RagConfiguration surface (inline + tool), implements RAGContext construction (build_rag_context), integrates BYOK and Solr/OKP retrieval with score multipliers, updates endpoints to consume inline RAG context, and updates docs, configs, constants, and tests accordingly.

Changes

Cohort / File(s)	Summary
Documentation - RAG Strategy Guides `docs/byok_guide.md`, `docs/rag_guide.md`, `docs/config.md`	Rewrote docs to present Inline RAG and Tool RAG, updated configuration examples, diagrams, TOC headings, and replaced Solr-centric guidance with dual-mode RAG model.
Configuration Models `src/models/config.py`, `src/configuration.py`	Introduced `RagConfiguration` (inline + tool), `ByokRagConfiguration` (with `score_multiplier`), `OkpRagConfiguration`; replaced `solr` accessor with `rag` and added `score_multiplier_mapping` in AppConfig.
Configuration Files & Examples `lightspeed-stack.yaml`, `run.yaml`, `examples/lightspeed-stack-byok-rag.yaml`	Replaced top-level `solr` block with `rag`; example BYOK entries updated (`vector_db_id` values, `score_multiplier`); `run.yaml` changes default_embedding_model to explicit provider/model mapping.
Query & Streaming Endpoints `src/app/endpoints/query.py`, `src/app/endpoints/streaming_query.py`	Replaced pre-query vector search with `build_rag_context()` producing `RAGContext`; pass inline RAG context into responses; rename `pre_rag_documents` → `inline_rag_documents`; updated signatures, merging logic, and logging.
RAG Context & Vector Search Utilities `src/utils/vector_search.py`, `src/utils/types.py`	Added `RAGContext`, `RAGChunk` handling, and `build_rag_context()` that aggregates BYOK and Solr/OKP results; implemented BYOK and Solr fetchers, chunk→document extraction, score_multiplier weighting, and context formatting.
Response Preparation & Tools `src/utils/responses.py`, `src/utils/query.py`	Added `get_vector_store_ids()` and `_build_provider_data_headers`; `prepare_responses_params()` gains `inline_rag_context` param; `prepare_input()` now prepends inline RAG context; tool availability gated by `configuration.rag.tool.byok.enabled`; tool chunk limits use constant.
Constants & Stack Configuration `src/constants.py`, `src/llama_stack_configuration.py`	Added RAG/tool/OKP/Solr constants and defaults; BYOK enrichment refactored to use `rag_id`-based naming for providers/vector stores/models; added `enrich_solr()` to inject Solr provider/vector store/embedding model into Llama Stack config.
OpenAPI Specification `docs/openapi.json`	Added public schemas for RAG configuration types and `ByokRagConfiguration` with `score_multiplier`, removed `SolrConfiguration`, and added MCP header propagation field.
Tests - Configuration & Models `tests/unit/models/config/*`, `tests/unit/test_configuration.py`	Added tests and validations for `score_multiplier`, updated dumped configuration expectations to new `rag` structure, and added `score_multiplier_mapping` test coverage.
Tests - Vector Search & Endpoints `tests/unit/app/endpoints/*`, `tests/unit/utils/test_responses.py`	Replaced `perform_vector_search` mocks with `build_rag_context`/`RAGContext`; added tests for `get_vector_store_ids` and tool gating; updated assertions to `inline_rag_documents`.
Tests - Llama Stack & RAG Chunks `tests/unit/test_llama_stack_configuration.py`, `tests/unit/models/responses/test_rag_chunk.py`	Added `enrich_solr()` tests, enforced `rag_id` naming in generated resources, and expanded RAGChunk/RAGContext test coverage.

Sequence Diagram(s)

sequenceDiagram
    participant Client as Client Request
    participant QueryEp as Query Endpoint
    participant RagBuilder as build_rag_context()
    participant ByokStore as BYOK Vector Store
    participant SolrOkp as Solr/OKP
    participant LLM as LLM
    participant Tools as file_search Tool

    Client->>QueryEp: send query_request

    rect rgba(100,150,200,0.5)
    Note over QueryEp,RagBuilder: Inline RAG (pre-query)
    QueryEp->>RagBuilder: build_rag_context(query_request)
    RagBuilder->>ByokStore: parallel BYOK searches (score_multiplier applied)
    RagBuilder->>SolrOkp: fetch Solr/OKP chunks
    ByokStore-->>RagBuilder: rag_chunks + referenced_documents
    SolrOkp-->>RagBuilder: rag_chunks + referenced_documents
    RagBuilder-->>QueryEp: RAGContext(context_text, rag_chunks, documents)
    end

    QueryEp->>LLM: forward query + inline_rag_context.context_text

    rect rgba(200,100,150,0.5)
    Note over LLM,Tools: Tool RAG (on-demand)
    alt LLM needs more info
      LLM->>Tools: invoke file_search
      Tools->>ByokStore: retrieve docs
      ByokStore-->>Tools: tool results
      Tools-->>LLM: return results
    end
    end

    LLM-->>QueryEp: response (may include used rag_chunks)
    QueryEp->>Client: merge inline + tool RAG chunks and return response

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

LCORE-1271: Use user-facing index names from LCORE config instead of llama-stack IDs #1135 — overlaps on rag_id mapping and RAG-related identifier wiring across endpoints and responses.
LCORE-693: Added rag_chunks to streaming_query #585 — related to RAG chunk and referenced-document propagation through query/streaming endpoints.
LCORE-873: Support Solr Vector I/O Provider in LCORE #868 — related refactor of vector-search/RAG utilities and Solr/OKP integration.

Suggested reviewers

tisnik

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly summarizes the main change: introducing prioritization of BYOK content over built-in content in the RAG system through configurable strategies and scoring mechanisms.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

are-ces · 2026-02-24T13:01:26Z

Chunks from Solr did not include title metadata, thus I get null values for Solr chunks titles. @Anxhela21 you might know why, I might have missed something in the Solr configuration..

Printout of the Solr chunk structure returned by client.vector_io.query:

Chunk(
    chunk_id = '/documentation/en-us/openshift_container_platform/4.20/html-single/architecture/index_chunk_89',
    content  = '# Chapter 9. Admission plugins\n\n\n\n\nAdmission plugins are used to help regulate how [..]',
    chunk_metadata = ChunkChunkMetadata(
        chunk_embedding_dimension = None,
        chunk_embedding_model     = None,
        chunk_id                  = '/documentation/en-us/openshift_container_platform/4.20/html-single/architecture/index_chunk_89',
        chunk_tokenizer           = None,
        chunk_window              = None,
        content_token_count       = None,
        created_timestamp         = None,
        document_id               = '/documentation/en-us/openshift_container_platform/4.20/html-single/architecture/index',
        metadata_token_count      = None,
        source                    = None,
        updated_timestamp         = None,
    ),
    embedding           = [],
    metadata            = {},
    embedding_model     = 'sentence-transformers/ibm-granite/granite-embedding-30m-english',
    embedding_dimension = 384,
)

tisnik

LGTM

- Add configurable RAG strategies: always RAG which is performed at each query (OKP Solr + BYOK) and tool RAG can be used independently or together - Add chunk prioritization with score multipliers per vector store for always RAG - Added knobs in config to select the RAG strategy - Tool RAG defaults to enabled=True for backward compatibility - Update lightspeed stack configuration enrichment script to build the solr section in llama stack and fix bugs in building the vector stores - Updated byok and rag documentation - Updated unit tests

coderabbitai

Actionable comments posted: 5

🧹 Nitpick comments (3)

tests/unit/utils/test_responses.py (1)

2248-2275: Add an explicit regression test for vector_store_ids=[].

Current cases cover None and non-empty lists, but not an explicit empty list. A dedicated test will prevent accidental “fetch all stores” behavior regressions.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tests/unit/utils/test_responses.py` around lines 2248 - 2275, Add a
regression test that verifies passing an explicit empty list to
get_vector_store_ids returns an empty list and does not call the API: create an
AsyncMock client (client_mock), call result = await
get_vector_store_ids(client_mock, []), assert result == [], and assert
client_mock.vector_stores.list.assert_not_called(); name the test clearly (e.g.,
test_returns_empty_for_explicit_empty_list_no_fetch) and place it alongside the
existing tests in the same test module.

tests/unit/test_llama_stack_configuration.py (1)

362-451: Good test coverage for enrich_solr function.

The tests comprehensively cover the main scenarios: disabled state, empty config, adding providers/stores/models, and preserving existing configuration. The duplicate detection tests for providers and vector stores are present.

Consider adding a test for duplicate embedding model detection to match the logic in enrich_solr (lines 470-473 in src/llama_stack_configuration.py) which checks existing_model_ids before adding.

Optional test for duplicate embedding model

def test_enrich_solr_skips_duplicate_embedding_model() -> None:
    """Test enrich_solr does not add duplicate embedding model."""
    ls_config: dict[str, Any] = {
        "registered_resources": {
            "models": [{"model_id": "existing", "provider_model_id": "all-mpnet-base-v2"}]
        }
    }
    enrich_solr(ls_config, {"enabled": True})

    model_ids = [m["provider_model_id"] for m in ls_config["registered_resources"]["models"]]
    assert model_ids.count("all-mpnet-base-v2") == 1

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tests/unit/test_llama_stack_configuration.py` around lines 362 - 451, Add a
unit test to cover duplicate embedding model detection in enrich_solr: create
ls_config with "registered_resources": {"models": [{"model_id": "existing",
"provider_model_id": "all-mpnet-base-v2"}]}, call enrich_solr(ls_config,
{"enabled": True}), then assert that the list of provider_model_id values in
ls_config["registered_resources"]["models"] contains "all-mpnet-base-v2" only
once (count == 1); reference enrich_solr and the
existing_model_ids/provider_model_id logic to locate where duplicates are
checked.

src/models/config.py (1)

1730-1740: Minor: Lambda wrappers in default_factory are unnecessary.

The # pylint: disable=unnecessary-lambda comments indicate these lambdas could be simplified. Since these are class constructors with no arguments, you can pass the class directly.

Optional simplification

-    solr: SolrRagConfiguration = Field(
-        default_factory=lambda: SolrRagConfiguration(),  # pylint: disable=unnecessary-lambda
+    solr: SolrRagConfiguration = Field(
+        default_factory=SolrRagConfiguration,
         title="Solr RAG configuration",
         description="Configuration for Solr RAG (pre-query).",
     )

-    byok: ByokRagConfiguration = Field(
-        default_factory=lambda: ByokRagConfiguration(),  # pylint: disable=unnecessary-lambda
+    byok: ByokRagConfiguration = Field(
+        default_factory=ByokRagConfiguration,
         title="BYOK RAG configuration",
         description="Configuration for BYOK RAG (pre-query).",
     )

The same applies to lines 1764-1773. However, line 1750 requires the lambda since it passes enabled=True.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/models/config.py` around lines 1730 - 1740, Replace the unnecessary
lambda wrappers used as default_factory for SolrRagConfiguration and
ByokRagConfiguration by passing the classes themselves (use
default_factory=SolrRagConfiguration and default_factory=ByokRagConfiguration)
in the Field definitions for the solr and byok attributes; also apply the same
simplification for any other fields that use no-arg constructors (e.g., the
similar fields around the RAG config group), but keep the lambda where a
constructor requires arguments (such as the field that calls a class with
enabled=True).

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@docs/rag_guide.md`:
- Around line 335-339: The fenced code block in docs/rag_guide.md is missing a
language specifier causing markdownlint MD040; update the block start to include
a language (e.g., bash) so the snippet begins with ```bash and leave the
contents unchanged—this addresses the lint warning for the curl POST example
used to call /v1/query.

In `@lightspeed-stack.yaml`:
- Around line 46-49: The YAML comment and value disagree for the BYOK flag: the
comment says "Default is true for backward compatibility" while the config sets
tool.byok.enabled: false; either change the comment to indicate the feature is
intentionally disabled (e.g., update the line near the tool/byok block to state
the default is false or that this base config disables BYOK) or change the value
to true to match the comment (set tool.byok.enabled: true); update only the
comment or only the boolean so they are consistent and leave the surrounding
YAML structure unchanged.

In `@src/app/endpoints/query.py`:
- Around line 163-167: The code mutates query_request.query by appending
rag_context.context_text; instead, avoid changing the model: after making the
deep copy (or even without copying), construct a separate retrieval string like
retrieval_query = query_request.query + rag_context.context_text and use
retrieval_query for retrieval/search/embedding calls only (where you currently
rely on the mutated value), leaving query_request.query intact for later
operations such as topic summary generation and storing the original user query;
update calls that currently read query_request.query for retrieval to use
retrieval_query instead and ensure persisted/stored values continue to use the
original query_request.query.

In `@src/utils/responses.py`:
- Around line 205-207: prepare_tools is calling vector_stores.list() via
get_vector_store_ids even when the feature toggle rag.tool.byok.enabled is
false, causing unnecessary failures; change the logic so that when
rag.tool.byok.enabled is False you skip any vector store listing and simply
return the passed-in vector_store_ids (or an empty list) instead of calling
vector_stores.list(); update either prepare_tools to short-circuit before
calling get_vector_store_ids or modify get_vector_store_ids to check the
rag.tool.byok.enabled flag and avoid calling vector_stores.list() when the flag
is disabled (referencing prepare_tools, get_vector_store_ids,
vector_stores.list, and rag.tool.byok.enabled to locate the code).
- Around line 125-126: The current truthiness check "if vector_store_ids:"
treats an explicit empty list as False and causes the code to fall through and
fetch all vector stores; change the condition to test for presence explicitly
(e.g., check for None) so that an empty list is preserved and returned. Locate
the check around the vector_store_ids variable in src/utils/responses.py and
replace the truthy conditional with an explicit None check (use
"vector_store_ids is not None" or equivalent) so the function returns the
provided vector_store_ids even when it's an empty list.

---

Nitpick comments:
In `@src/models/config.py`:
- Around line 1730-1740: Replace the unnecessary lambda wrappers used as
default_factory for SolrRagConfiguration and ByokRagConfiguration by passing the
classes themselves (use default_factory=SolrRagConfiguration and
default_factory=ByokRagConfiguration) in the Field definitions for the solr and
byok attributes; also apply the same simplification for any other fields that
use no-arg constructors (e.g., the similar fields around the RAG config group),
but keep the lambda where a constructor requires arguments (such as the field
that calls a class with enabled=True).

In `@tests/unit/test_llama_stack_configuration.py`:
- Around line 362-451: Add a unit test to cover duplicate embedding model
detection in enrich_solr: create ls_config with "registered_resources":
{"models": [{"model_id": "existing", "provider_model_id":
"all-mpnet-base-v2"}]}, call enrich_solr(ls_config, {"enabled": True}), then
assert that the list of provider_model_id values in
ls_config["registered_resources"]["models"] contains "all-mpnet-base-v2" only
once (count == 1); reference enrich_solr and the
existing_model_ids/provider_model_id logic to locate where duplicates are
checked.

In `@tests/unit/utils/test_responses.py`:
- Around line 2248-2275: Add a regression test that verifies passing an explicit
empty list to get_vector_store_ids returns an empty list and does not call the
API: create an AsyncMock client (client_mock), call result = await
get_vector_store_ids(client_mock, []), assert result == [], and assert
client_mock.vector_stores.list.assert_not_called(); name the test clearly (e.g.,
test_returns_empty_for_explicit_empty_list_no_fetch) and place it alongside the
existing tests in the same test module.

ℹ️ Review info

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 9a6024a and 2ace88f.

📒 Files selected for processing (22)

docs/byok_guide.md
docs/config.md
docs/rag_guide.md
examples/lightspeed-stack-byok-rag.yaml
lightspeed-stack.yaml
run.yaml
src/app/endpoints/query.py
src/app/endpoints/streaming_query.py
src/configuration.py
src/constants.py
src/llama_stack_configuration.py
src/models/config.py
src/utils/responses.py
src/utils/types.py
src/utils/vector_search.py
tests/unit/app/endpoints/test_streaming_query.py
tests/unit/models/config/test_byok_rag.py
tests/unit/models/config/test_dump_configuration.py
tests/unit/models/responses/test_rag_chunk.py
tests/unit/test_configuration.py
tests/unit/test_llama_stack_configuration.py
tests/unit/utils/test_responses.py

docs/rag_guide.md

lightspeed-stack.yaml

src/app/endpoints/query.py

src/utils/responses.py

tisnik

In overall LGTM, nice work!

asimurka

Great work, LGTM in overall. I suggested some minor improvements and there are also things worth a discussion.

src/app/endpoints/streaming_query.py

asimurka · 2026-02-27T11:53:44Z

src/models/config.py

+    """
+
+    always: AlwaysRagConfiguration = Field(
+        default_factory=lambda: AlwaysRagConfiguration(),  # pylint: disable=unnecessary-lambda


I think this could be done like default_factory=AlwaysRagConfiguration without using lambda

+1 goes for the other instances too.

src/utils/responses.py

asimurka · 2026-02-27T12:10:52Z

src/utils/responses.py

    """
+    # Check if Tool RAG is enabled in configuration
+    if not (configuration and configuration.rag.tool.byok.enabled):
+        return None


Worth a discussion: Returning an empty list here would make more sense. I see it does not affect current logic but possibly can in future (responses endpoint). By returning None you signal "overwrite by config" and by an empty list "do not use vector stores".

Agreed, updated PTAL

src/utils/responses.py

romartin · 2026-02-27T19:28:42Z

@are-ces @tisnik Ideally we should try, from the Ansible side, this feature before being merged, just to early detect any incompatibilities or problems, so new changes can be still added as part of the PR.

On the other hand, it may take a while, so if you have urgency on merging this, we are also fine on merging , but if any changes necessary once testing, then you'll have to create new PRs and potentially new minor releases.

I am busy on a top priority customer issue, @TamiTakamiya is also quite busy I think, but I can guarantee that at some point during next week, I can give this new feature a try for our Ansible use-case.

Let's keep in touch!
Thanks!

TamiTakamiya · 2026-02-27T20:22:45Z

I am busy on a top priority customer issue, @TamiTakamiya is also quite busy I think, but I can guarantee that at some point during next week, I can give this new feature a try for our Ansible use-case.

I am trying to set up a test environment on my CRC with this, #1179, and the lightspeed-providers#52. Right now I am still unable to make authentication work (it should work with #1179, though)... Hopefully I should be able to find a solution soon.

examples/lightspeed-stack-byok-rag.yaml

syedriko · 2026-03-01T20:10:13Z

src/app/endpoints/query.py

-        turn_summary.referenced_documents = deduplicate_referenced_documents(
-            doc_ids_from_chunks + (turn_summary.referenced_documents or [])
-        )
+    # Merge RAG chunks (BYOK + Solr) with tool-based RAG chunks


Merge "always" RAG chunks with "tool" RAG chunks?

Yes, these are the RAGChunk objects returned in the query response and saved in the transcript, not chunks which are sent to an LLM

Will make the comment more clear about that

syedriko · 2026-03-01T20:14:09Z

src/app/endpoints/streaming_query.py

-    _, _, doc_ids_from_chunks, pre_rag_chunks = await perform_vector_search(
-        client, query_request, configuration
-    )
+    # Build RAG context from BYOK and Solr sources


Build RAG context from the configured "always" sources?

src/app/endpoints/streaming_query.py

syedriko · 2026-03-01T20:39:57Z

src/app/endpoints/streaming_query.py

    Args:
        responses_params: The Responses API parameters
        context: The response generator context
+        pre_rag_documents: Referenced documents from pre-query RAG (BYOK + Solr)


always_rag_documents: Referenced documents from "always" RAG?

Fixed ty, changed to inline_rag_documents

src/app/endpoints/streaming_query.py

Anxhela21

Left a few minor comments. Overall looks good, nice work!

/LGTM

Anxhela21 · 2026-02-27T19:27:39Z

src/models/config.py

+    """
+
+    always: AlwaysRagConfiguration = Field(
+        default_factory=lambda: AlwaysRagConfiguration(),  # pylint: disable=unnecessary-lambda


+1 goes for the other instances too.

Anxhela21 · 2026-02-27T19:39:25Z

src/models/config.py

        title="BYOK RAG configuration",
        description="BYOK RAG configuration. This configuration can be used to "
        "reconfigure Llama Stack through its run.yaml configuration file",
    )


Since we are using the rag RagConfiguration object to define the configurations for all RAGs do we still need the byok_rag configuration above?

You have a valid point. My initial thought was to put all under "rag" and have a section for "dbs" or "indexes", however this is a breaking change for teams which use the current config with "byok_rag". That's why I kept the old "byok_rag" section for the configuration of vector stores.

@tisnik wdyt?

src/utils/responses.py

src/app/endpoints/query.py

lightspeed-stack.yaml

Anxhela21 · 2026-03-02T03:48:28Z

docs/byok_guide.md

+```

-The configuration above automatically enables the RAG tools. The system will:
+Both modes can be enabled simultaneously. Choose based on your latency and control preferences:


Would be nice to have some documentation on what the intended behavior is when both modes are enabled (always RAG and tool RAG) but that can be in a separate PR.

I feel this would be a spike, we would need to test the joint behavior e.g. in what cases the tool is being called.

Should I create a ticket for that @Anxhela21?

Yes, let's open a separate ticket for this. I wonder if we want some evaluation outcome included in this as well.

syedriko · 2026-03-02T04:57:34Z

@are-ces I left a few comments here in case the approach in this PR lives on, but I took this PR, with help of my mechanical assistant, in a different direction, presented at https://github.com/syedriko/lightspeed-stack/tree/syedriko-lcore-86.
Key points:

"always" and "tool" are not quite symmetrical, I went for "inline" and "tool" as call sites from where the vector store queries are issued
Solr is an implementation detail of the OKP, it is imprudent exposing it in UI of LCORE
"inline" and "tool" are two lists of vector stores, with
- no distinction made for the origin of those vector stores - BYOK, Solr or something else
- Solr (as a vector store, we need to rename it if there're still references to it left) can be used in either the "inline" or "tool" sections, there's no technical reason not to be able to call it as a tool.

are-ces · 2026-03-02T12:23:14Z

@are-ces I left a few comments here in case the approach in this PR lives on, but I took this PR, with help of my mechanical assistant, in a different direction, presented at https://github.com/syedriko/lightspeed-stack/tree/syedriko-lcore-86. Key points:

* "always" and "tool" are not quite symmetrical, I went for "inline" and "tool" as call sites from where the vector store queries are issued

* Solr is an implementation detail of the OKP, it is imprudent exposing it in UI of LCORE

* "inline" and "tool" are two lists of vector stores, with
  
  * no distinction made for the origin of those vector stores - BYOK, Solr or something else
  * Solr (as a vector store, we need to rename it if there're still references to it left) can be used in either the "inline" or "tool" sections, there's no technical reason not to be able to call it as a tool.

Hey @syedriko thanks for the review. You raise valid points.
I agree with the following:
always rag -> inline rag
solr in config -> okp

However, when it comes to including OKP Solr vector stores in the RAG tool IIRC @Anxhela21 tested this and it was not working properly, right?

Anyway I agree, OKP should be included among the vector stores available to the RAG tool.

Finally, having two lists of vector stores defined for inline or tool makes sense but not in this PR, it will be a breaking change which should go in a separate PR since this PR will be included in LCORE 0.4.2

coderabbitai

Actionable comments posted: 9

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

src/utils/vector_search.py (1)

191-205: ⚠️ Potential issue | 🟠 Major

Preserve reference_url when chunk_metadata omits it.

At Line 198 and Line 204, reference_url is reassigned unconditionally. If metadata.reference_url was present and chunk_metadata.reference_url is missing, the URL is lost, which drops downstream document links.

Proposed fix

             if isinstance(chunk_meta, dict):
                 doc_id = chunk_meta.get("doc_id") or chunk_meta.get("document_id")
                 title = title or chunk_meta.get("title")
-                reference_url = chunk_meta.get("reference_url")
+                reference_url = reference_url or chunk_meta.get("reference_url")
             else:
                 doc_id = getattr(chunk_meta, "doc_id", None) or getattr(
                     chunk_meta, "document_id", None
                 )
                 title = title or getattr(chunk_meta, "title", None)
-                reference_url = getattr(chunk_meta, "reference_url", None)
+                reference_url = reference_url or getattr(
+                    chunk_meta, "reference_url", None
+                )

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/utils/vector_search.py` around lines 191 - 205, The code in
src/utils/vector_search.py clobbers an existing reference_url when
chunk.chunk_metadata lacks it; update the assignments inside the chunk_meta
handling so reference_url is only overwritten when the chunk metadata actually
provides a non-null value (e.g., use reference_url =
chunk_meta.get("reference_url") or reference_url for the dict branch and
reference_url = getattr(chunk_meta, "reference_url", None) or reference_url for
the object branch), leaving doc_id and title handling as-is; locate the logic
around chunk_meta, chunk_metadata, doc_id, title, and reference_url to apply
this conditional-preserve change.

♻️ Duplicate comments (1)

src/utils/responses.py (1)

169-170: ⚠️ Potential issue | 🟠 Major

Skip vector-store discovery when Tool RAG is disabled.

prepare_tools() resolves vector stores before checking whether Tool RAG is enabled. If vector_stores.list() fails, the request can fail even when configuration.rag.tool.byok.enabled is false.

Proposed fix

-    # Get vector stores for RAG tools - use specified ones or fetch all
-    vector_store_ids = await get_vector_store_ids(client, vector_store_ids)
-
-    # Add RAG tools if vector stores are available
-    rag_tools = get_rag_tools(vector_store_ids)
+    resolved_vector_store_ids = vector_store_ids or []
+    if configuration.rag.tool.byok.enabled:
+        # Get vector stores for RAG tools - use specified ones or fetch all
+        resolved_vector_store_ids = await get_vector_store_ids(client, vector_store_ids)
+
+    # Add RAG tools if vector stores are available
+    rag_tools = get_rag_tools(resolved_vector_store_ids)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/utils/responses.py` around lines 169 - 170, The code calls
get_vector_store_ids (which triggers vector_stores.list()) before checking
whether Tool RAG is enabled; change prepare_tools() to skip vector-store
discovery when configuration.rag.tool.byok.enabled is false by moving or
guarding the get_vector_store_ids call: first check
configuration.rag.tool.byok.enabled in prepare_tools(), and only invoke
get_vector_store_ids(client, vector_store_ids) if the flag is true (or when
vector_store_ids is explicitly provided), so failures from vector_stores.list()
cannot occur when Tool RAG is disabled.

🧹 Nitpick comments (4)

src/app/endpoints/streaming_query.py (1)

747-756: Reuse the shared referenced-document dedup helper.

This block duplicates dedup logic that already exists in src/utils/responses.py (deduplicate_referenced_documents). Reusing it avoids drift and keeps URL normalization behavior consistent.

Proposed fix

 from utils.responses import (
     build_mcp_tool_call_from_arguments_done,
     build_tool_call_summary,
     build_tool_result_from_mcp_output_item_done,
+    deduplicate_referenced_documents,
     extract_token_usage,
     extract_vector_store_ids_from_tools,
     get_topic_summary,
     parse_referenced_documents,
     prepare_responses_params,
 )
@@
-    if turn_summary.inline_rag_documents:
-        all_documents = turn_summary.inline_rag_documents + tool_based_documents
-        seen = set()
-        deduplicated_documents = []
-        for doc in all_documents:
-            key = (doc.doc_url, doc.doc_title)
-            if key not in seen:
-                seen.add(key)
-                deduplicated_documents.append(doc)
-        turn_summary.referenced_documents = deduplicated_documents
+    if turn_summary.inline_rag_documents:
+        turn_summary.referenced_documents = deduplicate_referenced_documents(
+            turn_summary.inline_rag_documents + tool_based_documents
+        )
     else:
         turn_summary.referenced_documents = tool_based_documents

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/app/endpoints/streaming_query.py` around lines 747 - 756, Replace the
in-place dedup logic with the shared helper: import and call
deduplicate_referenced_documents (from src.utils.responses) on the combined list
(turn_summary.inline_rag_documents + tool_based_documents) and assign its result
to turn_summary.referenced_documents; keep using the same symbols
(turn_summary.inline_rag_documents, tool_based_documents,
turn_summary.referenced_documents) so URL/title normalization and dedup behavior
remain consistent across the codebase.

src/models/config.py (1)

1758-1774: Add compatibility mapping for renamed RAG keys.

With extra="forbid", legacy configs using older key names will fail hard on upgrade. Consider a small pre-validation migration (always→inline, legacy solr block→rag.inline.okp) with deprecation warnings.

Proposed fix

 class RagConfiguration(ConfigurationBase):
@@
+    `@model_validator`(mode="before")
+    `@classmethod`
+    def migrate_legacy_inline_key(cls, data: Any) -> Any:
+        if isinstance(data, dict) and "always" in data and "inline" not in data:
+            migrated = dict(data)
+            migrated["inline"] = migrated.pop("always")
+            return migrated
+        return data
@@
 class Configuration(ConfigurationBase):
@@
+    `@model_validator`(mode="before")
+    `@classmethod`
+    def migrate_legacy_solr_key(cls, data: Any) -> Any:
+        if not isinstance(data, dict) or "solr" not in data:
+            return data
+
+        migrated = dict(data)
+        rag = dict(migrated.get("rag") or {})
+        inline = dict(rag.get("inline") or {})
+        inline.setdefault("okp", migrated.pop("solr"))
+        rag["inline"] = inline
+        migrated["rag"] = rag
+        return migrated

Also applies to: 1915-1919

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/models/config.py` around lines 1758 - 1774, RagConfiguration currently
uses strict validation (extra="forbid") so legacy keys will error; add a
pre-validation migration step (e.g., a `@root_validator`(pre=True) on
RagConfiguration or on the top-level ConfigurationBase) that detects and
rewrites legacy keys: map "always" -> {"inline": ...} and move any legacy "solr"
block into "rag" -> "inline" -> "okp" (or "rag.inline.okp"), emit a
DeprecationWarning or logger.warn when a legacy key is found, then return the
normalized dict so the existing fields inline: InlineRagConfiguration and tool:
ToolRagConfiguration validate cleanly. Ensure the migration runs before
pydantic's forbid check so old configs continue to work with warnings.

docs/openapi.json (1)

7378-7385: Constrain propagated header names and deduplicate entries.

Consider validating header names and enforcing uniqueness to prevent invalid config values from silently propagating.

🔧 Proposed hardening

                     "headers": {
                         "items": {
-                            "type": "string"
+                            "type": "string",
+                            "pattern": "^[!#$%&'*+.^_`|~0-9A-Za-z-]+$"
                         },
                         "type": "array",
+                        "uniqueItems": true,
                         "title": "Propagated headers",
                         "description": "List of HTTP header names to automatically forward from the incoming request to this MCP server. Headers listed here are extracted from the original client request and included when calling the MCP server. This is useful when infrastructure components (e.g. API gateways) inject headers that MCP servers need, such as x-rh-identity in HCC. Header matching is case-insensitive. These headers are additive with authorization_headers and MCP-HEADERS."
                     },

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@docs/openapi.json` around lines 7378 - 7385, The "headers" array schema
should enforce valid header-name syntax and uniqueness: update the "headers"
property (the array and its "items") to add "uniqueItems": true and a regex
"pattern" on each item to match valid HTTP field-names (e.g. token chars per
RFC, e.g. /^[A-Za-z0-9!#$%&'*+\-.^_`|~]+$/) so invalid names are rejected and
duplicates are prevented; also adjust the "description" text if needed to note
case-insensitivity and that values are deduplicated/normalized.

src/utils/vector_search.py (1)

359-364: Use shared default score multiplier constant.

Line 363 uses a hardcoded 1.0. Prefer the shared constant to keep behavior centralized and avoid drift.

Proposed fix

                 _query_store_for_byok_rag(
                     client,
                     vector_store_id,
                     query,
-                    score_multiplier_mapping.get(vector_store_id, 1.0),
+                    score_multiplier_mapping.get(
+                        vector_store_id, constants.DEFAULT_SCORE_MULTIPLIER
+                    ),
                 )

As per coding guidelines, check constants.py for shared constants before defining new ones.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/utils/vector_search.py` around lines 359 - 364, Replace the hardcoded 1.0
score multiplier in the call to _query_store_for_byok_rag with the shared
default constant from constants.py (e.g., constants.DEFAULT_SCORE_MULTIPLIER or
the appropriate name found in constants.py); update the top of
src/utils/vector_search.py to import that constant if not already imported, and
change score_multiplier_mapping.get(vector_store_id, 1.0) to
score_multiplier_mapping.get(vector_store_id,
constants.DEFAULT_SCORE_MULTIPLIER) so the default behavior is centralized.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@docs/byok_guide.md`:
- Around line 41-48: The docs use the term "Solr" in user-facing text and the
mermaid diagram label "Solr OKP", causing inconsistency with the public
abstraction `okp`; update all occurrences of "Solr" to "OKP" (e.g., change the
mermaid node label "Solr OKP" to "OKP" and any inline text/table entries that
say "Solr"), and ensure the docs reference the configuration/endpoint name `okp`
where shown (also apply same replacement for the occurrences noted around lines
314-315).

In `@docs/config.md`:
- Around line 528-532: Update the "Inline RAG" bullet to use OKP terminology
instead of "Solr": replace the phrase that currently reads "context is fetched
from Solr and/or BYOK vector stores" with wording like "context is fetched from
OKP (Solr-backed) and/or BYOK vector stores" so the docs align with the config
surface `okp`; ensure the term "OKP (Solr-backed)" appears exactly in the Inline
RAG description and leave the "Tool RAG" line unchanged.

In `@docs/openapi.json`:
- Around line 7993-8004: The OpenAPI schema exposes the internal field
"inline_rag_context" in the public request model for the /v1/query and
/v1/streaming_query endpoints; remove it from the public request model (or move
it to a separate internal-only schema) so clients cannot set it. Locate
occurrences of "inline_rag_context" in docs/openapi.json (specifically under the
request body schemas for /v1/query and /v1/streaming_query) and either delete
that property from the public request schema or mark it as internal-only (e.g.,
move to an internal schema reference or set readOnly/internal extension) and
update any $ref references accordingly.
- Around line 5507-5513: The schema for the "score_multiplier" property
currently uses "exclusiveMinimum": 0.0 which disallows 0; if the field is
intended to be non-negative change the schema to use "minimum": 0.0 instead of
"exclusiveMinimum": 0.0 so that zero is a valid value; update the
"score_multiplier" definition in the OpenAPI JSON (the property named
score_multiplier) to use minimum: 0.0 and keep the existing type, title,
description, and default.

In `@src/app/endpoints/streaming_query.py`:
- Around line 186-187: Wrap the call to build_rag_context(client, query_request,
configuration) in a try/except that catches APIConnectionError and Exception and
re-raises fastapi.HTTPException with appropriate status codes/messages (e.g.,
502 for APIConnectionError, 500 for generic errors) so the endpoint returns
translated HTTP errors; ensure the caught exception is logged before raising the
HTTPException. Also remove the manual deduplication loop currently implemented
and call deduplicate_referenced_documents(referenced_documents) from
utils.responses instead to centralize logic and avoid duplication; update any
variable names to use the helper's return value where referenced in the
streaming_query handler.

In `@src/llama_stack_configuration.py`:
- Around line 142-147: Don't default missing BYOK identifiers to empty strings:
in the logic that builds backend_name and output (variables rag_id,
backend_name, and the output dict) validate that brag["rag_id"] (and
brag["db_path"] if required) are present and non-empty and raise/throw a clear
exception (or return an error) instead of using get(..., ""); ensure
backend_name uses a validated rag_id so you never produce names like
"byok__storage" or paths like ".llama/.db" and update any surrounding code that
expects a valid rag_id to handle the raised error.

In `@src/models/requests.py`:
- Around line 110-115: The inline_rag_context Field is marked internal but
remains client-writable; add a Pydantic validation that rejects any incoming
payload containing this key. In the request model class that defines
inline_rag_context, implement a `@root_validator`(pre=True) (e.g., def
validate_no_inline_rag_context(cls, values): ...) that checks if
"inline_rag_context" is present in the incoming values and raises a ValueError
(or ValidationError) if so; keep the Field(...) for internal use but ensure
external inputs are refused during model validation.

In `@src/utils/vector_search.py`:
- Around line 592-599: The conversion currently assumes chunk.chunk_metadata
uses attribute access (chunk.chunk_metadata.document_id) and misses document_id
when chunk_metadata is a dict; update the block handling chunk.chunk_metadata to
accept both attribute and dict forms (e.g., try getattr(chunk.chunk_metadata,
"document_id", None) then fallback to chunk.chunk_metadata.get("document_id") if
it's a dict) and then set attributes["document_id"] accordingly; also ensure the
offline URL fallback sets attributes["doc_url"] using
urljoin(constants.MIMIR_DOC_URL, doc_id) when "doc_url" not present and doc_id
exists.
- Around line 511-514: The current logger.debug calls print the full RAG payload
via context_text (sensitive user query and chunk bodies); replace these debug
prints with non-sensitive metadata: log the number of retrieved chunks, total
size/char count of context_text, and optionally a short redacted snippet or a
hash (e.g., first 50 chars masked or sha256 of context_text) instead of the
entire content; update the logger.debug calls that reference context_text to
emit counts/lengths and a redacted/hash summary so no full payload is written to
logs.

---

Outside diff comments:
In `@src/utils/vector_search.py`:
- Around line 191-205: The code in src/utils/vector_search.py clobbers an
existing reference_url when chunk.chunk_metadata lacks it; update the
assignments inside the chunk_meta handling so reference_url is only overwritten
when the chunk metadata actually provides a non-null value (e.g., use
reference_url = chunk_meta.get("reference_url") or reference_url for the dict
branch and reference_url = getattr(chunk_meta, "reference_url", None) or
reference_url for the object branch), leaving doc_id and title handling as-is;
locate the logic around chunk_meta, chunk_metadata, doc_id, title, and
reference_url to apply this conditional-preserve change.

---

Duplicate comments:
In `@src/utils/responses.py`:
- Around line 169-170: The code calls get_vector_store_ids (which triggers
vector_stores.list()) before checking whether Tool RAG is enabled; change
prepare_tools() to skip vector-store discovery when
configuration.rag.tool.byok.enabled is false by moving or guarding the
get_vector_store_ids call: first check configuration.rag.tool.byok.enabled in
prepare_tools(), and only invoke get_vector_store_ids(client, vector_store_ids)
if the flag is true (or when vector_store_ids is explicitly provided), so
failures from vector_stores.list() cannot occur when Tool RAG is disabled.

---

Nitpick comments:
In `@docs/openapi.json`:
- Around line 7378-7385: The "headers" array schema should enforce valid
header-name syntax and uniqueness: update the "headers" property (the array and
its "items") to add "uniqueItems": true and a regex "pattern" on each item to
match valid HTTP field-names (e.g. token chars per RFC, e.g.
/^[A-Za-z0-9!#$%&'*+\-.^_`|~]+$/) so invalid names are rejected and duplicates
are prevented; also adjust the "description" text if needed to note
case-insensitivity and that values are deduplicated/normalized.

In `@src/app/endpoints/streaming_query.py`:
- Around line 747-756: Replace the in-place dedup logic with the shared helper:
import and call deduplicate_referenced_documents (from src.utils.responses) on
the combined list (turn_summary.inline_rag_documents + tool_based_documents) and
assign its result to turn_summary.referenced_documents; keep using the same
symbols (turn_summary.inline_rag_documents, tool_based_documents,
turn_summary.referenced_documents) so URL/title normalization and dedup behavior
remain consistent across the codebase.

In `@src/models/config.py`:
- Around line 1758-1774: RagConfiguration currently uses strict validation
(extra="forbid") so legacy keys will error; add a pre-validation migration step
(e.g., a `@root_validator`(pre=True) on RagConfiguration or on the top-level
ConfigurationBase) that detects and rewrites legacy keys: map "always" ->
{"inline": ...} and move any legacy "solr" block into "rag" -> "inline" -> "okp"
(or "rag.inline.okp"), emit a DeprecationWarning or logger.warn when a legacy
key is found, then return the normalized dict so the existing fields inline:
InlineRagConfiguration and tool: ToolRagConfiguration validate cleanly. Ensure
the migration runs before pydantic's forbid check so old configs continue to
work with warnings.

In `@src/utils/vector_search.py`:
- Around line 359-364: Replace the hardcoded 1.0 score multiplier in the call to
_query_store_for_byok_rag with the shared default constant from constants.py
(e.g., constants.DEFAULT_SCORE_MULTIPLIER or the appropriate name found in
constants.py); update the top of src/utils/vector_search.py to import that
constant if not already imported, and change
score_multiplier_mapping.get(vector_store_id, 1.0) to
score_multiplier_mapping.get(vector_store_id,
constants.DEFAULT_SCORE_MULTIPLIER) so the default behavior is centralized.

ℹ️ Review info

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 2ace88f and 22fe72d.

📒 Files selected for processing (18)

docs/byok_guide.md
docs/config.md
docs/openapi.json
docs/rag_guide.md
examples/lightspeed-stack-byok-rag.yaml
lightspeed-stack.yaml
src/app/endpoints/query.py
src/app/endpoints/streaming_query.py
src/constants.py
src/llama_stack_configuration.py
src/models/config.py
src/models/requests.py
src/utils/query.py
src/utils/responses.py
src/utils/types.py
src/utils/vector_search.py
tests/unit/app/endpoints/test_query.py
tests/unit/models/config/test_dump_configuration.py

🚧 Files skipped from review as they are similar to previous changes (3)

src/utils/types.py
examples/lightspeed-stack-byok-rag.yaml
docs/rag_guide.md

docs/byok_guide.md

docs/config.md

docs/openapi.json

coderabbitai · 2026-03-02T13:16:20Z

docs/openapi.json

+                    "inline_rag_context": {
+                        "anyOf": [
+                            {
+                                "type": "string"
+                            },
+                            {
+                                "type": "null"
+                            }
+                        ],
+                        "title": "Inline Rag Context",
+                        "description": "RAG context injected into the query before sending to the LLM. Populated internally; not part of the external API."
+                    },


⚠️ Potential issue | 🟠 Major

inline_rag_context is marked internal, but it is currently exposed as client input in the public request schema.

This makes an internal field externally settable via /v1/query and /v1/streaming_query. Keep it out of the public OpenAPI request model (or use a separate internal model).

🔧 Proposed fix (schema-level)

- "inline_rag_context": { - "anyOf": [ - { - "type": "string" - }, - { - "type": "null" - } - ], - "title": "Inline Rag Context", - "description": "RAG context injected into the query before sending to the LLM. Populated internally; not part of the external API." - },

- inline_rag_context: RAG context injected into the query before sending to the LLM. - Populated internally; not part of the external API. Defaults to None.

Also applies to: 8216-8216

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@docs/openapi.json` around lines 7993 - 8004, The OpenAPI schema exposes the internal field "inline_rag_context" in the public request model for the /v1/query and /v1/streaming_query endpoints; remove it from the public request model (or move it to a separate internal-only schema) so clients cannot set it. Locate occurrences of "inline_rag_context" in docs/openapi.json (specifically under the request body schemas for /v1/query and /v1/streaming_query) and either delete that property from the public request schema or mark it as internal-only (e.g., move to an internal schema reference or set readOnly/internal extension) and update any $ref references accordingly.

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash set -euo pipefail echo "== Grep occurrences of 'inline_rag_context' across repo ==" rg -n -S --no-heading -C2 'inline_rag_context' || true echo echo "== Grep occurrences of title string 'Inline Rag Context' ==" rg -n -S --no-heading -C1 '"Inline Rag Context"' || true echo echo "== Show JSON node for components.schemas.QueryRequest.properties.inline_rag_context (if present) ==" if command -v jq >/dev/null 2>&1; then if [ -f docs/openapi.json ]; then jq -r 'try .components.schemas.QueryRequest.properties.inline_rag_context // empty' docs/openapi.json || true else echo "docs/openapi.json not found" fi else echo "jq not available" fi echo echo "== Confirm Python model marks the field as internal (exclude=True) ==" rg -n -S --no-heading -C2 'inline_rag_context' src/models/requests.py || true

Length of output: 4673

@coderabbit-user Thanks — confirmed via repo scan that inline_rag_context is:

Exposed in docs/openapi.json under components.schemas.QueryRequest.properties at Lines 7993–8003

Mentioned in the QueryRequest description at Line 8216

Set only internally by endpoints and consumed in utils/query.py

Here’s a minimal patch to keep the field out of the public OpenAPI while leaving internal behavior unchanged.

Diff 1/2: remove property from public schema

diff --git a/docs/openapi.json b/docs/openapi.json @@ -7990,18 +7990,6 @@ ] }, - "inline_rag_context": { - "anyOf": [ - { - "type": "string" - }, - { - "type": "null" - } - ], - "title": "Inline Rag Context", - "description": "RAG context injected into the query before sending to the LLM. Populated internally; not part of the external API." - }, "conversation_id": { "anyOf": [ {

Diff 2/2: remove mention from the description block

diff --git a/docs/openapi.json b/docs/openapi.json @@ -8214,7 +8214,6 @@ - "description": "Model representing a request for the LLM (Language Model).\n\nAttributes:\n query: The query string.\n inline_rag_context: RAG context injected into the query before sending to the LLM.\n Populated internally; not part of the external API. Defaults to None.\n conversation_id: The optional conversation ID (UUID).\n provider: The optional provider.\n model: The optional model.\n system_prompt: The optional system prompt.\n attachments: The optional attachments.\n no_tools: Whether to bypass all tools and MCP servers (default: False).\n generate_topic_summary: Whether to generate topic summary for new conversations.\n media_type: The optional media type for response format (application/json or text/plain).\n vector_store_ids: The optional list of specific vector store IDs to query for RAG.\n shield_ids: The optional list of safety shield IDs to apply.\n\nExample:\n ```python\n query_request = QueryRequest(query=\"Tell me about Kubernetes\")\n ```", + "description": "Model representing a request for the LLM (Language Model).\n\nAttributes:\n query: The query string.\n conversation_id: The optional conversation ID (UUID).\n provider: The optional provider.\n model: The optional model.\n system_prompt: The optional system prompt.\n attachments: The optional attachments.\n no_tools: Whether to bypass all tools and MCP servers (default: False).\n generate_topic_summary: Whether to generate topic summary for new conversations.\n media_type: The optional media type for response format (application/json or text/plain).\n vector_store_ids: The optional list of specific vector store IDs to query for RAG.\n shield_ids: The optional list of safety shield IDs to apply.\n\nExample:\n ```python\n query_request = QueryRequest(query=\"Tell me about Kubernetes\")\n ```",

Optional follow-up

Add a lightweight CI guard to prevent regressions: fail if docs/openapi.json contains "inline_rag_context".

Example: rg -q 'inline_rag_context' docs/openapi.json && { echo "inline_rag_context must not be exposed"; exit 1; } || true

Shout if you want me to open a quick follow-up PR or wire the CI check.

coderabbitai · 2026-03-02T13:16:20Z

src/app/endpoints/streaming_query.py

+    # Build RAG context from Inline RAG sources
+    rag_context = await build_rag_context(client, query_request, configuration)


⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash set -euo pipefail # Find the definition and imports related to build_rag_context echo "=== Looking for build_rag_context definition ===" rg -n "def build_rag_context|from.*build_rag_context|import.*build_rag_context" src/ --type=py echo "" echo "=== Checking streaming_query.py around lines 186-187 ===" sed -n '180,200p' src/app/endpoints/streaming_query.py echo "" echo "=== Checking dedup logic around lines 747-756 ===" sed -n '740,760p' src/app/endpoints/streaming_query.py

Repository: lightspeed-core/lightspeed-stack

Length of output: 1967

🏁 Script executed:

#!/bin/bash set -euo pipefail # Find where build_rag_context is defined echo "=== Searching for build_rag_context in utils ===" fd -e py -x rg -l "build_rag_context" {} \; src/utils/ echo "" echo "=== Looking for dedup helper in responses.py ===" cat -n src/utils/responses.py | grep -A5 -B5 "dedup\|duplicate"

Repository: lightspeed-core/lightspeed-stack

Length of output: 487

🏁 Script executed:

#!/bin/bash set -euo pipefail # Check error handling patterns in streaming_query.py context echo "=== Error handling near build_rag_context call ===" sed -n '175,210p' src/app/endpoints/streaming_query.py | cat -n echo "" echo "=== Check if build_rag_context handles API errors ===" rg -A10 "async def build_rag_context" src/ --type=py

Repository: lightspeed-core/lightspeed-stack

Length of output: 2367

🏁 Script executed:

#!/bin/bash set -euo pipefail echo "=== Full build_rag_context implementation ===" sed -n '479,550p' src/utils/vector_search.py echo "" echo "=== Check for APIConnectionError/APIStatusError handling in build_rag_context ===" rg -A50 "async def build_rag_context" src/utils/vector_search.py | head -80

Repository: lightspeed-core/lightspeed-stack

Length of output: 4270

🏁 Script executed:

#!/bin/bash set -euo pipefail echo "=== Search for dedup helper in responses.py ===" cat src/utils/responses.py | grep -i "dedup\|duplicate" || echo "No dedup/duplicate found" echo "" echo "=== Check if responses.py has any document deduplication helper ===" rg "def.*dedup|def.*duplicate" src/utils/ --type=py

Repository: lightspeed-core/lightspeed-stack

Length of output: 363

🏁 Script executed:

#!/bin/bash set -euo pipefail echo "=== Check the entire streaming endpoint function for error handling ===" rg -B5 "async def streaming_query" src/app/endpoints/streaming_query.py | head -20 echo "" echo "=== Look for try/except blocks in streaming_query function ===" awk '/async def streaming_query/,/^async def|^def [a-z]/' src/app/endpoints/streaming_query.py | grep -n "try\|except\|raise HTTPException" | head -20

Repository: lightspeed-core/lightspeed-stack

Length of output: 454

🏁 Script executed:

#!/bin/bash set -euo pipefail echo "=== deduplicate_referenced_documents helper ===" rg -A15 "def deduplicate_referenced_documents" src/utils/responses.py echo "" echo "=== Check _fetch_byok_rag and _fetch_solr_rag for error handling ===" rg -B2 -A20 "async def _fetch_byok_rag|async def _fetch_solr_rag" src/utils/vector_search.py | head -80

Repository: lightspeed-core/lightspeed-stack

Length of output: 2160

🏁 Script executed:

#!/bin/bash set -euo pipefail echo "=== Check imports and exception handling at top of vector_search.py ===" head -50 src/utils/vector_search.py | grep -E "^import|^from|APIConnectionError|APIStatusError" || echo "Checking full imports..." echo "" echo "=== Check for APIConnectionError/APIStatusError handling in vector_search.py ===" rg "APIConnectionError|APIStatusError|except.*Error" src/utils/vector_search.py echo "" echo "=== Look for the complete streaming_query_endpoint_handler function signature and error handling ===" sed -n '130,280p' src/app/endpoints/streaming_query.py | grep -n "try\|except\|HTTPException\|async def" | head -20

Repository: lightspeed-core/lightspeed-stack

Length of output: 671

🏁 Script executed:

#!/bin/bash set -euo pipefail echo "=== Check error handling structure in _fetch_byok_rag and _fetch_solr_rag implementations ===" rg -A40 "async def _fetch_byok_rag" src/utils/vector_search.py | head -50 echo "" echo "=== Check _fetch_solr_rag for try/except ===" rg -A50 "async def _fetch_solr_rag" src/utils/vector_search.py | head -60

Repository: lightspeed-core/lightspeed-stack

Length of output: 3615

🏁 Script executed:

#!/bin/bash set -euo pipefail echo "=== Verify deduplicate_referenced_documents is not used in streaming_query.py ===" rg "deduplicate_referenced_documents" src/app/endpoints/streaming_query.py || echo "Not found - confirms duplication" echo "" echo "=== Show the manual dedup at lines 747-756 more clearly ===" sed -n '745,760p' src/app/endpoints/streaming_query.py

Repository: lightspeed-core/lightspeed-stack

Length of output: 848

🏁 Script executed:

#!/bin/bash set -euo pipefail echo "=== Check if streaming_query_endpoint_handler has any try/except or error handling ===" awk '/async def streaming_query_endpoint_handler/,/^async def|^@/' src/app/endpoints/streaming_query.py | grep -c "try\|except" || echo "0" echo "" echo "=== Check the direct context around line 186-187 for wrapping ===" sed -n '150,250p' src/app/endpoints/streaming_query.py | grep -B20 "build_rag_context" | tail -30

Repository: lightspeed-core/lightspeed-stack

Length of output: 1110

🏁 Script executed:

#!/bin/bash set -euo pipefail echo "=== See the except clause in _fetch_byok_rag ===" rg -A60 "async def _fetch_byok_rag" src/utils/vector_search.py | grep -A5 "except" echo "" echo "=== See the except clause in _fetch_solr_rag ===" rg -A100 "async def _fetch_solr_rag" src/utils/vector_search.py | grep -A5 "except" echo "" echo "=== Check what exceptions are caught ===" rg "except.*:" src/utils/vector_search.py | head -10

Repository: lightspeed-core/lightspeed-stack

Length of output: 124

🏁 Script executed:

#!/bin/bash set -euo pipefail echo "=== Full _fetch_byok_rag with exception handling ===" rg -A80 "async def _fetch_byok_rag" src/utils/vector_search.py echo "" echo "=== Full _fetch_solr_rag with exception handling ===" rg -A120 "async def _fetch_solr_rag" src/utils/vector_search.py | tail -80

Repository: lightspeed-core/lightspeed-stack

Length of output: 6245

🏁 Script executed:

#!/bin/bash set -euo pipefail echo "=== Check _fetch_byok_rag for except clause (full function) ===" awk '/async def _fetch_byok_rag/,/^async def|^def [a-z_]/' src/utils/vector_search.py | tail -50

Repository: lightspeed-core/lightspeed-stack

Length of output: 166

🏁 Script executed:

#!/bin/bash set -euo pipefail echo "=== Get line numbers for _fetch_byok_rag and _fetch_solr_rag ===" rg -n "async def _fetch_byok_rag|async def _fetch_solr_rag|async def build_rag_context" src/utils/vector_search.py echo "" echo "=== Get line count of vector_search.py ===" wc -l src/utils/vector_search.py

Repository: lightspeed-core/lightspeed-stack

Length of output: 310

🏁 Script executed:

#!/bin/bash set -euoiopefail # Get the lines for _fetch_byok_rag echo "=== _fetch_byok_rag function ===" sed -n '320,420p' src/utils/vector_search.py | tail -60

Repository: lightspeed-core/lightspeed-stack

Length of output: 1290

🏁 Script executed:

#!/bin/bash set -euo pipefail echo "=== Check exception handling: is there an except in _fetch_byok_rag? ===" awk '/^async def _fetch_byok_rag/,/^async def _fetch_solr_rag/' src/utils/vector_search.py | grep -n "except\|try" | head -10

Repository: lightspeed-core/lightspeed-stack

Length of output: 232

🏁 Script executed:

#!/bin/bash set -euo pipefail echo "=== _fetch_byok_rag except block ===" sed -n '316,404p' src/utils/vector_search.py | tail -10 echo "" echo "=== Check the full except clause in _fetch_byok_rag ===" sed -n '395,404p' src/utils/vector_search.py

Repository: lightspeed-core/lightspeed-stack

Length of output: 1029

Add proper error handling for RAG context fetching per guidelines.

The build_rag_context() call at line 186 is not wrapped with error translation. While the helper functions internally catch and log errors gracefully, they do not convert API exceptions to HTTPException as required by the endpoint error handling guidelines. Additionally, wrap the call in a try/except that converts caught exceptions (e.g., APIConnectionError, general Exception) to appropriate HTTPException with meaningful status codes.

Also, replace the manual deduplication logic at lines 747-756 with the existing deduplicate_referenced_documents() helper from utils/responses.py to avoid code duplication.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/app/endpoints/streaming_query.py` around lines 186 - 187, Wrap the call to build_rag_context(client, query_request, configuration) in a try/except that catches APIConnectionError and Exception and re-raises fastapi.HTTPException with appropriate status codes/messages (e.g., 502 for APIConnectionError, 500 for generic errors) so the endpoint returns translated HTTP errors; ensure the caught exception is logged before raising the HTTPException. Also remove the manual deduplication loop currently implemented and call deduplicate_referenced_documents(referenced_documents) from utils.responses instead to centralize logic and avoid duplication; update any variable names to use the helper's return value where referenced in the streaming_query handler.

src/llama_stack_configuration.py

src/models/requests.py

src/utils/vector_search.py

asimurka

LGTM, all my comments are resolved.

…fix query mutation

coderabbitai

Actionable comments posted: 8

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

src/utils/vector_search.py (1)

198-199: ⚠️ Potential issue | 🟠 Major

Preserve previously extracted reference_url when chunk metadata lacks it.

Line 198 and Line 204 currently overwrite reference_url with None in some cases, which can drop a valid URL from chunk.metadata.

Proposed fix

-                reference_url = chunk_meta.get("reference_url")
+                reference_url = reference_url or chunk_meta.get("reference_url")
...
-                reference_url = getattr(chunk_meta, "reference_url", None)
+                reference_url = reference_url or getattr(
+                    chunk_meta, "reference_url", None
+                )

Also applies to: 204-204

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/utils/vector_search.py` around lines 198 - 199, The code is overwriting a
previously extracted reference_url with None when chunk_meta lacks
"reference_url"; update the assignment so it only replaces reference_url if
chunk_meta contains a non-empty value (e.g., check if "reference_url" in
chunk_meta or use chunk_meta.get("reference_url", reference_url)) so existing
reference_url from chunk.metadata is preserved; apply this change to the places
where reference_url is set from chunk_meta (the reference_url variable and the
chunk_meta.get(...) usage) so you never intentionally drop a valid URL.

♻️ Duplicate comments (6)

docs/rag_guide.md (1)
335-339: ⚠️ Potential issue | 🟡 Minor

Add a language specifier to the fenced code block.

This still triggers markdownlint MD040.
🛠️ Proposed fix
-```
+```bash
 curl -sX POST http://localhost:8080/v1/query \
     -H "Content-Type: application/json" \
     -d '{"query" : "how do I secure a nodejs application with keycloak?"}' | jq .
</details>

<details>
<summary>🤖 Prompt for AI Agents</summary>
Verify each finding against the current code and only fix it if needed.

In @docs/rag_guide.md around lines 335 - 339, The fenced code block containing
the curl example is missing a language specifier which triggers markdownlint
MD040; add a language tag (e.g., "bash") to the opening fence of the curl
snippet so the block starts with ```bash to resolve the lint warning and improve
syntax highlighting for the curl POST example.
</details>

</blockquote></details>
<details>
<summary>docs/byok_guide.md (1)</summary><blockquote>

`310-315`: _⚠️ Potential issue_ | _🟡 Minor_

**Clarify combined-mode behavior when Inline + Tool are both enabled.**

Lines 310–315 should explicitly mention that tool invocation remains model-driven (non-deterministic) and that inline source merging is BYOK-first/OKP-after without cross-source reranking.



<details>
<summary>📝 Proposed wording</summary>

```diff
 Both modes can be enabled simultaneously. Choose based on your latency and control preferences:
+
+> [!NOTE]
+> When Inline and Tool are both enabled, `file_search` tool usage is model-driven and non-deterministic.
+> Inline retrieval merges sources by appending OKP chunks after BYOK chunks (no cross-source reranking).
```
</details>

<details>
<summary>🤖 Prompt for AI Agents</summary>

```
Verify each finding against the current code and only fix it if needed.

In `@docs/byok_guide.md` around lines 310 - 315, Update the combined-mode
documentation to explicitly state that when both Inline RAG and Tool RAG are
enabled the tool calls remain model-driven (non-deterministic) and that inline
context merging follows a BYOK-first then OKP-after order with no cross-source
reranking; modify the text around the "Inline RAG" and "Tool RAG" table entries
to add a short sentence clarifying these behaviors and noting that tool
invocation timing is decided by the LLM while inline sources are merged
deterministically BYOK-first/OKP-second without reordering across sources.
```

</details>

</blockquote></details>
<details>
<summary>src/app/endpoints/streaming_query.py (1)</summary><blockquote>

`744-753`: _⚠️ Potential issue_ | _🟡 Minor_

**Use the shared referenced-document dedup helper here.**

Lines 746–753 reimplement deduplication locally using raw `(doc.doc_url, doc.doc_title)` keys. Reusing `deduplicate_referenced_documents()` from `src/utils/responses.py` keeps URL normalization behavior consistent across endpoints.



<details>
<summary>🛠️ Proposed fix</summary>

```diff
 from utils.responses import (
     build_mcp_tool_call_from_arguments_done,
     build_tool_call_summary,
     build_tool_result_from_mcp_output_item_done,
+    deduplicate_referenced_documents,
     extract_token_usage,
     extract_vector_store_ids_from_tools,
@@
-    if turn_summary.inline_rag_documents:
-        all_documents = turn_summary.inline_rag_documents + tool_based_documents
-        seen = set()
-        deduplicated_documents = []
-        for doc in all_documents:
-            key = (doc.doc_url, doc.doc_title)
-            if key not in seen:
-                seen.add(key)
-                deduplicated_documents.append(doc)
-        turn_summary.referenced_documents = deduplicated_documents
+    if turn_summary.inline_rag_documents:
+        all_documents = turn_summary.inline_rag_documents + tool_based_documents
+        turn_summary.referenced_documents = deduplicate_referenced_documents(all_documents)
     else:
         turn_summary.referenced_documents = tool_based_documents
```
</details>

<details>
<summary>🤖 Prompt for AI Agents</summary>

```
Verify each finding against the current code and only fix it if needed.

In `@src/app/endpoints/streaming_query.py` around lines 744 - 753, Replace the
manual deduplication loop with the shared helper
deduplicate_referenced_documents by calling
deduplicate_referenced_documents(all_documents) (where all_documents is
turn_summary.inline_rag_documents + tool_based_documents) and assigning its
result to turn_summary.referenced_documents; also ensure the
deduplicate_referenced_documents symbol from src/utils/responses.py is imported
into streaming_query.py if not already present so URL normalization and dedup
behavior remain consistent across endpoints.
```

</details>

</blockquote></details>
<details>
<summary>src/utils/responses.py (1)</summary><blockquote>

`169-170`: _⚠️ Potential issue_ | _🟠 Major_

**Do not list vector stores when Tool RAG is disabled.**

At Line 170, `get_vector_store_ids()` is called unconditionally. If Tool RAG is off and `vector_stores.list()` fails, the whole request can fail for a disabled feature.



<details>
<summary>🛠️ Proposed fix</summary>

```diff
     toolgroups: list[InputTool] = []
-    # Get vector stores for RAG tools - use specified ones or fetch all
-    vector_store_ids = await get_vector_store_ids(client, vector_store_ids)
-
-    # Add RAG tools if vector stores are available
-    rag_tools = get_rag_tools(vector_store_ids)
-    if rag_tools:
-        toolgroups.extend(rag_tools)
+    # Add RAG tools only when Tool RAG is enabled
+    if configuration.rag.tool.byok.enabled:
+        resolved_vector_store_ids = await get_vector_store_ids(client, vector_store_ids)
+        rag_tools = get_rag_tools(resolved_vector_store_ids)
+        if rag_tools:
+            toolgroups.extend(rag_tools)
```
</details>

<details>
<summary>🤖 Prompt for AI Agents</summary>

```
Verify each finding against the current code and only fix it if needed.

In `@src/utils/responses.py` around lines 169 - 170, Currently
get_vector_store_ids(client, vector_store_ids) is invoked unconditionally which
can cause failures when Tool RAG is disabled; update the flow so that
get_vector_store_ids is only called when the Tool RAG feature flag is enabled
(check the existing flag/utility that indicates Tool RAG is active), otherwise
skip the call and ensure vector_store_ids remains empty or unchanged;
specifically modify the code around the vector_store_ids assignment (the call to
get_vector_store_ids) to guard with the Tool RAG check used elsewhere in the
codebase so list() is never invoked for a disabled feature.
```

</details>

</blockquote></details>
<details>
<summary>src/utils/vector_search.py (2)</summary><blockquote>

`281-281`: _⚠️ Potential issue_ | _🟠 Major_

**Avoid logging raw Solr payload objects.**

Line 281 and Line 446 log full objects that can contain retrieved chunk content/metadata. Prefer IDs/counts to reduce data exposure in debug logs.  
 

<details>
<summary>Proposed fix</summary>

```diff
-        logger.debug("Extract doc ids from chunk: %s", chunk)
+        logger.debug(
+            "Extract doc ids from chunk_id=%s (has_metadata=%s, has_chunk_metadata=%s)",
+            getattr(chunk, "chunk_id", None),
+            bool(getattr(chunk, "metadata", None)),
+            bool(getattr(chunk, "chunk_metadata", None)),
+        )
...
-            logger.debug("Solr query response: %s", query_response)
+            logger.debug(
+                "Solr query returned chunks=%d scores=%d",
+                len(getattr(query_response, "chunks", []) or []),
+                len(getattr(query_response, "scores", []) or []),
+            )
```
</details>


Also applies to: 446-446

<details>
<summary>🤖 Prompt for AI Agents</summary>

```
Verify each finding against the current code and only fix it if needed.

In `@src/utils/vector_search.py` at line 281, The debug logs currently output full
Solr payload objects (e.g., the logger.debug call that logs "Extract doc ids
from chunk: %s" and the similar debug at the other location), which can expose
chunk content/metadata; change these debug statements to log only safe
identifiers and summary counts (e.g., chunk id(s), document ids list length,
number of fields) instead of the full chunk payload. Locate the logger.debug
calls that include the literal text "Extract doc ids from chunk" and the other
debug at the later occurrence and replace their message/arguments to include
chunk.get('id') or explicit doc id lists and len(...) summaries rather than the
entire chunk object. Ensure both occurrences are updated.
```

</details>

---

`593-599`: _⚠️ Potential issue_ | _🟠 Major_

**Handle dict-shaped `chunk_metadata` in Solr chunk conversion.**

Line 593-599 only supports attribute-style access. When `chunk_metadata` is a `dict`, `document_id` and offline URL fallback are skipped.  
  

<details>
<summary>Proposed fix</summary>

```diff
         if hasattr(chunk, "chunk_metadata") and chunk.chunk_metadata:
-            if hasattr(chunk.chunk_metadata, "document_id"):
-                doc_id = chunk.chunk_metadata.document_id
+            chunk_meta = chunk.chunk_metadata
+            if isinstance(chunk_meta, dict):
+                doc_id = chunk_meta.get("document_id") or chunk_meta.get("doc_id")
+            else:
+                doc_id = getattr(chunk_meta, "document_id", None) or getattr(
+                    chunk_meta, "doc_id", None
+                )
+            if doc_id:
                 attributes["document_id"] = doc_id
                 # Build URL if not already set
                 if "doc_url" not in attributes and offline and doc_id:
                     attributes["doc_url"] = urljoin(constants.MIMIR_DOC_URL, doc_id)
```
</details>

<details>
<summary>🤖 Prompt for AI Agents</summary>

```
Verify each finding against the current code and only fix it if needed.

In `@src/utils/vector_search.py` around lines 593 - 599, The current conversion
only handles attribute-style chunk.chunk_metadata and skips dict-shaped
metadata; update the logic around chunk, chunk.chunk_metadata, and attributes to
handle both mapping and object styles: if chunk has chunk_metadata, check if
isinstance(chunk.chunk_metadata, dict) and read doc_id =
chunk.chunk_metadata.get("document_id"), else fall back to
getattr(chunk.chunk_metadata, "document_id", None); then set
attributes["document_id"] when doc_id is present and apply the same offline
doc_url fallback (using urljoin(constants.MIMIR_DOC_URL, doc_id)) when "doc_url"
not in attributes. Ensure you reference the same symbols (chunk, chunk_metadata,
document_id, attributes, offline, urljoin, constants.MIMIR_DOC_URL) so the
change fits into the existing solr chunk conversion code.
```

</details>

</blockquote></details>

</blockquote></details>

<details>
<summary>🧹 Nitpick comments (2)</summary><blockquote>

<details>
<summary>lightspeed-stack.yaml (1)</summary><blockquote>

`37-41`: **Consider using OKP wording in the inline RAG comment for consistency.**

The config key is `okp`, so updating the comment text avoids mixed terminology.

<details>
<summary>✏️ Suggested wording</summary>

```diff
-  # Inline RAG (inject context pre-query with RAG from Solr and BYOK vector stores)
+  # Inline RAG (inject context pre-query with RAG from OKP (Solr-backed) and BYOK vector stores)
```
</details>

<details>
<summary>🤖 Prompt for AI Agents</summary>

```
Verify each finding against the current code and only fix it if needed.

In `@lightspeed-stack.yaml` around lines 37 - 41, Update the inline RAG comment to
use "OKP" terminology to match the config key; specifically change the comment
above the inline block that references RAG from "OKP and BYOK" or similar to
explicitly use "OKP" (matching the okp key) so the comment aligns with the
inline.okp setting and avoids mixed terminology.
```

</details>

</blockquote></details>
<details>
<summary>src/utils/responses.py (1)</summary><blockquote>

`343-355`: **Align `get_rag_tools` type/docstring with actual return behavior.**

Lines 343–355 always return a list (`[]` or populated list), but the signature/docstring still describe `Optional`/`None`.



<details>
<summary>♻️ Proposed fix</summary>

```diff
-def get_rag_tools(vector_store_ids: list[str]) -> Optional[list[InputToolFileSearch]]:
+def get_rag_tools(vector_store_ids: list[str]) -> list[InputToolFileSearch]:
@@
-    Returns:
-        List containing file_search tool configuration, or None if RAG as tool is disabled
+    Returns:
+        List containing file_search tool configuration, or empty list if disabled/unavailable
```
</details>

<details>
<summary>🤖 Prompt for AI Agents</summary>

```
Verify each finding against the current code and only fix it if needed.

In `@src/utils/responses.py` around lines 343 - 355, The get_rag_tools function
currently always returns a list (either [] or a populated list) but its
signature and docstring claim Optional/None; update the function signature to
return list[InputToolFileSearch] (remove Optional) and revise the docstring to
state it always returns a list of file_search tool configurations (empty list
when RAG-as-tool is disabled via configuration.rag.tool.byok.enabled), and
ensure any type imports or annotations referencing InputToolFileSearch remain
correct.
```

</details>

</blockquote></details>

</blockquote></details>

<details>
<summary>🤖 Prompt for all review comments with AI agents</summary>
Verify each finding against the current code and only fix it if needed.

Inline comments:
In @docs/openapi.json:

Around line 5527-5531: The OpenAPI schema is missing the backward-compatible
default for ToolRagConfiguration.byok (the runtime defaults
ToolRagConfiguration.byok.enabled to true), so update the schema to declare a
default for ToolRagConfiguration.byok that sets enabled: true; specifically add
a default object for the ToolRagConfiguration.byok property mirroring
ByokRagConfiguration (with "enabled": true) so the OpenAPI JSON reflects the
actual default behavior of ToolRagConfiguration.byok.enabled.

Line 4450: The GET and POST operations for the /a2a path both use the same
operationId "handle_a2a_jsonrpc_a2a_post", which must be globally unique; update
the GET operation's operationId to "handle_a2a_jsonrpc_a2a_get" so the two
operations differ. Locate the /a2a GET operation in the OpenAPI spec where
operationId is "handle_a2a_jsonrpc_a2a_post" and change it to
"handle_a2a_jsonrpc_a2a_get", then run your OpenAPI validator and regenerate any
client code to ensure no lingering duplicate operationId references remain.

In @docs/rag_guide.md:

Around line 29-33: Update the Inline RAG description to use OKP terminology
instead of "Solr and/or BYOK": edit the sentence under "Inline RAG" so it
references context fetched from the okp vector stores (and/or Solr if still
applicable) and aligns with the config/API naming (use the term okp where the
guide currently says "BYOK"); ensure the heading "Inline RAG" and the phrase
"context is fetched" remain intact while replacing the offending wording.

In @src/app/endpoints/query.py:

Line 170: The moderation check is being applied to responses_params.input
after inline RAG content may have been injected via inline_rag_context, causing
false positives; change the flow so you preserve the original user query (e.g.,
store it in a separate variable like original_user_input) and run moderation
against that original text before any inline_rag_context is attached, then only
build responses_params.input by combining the already-moderated
original_user_input with inline_rag_context afterwards; locate usages of
inline_rag_context and responses_params.input in query.py (and the moderation
call/function) and update them so moderation consumes the unmodified user input
variable rather than the post-injection responses_params.input.

Around line 157-158: The call to build_rag_context(client, query_request,
configuration) can raise APIConnectionError and API status exceptions but is not
wrapped; catch those exceptions around the inline_rag_context assignment and
handle them the same way retrieve_response(...) does: catch APIConnectionError
and return/raise a 503-style connectivity response, and catch API status errors
and propagate/marshal their HTTP status and message accordingly; reference the
build_rag_context call and mirror the exception handling logic used for
retrieve_response to ensure consistent HTTP error mapping.

In @src/utils/query.py:

Around line 214-215: The code concatenates inline_rag_context directly onto
input_text (variables inline_rag_context and input_text), which can merge
prompts; update the concatenation in the block that checks inline_rag_context to
ensure a fixed separator is inserted—e.g., if inline_rag_context does not start
with whitespace or newline, prepend a separator like "\n\n" (or always append
"\n\n" before inline_rag_context) before adding to input_text so the injected
context is clearly separated.

In @src/utils/vector_search.py:

Around line 126-129: The metadata label is incorrect: in
src/utils/vector_search.py the code appends f"document_id: {chunk.source}" to
metadata_parts even though chunk.source is a source identifier; update that
formatted string to use the correct label (e.g., f"source: {chunk.source}" or
f"source_id: {chunk.source}") so metadata_parts reflects the true meaning of
chunk.source (update the line that builds metadata_parts where chunk.source is
referenced).

In @tests/unit/utils/test_responses.py:

Around line 2280-2282: Replace the improper construction of APIConnectionError
using new with a normal initialized instance so the exception's
initialization logic runs; specifically change the side_effect assignment on
client_mock.vector_stores.list to raise an APIConnectionError created via its
constructor (provide a descriptive message and a minimal/mock request object
consistent with other tests) instead of
APIConnectionError.new(APIConnectionError), preserving the intent of the
test that the call raises this error.

Outside diff comments:
In @src/utils/vector_search.py:

Around line 198-199: The code is overwriting a previously extracted
reference_url with None when chunk_meta lacks "reference_url"; update the
assignment so it only replaces reference_url if chunk_meta contains a non-empty
value (e.g., check if "reference_url" in chunk_meta or use
chunk_meta.get("reference_url", reference_url)) so existing reference_url from
chunk.metadata is preserved; apply this change to the places where reference_url
is set from chunk_meta (the reference_url variable and the chunk_meta.get(...)
usage) so you never intentionally drop a valid URL.

Duplicate comments:
In @docs/byok_guide.md:

Around line 310-315: Update the combined-mode documentation to explicitly
state that when both Inline RAG and Tool RAG are enabled the tool calls remain
model-driven (non-deterministic) and that inline context merging follows a
BYOK-first then OKP-after order with no cross-source reranking; modify the text
around the "Inline RAG" and "Tool RAG" table entries to add a short sentence
clarifying these behaviors and noting that tool invocation timing is decided by
the LLM while inline sources are merged deterministically BYOK-first/OKP-second
without reordering across sources.

In @docs/rag_guide.md:

Around line 335-339: The fenced code block containing the curl example is
missing a language specifier which triggers markdownlint MD040; add a language
tag (e.g., "bash") to the opening fence of the curl snippet so the block starts
with ```bash to resolve the lint warning and improve syntax highlighting for the
curl POST example.

In @src/app/endpoints/streaming_query.py:

Around line 744-753: Replace the manual deduplication loop with the shared
helper deduplicate_referenced_documents by calling
deduplicate_referenced_documents(all_documents) (where all_documents is
turn_summary.inline_rag_documents + tool_based_documents) and assigning its
result to turn_summary.referenced_documents; also ensure the
deduplicate_referenced_documents symbol from src/utils/responses.py is imported
into streaming_query.py if not already present so URL normalization and dedup
behavior remain consistent across endpoints.

In @src/utils/responses.py:

Around line 169-170: Currently get_vector_store_ids(client, vector_store_ids)
is invoked unconditionally which can cause failures when Tool RAG is disabled;
update the flow so that get_vector_store_ids is only called when the Tool RAG
feature flag is enabled (check the existing flag/utility that indicates Tool RAG
is active), otherwise skip the call and ensure vector_store_ids remains empty or
unchanged; specifically modify the code around the vector_store_ids assignment
(the call to get_vector_store_ids) to guard with the Tool RAG check used
elsewhere in the codebase so list() is never invoked for a disabled feature.

In @src/utils/vector_search.py:

Line 281: The debug logs currently output full Solr payload objects (e.g., the
logger.debug call that logs "Extract doc ids from chunk: %s" and the similar
debug at the other location), which can expose chunk content/metadata; change
these debug statements to log only safe identifiers and summary counts (e.g.,
chunk id(s), document ids list length, number of fields) instead of the full
chunk payload. Locate the logger.debug calls that include the literal text
"Extract doc ids from chunk" and the other debug at the later occurrence and
replace their message/arguments to include chunk.get('id') or explicit doc id
lists and len(...) summaries rather than the entire chunk object. Ensure both
occurrences are updated.

Around line 593-599: The current conversion only handles attribute-style
chunk.chunk_metadata and skips dict-shaped metadata; update the logic around
chunk, chunk.chunk_metadata, and attributes to handle both mapping and object
styles: if chunk has chunk_metadata, check if isinstance(chunk.chunk_metadata,
dict) and read doc_id = chunk.chunk_metadata.get("document_id"), else fall back
to getattr(chunk.chunk_metadata, "document_id", None); then set
attributes["document_id"] when doc_id is present and apply the same offline
doc_url fallback (using urljoin(constants.MIMIR_DOC_URL, doc_id)) when "doc_url"
not in attributes. Ensure you reference the same symbols (chunk, chunk_metadata,
document_id, attributes, offline, urljoin, constants.MIMIR_DOC_URL) so the
change fits into the existing solr chunk conversion code.

Nitpick comments:
In @lightspeed-stack.yaml:

Around line 37-41: Update the inline RAG comment to use "OKP" terminology to
match the config key; specifically change the comment above the inline block
that references RAG from "OKP and BYOK" or similar to explicitly use "OKP"
(matching the okp key) so the comment aligns with the inline.okp setting and
avoids mixed terminology.

In @src/utils/responses.py:

Around line 343-355: The get_rag_tools function currently always returns a
list (either [] or a populated list) but its signature and docstring claim
Optional/None; update the function signature to return list[InputToolFileSearch]
(remove Optional) and revise the docstring to state it always returns a list of
file_search tool configurations (empty list when RAG-as-tool is disabled via
configuration.rag.tool.byok.enabled), and ensure any type imports or annotations
referencing InputToolFileSearch remain correct.
</details>

---

<details>
<summary>ℹ️ Review info</summary>

**Configuration used**: Path: .coderabbit.yaml

**Review profile**: CHILL

**Plan**: Pro

<details>
<summary>📥 Commits</summary>

Reviewing files that changed from the base of the PR and between 22fe72da916cb1e8723d45f28384876cd57eee4a and a4075c6d6083b7247ae0c0890c59e2c8c0c5a944.

</details>

<details>
<summary>📒 Files selected for processing (19)</summary>

* `docs/byok_guide.md`
* `docs/config.md`
* `docs/openapi.json`
* `docs/rag_guide.md`
* `examples/lightspeed-stack-byok-rag.yaml`
* `lightspeed-stack.yaml`
* `src/app/endpoints/query.py`
* `src/app/endpoints/streaming_query.py`
* `src/constants.py`
* `src/llama_stack_configuration.py`
* `src/models/config.py`
* `src/utils/query.py`
* `src/utils/responses.py`
* `src/utils/types.py`
* `src/utils/vector_search.py`
* `tests/unit/app/endpoints/test_query.py`
* `tests/unit/models/config/test_dump_configuration.py`
* `tests/unit/test_llama_stack_configuration.py`
* `tests/unit/utils/test_responses.py`

</details>

<details>
<summary>🚧 Files skipped from review as they are similar to previous changes (3)</summary>

* examples/lightspeed-stack-byok-rag.yaml
* tests/unit/app/endpoints/test_query.py
* src/llama_stack_configuration.py

</details>

</details>

coderabbitai · 2026-03-02T13:51:08Z

docs/openapi.json

                "summary": "Handle A2A Jsonrpc",
                "description": "Handle A2A JSON-RPC requests following the A2A protocol specification.\n\nThis endpoint uses the DefaultRequestHandler from the A2A SDK to handle\nall JSON-RPC requests including message/send, message/stream, etc.\n\nThe A2A SDK application is created per-request to include authentication\ncontext while still leveraging FastAPI's authorization middleware.\n\nAutomatically detects streaming requests (message/stream JSON-RPC method)\nand returns a StreamingResponse to enable real-time chunk delivery.\n\nArgs:\n    request: FastAPI request object\n    auth: Authentication tuple\n    mcp_headers: MCP headers for context propagation\n\nReturns:\n    JSON-RPC response or streaming response",
-                "operationId": "handle_a2a_jsonrpc_a2a_get",
+                "operationId": "handle_a2a_jsonrpc_a2a_post",


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Check the actual content around lines 4450 and 4468 echo "=== Content around line 4450 ===" sed -n '4445,4455p' docs/openapi.json echo "" echo "=== Content around line 4468 ===" sed -n '4463,4473p' docs/openapi.json

Repository: lightspeed-core/lightspeed-stack

Length of output: 2378

🏁 Script executed:

#!/bin/bash # Run the duplicate operationId detection python - <<'PY' import json, collections, pathlib spec = json.loads(pathlib.Path("docs/openapi.json").read_text()) ids = collections.defaultdict(list) for path, methods in spec.get("paths", {}).items(): for method, op in methods.items(): if isinstance(op, dict): oid = op.get("operationId") if oid: ids[oid].append(f"{method.upper()} {path}") dups = {k: v for k, v in ids.items() if len(v) > 1} if dups: print("=== Duplicate operationIds found ===") for oid, locations in sorted(dups.items()): print(f"{oid}:") for loc in sorted(locations): print(f" - {loc}") else: print("No duplicate operationIds found") PY

Repository: lightspeed-core/lightspeed-stack

Length of output: 168

🏁 Script executed:

#!/bin/bash # Find the HTTP method declarations for lines 4450 and 4468 # Look backwards from each line to find the method keyword echo "=== Context for line 4450 (looking back for method) ===" sed -n '4420,4450p' docs/openapi.json | tail -40 echo "" echo "=== Context for line 4468 (looking back for method) ===" sed -n '4440,4468p' docs/openapi.json | tail -40

Repository: lightspeed-core/lightspeed-stack

Length of output: 4465

Duplicate operationId values on /a2a GET and POST endpoints.

Line 4450 (GET) and line 4468 (POST) both define operationId: "handle_a2a_jsonrpc_a2a_post". This violates the OpenAPI specification requirement that operation IDs be globally unique and will break API client code generation and method routing.

Change the GET operation ID at line 4450 to "handle_a2a_jsonrpc_a2a_get":

Proposed fix

"/a2a": { "get": { "operationId": "handle_a2a_jsonrpc_a2a_get", }, "post": { "operationId": "handle_a2a_jsonrpc_a2a_post", } }

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@docs/openapi.json` at line 4450, The GET and POST operations for the /a2a path both use the same operationId "handle_a2a_jsonrpc_a2a_post", which must be globally unique; update the GET operation's operationId to "handle_a2a_jsonrpc_a2a_get" so the two operations differ. Locate the /a2a GET operation in the OpenAPI spec where operationId is "handle_a2a_jsonrpc_a2a_post" and change it to "handle_a2a_jsonrpc_a2a_get", then run your OpenAPI validator and regenerate any client code to ensure no lingering duplicate operationId references remain.

coderabbitai · 2026-03-02T13:51:08Z

docs/openapi.json

+                    "enabled": {
+                        "type": "boolean",
+                        "title": "BYOK RAG enabled",
+                        "description": "When True, queries BYOK vector stores for RAG context.",
+                        "default": false


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash set -euo pipefail python - <<'PY' import json, pathlib spec = json.loads(pathlib.Path("docs/openapi.json").read_text()) schemas = spec["components"]["schemas"] print("=== Schema Defaults ===") print("ByokRagConfiguration.enabled.default =", schemas["ByokRagConfiguration"]["properties"]["enabled"].get("default")) print("ToolRagConfiguration.byok.default =", schemas["ToolRagConfiguration"]["properties"]["byok"].get("default")) print("\n=== Full ByokRagConfiguration ===") print(json.dumps(schemas["ByokRagConfiguration"], indent=2)) print("\n=== Full ToolRagConfiguration.byok property ===") print(json.dumps(schemas["ToolRagConfiguration"]["properties"]["byok"], indent=2)) PY echo echo "=== Checking model defaults in source ===" fd 'config\.py$' src --type f 2>/dev/null | while read cfg_file; do echo "Found: $cfg_file" echo "--- Content ---" rg -n -A5 'class (ByokRagConfiguration|ToolRagConfiguration)' "$cfg_file" || true done

Repository: lightspeed-core/lightspeed-stack

Length of output: 1303

🏁 Script executed:

#!/bin/bash set -euo pipefail # Get full ToolRagConfiguration definition echo "=== Full ToolRagConfiguration definition ===" sed -n '1743,1800p' src/models/config.py echo echo "=== Check for backward-compatibility context ===" rg -i 'backward|compat|tool.*rag.*default' src --type py -B2 -A2 | head -50 echo echo "=== Check git history for recent changes to RAG defaults ===" git log --oneline --all -- src/models/config.py docs/openapi.json 2>/dev/null | head -20 || echo "Git history unavailable"

Repository: lightspeed-core/lightspeed-stack

Length of output: 5499

Add backward-compatible default to ToolRagConfiguration.byok in OpenAPI schema.

The source code (src/models/config.py:1748) explicitly defaults ToolRagConfiguration.byok.enabled to true for backward compatibility, but the OpenAPI schema does not capture this. The schema currently shows only ByokRagConfiguration.enabled defaulting to false, leaving ToolRagConfiguration.byok without a default, which contradicts the actual API behavior.

Proposed fix

"ToolRagConfiguration": { "properties": { "byok": { "$ref": "#/components/schemas/ByokRagConfiguration", + "default": { + "enabled": true + }, "title": "BYOK RAG configuration", "description": "Configuration for BYOK RAG as a tool." } },

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@docs/openapi.json` around lines 5527 - 5531, The OpenAPI schema is missing the backward-compatible default for ToolRagConfiguration.byok (the runtime defaults ToolRagConfiguration.byok.enabled to true), so update the schema to declare a default for ToolRagConfiguration.byok that sets enabled: true; specifically add a default object for the ToolRagConfiguration.byok property mirroring ByokRagConfiguration (with "enabled": true) so the OpenAPI JSON reflects the actual default behavior of ToolRagConfiguration.byok.enabled.

coderabbitai · 2026-03-02T13:51:08Z

docs/rag_guide.md

+Lightspeed Core Stack (LCS) supports two complementary RAG strategies:

-The system operates a chain of command. The **Agent** is the orchestrator, using the LLM as its reasoning engine. When a plan requires external information, the Agent queries the **Vector Store**. This is your database of indexed knowledge, which you are responsible for creating before running the stack. The **Embedding Model** is used to convert the queries to vectors. 
+- **Inline RAG**: context is fetched from Solr and/or BYOK vector stores and injected into every query before the LLM responds. No tool calls are required.
+- **Tool RAG**: the LLM can call the `file_search` tool during generation to retrieve context on demand from BYOK vector stores.
+


⚠️ Potential issue | 🟡 Minor

Use OKP terminology in the high-level Inline RAG definition.

The guide currently says “Solr and/or BYOK,” but the config/API surface uses okp; aligning wording avoids confusion.

✏️ Suggested wording

-- **Inline RAG**: context is fetched from Solr and/or BYOK vector stores and injected into every query before the LLM responds. No tool calls are required. +- **Inline RAG**: context is fetched from OKP (Solr-backed) and/or BYOK vector stores and injected into every query before the LLM responds. No tool calls are required.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

Lightspeed Core Stack (LCS) supports two complementary RAG strategies:

The system operates a chain of command. The **Agent** is the orchestrator, using the LLM as its reasoning engine. When a plan requires external information, the Agent queries the **Vector Store**. This is your database of indexed knowledge, which you are responsible for creating before running the stack. The **Embedding Model** is used to convert the queries to vectors.

- **Inline RAG**: context is fetched from Solr and/or BYOK vector stores and injected into every query before the LLM responds. No tool calls are required.

- **Tool RAG**: the LLM can call the `file_search` tool during generation to retrieve context on demand from BYOK vector stores.

Lightspeed Core Stack (LCS) supports two complementary RAG strategies:

- **Inline RAG**: context is fetched from OKP (Solr-backed) and/or BYOK vector stores and injected into every query before the LLM responds. No tool calls are required.

- **Tool RAG**: the LLM can call the `file_search` tool during generation to retrieve context on demand from BYOK vector stores.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@docs/rag_guide.md` around lines 29 - 33, Update the Inline RAG description to use OKP terminology instead of "Solr and/or BYOK": edit the sentence under "Inline RAG" so it references context fetched from the okp vector stores (and/or Solr if still applicable) and aligns with the config/API naming (use the term `okp` where the guide currently says "BYOK"); ensure the heading "Inline RAG" and the phrase "context is fetched" remain intact while replacing the offending wording.

coderabbitai · 2026-03-02T13:51:08Z

src/app/endpoints/query.py

+    # Build RAG context from Inline RAG sources
+    inline_rag_context = await build_rag_context(client, query_request, configuration)


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Examine the query.py file at lines 157-158 and surrounding context cat -n src/app/endpoints/query.py | sed -n '150,170p'

Repository: lightspeed-core/lightspeed-stack

Length of output: 909

🏁 Script executed:

#!/bin/bash # Check for error handling imports and patterns in query.py head -50 src/app/endpoints/query.py | cat -n

Repository: lightspeed-core/lightspeed-stack

Length of output: 2032

🏁 Script executed:

#!/bin/bash # Search for APIConnectionError handling and build_rag_context function rg -n 'APIConnectionError|build_rag_context|ServiceUnavailableResponse|handle_known_apistatus_errors' \ src/app/endpoints/query.py src/utils/vector_search.py 2>/dev/null || echo "Searching broader scope..."

Repository: lightspeed-core/lightspeed-stack

Length of output: 864

🏁 Script executed:

#!/bin/bash # Find files with error handling utilities fd -e py -t f | xargs grep -l 'ServiceUnavailableResponse\|handle_known_apistatus_errors' 2>/dev/null

Repository: lightspeed-core/lightspeed-stack

Length of output: 829

🏁 Script executed:

#!/bin/bash # Check existing error handling pattern at lines 311-318 cat -n src/app/endpoints/query.py | sed -n '305,325p'

Repository: lightspeed-core/lightspeed-stack

Length of output: 1049

🏁 Script executed:

#!/bin/bash # Check if build_rag_context handles these errors internally cat -n src/utils/vector_search.py | sed -n '479,530p'

Repository: lightspeed-core/lightspeed-stack

Length of output: 2083

🏁 Script executed:

#!/bin/bash # Check if there's any try-except wrapping the entire query operation cat -n src/app/endpoints/query.py | sed -n '100,200p'

Repository: lightspeed-core/lightspeed-stack

Length of output: 4813

Add error handling for inline RAG context building.

build_rag_context(...) can raise APIConnectionError and API status errors without being caught, causing Llama Stack connectivity failures to propagate as generic 500 responses. This is inconsistent with the error handling applied to retrieve_response(...) at line 187, which properly maps these exceptions to appropriate HTTP errors (503 for connection failures, consistent status responses for API errors).

🛠️ Suggested fix

- inline_rag_context = await build_rag_context(client, query_request, configuration) + try: + inline_rag_context = await build_rag_context( + client, query_request, configuration + ) + except APIConnectionError as e: + error_response = ServiceUnavailableResponse( + backend_name="Llama Stack", + cause=str(e), + ) + raise HTTPException(**error_response.model_dump()) from e + except (LLSApiStatusError, OpenAIAPIStatusError) as e: + error_response = handle_known_apistatus_errors(e, query_request.model or "unknown") + raise HTTPException(**error_response.model_dump()) from e

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/app/endpoints/query.py` around lines 157 - 158, The call to build_rag_context(client, query_request, configuration) can raise APIConnectionError and API status exceptions but is not wrapped; catch those exceptions around the inline_rag_context assignment and handle them the same way retrieve_response(...) does: catch APIConnectionError and return/raise a 503-style connectivity response, and catch API status errors and propagate/marshal their HTTP status and message accordingly; reference the build_rag_context call and mirror the exception handling logic used for retrieve_response to ensure consistent HTTP error mapping.

src/app/endpoints/query.py

coderabbitai · 2026-03-02T13:51:08Z

src/utils/query.py

+    if inline_rag_context:
+        input_text += inline_rag_context


⚠️ Potential issue | 🟡 Minor

Add a fixed separator before injected inline context.

At Line 215, inline_rag_context is concatenated directly to the query text. If the context string does not start with whitespace/newlines, prompt text can merge unexpectedly.

🛠️ Proposed fix

- if inline_rag_context: - input_text += inline_rag_context + if inline_rag_context: + input_text += f"\n\n{inline_rag_context.lstrip()}"

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/utils/query.py` around lines 214 - 215, The code concatenates inline_rag_context directly onto input_text (variables inline_rag_context and input_text), which can merge prompts; update the concatenation in the block that checks inline_rag_context to ensure a fixed separator is inserted—e.g., if inline_rag_context does not start with whitespace or newline, prepend a separator like "\n\n" (or always append "\n\n" before inline_rag_context) before adding to input_text so the injected context is clearly separated.

coderabbitai · 2026-03-02T13:51:08Z

src/utils/vector_search.py

+        if chunk.source:
+            metadata_parts.append(f"document_id: {chunk.source}")
+        if chunk.score is not None:
+            metadata_parts.append(f"score: {chunk.score:.4f}")


⚠️ Potential issue | 🟡 Minor

Use correct metadata label in formatted context.

Line 127 labels chunk.source as document_id, but it’s a source identifier. This mislabels metadata passed to the model.

Proposed fix

- if chunk.source: - metadata_parts.append(f"document_id: {chunk.source}") + if chunk.source: + metadata_parts.append(f"source: {chunk.source}") + if chunk.attributes and chunk.attributes.get("document_id"): + metadata_parts.append(f"document_id: {chunk.attributes['document_id']}")

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/utils/vector_search.py` around lines 126 - 129, The metadata label is incorrect: in src/utils/vector_search.py the code appends f"document_id: {chunk.source}" to metadata_parts even though chunk.source is a source identifier; update that formatted string to use the correct label (e.g., f"source: {chunk.source}" or f"source_id: {chunk.source}") so metadata_parts reflects the true meaning of chunk.source (update the line that builds metadata_parts where chunk.source is referenced).

coderabbitai · 2026-03-02T13:51:08Z

tests/unit/utils/test_responses.py

+        client_mock.vector_stores.list.side_effect = APIConnectionError.__new__(
+            APIConnectionError
+        )


⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

# First, locate the test file and verify the code snippet fd -t f test_responses.py

Repository: lightspeed-core/lightspeed-stack

Length of output: 153

🏁 Script executed:

# Search for APIConnectionError definition and imports rg -n "class APIConnectionError" --type py

Repository: lightspeed-core/lightspeed-stack

Length of output: 58

🏁 Script executed:

# Check how APIConnectionError is imported in the test file rg -n "APIConnectionError|from.*llama" tests/unit/utils/test_responses.py | head -20

Repository: lightspeed-core/lightspeed-stack

Length of output: 1007

🏁 Script executed:

# Look at the specific lines mentioned in the review (2280-2282) rg -n -A5 -B5 "APIConnectionError.__new__" tests/unit/utils/test_responses.py

Repository: lightspeed-core/lightspeed-stack

Length of output: 586

🏁 Script executed:

# Search for other uses of APIConnectionError in tests to understand patterns rg -n "APIConnectionError" tests/ --type py -A2 | head -40

Repository: lightspeed-core/lightspeed-stack

Length of output: 2731

Use proper initialization for APIConnectionError instead of __new__ to match test patterns.

This is the only place in the test suite where APIConnectionError is constructed via __new__. All other uses (lines 865, 951, 1090, 1115 in this file and across other test files) instantiate it normally with message and request parameters. This inconsistency bypasses initialization logic.

🛠️ Proposed fix

- client_mock.vector_stores.list.side_effect = APIConnectionError.__new__( - APIConnectionError - ) + client_mock.vector_stores.list.side_effect = APIConnectionError( + message="Connection failed", request=mocker.Mock() + )

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

client_mock.vector_stores.list.side_effect = APIConnectionError.__new__(

APIConnectionError

)

client_mock.vector_stores.list.side_effect = APIConnectionError(

message="Connection failed", request=mocker.Mock()

)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@tests/unit/utils/test_responses.py` around lines 2280 - 2282, Replace the improper construction of APIConnectionError using __new__ with a normal initialized instance so the exception's initialization logic runs; specifically change the side_effect assignment on client_mock.vector_stores.list to raise an APIConnectionError created via its constructor (provide a descriptive message and a minimal/mock request object consistent with other tests) instead of APIConnectionError.__new__(APIConnectionError), preserving the intent of the test that the call raises this error.

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (3)

src/llama_stack_configuration.py (1)
1-6: ⚠️ Potential issue | 🟡 Minor

Fix Black formatting to pass CI.

The pipeline indicates Black would reformat this file. Run black src/llama_stack_configuration.py to fix formatting issues.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/llama_stack_configuration.py` around lines 1 - 6, This file fails Black
formatting; run Black on the module (e.g., black
src/llama_stack_configuration.py) or reformat the file to match Black's style so
CI passes; ensure the module docstring and any definitions such as
generate_configuration are formatted according to Black (consistent line
breaks/quotes/spacing) before committing.
src/utils/vector_search.py (1)
1-1: ⚠️ Potential issue | 🟡 Minor

Black check is failing for this file.

CI reports src/utils/vector_search.py would be reformatted. Please run Black on this file before merge.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/utils/vector_search.py` at line 1, CI reports src/utils/vector_search.py
is not formatted with Black; run the Black formatter (e.g., black
src/utils/vector_search.py) or apply the project's Black config to reformat the
file so it matches CI; ensure the file-level docstring and all code in the
vector_search.py module are reformatted accordingly and re-run linters/CI to
confirm the change.
src/app/endpoints/query.py (1)
1-1: ⚠️ Potential issue | 🟡 Minor

Black check is failing for this file.

CI reports src/app/endpoints/query.py would be reformatted. Please run Black on this file before merge.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/app/endpoints/query.py` at line 1, The file src/app/endpoints/query.py
fails Black formatting; run Black (e.g., python -m black
src/app/endpoints/query.py) to auto-reformat the file and commit the changes so
CI passes; ensure the pylint disable header ("# pylint:
disable=too-many-locals,too-many-branches,too-many-nested-blocks") remains
intact and that any imports/functions within query.py (e.g., endpoint handlers
or helper functions defined there) are updated to match Black's formatting
before pushing the commit.

♻️ Duplicate comments (3)

src/utils/vector_search.py (2)

593-600: ⚠️ Potential issue | 🟠 Major

Support dict-based chunk_metadata when extracting Solr document IDs.

The current conversion only handles attribute access (chunk.chunk_metadata.document_id). If chunk_metadata is a dict, document_id (and offline URL fallback) is dropped.

Proposed fix

-        if hasattr(chunk, "chunk_metadata") and chunk.chunk_metadata:
-            if hasattr(chunk.chunk_metadata, "document_id"):
-                doc_id = chunk.chunk_metadata.document_id
-                attributes["document_id"] = doc_id
-                # Build URL if not already set
-                if "doc_url" not in attributes and offline and doc_id:
-                    attributes["doc_url"] = urljoin(constants.MIMIR_DOC_URL, doc_id)
+        if hasattr(chunk, "chunk_metadata") and chunk.chunk_metadata:
+            chunk_meta = chunk.chunk_metadata
+            if isinstance(chunk_meta, dict):
+                doc_id = chunk_meta.get("document_id") or chunk_meta.get("doc_id")
+            else:
+                doc_id = getattr(chunk_meta, "document_id", None) or getattr(
+                    chunk_meta, "doc_id", None
+                )
+            if doc_id:
+                attributes["document_id"] = doc_id
+                # Build URL if not already set
+                if "doc_url" not in attributes and offline:
+                    attributes["doc_url"] = urljoin(constants.MIMIR_DOC_URL, doc_id)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/utils/vector_search.py` around lines 593 - 600, The code only reads
chunk.chunk_metadata.document_id via attribute access and ignores dict-style
metadata; update the extraction logic in the block that handles chunk and
chunk_metadata (referencing chunk, chunk_metadata, attributes, doc_id) to handle
both attribute and mapping access: if chunk.chunk_metadata is a dict, read
document_id = chunk.chunk_metadata.get("document_id"), otherwise use
getattr(chunk.chunk_metadata, "document_id", None); then set
attributes["document_id"]=doc_id if present and keep the existing offline URL
fallback using urljoin(constants.MIMIR_DOC_URL, doc_id) when "doc_url" not in
attributes and offline is true.

125-127: ⚠️ Potential issue | 🟡 Minor

Use the correct metadata label for chunk.source in inline context.

chunk.source is a source identifier, not a document ID. This mislabels metadata sent to the model.

Proposed fix

-        if chunk.source:
-            metadata_parts.append(f"document_id: {chunk.source}")
+        if chunk.source:
+            metadata_parts.append(f"source: {chunk.source}")
+        if chunk.attributes and chunk.attributes.get("document_id"):
+            metadata_parts.append(f"document_id: {chunk.attributes['document_id']}")

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/utils/vector_search.py` around lines 125 - 127, The metadata label for
chunk.source is incorrect: in the metadata assembly (the metadata_parts list in
src/utils/vector_search.py) replace the mislabel "document_id: {chunk.source}"
with the correct inline label "source: {chunk.source}" so the model receives the
proper source identifier; update the conditional that appends to metadata_parts
(the block referencing chunk.source and metadata_parts) to use "source" rather
than "document_id".

src/app/endpoints/query.py (1)

172-172: ⚠️ Potential issue | 🟠 Major

Moderate the original user query, not the inline RAG-augmented payload.

responses_params.input can include injected inline context, which can cause false moderation blocks unrelated to user intent. Keep moderation input separate from LLM-augmented input.

Proposed fix

-    turn_summary = await retrieve_response(
+    turn_summary = await retrieve_response(
         client,
         responses_params,
         query_request.shield_ids,
         vector_store_ids,
         rag_id_mapping,
+        moderation_input=query_request.query,
     )

 async def retrieve_response(  # pylint: disable=too-many-locals
     client: AsyncLlamaStackClient,
     responses_params: ResponsesApiParams,
     shield_ids: Optional[list[str]] = None,
     vector_store_ids: Optional[list[str]] = None,
     rag_id_mapping: Optional[dict[str, str]] = None,
+    moderation_input: Optional[str] = None,
 ) -> TurnSummary:
@@
-        moderation_result = await run_shield_moderation(
-            client, cast(str, responses_params.input), shield_ids
-        )
+        moderation_payload = (
+            moderation_input
+            if moderation_input is not None
+            else cast(str, responses_params.input)
+        )
+        moderation_result = await run_shield_moderation(
+            client, moderation_payload, shield_ids
+        )

Also applies to: 290-292

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/app/endpoints/query.py` at line 172, The moderation call is currently
using the LLM-augmented payload (e.g., inline_rag_context.context_text or
responses_params.input) which may contain injected context; change the
moderation input to the original user query (use the raw user input
variable—e.g., responses_params.input_raw or the original_request_input variable
that holds the user message before augmentation) so only the unmodified user
query is moderated; update the two places noted (the inline_rag_context usage
around inline_rag_context.context_text and the duplicate occurrence at the later
block around lines 290-292) to pass the original user input variable to the
moderation function instead of responses_params.input or
inline_rag_context.context_text.

🧹 Nitpick comments (2)

src/llama_stack_configuration.py (2)

250-253: Consider extracting duplicate prefix-stripping logic into a helper function.

This same logic for stripping the sentence-transformers/ prefix appears here and in enrich_solr (lines 476-478). A small helper would reduce duplication.

♻️ Suggested helper function

def strip_sentence_transformers_prefix(model_id: str) -> str:
    """Strip 'sentence-transformers/' prefix from model ID if present."""
    prefix = "sentence-transformers/"
    if model_id.startswith(prefix):
        return model_id[len(prefix):]
    return model_id

Then use:

-        provider_model_id = embedding_model
-        if provider_model_id.startswith("sentence-transformers/"):
-            provider_model_id = provider_model_id[len("sentence-transformers/") :]
+        provider_model_id = strip_sentence_transformers_prefix(embedding_model)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/llama_stack_configuration.py` around lines 250 - 253, Extract the
duplicated prefix-stripping logic into a small helper, e.g.,
strip_sentence_transformers_prefix(model_id: str), and replace the inline logic
in the block using provider_model_id/embedding_model and the corresponding code
in enrich_solr with calls to that helper; ensure the helper returns the original
string when the "sentence-transformers/" prefix is not present and update
references to use the helper to avoid duplication.

379-438: LGTM with optional suggestion to modularize.

The Solr provider enrichment logic is correct - it properly checks for existing providers before adding and uses environment variable placeholders with sensible defaults from constants.

For improved maintainability, consider breaking this ~120-line function into smaller helpers (e.g., _add_solr_provider, _add_solr_vector_store, _add_solr_embedding_model), but this is optional given the clear sectional comments.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/llama_stack_configuration.py` around lines 379 - 438, The function
enrich_solr is a bit large and would benefit from modularization; extract
logical sections into small helper functions to improve readability and
testability. Specifically, create helpers like _add_solr_provider(ls_config,
solr_config) to encapsulate the provider construction and append logic
(currently the block that builds solr_url_env, collection_env, vector_field_env,
content_field_env, embedding_model_env, embedding_dim_env and the dict appended
to ls_config["providers"]["vector_io"]), and optionally
_ensure_vector_io_provider_container(ls_config) to initialize
ls_config["providers"]["vector_io"] if missing; then update enrich_solr to call
these helpers while keeping the same behavior and constants
(constants.SOLR_PROVIDER_ID, SOLR_DEFAULT_*), ensuring no change to existing
condition checks (e.g., if not solr_config or not solr_config.get("enabled") and
the existing_providers check).

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/app/endpoints/streaming_query.py`:
- Around line 202-203: The code currently runs moderation against
responses_params.input after inline RAG has been injected; change the flow so
moderation runs on the raw user query before any inline RAG injection: locate
where inline_rag_context is built and where moderation is invoked (references:
inline_rag_context and responses_params.input), capture the original user query
variable (the pre-injection input on responses_params) and call the moderation
routine on that raw value, then only after moderation passes attach
inline_rag_context=context_text or None and proceed; apply the same change at
the other occurrence mentioned (the block around lines 282-285) so both
streaming paths moderate the raw query prior to injection.

---

Outside diff comments:
In `@src/app/endpoints/query.py`:
- Line 1: The file src/app/endpoints/query.py fails Black formatting; run Black
(e.g., python -m black src/app/endpoints/query.py) to auto-reformat the file and
commit the changes so CI passes; ensure the pylint disable header ("# pylint:
disable=too-many-locals,too-many-branches,too-many-nested-blocks") remains
intact and that any imports/functions within query.py (e.g., endpoint handlers
or helper functions defined there) are updated to match Black's formatting
before pushing the commit.

In `@src/llama_stack_configuration.py`:
- Around line 1-6: This file fails Black formatting; run Black on the module
(e.g., black src/llama_stack_configuration.py) or reformat the file to match
Black's style so CI passes; ensure the module docstring and any definitions such
as generate_configuration are formatted according to Black (consistent line
breaks/quotes/spacing) before committing.

In `@src/utils/vector_search.py`:
- Line 1: CI reports src/utils/vector_search.py is not formatted with Black; run
the Black formatter (e.g., black src/utils/vector_search.py) or apply the
project's Black config to reformat the file so it matches CI; ensure the
file-level docstring and all code in the vector_search.py module are reformatted
accordingly and re-run linters/CI to confirm the change.

---

Duplicate comments:
In `@src/app/endpoints/query.py`:
- Line 172: The moderation call is currently using the LLM-augmented payload
(e.g., inline_rag_context.context_text or responses_params.input) which may
contain injected context; change the moderation input to the original user query
(use the raw user input variable—e.g., responses_params.input_raw or the
original_request_input variable that holds the user message before augmentation)
so only the unmodified user query is moderated; update the two places noted (the
inline_rag_context usage around inline_rag_context.context_text and the
duplicate occurrence at the later block around lines 290-292) to pass the
original user input variable to the moderation function instead of
responses_params.input or inline_rag_context.context_text.

In `@src/utils/vector_search.py`:
- Around line 593-600: The code only reads chunk.chunk_metadata.document_id via
attribute access and ignores dict-style metadata; update the extraction logic in
the block that handles chunk and chunk_metadata (referencing chunk,
chunk_metadata, attributes, doc_id) to handle both attribute and mapping access:
if chunk.chunk_metadata is a dict, read document_id =
chunk.chunk_metadata.get("document_id"), otherwise use
getattr(chunk.chunk_metadata, "document_id", None); then set
attributes["document_id"]=doc_id if present and keep the existing offline URL
fallback using urljoin(constants.MIMIR_DOC_URL, doc_id) when "doc_url" not in
attributes and offline is true.
- Around line 125-127: The metadata label for chunk.source is incorrect: in the
metadata assembly (the metadata_parts list in src/utils/vector_search.py)
replace the mislabel "document_id: {chunk.source}" with the correct inline label
"source: {chunk.source}" so the model receives the proper source identifier;
update the conditional that appends to metadata_parts (the block referencing
chunk.source and metadata_parts) to use "source" rather than "document_id".

---

Nitpick comments:
In `@src/llama_stack_configuration.py`:
- Around line 250-253: Extract the duplicated prefix-stripping logic into a
small helper, e.g., strip_sentence_transformers_prefix(model_id: str), and
replace the inline logic in the block using provider_model_id/embedding_model
and the corresponding code in enrich_solr with calls to that helper; ensure the
helper returns the original string when the "sentence-transformers/" prefix is
not present and update references to use the helper to avoid duplication.
- Around line 379-438: The function enrich_solr is a bit large and would benefit
from modularization; extract logical sections into small helper functions to
improve readability and testability. Specifically, create helpers like
_add_solr_provider(ls_config, solr_config) to encapsulate the provider
construction and append logic (currently the block that builds solr_url_env,
collection_env, vector_field_env, content_field_env, embedding_model_env,
embedding_dim_env and the dict appended to ls_config["providers"]["vector_io"]),
and optionally _ensure_vector_io_provider_container(ls_config) to initialize
ls_config["providers"]["vector_io"] if missing; then update enrich_solr to call
these helpers while keeping the same behavior and constants
(constants.SOLR_PROVIDER_ID, SOLR_DEFAULT_*), ensuring no change to existing
condition checks (e.g., if not solr_config or not solr_config.get("enabled") and
the existing_providers check).

ℹ️ Review info

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between a4075c6 and 87708d6.

📒 Files selected for processing (4)

src/app/endpoints/query.py
src/app/endpoints/streaming_query.py
src/llama_stack_configuration.py
src/utils/vector_search.py

src/app/endpoints/streaming_query.py

coderabbitai

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

src/llama_stack_configuration.py (1)

310-327: ⚠️ Potential issue | 🟠 Major

Prevent duplicate provider_id entries in providers.vector_io.

This appends providers unconditionally. Duplicate rag_id values (or re-enrichment over existing config) can produce duplicate provider_ids and ambiguous provider resolution.

🛠️ Proposed fix

 def construct_vector_io_providers_section(
     ls_config: dict[str, Any], byok_rag: list[dict[str, Any]]
 ) -> list[dict[str, Any]]:
@@
-    # append new vector_io entries
+    # append new vector_io entries
+    existing_provider_ids = {p.get("provider_id") for p in output}
     for brag in byok_rag:
         if not brag.get("rag_id"):
             raise ValueError(f"BYOK RAG entry is missing required 'rag_id': {brag}")
         rag_id = brag["rag_id"]
         backend_name = f"byok_{rag_id}_storage"
         provider_id = f"byok_{rag_id}"
+        if provider_id in existing_provider_ids:
+            logger.debug("Skipping duplicate vector_io provider_id: %s", provider_id)
+            continue
+        existing_provider_ids.add(provider_id)
         output.append(
             {
                 "provider_id": provider_id,

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/llama_stack_configuration.py` around lines 310 - 327, The loop over
byok_rag unconditionally appends provider entries and can create duplicate
provider_id values (provider_id = f"byok_{rag_id}") causing ambiguous
resolution; modify the loop in llama_stack_configuration.py (the byok_rag
iteration where provider_id and backend_name are built) to first check existing
provider_ids in the accumulating output (or providers.vector_io collection) and
skip or update the entry if provider_id already exists (e.g., compute
existing_ids = {p["provider_id"] for p in output} and only append when
provider_id not in existing_ids), ensuring duplicates are not added.

♻️ Duplicate comments (3)

src/app/endpoints/query.py (1)

157-173: ⚠️ Potential issue | 🟠 Major

Run moderation against the original user query, not the inline-injected prompt.

Because moderation consumes responses_params.input, inline RAG content can cause false blocks unrelated to user intent.

🛠️ Suggested direction

-    turn_summary = await retrieve_response(
+    turn_summary = await retrieve_response(
         client,
         responses_params,
         query_request.shield_ids,
         vector_store_ids,
         rag_id_mapping,
+        moderation_input=query_request.query,
     )

 async def retrieve_response(
@@
     rag_id_mapping: Optional[dict[str, str]] = None,
+    moderation_input: Optional[str] = None,
 ) -> TurnSummary:
@@
-        moderation_result = await run_shield_moderation(
-            client, cast(str, responses_params.input), shield_ids
-        )
+        moderation_payload = (
+            moderation_input
+            if moderation_input is not None
+            else cast(str, responses_params.input)
+        )
+        moderation_result = await run_shield_moderation(
+            client, moderation_payload, shield_ids
+        )

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/app/endpoints/query.py` around lines 157 - 173, The moderation is being
run on responses_params.input after you inject inline RAG content; instead run
moderation against the original user query (query_request.query) before or
separately from building/injecting inline_rag_context so
inline_rag_context.context_text cannot affect moderation. Update the flow around
build_rag_context and prepare_responses_params so that prepare_responses_params
is given the original query (or a separate field like user_input) for moderation
and only after moderation passes attach inline_rag_context.context_text to the
request; adjust code paths that reference responses_params.input to use the
original query value for moderation checks (referencing build_rag_context,
prepare_responses_params, responses_params.input, and
inline_rag_context.context_text).

src/utils/vector_search.py (2)

125-127: ⚠️ Potential issue | 🟡 Minor

Use the correct metadata label for chunk.source.

chunk.source is a source identifier, not a document ID. The current label can mislead the injected context.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/utils/vector_search.py` around lines 125 - 127, The metadata label for
chunk.source is incorrect; locate the code that appends to metadata_parts (the
conditional using chunk.source) and replace the string "document_id:
{chunk.source}" with a clear source label such as "source: {chunk.source}" (or
"source_id: {chunk.source}") so the injected metadata correctly reflects that
chunk.source is a source identifier rather than a document ID.

588-596: ⚠️ Potential issue | 🟠 Major

Handle Solr chunk_metadata as dict/object and propagate title/url metadata into RAGChunk.attributes.

The conversion still relies on attribute-only chunk_metadata.document_id, and doesn’t carry title/reference_url consistently into chunk attributes. This is likely why Solr chunk metadata remains sparse.

🛠️ Proposed fix

-        # For Solr chunks, also extract from chunk_metadata
-        if hasattr(chunk, "chunk_metadata") and chunk.chunk_metadata:
-            if hasattr(chunk.chunk_metadata, "document_id"):
-                doc_id = chunk.chunk_metadata.document_id
-                attributes["document_id"] = doc_id
-                # Build URL if not already set
-                if "doc_url" not in attributes and offline and doc_id:
-                    attributes["doc_url"] = urljoin(constants.MIMIR_DOC_URL, doc_id)
+        # Extract metadata from both dict-style and object-style payloads
+        doc_id, title, reference_url = _extract_solr_document_metadata(chunk)
+        if doc_id:
+            attributes["document_id"] = doc_id
+        if title:
+            attributes["title"] = title
+        if reference_url and "doc_url" not in attributes and not offline:
+            attributes["doc_url"] = reference_url
+        if "doc_url" not in attributes and offline and doc_id:
+            attributes["doc_url"] = urljoin(constants.MIMIR_DOC_URL, doc_id)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/utils/vector_search.py` around lines 588 - 596, The Solr chunk handling
only reads chunk.chunk_metadata.document_id via attribute access and never
copies title/reference_url into RAGChunk.attributes; update the conversion logic
around chunk, chunk_metadata and attributes to accept chunk.chunk_metadata as
either a dict or object (check both dict keys and object attributes), extract
document_id, title and reference_url (or equivalent keys) into
attributes["document_id"], attributes["title"], and attributes["reference_url"],
and only build attributes["doc_url"] when missing using offline and document_id
or fallback to reference_url via urljoin(constants.MIMIR_DOC_URL, doc_id) or the
reference_url value; ensure you reference the same symbols chunk,
chunk_metadata, attributes, offline, urljoin and constants.MIMIR_DOC_URL in your
changes.

🧹 Nitpick comments (1)

src/llama_stack_configuration.py (1)
381-387: Complete enrich_solr docstring to match repository docstring standard.

Please include explicit Returns/Raises sections (even if Returns is None) to keep new function docs consistent with the project convention.

As per coding guidelines "All functions must have complete docstrings with brief descriptions" and "Follow Google Python docstring conventions for modules, classes, and functions with Parameters, Returns, Raises, and Attributes sections".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/llama_stack_configuration.py` around lines 381 - 387, Update the
enrich_solr function docstring to follow the project's Google-style standard:
after the existing short description and Args section for ls_config and
solr_config, add a Returns section (explicitly state "None") and a Raises
section documenting any exceptions the function may raise (e.g., ValueError or
KeyError if you validate inputs, or RuntimeError for Solr connection errors) or
state "None" if it does not raise; ensure parameter names match ls_config and
solr_config and keep the docstring concise and consistent with other functions
in the repo.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/utils/vector_search.py`:
- Around line 280-281: The logger.debug call that logs the full Solr chunk
object (logger.debug("Extract doc ids from chunk: %s", chunk)) risks leaking
sensitive content; update the logging in the code that handles chunk objects
(the place where the variable chunk is processed, e.g., the Extract doc ids
logic) to only emit a non-sensitive summary such as chunk id, score/rank, and
safe metadata keys (e.g., chunk.get("id"), chunk.get("score"), chunk.get("meta",
{}).get("source")) and explicitly omit or avoid fields like "text", "content",
"payload", or any raw document fields; replace the raw chunk log with a debug
message that formats only these allowed summary fields.
- Around line 347-351: The comprehension for vector_store_ids_to_query currently
includes every non-Solr store returned by get_vector_store_ids, which can pull
from unintended stores; update it to restrict to only the configured BYOK stores
by intersecting the returned IDs with the provided vector_store_ids parameter
and excluding constants.SOLR_DEFAULT_VECTOR_STORE_ID. Locate the code building
vector_store_ids_to_query (uses get_vector_store_ids and
constants.SOLR_DEFAULT_VECTOR_STORE_ID) and change the filter to: result_ids =
await get_vector_store_ids(client, vector_store_ids); then set
vector_store_ids_to_query to the intersection of result_ids and the original
vector_store_ids (minus SOLR_DEFAULT_VECTOR_STORE_ID) so only explicitly
configured BYOK stores are queried.

---

Outside diff comments:
In `@src/llama_stack_configuration.py`:
- Around line 310-327: The loop over byok_rag unconditionally appends provider
entries and can create duplicate provider_id values (provider_id =
f"byok_{rag_id}") causing ambiguous resolution; modify the loop in
llama_stack_configuration.py (the byok_rag iteration where provider_id and
backend_name are built) to first check existing provider_ids in the accumulating
output (or providers.vector_io collection) and skip or update the entry if
provider_id already exists (e.g., compute existing_ids = {p["provider_id"] for p
in output} and only append when provider_id not in existing_ids), ensuring
duplicates are not added.

---

Duplicate comments:
In `@src/app/endpoints/query.py`:
- Around line 157-173: The moderation is being run on responses_params.input
after you inject inline RAG content; instead run moderation against the original
user query (query_request.query) before or separately from building/injecting
inline_rag_context so inline_rag_context.context_text cannot affect moderation.
Update the flow around build_rag_context and prepare_responses_params so that
prepare_responses_params is given the original query (or a separate field like
user_input) for moderation and only after moderation passes attach
inline_rag_context.context_text to the request; adjust code paths that reference
responses_params.input to use the original query value for moderation checks
(referencing build_rag_context, prepare_responses_params,
responses_params.input, and inline_rag_context.context_text).

In `@src/utils/vector_search.py`:
- Around line 125-127: The metadata label for chunk.source is incorrect; locate
the code that appends to metadata_parts (the conditional using chunk.source) and
replace the string "document_id: {chunk.source}" with a clear source label such
as "source: {chunk.source}" (or "source_id: {chunk.source}") so the injected
metadata correctly reflects that chunk.source is a source identifier rather than
a document ID.
- Around line 588-596: The Solr chunk handling only reads
chunk.chunk_metadata.document_id via attribute access and never copies
title/reference_url into RAGChunk.attributes; update the conversion logic around
chunk, chunk_metadata and attributes to accept chunk.chunk_metadata as either a
dict or object (check both dict keys and object attributes), extract
document_id, title and reference_url (or equivalent keys) into
attributes["document_id"], attributes["title"], and attributes["reference_url"],
and only build attributes["doc_url"] when missing using offline and document_id
or fallback to reference_url via urljoin(constants.MIMIR_DOC_URL, doc_id) or the
reference_url value; ensure you reference the same symbols chunk,
chunk_metadata, attributes, offline, urljoin and constants.MIMIR_DOC_URL in your
changes.

---

Nitpick comments:
In `@src/llama_stack_configuration.py`:
- Around line 381-387: Update the enrich_solr function docstring to follow the
project's Google-style standard: after the existing short description and Args
section for ls_config and solr_config, add a Returns section (explicitly state
"None") and a Raises section documenting any exceptions the function may raise
(e.g., ValueError or KeyError if you validate inputs, or RuntimeError for Solr
connection errors) or state "None" if it does not raise; ensure parameter names
match ls_config and solr_config and keep the docstring concise and consistent
with other functions in the repo.

ℹ️ Review info

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 87708d6 and 4342d9b.

📒 Files selected for processing (3)

src/app/endpoints/query.py
src/llama_stack_configuration.py
src/utils/vector_search.py

src/utils/vector_search.py

coderabbitai · 2026-03-02T14:52:34Z

src/utils/vector_search.py

+        vector_store_ids_to_query = [
+            vs_id
+            for vs_id in await get_vector_store_ids(client, vector_store_ids)
+            if vs_id != constants.SOLR_DEFAULT_VECTOR_STORE_ID
+        ]


⚠️ Potential issue | 🟠 Major

Limit inline BYOK retrieval to configured BYOK vector stores.

This currently queries every non-Solr store returned by Llama Stack, which can pull chunks from unintended stores and violate configured-source behavior.

🛠️ Proposed fix

- vector_store_ids_to_query = [ + configured_byok_store_ids = set(configuration.rag_id_mapping.keys()) + vector_store_ids_to_query = [ vs_id for vs_id in await get_vector_store_ids(client, vector_store_ids) - if vs_id != constants.SOLR_DEFAULT_VECTOR_STORE_ID + if vs_id != constants.SOLR_DEFAULT_VECTOR_STORE_ID + and vs_id in configured_byok_store_ids ]

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/utils/vector_search.py` around lines 347 - 351, The comprehension for vector_store_ids_to_query currently includes every non-Solr store returned by get_vector_store_ids, which can pull from unintended stores; update it to restrict to only the configured BYOK stores by intersecting the returned IDs with the provided vector_store_ids parameter and excluding constants.SOLR_DEFAULT_VECTOR_STORE_ID. Locate the code building vector_store_ids_to_query (uses get_vector_store_ids and constants.SOLR_DEFAULT_VECTOR_STORE_ID) and change the filter to: result_ids = await get_vector_store_ids(client, vector_store_ids); then set vector_store_ids_to_query to the intersection of result_ids and the original vector_store_ids (minus SOLR_DEFAULT_VECTOR_STORE_ID) so only explicitly configured BYOK stores are queried.

are-ces marked this pull request as draft February 24, 2026 12:41

are-ces requested review from Anxhela21, asimurka and tisnik and removed request for Anxhela21 February 24, 2026 13:04

tisnik approved these changes Feb 25, 2026

View reviewed changes

are-ces force-pushed the chunk-prio branch 3 times, most recently from 2e57151 to f593d68 Compare February 27, 2026 09:05

are-ces force-pushed the chunk-prio branch from f593d68 to 2ace88f Compare February 27, 2026 09:21

are-ces marked this pull request as ready for review February 27, 2026 09:30

coderabbitai bot reviewed Feb 27, 2026

View reviewed changes

docs/rag_guide.md Show resolved Hide resolved

lightspeed-stack.yaml Outdated Show resolved Hide resolved

src/app/endpoints/query.py Outdated Show resolved Hide resolved

src/utils/responses.py Outdated Show resolved Hide resolved

src/utils/responses.py Show resolved Hide resolved

tisnik approved these changes Feb 27, 2026

View reviewed changes

asimurka requested changes Feb 27, 2026

View reviewed changes

syedriko reviewed Mar 1, 2026

View reviewed changes

examples/lightspeed-stack-byok-rag.yaml Outdated Show resolved Hide resolved

syedriko reviewed Mar 1, 2026

View reviewed changes

src/app/endpoints/streaming_query.py Outdated Show resolved Hide resolved

syedriko reviewed Mar 1, 2026

View reviewed changes

src/app/endpoints/streaming_query.py Outdated Show resolved Hide resolved

Anxhela21 approved these changes Mar 2, 2026

View reviewed changes

coderabbitai bot reviewed Mar 2, 2026

View reviewed changes

asimurka approved these changes Mar 2, 2026

View reviewed changes

Address review: rename always RAG to inline RAG, Solr config to OKP, …

a4075c6

…fix query mutation

are-ces force-pushed the chunk-prio branch from 22fe72d to a4075c6 Compare March 2, 2026 13:42

coderabbitai bot reviewed Mar 2, 2026

View reviewed changes

Merge remote-tracking branch 'upstream/main' into chunk-prio

87708d6

coderabbitai bot reviewed Mar 2, 2026

View reviewed changes

src/app/endpoints/streaming_query.py Show resolved Hide resolved

coderabbitai bot reviewed Mar 2, 2026

View reviewed changes

Fix formatting

6c67545

are-ces force-pushed the chunk-prio branch from 4342d9b to 6c67545 Compare March 2, 2026 14:53

are-ces marked this pull request as draft March 2, 2026 15:12

are-ces added 2 commits March 2, 2026 16:19

Fix shield moderation running on RAG-injected input

d15c7db

Add unit tests for vector_search and RAG configuration

53d8eb3

are-ces marked this pull request as ready for review March 2, 2026 15:22

Avoid leak of OKP chunks

b9be0e0

		# Build RAG context from Inline RAG sources
		rag_context = await build_rag_context(client, query_request, configuration)

		# Build RAG context from Inline RAG sources
		inline_rag_context = await build_rag_context(client, query_request, configuration)

Conversation

are-ces commented Feb 24, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

TL;DR

Motivation

Description

How It Works

Configuration

Type of change

Tools used to create PR

Related Tickets & Documents

Checklist before requesting a review

Testing

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Feb 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested reviewers

Uh oh!

are-ces commented Feb 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tisnik left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tisnik left a comment

Choose a reason for hiding this comment

Uh oh!

asimurka left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

romartin commented Feb 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

TamiTakamiya commented Feb 27, 2026

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

are-ces Mar 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

are-ces commented Feb 24, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Feb 24, 2026 •

edited

Loading

are-ces commented Feb 24, 2026 •

edited

Loading

romartin commented Feb 27, 2026 •

edited

Loading

are-ces Mar 2, 2026 •

edited

Loading

syedriko Mar 1, 2026 •

edited

Loading

are-ces Mar 2, 2026 •

edited

Loading

are-ces Mar 2, 2026 •

edited

Loading

are-ces Mar 2, 2026 •

edited

Loading

are-ces commented Mar 2, 2026 •

edited

Loading

coderabbitai bot Mar 2, 2026 •

edited

Loading