Skip to content

[BUG] OpenAI embedding provider silently falls back to random vectors while reporting success #68

@gudzenkov

Description

@gudzenkov

Summary

When built with OpenAI embedding support and run with CODEGRAPH_EMBEDDING_PROVIDER=openai, indexing prints a critical fallback to random/hash embeddings, but the final indexing summary still reports successful OpenAI embeddings.

This makes a degraded index look valid.

Related

Clean Reproduction

Fresh clone:

git clone --depth 1 https://github.com/Jakedismo/codegraph-rust codegraph-rust-clean-openai
cd codegraph-rust-clean-openai
git rev-parse HEAD
# ce5bf27 Merge pull request #59 from Jakedismo/wip-indexing-analyzers-doc

Build:

cargo build --release -p codegraph-mcp-server --bin codegraph \
  --no-default-features \
  --features ai-enhanced,embeddings-openai,codegraph-ai/openai-llm

Index command:

CODEGRAPH_SURREALDB_URL=ws://127.0.0.1:51068 \
CODEGRAPH_SURREALDB_NAMESPACE=codegraph \
CODEGRAPH_SURREALDB_DATABASE=main \
CODEGRAPH_SURREALDB_USERNAME=root \
CODEGRAPH_SURREALDB_PASSWORD=root \
CODEGRAPH_EMBEDDING_PROVIDER=openai \
CODEGRAPH_EMBEDDING_MODEL=text-embedding-3-small \
CODEGRAPH_EMBEDDING_DIMENSION=1536 \
CODEGRAPH_LLM_PROVIDER=openai \
CODEGRAPH_MODEL=gpt-5.4-mini \
CODEGRAPH_CONTEXT_WINDOW=200000 \
RUST_LOG=info \
target/release/codegraph --config /path/to/codegraph-rust.toml \
  index src --recursive --index-tier fast --force --batch-size 64 --max-concurrent 2

Observed during indexing:

CRITICAL: Falling back to random hash-based embeddings

Observed final summary:

Embeddings: chunks 2157 | dim 1536 | provider openai

Database state:

chunks: 2157
embedding_1536 populated: 2157

Expected

One of:

  • Real OpenAI embeddings are generated.
  • Indexing fails hard with an actionable OpenAI provider error.
  • The final summary clearly reports fallback/random embeddings and marks the index unsuitable for semantic retrieval.

Actual

The index completes and reports OpenAI embeddings even though the log says random hash-based embeddings were used.

Impact

Downstream semantic search depends on embedding quality. A silent random fallback can make retrieval quality invalid while the CLI summary says the OpenAI embedding path succeeded.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions