Code hallucination by adaamko · Pull Request #38 · KRLabsOrg/LettuceDetect

adaamko · 2026-06-01T10:08:22Z

Hallucination detection for agentic coding workflows

Builds a span-level hallucination-detection benchmark for agentic coding workflows: given the grounded artifacts an assistant sees at inference (repository source, tool output, retrieved docs), localize the unsupported spans in its answer. Spans across every source map into one unified taxonomy, and two public prose datasets are folded in as a complementary collection.

What's here

Unified taxonomy (lettucedetect/datasets/taxonomy.py)

3 injectable categories (contradiction, unsupported_addition, fabricated_reference) + supported + document-level omission; 13 subtypes. Every source maps into one label space.

Shared generation pipeline (lettucedetect/generation/)

Grounded correct-answer generation, edit-based taxonomy injection (exact character spans), typed question generation, a batched/resumable runner, and a document-source pipeline.
classify.py: an LLM that types an already-annotated span into the taxonomy — for sources that ship untyped spans (the inverse of the injector).

Datasets (KRLabsOrg/lettucedetect-code-hallucination, 79,591 samples)

Code (SWE-bench, 23,830) · tool output (11,365) · ACL papers (5,355) · GitHub READMEs (13,803) · Wikipedia (25,238).
Prose collection KRLabsOrg/lettucedetect-prose-hallucination (87,834): PsiloQA (natural, 14 languages) + RAGTruth, classified into the same taxonomy.

Prompt format

Each sample exposes context and question separately, and the prompt places the request first (User request: {question}\n\n{context}) so it is never lost when a long context is truncated. Backfilled existing data in place and updated all adapters.

Tooling

scripts/build_hf_dataset.py: reusable assembler that merges data/v2 sources into a DatasetDict and pushes (metadata serialized to JSON string).

Repo hygiene

Removed 7 superseded prototype/one-off scripts (Groq/Kimi monoliths and helpers replaced by the modular pipeline), unreferenced by code/docs/CI.
De-duplicated validation helpers: validator.py now imports the canonical span/coverage helpers from the injector instead of keeping drifted copies.
Fixed an output-dir redirect bug in config.set_output_dir (missing global INJECTION_FAILURES_PATH).
Ruff lint clean on the CI scope.

Not in scope yet

Model training (encoder-first English, then multilingual), omission detection (question-side spans), and the difficulty gate are tracked separately.

Remove GLiNER2 schema-head detail from docs/taxonomy.md (too implementation-specific for now); keep the modality-unification and typed-output rationale. Ruff formatting on sample_assembler.

Introduce lettucedetect/generation/ with composable, source-agnostic primitives for building hallucination-detection datasets: - injection.py: universal, taxonomy-driven injector (sync + async) that corrupts a correct answer into a hallucinated one with exact character spans, modality-aware (code/tool_output/markdown/prose) across all 13 subtypes. - answers.py: grounded correct-answer generation (sync + async). - runner.py: batched (asyncio.gather), resumable, failure-logging orchestration reused by every source adapter. The code-hallucination Phase 6 injector now delegates edit application, span location, and validation to this engine, keeping its own code prompt and native labels. Verified byte-identical against the released dataset (2000/2000 entries reproduced through the shared path). Add the squeez tool-output adapter (scripts/generate_squeez_hallucinations.py) and taxonomy category/subtype definitions. Document the taxonomy and generation pipeline in the site nav.

The batched runner no longer writes a synthetic key field into output records — resumability keys are derived from each record via a record_key callable, so samples are written verbatim in the final schema. Add lettucedetect/generation/assembly.py with balance_hallucination_ratio, a reusable primitive that trims clean samples to a target hallucinated ratio at assembly/upload time (no per-source munging scripts).

Add lettucedetect/generation/questions.py — generates typed, self-contained questions answerable from a document, driven by an 18-type taxonomy (adapted from the acl-verbatim QA generator). Multi-part types are flagged as omission candidates. Needed by the doc-only markdown sources (wiki, READMEs). Extract the chat-completion-with-retries loop into generation/_completion.py (sync + async, with a transform callback) and route answers, injection, and questions through it, removing the duplicated retry boilerplate. Injection output is unchanged (verified byte-identical against the released dataset).

Add the ACL adapter (scripts/generate_acl_hallucinations.py): group acl-verbatim-spans by question, take the top-5 retrieved/gold chunks as markdown context, generate a grounded answer, and inject a paper-specific hallucination detectable against the excerpts. Shared additions, reused by future markdown sources: - per-edit hallucination types in apply_changes_to_answer (each edit labelled with its own type; falls back to the passed type, so existing sources are byte-identical) - inject_menu / inject_menu_async: menu-mode injection where the model picks the fitting types, mapped to the taxonomy per source - PAPER_MAP in taxonomy.py (NUMERICAL/ENTITY/RELATIONAL/METHODOLOGICAL/ CITATIONAL -> unified categories)

Reusable tool that searches popular repos across languages via the GitHub REST API, fetches and filters their READMEs (substantial, structured), and writes a resumable JSONL corpus for the README markdown source. Needs GITHUB_TOKEN for a usable rate limit; skips repos already collected.

Add generation/doc_source.py: the shared document-based flow (chunk by heading -> typed question -> grounded answer -> menu injection -> assemble), batched and resumable, with a generic factual markdown injection prompt as the default. READMEs and (upcoming) Wikipedia are thin configs over it. Add the README adapter (reads the collected README corpus, repo-level train/dev/test split, developer-style question subset). The generic factual injection suits heterogeneous README content far better than a dev-doc schema. Repurpose MARKDOWN_MAP to the generic factual types.

Stream the English open-wikipedia-markdown parquet shards (the dataset script is broken, so load parquet directly), sample substantial articles, and run the shared doc-source pipeline with factual question types and the generic factual injection. Add a shared hash_split helper for document-level train/dev/test splitting (README now uses it too).

The source acl-verbatim test config has only ~17 answerable questions, so the ACL test split was tiny (3 hallucinated). Pool all source questions and assign train/dev/test by hashing the paper id (paper-separated, no leakage), giving a real test set (~440 questions / ~117 hallucinated).

Update the generation pipeline doc to list the five sources now built (code, tool-output, ACL, README, Wikipedia) with their modality, question source, and injection mode.

Add classify.py: an LLM that types an already-annotated span into the unified taxonomy, for sources that ship untyped spans (inverse of the injector). Use it to fold PsiloQA (natural, multilingual hallucinations) into the taxonomy via classify_psiloqa_spans.py; RAGTruth maps mechanically. Add build_hf_dataset.py, a reusable assembler that merges data/v2 sources into a DatasetDict and pushes it (dev->validation, metadata dict serialized to a JSON string).

The baked prompt put the user request last ("...User request: {q}"), where truncation=only_first clips it on long inputs, and never exposed context or question separately. Add context+question fields to HallucinationSample, a shared format_prompt() that builds the prompt question-first (truncation-safe), and wire every adapter to emit context/question. Add canonicalize_prompts.py to backfill existing data/v2 in place (idempotent, no LLM). Pack Wikipedia heading sections into larger chunks so contexts are no longer too short.

These were early monolithic prototypes (Groq/Kimi) and one-off helpers superseded by the modular scripts/code_hallucination/ pipeline and lettucedetect/generation/. None were referenced by code, docs, or CI.

…op assert)

- validator.py now imports _extract_code_regions / _span_is_in_code / _max_allowed_coverage from the canonical injector instead of keeping drifted copies (behavior preserved: long-answer cap stays 0.30) - config.set_output_dir was missing 'global INJECTION_FAILURES_PATH', so redirecting the output dir silently left that one path at the default - drop a dead local variable in the injector's sequential path

adaamko added 17 commits May 20, 2026 22:04

Add taxonomy

eb5f028

Handling context functions

ff51a5d

Ruff Linting

bb24c5b

Trim taxonomy doc, lint sample_assembler

d8ea3af

Remove GLiNER2 schema-head detail from docs/taxonomy.md (too implementation-specific for now); keep the modality-unification and typed-output rationale. Ruff formatting on sample_assembler.

Document the five built data sources

b068263

Update the generation pipeline doc to list the five sources now built (code, tool-output, ACL, README, Wikipedia) with their modality, question source, and injection mode.

Drop user-facing v2 references from taxonomy doc

6fc85fe

Fix lint

1080d52

adaamko force-pushed the code_hallucination branch from a265c45 to 1080d52 Compare June 2, 2026 07:46

adaamko self-assigned this Jun 2, 2026

adaamko added 3 commits June 2, 2026 10:18

Remove superseded code-hallucination prototypes and one-off scripts

6a081f4

These were early monolithic prototypes (Groq/Kimi) and one-off helpers superseded by the modular scripts/code_hallucination/ pipeline and lettucedetect/generation/. None were referenced by code, docs, or CI.

Fix ruff lint in apply_taxonomy (docstrings, raw module docstring, dr…

f95dbc8

…op assert)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Code hallucination#38

Code hallucination#38
adaamko wants to merge 20 commits into
mainfrom
code_hallucination

adaamko commented Jun 1, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

adaamko commented Jun 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Hallucination detection for agentic coding workflows

What's here

Repo hygiene

Not in scope yet

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

adaamko commented Jun 1, 2026 •

edited

Loading