Skip to content

Code hallucination#38

Open
adaamko wants to merge 20 commits into
mainfrom
code_hallucination
Open

Code hallucination#38
adaamko wants to merge 20 commits into
mainfrom
code_hallucination

Conversation

@adaamko
Copy link
Copy Markdown
Collaborator

@adaamko adaamko commented Jun 1, 2026

Hallucination detection for agentic coding workflows

Builds a span-level hallucination-detection benchmark for agentic coding workflows: given the grounded artifacts an assistant sees at inference (repository source, tool output, retrieved docs), localize the unsupported spans in its answer. Spans across every source map into one unified taxonomy, and two public prose datasets are folded in as a complementary collection.

What's here

Unified taxonomy (lettucedetect/datasets/taxonomy.py)

  • 3 injectable categories (contradiction, unsupported_addition, fabricated_reference) + supported + document-level omission; 13 subtypes. Every source maps into one label space.

Shared generation pipeline (lettucedetect/generation/)

  • Grounded correct-answer generation, edit-based taxonomy injection (exact character spans), typed question generation, a batched/resumable runner, and a document-source pipeline.
  • classify.py: an LLM that types an already-annotated span into the taxonomy — for sources that ship untyped spans (the inverse of the injector).

Datasets (KRLabsOrg/lettucedetect-code-hallucination, 79,591 samples)

  • Code (SWE-bench, 23,830) · tool output (11,365) · ACL papers (5,355) · GitHub READMEs (13,803) · Wikipedia (25,238).
  • Prose collection KRLabsOrg/lettucedetect-prose-hallucination (87,834): PsiloQA (natural, 14 languages) + RAGTruth, classified into the same taxonomy.

Prompt format

  • Each sample exposes context and question separately, and the prompt places the request first (User request: {question}\n\n{context}) so it is never lost when a long context is truncated. Backfilled existing data in place and updated all adapters.

Tooling

  • scripts/build_hf_dataset.py: reusable assembler that merges data/v2 sources into a DatasetDict and pushes (metadata serialized to JSON string).

Repo hygiene

  • Removed 7 superseded prototype/one-off scripts (Groq/Kimi monoliths and helpers replaced by the modular pipeline), unreferenced by code/docs/CI.
  • De-duplicated validation helpers: validator.py now imports the canonical span/coverage helpers from the injector instead of keeping drifted copies.
  • Fixed an output-dir redirect bug in config.set_output_dir (missing global INJECTION_FAILURES_PATH).
  • Ruff lint clean on the CI scope.

Not in scope yet

  • Model training (encoder-first English, then multilingual), omission detection (question-side spans), and the difficulty gate are tracked separately.

adaamko added 17 commits May 20, 2026 22:04
Remove GLiNER2 schema-head detail from docs/taxonomy.md (too
implementation-specific for now); keep the modality-unification and
typed-output rationale. Ruff formatting on sample_assembler.
Introduce lettucedetect/generation/ with composable, source-agnostic
primitives for building hallucination-detection datasets:

- injection.py: universal, taxonomy-driven injector (sync + async) that
  corrupts a correct answer into a hallucinated one with exact character
  spans, modality-aware (code/tool_output/markdown/prose) across all 13
  subtypes.
- answers.py: grounded correct-answer generation (sync + async).
- runner.py: batched (asyncio.gather), resumable, failure-logging
  orchestration reused by every source adapter.

The code-hallucination Phase 6 injector now delegates edit application,
span location, and validation to this engine, keeping its own code prompt
and native labels. Verified byte-identical against the released dataset
(2000/2000 entries reproduced through the shared path).

Add the squeez tool-output adapter (scripts/generate_squeez_hallucinations.py)
and taxonomy category/subtype definitions. Document the taxonomy and
generation pipeline in the site nav.
The batched runner no longer writes a synthetic key field into output
records — resumability keys are derived from each record via a record_key
callable, so samples are written verbatim in the final schema.

Add lettucedetect/generation/assembly.py with balance_hallucination_ratio,
a reusable primitive that trims clean samples to a target hallucinated
ratio at assembly/upload time (no per-source munging scripts).
Add lettucedetect/generation/questions.py — generates typed, self-contained
questions answerable from a document, driven by an 18-type taxonomy (adapted
from the acl-verbatim QA generator). Multi-part types are flagged as omission
candidates. Needed by the doc-only markdown sources (wiki, READMEs).

Extract the chat-completion-with-retries loop into generation/_completion.py
(sync + async, with a transform callback) and route answers, injection, and
questions through it, removing the duplicated retry boilerplate. Injection
output is unchanged (verified byte-identical against the released dataset).
Add the ACL adapter (scripts/generate_acl_hallucinations.py): group
acl-verbatim-spans by question, take the top-5 retrieved/gold chunks as
markdown context, generate a grounded answer, and inject a paper-specific
hallucination detectable against the excerpts.

Shared additions, reused by future markdown sources:
- per-edit hallucination types in apply_changes_to_answer (each edit labelled
  with its own type; falls back to the passed type, so existing sources are
  byte-identical)
- inject_menu / inject_menu_async: menu-mode injection where the model picks
  the fitting types, mapped to the taxonomy per source
- PAPER_MAP in taxonomy.py (NUMERICAL/ENTITY/RELATIONAL/METHODOLOGICAL/
  CITATIONAL -> unified categories)
Reusable tool that searches popular repos across languages via the GitHub
REST API, fetches and filters their READMEs (substantial, structured), and
writes a resumable JSONL corpus for the README markdown source. Needs
GITHUB_TOKEN for a usable rate limit; skips repos already collected.
Add generation/doc_source.py: the shared document-based flow (chunk by
heading -> typed question -> grounded answer -> menu injection -> assemble),
batched and resumable, with a generic factual markdown injection prompt as
the default. READMEs and (upcoming) Wikipedia are thin configs over it.

Add the README adapter (reads the collected README corpus, repo-level
train/dev/test split, developer-style question subset). The generic factual
injection suits heterogeneous README content far better than a dev-doc schema.
Repurpose MARKDOWN_MAP to the generic factual types.
Stream the English open-wikipedia-markdown parquet shards (the dataset script
is broken, so load parquet directly), sample substantial articles, and run the
shared doc-source pipeline with factual question types and the generic factual
injection. Add a shared hash_split helper for document-level train/dev/test
splitting (README now uses it too).
The source acl-verbatim test config has only ~17 answerable questions, so the
ACL test split was tiny (3 hallucinated). Pool all source questions and assign
train/dev/test by hashing the paper id (paper-separated, no leakage), giving a
real test set (~440 questions / ~117 hallucinated).
Update the generation pipeline doc to list the five sources now built
(code, tool-output, ACL, README, Wikipedia) with their modality, question
source, and injection mode.
Add classify.py: an LLM that types an already-annotated span into the unified
taxonomy, for sources that ship untyped spans (inverse of the injector). Use it
to fold PsiloQA (natural, multilingual hallucinations) into the taxonomy via
classify_psiloqa_spans.py; RAGTruth maps mechanically. Add build_hf_dataset.py,
a reusable assembler that merges data/v2 sources into a DatasetDict and pushes
it (dev->validation, metadata dict serialized to a JSON string).
The baked prompt put the user request last ("...User request: {q}"), where
truncation=only_first clips it on long inputs, and never exposed context or
question separately. Add context+question fields to HallucinationSample, a shared
format_prompt() that builds the prompt question-first (truncation-safe), and wire
every adapter to emit context/question. Add canonicalize_prompts.py to backfill
existing data/v2 in place (idempotent, no LLM). Pack Wikipedia heading sections
into larger chunks so contexts are no longer too short.
@adaamko adaamko force-pushed the code_hallucination branch from a265c45 to 1080d52 Compare June 2, 2026 07:46
@adaamko adaamko self-assigned this Jun 2, 2026
adaamko added 3 commits June 2, 2026 10:18
These were early monolithic prototypes (Groq/Kimi) and one-off helpers
superseded by the modular scripts/code_hallucination/ pipeline and
lettucedetect/generation/. None were referenced by code, docs, or CI.
- validator.py now imports _extract_code_regions / _span_is_in_code /
  _max_allowed_coverage from the canonical injector instead of keeping drifted
  copies (behavior preserved: long-answer cap stays 0.30)
- config.set_output_dir was missing 'global INJECTION_FAILURES_PATH', so
  redirecting the output dir silently left that one path at the default
- drop a dead local variable in the injector's sequential path
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant