Skip to content

(retriever) Add VLM image captioning via vLLM#1660

Merged
jperez999 merged 26 commits intoNVIDIA:mainfrom
edknv:edwardk/retriever-image-caption
Mar 24, 2026
Merged

(retriever) Add VLM image captioning via vLLM#1660
jperez999 merged 26 commits intoNVIDIA:mainfrom
edknv:edwardk/retriever-image-caption

Conversation

@edknv
Copy link
Copy Markdown
Collaborator

@edknv edknv commented Mar 19, 2026

Description

  • Add a .caption() pipeline stage to both batch and in-process ingestors that generates text descriptions for extracted images using a VLM (Nemotron Nano 12B v2 VL via vLLM locally, or a remote NIM endpoint).
  • Use nv-ingest-api's extract_image_like_objects_from_pdfium_page during PDF extraction to detect, merge, and crop image-like objects (images, shapes, forms) from each page into the images column.
  • The caption stage filters out small images (< 32px), sends the remaining to the VLM, and writes captions back as images[i]["text"]. Optionally prepends surrounding page text to the VLM prompt via context_text_max_chars.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.
  • If adjusting docker-compose.yaml environment variables have you ensured those are mimicked in the Helm values.yaml file.

@edknv edknv requested a review from jperez999 March 24, 2026 05:03
@edknv edknv marked this pull request as ready for review March 24, 2026 05:03
@edknv edknv requested review from a team as code owners March 24, 2026 05:03
@jperez999 jperez999 merged commit e93a04f into NVIDIA:main Mar 24, 2026
4 of 7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants