(retriever) Add VLM image captioning via vLLM by edknv · Pull Request #1660 · NVIDIA/NeMo-Retriever

edknv · 2026-03-19T16:57:03Z

Description

Add a .caption() pipeline stage to both batch and in-process ingestors that generates text descriptions for extracted images using a VLM (Nemotron Nano 12B v2 VL via vLLM locally, or a remote NIM endpoint).
Use nv-ingest-api's extract_image_like_objects_from_pdfium_page during PDF extraction to detect, merge, and crop image-like objects (images, shapes, forms) from each page into the images column.
The caption stage filters out small images (< 32px), sends the remaining to the VLM, and writes captions back as images[i]["text"]. Optionally prepends surrounding page text to the VLM prompt via context_text_max_chars.

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.
If adjusting docker-compose.yaml environment variables have you ensured those are mimicked in the Helm values.yaml file.

…-ingest into edwardk/retriever-image-caption

edknv and others added 24 commits March 19, 2026 09:56

(retriever) Add VLM image captioning via vLLM

74b7125

Merge branch 'main' into edwardk/retriever-image-caption

db95dc5

revert fix pyproject.toml

4c99cca

add batch mode

c601e7f

build endpoint working

cca5001

add context window

1384c6f

update readme

8ba2c81

Merge branch 'main' into edwardk/retriever-image-caption

d09f014

install vllm wheels for cu130 support

06e5d8e

pin vllm to exact match

58fe381

cache model globally

f90de97

set gpu memory utilization

2a3df58

set caption batch size

1306f2b

remove batch size arg

858f7ca

skip loading ocr

5a2e0fd

use fractional gpu

8309207

filter out small images

564d72c

updates

a921382

updates

b0b4475

fix tests

e6cb852

simplify

ae08679

Merge branch 'main' into edwardk/retriever-image-caption

3450347

consistent default gpu mem util

92779f8

Merge branch 'main' into edwardk/retriever-image-caption

13615df

edknv requested a review from jperez999 March 24, 2026 05:03

edknv marked this pull request as ready for review March 24, 2026 05:03

edknv requested review from a team as code owners March 24, 2026 05:03

edknv added 2 commits March 24, 2026 09:34

lint

472ce22

Merge branch 'edwardk/retriever-image-caption' of github.com:edknv/nv…

b7f10fa

…-ingest into edwardk/retriever-image-caption

jperez999 approved these changes Mar 24, 2026

View reviewed changes

jperez999 merged commit e93a04f into NVIDIA:main Mar 24, 2026
4 of 7 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

(retriever) Add VLM image captioning via vLLM#1660

(retriever) Add VLM image captioning via vLLM#1660
jperez999 merged 26 commits intoNVIDIA:mainfrom
edknv:edwardk/retriever-image-caption

edknv commented Mar 19, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

edknv commented Mar 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Checklist

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

edknv commented Mar 19, 2026 •

edited

Loading