Skip to content

server: support image+text input for embeddings (Qwen3-VL-Embedding)#18665

Closed
ngxson wants to merge 1 commit intoggml-org:masterfrom
ngxson:xsn/qwen3_vl_embd
Closed

server: support image+text input for embeddings (Qwen3-VL-Embedding)#18665
ngxson wants to merge 1 commit intoggml-org:masterfrom
ngxson:xsn/qwen3_vl_embd

Conversation

@ngxson
Copy link
Copy Markdown
Contributor

@ngxson ngxson commented Jan 7, 2026

Target support: https://huggingface.co/Qwen/Qwen3-VL-Embedding-2B

Important

the original Qwen3-VL-Embedding model is missing 1_Pooling, I don't think it's actually ready to be used unless Qwen team fixed it (I already reached out to them, but got no responses)

But currently, the model is missing 1_Pooling, so it cannot be correctly converted to GGUF

This PR aims to support mixed text+image (and maybe audio input for models supporting it) using OAI-compat content-like schema:

{
    "input": [
        {
            "type": "text",
            "text": "mixed text and image input"
        },
        {
            "type": "image",
            "image_url": {
                "url": "https://huggingface.co/ggml-org/tinygemma3-GGUF/resolve/main/test/11_truck.png"
            }
        }
    ]
}

@ggerganov
Copy link
Copy Markdown
Member

When you convert the model, try to add --sentence-transformers-dense-modules:

llama.cpp/convert_hf_to_gguf.py

Lines 10974 to 10981 in 294b2b4

parser.add_argument(
"--sentence-transformers-dense-modules", action="store_true",
help=("Whether to include sentence-transformers dense modules."
"It can be used for sentence-transformers models, like google/embeddinggemma-300m"
"Default these modules are not included.")
)

@CISC
Copy link
Copy Markdown
Member

CISC commented Jan 7, 2026

IIRC they forgot to add 1_Pooling initially on other embedding models too, since this one is not public yet maybe ask about it?

@ngxson ngxson changed the title server: support image+text input for embeddings (Qwen3-VL-Embedding) server: support image+text input for embeddings Jan 7, 2026
@ngxson
Copy link
Copy Markdown
Contributor Author

ngxson commented Jan 7, 2026

Oh sorry I didn't notice that it's private 😅 temporary closing this to keep it under the radar

@ngxson ngxson closed this Jan 7, 2026
@ngxson ngxson reopened this Jan 8, 2026
@ngxson ngxson changed the title server: support image+text input for embeddings server: support image+text input for embeddings (Qwen3-VL-Embedding) Jan 8, 2026
@Tokimorphling
Copy link
Copy Markdown

I've fixed the Qwen3-VL-Embedding issues in llama.cpp and verified the fix with regression tests. Check out the code here: https://github.com/Tokimorphling/qwen3-vl-embedding

The implementation of the Qwen3-VL series in llama.cpp seems to be problematic.

@ngxson ngxson closed this Apr 22, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants