Skip to content

Conversation

@dhdaines
Copy link

@dhdaines dhdaines commented Jan 4, 2026

Applies on top of #2108 which has the necessary changes to MTMD.

This adds a chat format to support https://huggingface.co/ggml-org/granite-docling-258M-GGUF and derivatives. It should work with SmolVLM and SmolDocling as well.

In order to use these models effectively it is necessary to enable special tokens in the chat completion output, so this adds a special flag to all of the chat completion functions which matches what --special does in llama-cli (this is enabled by default in llama-mtmd-cli)

@dhdaines dhdaines marked this pull request as ready for review January 4, 2026 18:05
@dhdaines
Copy link
Author

dhdaines commented Jan 4, 2026

Ready for review!

@dhdaines dhdaines force-pushed the granite-docling branch 3 times, most recently from 3a04e21 to 8790ce6 Compare January 6, 2026 19:55
Ralf Waldukat and others added 4 commits January 13, 2026 09:47
- Update vendor/llama.cpp submodule to commit be47fb92 (2026-01-01)
- Bump version 0.3.16 -> 0.4.0

Critical fixes:
- Remove phantom flash_attn field from llama_context_params (caused segfaults)
- Add 3 missing params to llama_params_fit (margin, n_ctx_min, log_level)
- Migrate flash_attn bool -> flash_attn_type enum (BREAKING CHANGE)
- Add flash_attn_type to TYPE_CHECKING block
- Fix test: use flash_attn_type instead of removed flash_attn field
- FIX CRITICAL: kv_cache_seq_rm must preserve seq_id=-1 semantics (all sequences)
  * The wrapper was incorrectly converting -1 to 0, breaking context rewind
  * This caused 'discontinuity' errors on multi-turn conversations

API changes:
- flash_attn: bool field REMOVED from structs
- flash_attn_type: int enum ADDED (AUTO=-1, DISABLED=0, ENABLED=1)
- High-level API maintains backward compatibility via wrapper
- Server default changed: flash_attn=False -> flash_attn=None (AUTO mode)

New features:
- 20+ new functions (memory API, state management, samplers, vocab queries)
- 5 new enums (flash_attn_type, params_fit_status, model_meta_key, etc.)
- 6 new struct fields across llama_model_params, llama_context_params, mtmd_context_params

Deprecated removals:
- 11 llama_kv_self_* functions (replaced by llama_memory_*)
- llama_sampler_init_softmax
- verbosity field from mtmd_context_params
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant