chimera: a single-executable / library which combines llama.cpp, whisper.cpp, and stable-diffusion.cpp #1543

shakfu · 2026-05-22T04:57:52Z

shakfu
May 22, 2026

chimera is a single statically-linked C++ executable that bundles llama.cpp, whisper.cpp, stable-diffusion.cpp, SQLite, and sqlite-vec into a busybox-style multitool. The same binary handles text generation, interactive chat with persistent history, speech-to-text, text-to-image, a personal RAG / vector store, and an OpenAI-compatible HTTP server exposing all three inference capabilities at once — all sharing a single ggml backend set and one SQLite database.

If you want the same capabilities from Python instead of a native binary, see cyllama — chimera's sibling project, which exposes llama.cpp / whisper.cpp / stable-diffusion.cpp as Cython bindings with a high-level Python API.

The same build also produces libchimera.a, a redistributable static library that hosts the engines and the OpenAI-compatible HTTP server. Other C++ projects can link it directly to embed text generation, embeddings, transcription, image generation, RAG, and the HTTP server inside their own process — without forking chimera or shelling out to the chimera binary.

Who it's for

chimera targets CLI-first users who run more than one ggml-backed modality (text + audio + image) and want them sharing one process, one ggml backend set, one SQLite database, and one OpenAI-compatible HTTP surface — rather than running, configuring, and gluing together three separate servers. It is most useful when:

You want faithful upstream flag coverage (gen, chat, embed expose most llama.cpp sampler / RoPE / YaRN / multi-GPU / cache / adapter flags directly), not a curated subset.
You distribute a single static binary across machines and don't want a Python or Node runtime on the target host.
You build against multiple ggml backends (CPU, CUDA, ROCm, SYCL, Vulkan, Metal) from the same source tree, and want to verify the linked backend with chimera info rather than runtime probing.
You're building on top of the HTTP server and need text, audio, image, embeddings, RAG, and chat-history routes in one origin.
You're a C++ embedder who wants to drive llama.cpp / whisper.cpp / sd.cpp from your own process without reimplementing the load-and-run scaffolding. Linking libchimera.a (and optionally #include "chimera.hpp" for the persistent-handle OOP layer) gives you the same model lifecycle, sampler wiring, and HTTP-server code paths the chimera binary uses.

martinbu69 · 2026-06-12T18:21:49Z

martinbu69
Jun 12, 2026

Just realized this is similar to a propsal submitted by me a second ago....:
#1642

0 replies

shakfu · 2026-06-12T22:41:45Z

shakfu
Jun 12, 2026
Author

Thanks @martinbu69

I got to chimera, after working on cyllama, a cython wrapper of the .cpp libraries where I discovered that all three used ggml versions which were quite close. As a possible packaging size optimization, I had the other two libraries drop their own ggml libraries and link to llama.cpp's ggml lib and it worked. Once it became clear that this was not a one off, chimera was the next logical step.

3 replies

martinbu69 Jun 13, 2026

Nice. I've made an ollama-interface compatible prxy with a model-selection/intent-based matrix (and a router-model to decide on the intent) plus a python/jupyter interface based on stable-diffusion-wrapper. (To be able to experiment with some "story-telling"-photobook/video generator agents.. And then realized that I had to patch a couple of things to make it all work and then realized that there are many redundancies in sd.cpp that are solved in llama.cpp as well. This made me think, if those efforts could not be fused, so that it's more modular ....) I don't want to go the comfyUI route for multiple reasons.

shakfu Jun 13, 2026
Author

Interesting journey to the same place. 100% agree it's better to keep as minimal as possible. The challenge is to continue to have them work together even though this is not a priority in each individual project. It's great when it works, but you have to be patient for upstream fixes when things go out of sync, especially with the platform/gpu variants.

martinbu69 Jun 13, 2026

Unfortunately, it's a bit asymmertic: llama can always live without sd.cpp, but sdd.cpp cannot live without text-encoding.
For me a single source of gguf-based processing with "building my own pipes/emebssings/... abilities" would be really compelling and should also be in the interest of huggingface/llama.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chimera: a single-executable / library which combines llama.cpp, whisper.cpp, and stable-diffusion.cpp #1543

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 3 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

chimera: a single-executable / library which combines llama.cpp, whisper.cpp, and stable-diffusion.cpp #1543

Uh oh!

shakfu May 22, 2026

Who it's for

Replies: 2 comments · 3 replies

Uh oh!

martinbu69 Jun 12, 2026

Uh oh!

Uh oh!

shakfu Jun 12, 2026 Author

Uh oh!

martinbu69 Jun 13, 2026

Uh oh!

shakfu Jun 13, 2026 Author

Uh oh!

martinbu69 Jun 13, 2026

shakfu
May 22, 2026

Replies: 2 comments 3 replies

martinbu69
Jun 12, 2026

shakfu
Jun 12, 2026
Author

shakfu Jun 13, 2026
Author