feat(coglet): IPC bridge enhancements for file outputs, large payloads, and upload URL by tempusfrangit · Pull Request #2746 · replicate/cog

tempusfrangit · 2026-02-19T00:01:19Z

Summary

This branch addresses several interrelated issues in the coglet IPC bridge that caused file-type prediction outputs to be silently lost, large outputs to poison worker slots, and upload URLs to be ignored.

Key changes

File output routing through IPC: Generator yields of cog.Path and io.IOBase are now detected in the worker, written to disk, and sent as FileOutput messages over the bridge. The orchestrator reads them back, either uploading to a signed endpoint (when --upload-url is set) or base64-encoding as data URIs.
Large output spilling: Prediction outputs exceeding the 6 MiB IPC limit (to ensure we are under the 8 MiB frame limit) are automatically spilled to disk and sent as FileOutput(Oversized) references. The orchestrator reconstructs the original JSON value transparently. This prevents the "frame size too big" panic that previously poisoned slots.
Output decoupled from Done: The worker now sends output as a separate SlotResponse::Output message before Done, rather than embedding it in the Done payload. The orchestrator builds the final output from accumulated per-yield values. This fixes generators returning empty output arrays and removes the frame size constraint from Done messages entirely.
Upload URL wiring: --upload-url is accepted by cog serve, threaded through the Python entrypoint to coglet's orchestrator, and used for async PUT uploads of file outputs. Each upload runs in a spawned task; the Done handler awaits all pending uploads before marking the prediction as succeeded.
Log size enforcement: Worker logs are truncated at 4 MiB to stay under the 8 MiB bridge frame limit.
Stub generation and venv setup: Fixed mise tasks for Python type stub generation, venv creation, and editable coglet installs.

Integration tests

Three new integration tests verify the end-to-end behavior:

Test	What it covers
`coglet_large_output`	9 MiB string output that would poison the slot without spilling
`coglet_iterator_upload_url`	Generator yielding `Path` objects uploaded per-yield to `--upload-url`
`coglet_iterator_path_output`	Generator yielding `Path` objects returned as base64 data URIs

Files changed (highlights)

Area	Files	Description
Bridge protocol	`bridge/protocol.rs`	`FileOutput` variant with `Oversized` / `FileType` kinds, optional `mime_type`
Worker	`worker.rs`	`SlotSender` file output methods, `build_output_message` spill logic, output sent separately from Done
Orchestrator	`orchestrator.rs`	`FileOutput` handling (upload or base64), upload barrier for Done, accumulated output reconstruction
Prediction	`prediction.rs`	`take_outputs()` for draining accumulated per-yield values
Python bindings	`predictor.rs`	`send_output_item` / `send_file_output` for Path and IOBase detection in generators
CLI	`serve.go`	`--upload-url` flag with `host.docker.internal` networking
Docker	`command.go`, `docker.go`	`ExtraHosts` support in `RunOptions`
Test harness	`harness.go`	Mock upload server with `upload-server-start` / `upload-server-count` commands

Test plan

mise run test:rust -- 109 tests pass
mise run test:integration coglet_large_output -- 9 MiB output spills and reconstructs correctly
mise run test:integration coglet_iterator_upload_url -- per-yield uploads to mock server
mise run lint:rust -- clean
golangci-lint v2.8.0 run ./... -- clean

closes: #2739

This commit truncates log lines at 4MiB of data and adds a truncation notice to the log line. This protects against a panic/slot poisoning that can happen if we exceed the codec configured (8MiB) size. Any log line that exceeds 1MiB boarders on useless. To ensure that even the most insane log lines are kept without disruption, we have implemented 4MiB limit.

Outputs > 6MiB are spilled to disk and sent as FileOutput path references over IPC instead of inline, avoiding the 8MiB LengthDelimitedCodec frame limit. Coglet creates per-prediction output dirs at /tmp/coglet/outputs/{prediction_id}/ and passes the path to the worker via SlotRequest::Predict.

When the worker spills output to disk (FileOutput), the orchestrator reads it back and integrates it into the prediction via append_output. Handles both Oversized (JSON spill) and FileType (path reference) variants.

Generator outputs were collected into a Vec and bundled into a single Done frame, which could exceed the 8MiB IPC limit. Now each yield is sent immediately via slot_sender.send_output(), streaming through SlotResponse::Output frames. The Done frame carries an empty array for generators since outputs were already streamed. Plumbs slot_sender from PythonPredictHandler::predict through to process_generator_output and process_async_result for both predict and train paths.

File-type outputs (os.PathLike, io.IOBase) are now detected in the worker and written to the per-prediction output dir instead of being base64-encoded in-process. This keeps network I/O out of the worker subprocess — the parent process handles uploads. - Add SlotSender::write_file_output(bytes, ext) for language-agnostic file writing from any FFI worker (Python, Node, etc.) - Add send_output_item() helper that routes by type: PathLike sends existing path, IOBase reads bytes and writes via SlotSender, everything else goes through the existing JSON serialization path - Update process_single_output to also detect and route file types

The orchestrator's FileOutput handler now distinguishes between FileOutputKind::Oversized (JSON spill, deserialize as before) and FileOutputKind::FileType (binary file, base64-encode as data URI). This is the fallback path when no upload_url is configured. File outputs from the worker are read from disk and encoded as data:{mime};base64,{data} URIs using mime_guess for content type detection.

Add mime_type: Option<String> to SlotResponse::FileOutput so FFI workers can pass an explicit MIME type. When None, the orchestrator falls back to mime_guess from the file extension (matching old cog behavior). Plumbing is in place for future use — all current callers pass None.

Thread --upload-url CLI arg through coglet.server.serve() to the orchestrator event loop. When set, file outputs are PUT to the signed endpoint per-yield as received (matching old Python cog behavior). - upload_file() does PUT with Content-Type, extracts Location header, strips query params, follows redirects via reqwest - Uploads are spawned as tokio tasks so they don't block the event loop - Done handler awaits pending uploads before finalizing the prediction - Failed/Cancelled/Error handlers abort pending uploads immediately - Falls back to base64 data URI encoding when no upload_url is set

- Root .venv is now created by _setup_venv task (used by build:coglet, build:coglet:wheel, build:sdk, generate:stubs, stub:check) - _.python.venv uses REPO_ROOT so all tasks share the same venv - mise clean:python removes .venv, coglet-python/.venv, and *.so - stub_gen.rs: write coglet/_impl.pyi (not __init__.pyi) so the native module stubs don't get overwritten by mypy stubgen - module_variable! declarations for __version__, __build__, server so ty can resolve members imported from coglet._impl - stub:check depends on generate:stubs to avoid duplicating logic - #[allow(clippy::too_many_arguments)] on serve_impl (8 args after upload_url)

…verflow Output is now always sent as a separate message before Done, with automatic spill-to-disk for values exceeding the 6MiB IPC frame limit. The orchestrator reconstructs the final output from accumulated per-yield values (generators, file uploads) rather than from the Done message payload. This fixes three issues: - Large single outputs (>8MiB) caused "frame size too big" panics that poisoned slots, since the entire output was embedded in the Done message - Generator/iterator outputs were lost because per-yield values accumulated in outputs() were ignored by the Done handler in favor of the empty Stream(vec![]) from the worker - File upload outputs (Path, IOBase) yielded from generators were silently dropped for the same reason Also adds --upload-url flag to `cog serve`, mock upload server to the integration test harness, and three new integration tests: - coglet_large_output: 9MiB string output (would poison slot without spill) - coglet_iterator_upload_url: generator Path yields uploaded to --upload-url - coglet_iterator_path_output: generator Path yields as base64 data URIs

…tting cog predict writes file outputs to disk, not as base64 JSON to stdout. Fix assertion to check stderr for "Written output to" messages.

tempusfrangit · 2026-02-19T16:27:44Z

The large output spill is deadlocking. Digging into the specifics so we can get this landed.

The Done handler unconditionally used tokio::spawn to wait for pending uploads before calling set_succeeded(). This introduced a race with service.rs: Notify::notified() could miss the wakeup if the spawned task called notify_waiters() before the service registered its waiter, causing the prediction to hang indefinitely. Split the Done handler into two paths: - No uploads: call set_succeeded() synchronously in the event loop - Has uploads: spawn a task to await them (preserves existing behavior) This restores the synchronous notification path for the common case and fixes the CI hang in the coglet_large_output integration test.

…hancements

notify_waiters() only wakes currently-registered waiters. In the predict flow, service.rs checks is_terminal() then awaits on a separate Notify — if the orchestrator fires notify_waiters() between those two steps, the notification is lost and the prediction hangs. notify_one() stores a permit that a future .notified().await consumes immediately, closing the race window entirely. There is exactly one waiter per prediction so notify_one is semantically correct.

…A_CERT - build:cog: add sources/outputs so mise skips rebuild when Go sources are unchanged - build:coglet:wheel:linux-x64: fix output glob to match maturin's actual manylinux filename (was never matching, always rebuilding) - build:coglet:wheel:linux-arm64: same glob fix - test:integration: depend on linux-x64 wheel instead of native macOS wheel (only linux wheel is needed for Docker tests) - test:integration: pass extra args through to go test (e.g. -count=4) - test:integration: propagate COG_CA_CERT for custom CA certs (WARP) - Helper tasks (_setup_dist, _setup_venv, _clean_dist): use silent instead of quiet to suppress timing output - AGENTS.md: correct outdated references to embedded Python wheel (wheels are resolved from dist/ at Docker build time)

tempusfrangit requested a review from a team as a code owner February 19, 2026 00:01

tempusfrangit requested review from bfirsh, markphelps and michaeldwan February 19, 2026 00:01

tempusfrangit added 12 commits February 18, 2026 16:15

feat(coglet): wire orchestrator FileOutput handling from disk

576b82d

When the worker spills output to disk (FileOutput), the orchestrator reads it back and integrates it into the prediction via append_output. Handles both Oversized (JSON spill) and FileType (path reference) variants.

chore(coglet): regenerate Python type stubs

6a69cae

chore: go fmt

990f86c

tempusfrangit force-pushed the feat/coglet-bridge-enhancements branch from 9e39dc4 to c394d6d Compare February 19, 2026 00:16

fix(test): correct coglet_iterator_path_output assertion and go forma…

1390b59

…tting cog predict writes file outputs to disk, not as base64 JSON to stdout. Fix assertion to check stderr for "Written output to" messages.

tempusfrangit force-pushed the feat/coglet-bridge-enhancements branch from c394d6d to 1390b59 Compare February 19, 2026 00:33

tempusfrangit added 5 commits February 19, 2026 10:41

Merge remote-tracking branch 'origin/main' into feat/coglet-bridge-en…

639a297

…hancements

chore: update llm docs

f2b051f

tempusfrangit added this to the 0.17.0 Release milestone Feb 19, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(coglet): IPC bridge enhancements for file outputs, large payloads, and upload URL#2746

feat(coglet): IPC bridge enhancements for file outputs, large payloads, and upload URL#2746
tempusfrangit wants to merge 18 commits intomainfrom
feat/coglet-bridge-enhancements

tempusfrangit commented Feb 19, 2026 •

edited

Loading

Uh oh!

tempusfrangit commented Feb 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Comments

Conversation

tempusfrangit commented Feb 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Key changes

Integration tests

Files changed (highlights)

Test plan

Uh oh!

tempusfrangit commented Feb 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Comments

tempusfrangit commented Feb 19, 2026 •

edited

Loading