Skip to content

feat(coglet): IPC bridge enhancements for file outputs, large payloads, and upload URL#2746

Open
tempusfrangit wants to merge 18 commits intomainfrom
feat/coglet-bridge-enhancements
Open

feat(coglet): IPC bridge enhancements for file outputs, large payloads, and upload URL#2746
tempusfrangit wants to merge 18 commits intomainfrom
feat/coglet-bridge-enhancements

Conversation

@tempusfrangit
Copy link
Member

@tempusfrangit tempusfrangit commented Feb 19, 2026

Summary

This branch addresses several interrelated issues in the coglet IPC bridge that caused file-type prediction outputs to be silently lost, large outputs to poison worker slots, and upload URLs to be ignored.

Key changes

  • File output routing through IPC: Generator yields of cog.Path and io.IOBase are now detected in the worker, written to disk, and sent as FileOutput messages over the bridge. The orchestrator reads them back, either uploading to a signed endpoint (when --upload-url is set) or base64-encoding as data URIs.

  • Large output spilling: Prediction outputs exceeding the 6 MiB IPC limit (to ensure we are under the 8 MiB frame limit) are automatically spilled to disk and sent as FileOutput(Oversized) references. The orchestrator reconstructs the original JSON value transparently. This prevents the "frame size too big" panic that previously poisoned slots.

  • Output decoupled from Done: The worker now sends output as a separate SlotResponse::Output message before Done, rather than embedding it in the Done payload. The orchestrator builds the final output from accumulated per-yield values. This fixes generators returning empty output arrays and removes the frame size constraint from Done messages entirely.

  • Upload URL wiring: --upload-url is accepted by cog serve, threaded through the Python entrypoint to coglet's orchestrator, and used for async PUT uploads of file outputs. Each upload runs in a spawned task; the Done handler awaits all pending uploads before marking the prediction as succeeded.

  • Log size enforcement: Worker logs are truncated at 4 MiB to stay under the 8 MiB bridge frame limit.

  • Stub generation and venv setup: Fixed mise tasks for Python type stub generation, venv creation, and editable coglet installs.

Integration tests

Three new integration tests verify the end-to-end behavior:

Test What it covers
coglet_large_output 9 MiB string output that would poison the slot without spilling
coglet_iterator_upload_url Generator yielding Path objects uploaded per-yield to --upload-url
coglet_iterator_path_output Generator yielding Path objects returned as base64 data URIs

Files changed (highlights)

Area Files Description
Bridge protocol bridge/protocol.rs FileOutput variant with Oversized / FileType kinds, optional mime_type
Worker worker.rs SlotSender file output methods, build_output_message spill logic, output sent separately from Done
Orchestrator orchestrator.rs FileOutput handling (upload or base64), upload barrier for Done, accumulated output reconstruction
Prediction prediction.rs take_outputs() for draining accumulated per-yield values
Python bindings predictor.rs send_output_item / send_file_output for Path and IOBase detection in generators
CLI serve.go --upload-url flag with host.docker.internal networking
Docker command.go, docker.go ExtraHosts support in RunOptions
Test harness harness.go Mock upload server with upload-server-start / upload-server-count commands

Test plan

  • mise run test:rust -- 109 tests pass
  • mise run test:integration coglet_large_output -- 9 MiB output spills and reconstructs correctly
  • mise run test:integration coglet_iterator_upload_url -- per-yield uploads to mock server
  • mise run lint:rust -- clean
  • golangci-lint v2.8.0 run ./... -- clean

closes: #2739

This commit truncates log lines at 4MiB of data and adds a truncation
notice to the log line. This protects against a panic/slot poisoning
that can happen if we exceed the codec configured (8MiB) size. Any
log line that exceeds 1MiB boarders on useless. To ensure that even
the most insane log lines are kept without disruption, we have
implemented 4MiB limit.
Outputs > 6MiB are spilled to disk and sent as FileOutput path
references over IPC instead of inline, avoiding the 8MiB
LengthDelimitedCodec frame limit. Coglet creates per-prediction
output dirs at /tmp/coglet/outputs/{prediction_id}/ and passes
the path to the worker via SlotRequest::Predict.
When the worker spills output to disk (FileOutput), the orchestrator
reads it back and integrates it into the prediction via append_output.
Handles both Oversized (JSON spill) and FileType (path reference)
variants.
Generator outputs were collected into a Vec and bundled into a single
Done frame, which could exceed the 8MiB IPC limit. Now each yield
is sent immediately via slot_sender.send_output(), streaming through
SlotResponse::Output frames. The Done frame carries an empty array
for generators since outputs were already streamed.

Plumbs slot_sender from PythonPredictHandler::predict through to
process_generator_output and process_async_result for both predict
and train paths.
File-type outputs (os.PathLike, io.IOBase) are now detected in the
worker and written to the per-prediction output dir instead of being
base64-encoded in-process. This keeps network I/O out of the worker
subprocess — the parent process handles uploads.

- Add SlotSender::write_file_output(bytes, ext) for language-agnostic
  file writing from any FFI worker (Python, Node, etc.)
- Add send_output_item() helper that routes by type: PathLike sends
  existing path, IOBase reads bytes and writes via SlotSender,
  everything else goes through the existing JSON serialization path
- Update process_single_output to also detect and route file types
The orchestrator's FileOutput handler now distinguishes between
FileOutputKind::Oversized (JSON spill, deserialize as before) and
FileOutputKind::FileType (binary file, base64-encode as data URI).

This is the fallback path when no upload_url is configured. File
outputs from the worker are read from disk and encoded as
data:{mime};base64,{data} URIs using mime_guess for content type
detection.
Add mime_type: Option<String> to SlotResponse::FileOutput so FFI
workers can pass an explicit MIME type. When None, the orchestrator
falls back to mime_guess from the file extension (matching old cog
behavior). Plumbing is in place for future use — all current callers
pass None.
Thread --upload-url CLI arg through coglet.server.serve() to the
orchestrator event loop. When set, file outputs are PUT to the signed
endpoint per-yield as received (matching old Python cog behavior).

- upload_file() does PUT with Content-Type, extracts Location header,
  strips query params, follows redirects via reqwest
- Uploads are spawned as tokio tasks so they don't block the event loop
- Done handler awaits pending uploads before finalizing the prediction
- Failed/Cancelled/Error handlers abort pending uploads immediately
- Falls back to base64 data URI encoding when no upload_url is set
- Root .venv is now created by _setup_venv task (used by build:coglet,
  build:coglet:wheel, build:sdk, generate:stubs, stub:check)
- _.python.venv uses REPO_ROOT so all tasks share the same venv
- mise clean:python removes .venv, coglet-python/.venv, and *.so
- stub_gen.rs: write coglet/_impl.pyi (not __init__.pyi) so the native
  module stubs don't get overwritten by mypy stubgen
- module_variable! declarations for __version__, __build__, server so
  ty can resolve members imported from coglet._impl
- stub:check depends on generate:stubs to avoid duplicating logic
- #[allow(clippy::too_many_arguments)] on serve_impl (8 args after upload_url)
…verflow

Output is now always sent as a separate message before Done, with automatic
spill-to-disk for values exceeding the 6MiB IPC frame limit. The orchestrator
reconstructs the final output from accumulated per-yield values (generators,
file uploads) rather than from the Done message payload.

This fixes three issues:
- Large single outputs (>8MiB) caused "frame size too big" panics that
  poisoned slots, since the entire output was embedded in the Done message
- Generator/iterator outputs were lost because per-yield values accumulated
  in outputs() were ignored by the Done handler in favor of the empty
  Stream(vec![]) from the worker
- File upload outputs (Path, IOBase) yielded from generators were silently
  dropped for the same reason

Also adds --upload-url flag to `cog serve`, mock upload server to the
integration test harness, and three new integration tests:
- coglet_large_output: 9MiB string output (would poison slot without spill)
- coglet_iterator_upload_url: generator Path yields uploaded to --upload-url
- coglet_iterator_path_output: generator Path yields as base64 data URIs
@tempusfrangit tempusfrangit force-pushed the feat/coglet-bridge-enhancements branch from 9e39dc4 to c394d6d Compare February 19, 2026 00:16
…tting

cog predict writes file outputs to disk, not as base64 JSON to stdout.
Fix assertion to check stderr for "Written output to" messages.
@tempusfrangit tempusfrangit force-pushed the feat/coglet-bridge-enhancements branch from c394d6d to 1390b59 Compare February 19, 2026 00:33
@tempusfrangit
Copy link
Member Author

The large output spill is deadlocking. Digging into the specifics so we can get this landed.

The Done handler unconditionally used tokio::spawn to wait for pending
uploads before calling set_succeeded(). This introduced a race with
service.rs: Notify::notified() could miss the wakeup if the spawned
task called notify_waiters() before the service registered its waiter,
causing the prediction to hang indefinitely.

Split the Done handler into two paths:
- No uploads: call set_succeeded() synchronously in the event loop
- Has uploads: spawn a task to await them (preserves existing behavior)

This restores the synchronous notification path for the common case
and fixes the CI hang in the coglet_large_output integration test.
notify_waiters() only wakes currently-registered waiters. In the
predict flow, service.rs checks is_terminal() then awaits on a
separate Notify — if the orchestrator fires notify_waiters() between
those two steps, the notification is lost and the prediction hangs.

notify_one() stores a permit that a future .notified().await consumes
immediately, closing the race window entirely. There is exactly one
waiter per prediction so notify_one is semantically correct.
…A_CERT

- build:cog: add sources/outputs so mise skips rebuild when Go sources
  are unchanged
- build:coglet:wheel:linux-x64: fix output glob to match maturin's
  actual manylinux filename (was never matching, always rebuilding)
- build:coglet:wheel:linux-arm64: same glob fix
- test:integration: depend on linux-x64 wheel instead of native macOS
  wheel (only linux wheel is needed for Docker tests)
- test:integration: pass extra args through to go test (e.g. -count=4)
- test:integration: propagate COG_CA_CERT for custom CA certs (WARP)
- Helper tasks (_setup_dist, _setup_venv, _clean_dist): use silent
  instead of quiet to suppress timing output
- AGENTS.md: correct outdated references to embedded Python wheel
  (wheels are resolved from dist/ at Docker build time)
@tempusfrangit tempusfrangit added this to the 0.17.0 Release milestone Feb 19, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

bug: worker crash

1 participant

Comments