feat(taskbroker): Add Push Mode to Taskbroker by james-mcnulty · Pull Request #573 · getsentry/taskbroker

james-mcnulty · 2026-03-17T00:38:10Z

Linear

Description

Currently, taskworkers pull tasks from taskbrokers via RPC. This approach works, but has some drawbacks. Therefore, we want taskbrokers to push tasks to taskworkers instead. Read this page on Notion for more information.

This PR allows users to run the taskbroker in push mode that can be adjusted using several new configuration parameters.

Parameter	Type	Default	Description
`TASKBROKER_PUSH_MODE`	`bool`	`false`	Enables push mode.
`TASKBROKER_FETCH_THREADS`	`usize`	`1`	Sets the number of fetch threads to run.
`TASKBROKER_PUSH_THREADS`	`usize`	`1`	Sets the number of push threads to run.
`TASKBROKER_PUSH_QUEUE_SIZE`	`usize`	`1`	Sets the capacity of the channel sitting in front of the push thread pool.
`TASKBROKER_WORKER_ENDPOINT`	`String`	`http://127.0.0.1:50052`	Sets the worker service endpoint.
`TASKBROKER_CALLBACK_ADDR`	`String`	`0.0.0.0`	Sets the host used in the callback URL.
`TASKBROKER_CALLBACK_PORT`	`usize`	`50051`	Sets the port used in the callback URL.
`TASKBROKER_FETCH_WAIT_MS`	`u64`	`100`	Milliseconds to wait between fetch attempts when no pending activation is found.
`TASKBROKER_PUSH_TIMEOUT_MS`	`u64`	`5000`	Maximum number of milliseconds to wait when submitting an activation to the push pool.

Push Threads

On startup, the taskbroker now creates a "push pool," which is a pool of push threads. All of them wait to receive activations from the same MPMC channel provided by the flume crate. When a push thread receives an activation, it sends it to the worker service. Note that each push thread has its own connection to the worker service.

Push threads are grouped together by the PushPool data structure, which exposes a start method to actually spawn the threads and a submit method to receive activations.

Fetch Threads

On startup, the taskbroker also creates a "fetch pool," which is a pool of fetch threads. Each one retrieves a pending activation from the store, passes it to the push pool (waiting until it accepts), and repeats.

Notes on Naming

Fetch threads and push threads are actually asynchronous tasks provided by the Tokio crate. They are not real threads.

Details

Dependencies

Add flume 0.12.0 as a dependency (I didn't want to add any dependencies, but Tokio does not provide an asynchronous MPMC queue - only MPSC)
Upgrade sentry-protos from 0.4.11 to 0.8.5 (to use the new worker service schema)
Upgrade tonic, tonic-health, prost, and prost-types to 0.14 (to match the version used by sentry-protos)

Additions

Add FetchPool abstraction in src/fetch.rs
Add PushPool abstraction in src/push.rs
Use push pool and fetch pool abstractions in src/main.rs
Add configuration parameters for push mode

Modifications

Return "permission denied" error with explanatory message for get_task when operating in push mode

Future Changes

Add useful metrics
Fetch and send tasks in batches
Update tasks in batches
Combine upkeep row count queries into a single query
Delete completed tasks immediately

src/push.rs

cursor

Cursor Bugbot has reviewed your changes and found 3 potential issues.

Autofix Details

Bugbot Autofix prepared fixes for all 3 issues found in the latest run.

✅ Fixed: GRPC server address ignores configurable grpc_addr field
- Updated gRPC bind address construction to use config.grpc_addr together with config.grpc_port instead of hardcoding 0.0.0.0.
✅ Fixed: Callback URL missing protocol scheme for worker callbacks
- Changed push callback URL formatting to include the http:// scheme so workers receive a valid URI.
✅ Fixed: push_threads=0 causes deadlock unlike guarded fetch_threads
- Aligned push worker spawning with fetch behavior by using self.config.push_threads.max(1) to guarantee at least one consumer.

Or push these changes by commenting:

@cursor push effe733488

Preview (effe733488)

diff --git a/src/main.rs b/src/main.rs
--- a/src/main.rs
+++ b/src/main.rs
@@ -196,7 +196,7 @@
         let config = config.clone();
 
         async move {
-            let addr = format!("0.0.0.0:{}", config.grpc_port)
+            let addr = format!("{}:{}", config.grpc_addr, config.grpc_port)
                 .parse()
                 .expect("Failed to parse address");
 

diff --git a/src/push.rs b/src/push.rs
--- a/src/push.rs
+++ b/src/push.rs
@@ -57,11 +57,11 @@
     pub async fn start(&self) -> Result<()> {
         let mut handles = vec![];
 
-        for _ in 0..self.config.push_threads {
+        for _ in 0..self.config.push_threads.max(1) {
             let endpoint = self.config.worker_endpoint.clone();
 
             let callback_url = format!(
-                "{}:{}",
+                "http://{}:{}",
                 self.config.callback_addr, self.config.callback_port
             );

_{This Bugbot Autofix run was free. To enable autofix for future PRs, go to the Cursor dashboard.}

src/main.rs

src/push/mod.rs

src/push.rs

evanh

There is now a potential deadlock scenario where the push channels get full and the fetch activations block trying to send. Is that tracked somewhere with a metric?

src/grpc/server_tests.rs

src/fetch.rs

evanh · 2026-03-18T14:57:44Z

src/fetch.rs

+
+    debug!("Fetching next pending activation...");
+
+    match store.get_pending_activation(None, None).await {


This will need to use the namespace and application parameters passed into it in order to work correctly.

This is actually something I wanted to bring up. Right now, namespace and application come from the get_task request body. In push mode, there is no easy way to know what values to use here. Should they be provided in the configuration?

If a broker is handling multiple applications (like in local development, and in smaller environments) we'll need different worker pools to push to. Perhaps we need a mapping between application -> worker pools?

Should they be provided in the configuration?

Yes this is how it will have to work.

Done! Config now takes an optional application and an optional list of namespaces.

src/fetch.rs

evanh · 2026-03-18T15:00:05Z

src/fetch.rs

+
+        Err(e) => {
+            error!("Failed to fetch pending activation - {:?}", e);
+            sleep(Duration::from_millis(100)).await;


There is no need for a sleep here.

I had a sleep there because if this fails, it's either because the store is having issues (e.g. due to scaling AlloyDB up) or because the push queue is full. In both cases it makes sense to wait a little, no?

Can we tell the difference between the two? I would handle those two scenarios differently. For the queue being full I would wait, but for an actual error we might want to take other actions (e.g. crash the entire producer).

Yes, definitely. Any idea what should happen if it's the queue being full versus a store error? For now, I can just distinguish between the two and simply log which one it was without doing anything else until we decide for sure.

src/fetch.rs

src/push.rs

src/push/mod.rs

evanh · 2026-03-18T15:13:27Z

README.md

 ```bash
 # Run unit/integration tests
-make test
+make unit-test


Why did this change?

It's make unit-test, not make test - that doesn't do anything right now.

Or rather, there is no test target in the Makefile for doing both unit and integration tests, which seemed to be the intention here.

src/fetch/mod.rs

…Push Mode

src/fetch/mod.rs

src/push/mod.rs

src/grpc/server.rs

markstory · 2026-03-18T16:24:39Z

src/config.rs

+            fetch_threads: 1,
+            push_threads: 1,
+            push_queue_size: 1,
+            worker_endpoint: "http://127.0.0.1:50052".into(),


Is this the port that we'll be using for the worker in self-hosted and local dev? Ideally local development 'just works' and doesn't require additional configuration.

src/fetch.rs

markstory · 2026-03-18T16:39:42Z

src/fetch.rs

+
+    debug!("Fetching next pending activation...");
+
+    match store.get_pending_activation(None, None).await {


If a broker is handling multiple applications (like in local development, and in smaller environments) we'll need different worker pools to push to. Perhaps we need a mapping between application -> worker pools?

src/fetch.rs

src/push.rs

src/fetch/mod.rs

src/fetch/tests.rs

src/fetch/mod.rs

src/main.rs

linear-code · 2026-03-18T23:24:08Z

STREAM-820 Add Push Mode to Taskbroker

fpacifici

Will do a more in depth review later.
Though please consider doing a refactoring of the config before to separate the push attribute from the pull ones.
That would require its own PR.

fpacifici · 2026-03-19T00:24:34Z

src/config.rs

+    /// Run the taskbroker in push mode (as opposed to pull mode).
+    pub push_mode: bool,


Please change the push_mode boolean into a delivery_mode.

Now I think this configuration will be very hard to set. We have more than ten push specific config elements and more than 10 are pull specific. We need some ways to make it more intuitive.
One option would be to give it a structure, though I do not know whether we rely on these fields to be simple fields. @evanh may know better.
Otherwise, a common pattern for these scenarios is to prefix the poll specific parameters with poll_ and the push specific ones with push_

fpacifici · 2026-03-19T00:27:54Z

src/tokio.rs

Please avoid a helper module. It risks becoming a god object without a clear responsibility. If we need a function to spawn pools have a tokio module for this

Created a tokio module as suggested.

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

^{Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

src/push/mod.rs

src/fetch/mod.rs

evanh · 2026-03-20T16:04:12Z

src/fetch/mod.rs

+/// - `Ok(true)` if an activation was found
+/// - `Ok(false)` if none pending
+/// - `Err` if fetching failed.
+pub async fn fetch_activation<T: TaskPusher>(


Why is this a standalone function instead of part of TaskPusher? That would cut down on the number of clone calls (which are memcpy commands).

I separated it before so I could test it. I removed the function and moved the logic back into the loop, because you're right about the clones.

evanh · 2026-03-20T16:05:22Z

src/fetch/mod.rs

+
+                            //  Instead of returning when `fetch_activation` fails, we just try again
+                            match fetch_activation(store.clone(), pusher.clone(), config.clone()).await {
+                                Ok(false) | Err(_) => {


We should separate these cases. I would move any error handling (logging etc.) up to this function. That way we also handle any unexpected errors.

Added a new PushError enum so we can handle errors better.

src/store/inflight_activation.rs

sentry · 2026-03-23T06:57:41Z

src/push/mod.rs

+            async move {
+                let mut worker = match WorkerServiceClient::connect(endpoint).await {
+                    Ok(w) => w,
+                    Err(e) => {
+                        error!("Failed to connect to worker - {:?}", e);
+                        return;
+                    }
+                };


Bug: If a push worker fails to connect on startup, it exits silently. The PushPool then reports a successful start, leading to task loss as no tasks can be pushed.
_{Severity: HIGH}

Suggested Fix

Modify the push worker's connection logic to propagate connection errors instead of returning a unit type (). The PushPool::start() method should then check for these errors from its worker threads and return an error if any of them failed to connect, preventing the system from entering a state where tasks are processed but cannot be pushed.

Prompt for AI Agent

Review the code at the location below. A potential bug has been identified by an AI agent. Verify if this is a real issue. If it is, propose a fix; if not, explain why it's not valid. Location: src/push/mod.rs#L80-L87 Potential issue: In push mode, if a worker service is unreachable on startup, the corresponding push thread will fail to connect and exit silently without propagating an error. The `PushPool::start()` method incorrectly interprets this as a successful completion. Consequently, the main application continues, and fetch threads begin marking tasks as `Processing`. When these tasks are sent to the push pool, the operation fails because the receiving end of the channel has been dropped. After several failed retry attempts, the tasks are marked as `Failure` and discarded, leading to data loss, while the application logs indicate a successful push operation.

james-mcnulty added 8 commits March 16, 2026 16:31

Add Push Mode (Task Dispatchers and Pushers)

1fb5dfa

Add Unit Tests, Flush Tasks on Shutdown

a728613

Switch to Sentry Protos Release

6866ef2

Replace Dispatcher w/Separate Fetch and Push Pools

0a53d58

Initialize gRPC Server w/0.0.0.0

043559a

Add PushPool Unit Tests

f70cc3a

Add FetchPool Unit Tests

dafa06c

Fix Linting

e47a1e8

james-mcnulty marked this pull request as ready for review March 18, 2026 08:17

james-mcnulty requested a review from a team as a code owner March 18, 2026 08:17

sentry bot reviewed Mar 18, 2026

View reviewed changes

src/push.rs Outdated Show resolved Hide resolved

cursor bot reviewed Mar 18, 2026

View reviewed changes

src/main.rs Outdated Show resolved Hide resolved

src/push/mod.rs Show resolved Hide resolved

src/push.rs Outdated Show resolved Hide resolved

untitaker reviewed Mar 18, 2026

View reviewed changes

src/push.rs Outdated Show resolved Hide resolved

evanh reviewed Mar 18, 2026

View reviewed changes

Address PR Comments (Fix Bugs, Make More Robust)

cb47f99

sentry bot reviewed Mar 18, 2026

View reviewed changes

src/fetch/mod.rs Outdated Show resolved Hide resolved

Move Tasks Back to Pending on Push Failure, Add Server Unit Test for …

aa055bc

…Push Mode

cursor bot reviewed Mar 18, 2026

View reviewed changes

src/fetch/mod.rs Show resolved Hide resolved

src/push/mod.rs Show resolved Hide resolved

src/grpc/server.rs Show resolved Hide resolved

markstory reviewed Mar 18, 2026

View reviewed changes

Make Empty Store Backoff Configurable, Other Fixes and Tests

2f540ad

sentry bot reviewed Mar 18, 2026

View reviewed changes

src/fetch/mod.rs Outdated Show resolved Hide resolved

cursor bot reviewed Mar 18, 2026

View reviewed changes

src/fetch/mod.rs Outdated Show resolved Hide resolved

Return Error from fetch_activation on Submit Failure

97f00b0

cursor bot reviewed Mar 18, 2026

View reviewed changes

src/fetch/tests.rs Show resolved Hide resolved

Fix Fetch Unit Tests

6db46a0

james-mcnulty requested review from evanh, markstory and untitaker March 18, 2026 18:49

james-mcnulty changed the title ~~feat: Add Push Mode~~ feat: Add Push Mode to Taskbroker Mar 18, 2026

Don't Use FetchNextTask When in Push Mode

cf06e6b

sentry bot reviewed Mar 18, 2026

View reviewed changes

src/fetch/mod.rs Outdated Show resolved Hide resolved

cursor bot reviewed Mar 18, 2026

View reviewed changes

src/main.rs Show resolved Hide resolved

james-mcnulty changed the title ~~feat: Add Push Mode to Taskbroker~~ feat(taskbroker): Add Push Mode to Taskbroker Mar 18, 2026

fpacifici reviewed Mar 19, 2026

View reviewed changes

Do Not Reset Task Status on Push Failure

d28397d

cursor bot reviewed Mar 19, 2026

View reviewed changes

src/push/mod.rs Show resolved Hide resolved

Change Push Mode Config Name, Rename Helpers Module

d42ffb5

sentry bot reviewed Mar 19, 2026

View reviewed changes

src/push/mod.rs Show resolved Hide resolved

Add Application and Namespace Filters to Config

90825d4

sentry bot reviewed Mar 19, 2026

View reviewed changes

src/fetch/mod.rs Outdated Show resolved Hide resolved

evanh reviewed Mar 20, 2026

View reviewed changes

Minor Store Trait Refactor, Remove Separate Fetch Function

8839d50

sentry bot reviewed Mar 23, 2026

View reviewed changes

src/store/inflight_activation.rs Outdated Show resolved Hide resolved

james-mcnulty added 2 commits March 22, 2026 23:42

Don't Allow Empty Namespace List and No Application

517caa3

Small Tweaks

8d71ad1

sentry bot reviewed Mar 23, 2026

View reviewed changes


		debug!("Fetching next pending activation...");

		match store.get_pending_activation(None, None).await {

		/// Run the taskbroker in push mode (as opposed to pull mode).
		pub push_mode: bool,

Uh oh!

Conversation

james-mcnulty commented Mar 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Linear

Description

Push Threads

Fetch Threads

Notes on Naming

Details

Dependencies

Additions

Modifications

Future Changes

Uh oh!

Uh oh!

cursor bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

evanh left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

linear-code bot commented Mar 18, 2026

Uh oh!

fpacifici left a comment

Choose a reason for hiding this comment

Uh oh!

james-mcnulty commented Mar 17, 2026 •

edited

Loading

cursor bot left a comment •

edited

Loading