bug: replace cron job restarting agentex pod for oidc refresh with mongoclient refresh by michael-chou359 · Pull Request #338 · scaleapi/scale-agentex

michael-chou359 · 2026-06-24T16:59:35Z

Issue:

There is a legacy cron job that restarts the agentex pod every 30 minutes in order to refresh the OIDC token for GCP. Because this cron job restarts the entire prod, while the pod is down, Compass workflows can lose connectivity and record polling errors.

Fix:

We will instead rebuild only the mongo client in process every 45 min instead of the chron job. Originally, the chron job was created as a legacy process for refreshing oidc token, but killing the whole pod was overkill. The overall process is kept alive by refreshing only the mongo client which houses the token

We have a drain and close strategy. After we create our new mongo client, new requests are redirected there and we let in flight queries in the old client finish naturally so nothing gets dropped.

Greptile Summary

This PR replaces the MongoDB OIDC pod-restart workaround with an in-process client refresh. The main changes are:

Shared MongoDB client construction for startup and refresh paths.
OIDC-only background refresh loop with a configurable interval.
New client validation before swapping the global MongoDB client.
Delayed close of the superseded client to allow in-flight work to drain.
Unit and integration tests for refresh gating, swap behavior, and loop lifecycle.

Confidence Score: 3/5

The refresh approach reduces pod restarts but still has merge-blocking lifecycle issues around stale Mongo handles and failed candidate cleanup.

The changed code is well-scoped and tested, but the client swap does not update startup-cached collections and failed validation leaves candidate clients open, both of which affect runtime reliability.

agentex/src/config/dependencies.py

T-Rex Logs

What T-Rex did

Ran a focused pytest harness to reproduce how old clients remain bound to a cached Mongo collection when refresh_mongodb_client() is called.
Ran a focused pytest harness that forces refresh candidates to fail and verifies they are not closed, illustrating leaked candidate clients when refresh attempts fail.
Compared before and after probe executions to verify updated counters and flags, using the mongodb_oidc_refresh_probe.py script as the shared probe source.

_{Ran code and verified through T-Rex}

Prompt To Fix All With AI

Fix the following 2 code review issues. Work through them one at a time, proposing concise fixes.

---

### Issue 1 of 2
agentex/src/config/dependencies.py:280-286
**Old clients still used**

This refresh swaps only `GlobalDependencies.mongodb_database`, but long-lived consumers can keep using `AsyncDatabase` and collection objects from the old client. For example, the Temporal worker builds retention repositories once at startup, and `MongoDBCRUDRepository.__init__` stores `self.collection = db[collection_name]`. After the first refresh, those activities still use collections bound to `old_client`; once this delayed close runs, retention cleanup Mongo operations can fail even though the global dependency points at a fresh client.

### Issue 2 of 2
agentex/src/config/dependencies.py:268-271
**Failed refresh leaks client**

When the fresh client fails validation, the exception leaves `refresh_mongodb_client()` before `new_client` is closed. The refresh loop catches the error and retries on the next interval, so an auth or network outage can accumulate failed `AsyncMongoClient` instances and their underlying resources for the lifetime of the process. Close the candidate client before re-raising when `admin.command("ping")` fails.

_{Reviews (1): Last reviewed commit: "refresh" | Re-trigger Greptile}

Greptile also left 2 inline comments on this PR.

greptile-apps · 2026-06-24T17:04:46Z

+        if old_client is not None and old_client is not new_client:
+            task = asyncio.create_task(
+                self._close_mongodb_client_after_delay(old_client)
+            )
+            # Keep a strong reference until done so the task is not GC'd mid-flight.
+            self._mongodb_close_tasks.add(task)
+            task.add_done_callback(self._mongodb_close_tasks.discard)


Old clients still used

This refresh swaps only GlobalDependencies.mongodb_database, but long-lived consumers can keep using AsyncDatabase and collection objects from the old client. For example, the Temporal worker builds retention repositories once at startup, and MongoDBCRUDRepository.__init__ stores self.collection = db[collection_name]. After the first refresh, those activities still use collections bound to old_client; once this delayed close runs, retention cleanup Mongo operations can fail even though the global dependency points at a fresh client.

Artifacts

Repro: pytest harness for cached Mongo collection across OIDC refresh

Contains supporting evidence from the run (text/x-python; charset=utf-8).

Repro: verbose pytest output showing cached collection uses the closed superseded client

Keeps the command output available without making the summary code-heavy.

_{Ran code and verified through T-Rex}

Prompt To Fix With AI

This is a comment left during a code review. Path: agentex/src/config/dependencies.py Line: 280-286 Comment: **Old clients still used** This refresh swaps only `GlobalDependencies.mongodb_database`, but long-lived consumers can keep using `AsyncDatabase` and collection objects from the old client. For example, the Temporal worker builds retention repositories once at startup, and `MongoDBCRUDRepository.__init__` stores `self.collection = db[collection_name]`. After the first refresh, those activities still use collections bound to `old_client`; once this delayed close runs, retention cleanup Mongo operations can fail even though the global dependency points at a fresh client. How can I resolve this? If you propose a fix, please make it concise.

greptile-apps · 2026-06-24T17:04:47Z

+        new_client = self._build_mongodb_client(mongodb_uri)
+        # Force fresh OIDC auth and validate the new client before trusting it.
+        # If this raises, we keep using the existing (working) client.
+        await new_client.admin.command("ping")


Failed refresh leaks client

When the fresh client fails validation, the exception leaves refresh_mongodb_client() before new_client is closed. The refresh loop catches the error and retries on the next interval, so an auth or network outage can accumulate failed AsyncMongoClient instances and their underlying resources for the lifetime of the process. Close the candidate client before re-raising when admin.command("ping") fails.

Artifacts

Repro: pytest harness that forces failed MongoDB refresh candidates and asserts they are not closed

Contains supporting evidence from the run (text/x-python; charset=utf-8).

Repro: verbose pytest output showing two failed ping attempts with candidate_close_awaits=0

Keeps the command output available without making the summary code-heavy.

_{Ran code and verified through T-Rex}

Prompt To Fix With AI

This is a comment left during a code review. Path: agentex/src/config/dependencies.py Line: 268-271 Comment: **Failed refresh leaks client** When the fresh client fails validation, the exception leaves `refresh_mongodb_client()` before `new_client` is closed. The refresh loop catches the error and retries on the next interval, so an auth or network outage can accumulate failed `AsyncMongoClient` instances and their underlying resources for the lifetime of the process. Close the candidate client before re-raising when `admin.command("ping")` fails. How can I resolve this? If you propose a fix, please make it concise.

refresh

9a9103f

michael-chou359 requested a review from a team as a code owner June 24, 2026 16:59

greptile-apps Bot reviewed Jun 24, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

bug: replace cron job restarting agentex pod for oidc refresh with mongoclient refresh#338

bug: replace cron job restarting agentex pod for oidc refresh with mongoclient refresh#338
michael-chou359 wants to merge 1 commit into
mainfrom
mc/oidc-refresh

michael-chou359 commented Jun 24, 2026 •

edited by greptile-apps Bot

Loading

Uh oh!

greptile-apps Bot Jun 24, 2026

Uh oh!

greptile-apps Bot Jun 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

michael-chou359 commented Jun 24, 2026 • edited by greptile-apps Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 3/5

T-Rex Logs

Uh oh!

greptile-apps Bot Jun 24, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Jun 24, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

michael-chou359 commented Jun 24, 2026 •

edited by greptile-apps Bot

Loading