Skip to content

bug: replace cron job restarting agentex pod for oidc refresh with mongoclient refresh#338

Open
michael-chou359 wants to merge 1 commit into
mainfrom
mc/oidc-refresh
Open

bug: replace cron job restarting agentex pod for oidc refresh with mongoclient refresh#338
michael-chou359 wants to merge 1 commit into
mainfrom
mc/oidc-refresh

Conversation

@michael-chou359

@michael-chou359 michael-chou359 commented Jun 24, 2026

Copy link
Copy Markdown
Contributor

Issue:

There is a legacy cron job that restarts the agentex pod every 30 minutes in order to refresh the OIDC token for GCP. Because this cron job restarts the entire prod, while the pod is down, Compass workflows can lose connectivity and record polling errors.

Fix:

We will instead rebuild only the mongo client in process every 45 min instead of the chron job. Originally, the chron job was created as a legacy process for refreshing oidc token, but killing the whole pod was overkill. The overall process is kept alive by refreshing only the mongo client which houses the token

We have a drain and close strategy. After we create our new mongo client, new requests are redirected there and we let in flight queries in the old client finish naturally so nothing gets dropped.

Greptile Summary

This PR replaces the MongoDB OIDC pod-restart workaround with an in-process client refresh. The main changes are:

  • Shared MongoDB client construction for startup and refresh paths.
  • OIDC-only background refresh loop with a configurable interval.
  • New client validation before swapping the global MongoDB client.
  • Delayed close of the superseded client to allow in-flight work to drain.
  • Unit and integration tests for refresh gating, swap behavior, and loop lifecycle.

Confidence Score: 3/5

The refresh approach reduces pod restarts but still has merge-blocking lifecycle issues around stale Mongo handles and failed candidate cleanup.

The changed code is well-scoped and tested, but the client swap does not update startup-cached collections and failed validation leaves candidate clients open, both of which affect runtime reliability.

agentex/src/config/dependencies.py

T-Rex T-Rex Logs

What T-Rex did

  • Ran a focused pytest harness to reproduce how old clients remain bound to a cached Mongo collection when refresh_mongodb_client() is called.
  • Ran a focused pytest harness that forces refresh candidates to fail and verifies they are not closed, illustrating leaked candidate clients when refresh attempts fail.
  • Compared before and after probe executions to verify updated counters and flags, using the mongodb_oidc_refresh_probe.py script as the shared probe source.

View all artifacts

T-Rex Ran code and verified through T-Rex

Fix All in Cursor Fix All in Claude Code Fix All in Codex

Prompt To Fix All With AI
Fix the following 2 code review issues. Work through them one at a time, proposing concise fixes.

---

### Issue 1 of 2
agentex/src/config/dependencies.py:280-286
**Old clients still used**

This refresh swaps only `GlobalDependencies.mongodb_database`, but long-lived consumers can keep using `AsyncDatabase` and collection objects from the old client. For example, the Temporal worker builds retention repositories once at startup, and `MongoDBCRUDRepository.__init__` stores `self.collection = db[collection_name]`. After the first refresh, those activities still use collections bound to `old_client`; once this delayed close runs, retention cleanup Mongo operations can fail even though the global dependency points at a fresh client.

### Issue 2 of 2
agentex/src/config/dependencies.py:268-271
**Failed refresh leaks client**

When the fresh client fails validation, the exception leaves `refresh_mongodb_client()` before `new_client` is closed. The refresh loop catches the error and retries on the next interval, so an auth or network outage can accumulate failed `AsyncMongoClient` instances and their underlying resources for the lifetime of the process. Close the candidate client before re-raising when `admin.command("ping")` fails.

Reviews (1): Last reviewed commit: "refresh" | Re-trigger Greptile

Greptile also left 2 inline comments on this PR.

@michael-chou359 michael-chou359 requested a review from a team as a code owner June 24, 2026 16:59
Comment on lines +280 to +286
if old_client is not None and old_client is not new_client:
task = asyncio.create_task(
self._close_mongodb_client_after_delay(old_client)
)
# Keep a strong reference until done so the task is not GC'd mid-flight.
self._mongodb_close_tasks.add(task)
task.add_done_callback(self._mongodb_close_tasks.discard)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Old clients still used

This refresh swaps only GlobalDependencies.mongodb_database, but long-lived consumers can keep using AsyncDatabase and collection objects from the old client. For example, the Temporal worker builds retention repositories once at startup, and MongoDBCRUDRepository.__init__ stores self.collection = db[collection_name]. After the first refresh, those activities still use collections bound to old_client; once this delayed close runs, retention cleanup Mongo operations can fail even though the global dependency points at a fresh client.

Artifacts

Repro: pytest harness for cached Mongo collection across OIDC refresh

  • Contains supporting evidence from the run (text/x-python; charset=utf-8).

Repro: verbose pytest output showing cached collection uses the closed superseded client

  • Keeps the command output available without making the summary code-heavy.

View artifacts

T-Rex Ran code and verified through T-Rex

Prompt To Fix With AI
This is a comment left during a code review.
Path: agentex/src/config/dependencies.py
Line: 280-286

Comment:
**Old clients still used**

This refresh swaps only `GlobalDependencies.mongodb_database`, but long-lived consumers can keep using `AsyncDatabase` and collection objects from the old client. For example, the Temporal worker builds retention repositories once at startup, and `MongoDBCRUDRepository.__init__` stores `self.collection = db[collection_name]`. After the first refresh, those activities still use collections bound to `old_client`; once this delayed close runs, retention cleanup Mongo operations can fail even though the global dependency points at a fresh client.

How can I resolve this? If you propose a fix, please make it concise.

Fix in Cursor Fix in Claude Code Fix in Codex

Comment on lines +268 to +271
new_client = self._build_mongodb_client(mongodb_uri)
# Force fresh OIDC auth and validate the new client before trusting it.
# If this raises, we keep using the existing (working) client.
await new_client.admin.command("ping")

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Failed refresh leaks client

When the fresh client fails validation, the exception leaves refresh_mongodb_client() before new_client is closed. The refresh loop catches the error and retries on the next interval, so an auth or network outage can accumulate failed AsyncMongoClient instances and their underlying resources for the lifetime of the process. Close the candidate client before re-raising when admin.command("ping") fails.

Artifacts

Repro: pytest harness that forces failed MongoDB refresh candidates and asserts they are not closed

  • Contains supporting evidence from the run (text/x-python; charset=utf-8).

Repro: verbose pytest output showing two failed ping attempts with candidate_close_awaits=0

  • Keeps the command output available without making the summary code-heavy.

View artifacts

T-Rex Ran code and verified through T-Rex

Prompt To Fix With AI
This is a comment left during a code review.
Path: agentex/src/config/dependencies.py
Line: 268-271

Comment:
**Failed refresh leaks client**

When the fresh client fails validation, the exception leaves `refresh_mongodb_client()` before `new_client` is closed. The refresh loop catches the error and retries on the next interval, so an auth or network outage can accumulate failed `AsyncMongoClient` instances and their underlying resources for the lifetime of the process. Close the candidate client before re-raising when `admin.command("ping")` fails.

How can I resolve this? If you propose a fix, please make it concise.

Fix in Cursor Fix in Claude Code Fix in Codex

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant