bug: replace cron job restarting agentex pod for oidc refresh with mongoclient refresh#338
bug: replace cron job restarting agentex pod for oidc refresh with mongoclient refresh#338michael-chou359 wants to merge 1 commit into
Conversation
| if old_client is not None and old_client is not new_client: | ||
| task = asyncio.create_task( | ||
| self._close_mongodb_client_after_delay(old_client) | ||
| ) | ||
| # Keep a strong reference until done so the task is not GC'd mid-flight. | ||
| self._mongodb_close_tasks.add(task) | ||
| task.add_done_callback(self._mongodb_close_tasks.discard) |
There was a problem hiding this comment.
This refresh swaps only GlobalDependencies.mongodb_database, but long-lived consumers can keep using AsyncDatabase and collection objects from the old client. For example, the Temporal worker builds retention repositories once at startup, and MongoDBCRUDRepository.__init__ stores self.collection = db[collection_name]. After the first refresh, those activities still use collections bound to old_client; once this delayed close runs, retention cleanup Mongo operations can fail even though the global dependency points at a fresh client.
Artifacts
Repro: pytest harness for cached Mongo collection across OIDC refresh
- Contains supporting evidence from the run (text/x-python; charset=utf-8).
Repro: verbose pytest output showing cached collection uses the closed superseded client
- Keeps the command output available without making the summary code-heavy.
Ran code and verified through T-Rex
Prompt To Fix With AI
This is a comment left during a code review.
Path: agentex/src/config/dependencies.py
Line: 280-286
Comment:
**Old clients still used**
This refresh swaps only `GlobalDependencies.mongodb_database`, but long-lived consumers can keep using `AsyncDatabase` and collection objects from the old client. For example, the Temporal worker builds retention repositories once at startup, and `MongoDBCRUDRepository.__init__` stores `self.collection = db[collection_name]`. After the first refresh, those activities still use collections bound to `old_client`; once this delayed close runs, retention cleanup Mongo operations can fail even though the global dependency points at a fresh client.
How can I resolve this? If you propose a fix, please make it concise.| new_client = self._build_mongodb_client(mongodb_uri) | ||
| # Force fresh OIDC auth and validate the new client before trusting it. | ||
| # If this raises, we keep using the existing (working) client. | ||
| await new_client.admin.command("ping") |
There was a problem hiding this comment.
When the fresh client fails validation, the exception leaves refresh_mongodb_client() before new_client is closed. The refresh loop catches the error and retries on the next interval, so an auth or network outage can accumulate failed AsyncMongoClient instances and their underlying resources for the lifetime of the process. Close the candidate client before re-raising when admin.command("ping") fails.
Artifacts
Repro: pytest harness that forces failed MongoDB refresh candidates and asserts they are not closed
- Contains supporting evidence from the run (text/x-python; charset=utf-8).
Repro: verbose pytest output showing two failed ping attempts with candidate_close_awaits=0
- Keeps the command output available without making the summary code-heavy.
Ran code and verified through T-Rex
Prompt To Fix With AI
This is a comment left during a code review.
Path: agentex/src/config/dependencies.py
Line: 268-271
Comment:
**Failed refresh leaks client**
When the fresh client fails validation, the exception leaves `refresh_mongodb_client()` before `new_client` is closed. The refresh loop catches the error and retries on the next interval, so an auth or network outage can accumulate failed `AsyncMongoClient` instances and their underlying resources for the lifetime of the process. Close the candidate client before re-raising when `admin.command("ping")` fails.
How can I resolve this? If you propose a fix, please make it concise.
Issue:
There is a legacy cron job that restarts the agentex pod every 30 minutes in order to refresh the OIDC token for GCP. Because this cron job restarts the entire prod, while the pod is down, Compass workflows can lose connectivity and record polling errors.
Fix:
We will instead rebuild only the mongo client in process every 45 min instead of the chron job. Originally, the chron job was created as a legacy process for refreshing oidc token, but killing the whole pod was overkill. The overall process is kept alive by refreshing only the mongo client which houses the token
We have a drain and close strategy. After we create our new mongo client, new requests are redirected there and we let in flight queries in the old client finish naturally so nothing gets dropped.
Greptile Summary
This PR replaces the MongoDB OIDC pod-restart workaround with an in-process client refresh. The main changes are:
Confidence Score: 3/5
The refresh approach reduces pod restarts but still has merge-blocking lifecycle issues around stale Mongo handles and failed candidate cleanup.
The changed code is well-scoped and tested, but the client swap does not update startup-cached collections and failed validation leaves candidate clients open, both of which affect runtime reliability.
agentex/src/config/dependencies.py
What T-Rex did
Prompt To Fix All With AI
Reviews (1): Last reviewed commit: "refresh" | Re-trigger Greptile