Fix CloudWatch remote logging for ephemeral lifecycle executor#68779
Fix CloudWatch remote logging for ephemeral lifecycle executor#68779jason810496 wants to merge 3 commits into
Conversation
|
cc @sarvesh371, @seanghaeli Could you verify this patch for your setup when you have a moment? Since it likely #66633 won't catch the 3.3 release (we're close to dev freeze for 3.3), so we might release this provider-only patch first. Thanks. |
The streaming CloudWatch handler is rebuilt whenever it reports
shutting_down, so logs survive configure_logging() closing it. But
shutting_down alone cannot tell a mid-task close apart from genuine
teardown, so a record arriving after teardown would spin up an orphan
handler and its background queue thread that nobody flushes or closes.
The supervisor lifecycle makes the two cases distinguishable in time:
1. configure_logging() builds the handler via remote.processors
(processors does `_ = self.handler`), registering it in
logging._handlerList.
2. The same call then runs dictConfig, whose non-incremental reset
closes that handler -> watchtower sets shutting_down=True.
3. Child log records stream through proc -> self.handler, which sees
shutting_down and rebuilds. This is the case we must keep working.
4. At the last possible moment _upload_logs() -> upload() -> close()
flushes; nothing logs after this.
shutting_down is watchtower's flag set by dictConfig (step 2); the new
_closed flag is ours, set only by close() (step 4). dictConfig never
touches _closed, so the rebuild in step 3 still fires, while a late
record after step 4 keeps the closed handler instead of orphaning a new
one. close() on the outer CloudwatchTaskHandler now closes the handler
the IO is currently using rather than the reference captured in
set_context(), which dictConfig may have closed and the IO since rebuilt.
a92793f to
f28925a
Compare
ferruzzi
left a comment
There was a problem hiding this comment.
I haven't played with CloudWatch much, but I left some style nitpicks. Also, Claude loves to over-comment code, you may want to clean some of those up/
| # The handler MUST be initted here, before the processor is actually used to log anything. | ||
| # Otherwise, logging that occurs during the creation of the handler can create infinite loops. | ||
| _handler = self.handler | ||
| _ = self.handler |
There was a problem hiding this comment.
Do we need to keep this line?
There was a problem hiding this comment.
This forces the handler property to be created and cached when called, often called vivification. Yes needed as per the comment
There was a problem hiding this comment.
A comment was even there to explain, my bad!
Why
While trying to setup cloudwatch remote logging in #68709 in order to persist the logs in real time. I encounter the same errors as above listed issues.
The root cause I found is same as #66475 (comment) pointed out. The
configure_logging -> dictConfig -> _clearExistingHandlerscall chain shutdown the watchtower handler.How
I went to the another than #66633, instead of configuring the
processorsafter thedictConfigcall. We could make the cloudwatch remote logger itself self-healing by creating the fresh instance if previous instance wasshutdownby thedictConfigcall but also ensure the.closesemantic by guarding with the_closestate.What
Fix the lifecycle issue of cloudwatch remote logging and verify with
breeze k8ssystem test with provider only changes without touching the Task-SDK changes.