feat(lambda-runtime): log non-2xx Lambda Runtime API responses with status and body#1105
feat(lambda-runtime): log non-2xx Lambda Runtime API responses with status and body#1105darklight3it wants to merge 1 commit intomainfrom
Conversation
b17ead3 to
0120ef0
Compare
| First(#[pin] F, Arc<Client>), | ||
| Second(#[pin] BoxFuture<'static, Result<http::Response<Incoming>, BoxError>>), | ||
| #[pin_project(project = RuntimeApiClientFutureStateProj)] | ||
| pub enum RuntimeApiClientFutureState<F> { |
There was a problem hiding this comment.
This will be a breaking change. It is an internal type, that probably nobody is using, but it is part of our public API.
If we did want to go this route, I'd suggest making a new one, and marking the old one as deprecated. And cutting over all usage.
Given that probably nobody is using it, the new type is fairly painless. I don't mind that option.
That would only require a minor version bump (the deprecated old enum).
There was a problem hiding this comment.
It wouldn't be breaking if we had tagged it #[non_exhaustive], but we missed this one.
Definitely if we make a new type, we should tag it non exhaustive :)
There was a problem hiding this comment.
sorry I was already fixing that. It's still RuntimeApiClientFutureState. The CR was still a draft.
There was a problem hiding this comment.
we changed the internals ase we said in the other CR. Cannot make it pub (crate) from api_client but still it isn't exposed in mod.rs
There was a problem hiding this comment.
I discussed this with you and rcoh@ offline -
Technically this is a breaking change, whether we change the innards of RuntimeApiClientFuture or substitute it for a different enum. The type is accessible as an associated type on the RuntimeApiClientService which is publicly visible.
None of our internal machinery relies on this specific type surface - we are just polling the service that the runtime api client exposes. I'm pretty sure whoever implemented this just didn't realize that pub enum exposed all of its variants and field contents.
It would generally be risky to try to deploy something in the runtime api client, that the lambda-runtime needs to be upgraded in lockstep to, else they will have compile errors. And then that's fragile to do except across a breaking version boundary.
BUT - we aren't in that situation, the only people that would be impacted would be people that built their own higher level runtime on top of our own, or otherwise are doing things with the innards of the runtime client.
I've confirmed via code search that the only cases of RuntimeApiClientFuture being used are forks of the api client.
So, with that in mind, I think this is probably a fairly low risk change to make.
There was a problem hiding this comment.
And then, the lowest risk change available would be to just keep the current future name, since then anybody that is naming the future as part of a tower service stack, wouldn't be broken.
And with the future, it is safer to add a new field to the existing variant, rather than add a new variant. Since that won't break match statements.
Unfortunately, that doesn't get us out of this mess for next time, but it is the least risky.
There was a problem hiding this comment.
That said - what field to add - well, we already have the contents we want inside the RuntimeApiClientFuture, the issue is that we can't poll it to completion from inside the runtime (specifically, the body field of the http::Response future stored inside Second)
For that, we really would prefer a third state transition which is riskier than a new field .
I guess, we could also just change the second variant's contents to an enum (which is clearly sillier than a third variant, but less breaking-prone maybe?)
There was a problem hiding this comment.
Maybe instead we could do something hacky like, check the status code, and bubble up the inner future inside of the BoxError, and downcast the BoxError from inside our managing runtime to decide whether to finish polling the body?
And then, another option we discussed was just storing a client current thread runtime or something to drive the future, but that's pretty awful... but we don't currently require that this be run inside a tokio runtime, so i don't think we can rely on tokio::spawn.
I guess we could also do something with a sync thread and block_on, but that is pretty terrible :)
There was a problem hiding this comment.
I went with the tokio::spawn idea, the bubbling up of the error required a changing in then RuntimeApiClientFuture signature that while respecting the current enum and variant structure would have changed the signature constituting a breaking change.
This seems to be the only way for this to work without any signature change. The integration tests are working and the test I added are working too.
0120ef0 to
ef4ded0
Compare
ef4ded0 to
0b88c4b
Compare
…useful for the customer for timeouts
0b88c4b to
8e6dc81
Compare
jlizen
left a comment
There was a problem hiding this comment.
I think we have something workable here. I'll ping my colleague to take a quick peek as well for landmine avoidance.
Some changes needed (assuming he doesn't suggest a different route):
- we need to pass the tokio runtime handle through to the the future so that it doesn't panic on being awaited in non-tokio (one option is a static oncelock holding it)
- probably we should reflect against the logged output in our test
| Err(_) => { | ||
| // Fallback: build a minimal current-thread runtime and keep it alive alongside | ||
| // the service. This should only happen outside of a tokio context (e.g. some tests). | ||
| let rt = Arc::new( |
There was a problem hiding this comment.
Is there a reason that the rt needs to be inside an arc? It is stored inside a struct that doesn't have Clone anyway
| let status = resp.status(); | ||
| // Spawn a background task to read and log the error body without | ||
| // blocking the current poll or requiring a new enum variant. | ||
| Handle::current().spawn(async move { |
There was a problem hiding this comment.
This will panic if the future is awaited outside of a tokio runtime. It isn't actually using the handle we store on the client.
That said, there isn't a GREAT way to pass through the runtime handle to the future without changing the public signature, since we are stuck with an enum with wide open variants and inner fields. We would essentially need to add the handle to both the first and second, and have first plumb it through to second. IE add a field to both variants.
The main non breaking way I can think of would be to store the runtime in a static OnceLock (which you initialize inside RuntimeApiClientService::new(), and it stores the runtime and returns the handle via get_or_init. So nothing stored on the client itself anymore.
At that point, you won't need to carry any data in the future. And the runtime is something of a global resource, so it's not too strange..?
There was a problem hiding this comment.
One thing we might need to do would be, add a check to see if a caller-provided tokio runtime is still alive, since it might shut down early and cause a panic? Not sure how much we need to guard against that... Will get a second opinion
There was a problem hiding this comment.
is the only thing you need to do here log the message? Rather than doing it this way, can you send the resp over a channel to someone else to read it out and print it?
One "dumb" option is to use the response.extensions_mut() to get the channel (or runtime handle) in there.
That avoids the hazard of attempting to read it here from the thread local.
There was a problem hiding this comment.
Yes we only need to log the message.
There was a problem hiding this comment.
One concern is, the place we would want to store a runtime handle or channel receiver in order to write into response extensions, is the underlying runtime api client, which is also wide open as pub / non-exhaustive.
So it wouldn't necessarily sidestep the need for static state.
A nicer option might be, instead use a tokio::task_local! at the point where we are spawning the tokio tasks, to inject a mpsc::Sender. And then, we can listen on it in a tokio::select! loop alongside its work loop.
That reduces the hazard of of the thread local reads, right @rcoh ? It also has the side benefit of allowing us to enrich the logs with per-worker context, since we have a worker id span instrumented there.
And then, inside the RuntimeApiClientFuture, where we read the response from second, we can (perhaps behind cfg(feature="concurrency-tokio") check if LocalKey::try_get returns something, and if yes, write the response to the channel to let the worker-managed channel receive loop await the body and write out its log?
There was a problem hiding this comment.
I guess we could also just use tokio::runtime::Handle::current() inside the lambda_runtime_api_client gated behind tokio_concurrency and drill that feature down? IE use .map_ok on the request future, get the handle, and shove it in response extensions, and then have the RuntimeApiClientFuture check extensions and spawn the task.
Feels more infectious though than just task_locals, and it still has some weird edge cases if anybody is instantiating the lambda_runtime_api_client::Client directly.
| } | ||
| } | ||
| }); | ||
| break Ok(()); |
There was a problem hiding this comment.
why are these breaks needed? i would have expected we can just have eg Ok(_) => Ok(())
| break Ok(()); | ||
| } | ||
| Ok(_) => break Ok(()), | ||
| Err(err) => break Err(err), |
There was a problem hiding this comment.
just to confirm, we are already logging the connection related error higher up? It looked like the C++ impl was logging them all at the same site (though it's fine if we handle differently, with the same ultimate logs)
| pub use tower::{self, service_fn, Service}; | ||
|
|
||
| /// Logs using tracing `error!` if a dispatcher is set, otherwise falls back to `eprintln!`. | ||
| macro_rules! log_or_print { |
| } | ||
|
|
||
| #[tokio::test] | ||
| async fn test_invoke_timeout_410_is_logged_and_succeeds() { |
There was a problem hiding this comment.
This isn't actually asserting that anything is logged, we probably want to use #[traced_test] to reflect against log output
Summary
When the Lambda Runtime API returns a non-2xx response (e.g. HTTP 410 with
{"errorMessage":"Invoke timeout","errorType":"InvokeTimeout"}), the previous implementation silently discarded the response body. Customers had no visibility into why their handler response was rejected.Solution
Read and log the response body for non-2xx responses from the Lambda Runtime API.
The main challenge is that
RuntimeApiClientFutureis a pub enum without#[non_exhaustive], so adding a new variant would be a breaking change for any downstream code that exhaustively matches on it. To avoid this, we instead spawn a background task viatokio::task::spawnto read the body asynchronously without modifying the enum's public surface.Changes
RuntimeApiClientServicenow captures atokio::runtime::Handleat construction time viaHandle::try_current(), with a fallback minimal current-thread runtime for non-tokio contexts (e.g. tests)log_or_print!macro inlib.rsthat dispatches to tracing::error! when a tracing dispatcher is set, and falls back toeprintln!otherwise — consistent with the existing codebase patternhttpmockExample log output