Skip to content

Comments

feat(lambda-runtime): log non-2xx Lambda Runtime API responses with status and body#1105

Open
darklight3it wants to merge 1 commit intomainfrom
dmelfi/timeout-fixup
Open

feat(lambda-runtime): log non-2xx Lambda Runtime API responses with status and body#1105
darklight3it wants to merge 1 commit intomainfrom
dmelfi/timeout-fixup

Conversation

@darklight3it
Copy link

@darklight3it darklight3it commented Feb 23, 2026

Summary

When the Lambda Runtime API returns a non-2xx response (e.g. HTTP 410 with {"errorMessage":"Invoke timeout","errorType":"InvokeTimeout"}), the previous implementation silently discarded the response body. Customers had no visibility into why their handler response was rejected.

Solution

Read and log the response body for non-2xx responses from the Lambda Runtime API.

The main challenge is that RuntimeApiClientFuture is a pub enum without #[non_exhaustive], so adding a new variant would be a breaking change for any downstream code that exhaustively matches on it. To avoid this, we instead spawn a background task via tokio::task::spawn to read the body asynchronously without modifying the enum's public surface.

Changes

  • RuntimeApiClientService now captures a tokio::runtime::Handle at construction time via Handle::try_current(), with a fallback minimal current-thread runtime for non-tokio contexts (e.g. tests)
  • On non-2xx responses, a background task is spawned to collect and log the response body including the HTTP status code
  • Added a log_or_print! macro in lib.rs that dispatches to tracing::error! when a tracing dispatcher is set, and falls back to eprintln! otherwise — consistent with the existing codebase pattern
  • Added unit tests covering 2xx success, 410 InvokeTimeout, and inner service errors, using httpmock

Example log output

ERROR lambda_runtime::layers::api_client: Lambda Runtime API returned non-200 response status=410 body={"errorMessage":"Invoke timeout","errorType":"InvokeTimeout"}

First(#[pin] F, Arc<Client>),
Second(#[pin] BoxFuture<'static, Result<http::Response<Incoming>, BoxError>>),
#[pin_project(project = RuntimeApiClientFutureStateProj)]
pub enum RuntimeApiClientFutureState<F> {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will be a breaking change. It is an internal type, that probably nobody is using, but it is part of our public API.

If we did want to go this route, I'd suggest making a new one, and marking the old one as deprecated. And cutting over all usage.

Given that probably nobody is using it, the new type is fairly painless. I don't mind that option.

That would only require a minor version bump (the deprecated old enum).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It wouldn't be breaking if we had tagged it #[non_exhaustive], but we missed this one.

Definitely if we make a new type, we should tag it non exhaustive :)

Copy link
Author

@darklight3it darklight3it Feb 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry I was already fixing that. It's still RuntimeApiClientFutureState. The CR was still a draft.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we changed the internals ase we said in the other CR. Cannot make it pub (crate) from api_client but still it isn't exposed in mod.rs

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I discussed this with you and rcoh@ offline -

Technically this is a breaking change, whether we change the innards of RuntimeApiClientFuture or substitute it for a different enum. The type is accessible as an associated type on the RuntimeApiClientService which is publicly visible.

None of our internal machinery relies on this specific type surface - we are just polling the service that the runtime api client exposes. I'm pretty sure whoever implemented this just didn't realize that pub enum exposed all of its variants and field contents.

It would generally be risky to try to deploy something in the runtime api client, that the lambda-runtime needs to be upgraded in lockstep to, else they will have compile errors. And then that's fragile to do except across a breaking version boundary.

BUT - we aren't in that situation, the only people that would be impacted would be people that built their own higher level runtime on top of our own, or otherwise are doing things with the innards of the runtime client.

I've confirmed via code search that the only cases of RuntimeApiClientFuture being used are forks of the api client.

So, with that in mind, I think this is probably a fairly low risk change to make.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And then, the lowest risk change available would be to just keep the current future name, since then anybody that is naming the future as part of a tower service stack, wouldn't be broken.

And with the future, it is safer to add a new field to the existing variant, rather than add a new variant. Since that won't break match statements.

Unfortunately, that doesn't get us out of this mess for next time, but it is the least risky.

Copy link
Collaborator

@jlizen jlizen Feb 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That said - what field to add - well, we already have the contents we want inside the RuntimeApiClientFuture, the issue is that we can't poll it to completion from inside the runtime (specifically, the body field of the http::Response future stored inside Second)

For that, we really would prefer a third state transition which is riskier than a new field .

I guess, we could also just change the second variant's contents to an enum (which is clearly sillier than a third variant, but less breaking-prone maybe?)

Copy link
Collaborator

@jlizen jlizen Feb 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe instead we could do something hacky like, check the status code, and bubble up the inner future inside of the BoxError, and downcast the BoxError from inside our managing runtime to decide whether to finish polling the body?

And then, another option we discussed was just storing a client current thread runtime or something to drive the future, but that's pretty awful... but we don't currently require that this be run inside a tokio runtime, so i don't think we can rely on tokio::spawn.

I guess we could also do something with a sync thread and block_on, but that is pretty terrible :)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went with the tokio::spawn idea, the bubbling up of the error required a changing in then RuntimeApiClientFuture signature that while respecting the current enum and variant structure would have changed the signature constituting a breaking change.

This seems to be the only way for this to work without any signature change. The integration tests are working and the test I added are working too.

@darklight3it darklight3it marked this pull request as ready for review February 24, 2026 12:29
Copy link
Collaborator

@jlizen jlizen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we have something workable here. I'll ping my colleague to take a quick peek as well for landmine avoidance.

Some changes needed (assuming he doesn't suggest a different route):

  • we need to pass the tokio runtime handle through to the the future so that it doesn't panic on being awaited in non-tokio (one option is a static oncelock holding it)
  • probably we should reflect against the logged output in our test

Err(_) => {
// Fallback: build a minimal current-thread runtime and keep it alive alongside
// the service. This should only happen outside of a tokio context (e.g. some tests).
let rt = Arc::new(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason that the rt needs to be inside an arc? It is stored inside a struct that doesn't have Clone anyway

let status = resp.status();
// Spawn a background task to read and log the error body without
// blocking the current poll or requiring a new enum variant.
Handle::current().spawn(async move {
Copy link
Collaborator

@jlizen jlizen Feb 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will panic if the future is awaited outside of a tokio runtime. It isn't actually using the handle we store on the client.

That said, there isn't a GREAT way to pass through the runtime handle to the future without changing the public signature, since we are stuck with an enum with wide open variants and inner fields. We would essentially need to add the handle to both the first and second, and have first plumb it through to second. IE add a field to both variants.

The main non breaking way I can think of would be to store the runtime in a static OnceLock (which you initialize inside RuntimeApiClientService::new(), and it stores the runtime and returns the handle via get_or_init. So nothing stored on the client itself anymore.

At that point, you won't need to carry any data in the future. And the runtime is something of a global resource, so it's not too strange..?

Copy link
Collaborator

@jlizen jlizen Feb 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing we might need to do would be, add a check to see if a caller-provided tokio runtime is still alive, since it might shut down early and cause a panic? Not sure how much we need to guard against that... Will get a second opinion

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is the only thing you need to do here log the message? Rather than doing it this way, can you send the resp over a channel to someone else to read it out and print it?

One "dumb" option is to use the response.extensions_mut() to get the channel (or runtime handle) in there.

That avoids the hazard of attempting to read it here from the thread local.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes we only need to log the message.

Copy link
Collaborator

@jlizen jlizen Feb 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One concern is, the place we would want to store a runtime handle or channel receiver in order to write into response extensions, is the underlying runtime api client, which is also wide open as pub / non-exhaustive.

So it wouldn't necessarily sidestep the need for static state.

A nicer option might be, instead use a tokio::task_local! at the point where we are spawning the tokio tasks, to inject a mpsc::Sender. And then, we can listen on it in a tokio::select! loop alongside its work loop.

That reduces the hazard of of the thread local reads, right @rcoh ? It also has the side benefit of allowing us to enrich the logs with per-worker context, since we have a worker id span instrumented there.

And then, inside the RuntimeApiClientFuture, where we read the response from second, we can (perhaps behind cfg(feature="concurrency-tokio") check if LocalKey::try_get returns something, and if yes, write the response to the channel to let the worker-managed channel receive loop await the body and write out its log?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we could also just use tokio::runtime::Handle::current() inside the lambda_runtime_api_client gated behind tokio_concurrency and drill that feature down? IE use .map_ok on the request future, get the handle, and shove it in response extensions, and then have the RuntimeApiClientFuture check extensions and spawn the task.

Feels more infectious though than just task_locals, and it still has some weird edge cases if anybody is instantiating the lambda_runtime_api_client::Client directly.

}
}
});
break Ok(());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are these breaks needed? i would have expected we can just have eg Ok(_) => Ok(())

break Ok(());
}
Ok(_) => break Ok(()),
Err(err) => break Err(err),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just to confirm, we are already logging the connection related error higher up? It looked like the C++ impl was logging them all at the same site (though it's fine if we handle differently, with the same ultimate logs)

pub use tower::{self, service_fn, Service};

/// Logs using tracing `error!` if a dispatcher is set, otherwise falls back to `eprintln!`.
macro_rules! log_or_print {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cute, can we cut over this usage as well?

}

#[tokio::test]
async fn test_invoke_timeout_410_is_logged_and_succeeds() {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't actually asserting that anything is logged, we probably want to use #[traced_test] to reflect against log output

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know. I can add this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants