Skip to content

feat: add support for running on AWS Lambda managed instance types#2083

Open
herin049 wants to merge 8 commits intoopen-telemetry:mainfrom
herin049:feat/managed-instances
Open

feat: add support for running on AWS Lambda managed instance types#2083
herin049 wants to merge 8 commits intoopen-telemetry:mainfrom
herin049:feat/managed-instances

Conversation

@herin049
Copy link
Contributor

Adds support for running on AWS Lambda managed instances.

Lambda managed instances differ from standard lambda functions in several areas, but the differences most relevant for the OpenTelemetry collector layer are:

  • For Managed instance types the Extension API does not allow subscribing to the Invoke event type.
  • For Managed instances the telemetry API does not report platform.runtimeDone events
  • A Managed Lambda instance is never frozen (removes the need for the decouple processor)
  • Multiple Lambda processes can be created within a single execution environment (this is particularly relevant for ensuring that auto-instrumentation works with the bootstrap script)
  • Multiple Lambda function invocations may be simultaneously in process for a given execution environment

For more information see: https://docs.aws.amazon.com/lambda/latest/dg/lambda-managed-instances.html

In order to accommodate the changes above, the following changes have been made to the layer if the extension determines the initialization type is lambda-managed-instances using the AWS_LAMBDA_INITIALIZATION_TYPE environment variable:

  • The Extension no longer subscribes to the Invoke event type, and the extension no longer subscribes to the telemetry API to listen for the platform.runtimeDone event if the initialization type is lambda-managed-instances
  • The decouple processor is no longer added to any pipelines.
  • The FunctionInvoked() and FunctionFinished() lifecycle methods are no longer invoked for lifecycle listeners.
  • The wrapper script has been updated to instrument inside the wrapper script itself to ensure instrumentation is properly applied for newly created Python processes. (applies to all initialization types)
  • A few changes have been made to the telemetry API receiver to accomidate changes to the events reported by the telemetry API.

I have added relevant unit tests and have manually verified the implementation using the following repo: https://github.com/herin049/aws-lambda-managed which configures the layer to export signals to Grafana cloud.

@herin049 herin049 requested a review from a team as a code owner December 20, 2025 04:12
@serkan-ozal
Copy link
Contributor

Hi @herin049,

For the Python SDK related changes, as far as I understood from your explanations, AWS_LAMBDA_EXEC_WRAPPER is not taken care for the spawned Lambda processes (but main Lambda runtime process) in Lambda managed instances. Is that correct? Is there any official AWS documentation mentioning/explaining this behavior?

Because otherwise (if AWS_LAMBDA_EXEC_WRAPPER could still be used for the spawned Lambda processes) opentelemetry-instrument CLI would instrument spawned Lambda processes.

@herin049
Copy link
Contributor Author

herin049 commented Dec 23, 2025

Hi @serkan-ozal, thanks for the review.

Yes, your understanding is correct. To reiterate, here is what I assume happens internally for AWS Lambda managed instances:

  1. During initialization of the EC2 VM, Lambda first executes the AWS_LAMBDA_EXEC_WRAPPER as usual.
  2. After this initialization phase completes, Lambda spawns N child Python processes and imports the handler module directly in each child process by calling importlib.import_module. This is not an issue for most wrapper scripts since typically they involve just setting environment variables/manipulating the file system in some manner, so these side effects will be visible to each process. However, for the auto instrumentation library, this is an issue because the new processes no longer have the patching applied from the auto instrumentation libraries because they are fresh interpreter processes.

I can't find any documentation on this except with the docs stating that managed lambda instances can serve many requests concurrently, and the only way to do this while achieving true parallelism with Python is to use multiple processes.

Regardless, this change will ALWAYS be safe to make because even for standard lambda instances, the behavior will be identical. That is, running opentelemetry-instrument python main.py is nearly identical to running python main.py and calling auto_instrumentation.initialize() at the top of the file. These changes also make this PR: #2069 irrelevant. Essentially, even if the assumptions I am making are incorrect, regular lambda functions will still be instrumented properly, which is why this is the most straightforward/safest approach to take.

@serkan-ozal
Copy link
Contributor

@herin049

Yes, I know that using a wrapper handler which delegates to the user handler is a very common approach. But while asking to verify this behavioral change with the AWS_LAMBDA_EXEC_WRAPPER env var, I was mostly thinking about the other runtimes. Except for the Ruby runtime, auto instrumentation for the other runtimes works without wrapper (NODE_OPTIONS for NodeJS and JAVA_TOOL_OPTIONS for Java agent).

However, even though AWS_LAMBDA_EXEC_WRAPPER env var is not taken care for spawned processes, this should not be an issue for the NodeJS and Java runtimes as they are only set setting some env vars to configure/activate OTEL instrumentation and these env vars should be inherited by the spawned worker/child processes, unless main process filters env vars to pass them (that is the point we need to check). And one more point on this, loading user handler in the wrapper handler is a little bit complex in NodeJS as we need to take care of some more cases (paths, CJS vs ESM, etc ...). Another point here is we may also need to be sure that OTEL SDK is not instrumenting the main process itself. Because, otherwise, depending on the implementation of the main process, there might be some spans reported by OTEL SDK which are not related to the user code.

In addition to the points above, for Python, instead of wrapper handler, one other approach would be using sitecustomize.py which will be available under PYTHONPATH. Basically, you should be able to initialize the OpenTelemetry SDK automatically at Python startup by placing your OTEL initialization code in a sitecustomize.py file and ensuring that its directory is included in PYTHONPATH. Python will import sitecustomize on interpreter startup, allowing OTEL to be configured before any application code runs.

@herin049
Copy link
Contributor Author

Thanks @serkan-ozal

I see where you are coming from now. To limit the scope of these changes I've just focused on the collector level changes and adding support for Python for now. If you'd like, I can create an issue for verifying/making changes for all of the supported lambda runtime for Lambda managed instances. I am not as familiar with how auto instrumentation works for the other runtimes, but I can certainly look into this more and make any required changes based on my findings in order to support managed instances. In either case, the collector changes I have made in this PR will not change even if there are substantial changes to the auto instrumentation logic in some runtimes.

With regard to your concern with instrumenting the original parent process, I don't think this is a concern. I have not observed any irregular spans being reported even when enabling all of the auto instrumentation libraries.

From what I have found so far, it seems like the auto instrumentation wrapper command is not working properly for the worker Python processes because somehow sitecustomize.py is not being loaded properly for the worker processes (the auto instrumentation wrapper command already adds a custom sitecustomize.py file to the PYTHON_PATH environment variable). There are two possible reasons for this that I can think of: the updated PYTHON_PATH environment variable is not being propagated to the other Python processes correctly or sitecustomize.py is not being loaded for the new Python processes. I think the latter scenario is more likely because I know that the environment variables set in the wrapper script are indeed propagated to the worker processes correctly since otherwise, the modifications to the _HANDLER and ORIG_HANDLER environment variables would not be properly set, leading to auto instrumentation not being applied at all (but from my tests I do see telemetry being properly reported in all cases). So, we can try to switch to an approach that involves us creating our own sitecustomize.py file and having the wrapper script add it to the PYTHON_PATH, but I am guessing that we will run into the same issues as with the wrapper command.

The reason I made the changes to the wrapper script the way I did is because: they are relatively minimal and the changes are backwards compatible, ensuring that the behavior is the same as the previous wrapper script.

@serkan-ozal
Copy link
Contributor

@herin049 I think it is better to limit the scope of this PR to only changes related to the collector. For the SDK related changes, first, I would like to understand the behavior for the main process and spawned processes (I will be looking into it too) when AWS Lambda managed instance is used.

So, my take is please create issue(s) for the SDK related changes and remove the Python related changes from this PR.
@tylerbenson @wpessers @pragmaticivan @maxday WDYT?

@herin049
Copy link
Contributor Author

@herin049 I think it is better to limit the scope of this PR to only changes related to the collector. For the SDK related changes, first, I would like to understand the behavior for the main process and spawned processes (I will be looking into it too) when AWS Lambda managed instance is used.

So, my take is please create issue(s) for the SDK related changes and remove the Python related changes from this PR. @tylerbenson @wpessers @pragmaticivan @maxday WDYT?

Sounds good to me @serkan-ozal. I have reverted the Python SDK related changes in this PR, I can work on a follow-up PR to update all of the SDKs where necessary to support managed instance types, and do some additional research myself.

@wpessers
Copy link
Contributor

So, my take is please create issue(s) for the SDK related changes and remove the Python related changes from this PR. @tylerbenson @wpessers @pragmaticivan @maxday WDYT?

Yes I agree!

@serkan-ozal
Copy link
Contributor

I am OK with the changes here, @herin049 can you please resolve the conflicts, so then we can merge this one.

@herin049
Copy link
Contributor Author

I am OK with the changes here, @herin049 can you please resolve the conflicts, so then we can merge this one.

Awesome, will resolve the conflicts shortly.

@herin049 herin049 force-pushed the feat/managed-instances branch from 114e673 to 904b61f Compare February 12, 2026 18:38
@herin049
Copy link
Contributor Author

@wpessers and @serkan-ozal made a trivial cleanup change after reviewing the code again, should be good to go

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates the Lambda layer/collector to support AWS Lambda managed instance types by branching behavior based on AWS_LAMBDA_INITIALIZATION_TYPE (notably: no Invoke subscription, no platform.runtimeDone wait path, and no decouple processor insertion), and adjusts telemetry parsing to tolerate differences in managed-instance telemetry events.

Changes:

  • Add collector/lambdalifecycle module to parse AWS_LAMBDA_INITIALIZATION_TYPE into a typed InitType.
  • Update lifecycle manager + extension client registration to subscribe only to SHUTDOWN (and skip Telemetry API listener) for managed instances.
  • Update telemetry API receiver to better handle missing/partial report fields and request-id association, and adjust tests accordingly.

Reviewed changes

Copilot reviewed 13 out of 14 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
collector/receiver/telemetryapireceiver/receiver_test.go Updates expected platform report log formatting behavior.
collector/receiver/telemetryapireceiver/receiver.go Adds init-type awareness, improves report formatting tolerance, and adjusts request-id handling for managed instances.
collector/receiver/telemetryapireceiver/go.mod Adds dependency/replace wiring for the new lambdalifecycle module.
collector/lambdalifecycle/types_test.go Adds unit tests for InitType parsing/string/env behavior.
collector/lambdalifecycle/types.go Defines InitType enum + parsing helpers.
collector/lambdalifecycle/go.sum Adds sums for lambdalifecycle module dependencies.
collector/lambdalifecycle/go.mod Declares the lambdalifecycle submodule and its test dependency.
collector/lambdalifecycle/constants.go Defines AWS_LAMBDA_INITIALIZATION_TYPE env var constant.
collector/internal/lifecycle/manager_test.go Updates tests to use new extension client constructor signature.
collector/internal/lifecycle/manager.go Branches extension event subscriptions + Telemetry API listener startup based on init type.
collector/internal/lifecycle/constants.go Centralizes AWS_LAMBDA_RUNTIME_API env var name.
collector/internal/extensionapi/client.go Extends NewClient/Register to accept configurable subscribed event types.
collector/internal/confmap/converter/decoupleafterbatchconverter/converter_test.go Adds test ensuring decouple isn’t appended for managed instances.
collector/internal/confmap/converter/decoupleafterbatchconverter/converter.go Skips decouple insertion when init type is managed instances.
Comments suppressed due to low confidence (1)

collector/receiver/telemetryapireceiver/receiver.go:384

  • In createLogs(), the current request ID is updated twice for platform.start: first via updateCurrentRequestId(requestId) and then again via direct assignment to r.currentFaasInvocationID. The direct assignment bypasses the LambdaManagedInstances guard in updateCurrentRequestId and is redundant for other init types; remove the direct assignment and rely on updateCurrentRequestId (or route all writes through the helper) so managed-instance behavior stays consistent.
			if requestId != "" {
				logRecord.Attributes().PutStr(string(semconv.FaaSInvocationIDKey), requestId)

				// If this is the first event in the invocation with a request id (i.e. the "platform.start" event),
				// set the current invocation id to this request id.
				if el.Type == string(telemetryapi.PlatformStart) {
					r.currentFaasInvocationID = requestId
				}

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

if r.lastPlatformStartTime != "" && el.Time != "" {
r.lastPlatformEndTime = el.Time
r.logger.Info(fmt.Sprintf("Init end: %s", r.lastPlatformEndTime), zap.Any("event", el))
}
Copy link

Copilot AI Feb 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PlatformInitReport branch updates lastPlatformEndTime but never creates the init span or clears lastPlatformStartTime/lastPlatformEndTime. If managed instances emit platform.initReport instead of platform.initRuntimeDone, this will prevent coldstart/init spans from being produced and can leave stale init timestamps in the receiver state. Consider mirroring the PlatformInitRuntimeDone handling here (createPlatformInitSpan when start/end are present, then reset the timestamps).

Suggested change
}
}
if len(r.lastPlatformStartTime) > 0 && len(r.lastPlatformEndTime) > 0 {
if record, ok := el.Record.(map[string]any); ok {
if td, err := r.createPlatformInitSpan(record, r.lastPlatformStartTime, r.lastPlatformEndTime); err == nil {
err := r.nextTraces.ConsumeTraces(context.Background(), td)
if err == nil {
r.lastPlatformEndTime = ""
r.lastPlatformStartTime = ""
} else {
r.logger.Error("error receiving traces", zap.Error(err))
}
}
}
}

Copilot uses AI. Check for mistakes.
Comment on lines 50 to 54
listener *telemetryapi.Listener
wg sync.WaitGroup
lifecycleListeners []lambdalifecycle.Listener
initType lambdalifecycle.InitType
}
Copy link

Copilot AI Feb 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new manager.initType field is assigned in NewManager but never read anywhere, so it adds state without affecting behavior. Either remove the field or use it in processEvents()/other methods so it serves a purpose (e.g., for logging, guarding listener usage, or future branching).

Copilot uses AI. Check for mistakes.
Comment on lines +67 to +73
var extensionEvents []extensionapi.EventType
initType := lambdalifecycle.InitTypeFromEnv(lambdalifecycle.InitTypeEnvVar)
if initType == lambdalifecycle.LambdaManagedInstances {
extensionEvents = []extensionapi.EventType{extensionapi.Shutdown}
} else {
extensionEvents = []extensionapi.EventType{extensionapi.Invoke, extensionapi.Shutdown}
}
Copy link

Copilot AI Feb 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Lambda managed-instances branching in NewManager (event subscription selection + skipping Telemetry API listener startup) is new behavior but isn't covered by unit tests in this package. Add a test that sets AWS_LAMBDA_INITIALIZATION_TYPE to lambda-managed-instances and verifies the extension client is registered with only SHUTDOWN and that listener/Wait + FunctionInvoked/Finished paths are not used.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants