Expected Behavior
When a checkpoint payload exceeds the 262,144-byte limit, the SDK should classify the error as non-retriable (i.e., CheckpointErrorCategory.INVOCATION). Lambda should not be retried, since the exact same workflow invocation will produce the exact same oversized payload on every subsequent attempt.
Actual Behavior
The SDK classifies the InvalidParameterValueException / "STEP output payload size must be less than or equal to 262144 bytes" error as CheckpointErrorCategory.EXECUTION. This causes handle_checkpoint_error() to re-raise the error (raise error from None), which Lambda interprets as an unhandled failure and retries the invocation. Since the payload size is deterministic, the invocation retries indefinitely until the Lambda retry limit is exhausted, wasting compute and delaying any error surfacing.
Steps to Reproduce
- Write a durable execution workflow that produces a checkpoint output payload larger than 262,144 bytes (e.g., return a large dict or list from a
@step).
- Deploy the Lambda with the SDK handler wrapping the workflow.
- Invoke the Lambda.
- Observe that the Lambda is retried repeatedly (up to the configured retry limit) rather than failing immediately with a FAILED status.
- Confirm in the logs that each invocation fails with InvalidParameterValueException: STEP output payload size must be less than or equal to 262144 bytes — identical on every retry.
SDK Version
1.*
Python Version
3.13
Is this a regression?
No
Last Working Version
No response
Additional Context
No response
Expected Behavior
When a checkpoint payload exceeds the 262,144-byte limit, the SDK should classify the error as non-retriable (i.e.,
CheckpointErrorCategory.INVOCATION). Lambda should not be retried, since the exact same workflow invocation will produce the exact same oversized payload on every subsequent attempt.Actual Behavior
The SDK classifies the InvalidParameterValueException / "STEP output payload size must be less than or equal to 262144 bytes" error as
CheckpointErrorCategory.EXECUTION. This causeshandle_checkpoint_error()to re-raise the error (raise error from None), which Lambda interprets as an unhandled failure and retries the invocation. Since the payload size is deterministic, the invocation retries indefinitely until the Lambda retry limit is exhausted, wasting compute and delaying any error surfacing.Steps to Reproduce
@step).SDK Version
1.*
Python Version
3.13
Is this a regression?
No
Last Working Version
No response
Additional Context
No response