Skip to content

fix: exceeds payload size limit as invocation error#341

Open
andreas-baakind-adsk wants to merge 3 commits intoaws:mainfrom
andreas-baakind-adsk:fix/checkpoint-payload-size-handling
Open

fix: exceeds payload size limit as invocation error#341
andreas-baakind-adsk wants to merge 3 commits intoaws:mainfrom
andreas-baakind-adsk:fix/checkpoint-payload-size-handling

Conversation

@andreas-baakind-adsk
Copy link
Copy Markdown

@andreas-baakind-adsk andreas-baakind-adsk commented Apr 9, 2026

Issue #, if available:
#342

Description of changes:
Throw CheckpointError with error_category as INVOCATION for 4xx InvalidParameterValueException errors related to payload size limit exceeded, as these errors are not retryable.

The payload size exceeded constraint is deterministic: the exact same Lambda invocation, producing the exact same output, will fail with the exact same error on every retry. No amount of retrying resolves it — the data is too large by definition. AWS's own documentation classifies the equivalent Step Functions error (States.DataLimitExceeded) as terminal and permanent.

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

Throw CheckpointError with error_category as INVOCATION for 4xx
InvalidParameterValueException errors related to payload size limit
exceeded, as these errors are not retryable.
Copy link
Copy Markdown
Contributor

@yaythomas yaythomas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank you so much for your contribution @andreas-baakind-adsk!

I've got a suggestion for streamlining this. It will result in behaviour change as follows:

4xx non-429 (other than Invalid Token) previously would have been EXECUTION errs that retried. They will now be EXECUTION errs that do NOT retry.

So any 4xx non-429 checkpoint error that previously caused Lambda to retry will now return FAILED immediately.

That includes:

  • AccessDeniedException (403)
  • ResourceNotFoundException (404)
  • Any other InvalidParameterValueException with a different message
  • Payload-too-large (the bug being fixed)

For 403 and 404 retrying was never going to help anyway, so permanent fail is more correct.

The only case where the old retry behaviour was possibly useful was payload-too-large with a non-deterministic callable, but in this case the expectation in the reference TS implementation is that the step is responsible for dealing with the result size before it gets to checkpointing.

CheckpointErrorCategory is not in the public API (init.py), so no public API break.

However, tests asserting on error_category values and is_retriable() will need updating:

  1. exceptions_test.py
  2. execution_test.py.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oof, this logic is the wrong way round...

not your fault @andreas-baakind-adsk, you just got here 😁

this should be

def is_retriable(self):
    return self.error_category == CheckpointErrorCategory.INVOCATION

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not see this, great catch :)

# is not Output Payload Too Large => Execution
(error.get("Message") or "").startswith(
"STEP output payload size must be less than or equal to"
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is going to be brittle. in conjunction with the Invocation/Execution logic being the wrong way round with the is_retriable (see below), if we take this opportunity to change both we can simplify all this.

(taking inspiration from the TypeScript reference implementation here: https://github.com/aws/aws-durable-execution-sdk-js/blob/61896cb2480a4216683194d192a1d763e12bc25b/packages/aws-durable-execution-sdk-js/src/utils/checkpoint/checkpoint-manager.ts#L227-L256)

    @classmethod
    def from_exception(cls, exception: Exception) -> CheckpointError:
        base = BotoClientError.from_exception(exception)
        metadata: AwsErrorMetadata | None = base.response_metadata
        error: AwsErrorObj | None = base.error
        error_category: CheckpointErrorCategory = CheckpointErrorCategory.INVOCATION

        # 4xx errors (except 429) are permanent failures (EXECUTION), unless it's an
        # InvalidParameterValueException with "Invalid Checkpoint Token" which is retriable (INVOCATION).
        # 5xx, 429, and network errors are retriable (INVOCATION).
        status_code: int | None = (metadata and metadata.get("HTTPStatusCode")) or None
        if (
            status_code
            and BAD_REQUEST_ERROR <= status_code < SERVICE_ERROR
            and status_code != TOO_MANY_REQUESTS_ERROR
            and error
            and not (
                (error.get("Code") or "") == "InvalidParameterValueException"
                and (error.get("Message") or "").startswith("Invalid Checkpoint Token")
            )
        ):
            error_category = CheckpointErrorCategory.EXECUTION
        return CheckpointError(str(exception), error_category, error, metadata)

    def is_retriable(self):
        return self.error_category == CheckpointErrorCategory.INVOCATION

the result of this:

  1. 4xx non-429 + InvalidParameterValueException + "Invalid Checkpoint Token" -> INVOCATION (retry)
  2. 4xx non-429 (anything else) -> EXECUTION (permanent)
  3. everything else (5xx, 429, network) -> INVOCATION (retry)

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, that makes a lot more sense. I was a bit unsure of the string matching as you pointed out, it is brittle.

@yaythomas yaythomas moved this from Backlog to In review in aws-durable-execution Apr 10, 2026
@yaythomas yaythomas self-assigned this Apr 10, 2026
@yaythomas yaythomas added bug Something isn't working and removed bug Something isn't working labels Apr 10, 2026
- Fix inverted is_retriable(): INVOCATION errors are retriable,
  EXECUTION errors are permanent failures (logic was backwards)
- Classify payload size exceeded as EXECUTION (permanent), not
  INVOCATION — exceeding a size limit is not a transient failure
- Simplify CheckpointError.from_exception() conditional logic
- Update error-handling docs to reflect correct retry behavior
- Update tests to match corrected classification semantics
@yaythomas
Copy link
Copy Markdown
Contributor

thanks for the update @andreas-baakind-adsk, looking good!

just some minor formatting/linting...

tip: see https://github.com/aws/aws-durable-execution-sdk-python/blob/main/CONTRIBUTING.md#developer-workflow

you can run the CI lint checks locally so you know it'll work by the time it gets to the CI :-)

ops/ci-checks.sh

@andreas-baakind-adsk
Copy link
Copy Markdown
Author

Thank you for the great feeedback, @yaythomas. I will for sure run the ci-checks locally next time to make sure there are no surprises after pushing my changes :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: In review

Development

Successfully merging this pull request may close these issues.

2 participants