fix: exceeds payload size limit as invocation error#341
fix: exceeds payload size limit as invocation error#341andreas-baakind-adsk wants to merge 3 commits intoaws:mainfrom
Conversation
Throw CheckpointError with error_category as INVOCATION for 4xx InvalidParameterValueException errors related to payload size limit exceeded, as these errors are not retryable.
yaythomas
left a comment
There was a problem hiding this comment.
thank you so much for your contribution @andreas-baakind-adsk!
I've got a suggestion for streamlining this. It will result in behaviour change as follows:
4xx non-429 (other than Invalid Token) previously would have been EXECUTION errs that retried. They will now be EXECUTION errs that do NOT retry.
So any 4xx non-429 checkpoint error that previously caused Lambda to retry will now return FAILED immediately.
That includes:
- AccessDeniedException (403)
- ResourceNotFoundException (404)
- Any other InvalidParameterValueException with a different message
- Payload-too-large (the bug being fixed)
For 403 and 404 retrying was never going to help anyway, so permanent fail is more correct.
The only case where the old retry behaviour was possibly useful was payload-too-large with a non-deterministic callable, but in this case the expectation in the reference TS implementation is that the step is responsible for dealing with the result size before it gets to checkpointing.
CheckpointErrorCategory is not in the public API (init.py), so no public API break.
However, tests asserting on error_category values and is_retriable() will need updating:
- exceptions_test.py
- execution_test.py.
There was a problem hiding this comment.
oof, this logic is the wrong way round...
not your fault @andreas-baakind-adsk, you just got here 😁
this should be
def is_retriable(self):
return self.error_category == CheckpointErrorCategory.INVOCATION
There was a problem hiding this comment.
I did not see this, great catch :)
| # is not Output Payload Too Large => Execution | ||
| (error.get("Message") or "").startswith( | ||
| "STEP output payload size must be less than or equal to" | ||
| ) |
There was a problem hiding this comment.
this is going to be brittle. in conjunction with the Invocation/Execution logic being the wrong way round with the is_retriable (see below), if we take this opportunity to change both we can simplify all this.
(taking inspiration from the TypeScript reference implementation here: https://github.com/aws/aws-durable-execution-sdk-js/blob/61896cb2480a4216683194d192a1d763e12bc25b/packages/aws-durable-execution-sdk-js/src/utils/checkpoint/checkpoint-manager.ts#L227-L256)
@classmethod
def from_exception(cls, exception: Exception) -> CheckpointError:
base = BotoClientError.from_exception(exception)
metadata: AwsErrorMetadata | None = base.response_metadata
error: AwsErrorObj | None = base.error
error_category: CheckpointErrorCategory = CheckpointErrorCategory.INVOCATION
# 4xx errors (except 429) are permanent failures (EXECUTION), unless it's an
# InvalidParameterValueException with "Invalid Checkpoint Token" which is retriable (INVOCATION).
# 5xx, 429, and network errors are retriable (INVOCATION).
status_code: int | None = (metadata and metadata.get("HTTPStatusCode")) or None
if (
status_code
and BAD_REQUEST_ERROR <= status_code < SERVICE_ERROR
and status_code != TOO_MANY_REQUESTS_ERROR
and error
and not (
(error.get("Code") or "") == "InvalidParameterValueException"
and (error.get("Message") or "").startswith("Invalid Checkpoint Token")
)
):
error_category = CheckpointErrorCategory.EXECUTION
return CheckpointError(str(exception), error_category, error, metadata)
def is_retriable(self):
return self.error_category == CheckpointErrorCategory.INVOCATION
the result of this:
- 4xx non-429 + InvalidParameterValueException + "Invalid Checkpoint Token" -> INVOCATION (retry)
- 4xx non-429 (anything else) -> EXECUTION (permanent)
- everything else (5xx, 429, network) -> INVOCATION (retry)
There was a problem hiding this comment.
I agree, that makes a lot more sense. I was a bit unsure of the string matching as you pointed out, it is brittle.
- Fix inverted is_retriable(): INVOCATION errors are retriable, EXECUTION errors are permanent failures (logic was backwards) - Classify payload size exceeded as EXECUTION (permanent), not INVOCATION — exceeding a size limit is not a transient failure - Simplify CheckpointError.from_exception() conditional logic - Update error-handling docs to reflect correct retry behavior - Update tests to match corrected classification semantics
|
thanks for the update @andreas-baakind-adsk, looking good! just some minor formatting/linting... tip: see https://github.com/aws/aws-durable-execution-sdk-python/blob/main/CONTRIBUTING.md#developer-workflow you can run the CI lint checks locally so you know it'll work by the time it gets to the CI :-) |
|
Thank you for the great feeedback, @yaythomas. I will for sure run the ci-checks locally next time to make sure there are no surprises after pushing my changes :) |
Issue #, if available:
#342
Description of changes:
Throw CheckpointError with error_category as INVOCATION for 4xx InvalidParameterValueException errors related to payload size limit exceeded, as these errors are not retryable.
The payload size exceeded constraint is deterministic: the exact same Lambda invocation, producing the exact same output, will fail with the exact same error on every retry. No amount of retrying resolves it — the data is too large by definition. AWS's own documentation classifies the equivalent Step Functions error (States.DataLimitExceeded) as terminal and permanent.
By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.