[SPARK-55639][K8S] Support Recovery-mode K8s Executors by dongjoon-hyun · Pull Request #54558 · apache/spark

dongjoon-hyun · 2026-02-28T07:38:34Z

What changes were proposed in this pull request?

This PR aims to support Recovery-mode K8s Executors.

Why are the changes needed?

Frequently, Spark jobs have skewed data and corresponding tasks. In the worst case, a single resource-hungry task becomes a serial killer of executors. The homogeneous executors are unable to handle this kind of situation.

So, this PR introduces a recovery-mode executor when Spark driver detects OOM situation from the failed executors. A recovery-mode executor aims to provide an executor serving only a single task. In other words, after OOM, new executor will be created with the same pod spec (memory and core) except ENV_EXECUTOR_CORES=spark.task.cpus environment variable setting to allow a single task per executor JVM.

EXAMPLE: OOM happens at Executor 1 and Executor 3 is created as the recover mode

Does this PR introduce any user-facing change?

There is no behavior change for a healthy job with no executor loss.
There is no behavior change for jobs with non-OOM executor loss.
For the failed job due to OOM, this PR only tries to mitigate the situation by trying to make the job to finish.
- If the OOM happens even on the recovery-mode executor (a single task per executor JVM). The job will fail again.
- If the recovery-mode executor can handle those skewed tasks, this job will succeed which is good.
For the faulty job with one or two OOM-driven executor loss, this PR only affects newly created pods (which are one or two as we assumed). So, technically, the job will succeed in the very similar way because the most of executors are the same.
- spark.kubernetes.allocation.recoveryMode.enabled=false will disable the whole feature.

How was this patch tested?

Pass the CIs with newly added test cases.

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Gemini 3.1 Pro (High) on Antigravity

dongjoon-hyun · 2026-02-28T07:59:52Z

Could you review this simplified version of Recover-mode K8s Executors, @peter-toth ?

peter-toth · 2026-02-28T08:16:41Z

This feature makes sense to me but I have 2 questions:

What happens after a successful recovery? Will the remaining stages/tasks use the 1 core executor?
What heppens if spark.task.cpus is set to >1?

dongjoon-hyun · 2026-02-28T08:34:37Z

Thank you for review, @peter-toth .

What happens after a successful recovery? Will the remaining stages/tasks use the 1 core executor?

Here, the successful recovery should be considered as a job completions. If OOM kills executors one by one consecutively due to the re-try, the jobs fail eventually without moving to the next stage. And, if we consider only a single stage, yes. The set of executors will not change further if there is no executor loss.

What heppens if spark.task.cpus is set to >1?

Ya, that's the valid corner case. Let me disable this feature for that configuration.

dongjoon-hyun · 2026-02-28T08:50:13Z

I addressed your comment and revised the PR title, @peter-toth .

peter-toth · 2026-02-28T09:10:55Z

What happens after a successful recovery? Will the remaining stages/tasks use the 1 core executor?

Here, the successful recovery should be considered as a job completions. If OOM kills executors one by one consecutively due to the re-try, the jobs fail eventually without moving to the next stage. And, if we consider only a single stage, yes. The set of executors will not change further if there is no executor loss.

Yes, I agree that recovering from an OOM is a huge win.
My question is mainly about subsequent stages. If there is no resource profile set for them, will/might those stages use the 1 core executor? If yes, then I still consider the PR a nice improvement just we probably need to call out this behaviour in our documentation or in config description so that users could decide whether they want their jobs to fail fast or complete with maybe increased runtime.

What heppens if spark.task.cpus is set to >1?

Ya, that's the valid corner case. Let me disable this feature for that configuration.

Would it make sense to set spark.task.cpus number of cores in recovery mode and don't disable it if >1?

dongjoon-hyun · 2026-02-28T13:02:44Z

Would it make sense to set spark.task.cpus number of cores in recovery mode and don't disable it if >1?

Of course, it sounds better to me because it's the theoretical minimum. We simply need to revise our abstract according to this code.

Yes, I agree that recovering from an OOM is a huge win.
My question is mainly about subsequent stages. If there is no resource profile set for them, will/might those stages use the 1 core executor? If yes, then I still consider the PR a nice improvement just we probably need to call out this behaviour in our documentation or in config description so that users could decide whether they want their jobs to fail fast or complete with maybe increased runtime.

I understand your requirement of fail-fast. Technically, you want to give the users the right to disable the whole feature, right?

peter-toth · 2026-02-28T13:16:18Z

Yes, I agree that recovering from an OOM is a huge win.
My question is mainly about subsequent stages. If there is no resource profile set for them, will/might those stages use the 1 core executor? If yes, then I still consider the PR a nice improvement just we probably need to call out this behaviour in our documentation or in config description so that users could decide whether they want their jobs to fail fast or complete with maybe increased runtime.

I understand your requirement of fail-fast. Technically, you want to give the users the right to disable the whole feature, right?

I believe the new spark.kubernetes.allocation.recoveryMode.enabled config is good way to disable enable/disable the feature, but its description If true, enables the recovery mode during executor allocation. might need a better explanation what it exactly means; and maybe some additional notes that the recovery mode 1 core executor can stick for a while and might affect the remaining stages of a job.

dongjoon-hyun · 2026-02-28T13:54:44Z

The PR is updated.

Setting spark.kubernetes.allocation.recoveryMode.enabled=false explicitly will disable the feature completely.
Use kubernetesConf.get("spark.task.cpus") instead of 1.
Revise the config description.

Please note that SPARK-55553 Document heterogeneous K8s executors is planned to revisit all config tables and create a new document for this whole feature, @peter-toth .

dongjoon-hyun · 2026-02-28T14:22:03Z

Thank you, @peter-toth . I'll test more before merging this.

dongjoon-hyun · 2026-02-28T14:45:26Z

Updated the screenshot. BTW, I observed a funny situation where the recovery-mode executor is much faster than the normal executor in the following case. It's a little counter-intuitive but it might be due to competition overhead in this simple SparkPi case. 😄

#!/bin/bash
K8S_MASTER=$(kubectl config view --minify | yq '.clusters[0].cluster.server')

bin/spark-submit \
--master k8s://$K8S_MASTER \
--deploy-mode cluster \
-c spark.executor.cores=4 \
-c spark.executor.memory=4g \
-c spark.kubernetes.container.image=apache/spark:SPARK-55639 \
-c spark.kubernetes.authenticate.driver.serviceAccountName=spark \
-c spark.kubernetes.driver.pod.name=pi \
-c spark.kubernetes.executor.podNamePrefix=pi \
--class org.apache.spark.examples.SparkPi \
local:///opt/spark/examples/jars/spark-examples.jar 300000

dongjoon-hyun · 2026-02-28T14:45:33Z

Merged to master.

[SPARK-55639][K8S] Support Recovery-mode K8s Executors

7c3213e

Address comments

1e15755

dongjoon-hyun added 2 commits February 28, 2026 05:22

Address comments

b3e75ce

Address comments

7aae9ce

peter-toth approved these changes Feb 28, 2026

View reviewed changes

fix

a8d8e72

dongjoon-hyun closed this in 4b65223 Feb 28, 2026

dongjoon-hyun deleted the SPARK-55639-2 branch February 28, 2026 14:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-55639][K8S] Support Recovery-mode K8s Executors#54558

[SPARK-55639][K8S] Support Recovery-mode K8s Executors#54558
dongjoon-hyun wants to merge 5 commits intoapache:masterfrom
dongjoon-hyun:SPARK-55639-2

dongjoon-hyun commented Feb 28, 2026 •

edited

Loading

Uh oh!

dongjoon-hyun commented Feb 28, 2026

Uh oh!

peter-toth commented Feb 28, 2026

Uh oh!

dongjoon-hyun commented Feb 28, 2026

Uh oh!

dongjoon-hyun commented Feb 28, 2026

Uh oh!

peter-toth commented Feb 28, 2026 •

edited

Loading

Uh oh!

dongjoon-hyun commented Feb 28, 2026

Uh oh!

peter-toth commented Feb 28, 2026

Uh oh!

dongjoon-hyun commented Feb 28, 2026 •

edited

Loading

Uh oh!

dongjoon-hyun commented Feb 28, 2026

Uh oh!

dongjoon-hyun commented Feb 28, 2026

Uh oh!

dongjoon-hyun commented Feb 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

dongjoon-hyun commented Feb 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

dongjoon-hyun commented Feb 28, 2026

Uh oh!

peter-toth commented Feb 28, 2026

Uh oh!

dongjoon-hyun commented Feb 28, 2026

Uh oh!

dongjoon-hyun commented Feb 28, 2026

Uh oh!

peter-toth commented Feb 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dongjoon-hyun commented Feb 28, 2026

Uh oh!

peter-toth commented Feb 28, 2026

Uh oh!

dongjoon-hyun commented Feb 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dongjoon-hyun commented Feb 28, 2026

Uh oh!

dongjoon-hyun commented Feb 28, 2026

Uh oh!

dongjoon-hyun commented Feb 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

dongjoon-hyun commented Feb 28, 2026 •

edited

Loading

peter-toth commented Feb 28, 2026 •

edited

Loading

dongjoon-hyun commented Feb 28, 2026 •

edited

Loading