Skip to content

[SPARK-55639][K8S] Support Recovery-mode K8s Executors#54558

Closed
dongjoon-hyun wants to merge 5 commits intoapache:masterfrom
dongjoon-hyun:SPARK-55639-2
Closed

[SPARK-55639][K8S] Support Recovery-mode K8s Executors#54558
dongjoon-hyun wants to merge 5 commits intoapache:masterfrom
dongjoon-hyun:SPARK-55639-2

Conversation

@dongjoon-hyun
Copy link
Member

@dongjoon-hyun dongjoon-hyun commented Feb 28, 2026

What changes were proposed in this pull request?

This PR aims to support Recovery-mode K8s Executors.

Why are the changes needed?

Frequently, Spark jobs have skewed data and corresponding tasks. In the worst case, a single resource-hungry task becomes a serial killer of executors. The homogeneous executors are unable to handle this kind of situation.

So, this PR introduces a recovery-mode executor when Spark driver detects OOM situation from the failed executors. A recovery-mode executor aims to provide an executor serving only a single task. In other words, after OOM, new executor will be created with the same pod spec (memory and core) except ENV_EXECUTOR_CORES=spark.task.cpus environment variable setting to allow a single task per executor JVM.

EXAMPLE: OOM happens at Executor 1 and Executor 3 is created as the recover mode

Screenshot 2026-02-28 at 06 38 39

Does this PR introduce any user-facing change?

  • There is no behavior change for a healthy job with no executor loss.
  • There is no behavior change for jobs with non-OOM executor loss.
  • For the failed job due to OOM, this PR only tries to mitigate the situation by trying to make the job to finish.
    • If the OOM happens even on the recovery-mode executor (a single task per executor JVM). The job will fail again.
    • If the recovery-mode executor can handle those skewed tasks, this job will succeed which is good.
  • For the faulty job with one or two OOM-driven executor loss, this PR only affects newly created pods (which are one or two as we assumed). So, technically, the job will succeed in the very similar way because the most of executors are the same.
    • spark.kubernetes.allocation.recoveryMode.enabled=false will disable the whole feature.

How was this patch tested?

Pass the CIs with newly added test cases.

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Gemini 3.1 Pro (High) on Antigravity

@dongjoon-hyun
Copy link
Member Author

Could you review this simplified version of Recover-mode K8s Executors, @peter-toth ?

@peter-toth
Copy link
Contributor

This feature makes sense to me but I have 2 questions:

  • What happens after a successful recovery? Will the remaining stages/tasks use the 1 core executor?

  • What heppens if spark.task.cpus is set to >1?

@dongjoon-hyun
Copy link
Member Author

Thank you for review, @peter-toth .

  • What happens after a successful recovery? Will the remaining stages/tasks use the 1 core executor?

Here, the successful recovery should be considered as a job completions. If OOM kills executors one by one consecutively due to the re-try, the jobs fail eventually without moving to the next stage. And, if we consider only a single stage, yes. The set of executors will not change further if there is no executor loss.

  • What heppens if spark.task.cpus is set to >1?

Ya, that's the valid corner case. Let me disable this feature for that configuration.

@dongjoon-hyun
Copy link
Member Author

I addressed your comment and revised the PR title, @peter-toth .

@peter-toth
Copy link
Contributor

peter-toth commented Feb 28, 2026

  • What happens after a successful recovery? Will the remaining stages/tasks use the 1 core executor?

Here, the successful recovery should be considered as a job completions. If OOM kills executors one by one consecutively due to the re-try, the jobs fail eventually without moving to the next stage. And, if we consider only a single stage, yes. The set of executors will not change further if there is no executor loss.

Yes, I agree that recovering from an OOM is a huge win.
My question is mainly about subsequent stages. If there is no resource profile set for them, will/might those stages use the 1 core executor? If yes, then I still consider the PR a nice improvement just we probably need to call out this behaviour in our documentation or in config description so that users could decide whether they want their jobs to fail fast or complete with maybe increased runtime.

  • What heppens if spark.task.cpus is set to >1?

Ya, that's the valid corner case. Let me disable this feature for that configuration.

Would it make sense to set spark.task.cpus number of cores in recovery mode and don't disable it if >1?

@dongjoon-hyun
Copy link
Member Author

Would it make sense to set spark.task.cpus number of cores in recovery mode and don't disable it if >1?

Of course, it sounds better to me because it's the theoretical minimum. We simply need to revise our abstract according to this code.

Yes, I agree that recovering from an OOM is a huge win.
My question is mainly about subsequent stages. If there is no resource profile set for them, will/might those stages use the 1 core executor? If yes, then I still consider the PR a nice improvement just we probably need to call out this behaviour in our documentation or in config description so that users could decide whether they want their jobs to fail fast or complete with maybe increased runtime.

I understand your requirement of fail-fast. Technically, you want to give the users the right to disable the whole feature, right?

@peter-toth
Copy link
Contributor

Yes, I agree that recovering from an OOM is a huge win.
My question is mainly about subsequent stages. If there is no resource profile set for them, will/might those stages use the 1 core executor? If yes, then I still consider the PR a nice improvement just we probably need to call out this behaviour in our documentation or in config description so that users could decide whether they want their jobs to fail fast or complete with maybe increased runtime.

I understand your requirement of fail-fast. Technically, you want to give the users the right to disable the whole feature, right?

I believe the new spark.kubernetes.allocation.recoveryMode.enabled config is good way to disable enable/disable the feature, but its description If true, enables the recovery mode during executor allocation. might need a better explanation what it exactly means; and maybe some additional notes that the recovery mode 1 core executor can stick for a while and might affect the remaining stages of a job.

@dongjoon-hyun
Copy link
Member Author

dongjoon-hyun commented Feb 28, 2026

The PR is updated.

  • Setting spark.kubernetes.allocation.recoveryMode.enabled=false explicitly will disable the feature completely.
  • Use kubernetesConf.get("spark.task.cpus") instead of 1.
  • Revise the config description.

Please note that SPARK-55553 Document heterogeneous K8s executors is planned to revisit all config tables and create a new document for this whole feature, @peter-toth .

@dongjoon-hyun
Copy link
Member Author

Thank you, @peter-toth . I'll test more before merging this.

@dongjoon-hyun
Copy link
Member Author

Updated the screenshot. BTW, I observed a funny situation where the recovery-mode executor is much faster than the normal executor in the following case. It's a little counter-intuitive but it might be due to competition overhead in this simple SparkPi case. 😄

#!/bin/bash
K8S_MASTER=$(kubectl config view --minify | yq '.clusters[0].cluster.server')

bin/spark-submit \
--master k8s://$K8S_MASTER \
--deploy-mode cluster \
-c spark.executor.cores=4 \
-c spark.executor.memory=4g \
-c spark.kubernetes.container.image=apache/spark:SPARK-55639 \
-c spark.kubernetes.authenticate.driver.serviceAccountName=spark \
-c spark.kubernetes.driver.pod.name=pi \
-c spark.kubernetes.executor.podNamePrefix=pi \
--class org.apache.spark.examples.SparkPi \
local:///opt/spark/examples/jars/spark-examples.jar 300000

@dongjoon-hyun
Copy link
Member Author

Merged to master.

@dongjoon-hyun dongjoon-hyun deleted the SPARK-55639-2 branch February 28, 2026 14:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants