Fix KubernetesExecutor scheduler crash when pod_override is queued in-cluster by vatsrahul1001 · Pull Request #68819 · apache/airflow

vatsrahul1001 · 2026-06-22T04:26:18Z

Problem

When the scheduler runs in-cluster, the kubernetes Python client attaches an in-cluster Configuration to every V1Pod, and that Configuration.refresh_api_key_hook is a local closure (InClusterConfigLoader._set_config.<locals>._refresh_api_key). pickle cannot serialize a local closure.

KubernetesExecutor puts each task's KubernetesJob (which embeds the task's pod_override V1Pod) onto a multiprocessing.Manager queue, whose .put() pickles synchronously in the scheduler. So any in-cluster deployment where a task sets a V1Pod pod_override crashes the scheduler in a loop:

_pickle.PicklingError: Can't pickle local object
  <function InClusterConfigLoader._set_config.<locals>._refresh_api_key at ...>
when serializing kubernetes.client.configuration.Configuration ...
when serializing dict item 'pod_override'
when serializing ... ExecuteTask ...
  File ".../executors/kubernetes_executor.py", line 230, in execute_async
    self.task_queue.put(KubernetesJob(key, command, kube_executor_config, pod_template_file))

Fix

Serialize the pod_override to a plain dict (dropping the Configuration) before it is queued, using the existing PodGenerator.serialize_pod, and rebuild the V1Pod worker-side in run_next via PodGenerator.deserialize_model_dict. The worker already reconstructs its own kube client, so nothing is lost. The worker uses kube_executor_config for the pod override; the same V1Pod is also referenced by the workload's executor_config, so that copy is sanitized too (it is otherwise pickled as part of the workload).

This keeps pod_override working regardless of the kubernetes client version, rather than depending on a client version whose Configuration happens to be picklable. The worker side already sanitizes pods with sanitize_for_serialization (kubernetes_executor_utils.py:492) — this closes the same gap on the scheduler→queue boundary.

Tests

New regression test test_execute_async_queues_picklable_pod_override: builds a pod_override whose Configuration carries an unpicklable local-closure hook, asserts the raw pod is unpicklable (reproduces the crash), then asserts the queued KubernetesJob pickles cleanly and the serialized dict round-trips back to an equivalent V1Pod.
Updated test_pod_template_file_override_in_executor_config to reflect the queued kube_executor_config now being a dict; its existing run_next round-trip assertion (the reconstructed pod) is unchanged and still passes.
Full test_kubernetes_executor.py suite passes (111 passed, 1 skipped AF<3.2-only).

Was generative AI tooling used to co-author this PR?

Yes — Claude Code (Opus 4.8)

Generated-by: Claude Code (Opus 4.8) following the guidelines

vatsrahul1001 · 2026-06-22T08:30:14Z

Thanks @amoghrajesh — agreed on the "wrong layer" point, switched away from serialize/deserialize. I went with your in-place approach but with one correction.

Nulling local_vars_configuration breaks worker-side reconcile. When I tested local_vars_configuration = None on a real in-cluster deployment, run_next → PodGenerator.construct_pod → reconcile_specs failed:

File ".../kubernetes/client/models/v1_pod_spec.py", line 349, in containers
    if self.local_vars_configuration.client_side_validation and containers is None:
AttributeError: 'NoneType' object has no attribute 'client_side_validation'

The model setters hit during reconcile_pods dereference self.local_vars_configuration.client_side_validation, so None raises and the task fails with PodReconciliationError.

Fix: reset it to a fresh Configuration() instead of None — exactly what v35 model constructors produced. That's picklable (no in-cluster refresh_api_key_hook) and keeps client_side_validation, so reconcile works. Implemented as _reset_local_vars_configuration (recursive), called on kube_executor_config before task_queue.put. As you noted, from_obj returns the same object referenced by the workload's executor_config, so the single call covers both; the pod keeps its V1Pod type through the queue, so run_next is unchanged.

Validated on the deployment (no crash, task runs/finishes) and added a regression test that asserts both picklability and that run_next/reconcile_pods succeeds (the assertion the None approach fails). Full executor suite green.

Root cause is the v36 model-constructor change (Configuration() → Configuration.get_default_copy(), kubernetes-client/python#2532 / OpenAPITools/openapi-generator#8500); tracked in #68827.

Drafted-by: Claude Code (Opus 4.8); reviewed by @vatsrahul1001 before posting

amoghrajesh

Some nits and comments, otherwise LGTM.

…-cluster When the scheduler runs in-cluster, kubernetes-python-client v36 changed model constructors to use Configuration.get_default_copy() instead of Configuration() (kubernetes-client/python#2532, OpenAPI Generator v6.6.0). So every V1Pod built after load_incluster_config() captures the global in-cluster Configuration, whose refresh_api_key_hook is a local closure (InClusterConfigLoader._set_config.<locals>._refresh_api_key). pickle cannot serialize a local closure, so putting a task's pod_override V1Pod on the executor's multiprocessing queue raises PicklingError and crashes the scheduler in a loop. This affects any in-cluster KubernetesExecutor deployment where a task sets a V1Pod pod_override, independent of the Airflow version; pinning the client below 36 is not viable because 35.x has a separate no_proxy regression. Reset local_vars_configuration to a fresh Configuration() on the pod_override (recursively) before queuing -- exactly what v35 model constructors produced. It carries no in-cluster auth hook so the pod is picklable, while keeping client_side_validation so the worker-side reconcile_pods setters still work. (Setting it to None instead breaks reconcile: model setters dereference self.local_vars_configuration.client_side_validation.) The pod keeps its V1Pod type through the queue, so run_next is unchanged.

vatsrahul1001 · 2026-06-22T08:48:55Z

Thanks for the review @amoghrajesh! Addressed the nits:

Trimmed the docstring to your suggested wording (kept one line on why fresh Configuration() over None — it preserves client_side_validation so worker-side reconcile doesn't break).
Removed the redundant call-site comment.
Moved the test imports (pickle, Configuration, _reset_local_vars_configuration) to top level.

Drafted-by: Claude Code (Opus 4.8); reviewed by @vatsrahul1001 before posting

Lee-W · 2026-06-22T09:26:14Z

    )


+def _reset_local_vars_configuration(obj: Any) -> None:


Suggested change

def _reset_local_vars_configuration(obj: Any) -> None:

def _reset_local_vars_configuration(node: Any) -> None:

IMO, this is more readable. I don't quite understand what obj is in the original context.

Also, should we type it? looks like we know the potential type.

Lee-W · 2026-06-22T09:28:59Z

+            _reset_local_vars_configuration(v)
+    elif isinstance(obj, (list, tuple)):
+        for item in obj:
+            _reset_local_vars_configuration(item)


What should happen if we go into the else block? Is it expected to do nothing? or not expcted to happen

Lee-W · 2026-06-22T09:29:50Z

+            ),
+        )
+        self._install_unpicklable_incluster_hook(pod)
+        with pytest.raises((pickle.PicklingError, AttributeError, TypeError)):


Would like to check when will AttributeError, TypeError happen in different versions?

vatsrahul1001 · 2026-06-22T11:05:26Z

Closing in favour of #68831

ephraimbuddy

I've opened #68831, which fixes the same crash one layer up — at the airflow-core deserialize path where that Configuration actually gets attached. The pod_override the scheduler pickles onto the queue is produced by BaseSerialization.deserialize / ensure_pod_is_valid_after_unpickling (via ExecutorConfigType), not by the executor itself; deserializing there through a fresh Configuration() keeps the pod (and every nested model) picklable for every consumer of a deserialized V1Pod, not just the KubernetesExecutor queue. It also gives PodGenerator.deserialize_model_dict the same treatment, so the provider's worker-side reconstruction is covered too.

Since that resolves the crash at the source and makes the executor-level reset redundant, I think we can consolidate on #68831 and close this one. Let me know if that's reasonable?

ephraimbuddy · 2026-06-22T11:08:10Z

Ah, thanks for closing

vatsrahul1001 requested review from hussein-awala, jedcunningham and jscheffl as code owners June 22, 2026 04:26

boring-cyborg Bot added area:providers provider:cncf-kubernetes Kubernetes (k8s) provider related issues labels Jun 22, 2026

phanikumv requested a review from ephraimbuddy June 22, 2026 06:39

vatsrahul1001 mentioned this pull request Jun 22, 2026

KubernetesExecutor scheduler crashes with PicklingError on pod_override with kubernetes client 36.x #68827

Open

2 tasks

amoghrajesh reviewed Jun 22, 2026

View reviewed changes

Comment thread ...iders/cncf/kubernetes/src/airflow/providers/cncf/kubernetes/executors/kubernetes_executor.py Outdated

Comment thread ...iders/cncf/kubernetes/src/airflow/providers/cncf/kubernetes/executors/kubernetes_executor.py Outdated

vatsrahul1001 force-pushed the fix-k8s-executor-pod-override-pickle branch from fb24244 to 42916df Compare June 22, 2026 08:30

amoghrajesh approved these changes Jun 22, 2026

View reviewed changes

vatsrahul1001 force-pushed the fix-k8s-executor-pod-override-pickle branch from 42916df to 7cf53dc Compare June 22, 2026 08:48

Lee-W reviewed Jun 22, 2026

View reviewed changes

vatsrahul1001 closed this Jun 22, 2026

ephraimbuddy reviewed Jun 22, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix KubernetesExecutor scheduler crash when pod_override is queued in-cluster#68819

Fix KubernetesExecutor scheduler crash when pod_override is queued in-cluster#68819
vatsrahul1001 wants to merge 1 commit into
mainfrom
fix-k8s-executor-pod-override-pickle

vatsrahul1001 commented Jun 22, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

vatsrahul1001 commented Jun 22, 2026

Uh oh!

amoghrajesh left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

vatsrahul1001 commented Jun 22, 2026 •

edited

Loading

Uh oh!

Lee-W Jun 22, 2026 •

edited

Loading

Uh oh!

Lee-W Jun 22, 2026

Uh oh!

Lee-W Jun 22, 2026

Uh oh!

vatsrahul1001 commented Jun 22, 2026

Uh oh!

ephraimbuddy left a comment

Uh oh!

ephraimbuddy commented Jun 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

	def _reset_local_vars_configuration(obj: Any) -> None:
	def _reset_local_vars_configuration(node: Any) -> None:

Conversation

vatsrahul1001 commented Jun 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Fix

Tests

Was generative AI tooling used to co-author this PR?

Uh oh!

Uh oh!

Uh oh!

vatsrahul1001 commented Jun 22, 2026

Uh oh!

amoghrajesh left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

vatsrahul1001 commented Jun 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Lee-W Jun 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Lee-W Jun 22, 2026

Choose a reason for hiding this comment

Uh oh!

Lee-W Jun 22, 2026

Choose a reason for hiding this comment

Uh oh!

vatsrahul1001 commented Jun 22, 2026

Uh oh!

ephraimbuddy left a comment

Choose a reason for hiding this comment

Uh oh!

ephraimbuddy commented Jun 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

vatsrahul1001 commented Jun 22, 2026 •

edited

Loading

vatsrahul1001 commented Jun 22, 2026 •

edited

Loading

Lee-W Jun 22, 2026 •

edited

Loading