Skip to content

Fix KubernetesExecutor scheduler crash when pod_override is queued in-cluster#68819

Closed
vatsrahul1001 wants to merge 1 commit into
mainfrom
fix-k8s-executor-pod-override-pickle
Closed

Fix KubernetesExecutor scheduler crash when pod_override is queued in-cluster#68819
vatsrahul1001 wants to merge 1 commit into
mainfrom
fix-k8s-executor-pod-override-pickle

Conversation

@vatsrahul1001

@vatsrahul1001 vatsrahul1001 commented Jun 22, 2026

Copy link
Copy Markdown
Contributor

Problem

When the scheduler runs in-cluster, the kubernetes Python client attaches an in-cluster Configuration to every V1Pod, and that Configuration.refresh_api_key_hook is a local closure (InClusterConfigLoader._set_config.<locals>._refresh_api_key). pickle cannot serialize a local closure.

KubernetesExecutor puts each task's KubernetesJob (which embeds the task's pod_override V1Pod) onto a multiprocessing.Manager queue, whose .put() pickles synchronously in the scheduler. So any in-cluster deployment where a task sets a V1Pod pod_override crashes the scheduler in a loop:

_pickle.PicklingError: Can't pickle local object
  <function InClusterConfigLoader._set_config.<locals>._refresh_api_key at ...>
when serializing kubernetes.client.configuration.Configuration ...
when serializing dict item 'pod_override'
when serializing ... ExecuteTask ...
  File ".../executors/kubernetes_executor.py", line 230, in execute_async
    self.task_queue.put(KubernetesJob(key, command, kube_executor_config, pod_template_file))

Fix

Serialize the pod_override to a plain dict (dropping the Configuration) before it is queued, using the existing PodGenerator.serialize_pod, and rebuild the V1Pod worker-side in run_next via PodGenerator.deserialize_model_dict. The worker already reconstructs its own kube client, so nothing is lost. The worker uses kube_executor_config for the pod override; the same V1Pod is also referenced by the workload's executor_config, so that copy is sanitized too (it is otherwise pickled as part of the workload).

This keeps pod_override working regardless of the kubernetes client version, rather than depending on a client version whose Configuration happens to be picklable. The worker side already sanitizes pods with sanitize_for_serialization (kubernetes_executor_utils.py:492) — this closes the same gap on the scheduler→queue boundary.

Tests

  • New regression test test_execute_async_queues_picklable_pod_override: builds a pod_override whose Configuration carries an unpicklable local-closure hook, asserts the raw pod is unpicklable (reproduces the crash), then asserts the queued KubernetesJob pickles cleanly and the serialized dict round-trips back to an equivalent V1Pod.
  • Updated test_pod_template_file_override_in_executor_config to reflect the queued kube_executor_config now being a dict; its existing run_next round-trip assertion (the reconstructed pod) is unchanged and still passes.
  • Full test_kubernetes_executor.py suite passes (111 passed, 1 skipped AF<3.2-only).

Was generative AI tooling used to co-author this PR?
  • Yes — Claude Code (Opus 4.8)

Generated-by: Claude Code (Opus 4.8) following the guidelines

@vatsrahul1001 vatsrahul1001 force-pushed the fix-k8s-executor-pod-override-pickle branch from fb24244 to 42916df Compare June 22, 2026 08:30
@vatsrahul1001

Copy link
Copy Markdown
Contributor Author

Thanks @amoghrajesh — agreed on the "wrong layer" point, switched away from serialize/deserialize. I went with your in-place approach but with one correction.

Nulling local_vars_configuration breaks worker-side reconcile. When I tested local_vars_configuration = None on a real in-cluster deployment, run_nextPodGenerator.construct_podreconcile_specs failed:

File ".../kubernetes/client/models/v1_pod_spec.py", line 349, in containers
    if self.local_vars_configuration.client_side_validation and containers is None:
AttributeError: 'NoneType' object has no attribute 'client_side_validation'

The model setters hit during reconcile_pods dereference self.local_vars_configuration.client_side_validation, so None raises and the task fails with PodReconciliationError.

Fix: reset it to a fresh Configuration() instead of None — exactly what v35 model constructors produced. That's picklable (no in-cluster refresh_api_key_hook) and keeps client_side_validation, so reconcile works. Implemented as _reset_local_vars_configuration (recursive), called on kube_executor_config before task_queue.put. As you noted, from_obj returns the same object referenced by the workload's executor_config, so the single call covers both; the pod keeps its V1Pod type through the queue, so run_next is unchanged.

Validated on the deployment (no crash, task runs/finishes) and added a regression test that asserts both picklability and that run_next/reconcile_pods succeeds (the assertion the None approach fails). Full executor suite green.

Root cause is the v36 model-constructor change (Configuration()Configuration.get_default_copy(), kubernetes-client/python#2532 / OpenAPITools/openapi-generator#8500); tracked in #68827.


Drafted-by: Claude Code (Opus 4.8); reviewed by @vatsrahul1001 before posting

@amoghrajesh amoghrajesh left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some nits and comments, otherwise LGTM.

…-cluster

When the scheduler runs in-cluster, kubernetes-python-client v36 changed model
constructors to use Configuration.get_default_copy() instead of Configuration()
(kubernetes-client/python#2532, OpenAPI Generator v6.6.0). So every V1Pod built
after load_incluster_config() captures the global in-cluster Configuration,
whose refresh_api_key_hook is a local closure
(InClusterConfigLoader._set_config.<locals>._refresh_api_key). pickle cannot
serialize a local closure, so putting a task's pod_override V1Pod on the
executor's multiprocessing queue raises PicklingError and crashes the scheduler
in a loop. This affects any in-cluster KubernetesExecutor deployment where a task
sets a V1Pod pod_override, independent of the Airflow version; pinning the client
below 36 is not viable because 35.x has a separate no_proxy regression.

Reset local_vars_configuration to a fresh Configuration() on the pod_override
(recursively) before queuing -- exactly what v35 model constructors produced. It
carries no in-cluster auth hook so the pod is picklable, while keeping
client_side_validation so the worker-side reconcile_pods setters still work.
(Setting it to None instead breaks reconcile: model setters dereference
self.local_vars_configuration.client_side_validation.) The pod keeps its V1Pod
type through the queue, so run_next is unchanged.
@vatsrahul1001 vatsrahul1001 force-pushed the fix-k8s-executor-pod-override-pickle branch from 42916df to 7cf53dc Compare June 22, 2026 08:48
@vatsrahul1001

vatsrahul1001 commented Jun 22, 2026

Copy link
Copy Markdown
Contributor Author

Thanks for the review @amoghrajesh! Addressed the nits:

  • Trimmed the docstring to your suggested wording (kept one line on why fresh Configuration() over None — it preserves client_side_validation so worker-side reconcile doesn't break).
  • Removed the redundant call-site comment.
  • Moved the test imports (pickle, Configuration, _reset_local_vars_configuration) to top level.

Drafted-by: Claude Code (Opus 4.8); reviewed by @vatsrahul1001 before posting

)


def _reset_local_vars_configuration(obj: Any) -> None:

@Lee-W Lee-W Jun 22, 2026

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def _reset_local_vars_configuration(obj: Any) -> None:
def _reset_local_vars_configuration(node: Any) -> None:

IMO, this is more readable. I don't quite understand what obj is in the original context.

Also, should we type it? looks like we know the potential type.

_reset_local_vars_configuration(v)
elif isinstance(obj, (list, tuple)):
for item in obj:
_reset_local_vars_configuration(item)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What should happen if we go into the else block? Is it expected to do nothing? or not expcted to happen

),
)
self._install_unpicklable_incluster_hook(pod)
with pytest.raises((pickle.PicklingError, AttributeError, TypeError)):

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would like to check when will AttributeError, TypeError happen in different versions?

@vatsrahul1001

Copy link
Copy Markdown
Contributor Author

Closing in favour of #68831

@ephraimbuddy ephraimbuddy left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've opened #68831, which fixes the same crash one layer up — at the airflow-core deserialize path where that Configuration actually gets attached. The pod_override the scheduler pickles onto the queue is produced by BaseSerialization.deserialize / ensure_pod_is_valid_after_unpickling (via ExecutorConfigType), not by the executor itself; deserializing there through a fresh Configuration() keeps the pod (and every nested model) picklable for every consumer of a deserialized V1Pod, not just the KubernetesExecutor queue. It also gives PodGenerator.deserialize_model_dict the same treatment, so the provider's worker-side reconstruction is covered too.

Since that resolves the crash at the source and makes the executor-level reset redundant, I think we can consolidate on #68831 and close this one. Let me know if that's reasonable?

@ephraimbuddy

Copy link
Copy Markdown
Contributor

Ah, thanks for closing

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:providers provider:cncf-kubernetes Kubernetes (k8s) provider related issues

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants