Skip to content

Fix KubernetesExecutor scheduler crash from unpicklable pod_override#68831

Open
ephraimbuddy wants to merge 2 commits into
apache:mainfrom
astronomer:fix-v1pod-unpicklable-config-in-cluster
Open

Fix KubernetesExecutor scheduler crash from unpicklable pod_override#68831
ephraimbuddy wants to merge 2 commits into
apache:mainfrom
astronomer:fix-v1pod-unpicklable-config-in-cluster

Conversation

@ephraimbuddy

@ephraimbuddy ephraimbuddy commented Jun 22, 2026

Copy link
Copy Markdown
Contributor

When the scheduler runs in-cluster, the kubernetes client installs a process-global default Configuration whose refresh_api_key_hook is a local closure. Under kubernetes-client v36, deserializing a V1Pod copies that Configuration onto the pod and every nested model. A task that sets a V1Pod pod_override therefore produces a pod that pickle cannot serialize, and the KubernetesExecutor pickles each task onto a multiprocessing queue synchronously in the scheduler loop, so the scheduler crashes in a loop and no task can be launched.

Deserialize pods through an ApiClient built with a fresh Configuration so that neither the pod nor any nested model captures the in-cluster global. This is applied at the point the config gets attached, so every consumer of a deserialized pod stays picklable: airflow-core's serialization and unpickling repair paths, and the cncf provider's worker-side reconstruction.

Provider counter part of this PR: #68848


Was generative AI tooling used to co-author this PR?
  • Yes (please specify the tool below)

Claude Opus 4.8

@phanikumv phanikumv left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same ApiClient() deserialize code still lives in the below two paths.

  - providers/cncf/kubernetes/src/airflow/providers/cncf/kubernetes/operators/job.py — L381–L382 (api_client = ApiClient() → _ApiClient__deserialize_model(job, k8s.V1Job))
  - providers/cncf/kubernetes/src/airflow/providers/cncf/kubernetes/backcompat/backwards_compat_converters.py — L39–L40 (api_client = ApiClient() → _ApiClient__deserialize_model(obj, new_class))

Neither is on the path where a deserialized kubernetes model gets pickle-serialized, however worth fixing them as a follow up in another PR..?

@ephraimbuddy

Copy link
Copy Markdown
Contributor Author

The same ApiClient() deserialize code still lives in the below two paths.

  - providers/cncf/kubernetes/src/airflow/providers/cncf/kubernetes/operators/job.py — L381–L382 (api_client = ApiClient() → _ApiClient__deserialize_model(job, k8s.V1Job))
  - providers/cncf/kubernetes/src/airflow/providers/cncf/kubernetes/backcompat/backwards_compat_converters.py — L39–L40 (api_client = ApiClient() → _ApiClient__deserialize_model(obj, new_class))

Neither is on the path where a deserialized kubernetes model gets pickle-serialized, however worth fixing them as a follow up in another PR..?

Yeah, we can do a follow up since those are not on the pickle path

@ephraimbuddy ephraimbuddy force-pushed the fix-v1pod-unpicklable-config-in-cluster branch from 736cbee to f029434 Compare June 22, 2026 12:31
phanikumv
phanikumv previously approved these changes Jun 22, 2026
@ephraimbuddy ephraimbuddy force-pushed the fix-v1pod-unpicklable-config-in-cluster branch from f029434 to 902fbf3 Compare June 22, 2026 15:25

@eladkal eladkal left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we split this PR to core and providers?

@ephraimbuddy

ephraimbuddy commented Jun 22, 2026

Copy link
Copy Markdown
Contributor Author

Can we split this PR to core and providers?

The core change is what's important. Provider change can be released much later. i don't think we should split.

Splitting

@ephraimbuddy

Copy link
Copy Markdown
Contributor Author

Can we split this PR to core and providers?

The core change is what's important. Provider change can be released much later. i don't think we should split.

Splitting

PR: #68848

@ephraimbuddy ephraimbuddy force-pushed the fix-v1pod-unpicklable-config-in-cluster branch 2 times, most recently from 153aedd to 5490346 Compare June 22, 2026 16:01

@phanikumv phanikumv left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both the PRs are currently missing the airflow-core scheduler fix. Please restore yesterday's serialized_objects.py + sqlalchemy.py changes

@phanikumv phanikumv dismissed their stale review June 23, 2026 04:04

Code changed post approval.

@amoghrajesh amoghrajesh left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fix is fine for pods deserialized from yaml / dicst but misses directly instantiated V1Pod objects via the executor_config.

When a user constructs a V1Pod directly in their dag, it never goes through `ApiClient.__deserialize_model. Requesting changes to avoid accidental merge.

@ephraimbuddy

Copy link
Copy Markdown
Contributor Author

The fix is fine for pods deserialized from yaml / dicst but misses directly instantiated V1Pod objects via the executor_config.

When a user constructs a V1Pod directly in their dag, it never goes through `ApiClient.__deserialize_model. Requesting changes to avoid accidental merge.

Yeah, there was a mix up. I have pushed the actual change.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR prevents scheduler crashes in KubernetesExecutor when a task sets pod_override to a kubernetes.client.models.V1Pod that becomes unpicklable in-cluster (due to kubernetes-client v36 attaching the process-global in-cluster Configuration with an unpicklable refresh_api_key_hook closure). It centralizes pod dict deserialization through an ApiClient constructed with a fresh Configuration, ensuring deserialized pods (and nested models) remain picklable across core serialization and the “repair after unpickling” path.

Changes:

  • Add deserialize_pod_dict() that deserializes pod dicts via ApiClient(configuration=Configuration()) to avoid capturing the in-cluster global Configuration.
  • Update both BaseSerialization.deserialize() (POD branch) and ensure_pod_is_valid_after_unpickling() to use the new safe deserialization helper.
  • Add regression tests covering both the serialization/deserialization path and the unpickling repair path for picklability in-cluster.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated no comments.

File Description
airflow-core/src/airflow/utils/sqlalchemy.py Introduces safe pod dict deserialization with a fresh Kubernetes Configuration and reuses it in the unpickling repair path.
airflow-core/src/airflow/serialization/serialized_objects.py Switches POD deserialization to the safe helper to keep deserialized pods picklable.
airflow-core/tests/unit/utils/test_sqlalchemy.py Adds a test ensuring the unpickling repair path produces a picklable pod even with a dirty in-cluster default Configuration.
airflow-core/tests/unit/serialization/test_serialized_objects.py Adds a test ensuring serialized/deserialized pods do not capture an unpicklable in-cluster default Configuration.

@phanikumv phanikumv left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM now after the last commit

@phanikumv phanikumv requested a review from amoghrajesh June 23, 2026 09:03

@amoghrajesh amoghrajesh left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some nits, otherwise LGTM!

Comment thread airflow-core/tests/unit/serialization/test_serialized_objects.py Outdated
the pod it must round-trip through a fresh ``Configuration`` so it stays picklable onto the
KubernetesExecutor queue.
"""
from kubernetes.client import Configuration

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here (but this one can stay actually)

assert decoded.local_vars_configuration.refresh_api_key_hook is None
assert decoded.spec.containers[0].local_vars_configuration.refresh_api_key_hook is None

_has_kubernetes.cache_clear()

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this really needed? I guess its bleeding from other tests but yeah just flagging

Comment thread airflow-core/src/airflow/serialization/serialized_objects.py Outdated
When the scheduler runs in-cluster, the kubernetes client installs a
process-global default Configuration whose refresh_api_key_hook is a local
closure. Under kubernetes-client v36, deserializing a V1Pod copies that
Configuration onto the pod and every nested model. A task that sets a V1Pod
pod_override therefore produces a pod that pickle cannot serialize, and the
KubernetesExecutor pickles each task onto a multiprocessing queue
synchronously in the scheduler loop, so the scheduler crashes in a loop and no
task can be launched.

Deserialize pods through an ApiClient built with a fresh Configuration so that
neither the pod nor any nested model captures the in-cluster global. This is
applied where the config gets attached -- airflow-core's serialization and
unpickling repair paths -- so every consumer of a deserialized pod stays
picklable.
@ephraimbuddy ephraimbuddy force-pushed the fix-v1pod-unpicklable-config-in-cluster branch from f8c5f27 to fb76af2 Compare June 23, 2026 09:58
Move the deserialize_pod_dict import to module top-level, hoist pickle and
Configuration to top-level imports in the serialization test, and drop the
redundant _has_kubernetes cache_clear calls that were copied from a sibling
test.
@ephraimbuddy ephraimbuddy force-pushed the fix-v1pod-unpicklable-config-in-cluster branch from fb76af2 to 154a011 Compare June 23, 2026 10:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants