Skip to content

[AMORO-4198][optimizer] Support graceful shutdown for in-progress tasks#4197

Open
j1wonpark wants to merge 1 commit into
apache:masterfrom
j1wonpark:optimizer-graceful-shutdown
Open

[AMORO-4198][optimizer] Support graceful shutdown for in-progress tasks#4197
j1wonpark wants to merge 1 commit into
apache:masterfrom
j1wonpark:optimizer-graceful-shutdown

Conversation

@j1wonpark
Copy link
Copy Markdown
Contributor

@j1wonpark j1wonpark commented Apr 30, 2026

Why are the changes needed?

Close #4198.

When an optimizer receives SIGTERM, in-progress tasks are silently dropped and
AMS re-schedules them — doubling work and potentially causing duplicate commits.

Brief change log

  • Optimizer.stopOptimizing(): join executor threads up to --shutdown-timeout-ms
    (default 10 min); keep toucher alive during drain so AMS heartbeats continue
  • OptimizerExecutor.completeTask(): best-effort direct call after shutdown so
    results are not silently dropped
  • OptimizerToucher.stop(): interrupt runner thread to wake it from sleep immediately
  • AbstractOptimizerOperator.waitAShortTime(): preserve interrupt flag
  • OptimizerConfig: new -st / --shutdown-timeout-ms option
  • StandaloneOptimizer / SparkOptimizer: register graceful shutdown hook on
    Hadoop's ShutdownHookManager above FS_CACHE priority with explicit timeout
  • KubernetesOptimizerContainer: exec prefix in container command; derive
    terminationGracePeriodSeconds from shutdown-timeout-ms + 30s buffer
  • optimizer.sh start-foreground: exec $CMDS so Java receives SIGTERM directly

How was this patch tested?

  • Add some test cases that check the changes thoroughly including negative and positive cases if possible

  • Add screenshots for manual tests if appropriate

  • Run test locally before making a pull request

Documentation

  • Does this pull request introduce a new feature? yes
  • If yes, how is the feature documented? JavaDocs / option usage string

@j1wonpark j1wonpark changed the title [AMORO][optimizer] Support graceful shutdown for in-progress tasks [AMORO-4198][optimizer] Support graceful shutdown for in-progress tasks Apr 30, 2026
@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented May 28, 2026

Codecov Report

❌ Patch coverage is 59.09091% with 36 lines in your changes missing coverage. Please review.
✅ Project coverage is 30.24%. Comparing base (99fcc08) to head (2a6df8f).
⚠️ Report is 2 commits behind head on master.

Files with missing lines Patch % Lines
...a/org/apache/amoro/optimizer/common/Optimizer.java 43.58% 18 Missing and 4 partials ⚠️
...o/server/manager/KubernetesOptimizerContainer.java 77.27% 3 Missing and 2 partials ⚠️
...pache/amoro/optimizer/common/OptimizerToucher.java 69.23% 2 Missing and 2 partials ⚠️
...apache/amoro/optimizer/common/OptimizerConfig.java 40.00% 3 Missing ⚠️
...ache/amoro/optimizer/common/OptimizerExecutor.java 75.00% 1 Missing and 1 partial ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##             master    #4197      +/-   ##
============================================
+ Coverage     23.09%   30.24%   +7.14%     
- Complexity     2706     4390    +1684     
============================================
  Files           463      680     +217     
  Lines         42826    55337   +12511     
  Branches       6044     7102    +1058     
============================================
+ Hits           9891    16735    +6844     
- Misses        32076    37337    +5261     
- Partials        859     1265     +406     
Flag Coverage Δ
core 30.24% <59.09%> (?)
trino ?

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

On SIGTERM the optimizer flips its stopped flag and returns immediately,
so in-flight task results are silently dropped (completeTask is gated by
isStarted). On K8s this is compounded by `sh -c` swallowing SIGTERM, a
30s default grace period, and Hadoop's FileSystem cache cleanup racing
JVM shutdown hooks.

- Optimizer.stopOptimizing: join executors with a deadline, force
  interrupt only on timeout; keep toucher alive so AMS heartbeats
  continue while tasks drain.
- OptimizerExecutor.completeTask: best-effort direct call after stop so
  the in-flight result still reaches AMS.
- SparkOptimizer / StandaloneOptimizer: register on Hadoop
  ShutdownHookManager (priority above FS_CACHE / SparkContext) with an
  explicit per-hook timeout.
- OptimizerConfig: new -st / --shutdown-timeout-ms (default 600s).
- KubernetesOptimizerContainer: `sh -c 'exec <args>'` and an explicit
  terminationGracePeriodSeconds derived from -st + 30s buffer; user
  podTemplate values are respected.
- optimizer.sh start-foreground: exec $CMDS so java gets PID 1.

Signed-off-by: Jiwon Park <jpark92@outlook.kr>
@j1wonpark j1wonpark force-pushed the optimizer-graceful-shutdown branch from 3607baa to 2a6df8f Compare May 31, 2026 00:50
@j1wonpark
Copy link
Copy Markdown
Contributor Author

Gentle ping for review 🙏 @zhoujinsong @czy006 — this builds on the master-slave optimizer work you reviewed in #4174 / #3937, and you both merged master in earlier. Just rebased onto latest master as a single clean commit. Key bits: heartbeat/shutdown ordering in Optimizer.stopOptimizing, the best-effort completeTask path after stop, and the ShutdownHookManager priorities. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Improvement]: Support graceful shutdown for in-progress optimizer tasks

2 participants