Skip to content

fix(visualserver): contain visual worker failures#1347

Open
sufubao wants to merge 1 commit into
ModelTC:mainfrom
sufubao:backport-visual-worker-failures
Open

fix(visualserver): contain visual worker failures#1347
sufubao wants to merge 1 commit into
ModelTC:mainfrom
sufubao:backport-visual-worker-failures

Conversation

@sufubao

@sufubao sufubao commented Jun 13, 2026

Copy link
Copy Markdown
Collaborator

Summary

  • Backport 2b195128 visualserver failure containment to current main.
  • Add per-image visual inference result signaling so worker failures/timeouts abort requests instead of forwarding missing embeddings.
  • Bound visual infer waits and embed-cache RPyC calls, keep visual workers alive after per-batch failures, and expose visual worker death through /health watchdog sentinels.
  • Adapted for main: excluded the qwen35 Dockerfile/dependency update and qwen35-only visual token-budget flags.

Verification

  • git diff --check origin/main..HEAD
  • python -m py_compile lightllm/models/qwen3_vl/qwen3_visual.py lightllm/server/api_cli.py lightllm/server/core/objs/start_args_type.py lightllm/server/visualserver/manager.py lightllm/server/visualserver/model_infer/model_rpc.py lightllm/server/visualserver/model_infer/model_rpc_client.py lightllm/server/visualserver/objs.py lightllm/server/visualserver/proxy_manager.py lightllm/server/visualserver/visual_only_manager.py lightllm/utils/health_check.py
  • rg -n "<<<<<<<|>>>>>>>|\|\|\|\|\|\|\||visual_batch_max_tokens|visual_image_max_tokens|Dockerfile\.qwen35" . returned no matches

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request hardens the visual server to prevent hangs and resource exhaustion by introducing timeouts for visual inference and cache operations, handling per-image preprocessing failures gracefully, and adding a periodic watchdog thread to monitor worker liveness. The review feedback highlights critical thread-safety issues with concurrent RPyC calls on shared connections (such as get_items_embed and set_items_embed in the cache executor, and remote_infer_images on remote connections) and suggests using locks to serialize these operations. Additionally, the reviewer identifies a potential bug in the watchdog's sentinel cleanup where calling os.kill with a non-positive PID could inadvertently signal entire process groups.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment on lines +101 to +103
self._cache_executor = concurrent.futures.ThreadPoolExecutor(
max_workers=2, thread_name_prefix="visual_cache_rpc"
)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

RPyC connections are not thread-safe. Since self.cache_client is called from threads in self._cache_executor (which has max_workers=2), concurrent requests can call get_items_embed concurrently, leading to protocol corruption or connection hangs. We should initialize a lock to serialize all RPyC calls on self.cache_client.

Suggested change
self._cache_executor = concurrent.futures.ThreadPoolExecutor(
max_workers=2, thread_name_prefix="visual_cache_rpc"
)
self._cache_executor = concurrent.futures.ThreadPoolExecutor(
max_workers=2, thread_name_prefix='visual_cache_rpc'
)
self._cache_lock = threading.Lock()

Comment on lines 145 to +151
def get_need_infer_images(self, group_req_indexes: GroupReqIndexes) -> List[ImageItem]:
"""同步路径: 检查 shm req 是否 aborted, 必要时调用 embed cache 询问哪些 image 已就绪。

``cache_client.root.get_items_embed`` 是同步 RPyC 调用; 调用方需要包一层
``asyncio.to_thread`` 并加超时, 避免 embed cache 卡死时阻塞 asyncio 事件循环
(2026-05-09 incident)。
"""

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

To ensure thread safety when calling self.cache_client.root.get_items_embed concurrently from multiple executor threads, please wrap the RPyC call inside get_need_infer_images with the newly introduced self._cache_lock:

with self._cache_lock:
    ready_image = obtain(self.cache_client.root.get_items_embed(img_uuids))

Comment on lines +104 to 105
# set_items_embed 也走 cache executor + 受 cache_client 的 sync_request_timeout 兜底
self.cache_client.root.set_items_embed([image.uuid for image in images_need_infer])

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

RPyC connections are not thread-safe. Since _load_to_cpu_cache runs in self._cache_executor (which has max_workers=2), concurrent requests can call set_items_embed concurrently on self.cache_client, leading to protocol corruption. We should wrap this call with self._cache_lock to serialize access.

Suggested change
# set_items_embed 也走 cache executor + 受 cache_client 的 sync_request_timeout 兜底
self.cache_client.root.set_items_embed([image.uuid for image in images_need_infer])
# set_items_embed 也走 cache executor + 受 cache_client 的 sync_request_timeout 兜底
with self._cache_lock:
self.cache_client.root.set_items_embed([image.uuid for image in images_need_infer])

if self.args.detail_log:
start = time.time()
logger.info(f"Start to remote infer images {[image.md5 for image in images]}")
conn.root.remote_infer_images(images, event)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

RPyC connections are not thread-safe. If multiple concurrent requests select the same remote visual server connection from self.id_to_rpyc_conn, they will call conn.root.remote_infer_images concurrently on the same connection, which can corrupt the RPyC protocol. We should serialize calls to remote_infer_images on the same connection, for example by using a lock associated with the connection or a global lock.

Comment on lines +279 to +282
stale = False
if pid is None:
# Pre-PID file format from an older crash — treat as stale.
stale = True

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

In _sweep_stale_sentinels, if pid is parsed as 0 or a negative integer (which can happen if the sentinel file is corrupted or has invalid content), calling os.kill(pid, 0) can have unintended side effects. Specifically, in Unix systems, os.kill(0, 0) sends the signal to all processes in the current process group, and negative PIDs signal process groups. We should explicitly check that pid > 0 before calling os.kill to avoid signaling process groups or raising unexpected errors.

Suggested change
stale = False
if pid is None:
# Pre-PID file format from an older crash — treat as stale.
stale = True
stale = False
if pid is None or pid <= 0:
# Pre-PID or invalid file format from an older crash — treat as stale.
stale = True

Backport visualserver failure containment from upstream/qwen35. Keep worker loops alive after per-batch failures, propagate per-image failure status back to the manager, bound visual waits and cache RPCs, harden proxy/visual paths, and expose worker death through watchdog sentinels checked by /health.

Adapted for main: dropped the qwen35 image/dependency update and qwen35-only visual token-budget flags.

(cherry picked from commit 2b19512)
@sufubao sufubao force-pushed the backport-visual-worker-failures branch from c69bb61 to 50d6a0a Compare June 13, 2026 11:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant