Skip to content

Commit b6cc4cd

Browse files
authored
Merge branch 'release/2.5' into release/2.5
2 parents d409eda + 5307838 commit b6cc4cd

16 files changed

Lines changed: 242 additions & 138 deletions

File tree

docs/features/weight_update.md

Lines changed: 4 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -50,7 +50,7 @@ In FastDeploy >= 2.6, the underlying control-signal communication path is optimi
5050
| `/v1/is_paused` | `GET` | none | Return `{"is_paused": bool}`. |
5151
| `/v1/sleep` | `POST` | `?tags=weight,kv_cache` | Offload selected GPU memory objects. Supported tags are `weight` and `kv_cache`. If omitted, both are used. |
5252
| `/v1/wakeup` | `POST` | `?tags=weight,kv_cache` | Reload previously offloaded weights and/or KV cache. On success, the engine resumes automatically. |
53-
| `/v1/update_weights` | `POST` | JSON `{"version":"...", "rsync_config": {...}}` | Refresh weights in place through the worker control path. This API is intended for remote versioned updates, especially `load_strategy=rsync`. |
53+
| `/v1/update_weights` | `POST` | JSON `{"version":"...", "verify_checksum": false}` | Refresh weights in place through the worker control path. This API is intended for remote versioned updates, especially `load_strategy=rsync`. |
5454

5555
### Compatibility Notes
5656

@@ -114,7 +114,7 @@ After `wakeup` succeeds, FastDeploy automatically calls `resume`.
114114
Current request fields:
115115

116116
- `version`: optional string. Used to choose a target checkpoint version.
117-
- `rsync_config`: optional dictionary. Must contain `etcd_server` when provided.
117+
- `verify_checksum`: optional boolean. Defaults to `false`. Set to `true` to verify data integrity during weight synchronization.
118118

119119
Important semantics:
120120

@@ -186,9 +186,7 @@ curl -X POST http://127.0.0.1:8000/v1/update_weights \
186186
-H "Content-Type: application/json" \
187187
-d '{
188188
"version": "global_step_1200",
189-
"rsync_config": {
190-
"etcd_server": "127.0.0.1:2379"
191-
}
189+
"verify_checksum": false
192190
}'
193191
```
194192

@@ -261,9 +259,7 @@ curl -X POST http://127.0.0.1:8000/v1/update_weights \
261259
-H "Content-Type: application/json" \
262260
-d '{
263261
"version": "global_step_1200",
264-
"rsync_config": {
265-
"etcd_server": "127.0.0.1:2379"
266-
}
262+
"verify_checksum": false
267263
}'
268264

269265
# Resume the service after the update completes

docs/zh/features/weight_update.md

Lines changed: 4 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -50,7 +50,7 @@ python -m fastdeploy.entrypoints.openai.api_server \
5050
| `/v1/is_paused` | `GET` || 返回 `{"is_paused": bool}`|
5151
| `/v1/sleep` | `POST` | `?tags=weight,kv_cache` | 卸载指定 GPU 内存对象。支持 `weight``kv_cache`;不传时默认同时处理两者。 |
5252
| `/v1/wakeup` | `POST` | `?tags=weight,kv_cache` | 重新加载之前被卸载的权重和/或 KV Cache。成功后会自动 `resume`|
53-
| `/v1/update_weights` | `POST` | JSON `{"version":"...", "rsync_config": {...}}` | 通过 worker 控制链路原地刷新模型权重。该接口主要面向 `load_strategy=rsync` 的远端版本更新。 |
53+
| `/v1/update_weights` | `POST` | JSON `{"version":"...", "verify_checksum": false}` | 通过 worker 控制链路原地刷新模型权重。该接口主要面向 `load_strategy=rsync` 的远端版本更新。 |
5454

5555
### 兼容性说明
5656

@@ -113,7 +113,7 @@ python -m fastdeploy.entrypoints.openai.api_server \
113113
当前支持的请求字段:
114114

115115
- `version`:可选字符串,用于指定目标 checkpoint 版本。
116-
- `rsync_config`:可选字典;如果传入,必须包含 `etcd_server`
116+
- `verify_checksum`:可选布尔值;默认为 `false`。设置为 `true` 时,会在权重同步过程中校验数据完整性
117117

118118
关键语义:
119119

@@ -185,9 +185,7 @@ curl -X POST http://127.0.0.1:8000/v1/update_weights \
185185
-H "Content-Type: application/json" \
186186
-d '{
187187
"version": "global_step_1200",
188-
"rsync_config": {
189-
"etcd_server": "127.0.0.1:2379"
190-
}
188+
"verify_checksum": false
191189
}'
192190
```
193191

@@ -260,9 +258,7 @@ curl -X POST http://127.0.0.1:8000/v1/update_weights \
260258
-H "Content-Type: application/json" \
261259
-d '{
262260
"version": "global_step_1200",
263-
"rsync_config": {
264-
"etcd_server": "127.0.0.1:2379"
265-
}
261+
"verify_checksum": false
266262
}'
267263

268264
# 更新完成后恢复服务

fastdeploy/__init__.py

Lines changed: 0 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -15,19 +15,11 @@
1515
"""
1616

1717
import os
18-
import uuid
1918

2019
# suppress warning log from paddlepaddle
2120
os.environ["GLOG_minloglevel"] = "2"
2221
# suppress log from aistudio
2322
os.environ["AISTUDIO_LOG"] = "critical"
24-
# set prometheus dir
25-
if os.getenv("PROMETHEUS_MULTIPROC_DIR", "") == "":
26-
prom_dir = f"/tmp/fd_prom_{str(uuid.uuid4())}"
27-
os.environ["PROMETHEUS_MULTIPROC_DIR"] = prom_dir
28-
if os.path.exists(prom_dir):
29-
os.rmdir(prom_dir)
30-
os.mkdir(prom_dir)
3123

3224
import typing
3325

fastdeploy/config.py

Lines changed: 34 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1215,6 +1215,37 @@ def update_enable_early_stop(self, argument: bool):
12151215
argument = self.enable_early_stop
12161216

12171217

1218+
class DeployModality(str, Enum):
1219+
"""Modality mode for the serving engine deployment.
1220+
1221+
Determines which input modalities the serving engine should handle:
1222+
- TEXT: Text-only deployment. The engine only processes text inputs,
1223+
skipping multimodal preprocessing (e.g., vision encoder, audio
1224+
encoder). This reduces GPU memory usage and startup time when
1225+
multimodal capabilities are not needed.
1226+
- MIXED: Multimodal deployment (default). The engine handles mixed-modality
1227+
inputs including text, images, audio, and video. All modality-specific
1228+
encoders and preprocessing pipelines are initialized at startup.
1229+
1230+
Usage:
1231+
--deploy-modality text # text-only, lower resource footprint
1232+
--deploy-modality mixed # full multimodal support (default)
1233+
"""
1234+
1235+
TEXT = "text"
1236+
MIXED = "mixed"
1237+
1238+
@classmethod
1239+
def from_str(cls, value: str) -> "DeployModality":
1240+
"""Parse a string into a DeployModality enum, with validation."""
1241+
value = value.strip().lower()
1242+
try:
1243+
return cls(value)
1244+
except ValueError:
1245+
valid = ", ".join(f"'{m.value}'" for m in cls)
1246+
raise ValueError(f"Invalid deploy_modality '{value}'. Must be one of: {valid}")
1247+
1248+
12181249
class LoadChoices(str, Enum):
12191250
"""LoadChoices"""
12201251

@@ -1697,6 +1728,7 @@ def __init__(
16971728
tool_parser: str = None,
16981729
test_mode=False,
16991730
routing_replay_config: Optional[RoutingReplayConfig] = None,
1731+
deploy_modality: "DeployModality" = None,
17001732
):
17011733
self.model_config: ModelConfig = model_config # type: ignore
17021734
self.cache_config: CacheConfig = cache_config # type: ignore
@@ -1713,8 +1745,7 @@ def __init__(
17131745
self.structured_outputs_config: StructuredOutputsConfig = structured_outputs_config
17141746
self.router_config: RouterConfig = router_config
17151747
self.routing_replay_config = routing_replay_config
1716-
1717-
# Initialize cuda graph capture list
1748+
self.deploy_modality: DeployModality = deploy_modality if deploy_modality is not None else DeployModality.MIXED
17181749
max_capture_shape = self.scheduler_config.max_num_seqs
17191750
if self.speculative_config is not None and self.speculative_config.method in ["mtp", "suffix"]:
17201751
max_capture_shape = self.scheduler_config.max_num_seqs * (
@@ -2209,7 +2240,7 @@ def get_max_chunk_tokens(self, mm_max_tokens_per_item=None):
22092240
num_tokens = self.scheduler_config.max_num_seqs
22102241
else:
22112242
num_tokens = self.scheduler_config.max_num_batched_tokens
2212-
if mm_max_tokens_per_item is not None:
2243+
if mm_max_tokens_per_item is not None and self.deploy_modality != DeployModality.TEXT:
22132244
max_mm_tokens = max(
22142245
mm_max_tokens_per_item.get("image", 0),
22152246
mm_max_tokens_per_item.get("video", 0),

fastdeploy/entrypoints/openai/api_server.py

Lines changed: 5 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -459,19 +459,14 @@ async def update_weights(request: Request) -> Response:
459459
)
460460
args["version"] = request_data["version"]
461461

462-
# Validate and extract rsync_config parameter
463-
if "rsync_config" in request_data and request_data["rsync_config"] is not None:
464-
if not isinstance(request_data["rsync_config"], dict):
462+
# Validate and extract verify_checksum parameter
463+
if "verify_checksum" in request_data and request_data["verify_checksum"] is not None:
464+
if not isinstance(request_data["verify_checksum"], bool):
465465
return JSONResponse(
466466
status_code=400,
467-
content={"error": "Invalid parameter type", "message": "rsync_config must be a dictionary"},
467+
content={"error": "Invalid parameter type", "message": "verify_checksum must be a boolean"},
468468
)
469-
if "etcd_server" not in request_data["rsync_config"]:
470-
return JSONResponse(
471-
status_code=400,
472-
content={"error": "Invalid parameter type", "message": "rsync_config must contain etcd_server"},
473-
)
474-
args["rsync_config"] = request_data["rsync_config"]
469+
args["verify_checksum"] = request_data["verify_checksum"]
475470

476471
control_request = ControlRequest(request_id, "update_weights", args)
477472
control_response = await app.state.engine_client.run_control_method(control_request)

fastdeploy/entrypoints/openai/multi_api_server.py

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -107,6 +107,15 @@ def start_servers(
107107
env = os.environ.copy()
108108
env["FD_ENABLE_MULTI_API_SERVER"] = "1"
109109
env["FD_LOG_DIR"] = env.get("FD_LOG_DIR", "log") + f"/log_{i}"
110+
if "PROMETHEUS_MULTIPROC_DIR" in env:
111+
prom_dir = env.get("PROMETHEUS_MULTIPROC_DIR")
112+
prom_dir_i = os.path.join(os.path.dirname(prom_dir), os.path.basename(prom_dir) + f"_dp{i}")
113+
# Create the directory if it doesn't exist
114+
if not os.path.exists(prom_dir_i):
115+
os.makedirs(prom_dir_i, exist_ok=True)
116+
env["PROMETHEUS_MULTIPROC_DIR"] = prom_dir_i
117+
logger.info(f"Set PROMETHEUS_MULTIPROC_DIR for DP {i}: {prom_dir_i}")
118+
110119
cmd = [
111120
sys.executable,
112121
"-m",

fastdeploy/metrics/metrics.py

Lines changed: 41 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -65,7 +65,7 @@ def collect(self):
6565
Metric: Prometheus Metric objects that are not excluded.
6666
"""
6767
for metric in self.base_registry.collect():
68-
if not any(name.startswith(metric.name) for name in self.exclude_names):
68+
if not any(metric.name.startswith(name) for name in self.exclude_names):
6969
yield metric
7070

7171

@@ -83,11 +83,15 @@ def get_filtered_metrics() -> str:
8383
multiprocess.MultiProcessCollector(base_registry)
8484

8585
filtered_registry = CollectorRegistry()
86-
# 注册一个新的colletor,过滤gauge指标
87-
filtered_registry.register(SimpleCollector(base_registry, EXCLUDE_LABELS))
86+
# 动态获取需要排除的 gauge 指标列表
87+
exclude_labels = main_process_metrics.get_excluded_metrics()
88+
# 注册一个新的collector,过滤gauge指标
89+
filtered_registry.register(SimpleCollector(base_registry, exclude_labels))
8890

8991
# 将gauge指标重新注册到filtered_registry中,从内存中读取
9092
main_process_metrics.re_register_gauge(filtered_registry)
93+
# 将speculative中的gauge指标也重新注册
94+
main_process_metrics.re_register_speculative_gauge(filtered_registry)
9195

9296
return generate_latest(filtered_registry).decode("utf-8")
9397

@@ -195,7 +199,7 @@ class MetricsManager:
195199
"type": Gauge,
196200
"name": "fastdeploy:num_requests_running",
197201
"description": "Number of requests currently running",
198-
"kwargs": {"multiprocess_mode": "sum"},
202+
"kwargs": {},
199203
},
200204
"num_requests_waiting": {
201205
"type": Gauge,
@@ -625,19 +629,22 @@ def __init__(self):
625629
# 在模块加载,指标注册先设置Prometheus环境变量
626630
setup_multiprocess_prometheus()
627631

628-
# 动态创建所有指标
632+
# 动态创建所有非 gauge 型指标
629633
for metric_name, config in self.METRICS.items():
630634
setattr(
631635
self,
632636
metric_name,
633637
config["type"](config["name"], config["description"], **config["kwargs"]),
634638
)
635-
# 动态创建所有指标
639+
# 动态创建所有 gauge 型指标,统一配置 multiprocess_mode 为 livesum
636640
for metric_name, config in self.GAUGE_METRICS.items():
641+
kwargs = config["kwargs"].copy()
642+
if "multiprocess_mode" not in kwargs:
643+
kwargs["multiprocess_mode"] = "livesum"
637644
setattr(
638645
self,
639646
metric_name,
640-
config["type"](config["name"], config["description"], **config["kwargs"]),
647+
config["type"](config["name"], config["description"], **kwargs),
641648
)
642649
# 动态创建server metrics
643650
for metric_name, config in self.SERVER_METRICS.items():
@@ -695,17 +702,22 @@ def _init_speculative_metrics(self, speculative_method, num_speculative_tokens):
695702
Gauge(
696703
f"{config['name']}_{i}",
697704
f"{config['description']} (head {i})",
705+
multiprocess_mode="livesum",
698706
)
699707
)
700708
setattr(self, metric_name, gauges)
701709
else:
710+
# For Gauge metrics, automatically add multiprocess_mode="livesum"
711+
kwargs = config["kwargs"].copy()
712+
if config["type"] == Gauge and "multiprocess_mode" not in kwargs:
713+
kwargs["multiprocess_mode"] = "livesum"
702714
setattr(
703715
self,
704716
metric_name,
705717
config["type"](
706718
config["name"],
707719
config["description"],
708-
**config["kwargs"],
720+
**kwargs,
709721
),
710722
)
711723

@@ -766,6 +778,19 @@ def register_speculative_metrics(self, registry: CollectorRegistry):
766778
else:
767779
registry.register(getattr(self, metric_name))
768780

781+
def re_register_speculative_gauge(self, registry: CollectorRegistry):
782+
"""Re-register gauge metrics from SPECULATIVE_METRICS to the specified registry"""
783+
# Check if SPECULATIVE_METRICS was initialized in this process
784+
# (it's an instance attribute set by _init_speculative_metrics, not the class-level empty dict)
785+
if not hasattr(self, "spec_decode_draft_acceptance_rate"):
786+
return
787+
for metric_name, config in self.SPECULATIVE_METRICS.items():
788+
if metric_name == "spec_decode_draft_single_head_acceptance_rate":
789+
for gauge in getattr(self, metric_name):
790+
registry.register(gauge)
791+
elif config["type"] == Gauge:
792+
registry.register(getattr(self, metric_name))
793+
769794
def re_register_gauge(self, registry: CollectorRegistry):
770795
"""Re-register gauge to the specified registry"""
771796
for metric_name in self.GAUGE_METRICS:
@@ -789,16 +814,19 @@ def register_all(self, registry: CollectorRegistry):
789814
if hasattr(main_process_metrics, "spec_decode_draft_acceptance_rate"):
790815
self.register_speculative_metrics(registry)
791816

792-
@classmethod
793-
def get_excluded_metrics(cls) -> Set[str]:
817+
def get_excluded_metrics(self) -> Set[str]:
794818
"""Get the set of indicator names that need to be excluded"""
795-
return {config["name"] for config in cls.GAUGE_METRICS.values()}
819+
excluded = {config["name"] for config in self.GAUGE_METRICS.values()}
820+
# Also add gauge metrics from SPECULATIVE_METRICS (if initialized)
821+
if hasattr(self, "SPECULATIVE_METRICS"):
822+
for config in self.SPECULATIVE_METRICS.values():
823+
if config["type"] == Gauge or config["type"] == list[Gauge]:
824+
excluded.add(config["name"])
825+
return excluded
796826

797827

798828
main_process_metrics = MetricsManager()
799829

800830
# 由于zmq指标记录比较耗时,默认不开启,通过DEBUG参数开启
801831
if envs.FD_DEBUG:
802832
main_process_metrics.init_zmq_metrics()
803-
804-
EXCLUDE_LABELS = MetricsManager.get_excluded_metrics()

fastdeploy/model_executor/models/qwen3_vl/qwen3_vl.py

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -381,6 +381,10 @@ def forward(
381381

382382
return hidden_states
383383

384+
def clear_grpah_opt_backend(self):
385+
"""Clear graph optimization backend, the captured cuda graph will be cleaned"""
386+
self.model.clear_grpah_opt_backend(fd_config=self.fd_config)
387+
384388

385389
class Qwen3VLPretrainedModel(PretrainedModel):
386390
"""Utilities for tensor-parallel weight splitting."""

0 commit comments

Comments
 (0)