Summary
convert_deepseek_family_unscanned_ckpt --model_size kimi-k2.6-text accumulates every dequantized bf16 tensor for all 64 HF safetensors shards in a single in-memory chkpt_vars dict before assembling and writing Orbax. For Kimi-K2.6 that's ~2.3 TB of CPU RAM at peak. There's no streaming/low-memory mode, and the documented runbook doesn't mention a host-size requirement.
Environment
maxtext==0.2.2 installed via pip install maxtext[tpu] (PyPI), Python 3.12, CPU-only torch==2.12.0+cpu (Note: maxtext[tpu] pulls torch==2.12.0+cu130 whose torch._dynamo segfaults at import on a CPU-only host — forced reinstall with --index-url https://download.pytorch.org/whl/cpu to fix.)
- Host: GCP
n2-standard-32 (32 vCPU, 128 GB RAM), 2 TB pd-balanced disk, Ubuntu 22.04.
- Zone:
us-east5-a (chosen to be local to v5e TPU quota in this project; us-east5 has no m1/m2/m3-ultramem SKUs).
- HF source:
moonshotai/Kimi-K2.6 (96 files, 64 *.safetensors shards, 555 GB on disk).
Repro
python -m maxtext.checkpoint_conversion.standalone_scripts.convert_deepseek_family_unscanned_ckpt \
--model_size kimi-k2.6-text \
--base_model_path /home/$USER/kimi-k2.6-hf \
--maxtext_model_path gs://<bucket>/k2.6-unscanned
Memory growth observed (RSS via free -g)
| After loading shard |
RSS used |
| 2 |
16 GB |
| 3 |
53 GB |
| 4 |
90 GB |
| 5 |
124 GB |
| 6 |
SIGKILL (rc=137) |
~35–37 GB per shard, linear. Extrapolating across all 64 shards → ~2.3 TB RSS at peak.
Root cause (from reading the script)
In convert_deepseek_family_unscanned_ckpt.py, the outer loop over ckpt_paths populates a single chkpt_vars = {} with every dequantize_pack_quantized_int4(...) result before the assembly phase begins. Nothing is freed; nothing is written until the entire model is resident.
Asks
- Is there a supported low-memory / streaming mode for the K2.6 (or any K2-Thinking/K2.5) converter? The K2.6 runbook doesn't mention host-memory requirements.
- If not, what host SKU does the MaxText team use to convert K2.6?
m1-ultramem-160 (3.75 TB) appears necessary but is unavailable in zones where v5e Lite TPU quota lives (us-east5, europe-west4).
- Would a per-layer streaming patch be welcome upstream? Happy to PR it if there's interest.
Workarounds considered
- Cross-region: convert in
us-central1 (has m1-ultramem-160), pay ~$80 GCS egress to land the Orbax in us-east5.
- Patch the script to dequant + emit per layer, freeing after each.
- Use the scanned converter then unscan at decode time — does this avoid the same accumulation?
Summary
convert_deepseek_family_unscanned_ckpt --model_size kimi-k2.6-textaccumulates every dequantized bf16 tensor for all 64 HF safetensors shards in a single in-memorychkpt_varsdict before assembling and writing Orbax. For Kimi-K2.6 that's ~2.3 TB of CPU RAM at peak. There's no streaming/low-memory mode, and the documented runbook doesn't mention a host-size requirement.Environment
maxtext==0.2.2installed viapip install maxtext[tpu](PyPI), Python 3.12, CPU-onlytorch==2.12.0+cpu(Note:maxtext[tpu]pullstorch==2.12.0+cu130whosetorch._dynamosegfaults at import on a CPU-only host — forced reinstall with--index-url https://download.pytorch.org/whl/cputo fix.)n2-standard-32(32 vCPU, 128 GB RAM), 2 TB pd-balanced disk, Ubuntu 22.04.us-east5-a(chosen to be local to v5e TPU quota in this project; us-east5 has nom1/m2/m3-ultramemSKUs).moonshotai/Kimi-K2.6(96 files, 64*.safetensorsshards, 555 GB on disk).Repro
python -m maxtext.checkpoint_conversion.standalone_scripts.convert_deepseek_family_unscanned_ckpt \ --model_size kimi-k2.6-text \ --base_model_path /home/$USER/kimi-k2.6-hf \ --maxtext_model_path gs://<bucket>/k2.6-unscannedMemory growth observed (RSS via
free -g)~35–37 GB per shard, linear. Extrapolating across all 64 shards → ~2.3 TB RSS at peak.
Root cause (from reading the script)
In
convert_deepseek_family_unscanned_ckpt.py, the outer loop overckpt_pathspopulates a singlechkpt_vars = {}with everydequantize_pack_quantized_int4(...)result before the assembly phase begins. Nothing is freed; nothing is written until the entire model is resident.Asks
m1-ultramem-160(3.75 TB) appears necessary but is unavailable in zones where v5e Lite TPU quota lives (us-east5,europe-west4).Workarounds considered
us-central1(hasm1-ultramem-160), pay ~$80 GCS egress to land the Orbax inus-east5.