Skip to content

docs(runpod): fix pod setup for Gemma-4 on unsloth:latest image#5

Open
FloLey wants to merge 43 commits intomasterfrom
claude/find-grpo-training-script-zTUFX
Open

docs(runpod): fix pod setup for Gemma-4 on unsloth:latest image#5
FloLey wants to merge 43 commits intomasterfrom
claude/find-grpo-training-script-zTUFX

Conversation

@FloLey
Copy link
Copy Markdown
Owner

@FloLey FloLey commented Apr 12, 2026

Summary

Four follow-up doc fixes discovered while actually running the pipeline on a RunPod pod with unsloth/unsloth:latest. All changes are confined to training/RUNPOD_SETUP.md.

  • Pin transformers==5.5.0 — the stock image ships 4.57.1 which predates the gemma4 architecture id (KeyError: 'gemma4' at load), and pip install -U transformers overshoots to 5.5.3 which breaks unsloth-zoo (caps at <=5.5.0).
  • Replace huggingface-cli login with hf auth login — the old CLI now prints a deprecation warning and no longer works on the current image.
  • Install hf_transfer — unsloth-zoo uses it when available for faster model downloads.
  • Drop the earlier huggingface_hub<1.0 pin — superseded by the transformers upgrade (5.5.0 accepts hf_hub ≥ 1.0).

No code changes. The script as merged on master is working; these are pod-environment fixes.

Test plan

  • pip install -U wandb hf_transfer succeeds on unsloth/unsloth:latest.
  • pip install 'transformers==5.5.0' succeeds (vLLM conflict warning is harmless, vLLM unused).
  • python -c "from transformers import AutoConfig; print(AutoConfig.from_pretrained('unsloth/gemma-4-E4B-it').model_type)" prints gemma4.
  • hf auth login + hf auth whoami works end-to-end.
  • Full python connect4_train.py --model e4b-8bit --stage grpo --hf-repo Betha/connect4-agent-e4b-8bit launches past the model-load step (to be confirmed by the user on their running pod).

https://claude.ai/code/session_0155BCSv9dyCuio44HLbEWaY

claude added 4 commits April 12, 2026 17:52
The unsloth/unsloth:latest RunPod image ships transformers 4.57, which
requires huggingface_hub<1.0. `pip install -U huggingface_hub` upgrades
to 1.10 and breaks transformers. Also add hf_transfer for faster model
downloads (unsloth_zoo uses it when available).

https://claude.ai/code/session_0155BCSv9dyCuio44HLbEWaY
The unsloth/unsloth:latest image ships huggingface_hub with the new `hf`
CLI; `huggingface-cli login` prints a deprecation warning and no longer
works.

https://claude.ai/code/session_0155BCSv9dyCuio44HLbEWaY
unsloth/unsloth:latest ships transformers 4.57.1, which doesn't know
the `gemma4` architecture id; loading unsloth/gemma-4-E*B-it fails with
KeyError: 'gemma4'. The newer transformers release fixes that and also
accepts huggingface_hub >= 1.0, so we can drop the earlier <1.0 pin.

https://claude.ai/code/session_0155BCSv9dyCuio44HLbEWaY
unsloth-zoo 2026.4.6 requires transformers<=5.5.0 and `pip install -U
transformers` overshoots to 5.5.3. Pin exactly to 5.5.0, which has
Gemma 4 support and satisfies unsloth-zoo.

https://claude.ai/code/session_0155BCSv9dyCuio44HLbEWaY
Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates the RunPod setup documentation, specifically focusing on the installation of dependencies and the Hugging Face login process. The review feedback identifies several issues: the specified version of transformers (5.5.0) and the Gemma 4 model do not exist, the hf_transfer utility requires an environment variable to be activated, and the suggested Hugging Face login command is non-standard.

Comment thread training/RUNPOD_SETUP.md
Comment on lines +39 to +45
# - unsloth:latest ships transformers 4.57, which predates Gemma 4's
# `gemma4` architecture id (KeyError: 'gemma4' at load time).
# - unsloth-zoo caps transformers at <=5.5.0, so pin to 5.5.0 exactly
# (latest `pip install -U transformers` overshoots to 5.5.3+).
# Upgrading transformers pulls in huggingface_hub >= 1.0, which is fine.
pip install -U wandb hf_transfer
pip install 'transformers==5.5.0'
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The version numbers and model names mentioned here (transformers 4.57/5.5.0, huggingface_hub 1.0, Gemma 4) do not correspond to currently released software or models. transformers is currently in the 4.x range (e.g., 4.49.0) and huggingface_hub is in the 0.x range (e.g., 0.28.0). Additionally, 'Gemma 4' is not a released model (Gemma 2 is the current version). Attempting to install transformers==5.5.0 will fail with a 'No matching distribution found' error. Please verify the correct versions and model names for your environment.

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These versions and models do exist — the reviewer's knowledge is out of date. Verified live on unsloth/unsloth:latest today:

$ pip install 'transformers==5.5.0'
Successfully installed transformers-5.5.0

$ python -c "from transformers import AutoConfig; print(AutoConfig.from_pretrained('unsloth/gemma-4-E4B-it').model_type)"
gemma4

transformers 5.x is the current major line, huggingface_hub is at 1.x, and Gemma 4 is Google's current flagship release with Unsloth-maintained weights at https://huggingface.co/unsloth/gemma-4-E4B-it.


Generated by Claude Code

Comment thread training/RUNPOD_SETUP.md Outdated
# - unsloth-zoo caps transformers at <=5.5.0, so pin to 5.5.0 exactly
# (latest `pip install -U transformers` overshoots to 5.5.3+).
# Upgrading transformers pulls in huggingface_hub >= 1.0, which is fine.
pip install -U wandb hf_transfer
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Installing hf_transfer is not sufficient to enable its functionality. You must also set the HF_HUB_ENABLE_HF_TRANSFER environment variable to 1 to activate the faster download logic.

Suggested change
pip install -U wandb hf_transfer
pip install -U wandb hf_transfer
export HF_HUB_ENABLE_HF_TRANSFER=1

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch — applied in 0b9006c. Added export HF_HUB_ENABLE_HF_TRANSFER=1 right after the install.


Generated by Claude Code

Comment thread training/RUNPOD_SETUP.md
huggingface-cli login
# Login to HF (paste write token when prompted) — required for auto-push.
# The new CLI is `hf`; `huggingface-cli` is deprecated.
hf auth login
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

huggingface-cli login is the standard command provided by the huggingface-hub package and is not deprecated. Additionally, the subcommand for the newer hf tool is typically hf login rather than hf auth login. It is recommended to use the standard huggingface-cli login for better compatibility across environments.

Suggested change
hf auth login
huggingface-cli login

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

huggingface-cli login is deprecated on current huggingface_hub releases. Verified on unsloth/unsloth:latest:

$ huggingface-cli login
Warning: `huggingface-cli` is deprecated and no longer works. Use `hf` instead.
Hint: `hf` is already installed! Use it directly.
Examples:
  hf auth login

The new CLI uses subcommand groups — hf auth login, hf auth whoami, etc. hf login is not a valid command on current releases. Keeping hf auth login.


Generated by Claude Code

claude added 25 commits April 12, 2026 18:04
…sfer

Installing the package isn't enough — huggingface_hub only uses the
accelerated backend when the env var is exported.

Addresses review feedback on #5.

https://claude.ai/code/session_0155BCSv9dyCuio44HLbEWaY
The trl shipped with unsloth:latest has a bug in trl/extras/vllm_client.py
that unconditionally imports vllm_ascend (Huawei NPU pkg) at module load,
so `from trl import GRPOTrainer` fails on any non-NPU host — even when
use_vllm=False. `pip install -U trl` pulls a release where that import
is conditional.

https://claude.ai/code/session_0155BCSv9dyCuio44HLbEWaY
Previously --stage defaulted to grpo, launching GSPO directly against
base Gemma 4 with no format warmup. reward_format returns -10 for any
output that isn't exactly a bare digit 0-6 (after stripping <think>),
so the first few hundred GSPO steps burned budget teaching format
instead of optimizing move quality.

The sft stage already does a 50-step format warmup (~5-10 min) on
1000 examples, checks compliance, and auto-chains into GSPO when
compliance is >=80%. Making it the default means users get the right
behavior out of the box; --stage grpo is still available explicitly
for resuming after pod death.

Also update the top-of-file docstring usage and the four RunPod launch
commands to match (dropping the now-redundant --stage grpo).

https://claude.ai/code/session_0155BCSv9dyCuio44HLbEWaY
…onfig

After Unsloth patches Gemma 4, model.config ends up with a non-JSON-
serializable function object, so transformers' config.to_json_file
blows up when plain save_pretrained tries to write config.json. Unsloth
ships save_pretrained_merged which knows how to strip/serialize these
patched fields correctly.

Swap the vanilla merge + save for save_pretrained_merged(save_method=
"merged_16bit"). Also reorder: save BEFORE merge_and_unload since the
Unsloth helper is attached to the PEFT wrapper, which merge_and_unload
consumes.

https://claude.ai/code/session_0155BCSv9dyCuio44HLbEWaY
…save

Two-layer defense against Gemma 4 + Unsloth leaving a non-JSON-
serializable function object on model.config, which crashes
config.to_json_string at every merged-save path (vanilla save_pretrained,
save_pretrained_merged, save_pretrained_gguf):

1. run_sft now saves only the LoRA adapter (PEFT's adapter save writes
   adapter_config.json + adapter_model.safetensors without touching the
   base-model config, so it's immune to this bug). run_grpo then loads
   base + SFT adapter and merges in memory before attaching fresh GSPO
   LoRA — no merged model ever hits disk during SFT->GSPO handoff.

2. Add _sanitize_config_for_save helper that walks a PretrainedConfig
   (and its text_config / vision_config / audio_config sub-configs) and
   deletes attributes whose value is a function / method / lambda /
   builtin. Call it right before every remaining save path
   (export_model's merge + save, and as a defensive no-op in run_sft).

Final export flow (merge_and_unload -> save_pretrained -> save_pretrained_gguf)
still goes through transformers' serializer, so the sanitize helper is
what keeps that unblocked.

https://claude.ai/code/session_0155BCSv9dyCuio44HLbEWaY
… path

Gemma 4's tokenizer is actually a multimodal Processor. When called with
tokenize=True it runs visual-content extraction that assumes each
message's "content" is a list of {"type": ...} blocks; our plain-string
content trips `TypeError: string indices must be integers, not 'str'`
at transformers/processing_utils.py:1807.

prepare_sft_dataset and prepare_grpo_dataset got away with it by using
tokenize=False, which keeps the renderer on the text path. The format
compliance check in run_sft and the full loop in run_eval both used
tokenize=True and blew up.

Switch those two sites to: render chat template to string first with
tokenize=False, then pipe the string through the tokenizer in a separate
call. Same result, bypasses the multimodal branch.

https://claude.ai/code/session_0155BCSv9dyCuio44HLbEWaY
Unsloth's patched Gemma4Processor.__call__ has signature
(self, images=None, text=None, videos=None, **kwargs). A positional
first arg (tokenizer(my_text, ...)) lands in `images`, leaves `text`
as None, and the inner Gemma4Processor then crashes on text[0]:

    File ".../processing_gemma4.py", line 130, in __call__
        elif not isinstance(text, list) and not isinstance(text[0], str):
    TypeError: 'NoneType' object is not subscriptable

Fix = always pass text as explicit keyword: tokenizer(text=..., ...).
Confirmed as the correct pattern in Unsloth's Gemma 4 docs.

Applied in both sites that take the tokenize=True path — run_sft's
format-compliance verification loop and run_eval's full generation loop.

https://claude.ai/code/session_0155BCSv9dyCuio44HLbEWaY
…fits

After SFT, run_sft held refs to: the 8-bit base model, the dequantized
bf16 merged model used for compliance checking, the Trainer + optimizer
state, and the tokenized dataset. When run_grpo then called
FastLanguageModel.from_pretrained(..., load_in_8bit=True) on the 24 GB
4090, accelerate saw insufficient free VRAM and tried to push some
layers to CPU; bnb-8bit refuses that and throws "Some modules are
dispatched on the CPU or the disk".

Delete merged / model / trainer / dataset (guarded with NameError in
case a path never created them), double gc.collect(), empty the CUDA
cache, and synchronize before returning. This runs on both the "passed
compliance" and "failed compliance" paths.

https://claude.ai/code/session_0155BCSv9dyCuio44HLbEWaY
…k peft

Stock PeftModel.from_pretrained hits Gemma 4's custom Gemma4ClippableLinear
layer during adapter re-injection and crashes:

    ValueError: Target module Gemma4ClippableLinear(...) is not supported.
    Currently, only the following modules are supported: torch.nn.Linear, ...

apply_lora worked during SFT because it went through Unsloth's
FastLanguageModel.get_peft_model, which knows about Gemma4ClippableLinear.
Loading a saved adapter via stock peft doesn't get those patches.

Unsloth's FastLanguageModel.from_pretrained accepts a LoRA adapter path
(reads adapter_config.json, pulls the matching base, applies the adapter)
and returns an Unsloth-patched PEFT model whose merge_and_unload handles
the custom linear correctly.

Route run_grpo's "SFT adapter exists" path through that call instead.
The "no SFT adapter" path still uses plain load_model_and_tokenizer.

https://claude.ai/code/session_0155BCSv9dyCuio44HLbEWaY
…erge

Previous approach merged the SFT LoRA into the 8-bit base then applied a
fresh GSPO LoRA on top. Peft's own warning was the giveaway: "Merge lora
module to 8-bit linear may get different generations due to rounding
errors" — the merge into 8-bit quantizes the delta back to int8 and the
SFT learning doesn't survive. Telltale symptom: GSPO started with
completion_length=1.25 and reward_format/mean=-7.25, identical to a
cold-start from base Gemma 4 (SFT's 100% format compliance evaporated).

Drop the merge. Let Unsloth's FastLanguageModel load base + SFT adapter;
the adapter comes back attached and trainable, so GSPO just continues
training it. No fresh apply_lora call on top (that would stack a second
adapter and trigger the "Already found a peft_config" warning).

https://claude.ai/code/session_0155BCSv9dyCuio44HLbEWaY
…ndtrip

Save → reload of the SFT LoRA adapter was silently losing the SFT
weights. Symptoms: SFT trained to 100% format compliance in verification,
then GSPO's first step logged completion_length=1.25 and
reward_format/mean=-7.25 — identical to cold start from base Gemma 4.
The LoRA delta survives the save but is effectively wiped when the 8-bit
base is re-quantized on reload (peft even warns: "Merge lora module to
8-bit linear may get different generations due to rounding errors").
We tried three different save/reload variants, all had the same result.

Restructure so run_sft returns the live PEFT-wrapped, SFT-trained model
(and tokenizer) and main() passes both directly into run_grpo. Zero
save/reload between stages — the SFT weights are the exact same object
GSPO starts optimizing.

run_grpo gains optional model/tokenizer kwargs. Existing --stage grpo
resume-from-disk path is kept as a best-effort fallback (still lossy,
but the user is told the in-memory SFT->GSPO flow is preferred).

run_sft now returns (model, tokenizer) on success or None on failure,
and no longer deletes model/tokenizer in its cleanup block (trainer,
dataset, and the dequantized merged model are still freed).

https://claude.ai/code/session_0155BCSv9dyCuio44HLbEWaY
End-to-end audit turned up the real handoff bug. run_sft ended with:

    merged = model.merge_and_unload()
    ...  # verification using merged
    return model, tokenizer

peft's merge_and_unload MUTATES the original model — adapter layers are
removed and weights are baked into the base. After the call, `model` is
no longer a PeftModel; it's the plain base class (Gemma4ForCausalLM).
So main() handed GSPO a non-PEFT merged model. GRPOTrainer then either
OOMs (8 B trainable params) or sees 0 trainable params, and when routed
through the disk-reload fallback the SFT delta had also been lossily
re-quantised to int8 — which explains why every prior SFT run produced
GSPO metrics identical to cold-start (completion_length 1.25,
reward_format -7.25) regardless of how the reload was done.

Fix: skip the merge. Verify format compliance on the PEFT-wrapped model
directly — PEFT's forward pass already runs as base + adapter·delta
dynamically, so model.eval()/generate works fine and leaves the adapter
intact. model.train() before returning restores dropout for GSPO.

Refs: peft issues #1836, #2105, #868 (merge_and_unload mutation
semantics). Adjacent known issue unsloth#3069 (SFT→GRPO matmul shape
mismatch on Gemma) noted — fix is pip install 'trl==0.19.1' if it bites.

https://claude.ai/code/session_0155BCSv9dyCuio44HLbEWaY
…teps

Connect 4 with single-digit outputs is a degenerate case for GRPO —
after SFT the model becomes confident enough that within-group
completions collapse to identical tokens, killing advantage variance.
Previous run showed frac_reward_zero_std averaging 0.75-1.0 (most
groups had zero reward std → zero gradient signal).

Bump num_generations from 3-4 to 6-8 across variants so each prompt
gets more shots at producing disagreement. Keep batch_size == num_gen
so TRL's divisibility requirement is met; drop grad_accum to 1 (or 2
on e4b-bf16 where VRAM is tighter).

Also add debug output:
- RewardCalculator prints the best completion (with board + oracle
  scores + reward) every 5 reward-func calls so we can see what the
  model is actually generating.
- run_grpo reports prompt token lengths on the first 5 examples +
  current temperature and max_completion_length at launch, so config
  sanity is visible at a glance.

https://claude.ai/code/session_0155BCSv9dyCuio44HLbEWaY
difficulty_score previously returned -std and sort_by_difficulty used
reverse=True. Descending order on a negative key puts the least-negative
value (= lowest std = HARDEST position, where all 7 column scores agree
and nothing can be learned) at index 0. GaussianCurriculumSampler's
bucket 0 therefore held unlearnable endgame positions, starving early
GRPO of gradient signal.

Debug output from last run confirmed this: sample step 5 printed an
18-move-deep position with all oracle scores ≈ -0.99 — the worst
possible teaching example.

Return +std as the key; keep reverse=True so the sort is descending by
std. Bucket 0 now holds high-std positions where one move is clearly
best or clearly worst — exactly the kind of example that produces
within-group reward variance and real gradient.

https://claude.ai/code/session_0155BCSv9dyCuio44HLbEWaY
Previous debug printed only the single best completion every 5 steps.
Switch to printing ALL N completions in the group, every training step,
with each one's parsed column and reward. All entries in a
reward_move_quality call belong to the same group (our config sets
per_device_train_batch_size == num_generations), so they share the
board and oracle row — one prompt header, N generation lines.

Format per line:
  [k] col=<parsed> reward=<±X.XX>  '<completion text>'

Long completions truncated to 150 chars; embedded newlines escaped so
each generation stays on one line.

https://claude.ai/code/session_0155BCSv9dyCuio44HLbEWaY
…think

Three coordinated changes to make the model actually use <think>...</think>:

1. FILLER_THOUGHTS: replace the short repetitive phrases with 15 longer
   reasoning-style templates (15-25 tokens each). Short filler gave SFT
   no reason to learn the think wrapper — the high-entropy random content
   meant the wrapper was the only predictable signal, and even that got
   skipped. Longer, substantive thoughts make the wrapper meaningful and
   consistent to predict.

2. SFT budget 4x: 1000 → 5000 examples, 50 → 200 steps. ~8 min instead
   of ~2. Locks in the think-then-digit template reliably enough that
   GRPO generations contain think blocks from step 1, giving the new
   reward_thinking function non-zero gradient signal.

3. reward_thinking reward function: +1.0 for a <think>...</think> block
   with ≥10 chars of content, −0.5 if tag present but empty, −2.0 if no
   tag at all. Scale is ~1/10 of reward_move_quality (±10) so it nudges
   behavior without dominating. Registered alongside reward_format and
   reward_move_quality in run_grpo's reward_funcs.

Together: SFT establishes a think-block prior, GSPO reward_thinking
maintains it during RL while reward_move_quality drives toward correct
columns.

https://claude.ai/code/session_0155BCSv9dyCuio44HLbEWaY
…hinking

Last run confirmed SFT was learning the continuation of a think block but
not reliably starting one: reward_thinking stuck at -2 on every step with
completions of exactly 2 tokens (digit + EOS). Cause is exposure bias —
teacher-forced SFT drives predictable mid-sequence tokens to near-zero
loss (digit, </think>, \n, EOS) and leaves most of the residual loss on
the single first-token decision <think> vs digit, which base Gemma 4's
prior easily wins.

Fix: append `<think>` to every GRPO prompt after apply_chat_template. The
model's first generated token is now INSIDE the think block; it only has
to continue thinking and close the block, which SFT taught well.

Support in parsing helpers:
- strip_thinking now also strips the closing-only pattern `...</think>`
  at the start of text (completions that start mid-think because of the
  prepended tag).
- reward_thinking accepts both `<think>...</think>` (spontaneous) and
  `...</think>` (prefix-driven) forms; content < 10 chars → -0.5, no tag
  at all → -2.0, substantive → +1.0.
- is_clean_output and extract_column_from_response pick up the change
  for free (they route through strip_thinking).

Same trick is used in DeepSeek-R1 and the gpt-oss GRPO notebook.

https://claude.ai/code/session_0155BCSv9dyCuio44HLbEWaY
…2.5 Pro

Replace the generic FILLER_THOUGHTS (random 15-token sentences) with
per-position, board-grounded reasoning produced by Gemini 2.5 Pro. SFT
then learns real "look at threats / control / forced wins → pick this
column" reasoning instead of "emit generic filler then the digit."

New file: training/generate_thoughts.py
- Reads GOOGLE_API_KEY from ./.env, training/.env, backend/.env, or ../.env
  (loader searches in that order, falls back to os.environ).
- Uses the google-genai SDK (pip install google-genai).
- Teacher prompt tells Gemini the board + per-column oracle scores +
  best column, asks for 2-4 sentences of first-person thinking under
  80 words that leads to that column. System prompt forbids revealing
  the scores or saying "the best column is X".
- Concurrency via ThreadPoolExecutor. Resume-safe: rerunning skips
  move_sequences already in the output JSONL.
- --dry-run prints the first rendered prompt without API calls.
- Default model is gemini-2.5-pro (~$7.50 for 5000 traces). Pass
  --model gemini-2.5-flash for ~15x cheaper draft-quality traces.

Output JSONL: {"move_sequence": "...", "best_col": N, "thought": "..."}

Plug into SFT (training/connect4_train.py):
- prepare_sft_dataset now loads connect4_thoughts.jsonl at dataset build
  time (via _load_thoughts_jsonl) and uses the per-position thought
  when available. Falls back to the generic FILLER_THOUGHTS for any
  position not covered by the JSONL, and prints how many fell back.

https://claude.ai/code/session_0155BCSv9dyCuio44HLbEWaY
Generated by training/generate_thoughts.py — can be ~1-5 MB and is
easier to manage via scp or a HuggingFace dataset repo than to commit
into the main repo.

https://claude.ai/code/session_0155BCSv9dyCuio44HLbEWaY
…e logging

Gemini 2.5 Pro does internal "thinking" before producing output text, and
those thinking tokens count against max_output_tokens. With the previous
default of 220 the model was burning the entire budget on thinking and
returning empty text — reported as "empty response" with no diagnostics.

Changes:
- max_output_tokens default 220 -> 2048.
- New --thinking-budget (default 512) wired through ThinkingConfig so
  we cap how much goes to thinking and guarantee room for the actual
  response. Set 0 to disable thinking on supported models.
- _diagnose_empty() inspects the response on empty-text failures and
  reports prompt_feedback.block_reason, candidates[0].finish_reason,
  safety_ratings, and usage_metadata (incl. thoughts_token_count) so we
  can tell whether a failure is a safety block, token exhaustion, or
  something else.
- New --verbose flag prints every retry attempt reason + a preview of
  each successful trace as it lands.
- First 3 traces always previewed regardless of --verbose, as a
  sanity check at the start of every run.
- Output file now opened with buffering=1 (line-buffered), and after
  each write we flush() + os.fsync() so `tail -f` sees lines in real
  time and Ctrl+C can't lose a completed trace.

https://claude.ai/code/session_0155BCSv9dyCuio44HLbEWaY
…a HF

GSPO at temperature=1.0 + max_completion_length=512 was generating
512-token base-Gemma-style board narration ("The current board state
is: Row 0: ...") that never closed </think> and never produced a digit.
Three reward functions all returned worst-case (-10, -10, -2) because
the completion never reached the answer.

Changes:
- temperature 1.0 -> 0.7. Keeps sampling close to the SFT-learned
  teacher-style distribution. Standard for distilled reasoning models.
- max_completion_length 512 -> 256. Forces concision and bounds the
  rollout cost when the model does drift.

User-asked debug + workflow ergonomics:
- run_sft compliance check (greedy, do_sample=False) now keeps every
  generation and prints all 20 raw responses verbatim after the % stat,
  so we can see whether the model is producing real
  <think>...</think>\n<digit> or just <digit>. Uses max_new_tokens=256
  so a full think block can be observed.
- Per-step GSPO debug prints only the BEST of N completions, full text
  (no truncation), with the prompt-prefixed "<think>" prepended for
  readability — instead of dumping all 8 truncated previews.
- run_sft pushes its LoRA adapter to <hf_repo>-sft as a private repo
  after saving locally. So the adapter survives pod restarts and a
  future fresh pod can resume into GSPO without re-running SFT.
- run_grpo's resume path (when called with model=None) now tries
  snapshot_download from <hf_repo>-sft if the local sft_dir is missing,
  before falling back to cold start.

https://claude.ai/code/session_0155BCSv9dyCuio44HLbEWaY
…gnostics

Reverts the three hyperparam guesses from 061fe15 that were made without
analysis:

  temperature            0.7 -> 1.0     (GSPO default; was changed speculatively)
  max_completion_length  256 -> 512     (was changed speculatively)
  compliance max_new_tokens 256 -> 128  (was changed speculatively)

Keeps the genuinely useful bits of 061fe15:
- 20 compliance-check generations dumped verbatim after SFT (this is the
  data we need to see — is_clean_output strips <think> tags before
  checking for a digit, so the 20/20 compliance metric cannot distinguish
  between a full <think>{prose}</think>\<digit> and just <digit>).
- Best-of-N debug print in GRPO rollouts, full text with prepended <think>.
- SFT adapter push to <hf_repo>-sft, and GRPO pull from same on resume.

Rationale: we need real data from the compliance dump before making
any more changes. The next fix is chosen after we see what SFT greedy
actually produces.

https://claude.ai/code/session_0155BCSv9dyCuio44HLbEWaY
… align

Two pipelines silently disagreed on which "first 5000" meant:

- generate_thoughts.py: shuffle(seed=42) then take first 5000
- connect4_train.split_data: shuffle(seed=42) then
  sort_by_difficulty then first 5000

Same seed, same shuffle, but one sorts and the other doesn't. Result:
SFT's lookup-by-move_sequence found thoughts for ~125/5000 positions
(coincidental overlap); 4875/5000 silently fell back to the generic
FILLER_THOUGHTS list. "Loaded 5000 teacher thoughts" was true but
misleading — the bulk of SFT was still training on the generic filler.

Fix 1 (connect4_train.py::prepare_sft_dataset): DRIVE the dataset from
the thoughts dict. For every (move_sequence, thought) pair in the JSONL,
look up the corresponding entry in `data` (the sorted train split) and
build the SFT example. Report how many thoughts had no matching entry
(orphans). Falls back to the old filler-based path only when the JSONL
is empty.

Fix 2 (generate_thoughts.py): apply sort_by_difficulty after the shuffle
so future regenerations pick the same first-5000 as SFT. No regen
needed right now — Fix 1 already recovers the existing thoughts.

https://claude.ai/code/session_0155BCSv9dyCuio44HLbEWaY
…ance

Two diagnostics + one strict format check:

1) _dump_tokenizer_special_tokens prints the tokenizer's special-tokens
   map and probes our tags (`<think>`, `</think>`, `<|think|>`,
   `<|channel>`, `<|channel|>`) showing their token IDs and re-decoded
   form. Decisive for whether Gemma 4 treats `<think>` as a single
   special token (in which case skip_special_tokens=True at decode time
   strips it from the visible response) or as ordinary text.

2) _dump_first_example prints the FIRST rendered SFT training string
   AND the FIRST rendered GRPO prompt verbatim. We can compare them
   byte-for-byte to confirm the assistant turn header + prepended
   `<think>` in GRPO actually picks up where SFT's `<start_of_turn>model`
   left off.

3) is_thinking_output: STRICT format check. Requires a
   <think>{>=5 chars}</think> followed by a clean digit 0-6 (accepts
   either `<think>...</think>` spontaneous form OR `...</think>`
   when the prompt prefixed `<think>`). The SFT compliance check now
   uses this instead of the lenient is_clean_output. "20/20 passed" no
   longer hides the case where the model output a bare digit; it now
   only counts genuine think+digit completions. Also bump compliance
   max_new_tokens to 256 so a real think block has room to appear.

is_clean_output (lenient) is unchanged — still used by reward_format
during GRPO so partial-credit gradients flow when only a digit is emitted.

https://claude.ai/code/session_0155BCSv9dyCuio44HLbEWaY
The 125/5000 thought mismatch wasn't a lookup bug, it was an
architectural smell: SFT was building a move_sequence -> entry lookup
against the 190k training CSV just to find each thought's best_col.
But the HF thoughts dataset already stores move_sequence, best_col,
AND thought per row — SFT needs nothing else from the CSV.

Rewrite prepare_sft_dataset to:
  1. Load Betha/connect4-thoughts from HF directly (datasets.load_dataset),
     with fallback to local connect4_thoughts.jsonl if HF is unreachable.
  2. Iterate the dataset rows — each row is self-contained. Build SFT
     examples without any CSV cross-reference.
  3. Fall back to FILLER_THOUGHTS on the first 1000 CSV rows only when
     no thoughts can be loaded at all (not the expected path).

Drop the train_data argument from prepare_sft_dataset; revert the
previous train_data/train_data[:5000] confusion entirely. The 190k CSV
remains loaded for GRPO (reward_move_quality score lookup + curriculum
sort_by_difficulty); SFT is decoupled.

Also extend the tokenizer diagnostics:
- Probe additional Gemma 4 candidate tags: <|/think|>, <channel|>,
  <|/channel|>, <|turn>, <turn|>. Lets us see exact token IDs for the
  closing tags (docs hinted <channel|> but that string wasn't probed).
- New _dump_native_thinking_render renders a toy conversation via
  apply_chat_template with several enable_thinking / chat_template
  variants, so we can read the canonical native-thinking format
  directly from the tokenizer instead of guessing from docs. Output
  of this probe drives the next commit's tag swap.

https://claude.ai/code/session_0155BCSv9dyCuio44HLbEWaY
claude added 14 commits April 13, 2026 19:05
Empirical tokenizer probe from commit f5a304d confirmed:
  <|think|>    = id 98   (special, 1 token) — enables thinking via system
  <|channel>   = id 100  (special, 1 token) — OPEN tag (pipe LEFT of word)
  <channel|>   = id 101  (special, 1 token) — CLOSE tag (pipe RIGHT of word)
  <|turn>      = id 105  (special, 1 token)
  <turn|>      = id 106  (special, 1 token)

Our previous literal <think>/</think> tags were 3 ordinary tokens each
with no pretrained prior — SFT had to teach the format from scratch,
which it never locked in (greedy compliance "20/20" was a bare digit).
Gemma 4's native channel tokens have strong pretrained priors; SFT only
nudges.

Also verified empirically that apply_chat_template(enable_thinking=True)
prepends `<|think|>\n` to the system content but does NOT auto-wrap the
assistant turn — we do the channel wrapping manually.

Changes:
- prepare_sft_dataset assistant content:
    was: `<think>{thought}</think>\n{digit}`
    now: `<|channel>thought\n{thought}<channel|>{digit}`
  + enable_thinking=True on apply_chat_template so `<|think|>` lands
    in the system turn.
- prepare_grpo_dataset prompt prefix:
    was: prompt + "<think>"
    now: prompt + "<|channel>thought\n"
  + enable_thinking=True on the chat template call.
- strip_thinking, is_thinking_output, reward_thinking regexes updated
  to match `<|channel>(.*?)<channel|>` and the closing-only
  `^(.*?)<channel|>`. Legacy `<think>...</think>` patterns kept as
  fallback for any stray old completions still in flight.
- Reward magnitudes and thresholds unchanged.
- Debug print prepends `<|channel>thought\n` for readability.
- Minor: two "teaching the model to output ..." log lines updated to
  show the native tag format.

Same Gemini 5000 thoughts on HF — no regeneration needed, same prose,
just wrapped in native tags instead of literal text.

https://claude.ai/code/session_0155BCSv9dyCuio44HLbEWaY
The SFT compliance check and run_eval were still calling
apply_chat_template without enable_thinking=True and without prefixing
`<|channel>thought\n`. Result: the SFT-trained model was being tested
on an out-of-distribution prompt format — the strict compliance count
would fail even when SFT had actually learned the native format.

Both sites now match the training/inference format exactly:
  apply_chat_template(..., enable_thinking=True)  # puts <|think|> in system
  rendered = rendered + "<|channel>thought\n"     # prefill thinking channel

Also bumped run_eval's max_new_tokens from 128 to 256 so a full
<|channel>thought\n{reasoning}<channel|>{digit} sequence has room.

https://claude.ai/code/session_0155BCSv9dyCuio44HLbEWaY
…seness rewards

Last SFT+GSPO run confirmed: Gemma 4's pretrained thinking mode (activated
by <|think|> + <|channel>) overrides SFT's teacher-short style. 200 SFT
steps can't move an 8B model's pretrained CoT behavior. The model reasons
verbosely, gets truncated at max_completion_length, and never closes
<channel|>.

Switch strategy per user decision: trust the pretrained thinking mode,
give it room to finish, and use reward shaping (not SFT) to enforce the
closing + digit-only answer.

Changes:
- --stage default: sft -> grpo. SFT is now opt-in.
- MODEL_CONFIGS: all four variants now num_gen=4, batch=4, grad_accum=2.
  Bigger completions cost more compute per rollout; fewer generations
  keep per-step wall clock sane. Effective batch size unchanged (8).
- max_completion_length 512 -> 1536 everywhere (GRPOConfig + log lines).
  Gemma's native thinking can run 500-1500 tokens before closing.
- New reward_closes_and_answers: STRICT format reward.
    +5  if `<channel|>` followed by exactly a single digit 0-6
    -10 if `<channel|>` present but answer malformed ("column 3" etc.)
    -20 if `<channel|>` never appears (rambled past budget)
  Largest magnitude so closing behavior is learned first.
- New reward_conciseness: BONUS for CORRECT answers with short thinking.
    Linear decay +3 at <=200 chars -> 0 at >=1500 chars; 0 for wrong
    answers regardless of length. Only pushes the model to be briefer
    when it would otherwise be right — never rewards being briefly wrong.
- Dropped reward_thinking (redundant with reward_closes_and_answers now).
  The reward_format helper stays in the class for manual experimentation
  but is no longer registered.
- reward_funcs list now: closes_and_answers, move_quality, conciseness.

Total reward envelope per completion:
  worst:   -20 (no close) + -10 (no valid col)  + 0           = -30
  typical:  +5 (closed)   + oracle * 10         + 0..3        = -7 .. +18

https://claude.ai/code/session_0155BCSv9dyCuio44HLbEWaY
Two targeted upgrades to the GRPO reward set:

1) reward_conciseness — continuous quality factor instead of binary gate.
   Before: bonus fired only when col == best_col (binary).
   After:  reward = 3.0 * max(0, oracle_score) * length_factor.
   Near-best moves now share the bonus in proportion to their oracle
   score. A move scoring +0.8 gets 80% of what a +1.0 move gets; a
   losing or neutral move (oracle <= 0) still gets 0. Length factor
   unchanged (linear decay 200 -> 1500 chars). Smoother gradient
   surface in positions with multiple near-equal good moves.

2) reward_closes_and_answers — accept light trailing punctuation.
   New RewardCalculator._DIGIT_ONLY regex:
       ^\s*([0-6])\s*[.!?,;:\s]*$
   Accepts "3", "3.", " 3 ", "3\n", "3 !" as +5. Rejects "33", "3 4",
   "column 3", "The answer is 3" as -10 (prose). Reduces noise from
   trivial format variations early in training without letting prose
   sneak through.

No structural changes to the reward list or magnitudes (+5 / -10 / -20
for closes_and_answers, +-10 for move_quality, [0, 3] for conciseness).

https://claude.ai/code/session_0155BCSv9dyCuio44HLbEWaY
…old)

Previous behavior: --stage grpo on a fresh pod would auto-pull the SFT
adapter from <hf_repo>-sft and start GSPO on top of it. That made the
resume-after-pod-death flow work, but it also silently pulled stale SFT
adapters (from earlier literal-<think>-tag experiments) into runs
intended to cold-start with Gemma 4's native pretrained thinking.

New behavior:
  - `python connect4_train.py --model ... --hf-repo ...` → cold start
    from base Gemma 4. Gemma's pretrained thinking mode handles format;
    GSPO rewards teach column selection.
  - `python connect4_train.py --model ... --hf-repo ... --sft` → resume
    from <hf_repo>-sft (pulled if not local). Use when you've run SFT
    recently and want to start GSPO from a known-good warmup.

The --sft gate wraps the existing lookup logic verbatim, so nothing
changes about how the adapter is loaded once opted in.

https://claude.ai/code/session_0155BCSv9dyCuio44HLbEWaY
…atch

Step 2 of the last run proved native thinking works — the model reached
the correct conclusion internally ("This is the winning move" for col 1,
matching oracle best_col=1) but hit the 1536-token cap before emitting
the formal `<channel|>1` close. All 8 completions failed for the same
reason → reward_std=0 → zero gradient → no learning.

Changes:
- max_completion_length 1536 -> 4096. The reasoning fits in ~1200-1500
  tokens; +1500-2000 more for the conclusion + formal close. Generous
  budget that lets the model close naturally.
- reward_conciseness thresholds 200/1500 -> 600/5000 chars to match
  the new budget. 600 chars ≈ 150 tokens (full bonus for very concise
  closings); 5000 chars ≈ 1250 tokens (no bonus but no penalty — the
  model still gets its move-quality and format rewards).
- SYSTEM_TEMPLATE: append an explicit closing instruction reminding
  the model to emit `<channel|>` + column number at the end. Gemma
  already knows `<channel|>` as a pretrained special token; this just
  biases it toward using it after reaching a decision.

Expected: first few steps still mostly unclosed (reward floor), but
the budget increase should let ≥1 of 8 close within 20-30 steps,
which kicks in within-group variance and GRPO gradient starts flowing.
Conciseness bonus then teaches "close earlier" over hundreds of steps.

https://claude.ai/code/session_0155BCSv9dyCuio44HLbEWaY
Current r=16 gave 42M trainable params vs 8B frozen (0.5%). That's
designed for quick stylistic adaptations, not for overriding strong
pretrained priors like Gemma 4's verbose thinking-mode behavior. The
SFT failures and the GRPO zero-variance both trace back in part to
insufficient capacity to reshape pretrained behavior.

r=128 gives ~340M trainable (~4% of base) — standard rank for
reasoning fine-tunes (DeepSeek-R1 distillation, Unsloth reasoning
notebooks). Four things improve:
- GRPO learns "close earlier, prefer this column" behaviors faster
  because the LoRA delta has more degrees of freedom.
- If the user ever runs --stage sft again, that path also gets 8x
  more capacity and can actually move the pretrained distribution.
- reward_conciseness gradient has more room to carve behavior.
- Exported merged model will have richer behavioral changes baked in.

lora_alpha tracks r automatically via get_config (alpha = r * 2, so
now 256). No other config change needed.

VRAM budget (e4b-8bit, 4096 completion, batch=4, num_gen=4):
  +1-1.5 GB optimizer state (adamw_8bit). Comfortable within 24 GB.

https://claude.ai/code/session_0155BCSv9dyCuio44HLbEWaY
…stripped

DISCOVERED by inspecting the first rendered SFT training example at
r=128: Gemma 4's chat template was emitting `<|turn>model\n3<turn|>`
with NO thinking content. The entire `<|channel>thought\n{thought}<channel|>`
block we put into the assistant `content` field was being deleted by
the template.

Cause: Gemma 4's chat template is designed to HIDE historical assistant
thinking in multi-turn conversations (documented behaviour). When given
an assistant turn with `<|channel>...<channel|>` in it, the template
treats it as a historical turn and strips the thinking block before
rendering. Only the post-close answer (the digit) survives.

Every SFT run we've done has trained on `<|turn>model\n<digit><turn|>`
— nothing about format, nothing about thinking. That's why:
  - SFT loss bottomed at 0.2 instead of <0.05 (model already knew
    "after user asks Connect 4 question, emit a digit" — there was
    nothing new to learn).
  - Strict compliance was 0/20 — the model was never taught to
    produce `<|channel>...<channel|>` format.
  - Behavior at inference was pretrained-thinking-mode default, with
    SFT contributing essentially nothing.

Fix in prepare_sft_dataset: render just system+user with
add_generation_prompt=True (template ends at "<|turn>model\n"), then
append the assistant turn manually:
    completion_text = f"<|channel>thought\n{thought}<channel|>{digit}<turn|>"
    full_text = prompt_text + completion_text
This bypasses the template's history-stripping entirely. The
`<|channel>...<channel|>` block now survives into the SFT training text.

Applied to both the teacher-thoughts path and the FILLER_THOUGHTS
fallback.

https://claude.ai/code/session_0155BCSv9dyCuio44HLbEWaY
Root cause of today's 0/20 strict compliance (SFT actually worked):
`<|channel>` (id 100) and `<channel|>` (id 101) are SPECIAL tokens in
Gemma 4's tokenizer. When the compliance check decoded with
skip_special_tokens=True, `<channel|>` was stripped from the visible
text, leaving `"...reasoning.3"` — the digit there but the delimiter
invisible. Our `is_thinking_output` regex looked for literal
`<|channel>...<channel|>` and never matched, so every correct
completion failed the check.

Same problem for GRPO rewards: TRL's GRPOTrainer hard-codes
skip_special_tokens=True when decoding completions for reward
functions. Our `reward_closes_and_answers` checked `"<channel|>" in c`,
always got False, always returned -20. That's why zero-variance /
zero-gradient kept happening.

Fix strategy: don't rely on seeing the special delimiter at all. After
stripping, a correct completion looks like "{reasoning prose}{digit}".
Parse with a _TRAILING_DIGIT regex that anchors to the end of the
string, and use a _strip_trailing_specials helper to handle either
decode mode (with or without special tokens preserved).

Changes:
- New module-level _TRAILING_DIGIT = r'([0-6])\s*[.!?,;:\s]*$'
- New _strip_trailing_specials helper that pops `<turn|>`,
  `<end_of_turn>`, `<|endoftext|>`, `<channel|>`, `<|channel>`,
  `<|think|>` from the tail before parsing.
- extract_column_from_response: strip specials, find trailing digit.
- is_clean_output (lenient): True iff extract_column_from_response works.
- is_thinking_output (strict): True iff trailing digit AND >= 5 chars of
  content before it.
- strip_thinking: now a no-op (kept for backward-compat callers).
- Removed unused RewardCalculator._DIGIT_ONLY regex.
- reward_closes_and_answers: scored on post-strip text. +5 for
  reasoning (>=20 chars) + trailing digit, -5 for minimal reasoning
  + digit, -10 for bare digit, -20 for no trailing digit.
- Compliance decode in run_sft: skip_special_tokens=False so the 20
  dumped responses show `<|channel>...<channel|>3<turn|>` explicitly.
- run_eval decode: skip_special_tokens=False for parity.
- Compliance print-line updated to reflect the new strict rule.

With this landed:
- The previous SFT run's outputs would score 20/20 strict (all have
  reasoning + trailing digit).
- GRPO rewards will no longer floor at -20 because text contains no
  `<channel|>` — they score the actual output.

https://claude.ai/code/session_0155BCSv9dyCuio44HLbEWaY
…display

Step 1 of GSPO ran beautifully: 89-char Gemini-style completions,
reward +9 on average, reward_std > 0, step time 80s. SFT+GSPO are
working end to end. But step 2 crashed with CUDA OOM in
efficient_log_softmax allocating ((seq_chunk, vocab=262144), fp16).

Root cause: max_completion_length=4096 makes TRL/Unsloth pre-allocate
the log-softmax buffer for the worst-case sequence length, even though
SFT-trained completions are ~30 tokens in practice. At r=128 + 4-gen
rollouts + 8-bit base + optimizer state, the pre-alloc tips past 24 GB.

Drop max_completion_length 4096 -> 1024 everywhere (GRPOConfig + log
line). Still 30x headroom vs observed output length, and it cuts the
log-softmax buffer by 4x — eliminates OOM. No behavioral change; the
rewards measure the actual output length, not the budget.

Also: fix the per-step debug print to reconstruct the `<channel|>`
separator at the trailing-digit position. Before: the display showed
`...plays.2` with no visible separator, because TRL strips special
tokens before passing completions to reward functions. We already
prepended `<|channel>thought\n` at the start for readability; now we
also insert `<channel|>` before the trailing digit so the display
matches what the model actually generated:

    | <|channel>thought
    | ...future plays.<channel|>2

Purely cosmetic, no effect on training.

Recommend setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
before relaunching to help with fragmentation in the tight VRAM budget.

https://claude.ai/code/session_0155BCSv9dyCuio44HLbEWaY
…tokens

TRL's GRPOTrainer hardcodes skip_special_tokens=True when decoding
completions for reward functions. That was stripping our own format
markers (<|channel>, <channel|>, <turn|>) before our rewards ever saw
them. We worked around it with a trailing-digit heuristic, but that's
fragile for edge cases like "(5,3) moves" in the middle of reasoning.

Real fix: flip `<|channel>` (id 100) and `<channel|>` (id 101) from
special → non-special on the tokenizer immediately after load. Same
single-token IDs, same pretrained priors during training, but
skip_special_tokens=True at decode no longer strips them. Rewards
now receive the literal `<channel|>` marker and can parse structurally.

<|think|> (98) stays special — it's only in system prompts, never in
completions. <|turn>/<turn|> stay special — we actively want them
stripped from reward text.

Simplifications enabled by this:
- reward_closes_and_answers: split on `<channel|>`, check if the
  text after is a clean digit. +5/-10/-20 on three structural cases.
  No more trailing-digit heuristic; "(5,3) wins" style prose that
  happens to end in a digit no longer earns false credit.
- extract_column_from_response: split on `<channel|>`, int(after).
- is_thinking_output: same split + >= 5 chars of reasoning before
  the marker.
- Debug print: no reconstruction — `<channel|>` is already in the text
  we received. Only the opener `<|channel>thought\n` is still
  prepended manually (because it was in the prompt, not completion).
- _SPECIAL_TAIL_TOKENS renamed to _TURN_END_TAILS (only `<turn|>` etc.
  still need stripping).

Training behavior is identical (token IDs and priors unchanged).
Only decoding semantics change.

https://claude.ai/code/session_0155BCSv9dyCuio44HLbEWaY
…l|> tokens

TRL's GRPOTrainer hardcodes `batch_decode(completion_ids, skip_special_tokens=True)`
(line ~597 in grpo_trainer.py). There's an open feature request to
expose this as a config knob (huggingface/trl#2897, #3026) but nothing
has landed. For reasoning training where special tokens ARE the format
contract, the hardcoded stripping is a footgun.

Earlier attempt (add4611, reverted) tried to flip AddedToken.special=False
to make `<|channel>`/`<channel|>` non-special at the tokenizer level.
That's brittle across HF tokenizers versions + Unsloth patches —
silently didn't work (rewards saw -20 every step).

This commit uses a sturdier approach: monkey-patch
tokenizer.batch_decode on the instance to ALWAYS pass
skip_special_tokens=False, ignoring whatever the caller requested.
Applied in _force_batch_decode_keep_specials from
load_model_and_tokenizer. Simpler to reason about, no fast-tokenizer
Rust-state dance.

Effect: TRL's reward decode path receives completions with
`<|channel>thought\n{reasoning}<channel|>{digit}<turn|>` intact. Reward
and parse functions become structural:

- extract_column_from_response: split on `<channel|>`, parse the digit
  after (with turn-tails stripped). None if marker absent or answer
  malformed. No more trailing-digit regex heuristic.
- is_thinking_output: require `<channel|>` + valid digit after + >=5
  chars of content before. Robust to models self-opening a second
  `<|channel>` block.
- reward_closes_and_answers: three structural cases only:
    +5  : `<channel|>` present AND answer is a bare digit 0-6
    -10 : `<channel|>` present but answer is prose
    -20 : no `<channel|>` at all
  No more middle tier for "minimal content" — text length heuristic is
  gone.
- Debug print: `<channel|>` is in best_completion verbatim, no more
  reconstruction. We still prepend `<|channel>thought\n` for context
  since that opener was in the prompt, not the completion.

strip_thinking, is_clean_output, reward_conciseness, reward_move_quality
all work unchanged — they consume extract_column_from_response, which
is structurally cleaner but behaves the same on well-formed output.

_TRAILING_DIGIT regex removed (no longer needed).
_SPECIAL_TAIL_TOKENS renamed to _TURN_END_TAILS (only `<turn|>` etc.
still need stripping from the answer portion now).

Training behavior is IDENTICAL (token IDs and pretraining priors
unchanged). Only the reward-parsing side changes.

https://claude.ai/code/session_0155BCSv9dyCuio44HLbEWaY
…onflict

vLLM 0.16-0.19 pin transformers<5, but Gemma 4 requires transformers>=5.5.0
(vllm-project/vllm#39216). Unsloth's official guidance for Gemma 4 + GRPO
is to disable fast_inference so Unsloth's native generation path is used
instead of vLLM.

With vllm uninstalled + fast_inference=False, trl's is_vllm_available()
returns False, short-circuiting the broken vllm_ascend import branch —
no file patches needed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants