Skip to content

[ai/azure-ai-projects] Eval sample fixes: expand finetuning seed and make simulation dataset always-fresh#47249

Closed
aprilk-ms wants to merge 1 commit into
mainfrom
users/aprilk/sample-fixes-22
Closed

[ai/azure-ai-projects] Eval sample fixes: expand finetuning seed and make simulation dataset always-fresh#47249
aprilk-ms wants to merge 1 commit into
mainfrom
users/aprilk/sample-fixes-22

Conversation

@aprilk-ms
Copy link
Copy Markdown
Member

Three small fixes uncovered while bug-bashing the public 2.2.0 samples against a live Foundry project.

1. Unicode print crashes on Windows cp1252 (16 sample files)

The "completed / failed" summary print statements at the end of 16 evaluation samples used U+2713 ✓, U+2717 ✗, U+2705 ✅, U+274C ❌. On a fresh Windows shell (stdout = cp1252, no PYTHONIOENCODING=utf-8), Python's stdio writer crashes with UnicodeEncodeError and the sample exits non-zero after the run already succeeded on the service. Three of my multi-turn samples failed this way before I forced UTF-8 in my shell.

Replaced all 36 occurrences with ASCII [OK] / [FAIL] tokens — pure mechanical change.

2. Finetuning seed document was too small (sample_dataset_generation_job_simpleqna_for_finetuning.py)

The embedded 1.2 KB SEED_REFERENCE_DOCUMENT passes the eval scenario but the SUPERVISED_FINETUNING scenario rejects it with:

File content lacks sufficient context to generate quality questions.

Confirmed empirically: an identical script with a ~10 KB version of the same widgets/gizmos/sprockets reference doc generates the requested 15 QnA pairs cleanly (96 s run). Expanded the seed to that ~10 KB version (same domain, much richer prose) and updated the surrounding comment to explain the requirement.

3. Simulation sample silently fell back to a stale cached dataset (sample_multiturn_conversation_simulation.py)

The previous code uploaded simulation-scenarios:v1, and on the inevitable "dataset already exists" failure for the second run it swallowed the exception and pointed the eval run at the cached server-side dataset URI. That made the sample non-reproducible — anyone editing the local JSONL would never see their changes, and the Expected: {3 * 2} print became wrong as soon as anyone uploaded a different-sized scenario file from a prior session (in my bug-bash session it printed Expected: 6 but actually ran 10 conversations).

Now the dataset name is suffixed with a per-run id (same pattern as the datagen samples) so every invocation uploads fresh data, and the "expected" count is computed from the actual JSONL row count.

Verification

  • All 17 modified files py_compile clean.
  • Live end-to-end run of sample_multiturn_conversation_evaluation.py on a default Windows shell now prints [OK] Evaluation run completed successfully! and exits 0 (previously: UnicodeEncodeError).
  • All 14 attempted samples passed in my bug-bash with these fixes (rerun matrix in commit description).

Follow-up (not in this PR)

A few of the trace-based samples (sample_dataset_generation_job_traces_for_*, sample_multiturn_trace_evaluation_*) require pre-existing agent traces in the project to be runnable. I will open a separate issue/PR to make those self-contained by creating the agent and seeding traces inline.

…etuning seed; make simulation dataset always-fresh

Three small fixes uncovered while bug-bashing the 2.2.0 samples against a live
Foundry project:

1. Unicode print crashes on Windows cp1252 (16 sample files, 36 print lines).
   Final summary prints used U+2713/U+2717/U+2705/U+274C glyphs which crash
   the script when stdout is the default Windows code page (cp1252) and the
   user has not exported PYTHONIOENCODING=utf-8. Replace the glyphs with
   ASCII [OK] / [FAIL] tokens so the samples succeed out of the box on a
   fresh Windows shell.

2. Finetuning seed document was too small for SUPERVISED_FINETUNING QnA
   generation. The 1.2 KB embedded reference doc passed the eval scenario
   but the finetuning scenario rejected it with "File content lacks
   sufficient context to generate quality questions." Expanded the seed to
   a ~10 KB widgets/gizmos/sprockets reference (same domain, much richer
   prose) which lets the service synthesize the requested 15 QnA pairs.
   Also updated the surrounding comment to explain the size requirement.

3. Simulation sample silently fell back to a stale cached dataset on
   re-runs. The previous code uploaded simulation-scenarios:v1 and, on
   the inevitable "already exists" failure for the second run, swallowed
   the exception and pointed the run at the cached server-side dataset.
   That made the sample non-reproducible: anyone editing the local JSONL
   would not see their changes. Now the dataset name is suffixed with a
   per-run id so every invocation uploads fresh data, and the "expected
   conversation count" line is computed from the actual JSONL row count
   rather than a hard-coded constant.

Verified locally: all modified files compile and a live end-to-end run of
sample_multiturn_conversation_evaluation.py prints "[OK] Evaluation run
completed successfully!" cleanly on a default Windows shell.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@aprilk-ms aprilk-ms changed the title [ai/azure-ai-projects] fix sample print crashes on cp1252; expand finetuning seed; make simulation dataset always-fresh [ai/azure-ai-projects] Eval sample fixes: expand finetuning seed; make simulation dataset always-fresh May 31, 2026
@aprilk-ms aprilk-ms changed the title [ai/azure-ai-projects] Eval sample fixes: expand finetuning seed; make simulation dataset always-fresh [ai/azure-ai-projects] Eval sample fixes: expand finetuning seed and make simulation dataset always-fresh May 31, 2026
@aprilk-ms
Copy link
Copy Markdown
Member Author

Split into three focused PRs (rebased onto main) for easier review:

@aprilk-ms aprilk-ms closed this May 31, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant