[ai/azure-ai-projects] Eval sample fixes: expand finetuning seed and make simulation dataset always-fresh by aprilk-ms · Pull Request #47249 · Azure/azure-sdk-for-python

aprilk-ms · 2026-05-31T02:19:23Z

Three small fixes uncovered while bug-bashing the public 2.2.0 samples against a live Foundry project.

1. Unicode print crashes on Windows cp1252 (16 sample files)

The "completed / failed" summary print statements at the end of 16 evaluation samples used U+2713 ✓, U+2717 ✗, U+2705 ✅, U+274C ❌. On a fresh Windows shell (stdout = cp1252, no PYTHONIOENCODING=utf-8), Python's stdio writer crashes with UnicodeEncodeError and the sample exits non-zero after the run already succeeded on the service. Three of my multi-turn samples failed this way before I forced UTF-8 in my shell.

Replaced all 36 occurrences with ASCII [OK] / [FAIL] tokens — pure mechanical change.

2. Finetuning seed document was too small (`sample_dataset_generation_job_simpleqna_for_finetuning.py`)

The embedded 1.2 KB SEED_REFERENCE_DOCUMENT passes the eval scenario but the SUPERVISED_FINETUNING scenario rejects it with:

File content lacks sufficient context to generate quality questions.

Confirmed empirically: an identical script with a ~10 KB version of the same widgets/gizmos/sprockets reference doc generates the requested 15 QnA pairs cleanly (96 s run). Expanded the seed to that ~10 KB version (same domain, much richer prose) and updated the surrounding comment to explain the requirement.

3. Simulation sample silently fell back to a stale cached dataset (`sample_multiturn_conversation_simulation.py`)

The previous code uploaded simulation-scenarios:v1, and on the inevitable "dataset already exists" failure for the second run it swallowed the exception and pointed the eval run at the cached server-side dataset URI. That made the sample non-reproducible — anyone editing the local JSONL would never see their changes, and the Expected: {3 * 2} print became wrong as soon as anyone uploaded a different-sized scenario file from a prior session (in my bug-bash session it printed Expected: 6 but actually ran 10 conversations).

Now the dataset name is suffixed with a per-run id (same pattern as the datagen samples) so every invocation uploads fresh data, and the "expected" count is computed from the actual JSONL row count.

Verification

All 17 modified files py_compile clean.
Live end-to-end run of sample_multiturn_conversation_evaluation.py on a default Windows shell now prints [OK] Evaluation run completed successfully! and exits 0 (previously: UnicodeEncodeError).
All 14 attempted samples passed in my bug-bash with these fixes (rerun matrix in commit description).

Follow-up (not in this PR)

A few of the trace-based samples (sample_dataset_generation_job_traces_for_*, sample_multiturn_trace_evaluation_*) require pre-existing agent traces in the project to be runnable. I will open a separate issue/PR to make those self-contained by creating the agent and seeding traces inline.

…etuning seed; make simulation dataset always-fresh Three small fixes uncovered while bug-bashing the 2.2.0 samples against a live Foundry project: 1. Unicode print crashes on Windows cp1252 (16 sample files, 36 print lines). Final summary prints used U+2713/U+2717/U+2705/U+274C glyphs which crash the script when stdout is the default Windows code page (cp1252) and the user has not exported PYTHONIOENCODING=utf-8. Replace the glyphs with ASCII [OK] / [FAIL] tokens so the samples succeed out of the box on a fresh Windows shell. 2. Finetuning seed document was too small for SUPERVISED_FINETUNING QnA generation. The 1.2 KB embedded reference doc passed the eval scenario but the finetuning scenario rejected it with "File content lacks sufficient context to generate quality questions." Expanded the seed to a ~10 KB widgets/gizmos/sprockets reference (same domain, much richer prose) which lets the service synthesize the requested 15 QnA pairs. Also updated the surrounding comment to explain the size requirement. 3. Simulation sample silently fell back to a stale cached dataset on re-runs. The previous code uploaded simulation-scenarios:v1 and, on the inevitable "already exists" failure for the second run, swallowed the exception and pointed the run at the cached server-side dataset. That made the sample non-reproducible: anyone editing the local JSONL would not see their changes. Now the dataset name is suffixed with a per-run id so every invocation uploads fresh data, and the "expected conversation count" line is computed from the actual JSONL row count rather than a hard-coded constant. Verified locally: all modified files compile and a live end-to-end run of sample_multiturn_conversation_evaluation.py prints "[OK] Evaluation run completed successfully!" cleanly on a default Windows shell. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

aprilk-ms · 2026-05-31T17:46:48Z

Split into three focused PRs (rebased onto main) for easier review:

[ai/azure-ai-projects] fix evaluation sample print crashes on cp1252 consoles #47253 — cp1252 print crash fix (16 evaluation samples, mechanical)
[ai/azure-ai-projects] expand finetuning seed conversations so simpleqna sample succeeds #47254 — finetuning seed expansion (one sample, substantive)
[ai/azure-ai-projects] make multiturn simulation sample's dataset always-fresh #47255 — multiturn simulation always-fresh dataset (one sample, substantive)

github-actions Bot added the AI Projects label May 31, 2026

aprilk-ms changed the title ~~[ai/azure-ai-projects] fix sample print crashes on cp1252; expand finetuning seed; make simulation dataset always-fresh~~ [ai/azure-ai-projects] Eval sample fixes: expand finetuning seed; make simulation dataset always-fresh May 31, 2026

aprilk-ms changed the title ~~[ai/azure-ai-projects] Eval sample fixes: expand finetuning seed; make simulation dataset always-fresh~~ [ai/azure-ai-projects] Eval sample fixes: expand finetuning seed and make simulation dataset always-fresh May 31, 2026

aprilk-ms closed this May 31, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ai/azure-ai-projects] Eval sample fixes: expand finetuning seed and make simulation dataset always-fresh#47249

[ai/azure-ai-projects] Eval sample fixes: expand finetuning seed and make simulation dataset always-fresh#47249
aprilk-ms wants to merge 1 commit into
mainfrom
users/aprilk/sample-fixes-22

aprilk-ms commented May 31, 2026

Uh oh!

aprilk-ms commented May 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

aprilk-ms commented May 31, 2026

1. Unicode print crashes on Windows cp1252 (16 sample files)

2. Finetuning seed document was too small (sample_dataset_generation_job_simpleqna_for_finetuning.py)

3. Simulation sample silently fell back to a stale cached dataset (sample_multiturn_conversation_simulation.py)

Verification

Follow-up (not in this PR)

Uh oh!

aprilk-ms commented May 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

2. Finetuning seed document was too small (`sample_dataset_generation_job_simpleqna_for_finetuning.py`)

3. Simulation sample silently fell back to a stale cached dataset (`sample_multiturn_conversation_simulation.py`)