[ai/azure-ai-projects] Eval sample fixes: expand finetuning seed and make simulation dataset always-fresh#47249
Closed
aprilk-ms wants to merge 1 commit into
Closed
[ai/azure-ai-projects] Eval sample fixes: expand finetuning seed and make simulation dataset always-fresh#47249aprilk-ms wants to merge 1 commit into
aprilk-ms wants to merge 1 commit into
Conversation
…etuning seed; make simulation dataset always-fresh Three small fixes uncovered while bug-bashing the 2.2.0 samples against a live Foundry project: 1. Unicode print crashes on Windows cp1252 (16 sample files, 36 print lines). Final summary prints used U+2713/U+2717/U+2705/U+274C glyphs which crash the script when stdout is the default Windows code page (cp1252) and the user has not exported PYTHONIOENCODING=utf-8. Replace the glyphs with ASCII [OK] / [FAIL] tokens so the samples succeed out of the box on a fresh Windows shell. 2. Finetuning seed document was too small for SUPERVISED_FINETUNING QnA generation. The 1.2 KB embedded reference doc passed the eval scenario but the finetuning scenario rejected it with "File content lacks sufficient context to generate quality questions." Expanded the seed to a ~10 KB widgets/gizmos/sprockets reference (same domain, much richer prose) which lets the service synthesize the requested 15 QnA pairs. Also updated the surrounding comment to explain the size requirement. 3. Simulation sample silently fell back to a stale cached dataset on re-runs. The previous code uploaded simulation-scenarios:v1 and, on the inevitable "already exists" failure for the second run, swallowed the exception and pointed the run at the cached server-side dataset. That made the sample non-reproducible: anyone editing the local JSONL would not see their changes. Now the dataset name is suffixed with a per-run id so every invocation uploads fresh data, and the "expected conversation count" line is computed from the actual JSONL row count rather than a hard-coded constant. Verified locally: all modified files compile and a live end-to-end run of sample_multiturn_conversation_evaluation.py prints "[OK] Evaluation run completed successfully!" cleanly on a default Windows shell. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This was referenced May 31, 2026
Draft
Member
Author
|
Split into three focused PRs (rebased onto
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Three small fixes uncovered while bug-bashing the public 2.2.0 samples against a live Foundry project.
1. Unicode print crashes on Windows cp1252 (16 sample files)
The "completed / failed" summary
printstatements at the end of 16 evaluation samples usedU+2713 ✓,U+2717 ✗,U+2705 ✅,U+274C ❌. On a fresh Windows shell (stdout = cp1252, noPYTHONIOENCODING=utf-8), Python's stdio writer crashes withUnicodeEncodeErrorand the sample exits non-zero after the run already succeeded on the service. Three of my multi-turn samples failed this way before I forced UTF-8 in my shell.Replaced all 36 occurrences with ASCII
[OK]/[FAIL]tokens — pure mechanical change.2. Finetuning seed document was too small (
sample_dataset_generation_job_simpleqna_for_finetuning.py)The embedded 1.2 KB
SEED_REFERENCE_DOCUMENTpasses the eval scenario but theSUPERVISED_FINETUNINGscenario rejects it with:Confirmed empirically: an identical script with a ~10 KB version of the same widgets/gizmos/sprockets reference doc generates the requested 15 QnA pairs cleanly (96 s run). Expanded the seed to that ~10 KB version (same domain, much richer prose) and updated the surrounding comment to explain the requirement.
3. Simulation sample silently fell back to a stale cached dataset (
sample_multiturn_conversation_simulation.py)The previous code uploaded
simulation-scenarios:v1, and on the inevitable "dataset already exists" failure for the second run it swallowed the exception and pointed the eval run at the cached server-side dataset URI. That made the sample non-reproducible — anyone editing the local JSONL would never see their changes, and theExpected: {3 * 2}print became wrong as soon as anyone uploaded a different-sized scenario file from a prior session (in my bug-bash session it printedExpected: 6but actually ran 10 conversations).Now the dataset name is suffixed with a per-run id (same pattern as the datagen samples) so every invocation uploads fresh data, and the "expected" count is computed from the actual JSONL row count.
Verification
py_compileclean.sample_multiturn_conversation_evaluation.pyon a default Windows shell now prints[OK] Evaluation run completed successfully!and exits 0 (previously:UnicodeEncodeError).Follow-up (not in this PR)
A few of the trace-based samples (
sample_dataset_generation_job_traces_for_*,sample_multiturn_trace_evaluation_*) require pre-existing agent traces in the project to be runnable. I will open a separate issue/PR to make those self-contained by creating the agent and seeding traces inline.