[ai/azure-ai-projects] make traces-for-evaluation sample self-contained (pilot)#47250
[ai/azure-ai-projects] make traces-for-evaluation sample self-contained (pilot)#47250aprilk-ms wants to merge 8 commits into
Conversation
|
Self-reviewed before requesting review. One finding addressed (f9d1248): dataset and job cleanup lived inside the Re-live-tested happy path against build26-bug-bash on gpt-5.1: 15 generated samples, all 5 resources cleaned up via finally, exit 0. Other suggestions considered and dismissed:
Ready for your review. |
|
Follow-up readability pass (2b0191b): dropped the bring-your-own-agent mode and trimmed supporting plumbing so the sample is easier to follow for a first-time reader. Before / after
What changed
Re-live-tested against build26-bug-bash on gpt-5.1: 15 generated samples, all 6 resources cleaned up via finally, exit 0. |
Rewrite samples/datasets/sample_dataset_generation_job_traces_for_evaluation.py so it works without prior setup. By default, the sample now creates a temporary Foundry agent, runs three multi-turn Widgets & Gizmos conversations against it with GenAI content tracing enabled (configure_azure_monitor + AIProjectInstrumentor with enable_content_recording=True), waits for App Insights ingestion, then submits the existing data-generation job over a window that exactly brackets the seeded traces. The temporary agent, seeded conversations, generated dataset, and data-generation job are all cleaned up in a best-effort finally block. Users who already have an agent with traces can opt into bring-your-own-agent mode by setting FOUNDRY_AGENT_NAME; in that mode the sample skips agent creation, trace seeding, and ingestion wait and uses the existing FOUNDRY_TRACES_WINDOW_DAYS look-back window (default 7 days). New seeding knobs (TRACE_SEEDING_CONVERSATIONS, TRACE_SEEDING_TURNS, TRACE_INGESTION_WAIT_SECONDS) make timing tunable per environment. Validated end-to-end against the build26-bug-bash project on gpt-5.1: self-contained run produced 15 generated samples and cleaned up all temporary resources successfully. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Track submitted_job_id and created_dataset before the try block and move the dataset and job deletes into finally, alongside the existing conversation and agent cleanup. Previously these two deletes lived inside try, so a polling failure, dataset-get failure, or any exception between job creation and the success-path deletes would leak the data-generation job (and possibly the dataset) on the unhappy path. Each step is now wrapped in its own best-effort try/except so a failure in one does not skip the others, and so cleanup never masks the real exception. Live-tested happy path against build26-bug-bash on gpt-5.1: 15 generated samples, all 5 resources cleaned up via finally, exit 0. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Drops the bring-your-own-agent mode and trims supporting plumbing so the sample is easier for a first-time reader to follow. Was 439 lines; now 282 (sister sample sample_dataset_generation_job_simpleqna_with_agent_source.py is 180). Changes: - Self-contained mode only. The if seed_traces / else split, the BYO env vars (FOUNDRY_AGENT_NAME, FOUNDRY_TRACES_WINDOW_DAYS, TRACE_SEEDING_CONVERSATIONS, TRACE_SEEDING_TURNS), and the importlib.import_module dance for optional telemetry deps are gone. A 4-line note in the docstring tells BYO users which block to replace. - Imports azure.monitor.opentelemetry and azure.ai.projects.telemetry directly at the top of the file. - Shrinks AGENT_INSTRUCTIONS from ~40 lines to ~15 with only the policies the seeded prompts actually ask about. - Drops the _safe_console helper and the per-turn preview prints. Cleanup output is still printed. - Drops the opentelemetry force_flush try/except; the 180s ingestion wait covers exporter batching too. - Replaces the four per-resource try/except cleanup blocks with a small _try_delete(label, fn, *args, **kwargs) helper. Live-tested happy path against build26-bug-bash on gpt-5.1: 15 generated samples, all 6 resources cleaned up via finally, exit 0. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Trim verbose section banners and docstring prose; replace multi-line comments with single-line equivalents. No behavior change. 282 -> 257 lines (-9%). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Replace the _try_delete helper with one inline try/except per resource. Each cleanup now reads top-to-bottom at the call site (preferred for sample readability) and drops a layer of indirection. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Persona collapsed to 4 inline declarations (14 -> 5 lines). The nested SEEDING_CONVERSATIONS list-of-lists is replaced by a flat SEED_PROMPTS list plus NUM_CONVERSATIONS constant; the seeding loop cycles each conversation through the same prompts. Behavior unchanged - still 3 x 5 = 15 turns and 15 generated samples (live-tested). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
max_samples is a cap on generated samples, not a floor on input traces. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Prompt agents emit server-side traces to the project's connected App Insights, so client-side AIProjectInstrumentor + configure_azure_monitor are not required. Hardcode poll/wait constants and dataset name (still uniqueified via run id). Verified live: PASS in 231s, 1 sample generated, clean teardown. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
981d6f0 to
416fdd8
Compare
What
Pilot: refactor
samples/datasets/sample_dataset_generation_job_traces_for_evaluation.pyso a new user can run it end-to-end without any prior setup (no pre-existing agent, no pre-seeded traces).By default, the sample now runs in self-contained mode:
sample_dataset_generation_job_simpleqna_with_agent_source.py).AIProjectInstrumentor(enable_content_recording=True)so calls toresponses.createemit GenAI semantic spans (with message content) to Application Insights — the data-generation service readsgen_ai.agent.nameand conversation content from these spans.DataGenerationJobover a window that exactly brackets the seeded traces.finallyblock.Users who already have an agent with traces opt into bring-your-own-agent mode by setting
FOUNDRY_AGENT_NAME; the sample then skips creation, seeding, ingestion wait, and the new telemetry deps, and usesFOUNDRY_TRACES_WINDOW_DAYS(default 7d) as before.New tuning knobs (all optional):
TRACE_SEEDING_CONVERSATIONS(default 3)TRACE_SEEDING_TURNS(default 5)TRACE_INGESTION_WAIT_SECONDS(default 180)Why
Bug bash feedback: the trace-based samples required users to first stand up an agent and run conversations through it before the sample worked. Most reviewers either skipped them or asked the team to provide a shared agent. Making the sample self-contained removes that gate while keeping the BYO path available for anyone who wants to point at a real agent.
This is a pilot for one sample. Pending review, I'll apply the same pattern to the other trace-dependent samples (
sample_dataset_generation_job_traces_for_finetuning.py,sample_multiturn_trace_evaluation_*.py,sample_agent_trace_evaluation_smart_filter.py) in follow-ups.Validation
Live test against
build26-bug-bashproject ongpt-5.1:DataGenerationJobSUCCEEDED →Generated samples: 15→ dataset, job, 3 conversations, and agent all deleted cleanly. Total wall-clock ≈ 6 min.test-agent(15 samples).Notes for reviewers
azure-monitor-opentelemetry,azure-core-tracing-opentelemetry); BYO users don't need them. They're loaded viaimportlib.import_modulewith a friendlyImportErrorif missing.AZURE_EXPERIMENTAL_ENABLE_GENAI_TRACING=trueis forced (notsetdefault) beforeAIProjectInstrumentor().instrument()because the instrumentor checks this env var at instrument-time and silently does nothing if it isn't set._safe_console()wraps response previews because some Windows consoles default tocp1252and the model frequently emits chars (smart quotes, non-breaking hyphens) that crashprint()there. App Insights spans are unaffected.Stacked-PR note
This PR is stacked on top of #47249 (sample print/finetuning/simulation fixes). It targets
users/aprilk/sample-fixes-22; will retarget tomainonce #47249 merges.Co-authored-by: Copilot 223556219+Copilot@users.noreply.github.com