[ai/azure-ai-projects] make traces-for-evaluation sample self-contained (pilot) by aprilk-ms · Pull Request #47250 · Azure/azure-sdk-for-python

aprilk-ms · 2026-05-31T03:56:10Z

What

Pilot: refactor samples/datasets/sample_dataset_generation_job_traces_for_evaluation.py so a new user can run it end-to-end without any prior setup (no pre-existing agent, no pre-seeded traces).

By default, the sample now runs in self-contained mode:

Creates a temporary Foundry agent (Widgets & Gizmos persona, reused from sample_dataset_generation_job_simpleqna_with_agent_source.py).
Wires up Azure Monitor + AIProjectInstrumentor(enable_content_recording=True) so calls to responses.create emit GenAI semantic spans (with message content) to Application Insights — the data-generation service reads gen_ai.agent.name and conversation content from these spans.
Runs three multi-turn conversations (3 × 5 turns = 15 GenAI spans) against the agent, force-flushes the tracer provider, and waits 180s for App Insights ingestion.
Submits the existing DataGenerationJob over a window that exactly brackets the seeded traces.
Cleans up dataset, job, conversations, and temporary agent in a best-effort finally block.

Users who already have an agent with traces opt into bring-your-own-agent mode by setting FOUNDRY_AGENT_NAME; the sample then skips creation, seeding, ingestion wait, and the new telemetry deps, and uses FOUNDRY_TRACES_WINDOW_DAYS (default 7d) as before.

New tuning knobs (all optional):

TRACE_SEEDING_CONVERSATIONS (default 3)
TRACE_SEEDING_TURNS (default 5)
TRACE_INGESTION_WAIT_SECONDS (default 180)

Why

Bug bash feedback: the trace-based samples required users to first stand up an agent and run conversations through it before the sample worked. Most reviewers either skipped them or asked the team to provide a shared agent. Making the sample self-contained removes that gate while keeping the BYO path available for anyone who wants to point at a real agent.

This is a pilot for one sample. Pending review, I'll apply the same pattern to the other trace-dependent samples (sample_dataset_generation_job_traces_for_finetuning.py, sample_multiturn_trace_evaluation_*.py, sample_agent_trace_evaluation_smart_filter.py) in follow-ups.

Validation

Live test against build26-bug-bash project on gpt-5.1:

Self-contained mode: agent created → 15 seeded turns → 180s ingestion wait → DataGenerationJob SUCCEEDED → Generated samples: 15 → dataset, job, 3 conversations, and agent all deleted cleanly. Total wall-clock ≈ 6 min.
BYO mode: still works against existing test-agent (15 samples).

Notes for reviewers

Two new client deps are pip-installed only when in self-contained mode (azure-monitor-opentelemetry, azure-core-tracing-opentelemetry); BYO users don't need them. They're loaded via importlib.import_module with a friendly ImportError if missing.
AZURE_EXPERIMENTAL_ENABLE_GENAI_TRACING=true is forced (not setdefault) before AIProjectInstrumentor().instrument() because the instrumentor checks this env var at instrument-time and silently does nothing if it isn't set.
_safe_console() wraps response previews because some Windows consoles default to cp1252 and the model frequently emits chars (smart quotes, non-breaking hyphens) that crash print() there. App Insights spans are unaffected.

Stacked-PR note

This PR is stacked on top of #47249 (sample print/finetuning/simulation fixes). It targets users/aprilk/sample-fixes-22; will retarget to main once #47249 merges.

Co-authored-by: Copilot 223556219+Copilot@users.noreply.github.com

aprilk-ms · 2026-05-31T04:10:07Z

Self-reviewed before requesting review.

One finding addressed (f9d1248): dataset and job cleanup lived inside the try block, so a polling failure, dataset-get failure, or any exception between job creation and the success-path deletes would leak the data-generation job (and possibly the dataset) on the unhappy path. Moved both deletes into finally alongside the existing conversation and agent cleanup. Each step is its own best-effort try / except so a failure in one does not skip the others, and so cleanup never masks the real exception. Order is outputs → producers: dataset → job → conversations → agent.

Re-live-tested happy path against build26-bug-bash on gpt-5.1: 15 generated samples, all 5 resources cleaned up via finally, exit 0.

Other suggestions considered and dismissed:

Agent name length check — no known service limit; reviewer self-resolved at low confidence.
Modulo on TRACE_SEEDING_TURNS > arc length — arc[ti % len(arc)] already handles arbitrary turn counts safely.
uninstrument OTEL on exit — sample is a one-shot script; process exits after teardown, no state pollution risk.

Ready for your review.

aprilk-ms · 2026-05-31T04:39:41Z

Follow-up readability pass (2b0191b): dropped the bring-your-own-agent mode and trimmed supporting plumbing so the sample is easier to follow for a first-time reader.

Before / after

Was 439 lines / 24 KB. Now 282 lines / 14 KB. Sister sample sample_dataset_generation_job_simpleqna_with_agent_source.py is 180 / 11 KB; the remaining ~100-line gap is intrinsic trace-mode overhead (persona text, instrumentor wiring, conversation cleanup).

What changed

Self-contained mode only. The if seed_traces / else split, the BYO env vars, and the importlib.import_module dance for "optional" telemetry deps are gone. A 4-line note in the docstring tells BYO users which block to replace.
Direct from azure.monitor.opentelemetry import configure_azure_monitor and from azure.ai.projects.telemetry import AIProjectInstrumentor at the top of the file.
AGENT_INSTRUCTIONS shrunk from ~40 lines to ~15, scoped to just what the seeded prompts ask about.
Dropped _safe_console and the per-turn preview prints. Cleanup output is still printed.
Dropped the OpenTelemetry force_flush try/except; the 180s ingestion wait covers exporter batching too.
Four per-resource try/except cleanup blocks collapsed behind a small _try_delete(label, fn, *args, **kwargs) helper.

Re-live-tested against build26-bug-bash on gpt-5.1: 15 generated samples, all 6 resources cleaned up via finally, exit 0.

Rewrite samples/datasets/sample_dataset_generation_job_traces_for_evaluation.py so it works without prior setup. By default, the sample now creates a temporary Foundry agent, runs three multi-turn Widgets & Gizmos conversations against it with GenAI content tracing enabled (configure_azure_monitor + AIProjectInstrumentor with enable_content_recording=True), waits for App Insights ingestion, then submits the existing data-generation job over a window that exactly brackets the seeded traces. The temporary agent, seeded conversations, generated dataset, and data-generation job are all cleaned up in a best-effort finally block. Users who already have an agent with traces can opt into bring-your-own-agent mode by setting FOUNDRY_AGENT_NAME; in that mode the sample skips agent creation, trace seeding, and ingestion wait and uses the existing FOUNDRY_TRACES_WINDOW_DAYS look-back window (default 7 days). New seeding knobs (TRACE_SEEDING_CONVERSATIONS, TRACE_SEEDING_TURNS, TRACE_INGESTION_WAIT_SECONDS) make timing tunable per environment. Validated end-to-end against the build26-bug-bash project on gpt-5.1: self-contained run produced 15 generated samples and cleaned up all temporary resources successfully. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Track submitted_job_id and created_dataset before the try block and move the dataset and job deletes into finally, alongside the existing conversation and agent cleanup. Previously these two deletes lived inside try, so a polling failure, dataset-get failure, or any exception between job creation and the success-path deletes would leak the data-generation job (and possibly the dataset) on the unhappy path. Each step is now wrapped in its own best-effort try/except so a failure in one does not skip the others, and so cleanup never masks the real exception. Live-tested happy path against build26-bug-bash on gpt-5.1: 15 generated samples, all 5 resources cleaned up via finally, exit 0. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Drops the bring-your-own-agent mode and trims supporting plumbing so the sample is easier for a first-time reader to follow. Was 439 lines; now 282 (sister sample sample_dataset_generation_job_simpleqna_with_agent_source.py is 180). Changes: - Self-contained mode only. The if seed_traces / else split, the BYO env vars (FOUNDRY_AGENT_NAME, FOUNDRY_TRACES_WINDOW_DAYS, TRACE_SEEDING_CONVERSATIONS, TRACE_SEEDING_TURNS), and the importlib.import_module dance for optional telemetry deps are gone. A 4-line note in the docstring tells BYO users which block to replace. - Imports azure.monitor.opentelemetry and azure.ai.projects.telemetry directly at the top of the file. - Shrinks AGENT_INSTRUCTIONS from ~40 lines to ~15 with only the policies the seeded prompts actually ask about. - Drops the _safe_console helper and the per-turn preview prints. Cleanup output is still printed. - Drops the opentelemetry force_flush try/except; the 180s ingestion wait covers exporter batching too. - Replaces the four per-resource try/except cleanup blocks with a small _try_delete(label, fn, *args, **kwargs) helper. Live-tested happy path against build26-bug-bash on gpt-5.1: 15 generated samples, all 6 resources cleaned up via finally, exit 0. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Trim verbose section banners and docstring prose; replace multi-line comments with single-line equivalents. No behavior change. 282 -> 257 lines (-9%). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Replace the _try_delete helper with one inline try/except per resource. Each cleanup now reads top-to-bottom at the call site (preferred for sample readability) and drops a layer of indirection. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Persona collapsed to 4 inline declarations (14 -> 5 lines). The nested SEEDING_CONVERSATIONS list-of-lists is replaced by a flat SEED_PROMPTS list plus NUM_CONVERSATIONS constant; the seeding loop cycles each conversation through the same prompts. Behavior unchanged - still 3 x 5 = 15 turns and 15 generated samples (live-tested). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

max_samples is a cap on generated samples, not a floor on input traces. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Prompt agents emit server-side traces to the project's connected App Insights, so client-side AIProjectInstrumentor + configure_azure_monitor are not required. Hardcode poll/wait constants and dataset name (still uniqueified via run id). Verified live: PASS in 231s, 1 sample generated, clean teardown. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

github-actions Bot added the AI Projects label May 31, 2026

aprilk-ms and others added 8 commits May 31, 2026 10:42

Tighten comments in self-contained traces sample

6a255e8

Trim verbose section banners and docstring prose; replace multi-line comments with single-line equivalents. No behavior change. 282 -> 257 lines (-9%). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Minimize self-contained traces sample to 1 seeded turn

d641405

max_samples is a cap on generated samples, not a floor on input traces. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

aprilk-ms force-pushed the users/aprilk/sample-self-contained-traces branch from 981d6f0 to 416fdd8 Compare May 31, 2026 17:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ai/azure-ai-projects] make traces-for-evaluation sample self-contained (pilot)#47250

[ai/azure-ai-projects] make traces-for-evaluation sample self-contained (pilot)#47250
aprilk-ms wants to merge 8 commits into
users/aprilk/sample-fixes-22from
users/aprilk/sample-self-contained-traces

aprilk-ms commented May 31, 2026

Uh oh!

aprilk-ms commented May 31, 2026

Uh oh!

aprilk-ms commented May 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

aprilk-ms commented May 31, 2026

What

Why

Validation

Notes for reviewers

Stacked-PR note

Uh oh!

aprilk-ms commented May 31, 2026

Uh oh!

aprilk-ms commented May 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant