Skip to content

[ai/azure-ai-projects] make traces-for-evaluation sample self-contained (pilot)#47250

Draft
aprilk-ms wants to merge 8 commits into
users/aprilk/sample-fixes-22from
users/aprilk/sample-self-contained-traces
Draft

[ai/azure-ai-projects] make traces-for-evaluation sample self-contained (pilot)#47250
aprilk-ms wants to merge 8 commits into
users/aprilk/sample-fixes-22from
users/aprilk/sample-self-contained-traces

Conversation

@aprilk-ms
Copy link
Copy Markdown
Member

What

Pilot: refactor samples/datasets/sample_dataset_generation_job_traces_for_evaluation.py so a new user can run it end-to-end without any prior setup (no pre-existing agent, no pre-seeded traces).

By default, the sample now runs in self-contained mode:

  1. Creates a temporary Foundry agent (Widgets & Gizmos persona, reused from sample_dataset_generation_job_simpleqna_with_agent_source.py).
  2. Wires up Azure Monitor + AIProjectInstrumentor(enable_content_recording=True) so calls to responses.create emit GenAI semantic spans (with message content) to Application Insights — the data-generation service reads gen_ai.agent.name and conversation content from these spans.
  3. Runs three multi-turn conversations (3 × 5 turns = 15 GenAI spans) against the agent, force-flushes the tracer provider, and waits 180s for App Insights ingestion.
  4. Submits the existing DataGenerationJob over a window that exactly brackets the seeded traces.
  5. Cleans up dataset, job, conversations, and temporary agent in a best-effort finally block.

Users who already have an agent with traces opt into bring-your-own-agent mode by setting FOUNDRY_AGENT_NAME; the sample then skips creation, seeding, ingestion wait, and the new telemetry deps, and uses FOUNDRY_TRACES_WINDOW_DAYS (default 7d) as before.

New tuning knobs (all optional):

  • TRACE_SEEDING_CONVERSATIONS (default 3)
  • TRACE_SEEDING_TURNS (default 5)
  • TRACE_INGESTION_WAIT_SECONDS (default 180)

Why

Bug bash feedback: the trace-based samples required users to first stand up an agent and run conversations through it before the sample worked. Most reviewers either skipped them or asked the team to provide a shared agent. Making the sample self-contained removes that gate while keeping the BYO path available for anyone who wants to point at a real agent.

This is a pilot for one sample. Pending review, I'll apply the same pattern to the other trace-dependent samples (sample_dataset_generation_job_traces_for_finetuning.py, sample_multiturn_trace_evaluation_*.py, sample_agent_trace_evaluation_smart_filter.py) in follow-ups.

Validation

Live test against build26-bug-bash project on gpt-5.1:

  • Self-contained mode: agent created → 15 seeded turns → 180s ingestion wait → DataGenerationJob SUCCEEDED → Generated samples: 15 → dataset, job, 3 conversations, and agent all deleted cleanly. Total wall-clock ≈ 6 min.
  • BYO mode: still works against existing test-agent (15 samples).

Notes for reviewers

  • Two new client deps are pip-installed only when in self-contained mode (azure-monitor-opentelemetry, azure-core-tracing-opentelemetry); BYO users don't need them. They're loaded via importlib.import_module with a friendly ImportError if missing.
  • AZURE_EXPERIMENTAL_ENABLE_GENAI_TRACING=true is forced (not setdefault) before AIProjectInstrumentor().instrument() because the instrumentor checks this env var at instrument-time and silently does nothing if it isn't set.
  • _safe_console() wraps response previews because some Windows consoles default to cp1252 and the model frequently emits chars (smart quotes, non-breaking hyphens) that crash print() there. App Insights spans are unaffected.

Stacked-PR note

This PR is stacked on top of #47249 (sample print/finetuning/simulation fixes). It targets users/aprilk/sample-fixes-22; will retarget to main once #47249 merges.


Co-authored-by: Copilot 223556219+Copilot@users.noreply.github.com

@aprilk-ms
Copy link
Copy Markdown
Member Author

Self-reviewed before requesting review.

One finding addressed (f9d1248): dataset and job cleanup lived inside the try block, so a polling failure, dataset-get failure, or any exception between job creation and the success-path deletes would leak the data-generation job (and possibly the dataset) on the unhappy path. Moved both deletes into finally alongside the existing conversation and agent cleanup. Each step is its own best-effort try / except so a failure in one does not skip the others, and so cleanup never masks the real exception. Order is outputs → producers: dataset → job → conversations → agent.

Re-live-tested happy path against build26-bug-bash on gpt-5.1: 15 generated samples, all 5 resources cleaned up via finally, exit 0.

Other suggestions considered and dismissed:

  • Agent name length check — no known service limit; reviewer self-resolved at low confidence.
  • Modulo on TRACE_SEEDING_TURNS > arc lengtharc[ti % len(arc)] already handles arbitrary turn counts safely.
  • uninstrument OTEL on exit — sample is a one-shot script; process exits after teardown, no state pollution risk.

Ready for your review.

@aprilk-ms
Copy link
Copy Markdown
Member Author

Follow-up readability pass (2b0191b): dropped the bring-your-own-agent mode and trimmed supporting plumbing so the sample is easier to follow for a first-time reader.

Before / after

  • Was 439 lines / 24 KB. Now 282 lines / 14 KB. Sister sample sample_dataset_generation_job_simpleqna_with_agent_source.py is 180 / 11 KB; the remaining ~100-line gap is intrinsic trace-mode overhead (persona text, instrumentor wiring, conversation cleanup).

What changed

  • Self-contained mode only. The if seed_traces / else split, the BYO env vars, and the importlib.import_module dance for "optional" telemetry deps are gone. A 4-line note in the docstring tells BYO users which block to replace.
  • Direct from azure.monitor.opentelemetry import configure_azure_monitor and from azure.ai.projects.telemetry import AIProjectInstrumentor at the top of the file.
  • AGENT_INSTRUCTIONS shrunk from ~40 lines to ~15, scoped to just what the seeded prompts ask about.
  • Dropped _safe_console and the per-turn preview prints. Cleanup output is still printed.
  • Dropped the OpenTelemetry force_flush try/except; the 180s ingestion wait covers exporter batching too.
  • Four per-resource try/except cleanup blocks collapsed behind a small _try_delete(label, fn, *args, **kwargs) helper.

Re-live-tested against build26-bug-bash on gpt-5.1: 15 generated samples, all 6 resources cleaned up via finally, exit 0.

aprilk-ms and others added 8 commits May 31, 2026 10:42
Rewrite samples/datasets/sample_dataset_generation_job_traces_for_evaluation.py so it works without prior setup. By default, the sample now creates a temporary Foundry agent, runs three multi-turn Widgets & Gizmos conversations against it with GenAI content tracing enabled (configure_azure_monitor + AIProjectInstrumentor with enable_content_recording=True), waits for App Insights ingestion, then submits the existing data-generation job over a window that exactly brackets the seeded traces. The temporary agent, seeded conversations, generated dataset, and data-generation job are all cleaned up in a best-effort finally block.

Users who already have an agent with traces can opt into bring-your-own-agent mode by setting FOUNDRY_AGENT_NAME; in that mode the sample skips agent creation, trace seeding, and ingestion wait and uses the existing FOUNDRY_TRACES_WINDOW_DAYS look-back window (default 7 days). New seeding knobs (TRACE_SEEDING_CONVERSATIONS, TRACE_SEEDING_TURNS, TRACE_INGESTION_WAIT_SECONDS) make timing tunable per environment.

Validated end-to-end against the build26-bug-bash project on gpt-5.1: self-contained run produced 15 generated samples and cleaned up all temporary resources successfully.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Track submitted_job_id and created_dataset before the try block and move the dataset and job deletes into finally, alongside the existing conversation and agent cleanup. Previously these two deletes lived inside try, so a polling failure, dataset-get failure, or any exception between job creation and the success-path deletes would leak the data-generation job (and possibly the dataset) on the unhappy path. Each step is now wrapped in its own best-effort try/except so a failure in one does not skip the others, and so cleanup never masks the real exception. Live-tested happy path against build26-bug-bash on gpt-5.1: 15 generated samples, all 5 resources cleaned up via finally, exit 0.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Drops the bring-your-own-agent mode and trims supporting plumbing so the sample is easier for a first-time reader to follow. Was 439 lines; now 282 (sister sample sample_dataset_generation_job_simpleqna_with_agent_source.py is 180).

Changes:

- Self-contained mode only. The if seed_traces / else split, the BYO env vars (FOUNDRY_AGENT_NAME, FOUNDRY_TRACES_WINDOW_DAYS, TRACE_SEEDING_CONVERSATIONS, TRACE_SEEDING_TURNS), and the importlib.import_module dance for optional telemetry deps are gone. A 4-line note in the docstring tells BYO users which block to replace.

- Imports azure.monitor.opentelemetry and azure.ai.projects.telemetry directly at the top of the file.

- Shrinks AGENT_INSTRUCTIONS from ~40 lines to ~15 with only the policies the seeded prompts actually ask about.

- Drops the _safe_console helper and the per-turn preview prints. Cleanup output is still printed.

- Drops the opentelemetry force_flush try/except; the 180s ingestion wait covers exporter batching too.

- Replaces the four per-resource try/except cleanup blocks with a small _try_delete(label, fn, *args, **kwargs) helper.

Live-tested happy path against build26-bug-bash on gpt-5.1: 15 generated samples, all 6 resources cleaned up via finally, exit 0.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Trim verbose section banners and docstring prose; replace multi-line
comments with single-line equivalents. No behavior change.

282 -> 257 lines (-9%).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Replace the _try_delete helper with one inline try/except per resource.
Each cleanup now reads top-to-bottom at the call site (preferred for
sample readability) and drops a layer of indirection.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Persona collapsed to 4 inline declarations (14 -> 5 lines). The nested
SEEDING_CONVERSATIONS list-of-lists is replaced by a flat SEED_PROMPTS
list plus NUM_CONVERSATIONS constant; the seeding loop cycles each
conversation through the same prompts. Behavior unchanged - still 3 x 5
= 15 turns and 15 generated samples (live-tested).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
max_samples is a cap on generated samples, not a floor on input traces.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Prompt agents emit server-side traces to the project's connected App Insights, so client-side AIProjectInstrumentor + configure_azure_monitor are not required. Hardcode poll/wait constants and dataset name (still uniqueified via run id). Verified live: PASS in 231s, 1 sample generated, clean teardown.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@aprilk-ms aprilk-ms force-pushed the users/aprilk/sample-self-contained-traces branch from 981d6f0 to 416fdd8 Compare May 31, 2026 17:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant