Feat: Add evals harness + improve skills by nieblara · Pull Request #54 · launchdarkly/ai-tooling

nieblara · 2026-04-30T07:29:37Z

No description provided.

supporting approvals

…' into update-agent-skills

# Conflicts: # skills.json # skills/ai-configs/aiconfig-create/SKILL.md # skills/ai-configs/aiconfig-tools/SKILL.md # skills/ai-configs/aiconfig-update/SKILL.md # skills/ai-configs/aiconfig-variations/SKILL.md

…, and added a poc for onboarding routing

ari-launchdarkly

I'm still reviewing this. Just submitting it for now since the kids are awake and I've gotta make breakfast. I'd love to get someone from AIC to review the skill changes. So far so good - just some questions and asks to flesh some parts out. My thinking (and I could be wrong) is that non-devs will likely be more and more involved in this process and so we should orient the language to there? Maybe that's the wrong way of looking at it since an agent might be directing them, but if that's the case, maybe we need to provide the most amount of context for that agent?

ari-launchdarkly · 2026-04-30T11:37:45Z

+    # 09:17 UTC daily - off the hour to avoid lining up with API rate limits.
+    - cron: "17 9 * * *"


Is the rationale here (besides the rate limits) that we'd be available and not after-hours if this were to fire and trigger an alert? It's such a random time (i'm used to seeing Cron jobs running in the dead hours)

ari-launchdarkly · 2026-04-30T11:42:02Z

+      run_all:
+        description: "Re-run every suite regardless of diff"


are we sure about this? I think we'd only want to re-run these (since it can add up) when:

The temperature changes

A model changes

The prompt changes

There should be a way for us to detect that. If we want, it can be a fast-follow

Oh wait, I'm just realizing that this is to be run on the cron. Nevermind

ari-launchdarkly · 2026-04-30T12:26:42Z

+          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
+          AGENT_MODEL: ${{ vars.AGENT_MODEL || 'claude-sonnet-4-20250514' }}
+          RUBRIC_MODEL: ${{ vars.RUBRIC_MODEL || 'anthropic:messages:claude-haiku-4-5-20251001' }}


this could just be me having limited permissions, but I don't see the OPENAI_API_KEY set in the repo secrets or the variables

ari-launchdarkly · 2026-04-30T12:27:53Z

+        with:
+          name: results-${{ matrix.suite }}
+          path: evals/${{ matrix.suite }}/results.json
+          retention-days: 14


Do we care for outliers in the process that are found? I'm thinking from a data-ingest perspective where if we have failures over a period of time, could that provide an agent context on how to write better skills and evaluations as we iterate?

ari-launchdarkly · 2026-04-30T13:14:38Z

+      codebase_context: >
+        The codebase uses the LaunchDarkly Python AI SDK. AI Configs are evaluated
+        using ldclient with create_chat(). Config keys use kebab-case.
+    assert:
+      - type: javascript
+        value: |
+          const tools = output.tools_called || [];
+          const pass = tools.includes('setup-ai-config');
+          return { pass, score: pass ? 1 : 0, reason: tools.length ? 'Tools called: ' + tools.join(' -> ') : 'No tools called' };


i'm a little confused by this. Should the type be javascript if we're setting up ldclient in a python app?

ari-launchdarkly · 2026-04-30T13:24:53Z

+The aggregator + CI pick up the new suite automatically once it's in
+`_manifest.js`.
+
+## Open questions and known limitations


An additional thing to mention (I may have missed it) but there's a lot of intent management in agent-skills. What I mean by that is we seem to be creating a "voice" or identity for our agentic experience that the evaluations should capture.

ari-launchdarkly · 2026-04-30T13:26:47Z

+# grader unless you switch the grader to a non-Anthropic provider via
+# RUBRIC_MODEL below.
+ANTHROPIC_API_KEY=
+


I have a feeling we'll be asked where to get this value from. We should link that here

ari-launchdarkly · 2026-04-30T13:26:55Z

+# AGENT_MODEL=claude-sonnet-4-20250514
+
+# REQUIRED: the rubric grader for `llm-rubric` assertions. Wired into
+# shared/defaults.yaml as defaultTest.options.provider. Pick a cheaper model
+# than AGENT_MODEL since this only judges agent output and runs once per
+# rubric assertion.
+#
+# Examples:
+#   anthropic:messages:claude-haiku-4-5-20251001
+#   openai:gpt-5-mini
+#   openai:chat:gpt-4.1-mini
+RUBRIC_MODEL=anthropic:messages:claude-haiku-4-5-20251001


ari-launchdarkly · 2026-04-30T13:28:05Z

+| `AGENT_MODEL` | the provider (system under test) | `claude-sonnet-4-20250514` | Stays on Claude because that's representative of what users actually run. |
+| `RUBRIC_MODEL` | `defaultTest.options.provider` (rubric grader) | `anthropic:messages:claude-haiku-4-5-20251001` | Cheaper grader cuts cost roughly 10x without changing what's measured. |
+
+`EVAL_MODEL` (the legacy variable) is still honoured as a fallback for `AGENT_MODEL` so existing `.env` files keep working.


should we support the legacy value or just remove it?

ari-launchdarkly · 2026-04-30T13:29:54Z

+**Execute all three steps in a single pass without stopping to ask for details.** Infer the variation key (`default`), name (`Default`), instructions/messages, and model from the user's request context. If the user asked for GPT-4o agent mode, you have enough to complete the entire flow. Only ask clarifying questions if the mode or model is truly ambiguous.

-**Execute all three steps without stopping to ask for details.** Infer the variation key (`default`), name (`Default`), instructions/messages, and model from the user's request context. If the user asked for GPT-4o agent mode, you have enough to complete the entire flow. Only ask clarifying questions if the mode or model is truly ambiguous.
+**Step 3 (the `get-ai-config` call) is mandatory regardless of how convincing the create response looks.** The two write tools may return what looks like a complete object, but only `get-ai-config` confirms the config was actually persisted with both the shell and variation linked. Skipping this step is a workflow violation — make the call even when you "feel" the previous responses already showed everything.


I'd love to get someone from AIC to verify these changes

ari-launchdarkly

Ok. My other concerns are keeping things in sync (a problem we can resolve later) and adding tests to some of the utility methods. They do a lot and it would be nice to see that we've worked out any potential edge cases here

ari-launchdarkly · 2026-04-30T15:26:58Z

+      }
+    }
+  }
+}


how do we ensure there isn't drift between these mocks and the real APIs?

ari-launchdarkly · 2026-04-30T15:28:24Z

+          4. Did it hand off cleanly without trying to do both at once?
+          Score 1.0 if all four are met, deduct 0.25 for each missed.
+        metric: precedence_quality
+        weight: 2


this file is awesome

nieblara added 9 commits February 27, 2026 10:28

updating ai config tools to use remote mcp tools

5e0d3e1

fixing bugs

0c0326a

we want to keep projects skill the same for now

9afdfc4

[REL-12454] supporting approvals (#13)

5618474

supporting approvals

merge in main

93da5fe

Merge remote-tracking branch 'refs/remotes/origin/update-agent-skills…

44d531c

…' into update-agent-skills

improve ai configs skills based on evals

f6ba95f

Merge remote-tracking branch 'origin/main' into update-agent-skills

779215d

# Conflicts: # skills.json # skills/ai-configs/aiconfig-create/SKILL.md # skills/ai-configs/aiconfig-tools/SKILL.md # skills/ai-configs/aiconfig-update/SKILL.md # skills/ai-configs/aiconfig-variations/SKILL.md

updating skills, adding evals foundation for skills-loaded agent runs…

670cdf1

…, and added a poc for onboarding routing

nieblara requested a review from a team as a code owner April 30, 2026 07:29

ari-launchdarkly reviewed Apr 30, 2026

View reviewed changes

ari-launchdarkly approved these changes Apr 30, 2026

View reviewed changes

		# 09:17 UTC daily - off the hour to avoid lining up with API rate limits.
		- cron: "17 9 * * *"

		run_all:
		description: "Re-run every suite regardless of diff"

Conversation

nieblara commented Apr 30, 2026

Uh oh!

ari-launchdarkly left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ari-launchdarkly left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants