Feat: Add evals harness + improve skills#54
Conversation
supporting approvals
…' into update-agent-skills
# Conflicts: # skills.json # skills/ai-configs/aiconfig-create/SKILL.md # skills/ai-configs/aiconfig-tools/SKILL.md # skills/ai-configs/aiconfig-update/SKILL.md # skills/ai-configs/aiconfig-variations/SKILL.md
…, and added a poc for onboarding routing
ari-launchdarkly
left a comment
There was a problem hiding this comment.
I'm still reviewing this. Just submitting it for now since the kids are awake and I've gotta make breakfast. I'd love to get someone from AIC to review the skill changes. So far so good - just some questions and asks to flesh some parts out. My thinking (and I could be wrong) is that non-devs will likely be more and more involved in this process and so we should orient the language to there? Maybe that's the wrong way of looking at it since an agent might be directing them, but if that's the case, maybe we need to provide the most amount of context for that agent?
| # 09:17 UTC daily - off the hour to avoid lining up with API rate limits. | ||
| - cron: "17 9 * * *" |
There was a problem hiding this comment.
Is the rationale here (besides the rate limits) that we'd be available and not after-hours if this were to fire and trigger an alert? It's such a random time (i'm used to seeing Cron jobs running in the dead hours)
| run_all: | ||
| description: "Re-run every suite regardless of diff" |
There was a problem hiding this comment.
are we sure about this? I think we'd only want to re-run these (since it can add up) when:
- The temperature changes
- A model changes
- The prompt changes
There should be a way for us to detect that. If we want, it can be a fast-follow
There was a problem hiding this comment.
Oh wait, I'm just realizing that this is to be run on the cron. Nevermind
| OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }} | ||
| AGENT_MODEL: ${{ vars.AGENT_MODEL || 'claude-sonnet-4-20250514' }} | ||
| RUBRIC_MODEL: ${{ vars.RUBRIC_MODEL || 'anthropic:messages:claude-haiku-4-5-20251001' }} |
| with: | ||
| name: results-${{ matrix.suite }} | ||
| path: evals/${{ matrix.suite }}/results.json | ||
| retention-days: 14 |
There was a problem hiding this comment.
Do we care for outliers in the process that are found? I'm thinking from a data-ingest perspective where if we have failures over a period of time, could that provide an agent context on how to write better skills and evaluations as we iterate?
| codebase_context: > | ||
| The codebase uses the LaunchDarkly Python AI SDK. AI Configs are evaluated | ||
| using ldclient with create_chat(). Config keys use kebab-case. | ||
| assert: | ||
| - type: javascript | ||
| value: | | ||
| const tools = output.tools_called || []; | ||
| const pass = tools.includes('setup-ai-config'); | ||
| return { pass, score: pass ? 1 : 0, reason: tools.length ? 'Tools called: ' + tools.join(' -> ') : 'No tools called' }; |
There was a problem hiding this comment.
i'm a little confused by this. Should the type be javascript if we're setting up ldclient in a python app?
| The aggregator + CI pick up the new suite automatically once it's in | ||
| `_manifest.js`. | ||
|
|
||
| ## Open questions and known limitations |
There was a problem hiding this comment.
An additional thing to mention (I may have missed it) but there's a lot of intent management in agent-skills. What I mean by that is we seem to be creating a "voice" or identity for our agentic experience that the evaluations should capture.
| # grader unless you switch the grader to a non-Anthropic provider via | ||
| # RUBRIC_MODEL below. | ||
| ANTHROPIC_API_KEY= | ||
|
|
There was a problem hiding this comment.
I have a feeling we'll be asked where to get this value from. We should link that here
| # AGENT_MODEL=claude-sonnet-4-20250514 | ||
|
|
||
| # REQUIRED: the rubric grader for `llm-rubric` assertions. Wired into | ||
| # shared/defaults.yaml as defaultTest.options.provider. Pick a cheaper model | ||
| # than AGENT_MODEL since this only judges agent output and runs once per | ||
| # rubric assertion. | ||
| # | ||
| # Examples: | ||
| # anthropic:messages:claude-haiku-4-5-20251001 | ||
| # openai:gpt-5-mini | ||
| # openai:chat:gpt-4.1-mini | ||
| RUBRIC_MODEL=anthropic:messages:claude-haiku-4-5-20251001 |
| | `AGENT_MODEL` | the provider (system under test) | `claude-sonnet-4-20250514` | Stays on Claude because that's representative of what users actually run. | | ||
| | `RUBRIC_MODEL` | `defaultTest.options.provider` (rubric grader) | `anthropic:messages:claude-haiku-4-5-20251001` | Cheaper grader cuts cost roughly 10x without changing what's measured. | | ||
|
|
||
| `EVAL_MODEL` (the legacy variable) is still honoured as a fallback for `AGENT_MODEL` so existing `.env` files keep working. |
There was a problem hiding this comment.
should we support the legacy value or just remove it?
| **Execute all three steps in a single pass without stopping to ask for details.** Infer the variation key (`default`), name (`Default`), instructions/messages, and model from the user's request context. If the user asked for GPT-4o agent mode, you have enough to complete the entire flow. Only ask clarifying questions if the mode or model is truly ambiguous. | ||
|
|
||
| **Execute all three steps without stopping to ask for details.** Infer the variation key (`default`), name (`Default`), instructions/messages, and model from the user's request context. If the user asked for GPT-4o agent mode, you have enough to complete the entire flow. Only ask clarifying questions if the mode or model is truly ambiguous. | ||
| **Step 3 (the `get-ai-config` call) is mandatory regardless of how convincing the create response looks.** The two write tools may return what looks like a complete object, but only `get-ai-config` confirms the config was actually persisted with both the shell and variation linked. Skipping this step is a workflow violation — make the call even when you "feel" the previous responses already showed everything. |
There was a problem hiding this comment.
I'd love to get someone from AIC to verify these changes
ari-launchdarkly
left a comment
There was a problem hiding this comment.
Ok. My other concerns are keeping things in sync (a problem we can resolve later) and adding tests to some of the utility methods. They do a lot and it would be nice to see that we've worked out any potential edge cases here
| } | ||
| } | ||
| } | ||
| } |
There was a problem hiding this comment.
how do we ensure there isn't drift between these mocks and the real APIs?
| 4. Did it hand off cleanly without trying to do both at once? | ||
| Score 1.0 if all four are met, deduct 0.25 for each missed. | ||
| metric: precedence_quality | ||
| weight: 2 |
There was a problem hiding this comment.
this file is awesome


No description provided.