Skip to content

Feat: Add evals harness + improve skills#54

Open
nieblara wants to merge 9 commits into
mainfrom
add-evals-harness
Open

Feat: Add evals harness + improve skills#54
nieblara wants to merge 9 commits into
mainfrom
add-evals-harness

Conversation

@nieblara
Copy link
Copy Markdown
Contributor

No description provided.

# Conflicts:
#	skills.json
#	skills/ai-configs/aiconfig-create/SKILL.md
#	skills/ai-configs/aiconfig-tools/SKILL.md
#	skills/ai-configs/aiconfig-update/SKILL.md
#	skills/ai-configs/aiconfig-variations/SKILL.md
@nieblara nieblara requested a review from a team as a code owner April 30, 2026 07:29
Copy link
Copy Markdown
Contributor

@ari-launchdarkly ari-launchdarkly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still reviewing this. Just submitting it for now since the kids are awake and I've gotta make breakfast. I'd love to get someone from AIC to review the skill changes. So far so good - just some questions and asks to flesh some parts out. My thinking (and I could be wrong) is that non-devs will likely be more and more involved in this process and so we should orient the language to there? Maybe that's the wrong way of looking at it since an agent might be directing them, but if that's the case, maybe we need to provide the most amount of context for that agent?

Comment on lines +22 to +23
# 09:17 UTC daily - off the hour to avoid lining up with API rate limits.
- cron: "17 9 * * *"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the rationale here (besides the rate limits) that we'd be available and not after-hours if this were to fire and trigger an alert? It's such a random time (i'm used to seeing Cron jobs running in the dead hours)

Comment on lines +26 to +27
run_all:
description: "Re-run every suite regardless of diff"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are we sure about this? I think we'd only want to re-run these (since it can add up) when:

  1. The temperature changes
  2. A model changes
  3. The prompt changes

There should be a way for us to detect that. If we want, it can be a fast-follow

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh wait, I'm just realizing that this is to be run on the cron. Nevermind

Comment on lines +104 to +106
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
AGENT_MODEL: ${{ vars.AGENT_MODEL || 'claude-sonnet-4-20250514' }}
RUBRIC_MODEL: ${{ vars.RUBRIC_MODEL || 'anthropic:messages:claude-haiku-4-5-20251001' }}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this could just be me having limited permissions, but I don't see the OPENAI_API_KEY set in the repo secrets or the variables

Image Image

with:
name: results-${{ matrix.suite }}
path: evals/${{ matrix.suite }}/results.json
retention-days: 14
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we care for outliers in the process that are found? I'm thinking from a data-ingest perspective where if we have failures over a period of time, could that provide an agent context on how to write better skills and evaluations as we iterate?

Comment on lines +28 to +36
codebase_context: >
The codebase uses the LaunchDarkly Python AI SDK. AI Configs are evaluated
using ldclient with create_chat(). Config keys use kebab-case.
assert:
- type: javascript
value: |
const tools = output.tools_called || [];
const pass = tools.includes('setup-ai-config');
return { pass, score: pass ? 1 : 0, reason: tools.length ? 'Tools called: ' + tools.join(' -> ') : 'No tools called' };
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'm a little confused by this. Should the type be javascript if we're setting up ldclient in a python app?

Comment thread docs/evals.md
The aggregator + CI pick up the new suite automatically once it's in
`_manifest.js`.

## Open questions and known limitations
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An additional thing to mention (I may have missed it) but there's a lot of intent management in agent-skills. What I mean by that is we seem to be creating a "voice" or identity for our agentic experience that the evaluations should capture.

Comment thread evals/.env.example
# grader unless you switch the grader to a non-Anthropic provider via
# RUBRIC_MODEL below.
ANTHROPIC_API_KEY=

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a feeling we'll be asked where to get this value from. We should link that here

Comment thread evals/.env.example
Comment on lines +11 to +22
# AGENT_MODEL=claude-sonnet-4-20250514

# REQUIRED: the rubric grader for `llm-rubric` assertions. Wired into
# shared/defaults.yaml as defaultTest.options.provider. Pick a cheaper model
# than AGENT_MODEL since this only judges agent output and runs once per
# rubric assertion.
#
# Examples:
# anthropic:messages:claude-haiku-4-5-20251001
# openai:gpt-5-mini
# openai:chat:gpt-4.1-mini
RUBRIC_MODEL=anthropic:messages:claude-haiku-4-5-20251001
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here

Comment thread evals/README.md
| `AGENT_MODEL` | the provider (system under test) | `claude-sonnet-4-20250514` | Stays on Claude because that's representative of what users actually run. |
| `RUBRIC_MODEL` | `defaultTest.options.provider` (rubric grader) | `anthropic:messages:claude-haiku-4-5-20251001` | Cheaper grader cuts cost roughly 10x without changing what's measured. |

`EVAL_MODEL` (the legacy variable) is still honoured as a fallback for `AGENT_MODEL` so existing `.env` files keep working.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we support the legacy value or just remove it?

**Execute all three steps in a single pass without stopping to ask for details.** Infer the variation key (`default`), name (`Default`), instructions/messages, and model from the user's request context. If the user asked for GPT-4o agent mode, you have enough to complete the entire flow. Only ask clarifying questions if the mode or model is truly ambiguous.

**Execute all three steps without stopping to ask for details.** Infer the variation key (`default`), name (`Default`), instructions/messages, and model from the user's request context. If the user asked for GPT-4o agent mode, you have enough to complete the entire flow. Only ask clarifying questions if the mode or model is truly ambiguous.
**Step 3 (the `get-ai-config` call) is mandatory regardless of how convincing the create response looks.** The two write tools may return what looks like a complete object, but only `get-ai-config` confirms the config was actually persisted with both the shell and variation linked. Skipping this step is a workflow violation — make the call even when you "feel" the previous responses already showed everything.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd love to get someone from AIC to verify these changes

Copy link
Copy Markdown
Contributor

@ari-launchdarkly ari-launchdarkly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. My other concerns are keeping things in sync (a problem we can resolve later) and adding tests to some of the utility methods. They do a lot and it would be nice to see that we've worked out any potential edge cases here

}
}
}
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how do we ensure there isn't drift between these mocks and the real APIs?

4. Did it hand off cleanly without trying to do both at once?
Score 1.0 if all four are met, deduct 0.25 for each missed.
metric: precedence_quality
weight: 2
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this file is awesome

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants