Skip to content

Commit 84166f3

Browse files
committed
base-deep-evals
1 parent 442b299 commit 84166f3

File tree

3 files changed

+37
-21
lines changed

3 files changed

+37
-21
lines changed

agents/base2/base-deep-evals.ts

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,8 @@
1+
import { createBaseDeep } from './base-deep'
2+
3+
const definition = {
4+
...createBaseDeep({ noAskUser: true }),
5+
id: 'base-deep-evals',
6+
displayName: 'Buffy the Codex Evals Orchestrator',
7+
}
8+
export default definition

agents/base2/base-deep.ts

Lines changed: 28 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -1,10 +1,13 @@
1+
import { buildArray } from '@codebuff/common/util/array'
2+
13
import { publisher } from '../constants'
24
import {
35
PLACEHOLDER,
46
type SecretAgentDefinition,
57
} from '../types/secret-agent-definition'
68

7-
const SYSTEM_PROMPT = `You are Buffy, a strategic assistant that orchestrates complex coding tasks through specialized sub-agents. You are the AI agent behind the product, Codebuff, a CLI tool where users can chat with you to code with AI.
9+
function buildDeepSystemPrompt(noAskUser: boolean): string {
10+
return `You are Buffy, a strategic assistant that orchestrates complex coding tasks through specialized sub-agents. You are the AI agent behind the product, Codebuff, a CLI tool where users can chat with you to code with AI.
811
912
# Core Mandates
1013
@@ -14,8 +17,8 @@ const SYSTEM_PROMPT = `You are Buffy, a strategic assistant that orchestrates co
1417
- **Spawn mentioned agents:** If the user uses "@AgentName" in their message, you must spawn that agent.
1518
- **Validate assumptions:** Use researchers, file pickers, and the read_files tool to verify assumptions about libraries and APIs before implementing.
1619
- **Proactiveness:** Fulfill the user's request thoroughly, including reasonable, directly implied follow-up actions.
17-
- **Confirm Ambiguity/Expansion:** Do not take significant actions beyond the clear scope of the request without confirming with the user. If asked *how* to do something, explain first, don't just do it.
18-
- **Ask the user about important decisions or guidance using the ask_user tool:** You should feel free to stop and ask the user for guidance if there's a an important decision to make or you need an important clarification or you're stuck and don't know what to try next. Use the ask_user tool to collaborate with the user to acheive the best possible result! Prefer to gather context first before asking questions in case you end up answering your own question.
20+
- **Confirm Ambiguity/Expansion:** Do not take significant actions beyond the clear scope of the request without confirming with the user. If asked *how* to do something, explain first, don't just do it.${noAskUser ? '' : `
21+
- **Ask the user about important decisions or guidance using the ask_user tool:** You should feel free to stop and ask the user for guidance if there's a an important decision to make or you need an important clarification or you're stuck and don't know what to try next. Use the ask_user tool to collaborate with the user to acheive the best possible result! Prefer to gather context first before asking questions in case you end up answering your own question.`}
1922
- **Be careful about terminal commands:** Be careful about instructing subagents to run terminal commands that could be destructive or have effects that are hard to undo (e.g. git push, git commit, running any scripts -- especially ones that could alter production environments (!), installing packages globally, etc). Don't run any of these effectful commands unless the user explicitly asks you to.
2023
- **Do what the user asks:** If the user asks you to do something, even running a risky terminal command, do it.
2124
@@ -96,8 +99,10 @@ The following is the state of the git repository at the start of the conversatio
9699
97100
${PLACEHOLDER.GIT_CHANGES_PROMPT}
98101
`
102+
}
99103

100-
const INSTRUCTIONS_PROMPT = `Act as a helpful assistant and freely respond to the user's request however would be most helpful to the user. Use your judgement to orchestrate the completion of the user's request using your specialized sub-agents and tools as needed. Take your time and be comprehensive. Don't surprise the user. For example, don't modify files if the user has not asked you to do so at least implicitly.
104+
function buildDeepInstructionsPrompt(noAskUser: boolean): string {
105+
return `Act as a helpful assistant and freely respond to the user's request however would be most helpful to the user. Use your judgement to orchestrate the completion of the user's request using your specialized sub-agents and tools as needed. Take your time and be comprehensive. Don't surprise the user. For example, don't modify files if the user has not asked you to do so at least implicitly.
101106
102107
Follow this 7-phase workflow for implementation tasks. For simple questions or explanations, answer directly without going through all phases.
103108
@@ -138,7 +143,7 @@ Draft a spec first, then refine it with the user:
138143
- **Technical Approach**: How the implementation will work at a high level
139144
- **Files to Create/Modify**: List of files that will be touched
140145
- **Out of Scope**: Anything explicitly excluded
141-
- The spec defines WHAT to build and WHY — it should NOT include detailed implementation steps or a plan. That belongs in Phase 3.
146+
- The spec defines WHAT to build and WHY — it should NOT include detailed implementation steps or a plan. That belongs in Phase 3.${noAskUser ? '' : `
142147
3. Use the ask_user tool iteratively over MULTIPLE ROUNDS to refine the spec and clarify all aspects of the request. Ask ~2-5 focused questions per round. Continue until you have clarity on:
143148
- The exact scope and boundaries of the task
144149
- Key requirements and acceptance criteria
@@ -148,13 +153,13 @@ Draft a spec first, then refine it with the user:
148153
- Any constraints or preferences on implementation approach
149154
4. Between rounds, update SPEC.md with new information and gather additional codebase context as needed.
150155
5. **Do NOT ask obvious questions.** If you are >80% confident you know what the user would choose, just make that choice and move on. Only ask questions where the user's input would genuinely change the outcome.
151-
6. As the LAST question before finishing this phase, ask one open-ended question giving the user a chance to share any final feedback, concerns, or changes to the spec. For example: "Before I finalize the spec, is there anything else you'd like to add, change, or flag about the requirements?"
152-
7. Iteratively critique the spec:
156+
6. As the LAST question before finishing this phase, ask one open-ended question giving the user a chance to share any final feedback, concerns, or changes to the spec. For example: "Before I finalize the spec, is there anything else you'd like to add, change, or flag about the requirements?"`}
157+
${noAskUser ? '3' : '7'}. Iteratively critique the spec:
153158
a. Spawn thinker-codex to critique the spec — ask it to identify missing requirements, ambiguities, contradictions, overlooked edge cases, or technical approach issues.
154159
b. If the thinker raises valid critiques, update SPEC.md to address them.
155160
c. After updating, you MUST spawn thinker-codex again to re-critique the revised spec.
156161
d. Repeat until the thinker finds no new substantive critiques. Do NOT skip the re-critique — every revision must be verified.
157-
8. Do NOT proceed until you are confident the spec captures the full picture.
162+
${noAskUser ? '4' : '8'}. Do NOT proceed until you are confident the spec captures the full picture.
158163
159164
## Phase 3 — Plan
160165
@@ -231,19 +236,22 @@ Capture learnings for future sessions:
231236
a. Spawn thinker-codex to critique your LESSONS.md and skill file edits — ask it to identify missing insights, improvements to existing entries, and brainstorm additional skills that could be created or updated based on the work done in this session.
232237
b. If the thinker suggests valid improvements or new skill ideas, update the relevant files accordingly.
233238
c. After updating, you MUST spawn thinker-codex again to re-critique and brainstorm further.
234-
d. Repeat until the thinker finds no new substantive improvements or skill ideas. Do NOT skip the re-critique — every revision must be verified.
235-
4. Use suggest_followups to suggest ~3 next steps the user might want to take.
239+
d. Repeat until the thinker finds no new substantive improvements or skill ideas. Do NOT skip the re-critique — every revision must be verified.${noAskUser ? '' : `
240+
4. Use suggest_followups to suggest ~3 next steps the user might want to take.`}
236241
237242
Make sure to narrate to the user what you are doing and why you are doing it as you go along. Give a very short summary of what you accomplished at the end of your turn.
238243
239244
## Followup Requests
240245
241246
If the full 7-phase workflow has already been completed in this conversation and the user is asking for a followup change (e.g. "also add X" or "tweak Y"), you do NOT need to repeat the entire workflow. Use your judgement to run only the phases that are relevant — for example, directly make the requested changes (Phase 4), do a light review (Phase 5), and run validation (Phase 6). Skip the spec, and plan phases if the request is a straightforward extension of the work already done. Still update LESSONS.md and skills if you learn anything new.
242247
`
248+
}
243249

244-
export function createBaseDeep(): SecretAgentDefinition {
250+
export function createBaseDeep(options?: {
251+
noAskUser?: boolean
252+
}): Omit<SecretAgentDefinition, 'id'> {
253+
const { noAskUser = false } = options ?? {}
245254
return {
246-
id: 'base-deep',
247255
publisher,
248256
model: 'openai/gpt-5.3-codex',
249257
displayName: 'Buffy the Codex Orchestrator',
@@ -266,18 +274,18 @@ export function createBaseDeep(): SecretAgentDefinition {
266274
},
267275
outputMode: 'last_message',
268276
includeMessageHistory: true,
269-
toolNames: [
277+
toolNames: buildArray(
270278
'spawn_agents',
271279
'read_files',
272280
'read_subtree',
273-
'suggest_followups',
281+
!noAskUser && 'suggest_followups',
274282
'apply_patch',
275283
'write_file',
276284
'write_todos',
277-
'ask_user',
285+
!noAskUser && 'ask_user',
278286
'skill',
279287
'set_output',
280-
],
288+
),
281289
spawnableAgents: [
282290
'file-picker',
283291
'code-searcher',
@@ -291,13 +299,13 @@ export function createBaseDeep(): SecretAgentDefinition {
291299
'gpt-5-agent',
292300
'context-pruner',
293301
],
294-
systemPrompt: SYSTEM_PROMPT,
295-
instructionsPrompt: INSTRUCTIONS_PROMPT,
302+
systemPrompt: buildDeepSystemPrompt(noAskUser),
303+
instructionsPrompt: buildDeepInstructionsPrompt(noAskUser),
296304
stepPrompt: `Workflow phases reminder (7 phases):
297305
298306
**Planning todos** (write at start): Phase 1 → Phase 2 → Phase 3
299307
1. Context & Research — file-pickers + code-searchers + researchers in parallel, read results
300-
2. Spec — draft SPEC.md, iterative ask_user to refine (skip obvious Qs), open-ended final Q, thinker-codex critique loop
308+
2. Spec — draft SPEC.md, ${noAskUser ? '' : 'iterative ask_user to refine (skip obvious Qs), open-ended final Q, '}thinker-codex critique loop
301309
3. Plan — write PLAN.md, thinker-codex critique loop
302310
303311
**Implementation todos** (write after Plan): one todo per plan step + phases 5-7
@@ -326,5 +334,5 @@ export function createBaseDeep(): SecretAgentDefinition {
326334
}
327335
}
328336

329-
const definition = createBaseDeep()
337+
const definition = { ...createBaseDeep(), id: 'base-deep' }
330338
export default definition

evals/buffbench/main.ts

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@ async function main() {
88
// Use 'external:codex' for OpenAI Codex CLI
99
await runBuffBench({
1010
evalDataPaths: [path.join(__dirname, 'eval-codebuff.json')],
11-
agents: ['base-deep'],
11+
agents: ['base-deep-evals'],
1212
taskConcurrency: 5,
1313
})
1414

0 commit comments

Comments
 (0)