Skip to content

[OAI] Allow forcing Responses API for non-gpt-5 model names#190

Open
Kenny Wong (wong-codaio) wants to merge 3 commits into
braintrustdata:mainfrom
wong-codaio:wong/oai/force-responses-api
Open

[OAI] Allow forcing Responses API for non-gpt-5 model names#190
Kenny Wong (wong-codaio) wants to merge 3 commits into
braintrustdata:mainfrom
wong-codaio:wong/oai/force-responses-api

Conversation

@wong-codaio
Copy link
Copy Markdown
Contributor

@wong-codaio Kenny Wong (wong-codaio) commented May 29, 2026

Summary

[OAI] Allow forcing Responses API for non-gpt-5 model names

  • per-call use_responses_api (py) / useResponsesApi (js) flag forces the Responses API. routing becomes isGPT5Model(model) || useResponsesApi; flag is stripped before the request.
  • motivation: internal proxies may rewrite the model name for routing (e.g. a service-tier prefix), so a model that requires the Responses API can arrive under a name that doesn't start with gpt-5. the name check then sends it to Chat Completions and it fails, with no way to override. this flag lets such a model work regardless of its name.
  • per-call, not global: the model is chosen per call, so a global switch can't say "this model yes, that model no". keeps it next to model, like temperature/maxTokens.
  • also fixes a Responses-API bug found while testing: reasoning_effort was sent top-level (the API wants reasoning.effort), so any reasoning call routed to Responses 400'd.

PTAL:
FYI:

Test plan

  • unit tests (js + py, incl. built-in named scorers and reasoning.effort)
  • manual smoke test — scratch scripts below, each runs a scorer 3 ways and prints the endpoint hit:
OPENAI_API_KEY=sk-... [OPENAI_BASE_URL=https://us.api.openai.com/v1] python test.py
OPENAI_API_KEY=sk-... [OPENAI_BASE_URL=https://us.api.openai.com/v1] node test.mjs   # after `pnpm run build`
test.py
"""Scratch check: gpt-4.1 supports both Chat Completions and Responses APIs.
Run with OPENAI_API_KEY set. The request hook prints which endpoint each call hits.
If your org is region-pinned, also set OPENAI_BASE_URL (e.g. https://us.api.openai.com/v1):
  OPENAI_API_KEY=sk-... OPENAI_BASE_URL=https://us.api.openai.com/v1 python test.py
"""

import os

import httpx
from openai import OpenAI

from autoevals import Factuality, LLMClassifier, init

init(
    OpenAI(
        base_url=os.environ.get("OPENAI_BASE_URL"),  # None → SDK default (api.openai.com)
        http_client=httpx.Client(event_hooks={"request": [lambda r: print("  request →", r.url.path)]}),
    )
)

data = dict(output="6", expected="6", input="Add the numbers 1, 2, 3")

print("gpt-4.1 (default → expect /chat/completions):")
print("  score =", Factuality(model="gpt-4.1").eval(**data).score)

print("gpt-4.1 + use_responses_api=True (→ expect /responses):")
print("  score =", Factuality(model="gpt-4.1", use_responses_api=True).eval(**data).score)

# Built-in named scorers don't forward reasoning_effort yet, so use LLMClassifier here.
print("gpt-5.4 + medium reasoning (gpt-5 family → expect /responses):")
clf = LLMClassifier(
    name="match",
    prompt_template="Is the submission {{output}} equal to {{expected}}? Answer Y or N.",
    choice_scores={"Y": 1, "N": 0},
    model="gpt-5.4",
    reasoning_effort="medium",
)
print("  score =", clf.eval(**data).score)
test.mjs
// Scratch check: gpt-4.1 supports both Chat Completions and Responses APIs.
// Run with OPENAI_API_KEY set. The fetch wrapper prints which endpoint each call hits.
// If your org is region-pinned, also set OPENAI_BASE_URL (e.g. https://us.api.openai.com/v1):
//   OPENAI_API_KEY=sk-... OPENAI_BASE_URL=https://us.api.openai.com/v1 node test.mjs
import { OpenAI } from "openai";
import { Factuality, LLMClassifierFromTemplate, init } from "./jsdist/index.mjs";

const client = new OpenAI({
  baseURL: process.env.OPENAI_BASE_URL, // undefined → SDK default (api.openai.com)
  fetch: (url, opts) => {
    const u = typeof url === "string" ? url : url.url;
    console.log("  request →", new URL(u).pathname);
    return fetch(url, opts);
  },
});
init({ client });

const data = { output: "6", expected: "6", input: "Add the numbers 1, 2, 3" };

console.log("gpt-4.1 (default → expect /chat/completions):");
console.log("  score =", (await Factuality({ ...data, model: "gpt-4.1" })).score);

console.log("gpt-4.1 + useResponsesApi:true (→ expect /responses):");
console.log(
  "  score =",
  (await Factuality({ ...data, model: "gpt-4.1", useResponsesApi: true })).score,
);

// Built-in named scorers don't forward reasoningEffort yet, so use LLMClassifierFromTemplate here.
console.log("gpt-5.4 + medium reasoning (gpt-5 family → expect /responses):");
const clf = LLMClassifierFromTemplate({
  name: "match",
  promptTemplate: "Is the submission {{output}} equal to {{expected}}? Answer Y or N.",
  choiceScores: { Y: 1, N: 0 },
  model: "gpt-5.4",
  reasoningEffort: "medium",
});
console.log("  score =", (await clf({ ...data })).score);

Proxy/internal setups can serve a GPT-5 model under a name that doesn't
start with "gpt-5", so the name-based isGPT5Model() check alone can't route
them to the Responses API. Add a per-call use_responses_api / useResponsesApi
flag (camelCase at the scorer layer, snake_case in CachedLLMParams) so callers
can force it; the flag is stripped before the request is sent.
SpecFileClassifier.__new__ has a fixed kwarg list, so Factuality(use_responses_api=True)
and the other named scorers raised TypeError. Forward the flag like the other model knobs.
The Responses API rejects a top-level reasoning_effort param ("moved to
reasoning.effort"), so reasoning calls routed to it 400'd. Nest it correctly
in both languages.
@wong-codaio Kenny Wong (wong-codaio) marked this pull request as ready for review May 29, 2026 12:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant