Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
51 commits
Select commit Hold shift + click to select a range
a114ab3
initial design
jp-agenta Apr 24, 2026
747502d
initial implementation
jp-agenta Apr 26, 2026
2220e65
initial review
jp-agenta Apr 26, 2026
0c559d8
fixed basic evals
jp-agenta Apr 26, 2026
3862dfe
Add evaluation parallelization
jp-agenta Apr 27, 2026
b958574
Parallelization checks
jp-agenta Apr 27, 2026
5b647ee
quick engine fix
jp-agenta Apr 27, 2026
c4d8368
ongoing debug
jp-agenta May 5, 2026
e049ef5
intermediate design extensions
jp-agenta May 15, 2026
50d1d8e
Merge release/v0.99.9 and evals<>queues work
jp-agenta May 15, 2026
603820f
evals<>queues implementation
jp-agenta May 15, 2026
3ebe721
latest findings
jp-agenta May 20, 2026
f68e42a
Merge release/v0.100.1
jp-agenta May 20, 2026
9b388bd
Add missing sdk files
jp-agenta May 20, 2026
7aedca9
Fix Dependency Injection in EventsDAO
jp-agenta May 20, 2026
ce1258a
bump py deps
jp-agenta May 20, 2026
4cc2f0e
extra findings
jp-agenta May 20, 2026
6bc3301
fixing findings
jp-agenta May 20, 2026
dd211e9
Merge
jp-agenta May 20, 2026
ef1d228
fixing tests and dependency injection
jp-agenta May 21, 2026
08eb3ea
clean up dependencies
jp-agenta May 21, 2026
f1cb077
deep clean up
jp-agenta May 21, 2026
d73fe65
Merge branch 'release/v0.100.1' into feat/unified-eval-loops
junaway May 21, 2026
59ed5e2
docs(eval-loops): consolidate legacy eval-loops docs into unified-eva…
jp-agenta May 21, 2026
a9d28b3
reconciliation added in edit + some fixing
jp-agenta May 21, 2026
0f91ea3
fix simple queue creation and eval processing
jp-agenta May 21, 2026
f64d299
fix services tests
jp-agenta May 21, 2026
3b3ac15
Fix domain exceptions and run (un)archival
jp-agenta May 21, 2026
c5acdbe
feat(api): mock_v0 test workflow + fix batch run-status finalization …
jp-agenta May 21, 2026
7eaf519
test(api): eval flow + flag-matrix tests; close UEL-012/016/028, file…
jp-agenta May 21, 2026
9628538
fix(api): finalize batch query→evaluator runs (UEL-029)
jp-agenta May 21, 2026
71aa2cd
fix(api): make the closed-run lock actually return 409 (UEL-031)
jp-agenta May 21, 2026
c009f0b
test(api): default-queue policy + lifecycle coverage (UEL-011); file …
jp-agenta May 21, 2026
f4f6cc0
fix(API): enforce one active default eval queue per run (UEL-030)
jp-agenta May 21, 2026
c5bdccd
test(API): cover refresh_metrics dispatch branches
jp-agenta May 21, 2026
0bf8d12
docs(eval): move closed UEL-030 into the Closed Findings section
jp-agenta May 21, 2026
8f9059e
refactor(API/SDK): exact source-family rule, resolver/over-count hard…
jp-agenta May 22, 2026
13648d4
deep tests/findings cleanup
jp-agenta May 22, 2026
461e397
fix(API): stamp updated_at on archive_queue; drop inert server_onupdate
jp-agenta May 22, 2026
98d3057
fix(API): allow default-queue is_queue sync on a closed run
jp-agenta May 22, 2026
1d3c3b5
fix(API): stamp updated_at/updated_by_id on secret + org-provider/dom…
jp-agenta May 22, 2026
a27c21a
test(SDK): unit-cover evaluate() spec parsing and normalization
jp-agenta May 22, 2026
5aa9296
test(SDK): integration + acceptance coverage for evaluate(); fix save…
jp-agenta May 22, 2026
76a9833
clean up logs and auto/custom/human
jp-agenta May 22, 2026
026547e
docs(eval): origin execution model — today (human/auto/custom) and fu…
jp-agenta May 22, 2026
e9ff86d
bump py deps
jp-agenta May 22, 2026
423800e
fixing tensor
jp-agenta May 22, 2026
30f53c8
split refresh metrics from process slice
jp-agenta May 22, 2026
92d733e
add live evaluation tests
jp-agenta May 22, 2026
ae5fc5a
Merge release/v0.100.2
jp-agenta May 22, 2026
85f4b68
merge v0.100.2
jp-agenta May 26, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
22 changes: 22 additions & 0 deletions AGENTS.md
Original file line number Diff line number Diff line change
Expand Up @@ -173,6 +173,28 @@ Concrete examples:
- Legacy app storage marker (`WORKFLOW_MARKER_KEY`): `api/oss/src/core/applications/service.py`
- Legacy dedup key normalization (`__dedup_id__` <-> `testcase_dedup_id`): `api/oss/src/apis/fastapi/testsets/router.py`

### Alembic migration chains (OSS + EE)

Migrations live in two separate, parallel chains that must each resolve to a
single head:
- OSS: `api/oss/databases/postgres/migrations/core/versions/`
- EE: `api/ee/databases/postgres/migrations/core/versions/`

Rules:
- After adding/editing/renaming any migration, verify each chain has exactly ONE
head with the bundled tool: from each `.../migrations/` directory run
`python3 find_head.py core` and confirm the `Heads:` list has a single entry.
Run it for BOTH OSS and EE.
- New migrations chain linearly after the existing head — never fork off an older
node (a fork produces two heads; alembic then can't resolve a linear upgrade).
- Revision ids must be globally unique within a chain. A duplicate id makes
alembic silently skip one file (the migration never runs).
- The EE chain extends past the shared OSS head with EE-only migrations, so an
OSS migration chains after the OSS head while its EE copy chains after the EE
head — same revision id, different `down_revision`.
- `evaluation_runs.data` / `evaluation_queues.data` are `json` columns (not
`jsonb`); cast `data::jsonb` before using jsonb operators / `jsonb_array_elements`.

### Router and function style conventions

Router style:
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -254,9 +254,11 @@ def check_url_safety(cls, v: Any) -> Any: # noqa: N805
return v

from oss.src.dbs.postgres.git.mappings import map_dto_to_dbe
from oss.src.dbs.postgres.shared.engine import engine as db_engine
from oss.src.dbs.postgres.shared.engine import get_transactions_engine
from datetime import datetime, timezone

db_engine = get_transactions_engine()

workflow_create = WorkflowCreate(
**application_create.model_dump(mode="json"),
)
Expand All @@ -267,7 +269,7 @@ def check_url_safety(cls, v: Any) -> Any: # noqa: N805

# Avoid slug collision with existing workflow artifacts (e.g. evaluators)
artifact_slug = git_artifact_create.slug
async with db_engine.core_session() as session:
async with db_engine.session() as session:
existing = (
await session.execute(
select(WorkflowArtifactDBE).filter(
Expand Down Expand Up @@ -298,7 +300,7 @@ def check_url_safety(cls, v: Any) -> Any: # noqa: N805
dto=artifact_dto,
)

async with db_engine.core_session() as session:
async with db_engine.session() as session:
session.add(artifact_dbe)
await session.commit()

Expand Down Expand Up @@ -364,7 +366,7 @@ def check_url_safety(cls, v: Any) -> Any: # noqa: N805
dto=variant_dto,
)

async with db_engine.core_session() as session:
async with db_engine.session() as session:
session.add(variant_dbe)
await session.commit()

Expand Down Expand Up @@ -415,7 +417,7 @@ def check_url_safety(cls, v: Any) -> Any: # noqa: N805
dto=revision_dto,
)

async with db_engine.core_session() as session:
async with db_engine.session() as session:
session.add(revision_dbe)
await session.commit()

Expand Down
4 changes: 2 additions & 2 deletions api/ee/databases/postgres/migrations/core/env.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@

from alembic import context

from oss.src.dbs.postgres.shared.engine import engine
from oss.src.utils.env import env
from oss.src.dbs.postgres.shared.base import Base

# Side-effect imports: register SQLAlchemy models with Base.metadata
Expand All @@ -29,7 +29,7 @@
# this is the Alembic Config object, which provides
# access to the values within the .ini file in use.
config = context.config
config.set_main_option("sqlalchemy.url", engine.postgres_uri_core) # type: ignore
config.set_main_option("sqlalchemy.url", env.postgres.uri_core)


# Interpret the config file for Python logging.
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
"""add default evaluation queues

Revision ID: a1d2e3f4a5b6
Revises: b2c3d4e5f7a8
Create Date: 2026-05-15 00:00:00

Previously shared revision id `a1b2c3d4e5f6` with
`drop_corrupted_metrics_for_some_runs`, so alembic skipped it and the index
below never ran. Renamed to `a1d2e3f4a5b6`. The EE chain extends past the shared
OSS head `e6f7a8b9c0d1` with EE-only migrations
(`9d3e8f0a1b2c -> a1b2c3d4e5f7 -> b2c3d4e5f7a8`), so this EE copy chains after
`b2c3d4e5f7a8` while the OSS copy chains after `e6f7a8b9c0d1`.

The partial unique index is scoped to ACTIVE default queues (`deleted_at IS
NULL`) so a default can be archived then recreated/unarchived by reconcile.
"""

from typing import Sequence, Union

from alembic import op

revision: str = "a1d2e3f4a5b6"
down_revision: Union[str, None] = "b2c3d4e5f7a8"
branch_labels: Union[str, Sequence[str], None] = None
depends_on: Union[str, Sequence[str], None] = None


def upgrade() -> None:
op.execute("DROP INDEX IF EXISTS ux_evaluation_queues_default_per_run")
op.execute("""
CREATE UNIQUE INDEX ux_evaluation_queues_default_per_run
ON evaluation_queues (project_id, run_id)
WHERE (flags ->> 'is_default')::boolean = true AND deleted_at IS NULL
""")


def downgrade() -> None:
op.execute("DROP INDEX IF EXISTS ux_evaluation_queues_default_per_run")
Original file line number Diff line number Diff line change
@@ -0,0 +1,145 @@
"""backfill default evaluation queues

Revision ID: a2b3c4d5e6f8
Revises: a1d2e3f4a5b6
Create Date: 2026-05-15 00:10:00

Backfills source-family flags (`has_traces` / `has_testcases` / `has_queries` /
`has_testsets`) to match the runtime derivation rule, then mass-creates default
queues per the runtime policy and recomputes `is_queue`. The query/testset
recompute keys on exact reference-key presence, not a substring match.
"""

from typing import Sequence, Union

from alembic import op

revision: str = "a2b3c4d5e6f8"
down_revision: Union[str, None] = "a1d2e3f4a5b6"
branch_labels: Union[str, Sequence[str], None] = None
depends_on: Union[str, Sequence[str], None] = None


def upgrade() -> None:
# Backfill source-family flags to match the runtime derivation rule in
# dbs/postgres/evaluations/utils.py. `data` is a `json` column, so cast to
# jsonb before navigating. Direct sources (`has_traces`/`has_testcases`) come
# from the exact step key on a reference-less input; reference-backed sources
# (`has_queries`/`has_testsets`) from exact-key presence (JSONB `?`), not a
# substring match that would misfire on `query_anchor` / `testset_metadata`.
op.execute("""
UPDATE evaluation_runs
SET flags = COALESCE(flags, '{}'::jsonb)
|| jsonb_build_object(
'has_traces', EXISTS (
SELECT 1
FROM jsonb_array_elements(COALESCE(data::jsonb -> 'steps', '[]'::jsonb)) AS step
WHERE step ->> 'type' = 'input'
AND COALESCE(step -> 'references', '{}'::jsonb) = '{}'::jsonb
AND lower(COALESCE(step ->> 'key', '')) IN ('traces', 'query-direct')
),
'has_testcases', EXISTS (
SELECT 1
FROM jsonb_array_elements(COALESCE(data::jsonb -> 'steps', '[]'::jsonb)) AS step
WHERE step ->> 'type' = 'input'
AND COALESCE(step -> 'references', '{}'::jsonb) = '{}'::jsonb
AND lower(COALESCE(step ->> 'key', '')) IN ('testcases', 'testset-direct')
),
'has_queries', EXISTS (
SELECT 1
FROM jsonb_array_elements(COALESCE(data::jsonb -> 'steps', '[]'::jsonb)) AS step
WHERE step ->> 'type' = 'input'
AND COALESCE(step -> 'references', '{}'::jsonb) ? 'query_revision'
),
'has_testsets', EXISTS (
SELECT 1
FROM jsonb_array_elements(COALESCE(data::jsonb -> 'steps', '[]'::jsonb)) AS step
WHERE step ->> 'type' = 'input'
AND COALESCE(step -> 'references', '{}'::jsonb) ? 'testset_revision'
)
)
""")

# Mass-create default queues, mirroring the runtime create policy in
# EvaluationsService._reconcile_default_queue: a default queue should exist
# only for runs that should have one. The runtime predicate is
# `EVALUATIONS_DEFAULT_QUEUES_FOR_ALL_RUNS or has_human`, with the env toggle
# currently hardcoded False, so the backfill condition is `has_human = true`.
# Existing default queues, active or archived, are preserved and block
# duplicates. The created queue carries the run's own status instead of a
# hardcoded 'running', so closed/successful runs are not misrepresented.
op.execute("""
INSERT INTO evaluation_queues (
project_id,
id,
created_at,
created_by_id,
flags,
data,
status,
run_id
)
SELECT
r.project_id,
gen_random_uuid(),
CURRENT_TIMESTAMP,
r.created_by_id,
jsonb_build_object('is_default', true, 'is_sequential', false),
'{}'::json,
COALESCE(r.status, 'running'),
r.id
FROM evaluation_runs r
WHERE COALESCE((r.flags ->> 'has_human')::boolean, false) = true
AND NOT EXISTS (
SELECT 1
FROM evaluation_queues q
WHERE q.project_id = r.project_id
AND q.run_id = r.id
AND (q.flags ->> 'is_default')::boolean = true
)
""")

# Reconcile the other direction: runs that should NOT have a default queue
# (has_human = false under the current policy) but carry a stale active
# default queue get that queue archived, matching the runtime archive branch
# in _reconcile_default_queue. This keeps the fleet consistent immediately
# instead of waiting for the first per-run edit to reconcile.
op.execute("""
UPDATE evaluation_queues q
SET deleted_at = CURRENT_TIMESTAMP,
deleted_by_id = r.created_by_id
FROM evaluation_runs r
WHERE q.project_id = r.project_id
AND q.run_id = r.id
AND (q.flags ->> 'is_default')::boolean = true
AND q.deleted_at IS NULL
AND COALESCE((r.flags ->> 'has_human')::boolean, false) = false
""")

# Recompute simple-queue eligibility under the new meaning. An already
# existing active default queue is as valid as one inserted above.
op.execute("""
UPDATE evaluation_runs r
SET flags = COALESCE(r.flags, '{}'::jsonb)
|| jsonb_build_object(
'is_queue',
COALESCE((r.flags ->> 'has_human')::boolean, false)
AND EXISTS (
SELECT 1
FROM evaluation_queues q
WHERE q.project_id = r.project_id
AND q.run_id = r.id
AND (q.flags ->> 'is_default')::boolean = true
AND q.deleted_at IS NULL
)
)
""")


def downgrade() -> None:
# Keep generated queues/results intact on downgrade. Remove only the newly
# inferred flags; old is_queue semantics cannot be reconstructed safely.
op.execute("""
UPDATE evaluation_runs
SET flags = COALESCE(flags, '{}'::jsonb) - 'has_traces' - 'has_testcases'
""")
4 changes: 2 additions & 2 deletions api/ee/databases/postgres/migrations/tracing/env.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@

from alembic import context

from oss.src.dbs.postgres.shared.engine import engine
from oss.src.utils.env import env
from oss.src.dbs.postgres.shared.base import Base

# Side-effect import: register SQLAlchemy model with Base.metadata
Expand All @@ -19,7 +19,7 @@
# this is the Alembic Config object, which provides
# access to the values within the .ini file in use.
config = context.config
config.set_main_option("sqlalchemy.url", engine.postgres_uri_tracing) # type: ignore
config.set_main_option("sqlalchemy.url", env.postgres.uri_tracing)


# Interpret the config file for Python logging.
Expand Down
6 changes: 3 additions & 3 deletions api/ee/src/apis/fastapi/access/router.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,8 +4,8 @@

from oss.src.utils.exceptions import intercept_exceptions

from ee.src.core.entitlements.types import SCOPES, Tracker
from ee.src.core.entitlements.controls import (
from ee.src.core.access.entitlements.types import SCOPES, Tracker
from ee.src.core.access.controls import (
get_plans,
get_plan_description,
get_roles,
Expand Down Expand Up @@ -70,6 +70,6 @@ async def fetch_roles(self) -> Dict[str, List[Dict[str, Any]]]:
verbatim from access-controls, including the `"*"` wildcard for
`owner` — callers that need to render the full permission list
should expand the wildcard themselves (see
`ee.src.services.converters._expand_permissions`).
`ee.src.services.db_manager_ee._expand_permissions`).
"""
return {scope: list(get_roles(scope)) for scope in SCOPES}
10 changes: 5 additions & 5 deletions api/ee/src/apis/fastapi/billing/router.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@
from oss.src.utils.caching import acquire_lock, release_lock, renew_lock
from oss.src.utils.env import env

from ee.src.utils.entitlements import period_from, scope_from
from ee.src.core.access.entitlements.service import period_from, scope_from
from ee.src.core.meters.types import Meters, MeterPeriod
from oss.src.utils.context import get_auth_scope

Expand All @@ -24,10 +24,10 @@
)

from ee.src.services import db_manager_ee
from ee.src.utils.permissions import check_action_access
from ee.src.models.shared_models import Permission
from ee.src.core.entitlements.types import Tracker, Quota, Period, Scope
from ee.src.core.entitlements.controls import get_plan_entitlements, get_plans
from ee.src.core.access.permissions.service import check_action_access
from ee.src.core.access.permissions.types import Permission
from ee.src.core.access.entitlements.types import Tracker, Quota, Period, Scope
from ee.src.core.access.controls import get_plan_entitlements, get_plans
from ee.src.core.subscriptions.settings import (
get_catalog,
get_pricing,
Expand Down
6 changes: 3 additions & 3 deletions api/ee/src/apis/fastapi/organizations/router.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,16 +4,16 @@
from fastapi import APIRouter, Request, HTTPException
from fastapi.responses import JSONResponse, Response

from ee.src.utils.permissions import check_user_org_access
from ee.src.utils.entitlements import (
from ee.src.core.access.permissions.service import check_user_org_access
from ee.src.core.access.entitlements.service import (
check_entitlements,
NOT_ENTITLED_RESPONSE,
Tracker,
Flag,
)

from ee.src.services import db_manager_ee
from ee.src.services.selectors import get_user_org_and_workspace_id
from ee.src.services.db_manager_ee import get_user_org_and_workspace_id
from ee.src.services.organization_service import (
OrganizationDomainsService,
OrganizationProvidersService,
Expand Down
Loading
Loading