Introducing BCal evaluation by haoranpb · Pull Request #669 · microsoft/BC-Bench

haoranpb · 2026-06-18T13:21:00Z

BCal evaluation will be a bit special one, leverages BC-Bench for offline evaluation.

Pending:

Cleanup of the dataset
Make the evaluation cron job (maybe weekly)
Investigate what is the BCal version (do we always get the latest one?)

…egory/nl2al

…ult structure

…egory/nl2al

… handling and logging

…egory/nl2al

…into category/nl2al

Fixes hallucinated AL symbols and page-id mismatches verified against baseapp_src/ (BC 28.0): * block-invoice-posting-without-email-1: OnBeforeCheckSalesDoc -> OnBeforeCheckSalesDocument (codeunit Sales-Post) * warehouse-activity-released-notification-1: OnAfterReleaseWhseShipment -> OnAfterReleaseWarehouseShipment * login-audit-subscriber-1: OnAfterUserLoggedIn -> OnAfterLogin (codeunit System Initialization) * rc-headlines-rotating-daily-1: Page 9026 -> 9022 (Business Manager Role Center) * rc-profile-and-profile-extension-1: Page 9026 -> 9022 (Business Manager Role Center) Verified: NL2ALEntry.load() parses 128 entries; lint_v2 findings 18 -> 13. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

The 'Install bcal CLI from internal feed' step calls aka.ms/install-artifacts-credprovider.ps1, which in turn hits the GitHub release API to locate the credprovider binary. That API returns 403 (rate-limited) intermittently on cold runs. In run 27207573872 this flake killed one matrix job out of 130. Wrap the install in a 3-attempt loop with exponential backoff (1s, 4s, 16s) so a transient 403 no longer fails the entire entry. The failure mode is visible and retried up to three times before throwing. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

For each entry, the bad event name was replaced with a verified real publisher from baseapp_src/ (BC 28.0). Replacements are surgical (one text substitution per entry, no semantic rewrite). * item-image-format-validation-subscriber-1 drop OnAfterUploadFile alt; OnBeforeUploadFile is the real publisher * attach-quote-pdf-to-order-1 OnAfterMakeSalesOrder on 'Make Order' -> OnAfterSalesQuoteToOrderRun on 'Sales-Quote to Order (Yes/No)' or OnAfterInsertAllSalesOrderLines on 'Sales-Quote to Order' * quality-inspection-required-line-1 OnAfterInsertPurchRcptLine -> OnAfterPurchRcptLineInsert ('Purch.-Post') * bin-priority-pageext-1 OnBeforePickLines -> OnBeforePickAccordingToFEFO on 'Create Pick' * lead-source-enum-shared-1 OnAfterContactToCustomer -> OnBeforeCreateCustomerFromTemplate on Contact * hard-job-wip-recognition-method-1 OnAfterCalculateWIP -> OnAfterCalcWIP / OnBeforeCalcRecognizedCosts / OnBeforeCalcRecognizedSales on 'Job Calculate WIP' * hard-deferral-template-on-gl-account-1 OnAfterAssignFieldsForNo is on TABLE 'Sales Line', not codeunit 'Sales Line - Reserve'; fix the location reference * hard-document-attachment-offload-blob-1 OnAfterImportFromStream on codeunit -> OnBeforeImportFromStream on table 'Document Attachment' * hard-bank-recon-cheque-and-tolerance-match-1 OnAfterApplyAutomaticMatches / OnAfterMatchManually -> OnAfterMatchBankPayments on 'Match Bank Pmt. Appl.' and OnAfterApplyEntries on 'Bank Acc. Entry Set Recon.-No.' Verified: NL2ALEntry.load() parses 128 entries; lint_v2 findings 13 -> 6. Of the remaining 6, 4 are dest. for C.2 (no real publisher) and 2 are lint false positives (real table-level events not in the codeunit-only symbol index: OnAfterAssignFieldsForNo, OnBeforeImportFromStream). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…ents Unlike C.1 (surgical event-name swaps), these 4 entries asserted events that simply do not exist in BC. The rubric had to be reshaped to the canonical pattern instead. * routing-quality-check-flag-1 prod-order-release-notification-1 OnAfterReleaseProductionOrder does not exist. Releasing IS setting Status::Released. Rewrote both rubrics to require subscribing to the auto-generated OnAfterValidateEvent for field Status on table 'Production Order', guarding on xRec.Status <> Rec.Status and Rec.Status = Status::Released. * hard-dimension-priority-and-mandatory-1 OnAfterCheckMandatoryDimensions does not exist on codeunit 408. The real publisher that fires after the standard posting-dim check is OnAfterCheckDimValuePosting. Rewrote both nl_prompt and the relevant expected[] item. * hard-feature-management-flag-with-cohort-telemetry-1 OnGetFeatureKeyList and OnAfterFeatureKeyList do not exist. The real publisher on the System app 'Feature Management Facade' codeunit is OnGetFeatureKey. Dropped the hallucinated alternations. Verified: NL2ALEntry.load() parses 128 entries; lint_v2 findings 6 -> 2. The remaining 2 are known lint false positives (real table-level events OnAfterAssignFieldsForNo and OnBeforeImportFromStream, not in the codeunit-only symbol_28_0.json index but verified in baseapp_src/). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Common setup steps that were running 16x in parallel on every matrix job are now run exactly once in a new 'prepare-workspace' job, uploaded as an artifact, and restored in each matrix job. Saves ~3 CPU-h per attempt (measured ~1.5 min/job setup overhead * 128 entries). Hoisted: * credprovider plugin install (with the 3-attempt retry from B.3) * bcal CLI install (now via --tool-path to a portable folder, restored into \C:\Users\martinsrui/.dotnet/tools in the matrix job) * bc-eval CAPI bridge venv (only when llm-backend=external-command; uv-created venv is relocatable as long as uv python install 3.12 has been run in the restoring job) * BcContainerHelper module (Save-Module to portable folder, then prepended to PSModulePath via GITHUB_ENV in the matrix job) NOT hoisted (per-job by design): * Checkout / setup-python-uv (matrix job needs bcbench installed for 'uv run bcbench evaluate bcal'; setup-uv action's cache makes this cheap anyway) * Azure Login OIDC for Key Vault (token lifetime is per-job; azure-openai backend doesn't need it at all) * Download BC Application symbols (per-entry; A.2 will cache the bcartifacts payload separately) * Run BCal (the actual work) Behavior changes worth noting: * The matrix job no longer logs in to Azure for the ADO feed -- it doesn't need an ADO token anymore (nothing pulls from the feed after restore). The CAPI Key Vault login (still per-job, still gated on external-command) is the only Azure Login left. * If llm-backend=external-command the prepare job builds the venv; for azure-openai it's skipped. The matrix-job restore step also skips moving .bcal-capi-venv into place if it isn't in the artifact. Untested in CI. To verify, run workflow_dispatch with test-run=true and each llm-backend value, confirm prepare-workspace runs once and matrix jobs no longer execute Install-Module / dotnet tool install / credprovider. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

The BC sandbox artifact (~3 GB per version) was being downloaded by every matrix job through BcContainerHelper.Download-Artifacts. The nl2al dataset currently pins all 128 entries to BC 28.0, so 128 jobs were paying the same download cost in parallel. Now: prepare-workspace pre-downloads the artifact once (still calling BcContainerHelper.Download-Artifacts so any platform / version / country-code logic stays in BcContainerHelper), and actions/cache@v4 snapshots C:\\bcartifacts.cache at end of job. Each matrix job restores that cache before calling scripts\\Download-BCSymbols.ps1, which short-circuits because Download-Artifacts sees the payload locally. Cache key recipe: SHA-256 of 'sorted-distinct-versions|country|v1', truncated to 16 hex chars. So: * Dataset edits that don't change BC versions => same key, cache hit * Adding a new BC version to any entry => new key, cache miss + refill * The trailing 'v1' lets us force-bust the cache without touching data Caveats: * GitHub actions/cache repo limit is 10 GB. Each BC version's bcartifacts.cache is 2-3 GB after compression; we fit ~3 versions comfortably. The dataset would have to span >4 BC versions before evictions start to thrash. * Matrix job uses actions/cache/restore@v4 (restore-only) to avoid 16 concurrent save attempts racing for the same key. Only the prepare job saves. Expected savings: previously Download BC Application symbols averaged 2.4 min/job (p50 1.7 min, p95 6.7 min, 5.1 CPU-h total). Restore from cache should drop that to ~30s/job once warm. Untested in CI. To verify, run workflow_dispatch twice and confirm: 1) First run: prepare-workspace downloads ~3 GB, cache saves. 2) Second run: prepare-workspace hits cache (no download), and matrix jobs' symbol-download step is fast (~30s vs ~2 min). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Self-review of commit f6a868f found the routing-quality-check and prod-order-release-notification rewrites told agents to subscribe to the auto-generated OnAfterValidate event for field Status on table 'Production Order'. That hook DOES NOT FIRE in BC's standard release flow. ProdOrderStatusManagement.Codeunit.al releases an order via direct field assignment: ProductionOrder.Status := ProductionOrder.Status::Released; ProductionOrder.Insert(); Direct field assignment never triggers OnValidate. Agents following the previous rubric would have written subscribers that never execute. Replaced with the dedicated IntegrationEvents on codeunit 5407 'Prod. Order Status Management': * OnAfterChangeStatusOnProdOrder Fires after the status change with a NewStatus parameter; the subscriber must filter to NewStatus::Released. * OnAfterTransferRelatedTablesToReleasedProdOrder Fires only on the released transition; even more targeted. Both events are present in the lint's symbol index and verified in baseapp_src/ at .../ProdOrderStatusManagement.Codeunit.al. Lint state unchanged at 2 (still the known table-event false positives). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…fixtures Moved to dataset/nl2al_quarantine.jsonl (not loaded by `bcbench dataset list`, so excluded from CI matrix): - nl2al__hard-feature-telemetry-uptake-funnel-1 (Recs Buddy module not in baseapp_src) - nl2al__hard-permission-set-5-levels-suite-1 (Recs Buddy module not in baseapp_src) - nl2al__hard-alsearch-item-search-sales-line-picker-1 (ALSearch DotNet types not in baseapp_src) These 3 entries consistently fail every CI run because the BC modules they reference are not present in the fixture used by NL2AL evaluation. Quarantining them in a separate file (rather than deleting) preserves the rubric for future reactivation once the missing fixture packages are available. Active dataset: 125 entries (was 128). Lint state unchanged (2 known false positives on table-level events). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…egory/nl2al

…rge (#668)

Red team work moved to the dedicated category/nl2al-red-team branch so this branch stays focused on introducing the nl2al category.

haoranpb and others added 30 commits April 8, 2026 13:57

few more udpates for new categories

db388bd

init nl2al category

da462df

Merge branch 'main' of https://github.com/microsoft/BC-Bench into cat…

d689b79

…egory/nl2al

Merge branch 'main' of https://github.com/microsoft/BC-Bench into cat…

5f9b57a

…egory/nl2al

[Enhancement] Add repository reset functionality and improve NL2ALRes…

aa17c7e

…ult structure

Merge branch 'main' of https://github.com/microsoft/BC-Bench into cat…

3476367

…egory/nl2al

fix pre-commit errors

7f16155

add more commetns

316a3b1

revert accidental commit to notebook

72887c6

add force remove readonly function and update repo reset logic

6e43520

Merge branch 'main' of https://github.com/microsoft/BC-Bench into cat…

a2e86f0

…egory/nl2al

try bcal agent POC

94912c4

simplify code by focusing on one category for nl2al

4592e6d

get application.app for symbole reference

1609c4a

Refactor Download-ApplicationApp.ps1 and nl2al.py to improve artifact…

43fcf7c

… handling and logging

Add bcal cli suport

8b1e267

fix copying *.app files, we need all of htem

63ea301

update evaluate command for nl2al

5aaf537

getting ready for local usage

2301aef

sample checklist

be119c6

Add pipeline support for bcal cli

ec56f27

Add dataset-path to pipeline

6ad94d6

skip setup bc container for nl2al

0ca6003

For nl2al category, remove bc container

83448e5

manual install of bccontainerhelper

9d89725

Merge branch 'main' of https://github.com/microsoft/BC-Bench into cat…

be8b005

…egory/nl2al

fix merge mistake and format workflow

307d84c

try what copilot suggested

c5fa305

fix merge mistake

92622b7

Add first three nl2al datasets

41225fc

martinsrui-msft and others added 24 commits June 9, 2026 14:30

Merge branch 'category/nl2al' of https://github.com/microsoft/BC-Bench …

01078e3

…into category/nl2al

fix build

98dcf40

speed up the github action

6397729

fix build

ab346ea

fix build

bffb71b

twek dataset

aaef582

improve dataset

2c259b0

add persona data entries

9c67803

Merge branch 'main' of https://github.com/microsoft/BC-Bench into cat…

f4a7e82

…egory/nl2al

Simplify and refactor the NL2AL related logic and getting ready to me…

759e92f

…rge (#668)

Remove red team & harms testing from nl2al category

fc9b3e5

Red team work moved to the dedicated category/nl2al-red-team branch so this branch stays focused on introducing the nl2al category.

more clean up and ready to merge

ba4940e

cleanup tests

16d0089

fix special category for nl2al in test

8f53f27

cron job run every Sunday

f92972a

update page context for entire dataset and add gpt-4o model in dropdown

0c29c50

haoranpb marked this pull request as ready for review June 19, 2026 12:59

haoranpb enabled auto-merge (squash) June 19, 2026 12:59

scrub

013e353

martinsrui-msft approved these changes Jun 19, 2026

View reviewed changes

haoranpb merged commit 5139cce into main Jun 19, 2026
13 checks passed

haoranpb deleted the category/nl2al branch June 19, 2026 13:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Introducing BCal evaluation#669

Introducing BCal evaluation#669
haoranpb merged 86 commits into
mainfrom
category/nl2al

haoranpb commented Jun 18, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

haoranpb commented Jun 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

haoranpb commented Jun 18, 2026 •

edited

Loading