Introducing BCal evaluation#669
Merged
Merged
Conversation
… handling and logging
…into category/nl2al
Fixes hallucinated AL symbols and page-id mismatches verified against baseapp_src/ (BC 28.0): * block-invoice-posting-without-email-1: OnBeforeCheckSalesDoc -> OnBeforeCheckSalesDocument (codeunit Sales-Post) * warehouse-activity-released-notification-1: OnAfterReleaseWhseShipment -> OnAfterReleaseWarehouseShipment * login-audit-subscriber-1: OnAfterUserLoggedIn -> OnAfterLogin (codeunit System Initialization) * rc-headlines-rotating-daily-1: Page 9026 -> 9022 (Business Manager Role Center) * rc-profile-and-profile-extension-1: Page 9026 -> 9022 (Business Manager Role Center) Verified: NL2ALEntry.load() parses 128 entries; lint_v2 findings 18 -> 13. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The 'Install bcal CLI from internal feed' step calls aka.ms/install-artifacts-credprovider.ps1, which in turn hits the GitHub release API to locate the credprovider binary. That API returns 403 (rate-limited) intermittently on cold runs. In run 27207573872 this flake killed one matrix job out of 130. Wrap the install in a 3-attempt loop with exponential backoff (1s, 4s, 16s) so a transient 403 no longer fails the entire entry. The failure mode is visible and retried up to three times before throwing. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
For each entry, the bad event name was replaced with a verified real
publisher from baseapp_src/ (BC 28.0). Replacements are surgical (one
text substitution per entry, no semantic rewrite).
* item-image-format-validation-subscriber-1
drop OnAfterUploadFile alt; OnBeforeUploadFile is the real publisher
* attach-quote-pdf-to-order-1
OnAfterMakeSalesOrder on 'Make Order'
-> OnAfterSalesQuoteToOrderRun on 'Sales-Quote to Order (Yes/No)'
or OnAfterInsertAllSalesOrderLines on 'Sales-Quote to Order'
* quality-inspection-required-line-1
OnAfterInsertPurchRcptLine -> OnAfterPurchRcptLineInsert ('Purch.-Post')
* bin-priority-pageext-1
OnBeforePickLines -> OnBeforePickAccordingToFEFO on 'Create Pick'
* lead-source-enum-shared-1
OnAfterContactToCustomer -> OnBeforeCreateCustomerFromTemplate on Contact
* hard-job-wip-recognition-method-1
OnAfterCalculateWIP -> OnAfterCalcWIP / OnBeforeCalcRecognizedCosts /
OnBeforeCalcRecognizedSales on 'Job Calculate WIP'
* hard-deferral-template-on-gl-account-1
OnAfterAssignFieldsForNo is on TABLE 'Sales Line', not codeunit
'Sales Line - Reserve'; fix the location reference
* hard-document-attachment-offload-blob-1
OnAfterImportFromStream on codeunit -> OnBeforeImportFromStream on
table 'Document Attachment'
* hard-bank-recon-cheque-and-tolerance-match-1
OnAfterApplyAutomaticMatches / OnAfterMatchManually
-> OnAfterMatchBankPayments on 'Match Bank Pmt. Appl.'
and OnAfterApplyEntries on 'Bank Acc. Entry Set Recon.-No.'
Verified: NL2ALEntry.load() parses 128 entries; lint_v2 findings 13 -> 6.
Of the remaining 6, 4 are dest. for C.2 (no real publisher) and 2 are
lint false positives (real table-level events not in the codeunit-only
symbol index: OnAfterAssignFieldsForNo, OnBeforeImportFromStream).
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ents
Unlike C.1 (surgical event-name swaps), these 4 entries asserted events
that simply do not exist in BC. The rubric had to be reshaped to the
canonical pattern instead.
* routing-quality-check-flag-1
prod-order-release-notification-1
OnAfterReleaseProductionOrder does not exist. Releasing IS setting
Status::Released. Rewrote both rubrics to require subscribing to the
auto-generated OnAfterValidateEvent for field Status on table
'Production Order', guarding on xRec.Status <> Rec.Status and
Rec.Status = Status::Released.
* hard-dimension-priority-and-mandatory-1
OnAfterCheckMandatoryDimensions does not exist on codeunit 408. The
real publisher that fires after the standard posting-dim check is
OnAfterCheckDimValuePosting. Rewrote both nl_prompt and the relevant
expected[] item.
* hard-feature-management-flag-with-cohort-telemetry-1
OnGetFeatureKeyList and OnAfterFeatureKeyList do not exist. The real
publisher on the System app 'Feature Management Facade' codeunit is
OnGetFeatureKey. Dropped the hallucinated alternations.
Verified: NL2ALEntry.load() parses 128 entries; lint_v2 findings 6 -> 2.
The remaining 2 are known lint false positives (real table-level events
OnAfterAssignFieldsForNo and OnBeforeImportFromStream, not in the
codeunit-only symbol_28_0.json index but verified in baseapp_src/).
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Common setup steps that were running 16x in parallel on every matrix job are now run exactly once in a new 'prepare-workspace' job, uploaded as an artifact, and restored in each matrix job. Saves ~3 CPU-h per attempt (measured ~1.5 min/job setup overhead * 128 entries). Hoisted: * credprovider plugin install (with the 3-attempt retry from B.3) * bcal CLI install (now via --tool-path to a portable folder, restored into \C:\Users\martinsrui/.dotnet/tools in the matrix job) * bc-eval CAPI bridge venv (only when llm-backend=external-command; uv-created venv is relocatable as long as uv python install 3.12 has been run in the restoring job) * BcContainerHelper module (Save-Module to portable folder, then prepended to PSModulePath via GITHUB_ENV in the matrix job) NOT hoisted (per-job by design): * Checkout / setup-python-uv (matrix job needs bcbench installed for 'uv run bcbench evaluate bcal'; setup-uv action's cache makes this cheap anyway) * Azure Login OIDC for Key Vault (token lifetime is per-job; azure-openai backend doesn't need it at all) * Download BC Application symbols (per-entry; A.2 will cache the bcartifacts payload separately) * Run BCal (the actual work) Behavior changes worth noting: * The matrix job no longer logs in to Azure for the ADO feed -- it doesn't need an ADO token anymore (nothing pulls from the feed after restore). The CAPI Key Vault login (still per-job, still gated on external-command) is the only Azure Login left. * If llm-backend=external-command the prepare job builds the venv; for azure-openai it's skipped. The matrix-job restore step also skips moving .bcal-capi-venv into place if it isn't in the artifact. Untested in CI. To verify, run workflow_dispatch with test-run=true and each llm-backend value, confirm prepare-workspace runs once and matrix jobs no longer execute Install-Module / dotnet tool install / credprovider. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The BC sandbox artifact (~3 GB per version) was being downloaded by every matrix job through BcContainerHelper.Download-Artifacts. The nl2al dataset currently pins all 128 entries to BC 28.0, so 128 jobs were paying the same download cost in parallel. Now: prepare-workspace pre-downloads the artifact once (still calling BcContainerHelper.Download-Artifacts so any platform / version / country-code logic stays in BcContainerHelper), and actions/cache@v4 snapshots C:\\bcartifacts.cache at end of job. Each matrix job restores that cache before calling scripts\\Download-BCSymbols.ps1, which short-circuits because Download-Artifacts sees the payload locally. Cache key recipe: SHA-256 of 'sorted-distinct-versions|country|v1', truncated to 16 hex chars. So: * Dataset edits that don't change BC versions => same key, cache hit * Adding a new BC version to any entry => new key, cache miss + refill * The trailing 'v1' lets us force-bust the cache without touching data Caveats: * GitHub actions/cache repo limit is 10 GB. Each BC version's bcartifacts.cache is 2-3 GB after compression; we fit ~3 versions comfortably. The dataset would have to span >4 BC versions before evictions start to thrash. * Matrix job uses actions/cache/restore@v4 (restore-only) to avoid 16 concurrent save attempts racing for the same key. Only the prepare job saves. Expected savings: previously Download BC Application symbols averaged 2.4 min/job (p50 1.7 min, p95 6.7 min, 5.1 CPU-h total). Restore from cache should drop that to ~30s/job once warm. Untested in CI. To verify, run workflow_dispatch twice and confirm: 1) First run: prepare-workspace downloads ~3 GB, cache saves. 2) Second run: prepare-workspace hits cache (no download), and matrix jobs' symbol-download step is fast (~30s vs ~2 min). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Self-review of commit f6a868f found the routing-quality-check and prod-order-release-notification rewrites told agents to subscribe to the auto-generated OnAfterValidate event for field Status on table 'Production Order'. That hook DOES NOT FIRE in BC's standard release flow. ProdOrderStatusManagement.Codeunit.al releases an order via direct field assignment: ProductionOrder.Status := ProductionOrder.Status::Released; ProductionOrder.Insert(); Direct field assignment never triggers OnValidate. Agents following the previous rubric would have written subscribers that never execute. Replaced with the dedicated IntegrationEvents on codeunit 5407 'Prod. Order Status Management': * OnAfterChangeStatusOnProdOrder Fires after the status change with a NewStatus parameter; the subscriber must filter to NewStatus::Released. * OnAfterTransferRelatedTablesToReleasedProdOrder Fires only on the released transition; even more targeted. Both events are present in the lint's symbol index and verified in baseapp_src/ at .../ProdOrderStatusManagement.Codeunit.al. Lint state unchanged at 2 (still the known table-event false positives). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…fixtures Moved to dataset/nl2al_quarantine.jsonl (not loaded by `bcbench dataset list`, so excluded from CI matrix): - nl2al__hard-feature-telemetry-uptake-funnel-1 (Recs Buddy module not in baseapp_src) - nl2al__hard-permission-set-5-levels-suite-1 (Recs Buddy module not in baseapp_src) - nl2al__hard-alsearch-item-search-sales-line-picker-1 (ALSearch DotNet types not in baseapp_src) These 3 entries consistently fail every CI run because the BC modules they reference are not present in the fixture used by NL2AL evaluation. Quarantining them in a separate file (rather than deleting) preserves the rubric for future reactivation once the missing fixture packages are available. Active dataset: 125 entries (was 128). Lint state unchanged (2 known false positives on table-level events). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Red team work moved to the dedicated category/nl2al-red-team branch so this branch stays focused on introducing the nl2al category.
martinsrui-msft
approved these changes
Jun 19, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
BCal evaluation will be a bit special one, leverages BC-Bench for offline evaluation.
Pending: