Skip to content

Introducing BCal evaluation#669

Merged
haoranpb merged 86 commits into
mainfrom
category/nl2al
Jun 19, 2026
Merged

Introducing BCal evaluation#669
haoranpb merged 86 commits into
mainfrom
category/nl2al

Conversation

@haoranpb

@haoranpb haoranpb commented Jun 18, 2026

Copy link
Copy Markdown
Collaborator

BCal evaluation will be a bit special one, leverages BC-Bench for offline evaluation.

Pending:

  • Cleanup of the dataset
  • Make the evaluation cron job (maybe weekly)
  • Investigate what is the BCal version (do we always get the latest one?)

haoranpb and others added 30 commits April 8, 2026 13:57
martinsrui-msft and others added 24 commits June 9, 2026 14:30
Fixes hallucinated AL symbols and page-id mismatches verified against
baseapp_src/ (BC 28.0):

* block-invoice-posting-without-email-1:
  OnBeforeCheckSalesDoc -> OnBeforeCheckSalesDocument (codeunit Sales-Post)
* warehouse-activity-released-notification-1:
  OnAfterReleaseWhseShipment -> OnAfterReleaseWarehouseShipment
* login-audit-subscriber-1:
  OnAfterUserLoggedIn -> OnAfterLogin (codeunit System Initialization)
* rc-headlines-rotating-daily-1:
  Page 9026 -> 9022 (Business Manager Role Center)
* rc-profile-and-profile-extension-1:
  Page 9026 -> 9022 (Business Manager Role Center)

Verified: NL2ALEntry.load() parses 128 entries; lint_v2 findings 18 -> 13.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The 'Install bcal CLI from internal feed' step calls
aka.ms/install-artifacts-credprovider.ps1, which in turn hits the GitHub
release API to locate the credprovider binary. That API returns 403
(rate-limited) intermittently on cold runs.

In run 27207573872 this flake killed one matrix job out of 130. Wrap the
install in a 3-attempt loop with exponential backoff (1s, 4s, 16s) so a
transient 403 no longer fails the entire entry. The failure mode is
visible and retried up to three times before throwing.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
For each entry, the bad event name was replaced with a verified real
publisher from baseapp_src/ (BC 28.0). Replacements are surgical (one
text substitution per entry, no semantic rewrite).

* item-image-format-validation-subscriber-1
    drop OnAfterUploadFile alt; OnBeforeUploadFile is the real publisher
* attach-quote-pdf-to-order-1
    OnAfterMakeSalesOrder on 'Make Order'
    -> OnAfterSalesQuoteToOrderRun on 'Sales-Quote to Order (Yes/No)'
       or OnAfterInsertAllSalesOrderLines on 'Sales-Quote to Order'
* quality-inspection-required-line-1
    OnAfterInsertPurchRcptLine -> OnAfterPurchRcptLineInsert ('Purch.-Post')
* bin-priority-pageext-1
    OnBeforePickLines -> OnBeforePickAccordingToFEFO on 'Create Pick'
* lead-source-enum-shared-1
    OnAfterContactToCustomer -> OnBeforeCreateCustomerFromTemplate on Contact
* hard-job-wip-recognition-method-1
    OnAfterCalculateWIP -> OnAfterCalcWIP / OnBeforeCalcRecognizedCosts /
    OnBeforeCalcRecognizedSales on 'Job Calculate WIP'
* hard-deferral-template-on-gl-account-1
    OnAfterAssignFieldsForNo is on TABLE 'Sales Line', not codeunit
    'Sales Line - Reserve'; fix the location reference
* hard-document-attachment-offload-blob-1
    OnAfterImportFromStream on codeunit -> OnBeforeImportFromStream on
    table 'Document Attachment'
* hard-bank-recon-cheque-and-tolerance-match-1
    OnAfterApplyAutomaticMatches / OnAfterMatchManually
    -> OnAfterMatchBankPayments on 'Match Bank Pmt. Appl.'
       and OnAfterApplyEntries on 'Bank Acc. Entry Set Recon.-No.'

Verified: NL2ALEntry.load() parses 128 entries; lint_v2 findings 13 -> 6.
Of the remaining 6, 4 are dest. for C.2 (no real publisher) and 2 are
lint false positives (real table-level events not in the codeunit-only
symbol index: OnAfterAssignFieldsForNo, OnBeforeImportFromStream).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ents

Unlike C.1 (surgical event-name swaps), these 4 entries asserted events
that simply do not exist in BC. The rubric had to be reshaped to the
canonical pattern instead.

* routing-quality-check-flag-1
  prod-order-release-notification-1
    OnAfterReleaseProductionOrder does not exist. Releasing IS setting
    Status::Released. Rewrote both rubrics to require subscribing to the
    auto-generated OnAfterValidateEvent for field Status on table
    'Production Order', guarding on xRec.Status <> Rec.Status and
    Rec.Status = Status::Released.

* hard-dimension-priority-and-mandatory-1
    OnAfterCheckMandatoryDimensions does not exist on codeunit 408. The
    real publisher that fires after the standard posting-dim check is
    OnAfterCheckDimValuePosting. Rewrote both nl_prompt and the relevant
    expected[] item.

* hard-feature-management-flag-with-cohort-telemetry-1
    OnGetFeatureKeyList and OnAfterFeatureKeyList do not exist. The real
    publisher on the System app 'Feature Management Facade' codeunit is
    OnGetFeatureKey. Dropped the hallucinated alternations.

Verified: NL2ALEntry.load() parses 128 entries; lint_v2 findings 6 -> 2.
The remaining 2 are known lint false positives (real table-level events
OnAfterAssignFieldsForNo and OnBeforeImportFromStream, not in the
codeunit-only symbol_28_0.json index but verified in baseapp_src/).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Common setup steps that were running 16x in parallel on every matrix job
are now run exactly once in a new 'prepare-workspace' job, uploaded as
an artifact, and restored in each matrix job. Saves ~3 CPU-h per attempt
(measured ~1.5 min/job setup overhead * 128 entries).

Hoisted:
* credprovider plugin install (with the 3-attempt retry from B.3)
* bcal CLI install (now via --tool-path to a portable folder, restored
  into \C:\Users\martinsrui/.dotnet/tools in the matrix job)
* bc-eval CAPI bridge venv (only when llm-backend=external-command;
  uv-created venv is relocatable as long as uv python install 3.12 has
  been run in the restoring job)
* BcContainerHelper module (Save-Module to portable folder, then
  prepended to PSModulePath via GITHUB_ENV in the matrix job)

NOT hoisted (per-job by design):
* Checkout / setup-python-uv (matrix job needs bcbench installed for
  'uv run bcbench evaluate bcal'; setup-uv action's cache makes this
  cheap anyway)
* Azure Login OIDC for Key Vault (token lifetime is per-job;
  azure-openai backend doesn't need it at all)
* Download BC Application symbols (per-entry; A.2 will cache the
  bcartifacts payload separately)
* Run BCal (the actual work)

Behavior changes worth noting:
* The matrix job no longer logs in to Azure for the ADO feed -- it
  doesn't need an ADO token anymore (nothing pulls from the feed after
  restore). The CAPI Key Vault login (still per-job, still gated on
  external-command) is the only Azure Login left.
* If llm-backend=external-command the prepare job builds the venv;
  for azure-openai it's skipped. The matrix-job restore step also
  skips moving .bcal-capi-venv into place if it isn't in the artifact.

Untested in CI. To verify, run workflow_dispatch with test-run=true and
each llm-backend value, confirm prepare-workspace runs once and matrix
jobs no longer execute Install-Module / dotnet tool install / credprovider.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The BC sandbox artifact (~3 GB per version) was being downloaded by
every matrix job through BcContainerHelper.Download-Artifacts. The
nl2al dataset currently pins all 128 entries to BC 28.0, so 128 jobs
were paying the same download cost in parallel.

Now: prepare-workspace pre-downloads the artifact once (still calling
BcContainerHelper.Download-Artifacts so any platform / version /
country-code logic stays in BcContainerHelper), and actions/cache@v4
snapshots C:\\bcartifacts.cache at end of job. Each matrix job restores
that cache before calling scripts\\Download-BCSymbols.ps1, which
short-circuits because Download-Artifacts sees the payload locally.

Cache key recipe: SHA-256 of 'sorted-distinct-versions|country|v1',
truncated to 16 hex chars. So:
* Dataset edits that don't change BC versions => same key, cache hit
* Adding a new BC version to any entry => new key, cache miss + refill
* The trailing 'v1' lets us force-bust the cache without touching data

Caveats:
* GitHub actions/cache repo limit is 10 GB. Each BC version's
  bcartifacts.cache is 2-3 GB after compression; we fit ~3 versions
  comfortably. The dataset would have to span >4 BC versions before
  evictions start to thrash.
* Matrix job uses actions/cache/restore@v4 (restore-only) to avoid
  16 concurrent save attempts racing for the same key. Only the
  prepare job saves.

Expected savings: previously Download BC Application symbols averaged
2.4 min/job (p50 1.7 min, p95 6.7 min, 5.1 CPU-h total). Restore from
cache should drop that to ~30s/job once warm.

Untested in CI. To verify, run workflow_dispatch twice and confirm:
1) First run: prepare-workspace downloads ~3 GB, cache saves.
2) Second run: prepare-workspace hits cache (no download), and matrix
   jobs' symbol-download step is fast (~30s vs ~2 min).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Self-review of commit f6a868f found the routing-quality-check and
prod-order-release-notification rewrites told agents to subscribe to
the auto-generated OnAfterValidate event for field Status on table
'Production Order'.

That hook DOES NOT FIRE in BC's standard release flow.
ProdOrderStatusManagement.Codeunit.al releases an order via direct
field assignment:

    ProductionOrder.Status := ProductionOrder.Status::Released;
    ProductionOrder.Insert();

Direct field assignment never triggers OnValidate. Agents following
the previous rubric would have written subscribers that never execute.

Replaced with the dedicated IntegrationEvents on codeunit 5407
'Prod. Order Status Management':

* OnAfterChangeStatusOnProdOrder
    Fires after the status change with a NewStatus parameter; the
    subscriber must filter to NewStatus::Released.
* OnAfterTransferRelatedTablesToReleasedProdOrder
    Fires only on the released transition; even more targeted.

Both events are present in the lint's symbol index and verified
in baseapp_src/ at .../ProdOrderStatusManagement.Codeunit.al.

Lint state unchanged at 2 (still the known table-event false
positives).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…fixtures

Moved to dataset/nl2al_quarantine.jsonl (not loaded by `bcbench dataset list`,
so excluded from CI matrix):

- nl2al__hard-feature-telemetry-uptake-funnel-1   (Recs Buddy module not in baseapp_src)
- nl2al__hard-permission-set-5-levels-suite-1     (Recs Buddy module not in baseapp_src)
- nl2al__hard-alsearch-item-search-sales-line-picker-1  (ALSearch DotNet types not in baseapp_src)

These 3 entries consistently fail every CI run because the BC modules they
reference are not present in the fixture used by NL2AL evaluation. Quarantining
them in a separate file (rather than deleting) preserves the rubric for future
reactivation once the missing fixture packages are available.

Active dataset: 125 entries (was 128). Lint state unchanged (2 known false
positives on table-level events).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Red team work moved to the dedicated category/nl2al-red-team branch so this branch stays focused on introducing the nl2al category.
@haoranpb haoranpb marked this pull request as ready for review June 19, 2026 12:59
@haoranpb haoranpb enabled auto-merge (squash) June 19, 2026 12:59
@haoranpb haoranpb merged commit 5139cce into main Jun 19, 2026
13 checks passed
@haoranpb haoranpb deleted the category/nl2al branch June 19, 2026 13:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants