Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions .gitattributes
Original file line number Diff line number Diff line change
Expand Up @@ -6,3 +6,5 @@
# Force bash scripts to always use lf line endings so that if a repo is accessed
# in Unix via a file share from Windows, the scripts will work.
*.sh text eol=lf

.github/workflows/*.lock.yml linguist-generated=true merge=ours
6 changes: 3 additions & 3 deletions .github/aw/actions-lock.json
Original file line number Diff line number Diff line change
Expand Up @@ -5,10 +5,10 @@
"version": "v8",
"sha": "ed597411d8f924073f98dfc5c65a23a2325f34cd"
},
"github/gh-aw/actions/setup@v0.45.4": {
"github/gh-aw/actions/setup@v0.58.0": {
"repo": "github/gh-aw/actions/setup",
"version": "v0.45.4",
"sha": "ac090214a48a1938f7abafe132460b66752261af"
"version": "v0.58.0",
"sha": "cb7966564184443e601bd6135d5fbb534300070e"
}
}
}
188 changes: 188 additions & 0 deletions .github/copilot-instructions.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,188 @@
---
description: "Guidance for GitHub Copilot when working on ML.NET (dotnet/machinelearning). Use for any task in this repo: code changes, test writing, PR reviews, issue investigation, build troubleshooting, or documentation."
---

# ML.NET Development Guide

## Repository Overview

ML.NET is a cross-platform, open-source machine learning framework for .NET. It provides APIs for training, evaluating, and deploying ML models across classification, regression, clustering, ranking, anomaly detection, time series, recommendation, and generative AI (LLaMA, Phi, Mistral via TorchSharp).

### Key Technologies

- .NET SDK 10.0.100 (see `global.json`)
- Build system: Microsoft Arcade SDK (`eng/common/`)
- Test framework: xUnit (with `AwesomeAssertions`, `Xunit.Combinatorial`)
- Native dependencies: MKL, OpenMP, libmf, oneDNN
- Major dependencies: TorchSharp, ONNX Runtime, TensorFlow, LightGBM, Semantic Kernel
- Central package management: `Directory.Packages.props`

## Build & Test

### Build

```bash
# Linux/macOS
./build.sh

# Windows
build.cmd

# Build specific project
dotnet build src/Microsoft.ML.Core/Microsoft.ML.Core.csproj
```

The repo uses Arcade SDK. `build.sh`/`build.cmd` wraps `eng/common/build.sh`/`eng/common/build.ps1` with `--restore --build`. On Linux, native dependencies require `eng/common/native/install-dependencies.sh`.

### Test

```bash
# Run tests for a specific project
dotnet test test/Microsoft.ML.Tests/Microsoft.ML.Tests.csproj

# Run tests with filter
dotnet test test/Microsoft.ML.Tests/Microsoft.ML.Tests.csproj --filter "FullyQualifiedName~ClassName.MethodName"

# Run all tests (slow, prefer specific projects)
dotnet test Microsoft.ML.sln
```

Test projects multi-target `net8.0;net48;net9.0` on Windows, `net8.0` only on Linux/macOS/arm64.

### Format

```bash
dotnet format Microsoft.ML.sln --no-restore
```

The repo has `.editorconfig` and `EnforceCodeStyleInBuild=true`.

## Project Structure

```
src/
├── Microsoft.ML.Core/ # Core types, contracts, host environment
├── Microsoft.ML.Data/ # Data pipeline, DataView, schema
├── Microsoft.ML/ # MLContext, public API surface
├── Microsoft.ML.StandardTrainers/ # Built-in trainers (logistic regression, SVM, etc.)
├── Microsoft.ML.Transforms/ # Data transforms (normalize, featurize, etc.)
├── Microsoft.ML.AutoML/ # Automated ML pipeline selection
├── Microsoft.ML.FastTree/ # Tree-based trainers
├── Microsoft.ML.LightGbm/ # LightGBM integration
├── Microsoft.ML.Recommender/ # Matrix factorization recommenders
├── Microsoft.ML.TimeSeries/ # Time series analysis
├── Microsoft.ML.Tokenizers/ # BPE/WordPiece/SentencePiece tokenizers
├── Microsoft.ML.GenAI.Core/ # GenAI base types (CausalLM pipeline)
├── Microsoft.ML.GenAI.LLaMA/ # LLaMA model support
├── Microsoft.ML.GenAI.Phi/ # Phi model support
├── Microsoft.ML.GenAI.Mistral/ # Mistral model support
├── Microsoft.ML.TorchSharp/ # TorchSharp-based trainers
├── Microsoft.ML.OnnxTransformer/ # ONNX model inference
├── Microsoft.ML.TensorFlow/ # TensorFlow model inference
├── Microsoft.ML.Vision/ # Image classification
├── Microsoft.ML.ImageAnalytics/ # Image transforms
├── Microsoft.ML.CpuMath/ # SIMD-optimized math operations
├── Microsoft.Data.Analysis/ # DataFrame API
├── Native/ # C/C++ native library sources
└── Common/ # Shared internal code
test/
├── Microsoft.ML.TestFramework/ # Base test classes and helpers
├── Microsoft.ML.TestFrameworkCommon/ # Shared test utilities
├── Microsoft.ML.Tests/ # Main functional tests
├── Microsoft.ML.Core.Tests/ # Core unit tests
├── Microsoft.ML.IntegrationTests/ # End-to-end integration tests
├── Microsoft.ML.Tokenizers.Tests/ # Tokenizer tests
├── Microsoft.ML.GenAI.*.Tests/ # GenAI component tests
└── ... (30+ test projects)
```

## Conventions

### Code Style

Every `.cs` file starts with the 3-line .NET Foundation MIT license header. This is enforced across the codebase and must not be omitted.

Namespaces match assembly name (`Microsoft.ML`, `Microsoft.ML.Data`, `Microsoft.ML.Trainers`). Order usings as `System.*` first, then `Microsoft.*`, then others.

Use `[BestFriend]` attribute for internal members shared across assemblies. The repo has many assemblies that need to share types without making them public; `[BestFriend]` provides controlled cross-assembly visibility for this.

Use `Contracts.Check*` / `Contracts.Except*` for argument and state validation rather than raw `throw` statements. This ensures consistent error messages and lets the ML.NET host environment intercept validation failures.

XML docs with `<summary>` tags are required on all public types and members.

When editing an existing file, match its style even if it differs from general guidelines. Consistency within a file matters more than global uniformity.

Follow [dotnet/runtime coding-style](https://github.com/dotnet/runtime/blob/main/docs/coding-guidelines/coding-style.md).

### Test Conventions

Framework: xUnit (`[Fact]`, `[Theory]`, `[InlineData]`).

Inherit from `TestDataPipeBase` (for data pipeline tests) or `BaseTestClass` (for simpler tests). Both provide `ITestOutputHelper`, test data paths, and locale pinning to `en-US`.

```csharp
public class MyFeatureTests : TestDataPipeBase
{
public MyFeatureTests(ITestOutputHelper output) : base(output) { }

[Fact]
public void MyFeatureBasicTest()
{
// ...
}
}
```

Name test classes as `{Feature}Tests`, test methods as PascalCase descriptive names (e.g., `RandomizedPcaTrainerBaselineTest`). Do not use `Test_` prefixes or `_Should_` patterns.

Use `Assert.*` (xUnit) or `AwesomeAssertions` for fluent assertions. Do not use `Assert.That` (NUnit style).

Test data: use `Microsoft.ML.TestDatabases` package or files in `test/data/`, referenced via `GetDataPath("filename")` from the base class. Baseline output comparison uses files in `test/BaselineOutput/`. Update baselines carefully since they are the source of truth for output format stability.

Gotchas: the base class pins locale to `en-US` (don't override). `AllowUnsafeBlocks` is enabled in test projects for native interop testing. XML doc warnings (CS1573, CS1591, CS1712) are suppressed in test code.

### Architecture

`MLContext` is the main entry point, exposing catalogs for each ML task (classification, regression, etc.).

Data flows through `IDataView`, a lazy, columnar, cursor-based data pipeline. This design avoids loading entire datasets into memory, which matters for ML workloads.

Trainers implement the `IEstimator<T>` to `ITransformer` pattern: call `Fit()` to train, then `Transform()` to apply. New trainers go in their own project under `src/`. New test projects mirror source naming: `Microsoft.ML.Foo` to `Microsoft.ML.Foo.Tests`.

## Git Workflow

- Default branch: `main`
- Never commit directly to `main`, always create a feature branch
- Branch naming: `feature/description`, `fix/description`
- PRs are squash-merged
- Reference a filed issue in PR description
- Address review feedback in additional commits (don't amend/force-push)
- Use `git rebase` for conflict resolution, not merge commits

## CI

Primary CI: Azure DevOps Pipelines (`build/vsts-ci.yml`), the official signed build. Builds run on Windows, Linux (Ubuntu 22.04), and macOS, covering both managed (.NET) and native components. Code coverage uses `coverlet.collector`. A custom internal Roslyn analyzer (`Microsoft.ML.InternalCodeAnalyzer`) runs on all test projects.

## AI Infrastructure

### Workflows

GitHub Actions in `.github/workflows/`:

| Workflow | Trigger | Purpose |
|----------|---------|---------|
| `copilot-setup-steps.yml` | Manual | Remote Copilot Coding Agent build environment |
| `find-similar-issues.yml` | Issue opened | AI-powered duplicate detection for new issues |
| `inclusive-heat-sensor.yml` | Comments | Detect heated language in issue/PR comments |

### Prompts

Reusable prompt templates in `.github/prompts/`:

| Prompt | Purpose |
|--------|---------|
| `release-notes.prompt.md` | Generate classified release notes between commits |

### Issue Triage

For issue triage workflows (automated milestone assignment, priority labeling, investigation), use [GitHub Agentic Workflows](https://github.github.com/gh-aw/). Define triage automation as natural-language workflow files rather than custom scripts.
20 changes: 20 additions & 0 deletions .github/prompts/release-notes.prompt.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
# ML.NET Release Notes

Generate classified release notes between two commits.

## Categories

1. **Product** — Bug fixes, features, improvements
2. **Dependencies** — Package/SDK updates
3. **Testing** — Test changes and infrastructure
4. **Documentation** — Docs, samples
5. **Housekeeping** — Build, CI, cleanup

## Process

```bash
# Get commits between two points
git log --pretty=format:"%h - %s (%an)" BRANCH1..BRANCH2 > commits.txt
```

Classify each commit. When uncertain, default to Housekeeping. Group related commits. Flag breaking changes with ⚠️.
92 changes: 92 additions & 0 deletions .github/workflows/find-similar-issues.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,92 @@
name: "Find Similar Issues with AI"

on:
issues:
types: [opened]

permissions:
contents: read
issues: write
models: read

jobs:
find-similar-issues:
runs-on: ubuntu-latest
if: github.event_name == 'issues'
steps:
- uses: actions/setup-node@v4
with:
node-version: '20'

- run: npm init -y && npm install @octokit/rest

- name: Find and post similar issues
env:
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
ISSUE_NUMBER: ${{ github.event.issue.number }}
ISSUE_TITLE: ${{ github.event.issue.title }}
ISSUE_BODY: ${{ github.event.issue.body }}
run: |
node << 'SCRIPT'
const { Octokit } = require("@octokit/rest");
const fs = require('fs');
const octokit = new Octokit({ auth: process.env.GITHUB_TOKEN });
const endpoint = "https://models.inference.ai.azure.com";
const model = "gpt-4o-mini";
const token = process.env.GITHUB_TOKEN;
const issueNum = parseInt(process.env.ISSUE_NUMBER);
const title = process.env.ISSUE_TITLE;
const body = process.env.ISSUE_BODY || '';
const [owner, repo] = process.env.GITHUB_REPOSITORY.split('/');

function extractWords(text) {
const stop = new Set(['the','and','for','with','this','that','from','have','not','are','was','will','can','when','what','how','use','does','issue','error','work']);
return [...new Set(text.replace(/```[\s\S]*?```/g,'').replace(/https?:\/\/\S+/g,'').replace(/[^a-z0-9\s]/gi,' ').toLowerCase().split(/\s+/).filter(w=>w.length>3&&!stop.has(w)))];
}
function jaccard(a,b) { const i=a.filter(w=>b.includes(w)); const u=[...new Set([...a,...b])]; return u.length?i.length/u.length:0; }

(async()=>{
const issues=[];
for(let p=1;p<=10;p++){
const r=await octokit.issues.listForRepo({owner,repo,state:'all',per_page:100,page:p,sort:'updated',direction:'desc'});
if(!r.data.length)break;
issues.push(...r.data.filter(i=>i.number!==issueNum&&!i.pull_request));
}
const words=extractWords(`${title}\n${body}`);
const candidates=issues.map(i=>({issue:i,score:jaccard(words,extractWords(`${i.title}\n${i.body||''}`))}))
.filter(c=>c.score>0.1).sort((a,b)=>b.score-a.score).slice(0,30);

const results=[];
for(const{issue}of candidates){
try{
const r=await fetch(`${endpoint}/chat/completions`,{method:"POST",headers:{"Content-Type":"application/json","Authorization":`Bearer ${token}`},
body:JSON.stringify({model,temperature:0.3,max_tokens:150,messages:[
Comment on lines +55 to +63
Copy link

Copilot AI Mar 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This workflow can make up to ~30 model inference calls per newly opened issue (one per candidate in the candidates list). On a high-traffic repo this can become slow/expensive and may hit rate limits. Consider reducing the candidate cap, adding an opt-in label/command gate, and/or early-exiting after collecting enough high-confidence matches.

Copilot uses AI. Check for mistakes.
{role:"system",content:'Analyze GitHub issue similarity. Return JSON only: {"score":0.0,"reason":"brief"}'},
{role:"user",content:`Current:\nTitle: ${title}\nBody: ${body}\n\nCompare:\nTitle: ${issue.title}\nBody: ${issue.body||'None'}`}
]})});
const d=await r.json();
if(!d.choices?.[0])continue;
const parsed=JSON.parse(d.choices[0].message.content.trim().replace(/^```json?\s*/gm,'').replace(/```$/gm,''));
if(parsed.score>=0.6) results.push({number:issue.number,title:issue.title,state:issue.state,url:issue.html_url,score:parsed.score,reason:parsed.reason,labels:issue.labels.map(l=>l.name)});
await new Promise(r=>setTimeout(r,100));
}catch(e){console.error(`#${issue.number}:`,e.message)}
}
results.sort((a,b)=>b.score-a.score);
const top=results.slice(0,5);

let comment='';
if(top.length){
comment=`## 🔍 Similar Issues Found\n\n`;
top.forEach((s,i)=>{
comment+=`<details><summary><strong>${i+1}. <a href="${s.url}">#${s.number}</a>: ${s.title}</strong> (${Math.round(s.score*100)}%)</summary>\n\n`;
comment+=`**State:** ${s.state==='open'?'🟢 Open':'🔴 Closed'} \n**Labels:** ${s.labels.slice(0,5).map(l=>'`'+l+'`').join(', ')||'None'}\n`;
if(s.reason) comment+=`**Why:** ${s.reason}\n`;
comment+=`</details>\n\n`;
});
comment+=`---\n*AI-powered similar issue detection*`;
} else {
comment=`## 🔍 No similar issues found with high confidence.\n\n---\n*AI-powered similar issue detection*`;
}
await octokit.issues.createComment({owner,repo,issue_number:issueNum,body:comment});
})();
SCRIPT
21 changes: 21 additions & 0 deletions .github/workflows/inclusive-heat-sensor.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
name: Inclusive Heat Sensor
on:
issues:
types: [opened, reopened]
issue_comment:
types: [created, edited]
pull_request_review_comment:
types: [created, edited]

permissions:
contents: read
issues: write
pull-requests: write

jobs:
detect-heat:
uses: jonathanpeppers/inclusive-heat-sensor/.github/workflows/comments.yml@v0.1.2
with:
minimizeComment: true
offensiveThreshold: 9
angerThreshold: 9
Loading
Loading