Build tsikit-learn: scikit-learn → TypeScript migration

---
schedule: every 60m
timeout-minutes: 360
---

# Build tsikit-learn: scikit-learn → TypeScript migration

## Goal

Build `tsikit-learn`, a complete TypeScript port of [scikit-learn](https://scikit-learn.org/), one feature at a time. This is an open-ended program — it runs continuously, always adding the next piece of scikit-learn functionality.

**Data layer:** This project uses [`tsb`](https://github.com/githubnext/tsessebe) (a TypeScript port of pandas) as its DataFrame/Series foundation, just as scikit-learn uses pandas/numpy. `tsb` is a peer dependency.

### How each iteration works

1. **Read the README** at the repo root. It is the source of truth for all project parameters (package name, stack, conventions, testing requirements).
2. **Read repo-memory** (`.autoloop/`, `AGENTS.md`, `CLAUDE.md`, any planning docs) and the full issue thread (comments from other runs and steering from maintainers).
3. **Check for other running jobs.** If another autoloop job is in-flight on this program, choose *different* work that won't conflict. Check the long-running branch (`autoloop/build-tsikit-learn`) and recent commits to understand what's already in progress. Integrate cleanly when merging.
4. **Plan extensively before writing code.** On each iteration, write or update a detailed plan in repo-memory documenting: what scikit-learn modules exist, what's been ported so far, what's next, and why. The plan should reference the scikit-learn source directly.
5. **Pick ONE feature** to implement. Start with whatever is most foundational and work outward. Each iteration adds exactly one cohesive piece — never half-finish something.
6. **Implement it fully:**
   - Source code in `src/` — strict TypeScript, no `any`, no escape hatches
   - Comprehensive tests — unit, property-based (fast-check), fuzz where applicable. MATCH EXACT COVERAGE OF scikit-learn's Python tests. Duplicate all tests and add more.
   - Interactive web playground/demo page for the feature
   - Update all docs, exports, and indexes
7. **Commit with a clear message** describing what scikit-learn feature was ported.

### First iteration

The very first iteration should:
- Set up the complete project structure: `bun init`, `tsconfig.json` (strictest settings), linting config (Biome), test config, CI workflow (GitHub Actions with Bun), Pages deployment pipeline
- Install `tsb` as a dependency for the data layer (DataFrame, Series, Index from tsessebe)
- Create the initial `src/index.ts` with the `tsikit-learn` package entry point
- Write a minimal "hello world" test to prove the pipeline works end to end
- Set up the playground infrastructure (copy the pattern from [tsessebe's playground](https://github.com/githubnext/tsessebe/tree/main/playground)) — interactive code editor, browser bundle, GitHub Pages deploy
- Document the full migration plan in repo-memory: enumerate scikit-learn's top-level modules and features, propose an ordering, note architectural decisions
- Commit the plan and project skeleton — no scikit-learn features yet, just the foundation

### Migration ordering (suggested)

Port scikit-learn modules in dependency order, starting with the foundational pieces everything else builds on:

**Phase 1 — Foundation (math & utilities)**
1. `base` — BaseEstimator, mixins (ClassifierMixin, RegressorMixin, TransformerMixin, ClusterMixin), clone, parameter get/set, sklearn API conventions
2. `utils` — validation (check_array, check_X_y, check_is_fitted), type checking, multiclass helpers, class_weight, extmath (safe_sparse_dot, row_norms, softmax, log_logistic)
3. `utils.validation` — input validation, array conversion, sample weight checks
4. `exceptions` — NotFittedError, ConvergenceWarning, etc.

**Phase 2 — Preprocessing & metrics**
5. `preprocessing` — StandardScaler, MinMaxScaler, MaxAbsScaler, RobustScaler, Normalizer, Binarizer, LabelEncoder, OneHotEncoder, OrdinalEncoder, PolynomialFeatures, FunctionTransformer, PowerTransformer, QuantileTransformer, KBinsDiscretizer, SplineTransformer
6. `metrics` — accuracy, precision, recall, f1, confusion_matrix, classification_report, roc_auc, roc_curve, mean_squared_error, mean_absolute_error, r2_score, log_loss, silhouette_score, adjusted_rand_score, pairwise distances/kernels
7. `model_selection` — train_test_split, KFold, StratifiedKFold, cross_val_score, cross_validate, GridSearchCV, RandomizedSearchCV, ParameterGrid, learning_curve, validation_curve

**Phase 3 — Core estimators**
8. `linear_model` — LinearRegression, Ridge, Lasso, ElasticNet, LogisticRegression, SGDClassifier, SGDRegressor, Perceptron, PassiveAggressiveClassifier
9. `tree` — DecisionTreeClassifier, DecisionTreeRegressor, export_graphviz, plot_tree
10. `neighbors` — KNeighborsClassifier, KNeighborsRegressor, NearestNeighbors, KDTree, BallTree, radius_neighbors
11. `naive_bayes` — GaussianNB, MultinomialNB, BernoulliNB, ComplementNB, CategoricalNB
12. `svm` — SVC, SVR, LinearSVC, LinearSVR, NuSVC, NuSVR (pure TS implementations, no libsvm)
13. `cluster` — KMeans, MiniBatchKMeans, DBSCAN, AgglomerativeClustering, SpectralClustering, MeanShift, Birch, OPTICS

**Phase 4 — Ensemble & advanced**
14. `ensemble` — RandomForestClassifier, RandomForestRegressor, GradientBoostingClassifier, GradientBoostingRegressor, AdaBoostClassifier, AdaBoostRegressor, BaggingClassifier, BaggingRegressor, VotingClassifier, StackingClassifier, HistGradientBoosting*
15. `decomposition` — PCA, IncrementalPCA, KernelPCA, TruncatedSVD, NMF, FactorAnalysis, FastICA, LatentDirichletAllocation
16. `manifold` — TSNE, MDS, Isomap, LocallyLinearEmbedding, SpectralEmbedding
17. `feature_selection` — SelectKBest, SelectPercentile, GenericUnivariateSelect, RFE, RFECV, SelectFromModel, VarianceThreshold, mutual_info_classif, mutual_info_regression, f_classif, f_regression, chi2
18. `feature_extraction` — DictVectorizer, FeatureHasher, text (CountVectorizer, TfidfVectorizer, TfidfTransformer, HashingVectorizer)

**Phase 5 — Pipelines, imputation & remaining**
19. `pipeline` — Pipeline, FeatureUnion, make_pipeline, make_union, ColumnTransformer
20. `impute` — SimpleImputer, IterativeImputer, KNNImputer, MissingIndicator
21. `compose` — ColumnTransformer, TransformedTargetRegressor, make_column_selector
22. `calibration` — CalibratedClassifierCV, calibration_curve
23. `multiclass` — OneVsRestClassifier, OneVsOneClassifier, OutputCodeClassifier
24. `multioutput` — MultiOutputClassifier, MultiOutputRegressor, ClassifierChain, RegressorChain
25. `discriminant_analysis` — LinearDiscriminantAnalysis, QuadraticDiscriminantAnalysis
26. `gaussian_process` — GaussianProcessClassifier, GaussianProcessRegressor, kernels (RBF, Matern, DotProduct, WhiteKernel, ConstantKernel, RationalQuadratic, ExpSineSquared, Sum, Product)
27. `isotonic` — IsotonicRegression, isotonic_regression, check_increasing
28. `kernel_approximation` — RBFSampler, Nystroem, AdditiveChi2Sampler, SkewedChi2Sampler
29. `kernel_ridge` — KernelRidge
30. `mixture` — GaussianMixture, BayesianGaussianMixture
31. `neural_network` — MLPClassifier, MLPRegressor, BernoulliRBM
32. `semi_supervised` — LabelPropagation, LabelSpreading, SelfTrainingClassifier
33. `datasets` — make_classification, make_regression, make_blobs, make_moons, make_circles, make_swiss_roll, load_iris, load_digits, load_wine, load_breast_cancer

### Key constraints

- **Package name is `tsikit-learn`.** All imports: `import { LinearRegression } from 'tsikit-learn'`
- **Data layer is `tsb`.** Use `tsb` (from tsessebe) for DataFrame, Series, Index — just as scikit-learn uses pandas/numpy. For numeric arrays, use typed arrays (Float64Array, Int32Array) directly. Implement a thin ndarray-like wrapper for 2D operations.
- **Bun** for runtime, bundling, testing
- **Zero additional dependencies** for core library beyond `tsb`. Build all ML algorithms from scratch in pure TypeScript. No WASM, no native bindings.
- **Strictest TypeScript** — `strict: true`, `noUncheckedIndexedAccess: true`, `exactOptionalPropertyTypes: true`, no `any` anywhere, no `@ts-ignore`, no `as` casts unless provably safe
- **Strictest linting** — Biome with all rules enabled, zero warnings
- **100% test coverage** — Re-use scikit-learn's Python tests for everything, plus add more. Unit tests, property-based tests (fast-check), fuzz tests, Playwright e2e for the web playground
- **Interactive web playground** — every feature gets a demo page showing the algorithm in action (visualizations, scatter plots, decision boundaries where applicable), deployed to GitHub Pages
- **Don't worry about performance optimization** — another program handles that. Focus on correctness and completeness.
- **scikit-learn API parity** — match scikit-learn's public API surface, adapted to TypeScript idioms. `fit()`, `predict()`, `transform()`, `fit_transform()`, `score()`, `get_params()`, `set_params()` patterns. When in doubt, read the scikit-learn source.

### Playground / Pages site

The playground follows the same pattern as [tsessebe](https://githubnext.github.io/tsessebe/):
- Landing page (`playground/index.html`) with a feature roadmap grid showing ported vs pending modules
- One page per feature with interactive demos (e.g., train a model, see predictions, visualize decision boundaries)
- In-browser TypeScript editor powered by the TypeScript compiler
- Built and deployed to GitHub Pages via CI
- Use Canvas/SVG for visualizations (scatter plots, decision boundaries, dendrograms, ROC curves, etc.)

## Target

This program may modify any file in the repository. It is building the project from scratch.

Only modify these files:
- `src/**` — library source code
- `tests/**` — all test files
- `playground/**` — interactive web playground/demos
- `package.json` — package config
- `tsconfig.json` — TypeScript config
- `biome.json` — linter config
- `bunfig.toml` — Bun config
- `.github/workflows/**` — CI/CD pipelines (but not autoloop workflow files)
- `AGENTS.md` — agent instructions
- `CLAUDE.md` — Claude Code config
- `.autoloop/memory/**` — repo-memory for planning and coordination

Do NOT modify:
- `README.md` — source of truth, read-only for this program
- `.autoloop/programs/**` — program definitions
- `.github/ISSUE_TEMPLATE/**` — issue templates
- `.github/workflows/autoloop*` — autoloop workflow files
- `.github/workflows/sync-branches*` — sync workflow files

## Evaluation

```bash
# Type check must pass — reject iterations that introduce type errors
if command -v bunx >/dev/null 2>&1; then
  if ! bunx tsc --noEmit 2>&1; then
    echo '{"sklearn_features_ported": null, "rejected_reason": "type check failed"}'
    exit 0
  fi
fi

# Tests must pass — reject iterations that break existing functionality
if command -v bun >/dev/null 2>&1; then
  if ! bun test 2>&1; then
    echo '{"sklearn_features_ported": null, "rejected_reason": "tests failed"}'
    exit 0
  fi
fi

# Count TypeScript source files that contain sklearn-related functionality
# (excludes config, test infra, playground scaffolding — only counts actual library code)
count=$(find src -name '*.ts' -not -name 'index.ts' -not -name '*.d.ts' 2>/dev/null | xargs grep -l 'export' 2>/dev/null | wc -l | tr -d ' ')
echo "{\"sklearn_features_ported\": ${count:-0}}"
```

The metric is `sklearn_features_ported`. **Higher is better.**

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Build tsikit-learn: scikit-learn → TypeScript migration #5

schedule: every 60m
timeout-minutes: 360

Build tsikit-learn: scikit-learn → TypeScript migration

Goal

How each iteration works

First iteration

Migration ordering (suggested)

Key constraints

Playground / Pages site

Target

Evaluation

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Build tsikit-learn: scikit-learn → TypeScript migration #5

Description

schedule: every 60m timeout-minutes: 360

Build tsikit-learn: scikit-learn → TypeScript migration

Goal

How each iteration works

First iteration

Migration ordering (suggested)

Key constraints

Playground / Pages site

Target

Evaluation

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

schedule: every 60m
timeout-minutes: 360