Skip to content

Build tsikit-learn: scikit-learn → TypeScript migration #5

@mrjf

Description

@mrjf

schedule: every 60m
timeout-minutes: 360

Build tsikit-learn: scikit-learn → TypeScript migration

Goal

Build tsikit-learn, a complete TypeScript port of scikit-learn, one feature at a time. This is an open-ended program — it runs continuously, always adding the next piece of scikit-learn functionality.

Data layer: This project uses tsb (a TypeScript port of pandas) as its DataFrame/Series foundation, just as scikit-learn uses pandas/numpy. tsb is a peer dependency.

How each iteration works

  1. Read the README at the repo root. It is the source of truth for all project parameters (package name, stack, conventions, testing requirements).
  2. Read repo-memory (.autoloop/, AGENTS.md, CLAUDE.md, any planning docs) and the full issue thread (comments from other runs and steering from maintainers).
  3. Check for other running jobs. If another autoloop job is in-flight on this program, choose different work that won't conflict. Check the long-running branch (autoloop/build-tsikit-learn) and recent commits to understand what's already in progress. Integrate cleanly when merging.
  4. Plan extensively before writing code. On each iteration, write or update a detailed plan in repo-memory documenting: what scikit-learn modules exist, what's been ported so far, what's next, and why. The plan should reference the scikit-learn source directly.
  5. Pick ONE feature to implement. Start with whatever is most foundational and work outward. Each iteration adds exactly one cohesive piece — never half-finish something.
  6. Implement it fully:
    • Source code in src/ — strict TypeScript, no any, no escape hatches
    • Comprehensive tests — unit, property-based (fast-check), fuzz where applicable. MATCH EXACT COVERAGE OF scikit-learn's Python tests. Duplicate all tests and add more.
    • Interactive web playground/demo page for the feature
    • Update all docs, exports, and indexes
  7. Commit with a clear message describing what scikit-learn feature was ported.

First iteration

The very first iteration should:

  • Set up the complete project structure: bun init, tsconfig.json (strictest settings), linting config (Biome), test config, CI workflow (GitHub Actions with Bun), Pages deployment pipeline
  • Install tsb as a dependency for the data layer (DataFrame, Series, Index from tsessebe)
  • Create the initial src/index.ts with the tsikit-learn package entry point
  • Write a minimal "hello world" test to prove the pipeline works end to end
  • Set up the playground infrastructure (copy the pattern from tsessebe's playground) — interactive code editor, browser bundle, GitHub Pages deploy
  • Document the full migration plan in repo-memory: enumerate scikit-learn's top-level modules and features, propose an ordering, note architectural decisions
  • Commit the plan and project skeleton — no scikit-learn features yet, just the foundation

Migration ordering (suggested)

Port scikit-learn modules in dependency order, starting with the foundational pieces everything else builds on:

Phase 1 — Foundation (math & utilities)

  1. base — BaseEstimator, mixins (ClassifierMixin, RegressorMixin, TransformerMixin, ClusterMixin), clone, parameter get/set, sklearn API conventions
  2. utils — validation (check_array, check_X_y, check_is_fitted), type checking, multiclass helpers, class_weight, extmath (safe_sparse_dot, row_norms, softmax, log_logistic)
  3. utils.validation — input validation, array conversion, sample weight checks
  4. exceptions — NotFittedError, ConvergenceWarning, etc.

Phase 2 — Preprocessing & metrics
5. preprocessing — StandardScaler, MinMaxScaler, MaxAbsScaler, RobustScaler, Normalizer, Binarizer, LabelEncoder, OneHotEncoder, OrdinalEncoder, PolynomialFeatures, FunctionTransformer, PowerTransformer, QuantileTransformer, KBinsDiscretizer, SplineTransformer
6. metrics — accuracy, precision, recall, f1, confusion_matrix, classification_report, roc_auc, roc_curve, mean_squared_error, mean_absolute_error, r2_score, log_loss, silhouette_score, adjusted_rand_score, pairwise distances/kernels
7. model_selection — train_test_split, KFold, StratifiedKFold, cross_val_score, cross_validate, GridSearchCV, RandomizedSearchCV, ParameterGrid, learning_curve, validation_curve

Phase 3 — Core estimators
8. linear_model — LinearRegression, Ridge, Lasso, ElasticNet, LogisticRegression, SGDClassifier, SGDRegressor, Perceptron, PassiveAggressiveClassifier
9. tree — DecisionTreeClassifier, DecisionTreeRegressor, export_graphviz, plot_tree
10. neighbors — KNeighborsClassifier, KNeighborsRegressor, NearestNeighbors, KDTree, BallTree, radius_neighbors
11. naive_bayes — GaussianNB, MultinomialNB, BernoulliNB, ComplementNB, CategoricalNB
12. svm — SVC, SVR, LinearSVC, LinearSVR, NuSVC, NuSVR (pure TS implementations, no libsvm)
13. cluster — KMeans, MiniBatchKMeans, DBSCAN, AgglomerativeClustering, SpectralClustering, MeanShift, Birch, OPTICS

Phase 4 — Ensemble & advanced
14. ensemble — RandomForestClassifier, RandomForestRegressor, GradientBoostingClassifier, GradientBoostingRegressor, AdaBoostClassifier, AdaBoostRegressor, BaggingClassifier, BaggingRegressor, VotingClassifier, StackingClassifier, HistGradientBoosting*
15. decomposition — PCA, IncrementalPCA, KernelPCA, TruncatedSVD, NMF, FactorAnalysis, FastICA, LatentDirichletAllocation
16. manifold — TSNE, MDS, Isomap, LocallyLinearEmbedding, SpectralEmbedding
17. feature_selection — SelectKBest, SelectPercentile, GenericUnivariateSelect, RFE, RFECV, SelectFromModel, VarianceThreshold, mutual_info_classif, mutual_info_regression, f_classif, f_regression, chi2
18. feature_extraction — DictVectorizer, FeatureHasher, text (CountVectorizer, TfidfVectorizer, TfidfTransformer, HashingVectorizer)

Phase 5 — Pipelines, imputation & remaining
19. pipeline — Pipeline, FeatureUnion, make_pipeline, make_union, ColumnTransformer
20. impute — SimpleImputer, IterativeImputer, KNNImputer, MissingIndicator
21. compose — ColumnTransformer, TransformedTargetRegressor, make_column_selector
22. calibration — CalibratedClassifierCV, calibration_curve
23. multiclass — OneVsRestClassifier, OneVsOneClassifier, OutputCodeClassifier
24. multioutput — MultiOutputClassifier, MultiOutputRegressor, ClassifierChain, RegressorChain
25. discriminant_analysis — LinearDiscriminantAnalysis, QuadraticDiscriminantAnalysis
26. gaussian_process — GaussianProcessClassifier, GaussianProcessRegressor, kernels (RBF, Matern, DotProduct, WhiteKernel, ConstantKernel, RationalQuadratic, ExpSineSquared, Sum, Product)
27. isotonic — IsotonicRegression, isotonic_regression, check_increasing
28. kernel_approximation — RBFSampler, Nystroem, AdditiveChi2Sampler, SkewedChi2Sampler
29. kernel_ridge — KernelRidge
30. mixture — GaussianMixture, BayesianGaussianMixture
31. neural_network — MLPClassifier, MLPRegressor, BernoulliRBM
32. semi_supervised — LabelPropagation, LabelSpreading, SelfTrainingClassifier
33. datasets — make_classification, make_regression, make_blobs, make_moons, make_circles, make_swiss_roll, load_iris, load_digits, load_wine, load_breast_cancer

Key constraints

  • Package name is tsikit-learn. All imports: import { LinearRegression } from 'tsikit-learn'
  • Data layer is tsb. Use tsb (from tsessebe) for DataFrame, Series, Index — just as scikit-learn uses pandas/numpy. For numeric arrays, use typed arrays (Float64Array, Int32Array) directly. Implement a thin ndarray-like wrapper for 2D operations.
  • Bun for runtime, bundling, testing
  • Zero additional dependencies for core library beyond tsb. Build all ML algorithms from scratch in pure TypeScript. No WASM, no native bindings.
  • Strictest TypeScriptstrict: true, noUncheckedIndexedAccess: true, exactOptionalPropertyTypes: true, no any anywhere, no @ts-ignore, no as casts unless provably safe
  • Strictest linting — Biome with all rules enabled, zero warnings
  • 100% test coverage — Re-use scikit-learn's Python tests for everything, plus add more. Unit tests, property-based tests (fast-check), fuzz tests, Playwright e2e for the web playground
  • Interactive web playground — every feature gets a demo page showing the algorithm in action (visualizations, scatter plots, decision boundaries where applicable), deployed to GitHub Pages
  • Don't worry about performance optimization — another program handles that. Focus on correctness and completeness.
  • scikit-learn API parity — match scikit-learn's public API surface, adapted to TypeScript idioms. fit(), predict(), transform(), fit_transform(), score(), get_params(), set_params() patterns. When in doubt, read the scikit-learn source.

Playground / Pages site

The playground follows the same pattern as tsessebe:

  • Landing page (playground/index.html) with a feature roadmap grid showing ported vs pending modules
  • One page per feature with interactive demos (e.g., train a model, see predictions, visualize decision boundaries)
  • In-browser TypeScript editor powered by the TypeScript compiler
  • Built and deployed to GitHub Pages via CI
  • Use Canvas/SVG for visualizations (scatter plots, decision boundaries, dendrograms, ROC curves, etc.)

Target

This program may modify any file in the repository. It is building the project from scratch.

Only modify these files:

  • src/** — library source code
  • tests/** — all test files
  • playground/** — interactive web playground/demos
  • package.json — package config
  • tsconfig.json — TypeScript config
  • biome.json — linter config
  • bunfig.toml — Bun config
  • .github/workflows/** — CI/CD pipelines (but not autoloop workflow files)
  • AGENTS.md — agent instructions
  • CLAUDE.md — Claude Code config
  • .autoloop/memory/** — repo-memory for planning and coordination

Do NOT modify:

  • README.md — source of truth, read-only for this program
  • .autoloop/programs/** — program definitions
  • .github/ISSUE_TEMPLATE/** — issue templates
  • .github/workflows/autoloop* — autoloop workflow files
  • .github/workflows/sync-branches* — sync workflow files

Evaluation

# Type check must pass — reject iterations that introduce type errors
if command -v bunx >/dev/null 2>&1; then
  if ! bunx tsc --noEmit 2>&1; then
    echo '{"sklearn_features_ported": null, "rejected_reason": "type check failed"}'
    exit 0
  fi
fi

# Tests must pass — reject iterations that break existing functionality
if command -v bun >/dev/null 2>&1; then
  if ! bun test 2>&1; then
    echo '{"sklearn_features_ported": null, "rejected_reason": "tests failed"}'
    exit 0
  fi
fi

# Count TypeScript source files that contain sklearn-related functionality
# (excludes config, test infra, playground scaffolding — only counts actual library code)
count=$(find src -name '*.ts' -not -name 'index.ts' -not -name '*.d.ts' 2>/dev/null | xargs grep -l 'export' 2>/dev/null | wc -l | tr -d ' ')
echo "{\"sklearn_features_ported\": ${count:-0}}"

The metric is sklearn_features_ported. Higher is better.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions