Runnable companion code for the MatrixOne Git4Data Deep Dive article series —
Git-style version control for data at scale (commit, branch, diff, merge,
cherry-pick, time travel), built into MatrixOne.
What is Git4Data? If you treat a database as a Git repository and a table as a file in it, MatrixOne lets you run everyday Git operations — snapshot, clone, branch, diff, merge, cherry-pick, restore — over terabytes of data, almost instantly. It's the same workflow software engineers use on code, now on data.
| Part | Theme | Topic | Code here |
|---|---|---|---|
| 1 | Concept | The Git moment for data at scale | — |
| 2 | Concept | Hands on: every Git primitive, from zero | 02-hands-on/ |
| 3 | Concept | Under the hood: why snapshot/diff/merge are this fast | — |
| 4 | Data Ops | Incident rescue: snapshot / DIFF investigation / PITR | 04-incident-rescue/ |
| 5 | Data Ops | Collaborative development: one branch per engineer | 05-collaborative-dev/ |
| 6 | Data Ops | Write-Audit-Publish: a release gate for data | 06-write-audit-publish/ |
| 7 | AI Training | ML continuous learning: train only the delta | 07-ml-incremental/ |
| 8 | AI Training | SFT curation: clean in place, with receipts | 08-sft-curation/ |
| 9 | AI Training | Collaborative labeling: disagreement IS the conflict | 09-labeling-collab/ |
| 10 | AI Training | RLHF preference data: consensus, re-judging, reproducibility | 10-rlhf-preference/ |
| 11 | AI Training | Multimodal × lakeFS: bytes there, catalog here | 11-multimodal-lakefs/ |
| 12 | Agents | Agent memory: versioned, branchable, rewindable | 12-agent-memory/ |
| 13 | Agents | Agent traces: queryable, joinable, versioned | 13-agent-trace/ |
| 14 | Agents | Agent self-evolution (finale): branch / evaluate / merge / roll back | 14-agent-evolution/ |
Each later tutorial will add its own folder here.
# 1. Run a local MatrixOne (open source, MySQL-compatible)
docker run -d -p 6001:6001 --name matrixone matrixorigin/matrixone:4.0.0-rc1
# 2. Run the Part 2 walkthrough — every Git primitive on 1,000,000 rows
mysql -h 127.0.0.1 -P 6001 -u root -p111 < 02-hands-on/git4data_primitives.sqlDefault credentials: user root, password 111, port 6001.
02-hands-on/git4data_primitives.sql is a
single, copy-paste-runnable script (English comments) that walks through:
- commit / tag / reset —
CREATE SNAPSHOT, time-travelSELECT … {snapshot=…},RESTORE - clone — zero-copy
CREATE TABLE … CLONE - branch — lineage-tracked
DATA BRANCH CREATE - diff — row-level
DATA BRANCH DIFF … OUTPUT SUMMARY / COUNT / LIMIT / FILE - merge — three-way
DATA BRANCH MERGE … WHEN CONFLICT FAIL | SKIP | ACCEPT - cherry-pick —
DATA BRANCH PICK … KEYS(…) - point-in-time recovery —
CREATE PITR+RESTORE … FROM PITR "…" - granularity — the same semantics at table / database / account / cluster levels
- scale — measured numbers showing snapshot/clone/branch cost is independent of table size
It loads a million rows with a single generate_series statement (no external
files needed) and cleans up after itself.
Same table, same operations, on a single-node Docker MatrixOne (diff/merge each touch only 1,000 rows):
Steady-state, median of several runs (MatrixOne 4.0.0-rc1):
| table size | load | CREATE SNAPSHOT |
CLONE |
DATA BRANCH CREATE |
DIFF (1000) |
MERGE (1000) |
|---|---|---|---|---|---|---|
| 1,000,000 | 0.5 s | 6 ms | 6 ms | 7 ms | 13 ms | 64 ms |
| 10,000,000 | 5.3 s | 8 ms | 8 ms | 7 ms | 21 ms | 178 ms |
| 100,000,000 | 41 s | 5 ms | 25 ms | 19 ms | 23 ms | 189 ms |
Snapshot is dead constant (it just names a metadata directory). Clone/branch copy the metadata directory, not the data — 100× the data, clone rises only 6 ms → 25 ms. Diff/merge scale with how many rows changed, not table size. (The first snapshot of a freshly loaded table is ~10–12 ms — a one-time flush of in-memory data — then drops to the steady-state numbers above.)
- MatrixOne: https://github.com/matrixorigin/matrixone
- Docs: https://docs.matrixorigin.cn/
Apache 2.0