MatrixOne Git4Data Tutorial

Runnable companion code for the MatrixOne Git4Data Deep Dive article series — Git-style version control for data at scale (commit, branch, diff, merge, cherry-pick, time travel), built into MatrixOne.

What is Git4Data? If you treat a database as a Git repository and a table as a file in it, MatrixOne lets you run everyday Git operations — snapshot, clone, branch, diff, merge, cherry-pick, restore — over terabytes of data, almost instantly. It's the same workflow software engineers use on code, now on data.

The series

Part	Theme	Topic	Code here
1	Concept	The Git moment for data at scale	—
2	Concept	Hands on: every Git primitive, from zero	`02-hands-on/`
3	Concept	Under the hood: why snapshot/diff/merge are this fast	—
4	Data Ops	Incident rescue: snapshot / DIFF investigation / PITR	`04-incident-rescue/`
5	Data Ops	Collaborative development: one branch per engineer	`05-collaborative-dev/`
6	Data Ops	Write-Audit-Publish: a release gate for data	`06-write-audit-publish/`
7	AI Training	ML continuous learning: train only the delta	`07-ml-incremental/`
8	AI Training	SFT curation: clean in place, with receipts	`08-sft-curation/`
9	AI Training	Collaborative labeling: disagreement IS the conflict	`09-labeling-collab/`
10	AI Training	RLHF preference data: consensus, re-judging, reproducibility	`10-rlhf-preference/`
11	AI Training	Multimodal × lakeFS: bytes there, catalog here	`11-multimodal-lakefs/`
12	Agents	Agent memory: versioned, branchable, rewindable	`12-agent-memory/`
13	Agents	Agent traces: queryable, joinable, versioned	`13-agent-trace/`
14	Agents	Agent self-evolution (finale): branch / evaluate / merge / roll back	`14-agent-evolution/`

Each later tutorial will add its own folder here.

Quick start (5 minutes)

# 1. Run a local MatrixOne (open source, MySQL-compatible)
docker run -d -p 6001:6001 --name matrixone matrixorigin/matrixone:4.0.0-rc1

# 2. Run the Part 2 walkthrough — every Git primitive on 1,000,000 rows
mysql -h 127.0.0.1 -P 6001 -u root -p111 < 02-hands-on/git4data_primitives.sql

Default credentials: user root, password 111, port 6001.

What Part 2 covers

02-hands-on/git4data_primitives.sql is a single, copy-paste-runnable script (English comments) that walks through:

commit / tag / reset — CREATE SNAPSHOT, time-travel SELECT … {snapshot=…}, RESTORE
clone — zero-copy CREATE TABLE … CLONE
branch — lineage-tracked DATA BRANCH CREATE
diff — row-level DATA BRANCH DIFF … OUTPUT SUMMARY / COUNT / LIMIT / FILE
merge — three-way DATA BRANCH MERGE … WHEN CONFLICT FAIL | SKIP | ACCEPT
cherry-pick — DATA BRANCH PICK … KEYS(…)
point-in-time recovery — CREATE PITR + RESTORE … FROM PITR "…"
granularity — the same semantics at table / database / account / cluster levels
scale — measured numbers showing snapshot/clone/branch cost is independent of table size

It loads a million rows with a single generate_series statement (no external files needed) and cleans up after itself.

Measured: cost is independent of data size

Same table, same operations, on a single-node Docker MatrixOne (diff/merge each touch only 1,000 rows):

Steady-state, median of several runs (MatrixOne 4.0.0-rc1):

table size	load	`CREATE SNAPSHOT`	`CLONE`	`DATA BRANCH CREATE`	`DIFF` (1000)	`MERGE` (1000)
1,000,000	0.5 s	6 ms	6 ms	7 ms	13 ms	64 ms
10,000,000	5.3 s	8 ms	8 ms	7 ms	21 ms	178 ms
100,000,000	41 s	5 ms	25 ms	19 ms	23 ms	189 ms

Snapshot is dead constant (it just names a metadata directory). Clone/branch copy the metadata directory, not the data — 100× the data, clone rises only 6 ms → 25 ms. Diff/merge scale with how many rows changed, not table size. (The first snapshot of a freshly loaded table is ~10–12 ms — a one-time flush of in-memory data — then drops to the steady-state numbers above.)

Links

MatrixOne: https://github.com/matrixorigin/matrixone
Docs: https://docs.matrixorigin.cn/

License

Apache 2.0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MatrixOne Git4Data Tutorial

The series

Quick start (5 minutes)

What Part 2 covers

Measured: cost is independent of data size

Links

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
02-hands-on		02-hands-on
04-incident-rescue		04-incident-rescue
05-collaborative-dev		05-collaborative-dev
06-write-audit-publish		06-write-audit-publish
07-ml-incremental		07-ml-incremental
08-sft-curation		08-sft-curation
09-labeling-collab		09-labeling-collab
10-rlhf-preference		10-rlhf-preference
11-multimodal-lakefs		11-multimodal-lakefs
12-agent-memory		12-agent-memory
13-agent-trace		13-agent-trace
14-agent-evolution		14-agent-evolution
LICENSE		LICENSE
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

MatrixOne Git4Data Tutorial

The series

Quick start (5 minutes)

What Part 2 covers

Measured: cost is independent of data size

Links

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages