Duckdb by tim-band · Pull Request #75 · SAFEHR-data/datafaker

tim-band · 2025-12-10T18:30:33Z

DuckDB does not work without this change; it uses the PostgreSQL dialect with minor changes, but it really needs a couple more.
This change adds DuckDB as a SQLAlchemy plugin, and hooks into the SQL compilation process removing the PostgreSQL code that DuckDB does not understand.

dump-data has also been updated to allow the dumping of all non-ignored non-vocabulary tables in one call, and also to dump the data as Parquet.

So with dump-data for the destination and DuckDB's in-memory database for the source it is now possible to do Parquet-to-Parquet data faking without interacting directly with DuckDB at all! See duckdb.rst for details.

out of the generator proposer

DuckDB usage documentation Fixed DuckDB parquet output dump-data --output as-directory choice proposal crash fixed

- Generator writers `go_to` can cope with table names that have dots. - `dump-data --parquet` can cope with `TIMESTAMP`s - Foreign Keys to ignored tables fixed.

tim-band · 2026-01-14T15:46:32Z

Finally this all works! It's actually fairly easy to fake parquet files now; see the duckdb.rst file for details.

stefpiatek · 2026-01-15T09:44:36Z

Ahh nice, I'll pencil in some time next week hopefully to review. Appreciate it Tim

stefpiatek

Oooh this is very fun. Thanks for working on this and getting the translation so it works ❤️

stefpiatek · 2026-01-29T13:27:07Z

datafaker/generators/base.py

+                except TypeError:
+                    pass


Ooh when are we expecting this to happen, and if so do we want to log it?

datafaker/interactive/base.py

datafaker/interactive/missingness.py

stefpiatek · 2026-01-29T13:32:55Z

datafaker/create.py

 RowCounts = Counter[str]


+@compiles(CreateColumn, "duckdb")


ooh this is fun

Pretty nasty actually. But yes, fun that this hook exists!

datafaker/main.py

stefpiatek · 2026-01-29T13:42:08Z

datafaker/serialize_metadata.py

-    if fk_bits[0] not in tables_dict:
-        return False
-    return bool(tables_dict[fk_bits[0]].get("ignore", False))
+    (table, _column) = split_column_full_name(fk)


Ooh I'm not too sure what was happening before but I think this makes sense

Yes, one of those "did this ever work?" moments...

tests/utils.py

stefpiatek · 2026-01-29T13:44:14Z

datafaker/parquet2orm.py

+    column_types = {
+        column: _dtype_to_sql(dtype) for column, dtype in table.dtypes.items()
+    }
+    name_pref = name[: name.rfind(".")]


Could we get to a point where the name doesn't have a .?

This would be if the file doesn't have an extension such as .parquet, but you are right that this needs some sort of defense.

Actually, it's fine. That expression works even if no dot is found.

stefpiatek · 2026-01-29T13:46:41Z

datafaker/interactive/generators.py

+        if last_part in table_names:
+            table_names.append(f"{last_part}.")


oh interesting that this has swapped from first to last

It's just that previously if there was only one part it was called first_part, now it's called last_part because that's how the new split_column_full_name function does it.

Tim Band added 16 commits November 27, 2025 19:10

Refactoring query construction

0fe4e31

out of the generator proposer

Removd _get_row_partition

4d8e9c2

A bit more refactoring

0230da2

Fixed #73 Grouped generators query results overlap

6d26ee5

FKs to concept table named, initial implementation

c0125c5

Many extra comments output with partitions

4275014

Named column fetched from config.yaml

a0d4b16

configure-tables allows the setting of naming columns

dd1e852

Experimental DuckDB output support

1377fc0

Parquet output for dump-data

fae9305

DuckDB with files df.py fixed

9dd4073

DuckDB usage documentation Fixed DuckDB parquet output dump-data --output as-directory choice proposal crash fixed

Fixed test_dump

d70f39c

dump-data tests now run with DuckDB and PostgreSQL

a968e5d

precommit hooks pass

29db13a

A couple more tests for DuckDB

1896db0

More parquet tests, some bugs fixed:

954962c

- Generator writers `go_to` can cope with table names that have dots. - `dump-data --parquet` can cope with `TIMESTAMP`s - Foreign Keys to ignored tables fixed.

tim-band requested review from crangelsmith, myyong and stefpiatek January 14, 2026 15:45

Tim Band added 4 commits January 26, 2026 18:40

Added --parquet-dir to make-tables

d8e40bf

Fixed for opiates example

eb91e17

Merge branch 'main' of github:SAFEHR-data/datafaker into duckdb

57350f7

Updated DuckDB documentation

7277ccb

stefpiatek approved these changes Jan 29, 2026

View reviewed changes

Tim Band added 4 commits February 3, 2026 19:19

Dockerfile needs build tools for DuckDB support

d2a17b1

Response to Stef's comments

86839b0

Black reformat

4071ea0

A little comment

1392372

Tim Band added 3 commits February 4, 2026 17:25

Tests for parquet2orm

958bb95

Made parquet2orm tests a bit looser

61a1546

pre-commit clean

56e71db

tim-band merged commit e1e8992 into main Feb 5, 2026
2 checks passed

		if last_part in table_names:
		table_names.append(f"{last_part}.")

Conversation

tim-band commented Dec 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tim-band commented Jan 14, 2026

Uh oh!

stefpiatek commented Jan 15, 2026

Uh oh!

stefpiatek left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

tim-band commented Dec 10, 2025 •

edited

Loading