Per-user Iceberg warehouse with bring-your-own S3 storage #5293

mengw15 · 2026-05-20T07:17:25Z

mengw15
May 20, 2026
Collaborator

Feature Summary

A warehouse here is a top level entity in the catalog hierarchy (Project → Warehouse → Namespace → Table) that owns a set of namespaces (results, runtime_stats, console_logs) and the storage configuration (S3 bucket + credentials) backing their tables. This follows the Lakekeeper warehouse concept.

Today Texera writes all execution outputs (results, runtime_stats, console_logs) into a single global Iceberg warehouse. One warehouse, all users share it, storage costs absorbed by the platform.

This issue proposes a per-user warehouse model: each user registers one or more warehouses, each backed by their own S3 bucket (Bring-Your-Own-S3). Storage cost follows the data owner; users get tenant-isolated namespaces and tables.

Background / Motivation

Billing. S3 cost should be attributed to the user who owns the data, not the platform.
Isolation. Per-tenant namespaces/tables, no shared blast radius.
Builds on Migrate to Catalog Service and MinIO for Execution Results #4126 — that issue introduced the REST Catalog Service (Lakekeeper) layer. This issue is the next step: make Lakekeeper multi-tenant.

Scope

Per-user warehouses are scoped to the Kubernetes deployment. Local / single-node Docker Compose deployments continue to work as today: PsqlCatalog remains supported and unchanged, and RestCatalog mode keeps its current single global Lakekeeper warehouse (no per-user split).

Catalog hierarchy

Texera already has two Catalog implementations:

Catalog (interface)
├── PsqlCatalog          — backed by PostgreSQL
└── RestCatalog          — backed by any Iceberg REST Catalog service (Lakekeeper is one implementation of this)

This design uses RestCatalog with Lakekeeper as the REST Catalog service to deliver per-user warehouses. Lakekeeper owns S3 credentials in its own encrypted DB (Postgres); Texera never persists raw S3 creds, only the Lakekeeper warehouse UUID and non-secret metadata.

Proposed Solution or Design

Design 1:

User ─1:N→ Warehouse                (new)
User ─1:N→ ComputingUnit            (existing)
ComputingUnit ─1:N→ Execution       (existing)
Warehouse ─1:N→ CU                  (new)

Flow A — Registering a warehouse (Same for both design)

User fills the new Dashboard "Warehouse" tab with S3 bucket / endpoint / region / credentials.
Backend posts the credentials directly to Lakekeeper to create the warehouse. Creds never touch the Texera DB.
Lakekeeper returns the warehouse UUID; Texera stores the reference plus non-secret metadata.

Sequence diagram:

Flow B — Binding a warehouse to a CU (For Design 1)

When the user creates a CU they pick which warehouse to use.
At execution time, Texera instantiates a RestCatalog for that CU using the warehouse's Lakekeeper UUID — no global singleton on the hot path.
Two-layer split at runtime:
- Catalog path — RestCatalog talks to Lakekeeper for metadata operations (resolve table, create / commit snapshots, schema changes). Lakekeeper owns the warehouse → S3 path mapping.
- Data path — the Iceberg client reads/writes Parquet directly to the user's S3 bucket, using short-lived credentials vended by Lakekeeper per request. Lakekeeper does not proxy S3 traffic.

Files land in the user's S3 bucket under the warehouse's root prefix, organized by namespace (results / runtime_stats / console_logs) and per-execution table.

Sequence diagram (CU creation + RestCatalog instantiation):

For execution diagram please check: #4126

Design 2:

Data model

User ─1:N→ Warehouse                (new)
User ─1:N→ ComputingUnit            (existing)
ComputingUnit ─1:N→ Execution       (existing)
Warehouse ─1:N→ Execution           (new association)

ER diagram:

Flow A — Registering a warehouse (Same for both design)

User fills the new Dashboard "Warehouse" tab with S3 bucket / endpoint / region / credentials.
Backend posts the credentials directly to Lakekeeper to create the warehouse. Creds never touch the Texera DB.
Lakekeeper returns the warehouse UUID; Texera stores the reference plus non-secret metadata.

Sequence diagram:

Flow B — Binding a warehouse to an Execution (For Design 2)

CU creation does not ask for a warehouse. CUs are warehouse-agnostic and one CU can serve executions targeting any
warehouse the user owns.
The user picks the warehouse this execution will write to (from a warehouse selector in the workflow toolbar,
similar to selecting a CU).
The submit-execution RPC to the CU carries the resolved whid/Lakekeeper warehouse name.
The CU JVM maintains a per-warehouse RestCatalog cache (Map[warehouseName, RestCatalog]). The arriving execution:
- Cache hit → reuses the existing instance
- Cache miss → lazily initializes a new RestCatalog for that warehouse. Adding a new entry is atomic and does not touch other entries; in-flight executions on other warehouses are unaffected.
Two-layer split at runtime (same as Design 1):
- Catalog path — the per-warehouse RestCatalog talks to Lakekeeper for metadata operations.
- Data path — Iceberg reads/writes Parquet directly to the user's S3 bucket via Lakekeeper-vended short-lived credentials. Lakekeeper does not proxy S3 traffic.
Result reading on the amber side looks up workflow_executions.whid first, then routes the IcebergDocument read through the corresponding RestCatalog.

Sequence diagram (Execution start + RestCatalog instantiation):

Please note that currently CU is directly communicating with Postgres, there is an issue track this: #5011. However, this is out of scope of this current issue.

For execution diagram please check: #4126

Open questions

Should we allow Share Warehouse?
Shared CU (Design 1): when User A runs a workflow on a CU owned by User B, whose warehouse stores the results? In other words, should we allow User A store results into User B's Warehouse.
Warehouse deletion semantics: hard-delete the Lakekeeper catalog and leave S3 data orphaned in the user's bucket (Texera has no write access to user buckets), or soft-archive the catalog so existing executions stay readable until the user explicitly purges?

chenlica · 2026-05-21T19:23:49Z

chenlica
May 21, 2026
Collaborator

I have been talking to @mengw15 and @bobbai00 about this important feature. Here's my thoughts about the first question. So far all the resources on this platform (datasets, workflows, and CUs) are shareable. It will be good to make warehouses also shareable.

0 replies

bobbai00 · 2026-05-29T22:58:24Z

bobbai00
May 29, 2026
Collaborator

I support Idea B

0 replies

kunwp1 · 2026-05-29T23:22:05Z

kunwp1
May 29, 2026
Collaborator

I also support Idea B. In Design A the warehouse is coupled with the CU, so CU sharing becomes ambiguous. If user1 shares a CU with user2, user1's warehouse comes along with it. User2's executions would write into user1's warehouse and onto user1's S3 bill, even though user1 only meant to share compute. Design B decouples the two: the warehouse is chosen per execution so sharing a CU and sharing a warehouse stay independent decisions.

0 replies

Xiao-zhen-Liu · 2026-05-30T01:23:40Z

Xiao-zhen-Liu
May 30, 2026
Collaborator

I lean toward Design 1. I think @kunwp1 's argument about billing also applies to sharing of computing units if computing units are also billed to a user, so it is a separate issue and is not specific to Design 1. Another downside of Design 2 is it forces a user to specify a warehouse before each execution, which will be really inconvenient.

1 reply

chenlica May 30, 2026
Collaborator

I also support B. Regarding the second downside mentioned by @Xiao-zhen-Liu , we can address it by using a default warehouse for each execution.

mengw15 · 2026-05-30T06:49:26Z

mengw15
May 30, 2026
Collaborator Author

Thanks for the feedback!

I personally lean toward Design B. Conceptually, compute and storage are two independent, peer-level resources, and Design B preserves that flexibility.

I think @Xiao-zhen-Liu 's concern is fair. I was thinking we could mirror the CU pattern and provide a default warehouse for each execution — that should make this much less burdensome. What do you think?

1 reply

chenlica Jun 1, 2026
Collaborator

I support this idea, as mentioned above.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Per-user Iceberg warehouse with bring-your-own S3 storage #5293

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 5 comments 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Per-user Iceberg warehouse with bring-your-own S3 storage #5293

Uh oh!

Uh oh!

mengw15 May 20, 2026 Collaborator

Feature Summary

Background / Motivation

Scope

Catalog hierarchy

Proposed Solution or Design

Design 1:

Flow A — Registering a warehouse (Same for both design)

Flow B — Binding a warehouse to a CU (For Design 1)

Design 2:

Data model

Flow A — Registering a warehouse (Same for both design)

Flow B — Binding a warehouse to an Execution (For Design 2)

Open questions

Replies: 5 comments · 2 replies

Uh oh!

chenlica May 21, 2026 Collaborator

Uh oh!

bobbai00 May 29, 2026 Collaborator

Uh oh!

kunwp1 May 29, 2026 Collaborator

Uh oh!

Uh oh!

Xiao-zhen-Liu May 30, 2026 Collaborator

Uh oh!

chenlica May 30, 2026 Collaborator

Uh oh!

mengw15 May 30, 2026 Collaborator Author

Uh oh!

chenlica Jun 1, 2026 Collaborator

mengw15
May 20, 2026
Collaborator

Replies: 5 comments 2 replies

chenlica
May 21, 2026
Collaborator

bobbai00
May 29, 2026
Collaborator

kunwp1
May 29, 2026
Collaborator

Xiao-zhen-Liu
May 30, 2026
Collaborator

chenlica May 30, 2026
Collaborator

mengw15
May 30, 2026
Collaborator Author

chenlica Jun 1, 2026
Collaborator