Replies: 5 comments 2 replies
-
|
I have been talking to @mengw15 and @bobbai00 about this important feature. Here's my thoughts about the first question. So far all the resources on this platform (datasets, workflows, and CUs) are shareable. It will be good to make warehouses also shareable. |
Beta Was this translation helpful? Give feedback.
-
|
I support Idea B |
Beta Was this translation helpful? Give feedback.
-
|
I also support Idea B. In Design A the warehouse is coupled with the CU, so CU sharing becomes ambiguous. If user1 shares a CU with user2, user1's warehouse comes along with it. User2's executions would write into user1's warehouse and onto user1's S3 bill, even though user1 only meant to share compute. Design B decouples the two: the warehouse is chosen per execution so sharing a CU and sharing a warehouse stay independent decisions. |
Beta Was this translation helpful? Give feedback.
-
|
I lean toward Design 1. I think @kunwp1 's argument about billing also applies to sharing of computing units if computing units are also billed to a user, so it is a separate issue and is not specific to Design 1. Another downside of Design 2 is it forces a user to specify a warehouse before each execution, which will be really inconvenient. |
Beta Was this translation helpful? Give feedback.
-
|
Thanks for the feedback! I personally lean toward Design B. Conceptually, compute and storage are two independent, peer-level resources, and Design B preserves that flexibility. I think @Xiao-zhen-Liu 's concern is fair. I was thinking we could mirror the CU pattern and provide a default warehouse for each execution — that should make this much less burdensome. What do you think? |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Feature Summary
A warehouse here is a top level entity in the catalog hierarchy (
Project → Warehouse → Namespace → Table) that owns a set of namespaces (results,runtime_stats,console_logs) and the storage configuration (S3 bucket + credentials) backing their tables. This follows the Lakekeeper warehouse concept.Today Texera writes all execution outputs (
results,runtime_stats,console_logs) into a single global Iceberg warehouse. One warehouse, all users share it, storage costs absorbed by the platform.This issue proposes a per-user warehouse model: each user registers one or more warehouses, each backed by their own S3 bucket (Bring-Your-Own-S3). Storage cost follows the data owner; users get tenant-isolated namespaces and tables.
Background / Motivation
Scope
Per-user warehouses are scoped to the Kubernetes deployment. Local / single-node Docker Compose deployments continue to work as today:
PsqlCatalogremains supported and unchanged, andRestCatalogmode keeps its current single global Lakekeeper warehouse (no per-user split).Catalog hierarchy
Texera already has two
Catalogimplementations:This design uses
RestCatalogwith Lakekeeper as the REST Catalog service to deliver per-user warehouses. Lakekeeper owns S3 credentials in its own encrypted DB (Postgres); Texera never persists raw S3 creds, only the Lakekeeper warehouse UUID and non-secret metadata.Proposed Solution or Design
Design 1:
Flow A — Registering a warehouse (Same for both design)
Sequence diagram:
Flow B — Binding a warehouse to a CU (For Design 1)
RestCatalogfor that CU using the warehouse's Lakekeeper UUID — no global singleton on the hot path.RestCatalogtalks to Lakekeeper for metadata operations (resolve table, create / commit snapshots, schema changes). Lakekeeper owns the warehouse → S3 path mapping.Files land in the user's S3 bucket under the warehouse's root prefix, organized by namespace (
results/runtime_stats/console_logs) and per-execution table.Sequence diagram (CU creation + RestCatalog instantiation):
For execution diagram please check: #4126
Design 2:
Data model
ER diagram:
Flow A — Registering a warehouse (Same for both design)
Sequence diagram:
Flow B — Binding a warehouse to an Execution (For Design 2)
warehouse the user owns.
similar to selecting a CU).
- Cache hit → reuses the existing instance
- Cache miss → lazily initializes a new RestCatalog for that warehouse. Adding a new entry is atomic and does not touch other entries; in-flight executions on other warehouses are unaffected.
- Catalog path — the per-warehouse RestCatalog talks to Lakekeeper for metadata operations.
- Data path — Iceberg reads/writes Parquet directly to the user's S3 bucket via Lakekeeper-vended short-lived credentials. Lakekeeper does not proxy S3 traffic.
Sequence diagram (Execution start + RestCatalog instantiation):
Please note that currently CU is directly communicating with Postgres, there is an issue track this: #5011. However, this is out of scope of this current issue.
For execution diagram please check: #4126
Open questions
Beta Was this translation helpful? Give feedback.
All reactions