feat: writable external tables (INSERT/LOAD into stage files)#24889
feat: writable external tables (INSERT/LOAD into stage files)#24889fengttt wants to merge 4 commits into
Conversation
Make external tables writable when created with a WRITE_FILE_PATTERN option (a strftime template with MO extensions %nN = n random digits and %U = UUID, resolving to a stage:// path). Such tables accept INSERT ... SELECT and LOAD, writing CSV or JSONLine files into the stage; each parallel pipeline writes one file. Reads, UPDATE/DELETE, and read-only external tables are unchanged. - pkg/sql/colexec/externalwrite: strftime expander, ExternalWriter with a fileservice streaming sink, CSV/JSONLine encoders (const-vector aware, emit only the declared columns). - insert operator: third write mode ToExternal/insert_external alongside ToWriteS3/insert_table; write config carried on the Go InsertCtx (no proto change), built in compile from TableDef.Createsql. - compile: compileInsert routes external-write nodes to one writer op per source scope (no S3 merge/shuffle). - planner: minimal external insert plan (build_insert/build_load); op-aware checkTableType allows writable-external for insert; initInsertStmt Pkey guard; modern DML binder defers external targets to the legacy path. - DDL: validate WRITE_FILE_PATTERN (stage:// + csv/jsonline + parseable) and accept it in the read-side option validators. - docs/design + BVT case test/distributed/cases/stage/writable_external_table. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Qodo reviews are paused for this user.Troubleshooting steps vary by plan Learn more → On a Teams plan? Using GitHub Enterprise Server, GitLab Self-Managed, or Bitbucket Data Center? |
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- compile/operator.go: prealloc attrs slice in constructExternalInsert (fixes the SCA prealloc lint failure). - externalwrite UT: cover all csvValue/jsonValue type branches, addEscape, colCount, cellIsNull (const/const-null), unsupported-type errors, writer defaults/no-op Close/nil-batch, and every strftime directive. - plan/compile UT: GetWriteFilePattern, getExternParamFromTableDef, isExternalWriteInsert helpers. - BVT: add wide column-type case (int/unsigned/float/decimal/char/text/date/ bool/bit/json + NULL row) for CSV and JSONLine; regenerate result (51/51). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
This PR adds writable external tables: external tables created with a write_file_pattern option can accept INSERT ... SELECT and LOAD DATA, writing CSV or JSONLine files into a stage:// destination (one file per parallel write pipeline). The change is implemented by extending the existing INSERT operator with a third mode (ToExternal) and routing eligible external-table DML through the legacy planner path.
Changes:
- Adds a new execution path to write INSERT/LOAD output batches into stage files (CSV/JSONLine) via a new
pkg/sql/colexec/externalwritepackage. - Extends planner/DDL validation to recognize and validate
write_file_patternand to generate a minimal insert plan for writable external tables. - Adds unit tests plus a new BVT case for end-to-end writable external table behavior.
Reviewed changes
Copilot reviewed 25 out of 25 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| test/distributed/cases/stage/writable_external_table.sql | New BVT test script covering INSERT/LOAD into writable external tables (CSV/JSONLine) plus error cases. |
| test/distributed/cases/stage/writable_external_table.result | Expected output for the new BVT test. |
| pkg/sql/plan/utils.go | Allows write_file_pattern option during read-side option validation; adds ExternalWriteFilePatternKey and GetWriteFilePattern. |
| pkg/sql/plan/extwrite_helpers_test.go | Unit tests for GetWriteFilePattern and getExternParamFromTableDef. |
| pkg/sql/plan/dml_context.go | Forces external-table DML to fall back to the legacy planner path. |
| pkg/sql/plan/build_insert.go | Detects writable-external targets; builds minimal INSERT plan and helpers for external insert. |
| pkg/sql/plan/build_load.go | Adds LOAD support for writable external table targets (preserves target Createsql, avoids locks/shuffle). |
| pkg/sql/plan/build_ddl.go | Accepts and validates write_file_pattern at DDL time (stage:// only; csv/jsonline only; pattern must parse). |
| pkg/sql/plan/build_ddl_extwrite_test.go | Unit tests for validateWriteFilePattern. |
| pkg/sql/plan/build_constraint_util.go | Adds op parameter to checkTableType to allow external inserts when writable; nil-guards PK access. |
| pkg/sql/compile/operator.go | Detects external-write INSERT nodes and constructs an insert operator configured for external write. |
| pkg/sql/compile/operator_extwrite_test.go | Unit tests for isExternalWriteInsert. |
| pkg/sql/compile/compile.go | Compiles external-write INSERT with one writer per source scope (no S3 merge/shuffle). |
| pkg/sql/colexec/insert/types.go | Adds ToExternal mode and external writer lifecycle handling. |
| pkg/sql/colexec/insert/insert.go | Implements insert_external path that writes batches to stage files. |
| pkg/sql/colexec/externalwrite/expand.go | Implements strftime-like pattern expansion with %nN and %U. |
| pkg/sql/colexec/externalwrite/expand_test.go | Tests for pattern expansion and error cases. |
| pkg/sql/colexec/externalwrite/encode.go | CSV/JSONLine encoders (const-vector aware; emits declared columns only). |
| pkg/sql/colexec/externalwrite/encode_test.go | Basic encoder tests. |
| pkg/sql/colexec/externalwrite/encode_types_test.go | Wide type coverage tests for encoders. |
| pkg/sql/colexec/externalwrite/writer.go | ExternalWriter implementation using fileservice streaming writer; lazy file creation. |
| pkg/sql/colexec/externalwrite/writer_test.go | Writer defaulting and header formatting tests. |
| docs/design/writable_external_table.md | High-level design doc for writable external tables. |
| docs/design/writable_external_table_impl.md | Detailed design/spec & “as-built” notes. |
| cgo/.gitignore | Minor .so ignore + whitespace cleanup. |
| if vec.GetType().Scale < 0 || vec.GetType().Width == 0 { | ||
| return []byte(strconv.FormatFloat(float64(v), 'f', -1, 32)), false, nil | ||
| } | ||
| return []byte(strconv.FormatFloat(float64(v), 'f', int(vec.GetType().Scale), 64)), false, nil |
| cfg := externalwrite.WriterConfig{ | ||
| Pattern: pattern, | ||
| Format: format, | ||
| Attrs: attrs, | ||
| Stmt: time.Now(), | ||
| TimeZone: proc.GetSessionInfo().TimeZone, | ||
| } |
| insert into ext_wide_jl select * from wide_src; | ||
| select c_i8, c_i64, c_u32, c_dec, c_vc, c_bool from ext_wide_jl order by c_i64; | ||
| internal error: the input value 'hi,there' is invalid Decimal64 type for column 3 |
| Insert into external table should be done in the same way as insert into a matrixone table. | ||
| The query should be planned and optimized, and when insert rows, instead of inserting into | ||
| matrixone table, it should just call a API and add rows into the table. `LOAD` should load | ||
| data into the external table using same API. The API should be invokes using Batches, to |
| As implementation, we will only support INSERT to csv files and jsonline files. For external table, | ||
| it must have an additional config option `WRITE_FILE_PATTERN=strftime_string`, such that newly inserted | ||
| data is written to a new file, (or many new files if there are parallel writers, but each of the pipeline | ||
| should only create one file). The `strftime_string` can contain `%` formatting charaters as strftime. |
iamlinjunhong
left a comment
There was a problem hiding this comment.
1. Remote execution loses external-write mode
This external-write mode is stored only in the local Go insert.Insert / insert.InsertCtx (ToExternal + ExternalConfig). It is not serialized through pipeline.Insert in remote-run encoding/decoding. When a LOAD/INSERT plan has remote scopes, the remote CN rebuilds this as a normal insert operator with ToExternal=false, so it will try to insert into an engine relation instead of writing stage files.
This breaks the stated multi-CN requirement for writable external tables. Please either plumb ToExternal and the external writer config through the pipeline/plan proto path, or explicitly prevent this operator from being sent to remote CNs. Given the feature requires parallel multi-CN writes, proto plumbing seems necessary.
2. JSONLine writer ignores jsondata='array'
pkg/sql/colexec/externalwrite/encode.go
encodeJSONLine always emits one JSON object per row, but writable external table DDL still allows format='jsonline', jsondata='array'. The existing JSONLine reader switches on jsondata and expects arrays when jsondata='array', so a table created with that valid option will fail to read back its own inserted data.
Please either reject jsondata='array' for writable external tables at DDL time, or carry JsonData into WriterConfig and emit JSON arrays for that mode.
3. CSV default enclosure does not match the read path
pkg/sql/colexec/externalwrite/writer.go
The writer defaults EnclosedBy to 0, and constructExternalInsert only sets it from the parsed FIELDS clause. For common syntax like FIELDS TERMINATED BY ',', the parser produces an enclosure value of 0, while the external CSV reader still defaults to " as the enclosure. As a result, string values containing commas, newlines, or quotes are written unquoted and cannot round-trip through the same external table.
Please align the writer defaults with the external reader / SELECT INTO OUTFILE defaults, and add an end-to-end test with a comma/newline in a string column without an explicit ENCLOSED BY.
4. SHOW CREATE TABLE loses write_file_pattern
pkg/sql/plan/build_show_util.go
write_file_pattern is persisted in ExternParam.Option, but SHOW CREATE TABLE formatting does not include it for external tables. Recreating a writable external table from SHOW CREATE TABLE output will silently produce a read-only external table.
Please include write_file_pattern in the external-table show-create formatting for both INFILE and S3-style options.
Already raised by Copilot, but I would still require fixes
The BVT baseline currently expects an internal error for the JSONLine wide-type round-trip. That means the test is blessing a failed round-trip instead of validating writable JSONLine. The query should succeed and assert rows, or the unsupported case should be rejected with a stable user-facing error.
WriterConfig.Stmt is set with time.Now() during operator construction, so different pipelines/CNs can expand time directives using different timestamps. The PR documents statement-start semantics; please use the statement start timestamp instead.
In the scaled float32 branch, strconv.FormatFloat uses bitSize=64. It should use 32 to preserve float32 formatting semantics.
Verification I ran: go test ./pkg/sql/colexec/externalwrite passed. Targeted pkg/sql/compile and pkg/sql/plan tests were blocked by Go proxy dependency download failure for github.com/golang/geo.
What
Make external tables writable. An external table created with a new
write_file_patternoption acceptsINSERT ... SELECTandLOAD, writingCSV or JSONLine files into a
stage://destination. The pattern is astrftime(3)template with two MO extensions:%nN→nrandom decimal digits%U→ a UUIDEach parallel write pipeline produces exactly one file. Reads,
UPDATE/DELETE,and read-only external tables are unchanged.
How
pkg/sql/colexec/externalwrite(new): strftime expander,ExternalWriterwith a fileservice streaming sink (io.Pipe), CSV/JSONLine encoders. Encoders
are const-vector aware and emit only the table's declared columns.
ToExternal/insert_externalalongsideToWriteS3/insert_table. Write config is carried on the GoInsertCtx(no proto change), built at compile time from
TableDef.Createsql.compileInsertroutes external-write nodes to one writer op persource scope (no S3 merge/shuffle) → one file per pipeline, parallel across CNs.
build_insert/build_load);op-aware
checkTableTypeallows writable-external forinsert;initInsertStmtPkey nil-guard (external tables have no PK); the modern DMLbinder defers external targets to the legacy planner.
write_file_pattern(must bestage://, csv/jsonline only,pattern must parse) and accepts it in the read-side option validators.
Semantics / limitations
stage://.failed statement may leave partial files. Documented as a v1 limitation.
UPDATE/DELETEon external tables remain unsupported.Tests
pkg/sql/colexec/externalwrite(expander, encoders),build_ddlvalidator.test/distributed/cases/stage/writable_external_table.{sql,result}—CSV/JSONLine insert + readback, multi-file accumulation, LOAD into external,
and all error cases. Passes 37/37 locally. Existing
stage.sql(external readsDesign + as-built notes:
docs/design/writable_external_table_impl.md.🤖 Generated with Claude Code