Skip to content

feat: writable external tables (INSERT/LOAD into stage files)#24889

Open
fengttt wants to merge 4 commits into
matrixorigin:mainfrom
fengttt:feature/writable-external-table
Open

feat: writable external tables (INSERT/LOAD into stage files)#24889
fengttt wants to merge 4 commits into
matrixorigin:mainfrom
fengttt:feature/writable-external-table

Conversation

@fengttt

@fengttt fengttt commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

What

Make external tables writable. An external table created with a new
write_file_pattern option accepts INSERT ... SELECT and LOAD, writing
CSV or JSONLine files into a stage:// destination. The pattern is a
strftime(3) template with two MO extensions:

  • %nNn random decimal digits
  • %U → a UUID

Each parallel write pipeline produces exactly one file. Reads, UPDATE/DELETE,
and read-only external tables are unchanged.

CREATE EXTERNAL TABLE t (a int, b varchar(20), c double)
INFILE{'filepath'='stage://s/part_*.csv', 'format'='csv',
       'write_file_pattern'='stage://s/part_%U.csv'}
FIELDS TERMINATED BY ',';

INSERT INTO t SELECT * FROM src;          -- writes one CSV file to the stage
LOAD DATA INFILE '...' INTO TABLE t ...;  -- same, from a source file

How

  • pkg/sql/colexec/externalwrite (new): strftime expander, ExternalWriter
    with a fileservice streaming sink (io.Pipe), CSV/JSONLine encoders. Encoders
    are const-vector aware and emit only the table's declared columns.
  • insert operator: third write mode ToExternal/insert_external alongside
    ToWriteS3/insert_table. Write config is carried on the Go InsertCtx
    (no proto change), built at compile time from TableDef.Createsql.
  • compile: compileInsert routes external-write nodes to one writer op per
    source scope (no S3 merge/shuffle) → one file per pipeline, parallel across CNs.
  • planner: minimal external insert plan (build_insert/build_load);
    op-aware checkTableType allows writable-external for insert;
    initInsertStmt Pkey nil-guard (external tables have no PK); the modern DML
    binder defers external targets to the legacy planner.
  • DDL: validates write_file_pattern (must be stage://, csv/jsonline only,
    pattern must parse) and accepts it in the read-side option validators.

Semantics / limitations

  • Output formats: CSV and JSONLine only; destination must be stage://.
  • Empty pipeline → no file (lazy file creation).
  • Writes are not transactional (files are finalized at pipeline close); a
    failed statement may leave partial files. Documented as a v1 limitation.
  • UPDATE/DELETE on external tables remain unsupported.

Tests

  • Unit: pkg/sql/colexec/externalwrite (expander, encoders), build_ddl validator.
  • BVT: test/distributed/cases/stage/writable_external_table.{sql,result}
    CSV/JSONLine insert + readback, multi-file accumulation, LOAD into external,
    and all error cases. Passes 37/37 locally. Existing stage.sql (external reads
    • DML-rejection checks) still passes 100%.

Design + as-built notes: docs/design/writable_external_table_impl.md.

🤖 Generated with Claude Code

Make external tables writable when created with a WRITE_FILE_PATTERN option
(a strftime template with MO extensions %nN = n random digits and %U = UUID,
resolving to a stage:// path). Such tables accept INSERT ... SELECT and LOAD,
writing CSV or JSONLine files into the stage; each parallel pipeline writes one
file. Reads, UPDATE/DELETE, and read-only external tables are unchanged.

- pkg/sql/colexec/externalwrite: strftime expander, ExternalWriter with a
  fileservice streaming sink, CSV/JSONLine encoders (const-vector aware, emit
  only the declared columns).
- insert operator: third write mode ToExternal/insert_external alongside
  ToWriteS3/insert_table; write config carried on the Go InsertCtx (no proto
  change), built in compile from TableDef.Createsql.
- compile: compileInsert routes external-write nodes to one writer op per
  source scope (no S3 merge/shuffle).
- planner: minimal external insert plan (build_insert/build_load); op-aware
  checkTableType allows writable-external for insert; initInsertStmt Pkey guard;
  modern DML binder defers external targets to the legacy path.
- DDL: validate WRITE_FILE_PATTERN (stage:// + csv/jsonline + parseable) and
  accept it in the read-side option validators.
- docs/design + BVT case test/distributed/cases/stage/writable_external_table.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@qodo-code-review

Copy link
Copy Markdown

Qodo reviews are paused for this user.

Troubleshooting steps vary by plan Learn more →

On a Teams plan?
Reviews resume once this user has a paid seat and their Git account is linked in Qodo.
Link Git account →

Using GitHub Enterprise Server, GitLab Self-Managed, or Bitbucket Data Center?
These require an Enterprise plan - Contact us
Contact us →

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- compile/operator.go: prealloc attrs slice in constructExternalInsert
  (fixes the SCA prealloc lint failure).
- externalwrite UT: cover all csvValue/jsonValue type branches, addEscape,
  colCount, cellIsNull (const/const-null), unsupported-type errors, writer
  defaults/no-op Close/nil-batch, and every strftime directive.
- plan/compile UT: GetWriteFilePattern, getExternParamFromTableDef,
  isExternalWriteInsert helpers.
- BVT: add wide column-type case (int/unsigned/float/decimal/char/text/date/
  bool/bit/json + NULL row) for CSV and JSONLine; regenerate result (51/51).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds writable external tables: external tables created with a write_file_pattern option can accept INSERT ... SELECT and LOAD DATA, writing CSV or JSONLine files into a stage:// destination (one file per parallel write pipeline). The change is implemented by extending the existing INSERT operator with a third mode (ToExternal) and routing eligible external-table DML through the legacy planner path.

Changes:

  • Adds a new execution path to write INSERT/LOAD output batches into stage files (CSV/JSONLine) via a new pkg/sql/colexec/externalwrite package.
  • Extends planner/DDL validation to recognize and validate write_file_pattern and to generate a minimal insert plan for writable external tables.
  • Adds unit tests plus a new BVT case for end-to-end writable external table behavior.

Reviewed changes

Copilot reviewed 25 out of 25 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
test/distributed/cases/stage/writable_external_table.sql New BVT test script covering INSERT/LOAD into writable external tables (CSV/JSONLine) plus error cases.
test/distributed/cases/stage/writable_external_table.result Expected output for the new BVT test.
pkg/sql/plan/utils.go Allows write_file_pattern option during read-side option validation; adds ExternalWriteFilePatternKey and GetWriteFilePattern.
pkg/sql/plan/extwrite_helpers_test.go Unit tests for GetWriteFilePattern and getExternParamFromTableDef.
pkg/sql/plan/dml_context.go Forces external-table DML to fall back to the legacy planner path.
pkg/sql/plan/build_insert.go Detects writable-external targets; builds minimal INSERT plan and helpers for external insert.
pkg/sql/plan/build_load.go Adds LOAD support for writable external table targets (preserves target Createsql, avoids locks/shuffle).
pkg/sql/plan/build_ddl.go Accepts and validates write_file_pattern at DDL time (stage:// only; csv/jsonline only; pattern must parse).
pkg/sql/plan/build_ddl_extwrite_test.go Unit tests for validateWriteFilePattern.
pkg/sql/plan/build_constraint_util.go Adds op parameter to checkTableType to allow external inserts when writable; nil-guards PK access.
pkg/sql/compile/operator.go Detects external-write INSERT nodes and constructs an insert operator configured for external write.
pkg/sql/compile/operator_extwrite_test.go Unit tests for isExternalWriteInsert.
pkg/sql/compile/compile.go Compiles external-write INSERT with one writer per source scope (no S3 merge/shuffle).
pkg/sql/colexec/insert/types.go Adds ToExternal mode and external writer lifecycle handling.
pkg/sql/colexec/insert/insert.go Implements insert_external path that writes batches to stage files.
pkg/sql/colexec/externalwrite/expand.go Implements strftime-like pattern expansion with %nN and %U.
pkg/sql/colexec/externalwrite/expand_test.go Tests for pattern expansion and error cases.
pkg/sql/colexec/externalwrite/encode.go CSV/JSONLine encoders (const-vector aware; emits declared columns only).
pkg/sql/colexec/externalwrite/encode_test.go Basic encoder tests.
pkg/sql/colexec/externalwrite/encode_types_test.go Wide type coverage tests for encoders.
pkg/sql/colexec/externalwrite/writer.go ExternalWriter implementation using fileservice streaming writer; lazy file creation.
pkg/sql/colexec/externalwrite/writer_test.go Writer defaulting and header formatting tests.
docs/design/writable_external_table.md High-level design doc for writable external tables.
docs/design/writable_external_table_impl.md Detailed design/spec & “as-built” notes.
cgo/.gitignore Minor .so ignore + whitespace cleanup.

if vec.GetType().Scale < 0 || vec.GetType().Width == 0 {
return []byte(strconv.FormatFloat(float64(v), 'f', -1, 32)), false, nil
}
return []byte(strconv.FormatFloat(float64(v), 'f', int(vec.GetType().Scale), 64)), false, nil
Comment on lines +948 to +954
cfg := externalwrite.WriterConfig{
Pattern: pattern,
Format: format,
Attrs: attrs,
Stmt: time.Now(),
TimeZone: proc.GetSessionInfo().TimeZone,
}
Comment on lines +75 to +77
insert into ext_wide_jl select * from wide_src;
select c_i8, c_i64, c_u32, c_dec, c_vc, c_bool from ext_wide_jl order by c_i64;
internal error: the input value 'hi,there' is invalid Decimal64 type for column 3
Insert into external table should be done in the same way as insert into a matrixone table.
The query should be planned and optimized, and when insert rows, instead of inserting into
matrixone table, it should just call a API and add rows into the table. `LOAD` should load
data into the external table using same API. The API should be invokes using Batches, to
As implementation, we will only support INSERT to csv files and jsonline files. For external table,
it must have an additional config option `WRITE_FILE_PATTERN=strftime_string`, such that newly inserted
data is written to a new file, (or many new files if there are parallel writers, but each of the pipeline
should only create one file). The `strftime_string` can contain `%` formatting charaters as strftime.

@iamlinjunhong iamlinjunhong left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1. Remote execution loses external-write mode

pkg/sql/compile/operator.go

This external-write mode is stored only in the local Go insert.Insert / insert.InsertCtx (ToExternal + ExternalConfig). It is not serialized through pipeline.Insert in remote-run encoding/decoding. When a LOAD/INSERT plan has remote scopes, the remote CN rebuilds this as a normal insert operator with ToExternal=false, so it will try to insert into an engine relation instead of writing stage files.

This breaks the stated multi-CN requirement for writable external tables. Please either plumb ToExternal and the external writer config through the pipeline/plan proto path, or explicitly prevent this operator from being sent to remote CNs. Given the feature requires parallel multi-CN writes, proto plumbing seems necessary.

2. JSONLine writer ignores jsondata='array'

pkg/sql/colexec/externalwrite/encode.go

encodeJSONLine always emits one JSON object per row, but writable external table DDL still allows format='jsonline', jsondata='array'. The existing JSONLine reader switches on jsondata and expects arrays when jsondata='array', so a table created with that valid option will fail to read back its own inserted data.

Please either reject jsondata='array' for writable external tables at DDL time, or carry JsonData into WriterConfig and emit JSON arrays for that mode.

3. CSV default enclosure does not match the read path

pkg/sql/colexec/externalwrite/writer.go

The writer defaults EnclosedBy to 0, and constructExternalInsert only sets it from the parsed FIELDS clause. For common syntax like FIELDS TERMINATED BY ',', the parser produces an enclosure value of 0, while the external CSV reader still defaults to " as the enclosure. As a result, string values containing commas, newlines, or quotes are written unquoted and cannot round-trip through the same external table.

Please align the writer defaults with the external reader / SELECT INTO OUTFILE defaults, and add an end-to-end test with a comma/newline in a string column without an explicit ENCLOSED BY.

4. SHOW CREATE TABLE loses write_file_pattern

pkg/sql/plan/build_show_util.go

write_file_pattern is persisted in ExternParam.Option, but SHOW CREATE TABLE formatting does not include it for external tables. Recreating a writable external table from SHOW CREATE TABLE output will silently produce a read-only external table.

Please include write_file_pattern in the external-table show-create formatting for both INFILE and S3-style options.

Already raised by Copilot, but I would still require fixes

The BVT baseline currently expects an internal error for the JSONLine wide-type round-trip. That means the test is blessing a failed round-trip instead of validating writable JSONLine. The query should succeed and assert rows, or the unsupported case should be rejected with a stable user-facing error.

WriterConfig.Stmt is set with time.Now() during operator construction, so different pipelines/CNs can expand time directives using different timestamps. The PR documents statement-start semantics; please use the statement start timestamp instead.

In the scaled float32 branch, strconv.FormatFloat uses bitSize=64. It should use 32 to preserve float32 formatting semantics.

Verification I ran: go test ./pkg/sql/colexec/externalwrite passed. Targeted pkg/sql/compile and pkg/sql/plan tests were blocked by Go proxy dependency download failure for github.com/golang/geo.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size/XL Denotes a PR that changes [1000, 1999] lines

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants