Skip to content

[spark] Support FROM (query) export in COPY INTO location#8096

Open
JunRuiLee wants to merge 1 commit into
apache:masterfrom
JunRuiLee:copy-into-query-export-v3
Open

[spark] Support FROM (query) export in COPY INTO location#8096
JunRuiLee wants to merge 1 commit into
apache:masterfrom
JunRuiLee:copy-into-query-export-v3

Conversation

@JunRuiLee
Copy link
Copy Markdown
Contributor

Purpose

Extend COPY INTO <location> (export) to accept an inline query as the source, not just a table:

COPY INTO '/export/active_users/'
FROM (SELECT id, name FROM my_db.users WHERE active = TRUE)
FILE_FORMAT = (TYPE = CSV, HEADER = TRUE);

Previously only FROM table_name was supported. The inline query is parsed through the
session (Paimon) parser, so it behaves exactly like the same query run via spark.sql,
including Paimon parser rules such as the v1 function rewrite.

Design notes:

  • Grammar: a parenBlock rule captures the parenthesized source with balanced
    parentheses, so nested parens (e.g. WHERE id IN (1, 2)) are matched correctly. The raw
    text is re-parsed by the AST builder.
  • Read-only enforcement: any read-only query is allowed (SELECT, WITH ... SELECT,
    VALUES, ...); statements with side effects are rejected by inspecting the parsed plan for
    Command / ParsedStatement / InsertIntoDir nodes. DDL/DML such as DROP, INSERT, and
    INSERT OVERWRITE DIRECTORY are rejected before any execution, so they cannot run.
  • Source modeling: the source (table name vs. query) is modeled as a sealed ADT in both
    the logical command and the physical exec, so impossible states cannot be constructed.
  • Row count: rows_written is counted before the write; for a non-deterministic query it
    may differ from the actual output. This is documented; the result is intentionally not
    staged, so the export does not consume extra executor disk.

This is part of #8005.

Tests

  • CopyIntoTestBase: CSV/JSON/Parquet export from a query, aggregation, nested parentheses,
    OVERWRITE = TRUE, VALUES source, empty-query rejection, and rejection of
    side-effecting statements (with assertions that the source table is untouched and no files
    are written).
  • PaimonV1FunctionTestBase: exporting FROM (SELECT <v1_function>(...)) resolves correctly
    through the session parser.

@JingsongLi JingsongLi closed this Jun 3, 2026
@JingsongLi JingsongLi reopened this Jun 3, 2026
@JingsongLi JingsongLi closed this Jun 3, 2026
@JingsongLi JingsongLi reopened this Jun 3, 2026
@JunRuiLee JunRuiLee force-pushed the copy-into-query-export-v3 branch from 714e43b to f6e903c Compare June 3, 2026 08:15
@JunRuiLee JunRuiLee force-pushed the copy-into-query-export-v3 branch from f6e903c to 6f27690 Compare June 3, 2026 10:40
@JingsongLi
Copy link
Copy Markdown
Contributor

cc @Zouxxyy to take a look.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants