Skip to content

GH-541: Document status of file_path#542

Merged
emkornfield merged 12 commits intoapache:masterfrom
emkornfield:warn_on_file_path
Feb 3, 2026
Merged

GH-541: Document status of file_path#542
emkornfield merged 12 commits intoapache:masterfrom
emkornfield:warn_on_file_path

Conversation

@emkornfield
Copy link
Copy Markdown
Contributor

Rationale for this change

file_path does not have any proper implementations and we should clarify its current usage, and that future
usage must go through a formal feature addition process.

What changes are included in this PR?

Clarification on the file_path field status.

Do these changes have PoC implementations?

No.

Closes #541

Copy link
Copy Markdown
Contributor

@alkis alkis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment thread src/main/thrift/parquet.thrift Outdated
Comment on lines +970 to +971
* addition process. CONTRIBUTING.md in the parquet-format repository
* provides details on the process.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I wouldn't mention CONTRIBUTING.md here - seems something that can be found easily without explicit mention.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed.

Copy link
Copy Markdown
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this PR @emkornfield

Comment thread src/main/thrift/parquet.thrift Outdated
* addition process. CONTRIBUTING.md in the parquet-format repository
* provides details on the process.
*
* One known use-case for this field is to batch parquet footers together
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found this somewhat confusing as above it says there are no implementations that make use of this implementation but then this paragraph explains a usecase.

Maybe we could say something like

As of December 2025, there are no known open source Parquet implementations that support
reading external data via this field. Some query engines (which ones??) make use of this field
to batch can batch parquet footers together into a single file that serve as an index, but this is
not common.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I rephrased above, starting with the known use-case.

@emkornfield
Copy link
Copy Markdown
Contributor Author

Should we add a row to https://parquet.apache.org/docs/file-format/implementationstatus/ for this?

I think we need to update implementation status phrasing to indicate the "_metadata" use case is what is meant.

@emkornfield
Copy link
Copy Markdown
Contributor Author

apache/parquet-site#145 to update implementation status

Comment thread src/main/thrift/parquet.thrift Outdated
Co-authored-by: Daniel Weeks <daniel.weeks@databricks.com>
Comment thread src/main/thrift/parquet.thrift Outdated
Comment on lines +973 to +974
* Writers should not populate this field except for in parquet summary files. Readers
* should ensure this field is empty.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These statements are effectively removing it from the spec, which I feel is too strong of a position for this clarification. I think it's fair to say that "readers should validate this field is empty if they do not support reading external column data", but prohibiting it is not a clarification.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just removed this text entirely. I think this is mostly covered by the having new use go through a formal proposal.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if this was also referring to below as well. But I updated some language above this specific sentence in the paragraph above to also clarify that using this for reading external columns is not considered part of the specification (which is reflected by current implementation status).

Comment thread src/main/thrift/parquet.thrift Outdated
Comment thread src/main/thrift/parquet.thrift Outdated
remove guidance on populating
Comment thread src/main/thrift/parquet.thrift Outdated
Comment thread src/main/thrift/parquet.thrift Outdated
Comment thread src/main/thrift/parquet.thrift Outdated
Copy link
Copy Markdown
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @emkornfield -- it is really nice to see the spec being clarified

* metadata. This path is relative to the current file.
*
* As of December 2025, the only known use-case for this field is writing summary
* parquet files (i.e. "_metadata" files). These files consolidate footers from
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we have a link that describes what a summary file is and what implementations support it?

This is what came back from a quick google search: https://stackoverflow.com/questions/53150801/what-is-the-parquet-summary-file

But I didn't see any mention of this in the format repository: https://github.com/search?q=repo%3Aapache%2Fparquet-format%20summary&type=code

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this was ever officially part of the parquet specification as far as I can tell.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reworded this section.

Comment thread src/main/thrift/parquet.thrift Outdated
Comment on lines +966 to +967
* This is legacy feature as modern table formats (e.g. Iceberg, Hudi and Delta Lake)
* are more scalable and serve effectively the same purpose.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seem to me that calling this "legacy" may be too opinionated -- maybe we could tone down the language with something like

Note that table formats (e.g. Iceberg, Hudi and Delta Lake) offer a
superset of this functionality.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is at attempt to summarize this thread: https://lists.apache.org/thread/ootf2kmyg3p01b1bvplpvp4ftd1bt72d

It seems like there are potential correctness issues.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added this to the text.

Comment thread src/main/thrift/parquet.thrift Outdated
Comment thread src/main/thrift/parquet.thrift Outdated
Comment thread src/main/thrift/parquet.thrift Outdated
Copy link
Copy Markdown
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good to me -- thank you @emkornfield

Comment thread src/main/thrift/parquet.thrift Outdated
clarify reference implementation.
@emkornfield emkornfield requested a review from alkis January 5, 2026 05:06
Copy link
Copy Markdown

@danielcweeks danielcweeks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left a suggestion to massage the language just a little, but I'm willing to leave it to you.

Comment on lines +974 to +975
* within a standard parquet file. Making use of the field for this purpose is
* not considered part of the Parquet specification.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* within a standard parquet file. Making use of the field for this purpose is
* not considered part of the Parquet specification.
* within a standard parquet file. Making use of the field for this purpose is
* discouraged and may be removed in a later revision of the specification.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the consensus is that making use of the field needs to be revisited with a strong use-case, I'd prefer to leave the wording as is.

Copy link
Copy Markdown
Contributor

@alkis alkis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This look great. Thank you.

@emkornfield
Copy link
Copy Markdown
Contributor Author

Thanks everyone for the reviews. Merging.

@emkornfield emkornfield merged commit 9621f8c into apache:master Feb 3, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Clarify file_path status

6 participants