GH-541: Document status of file_path#542
Conversation
alkis
left a comment
There was a problem hiding this comment.
Should we add a row to https://parquet.apache.org/docs/file-format/implementationstatus/ for this?
| * addition process. CONTRIBUTING.md in the parquet-format repository | ||
| * provides details on the process. |
There was a problem hiding this comment.
nit: I wouldn't mention CONTRIBUTING.md here - seems something that can be found easily without explicit mention.
alamb
left a comment
There was a problem hiding this comment.
Thanks for this PR @emkornfield
| * addition process. CONTRIBUTING.md in the parquet-format repository | ||
| * provides details on the process. | ||
| * | ||
| * One known use-case for this field is to batch parquet footers together |
There was a problem hiding this comment.
I found this somewhat confusing as above it says there are no implementations that make use of this implementation but then this paragraph explains a usecase.
Maybe we could say something like
As of December 2025, there are no known open source Parquet implementations that support
reading external data via this field. Some query engines (which ones??) make use of this field
to batch can batch parquet footers together into a single file that serve as an index, but this is
not common.
There was a problem hiding this comment.
I rephrased above, starting with the known use-case.
I think we need to update implementation status phrasing to indicate the "_metadata" use case is what is meant. |
|
apache/parquet-site#145 to update implementation status |
Co-authored-by: Daniel Weeks <daniel.weeks@databricks.com>
| * Writers should not populate this field except for in parquet summary files. Readers | ||
| * should ensure this field is empty. |
There was a problem hiding this comment.
These statements are effectively removing it from the spec, which I feel is too strong of a position for this clarification. I think it's fair to say that "readers should validate this field is empty if they do not support reading external column data", but prohibiting it is not a clarification.
There was a problem hiding this comment.
I just removed this text entirely. I think this is mostly covered by the having new use go through a formal proposal.
There was a problem hiding this comment.
Not sure if this was also referring to below as well. But I updated some language above this specific sentence in the paragraph above to also clarify that using this for reading external columns is not considered part of the specification (which is reflected by current implementation status).
remove guidance on populating
alamb
left a comment
There was a problem hiding this comment.
Thank you @emkornfield -- it is really nice to see the spec being clarified
| * metadata. This path is relative to the current file. | ||
| * | ||
| * As of December 2025, the only known use-case for this field is writing summary | ||
| * parquet files (i.e. "_metadata" files). These files consolidate footers from |
There was a problem hiding this comment.
do we have a link that describes what a summary file is and what implementations support it?
This is what came back from a quick google search: https://stackoverflow.com/questions/53150801/what-is-the-parquet-summary-file
But I didn't see any mention of this in the format repository: https://github.com/search?q=repo%3Aapache%2Fparquet-format%20summary&type=code
There was a problem hiding this comment.
I don't think this was ever officially part of the parquet specification as far as I can tell.
There was a problem hiding this comment.
I reworded this section.
| * This is legacy feature as modern table formats (e.g. Iceberg, Hudi and Delta Lake) | ||
| * are more scalable and serve effectively the same purpose. |
There was a problem hiding this comment.
Seem to me that calling this "legacy" may be too opinionated -- maybe we could tone down the language with something like
Note that table formats (e.g. Iceberg, Hudi and Delta Lake) offer a
superset of this functionality.
There was a problem hiding this comment.
This is at attempt to summarize this thread: https://lists.apache.org/thread/ootf2kmyg3p01b1bvplpvp4ftd1bt72d
It seems like there are potential correctness issues.
There was a problem hiding this comment.
added this to the text.
alamb
left a comment
There was a problem hiding this comment.
looks good to me -- thank you @emkornfield
clarify reference implementation.
danielcweeks
left a comment
There was a problem hiding this comment.
I left a suggestion to massage the language just a little, but I'm willing to leave it to you.
| * within a standard parquet file. Making use of the field for this purpose is | ||
| * not considered part of the Parquet specification. |
There was a problem hiding this comment.
| * within a standard parquet file. Making use of the field for this purpose is | |
| * not considered part of the Parquet specification. | |
| * within a standard parquet file. Making use of the field for this purpose is | |
| * discouraged and may be removed in a later revision of the specification. |
There was a problem hiding this comment.
I think the consensus is that making use of the field needs to be revisited with a strong use-case, I'd prefer to leave the wording as is.
alkis
left a comment
There was a problem hiding this comment.
This look great. Thank you.
|
Thanks everyone for the reviews. Merging. |
Rationale for this change
file_pathdoes not have any proper implementations and we should clarify its current usage, and that futureusage must go through a formal feature addition process.
What changes are included in this PR?
Clarification on the
file_pathfield status.Do these changes have PoC implementations?
No.
Closes #541