Allow I/O of interpreted class if the underlying data format allows#22234
Merged
Conversation
fa0c418 to
93748f1
Compare
Test Results 22 files 22 suites 3d 11h 20m 3s ⏱️ For more details on these failures, see this check. Results for commit e7b6af3. ♻️ This comment has been updated with latest results. |
ed725fc to
fb44bbe
Compare
vepadulano
added a commit
to vepadulano/awkward
that referenced
this pull request
May 13, 2026
Change the implementation of the generated RDataSource class to use the newer signature that returns one column reader. This, together with the changes at root-project/root#22234, allow to restore the behaviour that awkward relied upon to store arrays to disk via RDataFrame. Irrespective of these changes, the current in-memory layout of the awkward array is not fit for storage, leading to writing silently garbage data to disk. Fixes scikit-hep#3885
vepadulano
added a commit
to vepadulano/awkward
that referenced
this pull request
May 13, 2026
Change the implementation of the generated RDataSource class to use the newer signature that returns one column reader. This, together with the changes at root-project/root#22234, allow to restore the behaviour that awkward relied upon to store arrays to disk via RDataFrame. Irrespective of these changes, the current in-memory layout of the awkward RecordArray is not fit for storage, leading to writing silently garbage data to disk. Fixes scikit-hep#3885
pcanal
reviewed
May 13, 2026
pcanal
reviewed
May 13, 2026
hageboeck
approved these changes
May 14, 2026
RDataFrame needs the type_info of input column types to create the column readers. This is searched first in a map of known simple types, then through TClass. Sometimes TClass might not know about a type_info even though it's available to the interpreter. This is the case for example of runtime-generated classes that are declared to the interpreter but do not have a dictionary, in which case `tclass->IsLoaded()` will return false and the TClass will not expose the type_info but the interpreter knows about it. This commit proposes to extend the search in RDataFrame to include also a call to the interpreter in case the RTTI cannot be found through TClass. This removes a long-standing safeguard introduced in RDataFrame that would prevent users from trying to Snapshot a column if the type had no dictionary. But both TTree and RNTuple support this to different degrees. By removing the safeguard and searching querying the interpreter for the RTTI RDataFrame delegates responsability to the underlying I/O technology. As a side effect, this also helps in the integration with awkward arrays, where a C++ type is generated per different awkward array layout at runtime in the Python application. The awkward data source is still responsible to communicate correctly to TTree or RNTuple what they should store to disk when a Snapshot is called.
fb44bbe to
e7b6af3
Compare
Member
Author
|
/backport to 6.40 |
|
Preparing to backport PR #22234 to branch 6.40 requested by vepadulano |
|
This PR has been backported to branch 6.40: #22303 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
RDataFrame needs the type_info of input column types to create the column readers. This is searched first in a map of known simple types, then through TClass. Sometimes TClass might not know about a type_info even though it's available to the interpreter. This is the case for example of runtime-generated classes that are declared to the interpreter but do not have a dictionary, in which case
tclass->IsLoaded()will return false and the TClass will not expose the type_info but the interpreter knows about it.This commit proposes to extend the search in RDataFrame to include also a call to the interpreter in case the RTTI cannot be found through TClass.
This removes a long-standing safeguard introduced in RDataFrame that would prevent users from trying to Snapshot a column if the type had no dictionary. But both TTree and RNTuple support this to different degrees. By removing the safeguard and querying the interpreter for the RTTI RDataFrame delegates responsability to the underlying I/O technology.
As a side effect, this also helps in the integration with awkward arrays, where a C++ type is generated per different awkward array layout at runtime in the Python application. The awkward data source is still responsible to communicate correctly to TTree or RNTuple what they should store to disk when a Snapshot is called.
FYI @ianna
Fixes #22233