-
Notifications
You must be signed in to change notification settings - Fork 0
TIMX 417 - read from dataset #52
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Why these changes are being introduced: Currently, both loading and filtering the dataset resulted in a large number of keyword arguments for each method to aligned with dataset columns. Moving into read methods, which will also support filtering, this was a substantial amount of duplication and could be error prone over time. How this addresses that need: * Creates a typed dictionary DatasetFilters that includes all columns or partitions that we can use when filtering the dataset * Each method that can filter the dataset accepts kwargs, but they are typed to this typed dictionary Side effects of this change: * None Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/TIMX-417
Why these changes are being introduced: A primary responsibility of the TIMDEXDataset class is to provide performant and memory efficient reading of a dataset. It is anticipated that additional read methods may be required, for specific or niche situations, but some simple baseline ones are needed at this time. How this addresses that need: * Adds methods for reading pyarrow batches, pandas dataframes, and python dictionaries from a dataset. Side effects of this change: * Applications like timdex lambdas or TIM can now read records from dataset Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/TIMX-417
ehanson8
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me, one question about a docstring. Consider it approved once @jonavellecuerdo weighs in!
jonavellecuerdo
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed the code changes and the sample code (without running it as I think a lot of hands-on experience with the read methods will occur during TIM dev), and I think the updates make sense! The learnings re: PEP692 were great to be aware of as it opens up more opportunities for **kwargs.
Have a couple clarification questions for you!
| f"total size: {total_size}" | ||
| ) | ||
|
|
||
| def read_batches_iter( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Commenting here but applies to all the read_* methods: None of the read_* methods actually update self.dataset. Does this mean that it is always and only the load method that updates self.dataset?
I think this is an important distinction to note somewhere in future documentation (if not already documented somewhere beyond our PR discussions). 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's correct. Once the dataset is loaded, by design, any filtering performed in the read methods do not modify the loaded dataset. As you mention above, I think this will prove helpful in TIM where you could:
- load the dataset with
run_dateandrun_id - perform a read, additionally filtering by
action="index" - perform another read, this time filtering by
action="delete"
Both reads work, as they don't modify the originally loaded dataset. That's what this part of a test is meant to exercise.
Purpose and background context
This PR adds the ability to read from a TIMDEX dataset.
Applications like the TIMDEX pipeline lambdas or TIM will need to read records from the dataset to perform further actions. This PR introduces some baseline methods on the
TIMDEXDatasetclass to allow for quickly and efficiently reading records from a dataset.This PR also introduces some refactoring of dataset filtering (commit 7251258) to accomodate 4-5 new read methods that all also support datasaet filtering. The approach was to follow python PEP 692 which introduced typed
kwargsfor a function. In this way, we are able to create a single typed dictionary of kwargs we might expect for a function (i.e. dataset filters).How can a reviewer manually see the effects of these changes?
1- Set Dev1
TIMDEXManagerscredentials2- Start ipython shell
3- Load a dataset without any filtering
4- Filter by a
run_idand load entire run as pandas dataframe5- Retrieve a single record as a dictionary, similar to what TIM might require
For a bit more advanced usage, here is an example of subsetting what columns are returned for some analysis, and then some batch reading. This pulls from a simulated dataset in Dev1,
s3://timdex-extract-dev-222053980223/dataset-five-year-simulation/, that has lots of records.Includes new or updated dependencies?
YES
Changes expectations for external applications?
NO
What are the relevant tickets?
Developer
Code Reviewer(s)