TIMX 417 - read from dataset #52

ghukill · 2025-01-02T18:28:44Z

Purpose and background context

This PR adds the ability to read from a TIMDEX dataset.

Applications like the TIMDEX pipeline lambdas or TIM will need to read records from the dataset to perform further actions. This PR introduces some baseline methods on the TIMDEXDataset class to allow for quickly and efficiently reading records from a dataset.

This PR also introduces some refactoring of dataset filtering (commit 7251258) to accomodate 4-5 new read methods that all also support datasaet filtering. The approach was to follow python PEP 692 which introduced typed kwargs for a function. In this way, we are able to create a single typed dictionary of kwargs we might expect for a function (i.e. dataset filters).

How can a reviewer manually see the effects of these changes?

1- Set Dev1 TIMDEXManagers credentials

2- Start ipython shell

pipenv run ipython

3- Load a dataset without any filtering

from timdex_dataset_api import TIMDEXDataset
td = TIMDEXDataset(location="s3://timdex-extract-dev-222053980223/dataset/")
td.load()

4- Filter by a run_id and load entire run as pandas dataframe

run_df = td.read_dataframe(run_id="fe6e9d6d-67f7-4250-8842-4e43ebd53c02")

# observe dataframe is 360 rows all for the same run
display(run_df)
"""
             timdex_record_id                                      source_record                                 transformed_record     source    run_date run_type                                run_id action  year month day
0     libguides:guides-175846  b'<?xml version="1.0" encoding="utf-8"?>\n<rec...  b'{"source": "LibGuides", "source_link": "http...  libguides  2024-12-18     full  fe6e9d6d-67f7-4250-8842-4e43ebd53c02  index  2024    12  18
1     libguides:guides-175847  b'<?xml version="1.0" encoding="utf-8"?>\n<rec...  b'{"source": "LibGuides", "source_link": "http...  libguides  2024-12-18     full  fe6e9d6d-67f7-4250-8842-4e43ebd53c02  index  2024    12  18
2     libguides:guides-175849  b'<?xml version="1.0" encoding="utf-8"?>\n<rec...  b'{"source": "LibGuides", "source_link": "http...  libguides  2024-12-18     full  fe6e9d6d-67f7-4250-8842-4e43ebd53c02  index  2024    12  18
3     libguides:guides-175853  b'<?xml version="1.0" encoding="utf-8"?>\n<rec...  b'{"source": "LibGuides", "source_link": "http...  libguides  2024-12-18     full  fe6e9d6d-67f7-4250-8842-4e43ebd53c02  index  2024    12  18
4     libguides:guides-175855  b'<?xml version="1.0" encoding="utf-8"?>\n<rec...  b'{"source": "LibGuides", "source_link": "http...  libguides  2024-12-18     full  fe6e9d6d-67f7-4250-8842-4e43ebd53c02  index  2024    12  18
..                        ...                                                ...                                                ...        ...         ...      ...                                   ...    ...   ...   ...  ..
355  libguides:guides-1402904  b'<?xml version="1.0" encoding="utf-8"?>\n<rec...  b'{"source": "LibGuides", "source_link": "http...  libguides  2024-12-18     full  fe6e9d6d-67f7-4250-8842-4e43ebd53c02  index  2024    12  18
356  libguides:guides-1415734  b'<?xml version="1.0" encoding="utf-8"?>\n<rec...  b'{"source": "LibGuides", "source_link": "http...  libguides  2024-12-18     full  fe6e9d6d-67f7-4250-8842-4e43ebd53c02  index  2024    12  18
357  libguides:guides-1429216  b'<?xml version="1.0" encoding="utf-8"?>\n<rec...  b'{"source": "LibGuides", "source_link": "http...  libguides  2024-12-18     full  fe6e9d6d-67f7-4250-8842-4e43ebd53c02  index  2024    12  18
358  libguides:guides-1434814  b'<?xml version="1.0" encoding="utf-8"?>\n<rec...  b'{"source": "LibGuides", "source_link": "http...  libguides  2024-12-18     full  fe6e9d6d-67f7-4250-8842-4e43ebd53c02  index  2024    12  18
359  libguides:guides-1435500  b'<?xml version="1.0" encoding="utf-8"?>\n<rec...  b'{"source": "LibGuides", "source_link": "http...  libguides  2024-12-18     full  fe6e9d6d-67f7-4250-8842-4e43ebd53c02  index  2024    12  18

[360 rows x 11 columns]
"""

5- Retrieve a single record as a dictionary, similar to what TIM might require

record = next(td.read_dicts_iter(run_id="fe6e9d6d-67f7-4250-8842-4e43ebd53c02"))

display(record)
"""
{'timdex_record_id': 'libguides:guides-175846',
 'source_record': b'<?xml version="1.0" encoding="utf-8"?>\n<record xmlns="http://www.openarchives.org/OAI/2.0/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"><header><identifier>oai:libguides.com:guides/175846</identifier><datestamp>2024-07-09T17:17:40Z</datestamp><setSpec>guides</setSpec></header><metadata><oai_dc:dc xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:oai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/oai_dc/ http://www.openarchives.org/OAI/2.0/oai_dc.xsd"><dc:title>Materials Science &amp; Engineering</dc:title><dc:creator>Tina Chan</dc:creator><dc:subject>Engineering</dc:subject><dc:subject>Science</dc:subject><dc:description>Useful databases and other research tips for materials science.</dc:description><dc:publisher>MIT Libraries</dc:publisher><dc:date>2008-06-19 17:55:27</dc:date><dc:identifier>https://libguides.mit.edu/materials</dc:identifier></oai_dc:dc></metadata></record>',
 'transformed_record': b'{"source": "LibGuides", "source_link": "https://libguides.mit.edu/materials", "timdex_record_id": "libguides:guides-175846", "title": "Materials Science & Engineering", "citation": "Tina Chan. Materials Science & Engineering. MIT Libraries. libguides. https://libguides.mit.edu/materials", "content_type": ["libguides"], "contributors": [{"value": "Tina Chan", "kind": "Creator"}], "dates": [{"kind": "Created", "value": "2008-06-19T17:55:27"}], "format": "electronic resource", "identifiers": [{"value": "oai:libguides.com:guides/175846", "kind": "OAI-PMH"}], "links": [{"url": "https://libguides.mit.edu/materials", "kind": "LibGuide URL", "text": "LibGuide URL"}], "publishers": [{"name": "MIT Libraries"}], "subjects": [{"value": ["Engineering", "Science"], "kind": "Subject scheme not provided"}], "summary": ["Useful databases and other research tips for materials science."]}',
 'source': 'libguides',
 'run_date': datetime.date(2024, 12, 18),
 'run_type': 'full',
 'run_id': 'fe6e9d6d-67f7-4250-8842-4e43ebd53c02',
 'action': 'index',
 'year': '2024',
 'month': '12',
 'day': '18'}
"""

For a bit more advanced usage, here is an example of subsetting what columns are returned for some analysis, and then some batch reading. This pulls from a simulated dataset in Dev1, s3://timdex-extract-dev-222053980223/dataset-five-year-simulation/, that has lots of records.

from timdex_dataset_api import TIMDEXDataset
td = TIMDEXDataset(location="s3://timdex-extract-dev-222053980223/dataset-five-year-simulation/")

# filter to a single day on load; utilizes partitions and quite quick
td.load(run_date='2026-06-15')

# get run_ids from this day
td.read_dataframe(columns=['source','run_id']).value_counts()
"""
Out[23]: 
source             run_id                              
alma               8efa0222-0889-4f16-aeb3-3ca60db89260    30332
gisogm             f3834404-348c-4a76-9154-def53233a1c4     5245
dspace             9cabbbe8-922f-4758-b680-707d8b802cc9      184
aspace             df2daf00-1381-4755-80ba-2964422333e7       33
libguides          272cc8fc-87af-4f9c-b6ed-6132e8a2cb8a        9
researchdatabases  f82c7458-37a8-425d-8f0a-8af03ba193c3        7
gismit             bf891864-8c66-4441-ab18-b2bb0281aeba        2
Name: count, dtype: int64
"""

# get record batches as lists of dictionaries, similar to what TIM would bulk insert into Opensearch
for record_batch in td.read_dicts_iter(run_id="8efa0222-0889-4f16-aeb3-3ca60db89260"):
    print("Doing something with batch...")

Includes new or updated dependencies?

YES

Changes expectations for external applications?

NO

What are the relevant tickets?

https://mitlibraries.atlassian.net/browse/TIMX-417

Developer

All new ENV is documented in README
All new ENV has been added to staging and production environments
All related Jira tickets are linked in commit message(s)
Stakeholder approval has been confirmed (or is not needed)

Code Reviewer(s)

The commit message is clear and follows our guidelines (not just this PR message)
There are appropriate tests covering any new functionality
The provided documentation is sufficient for understanding any new functionality introduced
Any manual tests have been performed or provided examples verified
New dependencies are appropriate or there were no changes

Why these changes are being introduced: Currently, both loading and filtering the dataset resulted in a large number of keyword arguments for each method to aligned with dataset columns. Moving into read methods, which will also support filtering, this was a substantial amount of duplication and could be error prone over time. How this addresses that need: * Creates a typed dictionary DatasetFilters that includes all columns or partitions that we can use when filtering the dataset * Each method that can filter the dataset accepts kwargs, but they are typed to this typed dictionary Side effects of this change: * None Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/TIMX-417

Why these changes are being introduced: A primary responsibility of the TIMDEXDataset class is to provide performant and memory efficient reading of a dataset. It is anticipated that additional read methods may be required, for specific or niche situations, but some simple baseline ones are needed at this time. How this addresses that need: * Adds methods for reading pyarrow batches, pandas dataframes, and python dictionaries from a dataset. Side effects of this change: * Applications like timdex lambdas or TIM can now read records from dataset Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/TIMX-417

timdex_dataset_api/dataset.py

ehanson8

Looks good to me, one question about a docstring. Consider it approved once @jonavellecuerdo weighs in!

timdex_dataset_api/dataset.py

jonavellecuerdo

Reviewed the code changes and the sample code (without running it as I think a lot of hands-on experience with the read methods will occur during TIM dev), and I think the updates make sense! The learnings re: PEP692 were great to be aware of as it opens up more opportunities for **kwargs.

Have a couple clarification questions for you!

timdex_dataset_api/dataset.py

jonavellecuerdo · 2025-01-03T19:37:21Z

timdex_dataset_api/dataset.py

            f"total size: {total_size}"
        )
+
+    def read_batches_iter(


Commenting here but applies to all the read_* methods: None of the read_* methods actually update self.dataset. Does this mean that it is always and only the load method that updates self.dataset?

I think this is an important distinction to note somewhere in future documentation (if not already documented somewhere beyond our PR discussions). 🤔

That's correct. Once the dataset is loaded, by design, any filtering performed in the read methods do not modify the loaded dataset. As you mention above, I think this will prove helpful in TIM where you could:

load the dataset with run_date and run_id

perform a read, additionally filtering by action="index"

perform another read, this time filtering by action="delete"

Both reads work, as they don't modify the originally loaded dataset. That's what this part of a test is meant to exercise.

ghukill added 3 commits January 2, 2025 10:20

bump version to v0.5.0

81cdccf

ghukill commented Jan 2, 2025

View reviewed changes

timdex_dataset_api/dataset.py Show resolved Hide resolved

ghukill requested review from ehanson8 and jonavellecuerdo January 2, 2025 18:31

ehanson8 reviewed Jan 2, 2025

View reviewed changes

timdex_dataset_api/dataset.py Show resolved Hide resolved

timdex_dataset_api/dataset.py Show resolved Hide resolved

jonavellecuerdo approved these changes Jan 3, 2025

View reviewed changes

ghukill merged commit 47e74c3 into main Jan 3, 2025
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

TIMX 417 - read from dataset #52

TIMX 417 - read from dataset #52

Uh oh!

ghukill commented Jan 2, 2025 •

edited

Loading

Uh oh!

Uh oh!

ehanson8 left a comment

Uh oh!

Uh oh!

Uh oh!

jonavellecuerdo left a comment

Uh oh!

Uh oh!

Uh oh!

jonavellecuerdo Jan 3, 2025

Uh oh!

ghukill Jan 3, 2025 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

TIMX 417 - read from dataset #52

TIMX 417 - read from dataset #52

Uh oh!

Conversation

ghukill commented Jan 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose and background context

How can a reviewer manually see the effects of these changes?

Includes new or updated dependencies?

Changes expectations for external applications?

What are the relevant tickets?

Developer

Code Reviewer(s)

Uh oh!

Uh oh!

ehanson8 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

jonavellecuerdo left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

jonavellecuerdo Jan 3, 2025

Choose a reason for hiding this comment

Uh oh!

ghukill Jan 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ghukill commented Jan 2, 2025 •

edited

Loading

ghukill Jan 3, 2025 •

edited

Loading