Skip to content

Conversation

@ghukill
Copy link
Contributor

@ghukill ghukill commented Jan 2, 2025

Purpose and background context

This PR adds the ability to read from a TIMDEX dataset.

Applications like the TIMDEX pipeline lambdas or TIM will need to read records from the dataset to perform further actions. This PR introduces some baseline methods on the TIMDEXDataset class to allow for quickly and efficiently reading records from a dataset.

This PR also introduces some refactoring of dataset filtering (commit 7251258) to accomodate 4-5 new read methods that all also support datasaet filtering. The approach was to follow python PEP 692 which introduced typed kwargs for a function. In this way, we are able to create a single typed dictionary of kwargs we might expect for a function (i.e. dataset filters).

How can a reviewer manually see the effects of these changes?

1- Set Dev1 TIMDEXManagers credentials

2- Start ipython shell

pipenv run ipython

3- Load a dataset without any filtering

from timdex_dataset_api import TIMDEXDataset
td = TIMDEXDataset(location="s3://timdex-extract-dev-222053980223/dataset/")
td.load()

4- Filter by a run_id and load entire run as pandas dataframe

run_df = td.read_dataframe(run_id="fe6e9d6d-67f7-4250-8842-4e43ebd53c02")

# observe dataframe is 360 rows all for the same run
display(run_df)
"""
             timdex_record_id                                      source_record                                 transformed_record     source    run_date run_type                                run_id action  year month day
0     libguides:guides-175846  b'<?xml version="1.0" encoding="utf-8"?>\n<rec...  b'{"source": "LibGuides", "source_link": "http...  libguides  2024-12-18     full  fe6e9d6d-67f7-4250-8842-4e43ebd53c02  index  2024    12  18
1     libguides:guides-175847  b'<?xml version="1.0" encoding="utf-8"?>\n<rec...  b'{"source": "LibGuides", "source_link": "http...  libguides  2024-12-18     full  fe6e9d6d-67f7-4250-8842-4e43ebd53c02  index  2024    12  18
2     libguides:guides-175849  b'<?xml version="1.0" encoding="utf-8"?>\n<rec...  b'{"source": "LibGuides", "source_link": "http...  libguides  2024-12-18     full  fe6e9d6d-67f7-4250-8842-4e43ebd53c02  index  2024    12  18
3     libguides:guides-175853  b'<?xml version="1.0" encoding="utf-8"?>\n<rec...  b'{"source": "LibGuides", "source_link": "http...  libguides  2024-12-18     full  fe6e9d6d-67f7-4250-8842-4e43ebd53c02  index  2024    12  18
4     libguides:guides-175855  b'<?xml version="1.0" encoding="utf-8"?>\n<rec...  b'{"source": "LibGuides", "source_link": "http...  libguides  2024-12-18     full  fe6e9d6d-67f7-4250-8842-4e43ebd53c02  index  2024    12  18
..                        ...                                                ...                                                ...        ...         ...      ...                                   ...    ...   ...   ...  ..
355  libguides:guides-1402904  b'<?xml version="1.0" encoding="utf-8"?>\n<rec...  b'{"source": "LibGuides", "source_link": "http...  libguides  2024-12-18     full  fe6e9d6d-67f7-4250-8842-4e43ebd53c02  index  2024    12  18
356  libguides:guides-1415734  b'<?xml version="1.0" encoding="utf-8"?>\n<rec...  b'{"source": "LibGuides", "source_link": "http...  libguides  2024-12-18     full  fe6e9d6d-67f7-4250-8842-4e43ebd53c02  index  2024    12  18
357  libguides:guides-1429216  b'<?xml version="1.0" encoding="utf-8"?>\n<rec...  b'{"source": "LibGuides", "source_link": "http...  libguides  2024-12-18     full  fe6e9d6d-67f7-4250-8842-4e43ebd53c02  index  2024    12  18
358  libguides:guides-1434814  b'<?xml version="1.0" encoding="utf-8"?>\n<rec...  b'{"source": "LibGuides", "source_link": "http...  libguides  2024-12-18     full  fe6e9d6d-67f7-4250-8842-4e43ebd53c02  index  2024    12  18
359  libguides:guides-1435500  b'<?xml version="1.0" encoding="utf-8"?>\n<rec...  b'{"source": "LibGuides", "source_link": "http...  libguides  2024-12-18     full  fe6e9d6d-67f7-4250-8842-4e43ebd53c02  index  2024    12  18

[360 rows x 11 columns]
"""

5- Retrieve a single record as a dictionary, similar to what TIM might require

record = next(td.read_dicts_iter(run_id="fe6e9d6d-67f7-4250-8842-4e43ebd53c02"))

display(record)
"""
{'timdex_record_id': 'libguides:guides-175846',
 'source_record': b'<?xml version="1.0" encoding="utf-8"?>\n<record xmlns="http://www.openarchives.org/OAI/2.0/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"><header><identifier>oai:libguides.com:guides/175846</identifier><datestamp>2024-07-09T17:17:40Z</datestamp><setSpec>guides</setSpec></header><metadata><oai_dc:dc xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:oai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/oai_dc/ http://www.openarchives.org/OAI/2.0/oai_dc.xsd"><dc:title>Materials Science &amp; Engineering</dc:title><dc:creator>Tina Chan</dc:creator><dc:subject>Engineering</dc:subject><dc:subject>Science</dc:subject><dc:description>Useful databases and other research tips for materials science.</dc:description><dc:publisher>MIT Libraries</dc:publisher><dc:date>2008-06-19 17:55:27</dc:date><dc:identifier>https://libguides.mit.edu/materials</dc:identifier></oai_dc:dc></metadata></record>',
 'transformed_record': b'{"source": "LibGuides", "source_link": "https://libguides.mit.edu/materials", "timdex_record_id": "libguides:guides-175846", "title": "Materials Science & Engineering", "citation": "Tina Chan. Materials Science & Engineering. MIT Libraries. libguides. https://libguides.mit.edu/materials", "content_type": ["libguides"], "contributors": [{"value": "Tina Chan", "kind": "Creator"}], "dates": [{"kind": "Created", "value": "2008-06-19T17:55:27"}], "format": "electronic resource", "identifiers": [{"value": "oai:libguides.com:guides/175846", "kind": "OAI-PMH"}], "links": [{"url": "https://libguides.mit.edu/materials", "kind": "LibGuide URL", "text": "LibGuide URL"}], "publishers": [{"name": "MIT Libraries"}], "subjects": [{"value": ["Engineering", "Science"], "kind": "Subject scheme not provided"}], "summary": ["Useful databases and other research tips for materials science."]}',
 'source': 'libguides',
 'run_date': datetime.date(2024, 12, 18),
 'run_type': 'full',
 'run_id': 'fe6e9d6d-67f7-4250-8842-4e43ebd53c02',
 'action': 'index',
 'year': '2024',
 'month': '12',
 'day': '18'}
"""

For a bit more advanced usage, here is an example of subsetting what columns are returned for some analysis, and then some batch reading. This pulls from a simulated dataset in Dev1, s3://timdex-extract-dev-222053980223/dataset-five-year-simulation/, that has lots of records.

from timdex_dataset_api import TIMDEXDataset
td = TIMDEXDataset(location="s3://timdex-extract-dev-222053980223/dataset-five-year-simulation/")

# filter to a single day on load; utilizes partitions and quite quick
td.load(run_date='2026-06-15')

# get run_ids from this day
td.read_dataframe(columns=['source','run_id']).value_counts()
"""
Out[23]: 
source             run_id                              
alma               8efa0222-0889-4f16-aeb3-3ca60db89260    30332
gisogm             f3834404-348c-4a76-9154-def53233a1c4     5245
dspace             9cabbbe8-922f-4758-b680-707d8b802cc9      184
aspace             df2daf00-1381-4755-80ba-2964422333e7       33
libguides          272cc8fc-87af-4f9c-b6ed-6132e8a2cb8a        9
researchdatabases  f82c7458-37a8-425d-8f0a-8af03ba193c3        7
gismit             bf891864-8c66-4441-ab18-b2bb0281aeba        2
Name: count, dtype: int64
"""

# get record batches as lists of dictionaries, similar to what TIM would bulk insert into Opensearch
for record_batch in td.read_dicts_iter(run_id="8efa0222-0889-4f16-aeb3-3ca60db89260"):
    print("Doing something with batch...")

Includes new or updated dependencies?

YES

Changes expectations for external applications?

NO

What are the relevant tickets?

Developer

  • All new ENV is documented in README
  • All new ENV has been added to staging and production environments
  • All related Jira tickets are linked in commit message(s)
  • Stakeholder approval has been confirmed (or is not needed)

Code Reviewer(s)

  • The commit message is clear and follows our guidelines (not just this PR message)
  • There are appropriate tests covering any new functionality
  • The provided documentation is sufficient for understanding any new functionality introduced
  • Any manual tests have been performed or provided examples verified
  • New dependencies are appropriate or there were no changes

Why these changes are being introduced:

Currently, both loading and filtering the dataset resulted in a large
number of keyword arguments for each method to aligned with dataset
columns.  Moving into read methods, which will also support filtering,
this was a substantial amount of duplication and could be error prone
over time.

How this addresses that need:
* Creates a typed dictionary DatasetFilters that includes all columns
or partitions that we can use when filtering the dataset
* Each method that can filter the dataset accepts kwargs, but they are
typed to this typed dictionary

Side effects of this change:
* None

Relevant ticket(s):
* https://mitlibraries.atlassian.net/browse/TIMX-417
Why these changes are being introduced:

A primary responsibility of the TIMDEXDataset class is to provide
performant and memory efficient reading of a dataset.  It is anticipated
that additional read methods may be required, for specific or niche
situations, but some simple baseline ones are needed at this time.

How this addresses that need:
* Adds methods for reading pyarrow batches, pandas dataframes, and python
dictionaries from a dataset.

Side effects of this change:
* Applications like timdex lambdas or TIM can now read records from dataset

Relevant ticket(s):
* https://mitlibraries.atlassian.net/browse/TIMX-417
Copy link

@ehanson8 ehanson8 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, one question about a docstring. Consider it approved once @jonavellecuerdo weighs in!

Copy link
Contributor

@jonavellecuerdo jonavellecuerdo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed the code changes and the sample code (without running it as I think a lot of hands-on experience with the read methods will occur during TIM dev), and I think the updates make sense! The learnings re: PEP692 were great to be aware of as it opens up more opportunities for **kwargs.

Have a couple clarification questions for you!

f"total size: {total_size}"
)

def read_batches_iter(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Commenting here but applies to all the read_* methods: None of the read_* methods actually update self.dataset. Does this mean that it is always and only the load method that updates self.dataset?

I think this is an important distinction to note somewhere in future documentation (if not already documented somewhere beyond our PR discussions). 🤔

Copy link
Contributor Author

@ghukill ghukill Jan 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's correct. Once the dataset is loaded, by design, any filtering performed in the read methods do not modify the loaded dataset. As you mention above, I think this will prove helpful in TIM where you could:

  1. load the dataset with run_date and run_id
  2. perform a read, additionally filtering by action="index"
  3. perform another read, this time filtering by action="delete"

Both reads work, as they don't modify the originally loaded dataset. That's what this part of a test is meant to exercise.

@ghukill ghukill merged commit 47e74c3 into main Jan 3, 2025
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants