Closes #59 - Add CEI dataset by napsternxg · Pull Request #530 · bigscience-workshop/biomedical

napsternxg · 2022-04-30T05:49:49Z

Fixes #59 - Add CEI dataset

Please name your PR after the issue it closes. You can use the following line: "Closes #ISSUE-NUMBER" where you replace the ISSUE-NUMBER with the one corresponding to your dataset.

If the following information is NOT present in the issue, please populate:

Name: CEI
Description: short description of the dataset (or link to social media or blog post)
Paper: https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0173132
Data: https://github.com/sb895/chemical-exposure-information-corpus/archive/refs/heads/master.zip

Checkbox

Confirm that this PR is linked to the dataset issue.
Create the dataloader script biodatasets/my_dataset/my_dataset.py (please use only lowercase and underscore for dataset naming).
Provide values for the _CITATION, _DATASETNAME, _DESCRIPTION, _HOMEPAGE, _LICENSE, _URLs, _SUPPORTED_TASKS, _SOURCE_VERSION, and _BIGBIO_VERSION variables.
Implement _info(), _split_generators() and _generate_examples() in dataloader script.
Make sure that the BUILDER_CONFIGS class attribute is a list with at least one BigBioConfig for the source schema and one for a bigbio schema.
Confirm dataloader script works with datasets.load_dataset function.
Confirm that your dataloader script passes the test suite run with python -m tests.test_bigbio biodatasets/my_dataset/my_dataset.py.
If my dataset is local, I have provided an output of the unit-tests in the PR (please copy paste). This is OPTIONAL for public datasets, as we can test these without access to the data files.

- Initial commit to add CEI

mariosaenger · 2024-10-28T10:44:54Z

@phlobo I revised the implementation of this dataset. Please have a look at it.

phlobo · 2024-10-30T20:34:24Z

@mariosaenger

I noticed there are some duplicate labels per document:

{'id': '10022290',
 'document_id': '10022290',
 'text': '...',
 'labels': ['Biomonitoring--exposure biomarker--blood--cord blood',
  'Biomonitoring--exposure biomarker--mothers milk',
  'Biomonitoring--exposure biomarker--blood--cord blood',
  'Biomonitoring--exposure biomarker--mothers milk',
  'Biomonitoring--effect marker--physiological parameter']}

This way, the label statistics don't match the ones reported in the paper: e.g., there are 1467 instances of Biomonitoring--exposure biomarker--urine vs 784 in the paper.

I'm not sure I entirely understand the syntax of the source dataset labels (e.g., https://github.com/sb895/chemical-exposure-information-corpus/blob/master/labels/10022290.txt), but duplicate removal after parsing the labels might already do the trick.

phlobo · 2024-10-30T20:34:58Z

bigbio/hub/hub_repos/cei/cei.py

+        text_files = sorted(list(base_dir.glob("./text/*.txt")))
+
+        if self.config.schema == "source":
+            # TODO: yield (key, example) tuples in the original dataset schema


Please remove TODO comments

phlobo · 2024-10-30T20:35:03Z

bigbio/hub/hub_repos/cei/cei.py

+                yield key, example
+
+        elif self.config.schema == "bigbio_text":
+            # TODO: yield (key, example) tuples in the bigbio schema


Please remove TODO comments

phlobo · 2024-10-30T20:35:43Z

bigbio/hub/hub_repos/cei/cei.py

+        with open(label_file, encoding="utf-8") as fp:
+            label_text = fp.read()
+
+        labels = [line.strip(" -") for line in LABEL_REGEX.findall(label_text)]


This results in many duplicate labels. Maybe just wrap it in a set?

phlobo · 2024-10-30T20:36:35Z

bigbio/hub/hub_repos/cei/cei.py

+_DESCRIPTION = """\
+The Chemical Exposure Information (CEI) Corpus consists of 3661 PubMed publication abstracts manually annotated by \
+experts according to a taxonomy. The taxonomy consists of 32 classes in a hierarchy. Zero or more class labels are \
+assigned to each sentence in the corpus.


the corpus does not really contain "sentences", but I guess the description was copied from the original source...

napsternxg and others added 3 commits April 11, 2022 16:20

Fixes bigscience-workshop#59 - Add CEI dataset

352e1d7

- Initial commit to add CEI

Added info. Need to figure our data parsing.

99507a1

Added working code.

105774e

napsternxg requested review from debajyotidatta, galtay, hakunanatasha, jason-fries, leonweber, ruisi-su, sg-wbi and sunnnymskang as code owners April 30, 2022 05:49

napsternxg mentioned this pull request Apr 30, 2022

Create a dataset loader for CEI #59

Open

sg-wbi changed the title ~~Fixes #59 - Add CEI dataset~~ Closes #59 - Add CEI dataset May 9, 2022

mariosaenger self-assigned this Oct 28, 2024

Mario Sänger added 2 commits October 28, 2024 11:22

Merge branch 'main' into cei

ed12144

refactor: Revise implementation of CEI to hub-style integration

52c208d

mariosaenger requested a review from phlobo October 28, 2024 10:44

phlobo requested changes Oct 30, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Closes #59 - Add CEI dataset #530

Closes #59 - Add CEI dataset #530
napsternxg wants to merge 5 commits intobigscience-workshop:mainfrom
napsternxg:cei

napsternxg commented Apr 30, 2022 •

edited

Loading

Uh oh!

mariosaenger commented Oct 28, 2024

Uh oh!

phlobo commented Oct 30, 2024

Uh oh!

phlobo Oct 30, 2024

Uh oh!

phlobo Oct 30, 2024

Uh oh!

phlobo Oct 30, 2024

Uh oh!

phlobo Oct 30, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

napsternxg commented Apr 30, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Checkbox

Uh oh!

mariosaenger commented Oct 28, 2024

Uh oh!

phlobo commented Oct 30, 2024

Uh oh!

phlobo Oct 30, 2024

Choose a reason for hiding this comment

Uh oh!

phlobo Oct 30, 2024

Choose a reason for hiding this comment

Uh oh!

phlobo Oct 30, 2024

Choose a reason for hiding this comment

Uh oh!

phlobo Oct 30, 2024

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

napsternxg commented Apr 30, 2022 •

edited

Loading