Closes #59 - Add CEI dataset #530
Closes #59 - Add CEI dataset #530napsternxg wants to merge 5 commits intobigscience-workshop:mainfrom
Conversation
- Initial commit to add CEI
|
@phlobo I revised the implementation of this dataset. Please have a look at it. |
|
I noticed there are some duplicate labels per document: {'id': '10022290',
'document_id': '10022290',
'text': '...',
'labels': ['Biomonitoring--exposure biomarker--blood--cord blood',
'Biomonitoring--exposure biomarker--mothers milk',
'Biomonitoring--exposure biomarker--blood--cord blood',
'Biomonitoring--exposure biomarker--mothers milk',
'Biomonitoring--effect marker--physiological parameter']}This way, the label statistics don't match the ones reported in the paper: e.g., there are 1467 instances of I'm not sure I entirely understand the syntax of the source dataset labels (e.g., https://github.com/sb895/chemical-exposure-information-corpus/blob/master/labels/10022290.txt), but duplicate removal after parsing the labels might already do the trick. |
| text_files = sorted(list(base_dir.glob("./text/*.txt"))) | ||
|
|
||
| if self.config.schema == "source": | ||
| # TODO: yield (key, example) tuples in the original dataset schema |
There was a problem hiding this comment.
Please remove TODO comments
| yield key, example | ||
|
|
||
| elif self.config.schema == "bigbio_text": | ||
| # TODO: yield (key, example) tuples in the bigbio schema |
There was a problem hiding this comment.
Please remove TODO comments
| with open(label_file, encoding="utf-8") as fp: | ||
| label_text = fp.read() | ||
|
|
||
| labels = [line.strip(" -") for line in LABEL_REGEX.findall(label_text)] |
There was a problem hiding this comment.
This results in many duplicate labels. Maybe just wrap it in a set?
| _DESCRIPTION = """\ | ||
| The Chemical Exposure Information (CEI) Corpus consists of 3661 PubMed publication abstracts manually annotated by \ | ||
| experts according to a taxonomy. The taxonomy consists of 32 classes in a hierarchy. Zero or more class labels are \ | ||
| assigned to each sentence in the corpus. |
There was a problem hiding this comment.
the corpus does not really contain "sentences", but I guess the description was copied from the original source...
Fixes #59 - Add CEI dataset
Please name your PR after the issue it closes. You can use the following line: "Closes #ISSUE-NUMBER" where you replace the ISSUE-NUMBER with the one corresponding to your dataset.
If the following information is NOT present in the issue, please populate:
Checkbox
biodatasets/my_dataset/my_dataset.py(please use only lowercase and underscore for dataset naming)._CITATION,_DATASETNAME,_DESCRIPTION,_HOMEPAGE,_LICENSE,_URLs,_SUPPORTED_TASKS,_SOURCE_VERSION, and_BIGBIO_VERSIONvariables._info(),_split_generators()and_generate_examples()in dataloader script.BUILDER_CONFIGSclass attribute is a list with at least oneBigBioConfigfor the source schema and one for a bigbio schema.datasets.load_datasetfunction.python -m tests.test_bigbio biodatasets/my_dataset/my_dataset.py.