Skip to content

Commit 9e42bd2

Browse files
authored
Merge pull request #97 from NeotomaDB/dev
Final Changes for Submission
2 parents 8005366 + a420bc8 commit 9e42bd2

File tree

144 files changed

+8543
-16423
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

144 files changed

+8543
-16423
lines changed

.github/workflows/pull-request-testing.yml

Lines changed: 5 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -35,7 +35,10 @@ jobs:
3535
flake8 . --count --exit-zero --max-complexity=10 --max-line-length=127 --statistics
3636
- name: Test with pytest
3737
run: |
38-
pytest
38+
pytest --cov=src --cov-report=xml
3939
- name: Upload coverage reports to Codecov
4040
uses: codecov/codecov-action@v3
41-
env: CODECOV_TOKEN: ${{ secrets.CODECOV_TOKEN }}
41+
env:
42+
CODECOV_TOKEN: ${{ secrets.CODECOV_TOKEN }}
43+
with:
44+
files: ./coverage.xml # coverage report

CODE_OF_CONDUCT.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -34,7 +34,7 @@ This Code of Conduct applies both within project spaces and in public spaces whe
3434

3535
## Enforcement
3636

37-
Instances of abusive, harassing, or otherwise unacceptable behavior may be reported by contacting the project team. The project team will review and investigate all complaints, and will respond in a way that it deems appropriate to the circumstances. The project team is obligated to maintain confidentiality with regard to the reporter of an incident. Further details of specific enforcement policies may be posted separately.
37+
Instances of abusive, harassing, or otherwise unacceptable behavior may be reported by contacting the project team at goring@wisc.edu. The project team will review and investigate all complaints, and will respond in a way that it deems appropriate to the circumstances. The project team is obligated to maintain confidentiality with regard to the reporter of an incident. Further details of specific enforcement policies may be posted separately.
3838

3939
Project maintainers who do not follow or enforce the Code of Conduct in good faith may face temporary or permanent repercussions as determined by other members of the project's leadership.
4040

README.md

Lines changed: 74 additions & 47 deletions
Original file line numberDiff line numberDiff line change
@@ -3,51 +3,62 @@
33
[![Stargazers][stars-shield]][stars-url]
44
[![Issues][issues-shield]][issues-url]
55
[![MIT License][license-shield]][license-url]
6+
[![codecov][codecov-shield]][codecov-url]
67

8+
![Banner](assets/ffossils-logo-text.png)
79
# **MetaExtractor: Finding Fossils in the Literature**
810

911
This project aims to identify research articles which are relevant to the [_Neotoma Paleoecological Database_](http://neotomadb.org) (Neotoma), extract data relevant to Neotoma from the article, and provide a mechanism for the data to be reviewed by Neotoma data stewards then submitted to Neotoma. It is being completed as part of the _University of British Columbia (UBC)_ [_Masters of Data Science (MDS)_](https://masterdatascience.ubc.ca/) program in partnership with the [_Neotoma Paleoecological Database_](http://neotomadb.org).
1012

1113
**Table of Contents**
1214

1315
- [**MetaExtractor: Finding Fossils in the Literature**](#metaextractor-finding-fossils-in-the-literature)
14-
- [**Article Relevance Prediction**](#article-relevance-prediction)
15-
- [**Data Extraction Pipeline**](#data-extraction-pipeline)
16-
- [**Data Review Tool**](#data-review-tool)
16+
- [About](#about)
17+
- [Article Relevance Prediction](#article-relevance-prediction)
18+
- [Data Extraction Pipeline](#data-extraction-pipeline)
19+
- [Data Review Tool](#data-review-tool)
1720
- [How to use this repository](#how-to-use-this-repository)
18-
- [Entity Extraction Model Training](#entity-extraction-model-training)
1921
- [Data Review Tool](#data-review-tool-1)
22+
- [Article Relevance \& Entity Extraction Model](#article-relevance--entity-extraction-model)
2023
- [Data Requirements](#data-requirements)
2124
- [Article Relevance Prediction](#article-relevance-prediction-1)
2225
- [Data Extraction Pipeline](#data-extraction-pipeline-1)
23-
- [Development Workflow Overview](#development-workflow-overview)
24-
- [Analysis Workflow Overview](#analysis-workflow-overview)
2526
- [System Requirements](#system-requirements)
26-
- [**Directory Structure and Description**](#directory-structure-and-description)
27-
- [**Contributors**](#contributors)
27+
- [Directory Structure and Description](#directory-structure-and-description)
28+
- [Contributors](#contributors)
2829
- [Tips for Contributing](#tips-for-contributing)
2930

3031
There are 3 primary components to this project:
3132

3233
1. **Article Relevance Prediction** - get the latest articles published, predict which ones are relevant to Neotoma and submit for processing.
33-
2. **MetaData Extraction Pipeline** - extract relevant entities from the article including geographic locations, taxa, etc.
34+
2. **Data Extraction Pipeline** - extract relevant entities from the article including geographic locations, taxa, etc.
3435
3. **Data Review Tool** - this takes the extracted data and allows the user to review and correct it for submission to Neotoma.
3536

36-
![](assets/project-flow-diagram.png)
37+
<p align="center">
38+
<img src="assets/project-flow-diagram.png" width="800">
39+
</p>
3740

38-
## **Article Relevance Prediction**
41+
## **About**
42+
43+
Information on each component is outlined below.
44+
45+
### **Article Relevance Prediction**
3946

4047
The goal of this component is to monitor and identify new articles that are relevant to Neotoma. This is done by using the public [xDD API](https://geodeepdive.org/) to regularly get recently published articles. Article metadata is queried from the [CrossRef API](https://www.crossref.org/documentation/retrieve-metadata/rest-api/) to obtain data such as journal name, title, abstract and more. The article metadata is then used to predict whether the article is relevant to Neotoma or not.
4148

4249
The model was trained on ~900 positive examples (a sample of articles currently contributing to Neotoma) and ~3500 negative examples (a sample of articles unrrelated or closely related to Neotoma). Logistic regression model was chosen for its outstanding performance and interpretability.
4350

4451
Articles predicted to be relevant will then be submitted to the Data Extraction Pipeline for processing.
4552

46-
![](assets/article_prediction_flow.png)
53+
<p align="center">
54+
<img src="assets/article_prediction_flow.png" width="800">
55+
</p>
56+
57+
To run the Docker image for article relevance prediction pipeline, please refer to the instructions [here](docker/article-relevance/README.md)
4758

48-
To run the Docker image for article relevance prediction pipeline, please refer to the instruction [here](docker/article-relevance/README.md)
59+
The model could be retrained using reviewed article data. Please refer to [here](docker/article-relevance-retrain/README.md) for the instructions.
4960

50-
## **Data Extraction Pipeline**
61+
### **Data Extraction Pipeline**
5162

5263
The full text is provided by the xDD team for the articles that are deemed to be relevant and a custom trained **Named Entity Recognition (NER)** model is used to extract entities of interest from the article.
5364

@@ -64,72 +75,89 @@ The entities extracted by this model are:
6475
The model was trained on ~40 existing Paleoecology articles manually annotated by the team consisting of **~60,000 tokens** with **~4,500 tagged entities**.
6576

6677
The trained model is available for inference and further development on huggingface.co [here](https://huggingface.co/finding-fossils/metaextractor).
67-
![](assets/hugging-face-metaextractor.png)
6878

69-
## **Data Review Tool**
79+
<p align="center">
80+
<img src="assets/hugging-face-metaextractor.png" width="1000">
81+
</p>
82+
83+
### **Data Review Tool**
7084

7185
Finally, the extracted data is loaded into the Data Review Tool where members of the Neotoma community can review the data and make any corrections necessary before submitting to Neotoma. The Data Review Tool is a web application built using the [Plotly Dash](https://dash.plotly.com/) framework. The tool allows users to view the extracted data, make corrections, and submit the data to be entered into Neotoma.
7286

73-
![](assets/data-review-tool.png)
87+
<p align="center">
88+
<img src="assets/data-review-tool.png" width="1000">
89+
</p>
7490

7591
## How to use this repository
7692

77-
First, begin by installing the requirements and Docker if not already installed ([Docker install instructions](https://docs.docker.com/get-docker/))
93+
First, begin by installing the requirements.
94+
95+
For pip:
7896

7997
```bash
8098
pip install -r requirements.txt
8199
```
82100

83-
A conda environment file will be provided in the final release.
84-
85-
### Entity Extraction Model Training
86-
87-
The Entity Extraction Models can be trained using the HuggingFace API by following the instructions in the [Entity Extraction Training README](src/entity_extraction/training/hf_token_classification/README.md).
88-
89-
The spaCy model training documentation is a WIP.
101+
For conda:
102+
```bash
103+
conda install environment.yml
104+
```
90105

91-
### Data Review Tool
106+
If you plan to use the pre-built Docker images, install Docker following these [instructions](https://docs.docker.com/get-docker/)
92107

93-
The Data Review Tool can be launched by running the following command from the root directory of this repository:
108+
To launch the app, run the following command from the root directory of this repository:
94109

95110
```bash
96111
docker-compose up --build data-review-tool
97112
```
98113

99-
Once the image is built and the container is running, the Data Review Tool can be accessed at http://localhost:8050/. There is a sample "extracted entities" JSON file provided for demo purposes.
114+
Once the image is built and the container is running, the Data Review Tool can be accessed at <http://0.0.0.0:8050/>. There is a sample `article-relevance-output.parquet` and `entity-extraction-output.zip` provided for demo purposes.
115+
116+
### **Article Relevance & Entity Extraction Model**
117+
118+
Please refer to the project wiki for the development and analysis workflow details: [MetaExtractor Wiki](https://github.com/NeotomaDB/MetaExtractor/wiki)
100119

101-
### Data Requirements
120+
### **Data Requirements**
102121

103122
Each of the components of this project have different data requirements. The data requirements for each component are outlined below.
104123

105-
#### Article Relevance Prediction
124+
#### **Article Relevance Prediction**
125+
126+
The article relevance prediction component requires a list of journals that are relevant to Neotoma. This dataset used to train and develop the model is available for download [HERE](https://drive.google.com/drive/folders/1NpOO7vSnVY0Wi0rvkuwNiSo3sqq-5AkY?usp=sharing). Download all files and extract the contents into `MetaExtractor/data/article-relevance/raw/`.
127+
128+
#### **Data Extraction Pipeline**
106129

107-
The article relevance prediction component requires a list of journals that are relevant to Neotoma. This dataset used to train and develop the model is available for download HERE. TODO: Setup public link for data download from project GDrive.
130+
As the full text articles provided by the xDD team are not publicly available we cannot create a public link to download the labelled training data. For access requests please contact Simon Goring at <goring@wisc.edu> or Ty Andrews at <ty.elgin.andrews@gmail.com>.
108131

109-
#### Data Extraction Pipeline
132+
#### **Data Review Tool**
110133

111-
As the full text articles provided by the xDD team are not publicly available we cannot create a public link to download the labelled training data. For access requests please contact Ty Andrews at ty.elgin.andrews@gmail.com.
134+
Once the article relevance prediction and data extraction pipeline have been run, the output files can be used as input for the Data Review Tool. The Data Review Tool requires the following files:
112135

113-
### Development Workflow Overview
136+
- `article-relevance-output.parquet` - output file from the article relevance prediction pipeline
137+
- `entity-extraction-output.zip` - output file from the data extraction pipeline
114138

115-
WIP
139+
These files should be present under a single folder and the path to the folder can be updated in the `docker-compose.yml` file, the default location is `data/data-review-tool` directory.
116140

117-
### Analysis Workflow Overview
141+
### **System Requirements**
118142

119-
WIP
143+
The project has been developed and tested on the following system:
120144

121-
### System Requirements
145+
- macOS Monterey 12.5.1
146+
- Windows 11 Pro Version: 22H2
147+
- Ubuntu 22.04.2 LTS
122148

123-
WIP
124149

125-
### **Directory Structure and Description**
150+
The pre-built Docker images were built using Docker version 4.20.0 but should work with any version of Docker since 4.
151+
152+
## **Directory Structure and Description**
126153

127154
```
128155
├── .github/ <- Directory for GitHub files
129156
│ ├── workflows/ <- Directory for workflows
130157
├── assets/ <- Directory for assets
131158
├── docker/ <- Directory for docker files
132159
│ ├── article-relevance/ <- Directory for docker files related to article relevance prediction
160+
│ ├── article-relevance-retrain/ <- Directory for docker files related to article relevance retraining
133161
│ ├── data-review-tool/ <- Directory for docker files related to data review tool
134162
│ ├── entity-extraction/ <- Directory for docker files related to named entity recognition
135163
├── data/ <- Directory for data
@@ -142,9 +170,6 @@ WIP
142170
│ │ ├── processed/ <- Processed data
143171
│ │ └── interim/ <- Temporary data location
144172
│ ├── data-review-tool/ <- Directory for data related to data review tool
145-
│ │ ├── raw/ <- Raw unprocessed data
146-
│ │ ├── processed/ <- Processed data
147-
│ │ └── interim/ <- Temporary data location
148173
├── results/ <- Directory for results
149174
│ ├── article-relevance/ <- Directory for results related to article relevance prediction
150175
│ ├── ner/ <- Directory for results related to named entity recognition
@@ -169,10 +194,10 @@ This project is an open project, and contributions are welcome from any individu
169194

170195
The UBC MDS project team consists of:
171196

172-
- **Ty Andrews**
173-
- **Kelly Wu**
174-
- **Jenit Jain**
175-
- **Shaun Hutchinson**
197+
- [![ORCID](https://img.shields.io/badge/orcid-0009--0003--0699--5838-brightgreen.svg)](https://orcid.org/0009-0003-0699-5838) [Ty Andrews](http://www.ty-andrews.com)
198+
- [![ORCID](https://img.shields.io/badge/orcid-0009--0004--2508--4746-brightgreen.svg)](https://orcid.org/0009-0004-2508-4746) Kelly Wu
199+
- [![ORCID](https://img.shields.io/badge/orcid-0009--0007--1998--3392-brightgreen.svg)](https://orcid.org/0009-0007-1998-3392) Shaun Hutchinson
200+
- [![ORCID](https://img.shields.io/badge/orcid-0009--0007--8913--2403-brightgreen.svg)](https://orcid.org/0009-0007-8913-2403) [Jenit Jain](https://www.linkedin.com/in/jenit-jain-0b31b0160/)
176201

177202
Sponsors from Neotoma supporting the project are:
178203

@@ -195,3 +220,5 @@ All products of the Neotoma Paleoecology Database are licensed under an [MIT Lic
195220
[issues-url]: https://github.com/NeotomaDB/MetaExtractor/issues
196221
[license-shield]: https://img.shields.io/github/license/NeotomaDB/MetaExtractor.svg?style=for-the-badge
197222
[license-url]: https://github.com/NeotomaDB/MetaExtractor/blob/master/LICENSE.txt
223+
[codecov-shield]: https://img.shields.io/codecov/c/github/NeotomaDB/MetaExtractor?style=for-the-badge
224+
[codecov-url]: https://codecov.io/gh/NeotomaDB/MetaExtractor

assets/article_prediction_flow.png

69.9 KB
Loading

assets/ffossils-logo-text.png

-13.4 KB
Loading
58.1 KB
Loading
122 KB
Loading
69.4 KB
Loading
58.6 KB
Loading
156 KB
Loading

0 commit comments

Comments
 (0)