Skip to content

Commit dd6550d

Browse files
committed
documentation, changelog, security mds
1 parent 8d99a73 commit dd6550d

6 files changed

Lines changed: 162 additions & 19 deletions

File tree

CHANGELOG.md

Lines changed: 80 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,80 @@
1+
# Changelog
2+
3+
All notable changes to the DataBUS project will be documented in this file.
4+
5+
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.1.0/),
6+
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
7+
8+
## [Unreleased]
9+
10+
### Added
11+
12+
- Universal YAML template (`data/template_example.yml`).
13+
- Example CSV data file (`data/data_example.csv`) demonstrating the full column set.
14+
- Comprehensive test suite with coverage reporting via Codecov.
15+
- CI pipeline with Ruff linting, pytest + coverage, and Codecov upload (`.github/workflows/ci.yml`).
16+
- MkDocs documentation site with auto-generated API reference via mkdocstrings.
17+
- Tutorials rewritten to reflect the actual two-pass workflow (`databus_example.py`).
18+
- OpenSSF Best Practices badge tracking.
19+
20+
### Changed
21+
22+
- **Major refactor of the validation/upload architecture** (BU-334, BU-349): each validator now also handles insertion when a populated `databus` dict is supplied, eliminating the separate `neotomaUploader` module and reducing code duplication.
23+
- Refactored `pull_params` into smaller, testable helper functions in `utils.py`, removing the dependency on pandas.
24+
- Contact handling consolidated: all contact types (PI, collector, processor, analyst) now go through `valid_contact`, with chronology modeler assignment handled within `valid_chronologies`. This significantly reduces repeated code.
25+
- Data upload now tracks inserted IDs so that data uncertainties can be linked correctly.
26+
- Chronology handling improved to properly manage calendar years, default chronologies, and sample age linkage.
27+
- Geopolitical unit insertion updated to handle entities like Scotland under the UK.
28+
- Improved logging with `logging_dict` and per-file `.valid.log` output.
29+
- Adopted Ruff as the sole linter and formatter, replacing previous tooling.
30+
- Switched to `uv` for dependency management and script execution.
31+
32+
### Fixed
33+
34+
- Chron controls now handle calendar years properly.
35+
- U-Th series insertion works correctly when the number of geochron indices differs from sample indices.
36+
- Fixed dataset–publication and dataset–database linking during upload.
37+
- Fixed collector insertion for NODE community datasets.
38+
- Fixed variable validation to handle null values without comparing null against null.
39+
- Numerous typos across `chroncontrols.py`, `sample.py`, `Chronology.py`, and others.
40+
41+
## [1.0.0] - 2025-11-27
42+
43+
### Added
44+
45+
- Support for speleothem datasets (SISAL community): U-Th series, external speleothem data, speleothem reference inserts, and entity samples.
46+
- `ExternalSpeleothem` class and corresponding `valid_external_speleothem` validator.
47+
- `UThSeries` class with independent insertion of U-series analytical data.
48+
- Lead-210 (`210Pb`) community support with lead model classes and geochronology workflows.
49+
- Ostracode surface sample support.
50+
- Script for batch speleothem reference inserts after initial upload.
51+
- `hash_file` and `check_file` helpers for file integrity verification before upload.
52+
- `safe_step` wrapper for error-safe validation with automatic logging and rollback.
53+
- `CITATION.cff` for academic citation.
54+
- `code_of_conduct.md`.
55+
56+
### Changed
57+
58+
- Expanded contact name parsing to handle initials and periods in given names.
59+
- Improved handling of diverse data groups across communities.
60+
61+
### Fixed
62+
63+
- Geochronology data handling for SISAL-specific dating methods.
64+
- Entity cover insertion errors in the database layer.
65+
- Various fixes for community-specific edge cases (NODE, 210Pb, SISAL).
66+
67+
## [0.0.1] - 2023-11-15
68+
69+
### Added
70+
71+
- Initial release of DataBUS.
72+
- Core data classes: `Site`, `CollectionUnit`, `AnalysisUnit`, `Sample`, `Dataset`, `Datum`, `Variable`, `Chronology`, `ChronControl`, `Geochron`, `GeochronControl`, `Contact`, `Geog`, `Hiatus`, `Response`.
73+
- Validation framework with `neotomaValidator` module.
74+
- Helper utilities: `template_to_dict`, `read_csv`, `pull_params`, `pull_required`.
75+
- CLI argument parsing via `parse_arguments`.
76+
- Basic pollen dataset upload workflow.
77+
78+
[Unreleased]: https://github.com/NeotomaDB/DataBUS/compare/v1.0.0...HEAD
79+
[1.0.0]: https://github.com/NeotomaDB/DataBUS/compare/v0.0.1...v1.0.0
80+
[0.0.1]: https://github.com/NeotomaDB/DataBUS/releases/tag/v0.0.1

README.md

Lines changed: 41 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -7,15 +7,19 @@
77
[![codecov](https://codecov.io/gh/NeotomaDB/DataBUS/branch/main/graph/badge.svg)](https://codecov.io/gh/NeotomaDB/DataBUS)
88
<!-- badges: end -->
99

10-
# Working with the Python Data Upload Template
10+
# Working with the Neotoma's DataBUS (Data Bulk Uploading System)
1111

12-
This set of python scripts is intended to support the bulk upload of a set of records to Neotoma. It consists of three key steps:
12+
This set of python scripts is intended to support the bulk upload of a set of records to Neotoma.
1313

14-
1. Development of a data template (YAML and CSV)
15-
2. Template validation
16-
3. Data upload
14+
It consists of three key files:
1715

18-
Once these three steps are completed the uploader will push the template files to the `neotomaholding` database. This is a temporary database that is intended to hold data within the Neotoma Paleoecology Database system for access by Tilia. Tilia is then used to provide a final data check and upload of data to Neotoma proper.
16+
1. A folder with CSV files
17+
2. A YAML data template that maps data in the CSV files with the Neotoma Database.
18+
3. A Python script that validates and uploads the data.
19+
20+
Once these three steps are completed the main script should first push the csv files into the `neotomaholdingtank` database. This is a temporary database that is intended to hold data within the Neotoma Paleoecology Database system for access by Tilia.
21+
22+
After the data is verified and the stewards feel confident in the upload, then, the script is run once more with the flag `--upload = True` to upload data to Neotoma proper.
1923

2024
![The process of uploading records using the bulk uploader. Individuals follow the steps outlined above and described further in this README file.](img/BulkUploaderSchema.svg)
2125

@@ -32,12 +36,12 @@ metadata:
3236
- column: Site.name
3337
neotoma: ndb.sites.sitename
3438
vocab: False
35-
repeat: True
39+
rowwise: True
3640
type: string
3741
ordered: False
3842
```
3943
40-
The template is used to link the template CSV file (the file that will be generated by the upload team) to the Neotoma database. It is a form of cross-walk between the upload team and the existing database structure.
44+
The template is used to link the template CSV file (the file that will be generated by the stewards upload team) to the Neotoma database. It is a form of cross-walk between the upload team and the existing database structure.
4145
4246
All YAML files should begin with an `apiVersion` header that indicates we are using `neotoma v2.0`. This is the current API version for Neotoma (accessible through [api.neotomadb.org](https://api.neotomadb.org)). This field is intended to support future development of the Neotoma API.
4347

@@ -51,19 +55,18 @@ Each entry in the `metadata` tab can have the following entries:
5155
* `neotoma`: A database table and column combination from the database schema.
5256
* `vocab`: If there is a fixed vocabulary for the column, include the possible terms here.
5357
* `rowwise`: [`true`, `false`] Is each entry unique and tied to the row (`false`, this isn't a set of repeated values), or is this a set of entries associated with the site (`true`, there is only a single value that repeats throughout)?
54-
* `type`: [`integer`, `numeric`, `date`] The variable type for the field.
58+
* `type`: [`integer`, `numeric`, `date`, `str`] The variable type for the field.
5559

5660
```yaml
5761
metadata:
5862
- column: Coordinate.precision
5963
neotoma: ndb.collectionunits.location
6064
vocab: ['core-site','GPS','core-site approximate','lake center']
61-
repeat: True
62-
type: character
63-
ordered: False
65+
rowwise: True
66+
type: str
6467
```
6568

66-
In this case we see that the team has chosen to create a column in their spreadsheet called `Coordinate.precision`, it is linked to the Neotoma table/column `ndb.collectionunits.location`. We state that it requires one term from a fixed vocabulary, the value repeats within the column, it is expected to be a `character` (as opposed to an `integer` or `numeric` value) and the order of the values does not matter.
69+
In this case we see that the team has chosen to create a column in their spreadsheet called `Coordinate.precision`, it is linked to the Neotoma table/column `ndb.collectionunits.location`. We state that it requires one term from a fixed vocabulary, the value repeats within the column, it is expected to be a `str` (as opposed to an `integer` or `numeric` value) and each row has its own value.
6770

6871
A complete list of Neotoma tables and columns is included in [`tablecolumns.csv`](docs/tablecolumns.csv), and additional support for table concepts and content can be found either in the [Neotoma Paleoecology Database Manual](https://open.neotomadb.org/manual) or in the [online database schema](https://open.neotomadb.org/dbschema).
6972

@@ -76,7 +79,7 @@ On completion of the YAML file, each column of the CSV will have an entry that f
7679
We execute the validation process by running (see [`databus_example.py`](databus_example.py) for the full example script):
7780

7881
```bash
79-
uv run databus_example.py --data FILEFOLDER/ --template template.yml --logs FILEFOLDER/logs/ --upload False
82+
uv run databus_example.py --data FILEFOLDER/ --template template.yml --upload False
8083
```
8184

8285
This will then search the folder provided in `FILEFOLDER` for csv files and parse them for validity.
@@ -113,10 +116,34 @@ The validation step identifies each element of the template being validated, pro
113116

114117
## Upload
115118

119+
The script will be run a second time - if it is not run the first time, there will be no validation logs and the upload will not be allowed.
120+
116121
The upload process is initiated using the command:
117122

118123
```bash
119124
uv run databus_example.py --data FILEFOLDER/ --template template.yml --logs FILEFOLDER/logs/ --upload True
120125
```
121126

122127
The upload process will return the distinct siteids, and related data identifiers for the uploads.
128+
129+
## Contributors
130+
131+
This project is an open project, and contributions are welcome from any individual. All contributors to this project are bound by a [code of conduct](CODE_OF_CONDUCT.md). Please review and follow this code of conduct as part of your contribution.
132+
133+
* [![ORCID](https://img.shields.io/badge/orcid-0000--0002--7926--4935-brightgreen.svg)](https://orcid.org/0000-0002-7926-4935) [Socorro Dominguez](https://ht-data.com/about)
134+
135+
* [![ORCID](https://img.shields.io/badge/orcid-0000--0002--2700--4605-brightgreen.svg)](https://orcid.org/0000-0002-2700-4605) [Simon Goring](http://www.goring.org)
136+
137+
### Tips for Contributing
138+
139+
Issues and bug reports are always welcome. Code clean-up, and feature additions can be done either through pull requests to [project forks](https://github.com/NeotomaDB/DataBUS/network/members) or [project branches](https://github.com/NeotomaDB/DataBUS/branches).
140+
141+
Before submitting a pull request, please ensure that:
142+
143+
* All existing tests pass: `uv run pytest tests/`
144+
* Code passes Ruff linting and formatting: `uv run ruff check src/` and `uv run ruff format --check src/`
145+
* New functionality includes corresponding tests in the `tests/` directory
146+
147+
These checks are enforced automatically by the [CI workflow](.github/workflows/ci.yml) on every push and pull request.
148+
149+
All products of the Neotoma Paleoecology Database are licensed under an [MIT License](LICENSE.md) unless otherwise noted.

SECURITY.md

Lines changed: 36 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,36 @@
1+
# Security Policy
2+
3+
## Supported Versions
4+
5+
| Version | Supported |
6+
|---------|--------------------|
7+
| 1.0.x | :white_check_mark: |
8+
| < 1.0 | :x: |
9+
10+
## Reporting a Vulnerability
11+
12+
If you discover a security vulnerability in DataBUS, please report it privately. **Do not open a public GitHub issue.**
13+
14+
**Email:** [dominguezvid@wisc.edu]
15+
16+
Please include:
17+
18+
- A description of the vulnerability
19+
- Steps to reproduce the issue
20+
- Any relevant logs or screenshots
21+
22+
## Response Timeline
23+
24+
We aim to acknowledge all vulnerability reports within **14 days** and will provide an update on next steps within **30 days**.
25+
26+
## Scope
27+
28+
DataBUS is a data validation and upload tool that connects to PostgreSQL databases. Security concerns most relevant to this project include:
29+
30+
- SQL injection via user-supplied CSV or YAML input
31+
- Credential exposure in `.env` files or logs
32+
- Dependency vulnerabilities in Python packages
33+
34+
## Disclosure Policy
35+
36+
We follow coordinated disclosure. Once a fix is available, we will publish details in the [CHANGELOG](CHANGELOG.md) and, if applicable, issue a GitHub Security Advisory.

docs/how-to-guide.md

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
# How-To-Guide
2+
3+
- Coming soon.

docs/reference.md

Lines changed: 0 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -19,8 +19,6 @@ Core classes representing the fundamental data models used throughout the DataBU
1919
::: DataBUS.Geog
2020
::: DataBUS.Hiatus
2121
::: DataBUS.LeadModel
22-
::: DataBUS.Publication
23-
::: DataBUS.Repository
2422
::: DataBUS.Response
2523
::: DataBUS.Sample
2624
::: DataBUS.SampleAge

mkdocs.yml

Lines changed: 2 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -11,10 +11,9 @@ plugins:
1111

1212
nav:
1313
- DataBUS Docs: index.md
14-
- tutorials.md
14+
- Tutorials: tutorials.md
1515
- How-To Guides: how-to-guide.md
16-
- reference.md
17-
- explanation.md
16+
- Documentation: reference.md
1817

1918
extra:
2019
version: 1.0

0 commit comments

Comments
 (0)