You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
- Example CSV data file (`data/data_example.csv`) demonstrating the full column set.
14
+
- Comprehensive test suite with coverage reporting via Codecov.
15
+
- CI pipeline with Ruff linting, pytest + coverage, and Codecov upload (`.github/workflows/ci.yml`).
16
+
- MkDocs documentation site with auto-generated API reference via mkdocstrings.
17
+
- Tutorials rewritten to reflect the actual two-pass workflow (`databus_example.py`).
18
+
- OpenSSF Best Practices badge tracking.
19
+
20
+
### Changed
21
+
22
+
-**Major refactor of the validation/upload architecture** (BU-334, BU-349): each validator now also handles insertion when a populated `databus` dict is supplied, eliminating the separate `neotomaUploader` module and reducing code duplication.
23
+
- Refactored `pull_params` into smaller, testable helper functions in `utils.py`, removing the dependency on pandas.
24
+
- Contact handling consolidated: all contact types (PI, collector, processor, analyst) now go through `valid_contact`, with chronology modeler assignment handled within `valid_chronologies`. This significantly reduces repeated code.
25
+
- Data upload now tracks inserted IDs so that data uncertainties can be linked correctly.
26
+
- Chronology handling improved to properly manage calendar years, default chronologies, and sample age linkage.
27
+
- Geopolitical unit insertion updated to handle entities like Scotland under the UK.
28
+
- Improved logging with `logging_dict` and per-file `.valid.log` output.
29
+
- Adopted Ruff as the sole linter and formatter, replacing previous tooling.
30
+
- Switched to `uv` for dependency management and script execution.
31
+
32
+
### Fixed
33
+
34
+
- Chron controls now handle calendar years properly.
35
+
- U-Th series insertion works correctly when the number of geochron indices differs from sample indices.
36
+
- Fixed dataset–publication and dataset–database linking during upload.
37
+
- Fixed collector insertion for NODE community datasets.
38
+
- Fixed variable validation to handle null values without comparing null against null.
39
+
- Numerous typos across `chroncontrols.py`, `sample.py`, `Chronology.py`, and others.
40
+
41
+
## [1.0.0] - 2025-11-27
42
+
43
+
### Added
44
+
45
+
- Support for speleothem datasets (SISAL community): U-Th series, external speleothem data, speleothem reference inserts, and entity samples.
46
+
-`ExternalSpeleothem` class and corresponding `valid_external_speleothem` validator.
47
+
-`UThSeries` class with independent insertion of U-series analytical data.
48
+
- Lead-210 (`210Pb`) community support with lead model classes and geochronology workflows.
49
+
- Ostracode surface sample support.
50
+
- Script for batch speleothem reference inserts after initial upload.
51
+
-`hash_file` and `check_file` helpers for file integrity verification before upload.
52
+
-`safe_step` wrapper for error-safe validation with automatic logging and rollback.
53
+
-`CITATION.cff` for academic citation.
54
+
-`code_of_conduct.md`.
55
+
56
+
### Changed
57
+
58
+
- Expanded contact name parsing to handle initials and periods in given names.
59
+
- Improved handling of diverse data groups across communities.
60
+
61
+
### Fixed
62
+
63
+
- Geochronology data handling for SISAL-specific dating methods.
64
+
- Entity cover insertion errors in the database layer.
65
+
- Various fixes for community-specific edge cases (NODE, 210Pb, SISAL).
# Working with the Neotoma's DataBUS (Data Bulk Uploading System)
11
11
12
-
This set of python scripts is intended to support the bulk upload of a set of records to Neotoma. It consists of three key steps:
12
+
This set of python scripts is intended to support the bulk upload of a set of records to Neotoma.
13
13
14
-
1. Development of a data template (YAML and CSV)
15
-
2. Template validation
16
-
3. Data upload
14
+
It consists of three key files:
17
15
18
-
Once these three steps are completed the uploader will push the template files to the `neotomaholding` database. This is a temporary database that is intended to hold data within the Neotoma Paleoecology Database system for access by Tilia. Tilia is then used to provide a final data check and upload of data to Neotoma proper.
16
+
1. A folder with CSV files
17
+
2. A YAML data template that maps data in the CSV files with the Neotoma Database.
18
+
3. A Python script that validates and uploads the data.
19
+
20
+
Once these three steps are completed the main script should first push the csv files into the `neotomaholdingtank` database. This is a temporary database that is intended to hold data within the Neotoma Paleoecology Database system for access by Tilia.
21
+
22
+
After the data is verified and the stewards feel confident in the upload, then, the script is run once more with the flag `--upload = True` to upload data to Neotoma proper.
19
23
20
24

21
25
@@ -32,12 +36,12 @@ metadata:
32
36
- column: Site.name
33
37
neotoma: ndb.sites.sitename
34
38
vocab: False
35
-
repeat: True
39
+
rowwise: True
36
40
type: string
37
41
ordered: False
38
42
```
39
43
40
-
The template is used to link the template CSV file (the file that will be generated by the upload team) to the Neotoma database. It is a form of cross-walk between the upload team and the existing database structure.
44
+
The template is used to link the template CSV file (the file that will be generated by the stewards upload team) to the Neotoma database. It is a form of cross-walk between the upload team and the existing database structure.
41
45
42
46
All YAML files should begin with an `apiVersion` header that indicates we are using `neotoma v2.0`. This is the current API version for Neotoma (accessible through [api.neotomadb.org](https://api.neotomadb.org)). This field is intended to support future development of the Neotoma API.
43
47
@@ -51,19 +55,18 @@ Each entry in the `metadata` tab can have the following entries:
51
55
* `neotoma`: A database table and column combination from the database schema.
52
56
* `vocab`: If there is a fixed vocabulary for the column, include the possible terms here.
53
57
* `rowwise`: [`true`, `false`] Is each entry unique and tied to the row (`false`, this isn't a set of repeated values), or is this a set of entries associated with the site (`true`, there is only a single value that repeats throughout)?
54
-
* `type`: [`integer`, `numeric`, `date`] The variable type for the field.
58
+
* `type`: [`integer`, `numeric`, `date`, `str`] The variable type for the field.
In this case we see that the team has chosen to create a column in their spreadsheet called `Coordinate.precision`, it is linked to the Neotoma table/column `ndb.collectionunits.location`. We state that it requires one term from a fixed vocabulary, the value repeats within the column, it is expected to be a `character` (as opposed to an `integer` or `numeric` value) and the order of the values does not matter.
69
+
In this case we see that the team has chosen to create a column in their spreadsheet called `Coordinate.precision`, it is linked to the Neotoma table/column `ndb.collectionunits.location`. We state that it requires one term from a fixed vocabulary, the value repeats within the column, it is expected to be a `str` (as opposed to an `integer` or `numeric` value) and each row has its own value.
67
70
68
71
A complete list of Neotoma tables and columns is included in [`tablecolumns.csv`](docs/tablecolumns.csv), and additional support for table concepts and content can be found either in the [Neotoma Paleoecology Database Manual](https://open.neotomadb.org/manual) or in the [online database schema](https://open.neotomadb.org/dbschema).
69
72
@@ -76,7 +79,7 @@ On completion of the YAML file, each column of the CSV will have an entry that f
76
79
We execute the validation process by running (see [`databus_example.py`](databus_example.py) for the full example script):
The upload process will return the distinct siteids, and related data identifiers for the uploads.
128
+
129
+
## Contributors
130
+
131
+
This project is an open project, and contributions are welcome from any individual. All contributors to this project are bound by a [code of conduct](CODE_OF_CONDUCT.md). Please review and follow this code of conduct as part of your contribution.
Issues and bug reports are always welcome. Code clean-up, and feature additions can be done either through pull requests to [project forks](https://github.com/NeotomaDB/DataBUS/network/members) or [project branches](https://github.com/NeotomaDB/DataBUS/branches).
140
+
141
+
Before submitting a pull request, please ensure that:
142
+
143
+
* All existing tests pass: `uv run pytest tests/`
144
+
* Code passes Ruff linting and formatting: `uv run ruff check src/` and `uv run ruff format --check src/`
145
+
* New functionality includes corresponding tests in the `tests/` directory
146
+
147
+
These checks are enforced automatically by the [CI workflow](.github/workflows/ci.yml) on every push and pull request.
148
+
149
+
All products of the Neotoma Paleoecology Database are licensed under an [MIT License](LICENSE.md) unless otherwise noted.
If you discover a security vulnerability in DataBUS, please report it privately. **Do not open a public GitHub issue.**
13
+
14
+
**Email:**[dominguezvid@wisc.edu]
15
+
16
+
Please include:
17
+
18
+
- A description of the vulnerability
19
+
- Steps to reproduce the issue
20
+
- Any relevant logs or screenshots
21
+
22
+
## Response Timeline
23
+
24
+
We aim to acknowledge all vulnerability reports within **14 days** and will provide an update on next steps within **30 days**.
25
+
26
+
## Scope
27
+
28
+
DataBUS is a data validation and upload tool that connects to PostgreSQL databases. Security concerns most relevant to this project include:
29
+
30
+
- SQL injection via user-supplied CSV or YAML input
31
+
- Credential exposure in `.env` files or logs
32
+
- Dependency vulnerabilities in Python packages
33
+
34
+
## Disclosure Policy
35
+
36
+
We follow coordinated disclosure. Once a fix is available, we will publish details in the [CHANGELOG](CHANGELOG.md) and, if applicable, issue a GitHub Security Advisory.
0 commit comments