documentation, changelog, security mds

sedv8808 · sedv8808 · commit dd6550d2aae8 · 2026-03-05T11:53:02.000-08:00
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -0,0 +1,80 @@
+# Changelog
+
+All notable changes to the DataBUS project will be documented in this file.
+
+The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.1.0/),
+and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
+
+## [Unreleased]
+
+### Added
+
+- Universal YAML template (`data/template_example.yml`).
+- Example CSV data file (`data/data_example.csv`) demonstrating the full column set.
+- Comprehensive test suite with coverage reporting via Codecov.
+- CI pipeline with Ruff linting, pytest + coverage, and Codecov upload (`.github/workflows/ci.yml`).
+- MkDocs documentation site with auto-generated API reference via mkdocstrings.
+- Tutorials rewritten to reflect the actual two-pass workflow (`databus_example.py`).
+- OpenSSF Best Practices badge tracking.
+
+### Changed
+
+- **Major refactor of the validation/upload architecture** (BU-334, BU-349): each validator now also handles insertion when a populated `databus` dict is supplied, eliminating the separate `neotomaUploader` module and reducing code duplication.
+- Refactored `pull_params` into smaller, testable helper functions in `utils.py`, removing the dependency on pandas.
+- Contact handling consolidated: all contact types (PI, collector, processor, analyst) now go through `valid_contact`, with chronology modeler assignment handled within `valid_chronologies`. This significantly reduces repeated code.
+- Data upload now tracks inserted IDs so that data uncertainties can be linked correctly.
+- Chronology handling improved to properly manage calendar years, default chronologies, and sample age linkage.
+- Geopolitical unit insertion updated to handle entities like Scotland under the UK.
+- Improved logging with `logging_dict` and per-file `.valid.log` output.
+- Adopted Ruff as the sole linter and formatter, replacing previous tooling.
+- Switched to `uv` for dependency management and script execution.
+
+### Fixed
+
+- Chron controls now handle calendar years properly.
+- U-Th series insertion works correctly when the number of geochron indices differs from sample indices.
+- Fixed dataset–publication and dataset–database linking during upload.
+- Fixed collector insertion for NODE community datasets.
+- Fixed variable validation to handle null values without comparing null against null.
+- Numerous typos across `chroncontrols.py`, `sample.py`, `Chronology.py`, and others.
+
+## [1.0.0] - 2025-11-27
+
+### Added
+
+- Support for speleothem datasets (SISAL community): U-Th series, external speleothem data, speleothem reference inserts, and entity samples.
+- `ExternalSpeleothem` class and corresponding `valid_external_speleothem` validator.
+- `UThSeries` class with independent insertion of U-series analytical data.
+- Lead-210 (`210Pb`) community support with lead model classes and geochronology workflows.
+- Ostracode surface sample support.
+- Script for batch speleothem reference inserts after initial upload.
+- `hash_file` and `check_file` helpers for file integrity verification before upload.
+- `safe_step` wrapper for error-safe validation with automatic logging and rollback.
+- `CITATION.cff` for academic citation.
+- `code_of_conduct.md`.
+
+### Changed
+
+- Expanded contact name parsing to handle initials and periods in given names.
+- Improved handling of diverse data groups across communities.
+
+### Fixed
+
+- Geochronology data handling for SISAL-specific dating methods.
+- Entity cover insertion errors in the database layer.
+- Various fixes for community-specific edge cases (NODE, 210Pb, SISAL).
+
+## [0.0.1] - 2023-11-15
+
+### Added
+
+- Initial release of DataBUS.
+- Core data classes: `Site`, `CollectionUnit`, `AnalysisUnit`, `Sample`, `Dataset`, `Datum`, `Variable`, `Chronology`, `ChronControl`, `Geochron`, `GeochronControl`, `Contact`, `Geog`, `Hiatus`, `Response`.
+- Validation framework with `neotomaValidator` module.
+- Helper utilities: `template_to_dict`, `read_csv`, `pull_params`, `pull_required`.
+- CLI argument parsing via `parse_arguments`.
+- Basic pollen dataset upload workflow.
+
+[Unreleased]: https://github.com/NeotomaDB/DataBUS/compare/v1.0.0...HEAD
+[1.0.0]: https://github.com/NeotomaDB/DataBUS/compare/v0.0.1...v1.0.0
+[0.0.1]: https://github.com/NeotomaDB/DataBUS/releases/tag/v0.0.1
diff --git a/README.md b/README.md
@@ -7,15 +7,19 @@
 [![codecov](https://codecov.io/gh/NeotomaDB/DataBUS/branch/main/graph/badge.svg)](https://codecov.io/gh/NeotomaDB/DataBUS)
 <!-- badges: end -->
 
-# Working with the Python Data Upload Template
+# Working with the Neotoma's DataBUS (Data Bulk Uploading System)
 
-This set of python scripts is intended to support the bulk upload of a set of records to Neotoma. It consists of three key steps:
+This set of python scripts is intended to support the bulk upload of a set of records to Neotoma.
 
-1. Development of a data template (YAML and CSV)
-2. Template validation
-3. Data upload
+It consists of three key files:
 
-Once these three steps are completed the uploader will push the template files to the `neotomaholding` database. This is a temporary database that is intended to hold data within the Neotoma Paleoecology Database system for access by Tilia. Tilia is then used to provide a final data check and upload of data to Neotoma proper.
+1. A folder with CSV files
+2. A YAML data template that maps data in the CSV files with the Neotoma Database.
+3. A Python script that validates and uploads the data.
+
+Once these three steps are completed the main script should first push the csv files into the `neotomaholdingtank` database. This is a temporary database that is intended to hold data within the Neotoma Paleoecology Database system for access by Tilia.
+
+After the data is verified and the stewards feel confident in the upload, then, the script is run once more with the flag `--upload = True` to upload data to Neotoma proper.
 
 ![The process of uploading records using the bulk uploader. Individuals follow the steps outlined above and described further in this README file.](img/BulkUploaderSchema.svg)
 
@@ -32,12 +36,12 @@ metadata:
   - column:  Site.name
     neotoma: ndb.sites.sitename  
     vocab: False
-    repeat: True
+    rowwise: True
     type: string
     ordered: False
 ```
 
-The template is used to link the template CSV file (the file that will be generated by the upload team) to the Neotoma database. It is a form of cross-walk between the upload team and the existing database structure.
+The template is used to link the template CSV file (the file that will be generated by the stewards upload team) to the Neotoma database. It is a form of cross-walk between the upload team and the existing database structure.
 
 All YAML files should begin with an `apiVersion` header that indicates we are using `neotoma v2.0`. This is the current API version for Neotoma (accessible through [api.neotomadb.org](https://api.neotomadb.org)). This field is intended to support future development of the Neotoma API.
 
@@ -51,19 +55,18 @@ Each entry in the `metadata` tab can have the following entries:
 * `neotoma`: A database table and column combination from the database schema.
 * `vocab`: If there is a fixed vocabulary for the column, include the possible terms here.
 * `rowwise`: [`true`, `false`] Is each entry unique and tied to the row (`false`, this isn't a set of repeated values), or is this a set of entries associated with the site (`true`, there is only a single value that repeats throughout)?
-* `type`: [`integer`, `numeric`, `date`] The variable type for the field.
+* `type`: [`integer`, `numeric`, `date`, `str`] The variable type for the field.
 
 ```yaml
 metadata:
   - column: Coordinate.precision
     neotoma: ndb.collectionunits.location
     vocab: ['core-site','GPS','core-site approximate','lake center']
-    repeat: True
-    type: character
-    ordered: False
+    rowwise: True
+    type: str
 ```
 
-In this case we see that the team has chosen to create a column in their spreadsheet called `Coordinate.precision`, it is linked to the Neotoma table/column `ndb.collectionunits.location`. We state that it requires one term from a fixed vocabulary, the value repeats within the column, it is expected to be a `character` (as opposed to an `integer` or `numeric` value) and the order of the values does not matter.
+In this case we see that the team has chosen to create a column in their spreadsheet called `Coordinate.precision`, it is linked to the Neotoma table/column `ndb.collectionunits.location`. We state that it requires one term from a fixed vocabulary, the value repeats within the column, it is expected to be a `str` (as opposed to an `integer` or `numeric` value) and each row has its own value.
 
 A complete list of Neotoma tables and columns is included in [`tablecolumns.csv`](docs/tablecolumns.csv), and additional support for table concepts and content can be found either in the [Neotoma Paleoecology Database Manual](https://open.neotomadb.org/manual) or in the [online database schema](https://open.neotomadb.org/dbschema).
 
@@ -76,7 +79,7 @@ On completion of the YAML file, each column of the CSV will have an entry that f
 We execute the validation process by running (see [`databus_example.py`](databus_example.py) for the full example script):
 
 ```bash
-uv run databus_example.py --data FILEFOLDER/ --template template.yml --logs FILEFOLDER/logs/ --upload False
+uv run databus_example.py --data FILEFOLDER/ --template template.yml --upload False
 ```
 
 This will then search the folder provided in `FILEFOLDER` for csv files and parse them for validity.
@@ -113,10 +116,34 @@ The validation step identifies each element of the template being validated, pro
 
 ## Upload
 
+The script will be run a second time - if it is not run the first time, there will be no validation logs and the upload will not be allowed.
+
 The upload process is initiated using the command:
 
 ```bash
 uv run databus_example.py --data FILEFOLDER/ --template template.yml --logs FILEFOLDER/logs/ --upload True
 ```
 
 The upload process will return the distinct siteids, and related data identifiers for the uploads.
+
+## Contributors
+
+This project is an open project, and contributions are welcome from any individual.  All contributors to this project are bound by a [code of conduct](CODE_OF_CONDUCT.md).  Please review and follow this code of conduct as part of your contribution.
+
+* [![ORCID](https://img.shields.io/badge/orcid-0000--0002--7926--4935-brightgreen.svg)](https://orcid.org/0000-0002-7926-4935) [Socorro Dominguez](https://ht-data.com/about)
+
+* [![ORCID](https://img.shields.io/badge/orcid-0000--0002--2700--4605-brightgreen.svg)](https://orcid.org/0000-0002-2700-4605) [Simon Goring](http://www.goring.org)
+
+### Tips for Contributing
+
+Issues and bug reports are always welcome.  Code clean-up, and feature additions can be done either through pull requests to [project forks](https://github.com/NeotomaDB/DataBUS/network/members) or [project branches](https://github.com/NeotomaDB/DataBUS/branches).
+
+Before submitting a pull request, please ensure that:
+
+* All existing tests pass: `uv run pytest tests/`
+* Code passes Ruff linting and formatting: `uv run ruff check src/` and `uv run ruff format --check src/`
+* New functionality includes corresponding tests in the `tests/` directory
+
+These checks are enforced automatically by the [CI workflow](.github/workflows/ci.yml) on every push and pull request.
+
+All products of the Neotoma Paleoecology Database are licensed under an [MIT License](LICENSE.md) unless otherwise noted.
diff --git a/SECURITY.md b/SECURITY.md
@@ -0,0 +1,36 @@
+# Security Policy
+
+## Supported Versions
+
+| Version | Supported          |
+|---------|--------------------|
+| 1.0.x   | :white_check_mark: |
+| < 1.0   | :x:                |
+
+## Reporting a Vulnerability
+
+If you discover a security vulnerability in DataBUS, please report it privately. **Do not open a public GitHub issue.**
+
+**Email:** [dominguezvid@wisc.edu]
+
+Please include:
+
+- A description of the vulnerability
+- Steps to reproduce the issue
+- Any relevant logs or screenshots
+
+## Response Timeline
+
+We aim to acknowledge all vulnerability reports within **14 days** and will provide an update on next steps within **30 days**.
+
+## Scope
+
+DataBUS is a data validation and upload tool that connects to PostgreSQL databases. Security concerns most relevant to this project include:
+
+- SQL injection via user-supplied CSV or YAML input
+- Credential exposure in `.env` files or logs
+- Dependency vulnerabilities in Python packages
+
+## Disclosure Policy
+
+We follow coordinated disclosure. Once a fix is available, we will publish details in the [CHANGELOG](CHANGELOG.md) and, if applicable, issue a GitHub Security Advisory.
diff --git a/docs/how-to-guide.md b/docs/how-to-guide.md
@@ -0,0 +1,3 @@
+# How-To-Guide
+
+- Coming soon.
diff --git a/docs/reference.md b/docs/reference.md
@@ -19,8 +19,6 @@ Core classes representing the fundamental data models used throughout the DataBU
 ::: DataBUS.Geog
 ::: DataBUS.Hiatus
 ::: DataBUS.LeadModel
-::: DataBUS.Publication
-::: DataBUS.Repository
 ::: DataBUS.Response
 ::: DataBUS.Sample
 ::: DataBUS.SampleAge
diff --git a/mkdocs.yml b/mkdocs.yml
@@ -11,10 +11,9 @@ plugins:
 
 nav:
   - DataBUS Docs: index.md
-  - tutorials.md
+  - Tutorials: tutorials.md
   - How-To Guides: how-to-guide.md
-  - reference.md
-  - explanation.md
+  - Documentation: reference.md
 
 extra:
   version: 1.0

Original file line number	Diff line number	Diff line change
`@@ -0,0 +1,3 @@`
	`1`	`+# How-To-Guide`
	`2`	`+`
	`3`	`+- Coming soon.`