cpdata · cpdata · Oct 17, 2025 · Oct 17, 2025 · Oct 17, 2025
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -1,5 +1,29 @@
 # Changelog
 
+## [2025-10-17T06:42:38-04:00 (America/New_York)]
+### Changed
+- Added `entity_label` to triplet CSV rows generated by `scripts/generate_synthetic_dataset.py` and refreshed ingestion
+  documentation (`docs/retrieval.md`, `README.md`, `docs/operations.md`, `docs/testing.md`, `SETUP.md`) plus planning collateral
+  (`PROJECT.md`, `PLAN.md`, `ROADMAP.md`, `SOT.md`, `ENVIRONMENT_NEEDS.md`, `NEEDED_FOR_TESTING.md`, `PLANNING_THOUGHTS.md`,
+  `ISSUES.md`, `TODO.md`, `RESUME_NOTES.md`) so synthetic dataset guidance stays accurate.
+
+## [2025-10-16T22:44:21-04:00 (America/New_York)]
+### Changed
+- Simplified roadmap section headings in `ROADMAP.md` by removing week estimates from the horizon labels to
+  emphasise qualitative prioritisation.
+
+## [2025-10-16T21:44:46-04:00 (America/New_York)]
+### Added
+- Documented a synthetic dataset ingestion workflow in `docs/retrieval.md` (including sample loader code) so benchmarking
+  runs can hydrate graph drivers without recomputing embeddings.
+
+### Changed
+- Expanded operations, setup, and environment guides (`docs/operations.md`, `SETUP.md`, `ENVIRONMENT_NEEDS.md`,
+  `NEEDED_FOR_TESTING.md`) with batching/verification tips for loading generated JSONL/CSV corpora.
+- Updated core documentation and planning artifacts (`README.md`, `PROJECT.md`, `PLAN.md`, `ROADMAP.md`, `SOT.md`,
+  `RECOMMENDATIONS.md`, `PLANNING_THOUGHTS.md`, `ISSUES.md`, `RESUME_NOTES.md`, `TODO.md`) to reference the ingestion workflow
+  and capture the follow-up automation task.
+
 ## [2025-10-16T20:39:06-04:00 (America/New_York)]
 ### Added
 - Added live integration coverage for Memgraph, Neo4j, and Redis via `meshmind/tests/test_integration_live.py` and configured

diff --git a/ENVIRONMENT_NEEDS.md b/ENVIRONMENT_NEEDS.md
@@ -24,7 +24,10 @@
   consolidation heuristics and pagination under load. The new
   `scripts/generate_synthetic_dataset.py` utility produces JSONL/CSV corpora
   (defaults: 10k memories, 20k triplets, 384-dim embeddings) that can be copied to
-  shared storage for on-demand benchmarking.
+  shared storage for on-demand benchmarking. Triplet rows now embed `entity_label`,
+  so pairing the shared datasets with the ingestion workflow documented in
+  `docs/retrieval.md` lets operators seed environments quickly without recomputing
+  embeddings or rewriting CSV headers.
 - Maintain outbound package download access to PyPI and vendor repositories; this
   session confirmed package installation works when the network is open, and future
   sessions need the same capability to refresh locks or install new optional

diff --git a/ISSUES.md b/ISSUES.md
@@ -35,7 +35,9 @@
 
 - [ ] Validate the new Docker Compose stacks (root and `meshmind/tests/docker/`) on an environment with container support and document host requirements (ports, resources).
 ## Low Priority / Nice to Have
+- [x] Align synthetic dataset triplet CSV headers with `Triplet` schema (added `entity_label`) and refresh ingestion docs.
+- [x] Remove week-based horizon estimates from roadmap headings to avoid implying precise delivery dates in planning docs.
 - [x] Offer alternative storage backends (in-memory driver, SQLite, etc.) for easier local development.
 - [x] Provide an administrative dashboard or CLI commands for listing namespaces, counts, and maintenance statistics (CLI admin subcommands now expose predicates, telemetry, and graph checks).
-- [ ] Publish onboarding guides and troubleshooting FAQs for contributors.
+- [ ] Publish onboarding guides and troubleshooting FAQs for contributors (synthetic dataset ingestion docs landed in `docs/retrieval.md`, but a broader newcomer guide is still pending).
 - [ ] Explore plugin registration for embeddings and retrieval strategies to reduce manual wiring.
diff --git a/NEEDED_FOR_TESTING.md b/NEEDED_FOR_TESTING.md
@@ -69,7 +69,7 @@
   external services are unavailable.
 - Use `meshmind/testing` fakes (`FakeMemgraphDriver`, `FakeRedisBroker`, `FakeEmbeddingEncoder`, `FakeLLMClient`) in tests or demos to eliminate external infrastructure requirements. Integration suites marked with `@pytest.mark.integration` exercise live Memgraph/Neo4j/Redis instances and expect the docker stack to be running.
 - Invoke `meshmind admin predicates` and `meshmind admin maintenance --max-attempts <n> --base-delay <seconds> --run <task>` during local runs to inspect predicate registries, telemetry, and tune maintenance retries without external services.
-- Use the benchmarking utilities in `scripts/` (`evaluate_importance.py`, `consolidation_benchmark.py`, `benchmark_pagination.py`) to validate heuristics and driver performance offline before connecting to live infrastructure. Generate large corpora with `scripts/generate_synthetic_dataset.py` when you need ≥10k memories for stress tests.
+- Use the benchmarking utilities in `scripts/` (`evaluate_importance.py`, `consolidation_benchmark.py`, `benchmark_pagination.py`) to validate heuristics and driver performance offline before connecting to live infrastructure. Generate large corpora with `scripts/generate_synthetic_dataset.py` when you need ≥10k memories for stress tests; triplet CSV rows now ship with `entity_label`, so the ingestion workflow in `docs/retrieval.md` can hydrate graph drivers without extra mutation.
 - Seed demo data as needed using the `examples/extract_preprocess_store_example.py` script after configuring environment
   variables.
 - Create a `.env` file storing the environment variables above for consistent local configuration.

diff --git a/PLAN.md b/PLAN.md
@@ -1,5 +1,7 @@
 # Plan of Action
 
+Roadmap milestones now reference qualitative horizons (Near/Mid/Long-Term) instead of week estimates to focus this plan on sequencing rather than timeboxing.
+
 ## Phase 1 – Stabilize Runtime Basics ✅
 1. **Dependency Guards** – Implemented lazy driver factories, optional imports, and clear ImportErrors for missing packages.
 2. **Default Encoder Registration** – Bootstraps register encoders/entities automatically and the CLI invokes them on startup.
@@ -20,7 +22,8 @@
 2. **Maintenance Tasks** – Tasks emit telemetry, persist consolidation/compression results, and now retry conflicting writes with
    configurable exponential backoff (`MAINTENANCE_MAX_ATTEMPTS`, `MAINTENANCE_BASE_DELAY_SECONDS`). Synthetic benchmark scripts,
    the new `scripts/generate_synthetic_dataset.py`, and integration tests against live Memgraph/Neo4j validate behaviour on larger
-   workloads; next, replay production-like datasets to tune thresholds.
+   workloads. Fresh documentation in `docs/retrieval.md` and `docs/operations.md` now describes how to ingest those synthetic datasets
+   (with triplet CSVs that include `entity_label`) into the target backend; next, replay production-like datasets to tune thresholds.
 3. **Importance Scoring Improvements** – Heuristic scoring is live, records distribution metrics via telemetry, and ships with
    `scripts/evaluate_importance.py` for synthetic/offline evaluation. Next: incorporate real feedback loops or LLM-assisted
    ranking to tune weights over time.

diff --git a/PLANNING_THOUGHTS.md b/PLANNING_THOUGHTS.md
@@ -14,7 +14,7 @@
 - **Pydantic Model Policy** – Follow the documented plan (target Pydantic 2.12+, refresh locks when 3.13 wheels land, record migration guidance) to avoid resurrecting compatibility shims.
 
 ## Upcoming Research
-- Benchmark consolidation heuristics on synthetic datasets representing customer scale and capture telemetry snapshots (seed data via `scripts/generate_synthetic_dataset.py`).
+- Benchmark consolidation heuristics on synthetic datasets representing customer scale and capture telemetry snapshots (seed data via `scripts/generate_synthetic_dataset.py`—whose triplet CSV now includes `entity_label`—and load it using the ingestion workflow documented in `docs/retrieval.md`).
 - Compare graph query latency across in-memory, SQLite, Memgraph, and Neo4j drivers when using pagination and filtering.
 - Evaluate rerank quality across LLM providers using a labelled evaluation set to determine optimal default models.
 - Investigate options for secure secret storage (e.g., Vault, AWS Secrets Manager) to standardise API key management.
diff --git a/PROJECT.md b/PROJECT.md
@@ -78,7 +78,7 @@
 - Docker Compose now provisions Memgraph, Neo4j, and Redis; integration-specific stacks (including the Celery worker) live under
   `meshmind/tests/docker/`. `pytest -m integration` exercises live services once the stack is running. See `ENVIRONMENT_NEEDS.md`
   and `SETUP.md` for enabling optional services locally.
-- `scripts/generate_synthetic_dataset.py` produces large JSONL/CSV corpora (defaults: 10k memories, 20k triplets, 384-dim embeddings) to stress retrieval and consolidation flows prior to ingesting real datasets.
+- `scripts/generate_synthetic_dataset.py` produces large JSONL/CSV corpora (defaults: 10k memories, 20k triplets, 384-dim embeddings) to stress retrieval and consolidation flows prior to ingesting real datasets. Triplet rows ship with `entity_label` so the ingestion workflow documented in `docs/retrieval.md` hydrates graph drivers without additional preprocessing.
 
 ## Roadmap Highlights
 - Push graph-backed retrieval deeper into the drivers (vector similarity, structured filters) so the new server-side filtering/pagination evolves into full backend-native ranking.

diff --git a/README.md b/README.md
@@ -202,7 +202,11 @@ Tasks instantiate the driver lazily, emit structured logs/metrics, and persist c
 ## Benchmarking & Evaluation
 - **Synthetic dataset generation** – `scripts/generate_synthetic_dataset.py` creates large JSONL/CSV corpora of
   memories/triplets (defaults: 10k memories, 20k triplets, 384-dim embeddings) so you can stress retrieval, consolidation,
-  and integration flows before ingesting real data.
+  and integration flows before ingesting real data. Triplet rows now ship with `entity_label` to match
+  `meshmind.core.types.Triplet`.
+- **Synthetic dataset ingestion** – Follow the workflow documented in `docs/retrieval.md` to load the generated JSONL/CSV
+  payloads into MeshMind via the Python client. The operations guide walks through batching tips and post-ingestion
+  verification so benchmark runs start from a consistent baseline.
 - **Importance scoring** – `scripts/evaluate_importance.py` runs the heuristic against JSON or synthetic datasets and reports
   descriptive statistics for quick regression checks.
 - **Consolidation throughput** – `scripts/consolidation_benchmark.py` generates synthetic workloads to measure batch merging

diff --git a/RECOMMENDATIONS.md b/RECOMMENDATIONS.md
@@ -30,7 +30,9 @@
 
 ## Documentation & Onboarding
 - Keep `README.md`, `SOT.md`, `docs/`, and onboarding guides synchronized with each release; document rerank, retrieval, and
-  registry flows with diagrams when possible.
+  registry flows with diagrams when possible. The new synthetic dataset ingestion workflow in `docs/retrieval.md` should be
+  incorporated into future onboarding materials.
+- Keep roadmap horizons qualitative (Near/Mid/Long-Term) instead of week-based estimates so planning docs emphasise sequencing and flexibility.
 - Maintain the troubleshooting section for optional tooling (ruff, pyright, typeguard, toml-sort, yamllint) now referenced in
   the Makefile and expand it as new developer utilities are introduced. Keep `SETUP.md` synchronized when dependencies change.
 - Provide walkthroughs for configuring LLM reranking, including sample prompts and response expectations.

diff --git a/RESUME_NOTES.md b/RESUME_NOTES.md
@@ -10,9 +10,11 @@
 
 ## Latest Changes
 
+- Removed week-based estimates from roadmap section headings and refreshed planning docs (`PLAN.md`, `SOT.md`, `RECOMMENDATIONS.md`, `ISSUES.md`, `TODO.md`) to emphasise qualitative sequencing.
 - Added live integration coverage (`meshmind/tests/test_integration_live.py`) for Memgraph, Neo4j, and Redis, introduced a pytest marker configuration, and documented the workflow across README/SETUP/docs.
 - Generated a fresh `uv.lock`, pinned `.python-version` to 3.12, and updated install docs to standardise on `uv sync --all-extras`.
-- Created `scripts/generate_synthetic_dataset.py` for large JSONL/CSV corpora and referenced it across benchmarking docs.
+- Created `scripts/generate_synthetic_dataset.py` for large JSONL/CSV corpora, added `entity_label` to triplet CSV rows, and referenced it across benchmarking docs.
+- Documented the synthetic dataset ingestion workflow across `docs/retrieval.md`, `docs/operations.md`, README, and supporting planning guides so benchmarks can load corpora without recomputing embeddings.
 - Updated documentation and planning collateral (README.md, SETUP.md, docs/development.md, docs/testing.md, docs/operations.md, PROJECT.md, PLAN.md, RECOMMENDATIONS.md, ROADMAP.md, ENVIRONMENT_NEEDS.md, NEEDED_FOR_TESTING.md, SOT.md, PLANNING_THOUGHTS.md, DUMMIES.md, TODO.md, RESUME_NOTES.md) to reflect the integration workflow, dataset generation, and the new Pydantic policy.
 
 ## Environment State
@@ -26,5 +28,5 @@
 1. Address remaining `TODO.md` priority items (backend-native vector similarity, Celery worker integration, grpcurl end-to-end tests) now that graph services are accessible locally.
 2. Automate the integration suite in CI and capture resource requirements for shared infrastructure.
 3. Prepare grpcurl-based smoke tests for `meshmind serve-grpc` and plan protobuf client packaging once integration coverage extends beyond the Python stub.
-4. Feed findings from large synthetic datasets into retry/backoff defaults and document recommended values in `ENVIRONMENT_NEEDS.md`.
+4. Feed findings from large synthetic datasets into retry/backoff defaults and document recommended values in `ENVIRONMENT_NEEDS.md`, validating the new ingestion workflow as part of those runs.
 5. Continue tracking shim retirements in `DUMMIES.md` and follow the cleanup plan in `CLEANUP.md` so remaining fakes can be removed when infrastructure allows.
diff --git a/ROADMAP.md b/ROADMAP.md
@@ -5,21 +5,21 @@
 - Support multiple graph backends (in-memory, SQLite, Memgraph, Neo4j) with consistent telemetry, maintenance, and LLM orchestration knobs.
 - Provide developers with reproducible tooling, comprehensive documentation, and automation scripts that keep local and CI environments aligned.
 
-## Near-Term (0–2 Weeks)
+## Near-Term
 - Automate the new integration suite (`pytest -m integration`) in CI so Memgraph/Neo4j/Redis regressions fail fast.
-- Finalize maintenance write policies by implementing retry/backoff semantics and measuring consolidation accuracy against representative datasets (now aided by `scripts/generate_synthetic_dataset.py`).
+- Finalize maintenance write policies by implementing retry/backoff semantics and measuring consolidation accuracy against representative datasets (now aided by `scripts/generate_synthetic_dataset.py`, whose triplet CSV exposes `entity_label`, and the documented ingestion workflow in `docs/retrieval.md`).
 - Publish ROADMAP and PLANNING_THOUGHTS artifacts, and seed the `research/` folder with competitive analysis to ground prioritization discussions.
 - Expand automated smoke tests for REST `/memories/counts`, CLI `meshmind admin counts`, and provisioning scripts to ensure guardrails stay trustworthy.
 - Capture outstanding shim retirement work (FastAPI tests now live; continue tracking FakeLLM/Fake drivers) in CLEANUP.md with precise acceptance criteria for each removal.
 
-## Mid-Term (2–6 Weeks)
+## Mid-Term
 - Run load tests against SQLite and hosted graph backends to tune pagination defaults, consolidation heuristics, and token compression strategies.
 - Implement backend-native vector similarity queries and schema indexes so embeddings never leave the database during scoring.
 - Finalise the gRPC surface by building on the new asyncio server helpers—exercise the `meshmind serve-grpc` CLI entry point within Docker Compose, publish generated clients (Python + additional languages), and add integration smoke tests so external agents can integrate without the in-process stub.
 - Instrument observability exports (Prometheus/OpenTelemetry) and wire dashboards/alerts for ingestion latency, queue depth, and error rates.
 - Replace compatibility shims with official Pydantic/FastAPI packages once dependency constraints are lifted, and backfill validation coverage.
 
-## Long-Term (6+ Weeks)
+## Long-Term
 - Build evaluation loops—analytics dashboards and LLM-assisted reviews—that continuously score memory importance heuristics and rerank quality.
 - Introduce human-in-the-loop tooling for conflict resolution, allowing operators to approve merges or override automated maintenance plans.
 - Explore federated deployments that synchronise multiple MeshMind instances, including replication strategies and eventual-consistency guarantees.

diff --git a/SETUP.md b/SETUP.md
@@ -80,7 +80,10 @@ docker compose -f meshmind/tests/docker/memgraph.yml up -d
 ```
 
 > Need synthetic load? Run `python scripts/generate_synthetic_dataset.py build/datasets/benchmark`
-> to seed JSONL/CSV fixtures before loading them into Memgraph/Neo4j for stress tests.
+> to seed JSONL/CSV fixtures before loading them into Memgraph/Neo4j for stress tests. Triplet rows
+> now include `entity_label`, so the ingestion workflow in `docs/retrieval.md` can materialize
+> `Triplet` models without mutating CSV fields. Follow the ingestion steps when copying fixtures so
+> benchmarks reuse the same namespace/layout.
 
 ### 3.2 Cleaning up
 

diff --git a/SOT.md b/SOT.md
@@ -28,11 +28,12 @@ Supporting assets:
 - `SETUP.md`: End-to-end provisioning instructions covering Python deps, environment variables, and Compose workflows.
 - `run/install_setup.sh`, `run/maintenance_setup.sh`: Automation scripts for provisioning fresh environments and refreshing cached workspaces.
 - `scripts/evaluate_importance.py`, `scripts/consolidation_benchmark.py`, `scripts/benchmark_pagination.py`: Evaluation and benchmarking tools for importance heuristics, consolidation throughput, and driver pagination performance.
-- `scripts/generate_synthetic_dataset.py`: Produces large JSONL/CSV corpora (defaults: 10k memories, 20k triplets, 384-dim embeddings) for integration and benchmark scenarios.
+- `scripts/generate_synthetic_dataset.py`: Produces large JSONL/CSV corpora (defaults: 10k memories, 20k triplets, 384-dim embeddings) for integration and benchmark scenarios. Triplet rows include `entity_label`, so the ingestion workflow in `docs/retrieval.md` stores the generated payloads without recomputing embeddings or mutating CSV fields.
 - `.github/workflows/ci.yml`: GitHub Actions workflow running linting/formatting checks and pytest.
 - `pyproject.toml`: Project metadata and dependency list (pins Python `>=3.11,<3.13`; see compatibility notes in `ISSUES.md`).
 - Documentation (`PROJECT.md`, `PLAN.md`, `SOT.md`, `README.md`, etc.) describing the system and roadmap.
 - Strategic context (`ROADMAP.md`, `PLANNING_THOUGHTS.md`, `research/overview.md`) summarising milestones, planning questions, and competitor analysis.
+  Roadmap horizons now use qualitative labels (Near/Mid/Long-Term) without week estimates to emphasise sequencing over exact timing.
 - `DUMMIES.md`: Catalog of temporary shims (REST/gRPC stubs, Celery dummies, fake drivers) with removal guidance and a retired
   section for historical compatibility layers.