cpdata · cpdata · Oct 14, 2025 · Oct 14, 2025 · Oct 14, 2025
diff --git a/AGENTS.md b/AGENTS.md
@@ -0,0 +1,5 @@
+# Agent Instructions
+
+- Use descriptive Markdown headings starting at level 1 for top-level documents.
+- Keep lines to 120 characters or fewer when practical.
+- Prefer bullet lists for enumerations instead of inline commas.
diff --git a/DISCREPANCIES.md b/DISCREPANCIES.md
@@ -0,0 +1,43 @@
+# README vs Implementation Discrepancies
+
+## Overview
+- The legacy README promises a fully featured memory graph service with multi-level APIs, relationship storage, and diverse retrieval methods.
+- The current codebase delivers a narrower pipeline that focuses on extracting `Memory` nodes, preprocessing them, and writing them to Memgraph.
+- Many examples in the README are not executable because the described methods, configuration defaults, and dependency behaviors do not exist.
+
+## API Surface
+- README shows `MeshMind.register_entity`, `register_allowed_predicates`, `add_predicate`, `store_memory`, `add_memory`, `add_triplet`, `search`, `search_facts`, `search_procedures`, `update_memory`, and `delete_memory` methods. Only `extract_memories`, `deduplicate`, `score_importance`, `compress`, and `store_memories` exist on `meshmind.client.MeshMind`.
+- Entity registration is depicted as a first-class feature that validates custom Pydantic models (e.g., `Person`). The implementation merely checks that the `entity_label` returned by the LLM matches the class name; there is no registry enforcing schemas or instantiating the models.
+- Triplet storage is central to the README narrative, yet the pipeline never creates edges. `MesMind` exposes no method that calls `GraphDriver.upsert_edge`, and the tests never cover triplet scenarios.
+- CRUD operations (`add_memory`, `update_memory`, `delete_memory`) are discussed as mid-level helpers. Only the lower-level `MemoryManager` class (not surfaced through `MeshMind`) contains these methods.
+
+## Retrieval Capabilities
+- README advertises embedding vector search, BM25, LLM reranking, fuzzy search, exact comparison, regex search, filter support, and hybrid methods. The codebase exposes BM25, fuzzy, hybrid, and metadata filters. There are no endpoints for exact comparison, regex search, or LLM reranking, and vector search exists only as a helper inside `MemgraphDriver.vector_search` with no integration.
+- The README implies that search operates against the graph database. Actual retrieval utilities work on in-memory lists provided by the caller and do not query Memgraph.
+- Usage examples show three layered search calls (`search`, `search_facts`, `search_procedures`). Only the single `search` dispatcher exists, alongside `search_bm25` and `search_fuzzy` helper functions.
+
+## Data & Relationship Modeling
+- README claims memories encompass nodes and edges, including relationship predicates registered ahead of time. The code lacks predicate management beyond an unused `PredicateRegistry` and never writes relationships to the database.
+- Low-level `add_triplet` examples assume subject/object lookups by name. `MemgraphDriver.upsert_edge` expects UUIDs and assumes the nodes already exist, so the documented behavior cannot work.
+- Memory importance, consolidation, and expiry are presented as rich features. Implementations are minimal: importance defaults to `1.0`, consolidation simply keeps the highest-importance duplicate in-memory, and expiry only runs inside a Celery task that depends on optional infrastructure.
+
+## Configuration & Dependencies
+- README omits instructions for registering embedding encoders. In practice, `meshmind.pipeline.extract` fails with `KeyError` unless `EncoderRegistry.register` is called before extraction.
+- README suggests broad Python support and effortless setup. `pyproject.toml` requires Python 3.13, yet many dependencies (Celery, mgclient) do not publish wheels for that version. Missing `mgclient` triggers an import-time failure inside `MeshMind.__init__`.
+- Required environment variables (OpenAI API key, Memgraph URI/credentials, Redis URL) are not documented. The README examples instantiate `MeshMind()` with no mention of configuration, but the code depends on these settings.
+- README instructs storing and searching without mentioning external services. The current implementation requires a running Memgraph instance and optional Redis for Celery beat.
+
+## Example Code Path
+- README’s extraction example instantiates `MeshMind`, registers entity classes, and expects memories to be generated with corresponding attributes. Actual extraction enforces `entity_label` only by string name and returns `Memory` objects rather than instances of the provided Pydantic models.
+- The `mesh_mind.store_memory(memory)` loop in the README references a nonexistent method; the equivalent in code is `store_memories([memory])` or direct use of `MemoryManager`.
+- Search examples call `mesh_mind.search`, `search_facts`, and `search_procedures`. Only `search` exists, and it requires preloaded `Memory` objects.
+- Update/delete examples rely on `mesh_mind.update_memory` and `mesh_mind.delete_memory`, which are absent.
+
+## Tooling & Operations
+- README does not mention the need to register encoders before running the CLI; the default ingest command fails unless `EncoderRegistry` has an entry matching the configured embedding model.
+- README implies functioning Celery maintenance processes. The Celery tasks are importable but disabled when dependencies are missing, and they do not persist consolidated results.
+- README lacks troubleshooting guidance for OpenAI SDK changes. The shipped code uses response access patterns (`response['data']`) incompatible with the current SDK, leading to runtime errors.
+
+## Documentation State
+- README positions the document as the authoritative source of truth, yet large sections (triplet storage, relationship management, retrieval coverage) describe unimplemented functionality.
+- The original README does not point readers to supporting documents such as configuration references, dependency requirements, or operational runbooks, leaving gaps for anyone onboarding today.
diff --git a/FINDINGS.md b/FINDINGS.md
@@ -0,0 +1,32 @@
+# Findings
+
+## General Observations
+- The codebase compiles a wide range of functionality, but most modules are loosely integrated; many components are present without being wired into the main client or CLI flows.
+- Optional dependencies are imported eagerly (OpenAI, tiktoken, mgclient). In minimal environments the package cannot be imported without installing every dependency.
+- Tests capture the intended behaviors more accurately than the runtime code, yet they rely on deprecated OpenAI interfaces and attributes that no longer exist on the models.
+
+## Dependency & Environment Issues
+- `MeshMind` always instantiates `MemgraphDriver`, so installing `mgclient` and having a reachable Memgraph instance are prerequisites even for local experimentation.
+- `meshmind.pipeline.extract` calls the OpenAI Responses API but never registers an encoder. Unless callers register a matching encoder manually, extraction fails before returning any memories.
+- The OpenAI SDK objects expose attributes rather than dictionary keys, so `response['data']` in `OpenAIEmbeddingEncoder` raises at runtime.
+- `meshmind.core.utils` imports `tiktoken` globally; importing `meshmind.core.utils` without the package installed raises immediately.
+- Celery and Redis are referenced in tasks, yet `docker-compose.yml` does not provision those services. There is no documented way to launch a working stack.
+
+## Data Flow & Persistence
+- Pipelines only ever upsert nodes. `GraphDriver.upsert_edge` and the `Triplet` model are unused, so relationship data is currently lost.
+- Compression, consolidation, and expiry utilities operate on lists of `Memory` objects but do not persist the results back into the graph within standard workflows.
+- `Memory.importance` defaults to `1.0` and is not recalculated; there is no ranking algorithm or heuristics as described in the README.
+
+## CLI & Tooling
+- The CLI hardcodes `entity_types=[Memory]` for extraction, which undermines the intent of user-defined entity models.
+- CLI ingestion will fail without an encoder registered under the configured embedding model name, yet the CLI does not perform or document this registration step.
+- `docker-compose.yml` appears to be a placeholder. It lacks service definitions for Memgraph or Redis and cannot launch the environment described in the README.
+
+## Testing & Quality
+- Tests import `pytest` but there is no automated workflow or Makefile target for running them; `pytest` will still fail because mgclient, OpenAI, and tiktoken are missing.
+- Some tests reference `Memory.pre_init` hooks that are absent from the production model, indicating drift between tests and implementation.
+- There is no linting, formatting, or type-checking configuration, despite the project aiming for production-level reliability.
+
+## Documentation
+- The new README must explain encoder registration, dependency setup, and realistic capabilities. The existing README significantly overpromises.
+- Additional developer onboarding materials (environment setup, service provisioning, troubleshooting) are required to make the project approachable.
diff --git a/ISSUES.md b/ISSUES.md
@@ -0,0 +1,30 @@
+# Issues Checklist
+
+## Blockers
+- [ ] MeshMind client fails without `mgclient`; introduce lazy driver initialization or documented in-memory fallback.
+- [ ] Register a default embedding encoder (OpenAI or sentence-transformers) during startup so extraction and hybrid search can run.
+- [ ] Update OpenAI integration to match the current SDK (Responses API payload, embeddings API response structure).
+- [ ] Replace eager `tiktoken` imports in `meshmind.core.utils` and `meshmind.pipeline.compress` with guarded, optional imports.
+- [ ] Align declared Python requirement with supported dependencies (currently set to Python 3.13 despite ecosystem gaps).
+
+## High Priority
+- [ ] Implement relationship persistence (`GraphDriver.upsert_edge`) within the storage pipeline and expose triplet APIs.
+- [ ] Restore high-level API methods promised in README (`register_entity`, predicate management, `add_memory`, `update_memory`, `delete_memory`).
+- [ ] Ensure CLI ingestion registers entity models and embedding encoders or fails fast with actionable messaging.
+- [ ] Provide configuration documentation and examples for Memgraph, Redis, and OpenAI environment variables.
+- [ ] Add automated tests or smoke checks that run without external services (mock OpenAI, stub Memgraph driver).
+- [ ] Create real docker-compose services for Memgraph and Redis or remove the placeholder file.
+
+## Medium Priority
+- [ ] Persist results from consolidation and compression tasks back to the database (currently in-memory only).
+- [ ] Refine `Memory.importance` scoring to reflect actual ranking heuristics instead of a constant.
+- [ ] Add vector, regex, and exact-match search helpers to match stated feature set or update documentation to demote them.
+- [ ] Harden Celery tasks to initialize dependencies lazily and log failures when the driver is unavailable.
+- [ ] Reconcile tests that depend on `Memory.pre_init` and outdated OpenAI interfaces with the current implementation.
+- [ ] Add linting, formatting, and type-checking tooling to improve code quality.
+
+## Low Priority / Nice to Have
+- [ ] Offer alternative storage backends (in-memory driver, SQLite, etc.) for easier local development.
+- [ ] Provide an administrative dashboard or CLI commands for listing namespaces, counts, and maintenance statistics.
+- [ ] Publish onboarding guides and troubleshooting FAQs for contributors.
+- [ ] Explore plugin registration for embeddings and retrieval strategies to reduce manual wiring.
diff --git a/NEEDED_FOR_TESTING.md b/NEEDED_FOR_TESTING.md
@@ -0,0 +1,51 @@
+# Needed for Testing MeshMind
+
+## Python Runtime
+- Python 3.11 or 3.12 is recommended; project metadata claims 3.13+, but several dependencies (e.g., `pymgclient`, `sentence-transformers`)
+  have not been validated there.
+- A virtual environment (via `venv`, `uv`, or `conda`) to isolate dependencies.
+
+## Python Dependencies
+- Install the project in editable mode: `pip install -e .` from the repository root.
+- Ensure optional extras that ship as hard dependencies in `pyproject.toml` are present:
+  - `pymgclient` for Memgraph connectivity.
+  - `celery[redis]` for scheduled maintenance tasks.
+  - `tiktoken`, `sentence-transformers`, `rapidfuzz`, `scikit-learn`, and `numpy` for embedding, retrieval, and compression.
+- Additional development tooling that may be required when running the current test suite:
+  - `pytest` (already listed but verify it installs successfully under the chosen Python version).
+  - `python-dotenv` if you plan to load environment variables from a `.env` file.
+
+## External Services and Infrastructure
+- **Memgraph** (preferred) or a Neo4j-compatible Bolt graph database reachable at the URI exported via `MEMGRAPH_URI`.
+  - Requires the Bolt port (default `7687`) to be exposed.
+  - Ensure user credentials provided in `MEMGRAPH_USERNAME` and `MEMGRAPH_PASSWORD` have write access.
+- **Redis** instance when exercising Celery tasks (expiry, consolidation, compression). Set its location via `REDIS_URL`.
+- **OpenAI API access** for extraction and embedding encoders used throughout the pipeline.
+- Optional but useful: an orchestration layer (Docker Compose or Kubernetes) to manage Memgraph and Redis in tandem if you plan to
+  mimic production workflows.
+
+## Environment Variables
+- `OPENAI_API_KEY` — required for any extraction or embedding calls via the OpenAI SDK.
+- `MEMGRAPH_URI` — Bolt connection string, e.g., `bolt://localhost:7687`.
+- `MEMGRAPH_USERNAME` — username for the Memgraph (or Neo4j) instance.
+- `MEMGRAPH_PASSWORD` — password for the database user.
+- `REDIS_URL` — Redis connection URI (defaults to `redis://localhost:6379/0`).
+- `EMBEDDING_MODEL` — key used by `EncoderRegistry` (defaults to `text-embedding-3-small`). Ensure a matching encoder is
+  registered at runtime before running ingestion or retrieval steps.
+
+## Local Configuration Steps
+- Register an embedding encoder before tests that rely on embeddings:
+  ```python
+  from meshmind.core.embeddings import EncoderRegistry, OpenAIEmbeddingEncoder
+  EncoderRegistry.register("text-embedding-3-small", OpenAIEmbeddingEncoder("text-embedding-3-small"))
+  ```
+- Provide seed data or fixtures for the graph database if end-to-end tests assume pre-existing memories.
+- Optionally create a `.env` file mirroring the environment variables above for convenient local setup.
+
+## Current Blockers in This Environment
+- Neo4j and Memgraph binaries are not available, and container tooling (Docker, `mgconsole`, `neo4j-admin`) cannot be installed,
+  preventing local graph database provisioning inside this workspace.
+- Outbound network restrictions may block installation of proprietary dependencies or remote database provisioning without
+  pre-baked artifacts.
+- Redis is likewise unavailable without Docker or host-level package managers; Celery tasks cannot be validated locally until a
+  remote instance is supplied.