diff --git a/AGENTS.md b/AGENTS.md new file mode 100644 index 0000000..e842439 --- /dev/null +++ b/AGENTS.md @@ -0,0 +1,5 @@ +# Agent Instructions + +- Use descriptive Markdown headings starting at level 1 for top-level documents. +- Keep lines to 120 characters or fewer when practical. +- Prefer bullet lists for enumerations instead of inline commas. diff --git a/DISCREPANCIES.md b/DISCREPANCIES.md new file mode 100644 index 0000000..5e391f9 --- /dev/null +++ b/DISCREPANCIES.md @@ -0,0 +1,43 @@ +# README vs Implementation Discrepancies + +## Overview +- The legacy README promises a fully featured memory graph service with multi-level APIs, relationship storage, and diverse retrieval methods. +- The current codebase delivers a narrower pipeline that focuses on extracting `Memory` nodes, preprocessing them, and writing them to Memgraph. +- Many examples in the README are not executable because the described methods, configuration defaults, and dependency behaviors do not exist. + +## API Surface +- README shows `MeshMind.register_entity`, `register_allowed_predicates`, `add_predicate`, `store_memory`, `add_memory`, `add_triplet`, `search`, `search_facts`, `search_procedures`, `update_memory`, and `delete_memory` methods. Only `extract_memories`, `deduplicate`, `score_importance`, `compress`, and `store_memories` exist on `meshmind.client.MeshMind`. +- Entity registration is depicted as a first-class feature that validates custom Pydantic models (e.g., `Person`). The implementation merely checks that the `entity_label` returned by the LLM matches the class name; there is no registry enforcing schemas or instantiating the models. +- Triplet storage is central to the README narrative, yet the pipeline never creates edges. `MesMind` exposes no method that calls `GraphDriver.upsert_edge`, and the tests never cover triplet scenarios. +- CRUD operations (`add_memory`, `update_memory`, `delete_memory`) are discussed as mid-level helpers. Only the lower-level `MemoryManager` class (not surfaced through `MeshMind`) contains these methods. + +## Retrieval Capabilities +- README advertises embedding vector search, BM25, LLM reranking, fuzzy search, exact comparison, regex search, filter support, and hybrid methods. The codebase exposes BM25, fuzzy, hybrid, and metadata filters. There are no endpoints for exact comparison, regex search, or LLM reranking, and vector search exists only as a helper inside `MemgraphDriver.vector_search` with no integration. +- The README implies that search operates against the graph database. Actual retrieval utilities work on in-memory lists provided by the caller and do not query Memgraph. +- Usage examples show three layered search calls (`search`, `search_facts`, `search_procedures`). Only the single `search` dispatcher exists, alongside `search_bm25` and `search_fuzzy` helper functions. + +## Data & Relationship Modeling +- README claims memories encompass nodes and edges, including relationship predicates registered ahead of time. The code lacks predicate management beyond an unused `PredicateRegistry` and never writes relationships to the database. +- Low-level `add_triplet` examples assume subject/object lookups by name. `MemgraphDriver.upsert_edge` expects UUIDs and assumes the nodes already exist, so the documented behavior cannot work. +- Memory importance, consolidation, and expiry are presented as rich features. Implementations are minimal: importance defaults to `1.0`, consolidation simply keeps the highest-importance duplicate in-memory, and expiry only runs inside a Celery task that depends on optional infrastructure. + +## Configuration & Dependencies +- README omits instructions for registering embedding encoders. In practice, `meshmind.pipeline.extract` fails with `KeyError` unless `EncoderRegistry.register` is called before extraction. +- README suggests broad Python support and effortless setup. `pyproject.toml` requires Python 3.13, yet many dependencies (Celery, mgclient) do not publish wheels for that version. Missing `mgclient` triggers an import-time failure inside `MeshMind.__init__`. +- Required environment variables (OpenAI API key, Memgraph URI/credentials, Redis URL) are not documented. The README examples instantiate `MeshMind()` with no mention of configuration, but the code depends on these settings. +- README instructs storing and searching without mentioning external services. The current implementation requires a running Memgraph instance and optional Redis for Celery beat. + +## Example Code Path +- README’s extraction example instantiates `MeshMind`, registers entity classes, and expects memories to be generated with corresponding attributes. Actual extraction enforces `entity_label` only by string name and returns `Memory` objects rather than instances of the provided Pydantic models. +- The `mesh_mind.store_memory(memory)` loop in the README references a nonexistent method; the equivalent in code is `store_memories([memory])` or direct use of `MemoryManager`. +- Search examples call `mesh_mind.search`, `search_facts`, and `search_procedures`. Only `search` exists, and it requires preloaded `Memory` objects. +- Update/delete examples rely on `mesh_mind.update_memory` and `mesh_mind.delete_memory`, which are absent. + +## Tooling & Operations +- README does not mention the need to register encoders before running the CLI; the default ingest command fails unless `EncoderRegistry` has an entry matching the configured embedding model. +- README implies functioning Celery maintenance processes. The Celery tasks are importable but disabled when dependencies are missing, and they do not persist consolidated results. +- README lacks troubleshooting guidance for OpenAI SDK changes. The shipped code uses response access patterns (`response['data']`) incompatible with the current SDK, leading to runtime errors. + +## Documentation State +- README positions the document as the authoritative source of truth, yet large sections (triplet storage, relationship management, retrieval coverage) describe unimplemented functionality. +- The original README does not point readers to supporting documents such as configuration references, dependency requirements, or operational runbooks, leaving gaps for anyone onboarding today. diff --git a/FINDINGS.md b/FINDINGS.md new file mode 100644 index 0000000..4fb2bf4 --- /dev/null +++ b/FINDINGS.md @@ -0,0 +1,32 @@ +# Findings + +## General Observations +- The codebase compiles a wide range of functionality, but most modules are loosely integrated; many components are present without being wired into the main client or CLI flows. +- Optional dependencies are imported eagerly (OpenAI, tiktoken, mgclient). In minimal environments the package cannot be imported without installing every dependency. +- Tests capture the intended behaviors more accurately than the runtime code, yet they rely on deprecated OpenAI interfaces and attributes that no longer exist on the models. + +## Dependency & Environment Issues +- `MeshMind` always instantiates `MemgraphDriver`, so installing `mgclient` and having a reachable Memgraph instance are prerequisites even for local experimentation. +- `meshmind.pipeline.extract` calls the OpenAI Responses API but never registers an encoder. Unless callers register a matching encoder manually, extraction fails before returning any memories. +- The OpenAI SDK objects expose attributes rather than dictionary keys, so `response['data']` in `OpenAIEmbeddingEncoder` raises at runtime. +- `meshmind.core.utils` imports `tiktoken` globally; importing `meshmind.core.utils` without the package installed raises immediately. +- Celery and Redis are referenced in tasks, yet `docker-compose.yml` does not provision those services. There is no documented way to launch a working stack. + +## Data Flow & Persistence +- Pipelines only ever upsert nodes. `GraphDriver.upsert_edge` and the `Triplet` model are unused, so relationship data is currently lost. +- Compression, consolidation, and expiry utilities operate on lists of `Memory` objects but do not persist the results back into the graph within standard workflows. +- `Memory.importance` defaults to `1.0` and is not recalculated; there is no ranking algorithm or heuristics as described in the README. + +## CLI & Tooling +- The CLI hardcodes `entity_types=[Memory]` for extraction, which undermines the intent of user-defined entity models. +- CLI ingestion will fail without an encoder registered under the configured embedding model name, yet the CLI does not perform or document this registration step. +- `docker-compose.yml` appears to be a placeholder. It lacks service definitions for Memgraph or Redis and cannot launch the environment described in the README. + +## Testing & Quality +- Tests import `pytest` but there is no automated workflow or Makefile target for running them; `pytest` will still fail because mgclient, OpenAI, and tiktoken are missing. +- Some tests reference `Memory.pre_init` hooks that are absent from the production model, indicating drift between tests and implementation. +- There is no linting, formatting, or type-checking configuration, despite the project aiming for production-level reliability. + +## Documentation +- The new README must explain encoder registration, dependency setup, and realistic capabilities. The existing README significantly overpromises. +- Additional developer onboarding materials (environment setup, service provisioning, troubleshooting) are required to make the project approachable. diff --git a/ISSUES.md b/ISSUES.md new file mode 100644 index 0000000..45e8572 --- /dev/null +++ b/ISSUES.md @@ -0,0 +1,30 @@ +# Issues Checklist + +## Blockers +- [ ] MeshMind client fails without `mgclient`; introduce lazy driver initialization or documented in-memory fallback. +- [ ] Register a default embedding encoder (OpenAI or sentence-transformers) during startup so extraction and hybrid search can run. +- [ ] Update OpenAI integration to match the current SDK (Responses API payload, embeddings API response structure). +- [ ] Replace eager `tiktoken` imports in `meshmind.core.utils` and `meshmind.pipeline.compress` with guarded, optional imports. +- [ ] Align declared Python requirement with supported dependencies (currently set to Python 3.13 despite ecosystem gaps). + +## High Priority +- [ ] Implement relationship persistence (`GraphDriver.upsert_edge`) within the storage pipeline and expose triplet APIs. +- [ ] Restore high-level API methods promised in README (`register_entity`, predicate management, `add_memory`, `update_memory`, `delete_memory`). +- [ ] Ensure CLI ingestion registers entity models and embedding encoders or fails fast with actionable messaging. +- [ ] Provide configuration documentation and examples for Memgraph, Redis, and OpenAI environment variables. +- [ ] Add automated tests or smoke checks that run without external services (mock OpenAI, stub Memgraph driver). +- [ ] Create real docker-compose services for Memgraph and Redis or remove the placeholder file. + +## Medium Priority +- [ ] Persist results from consolidation and compression tasks back to the database (currently in-memory only). +- [ ] Refine `Memory.importance` scoring to reflect actual ranking heuristics instead of a constant. +- [ ] Add vector, regex, and exact-match search helpers to match stated feature set or update documentation to demote them. +- [ ] Harden Celery tasks to initialize dependencies lazily and log failures when the driver is unavailable. +- [ ] Reconcile tests that depend on `Memory.pre_init` and outdated OpenAI interfaces with the current implementation. +- [ ] Add linting, formatting, and type-checking tooling to improve code quality. + +## Low Priority / Nice to Have +- [ ] Offer alternative storage backends (in-memory driver, SQLite, etc.) for easier local development. +- [ ] Provide an administrative dashboard or CLI commands for listing namespaces, counts, and maintenance statistics. +- [ ] Publish onboarding guides and troubleshooting FAQs for contributors. +- [ ] Explore plugin registration for embeddings and retrieval strategies to reduce manual wiring. diff --git a/NEEDED_FOR_TESTING.md b/NEEDED_FOR_TESTING.md new file mode 100644 index 0000000..ad6f3cf --- /dev/null +++ b/NEEDED_FOR_TESTING.md @@ -0,0 +1,51 @@ +# Needed for Testing MeshMind + +## Python Runtime +- Python 3.11 or 3.12 is recommended; project metadata claims 3.13+, but several dependencies (e.g., `pymgclient`, `sentence-transformers`) + have not been validated there. +- A virtual environment (via `venv`, `uv`, or `conda`) to isolate dependencies. + +## Python Dependencies +- Install the project in editable mode: `pip install -e .` from the repository root. +- Ensure optional extras that ship as hard dependencies in `pyproject.toml` are present: + - `pymgclient` for Memgraph connectivity. + - `celery[redis]` for scheduled maintenance tasks. + - `tiktoken`, `sentence-transformers`, `rapidfuzz`, `scikit-learn`, and `numpy` for embedding, retrieval, and compression. +- Additional development tooling that may be required when running the current test suite: + - `pytest` (already listed but verify it installs successfully under the chosen Python version). + - `python-dotenv` if you plan to load environment variables from a `.env` file. + +## External Services and Infrastructure +- **Memgraph** (preferred) or a Neo4j-compatible Bolt graph database reachable at the URI exported via `MEMGRAPH_URI`. + - Requires the Bolt port (default `7687`) to be exposed. + - Ensure user credentials provided in `MEMGRAPH_USERNAME` and `MEMGRAPH_PASSWORD` have write access. +- **Redis** instance when exercising Celery tasks (expiry, consolidation, compression). Set its location via `REDIS_URL`. +- **OpenAI API access** for extraction and embedding encoders used throughout the pipeline. +- Optional but useful: an orchestration layer (Docker Compose or Kubernetes) to manage Memgraph and Redis in tandem if you plan to + mimic production workflows. + +## Environment Variables +- `OPENAI_API_KEY` — required for any extraction or embedding calls via the OpenAI SDK. +- `MEMGRAPH_URI` — Bolt connection string, e.g., `bolt://localhost:7687`. +- `MEMGRAPH_USERNAME` — username for the Memgraph (or Neo4j) instance. +- `MEMGRAPH_PASSWORD` — password for the database user. +- `REDIS_URL` — Redis connection URI (defaults to `redis://localhost:6379/0`). +- `EMBEDDING_MODEL` — key used by `EncoderRegistry` (defaults to `text-embedding-3-small`). Ensure a matching encoder is + registered at runtime before running ingestion or retrieval steps. + +## Local Configuration Steps +- Register an embedding encoder before tests that rely on embeddings: + ```python + from meshmind.core.embeddings import EncoderRegistry, OpenAIEmbeddingEncoder + EncoderRegistry.register("text-embedding-3-small", OpenAIEmbeddingEncoder("text-embedding-3-small")) + ``` +- Provide seed data or fixtures for the graph database if end-to-end tests assume pre-existing memories. +- Optionally create a `.env` file mirroring the environment variables above for convenient local setup. + +## Current Blockers in This Environment +- Neo4j and Memgraph binaries are not available, and container tooling (Docker, `mgconsole`, `neo4j-admin`) cannot be installed, + preventing local graph database provisioning inside this workspace. +- Outbound network restrictions may block installation of proprietary dependencies or remote database provisioning without + pre-baked artifacts. +- Redis is likewise unavailable without Docker or host-level package managers; Celery tasks cannot be validated locally until a + remote instance is supplied. diff --git a/NEW_README.md b/NEW_README.md new file mode 100644 index 0000000..78c2b55 --- /dev/null +++ b/NEW_README.md @@ -0,0 +1,136 @@ +# MeshMind + +MeshMind is an experimental memory orchestration service that pairs large language models with a property graph. The current code turns unstructured text into `Memory` records, applies light preprocessing, and stores them via a Memgraph driver. Retrieval helpers run in-memory using lexical, fuzzy, and hybrid scoring strategies. The project is a work in progress; many features described in the legacy README are not yet implemented. + +## Status at a Glance +- ✅ High-level client (`meshmind.client.MeshMind`) with helpers for extraction, preprocessing, and storage. +- ✅ Pipelines for deduplication, default importance scoring, compression, and persistence of memory nodes. +- ✅ Retrieval helpers for TF-IDF (BM25-style), fuzzy string matching, and hybrid vector + lexical ranking (requires registered encoder). +- ✅ Celery task stubs for expiry, consolidation, and compression (require Celery + Redis + Memgraph to function). +- ⚠️ Relationship handling (triplets, predicate registration) is scaffolded but not wired into the storage pipeline. +- ⚠️ CLI ingestion requires manual encoder registration and a running Memgraph instance with `mgclient` installed. +- ❌ Mid-level CRUD APIs (`add_memory`, `update_memory`, `delete_memory`), triplet storage, regex search, and LLM reranking are not implemented. + +## Requirements +- Python 3.11 or 3.12 recommended (project metadata claims 3.13 but dependency support is unverified). +- Memgraph instance reachable via Bolt and the `mgclient` Python package. +- OpenAI API key for extraction and embeddings. +- Optional but recommended: Redis and Celery for scheduled maintenance tasks. +- Additional Python packages installed via `pip install -e .` (see `pyproject.toml`). Some optional modules (`tiktoken`, `sentence-transformers`) are required for specific features. + +## Installation +1. Create and activate a virtual environment using Python 3.11 or 3.12. +2. Install the package in editable mode: + ```bash + pip install -e . + ``` +3. Install optional dependencies as needed: + ```bash + pip install mgclient tiktoken sentence-transformers redis celery + ``` +4. Export required environment variables (adapt values to your setup): + ```bash + export OPENAI_API_KEY=sk-... + export MEMGRAPH_URI=bolt://localhost:7687 + export MEMGRAPH_USERNAME=neo4j + export MEMGRAPH_PASSWORD=secret + export REDIS_URL=redis://localhost:6379/0 + export EMBEDDING_MODEL=text-embedding-3-small + ``` + +## Encoder Registration +`meshmind.pipeline.extract` expects an encoder registered in `meshmind.core.embeddings.EncoderRegistry` that matches `settings.EMBEDDING_MODEL`. Before calling extraction or the CLI, register an encoder: +```python +from meshmind.core.embeddings import EncoderRegistry, OpenAIEmbeddingEncoder +EncoderRegistry.register("text-embedding-3-small", OpenAIEmbeddingEncoder("text-embedding-3-small")) +``` +For offline experimentation you may register a custom encoder that returns deterministic embeddings. + +## Quick Start (Python) +```python +from meshmind.client import MeshMind +from meshmind.core.embeddings import EncoderRegistry, OpenAIEmbeddingEncoder +from meshmind.core.types import Memory + +# Register an embedding encoder once at startup +EncoderRegistry.register("text-embedding-3-small", OpenAIEmbeddingEncoder()) + +mm = MeshMind() +texts = ["Python is a programming language created by Guido van Rossum."] +memories = mm.extract_memories( + instructions="Extract key facts as Memory objects.", + namespace="demo", + entity_types=[Memory], + content=texts, +) +memories = mm.deduplicate(memories) +memories = mm.score_importance(memories) +memories = mm.compress(memories) +mm.store_memories(memories) +``` +This workflow persists `Memory` nodes in Memgraph. Relationships are not yet created automatically. + +## Command-Line Ingestion +The CLI performs the same pipeline for files and folders of text documents. +```bash +meshmind ingest \ + --namespace demo \ + --instructions "Extract key facts as Memory objects." \ + ./path/to/text/files +``` +Before running: +- Ensure `mgclient` can connect to Memgraph using credentials in environment variables. +- Register an embedding encoder in a startup script (e.g., run a small Python snippet prior to invocation) or extend the CLI to perform registration. +- Provide a `.env` file or export environment variables for configuration. + +## Retrieval Helpers +Retrieval utilities operate on in-memory lists of `Memory` objects. Load the records from your graph (e.g., via `MemoryManager.list_memories`) before calling these helpers. +```python +from meshmind.api.memory_manager import MemoryManager +from meshmind.core.types import SearchConfig +from meshmind.db.memgraph_driver import MemgraphDriver +from meshmind.retrieval.search import search + +# Load memories from Memgraph +manager = MemoryManager(MemgraphDriver("bolt://localhost:7687")) +memories = manager.list_memories(namespace="demo") + +# Register the same encoder used during ingestion +from meshmind.core.embeddings import EncoderRegistry, OpenAIEmbeddingEncoder +EncoderRegistry.register("text-embedding-3-small", OpenAIEmbeddingEncoder()) + +config = SearchConfig(encoder="text-embedding-3-small", top_k=10) +results = search("Python", memories, namespace="demo", config=config) +for memory in results: + print(memory.name, memory.metadata) +``` +`search_bm25` and `search_fuzzy` are also available for lexical and fuzzy-only scoring. Vector search against Memgraph is not implemented; hybrid search uses embeddings stored on each memory object. + +## Maintenance Tasks +`meshmind.tasks.scheduled` defines Celery beat jobs for: +- Expiring memories once `created_at + ttl_seconds` has elapsed. +- Consolidating duplicate names by keeping the highest-importance record. +- Compressing long metadata content with tiktoken. + +To enable these tasks: +1. Ensure Celery and Redis are installed and running. +2. Start a Celery worker with the meshmind app: + ```bash + celery -A meshmind.tasks.celery_app.app worker -B + ``` + The module attempts to instantiate `MemgraphDriver` at import time; provide valid credentials and ensure `mgclient` is available. + +## Testing +- Pytests live under `meshmind/tests`. They rely on heavy monkeypatching and may need updates to align with the current OpenAI SDK. +- Some tests assume fixtures or hooks (`Memory.pre_init`) that are absent; expect failures until the suite is modernized. +- No continuous integration pipeline is currently provided. + +## Known Limitations +- No triplet/relationship persistence; only nodes are stored. +- Mid-level CRUD helpers and predicate registration are missing from the public client API. +- CLI lacks ergonomics for registering encoders or custom entity models. +- Optional dependencies (`tiktoken`, `sentence-transformers`) are required at import time, leading to crashes when absent. +- The project requires significant configuration (Memgraph, Redis, OpenAI) before any end-to-end scenario succeeds. + +## Roadmap Snapshot +See `PROJECT.md` and `PLAN.md` for prioritized workstreams, including restoring the documented API, improving dependency handling, and expanding retrieval coverage. diff --git a/PLAN.md b/PLAN.md new file mode 100644 index 0000000..1dc1ca8 --- /dev/null +++ b/PLAN.md @@ -0,0 +1,52 @@ +# Plan of Action + +## Phase 1 – Stabilize Runtime Basics +1. **Dependency Guards** + - Add lazy imports and feature flags for `mgclient`, `tiktoken`, `celery`, and `sentence-transformers` so missing packages do not crash the import path. + - Provide clear error messages and documentation when dependencies are absent. +2. **Default Encoder Registration** + - Introduce a bootstrap helper that registers an OpenAI encoder (or configurable fallback) on package import or via CLI option. + - Update CLI to call the bootstrap helper before extraction. +3. **OpenAI SDK Compatibility** + - Refactor `OpenAIEmbeddingEncoder` and pipeline extraction to use the latest `openai` SDK response objects with proper retry and error handling. +4. **Configuration Clarity** + - Publish a setup guide covering environment variables, Memgraph/Redis provisioning, and CLI usage. + +## Phase 2 – Restore Promised API Surface +1. **Entity & Predicate Registry Wiring** + - Connect `EntityRegistry` and `PredicateRegistry` to `MeshMind`, ensuring registered models and predicates persist to the database. +2. **CRUD & Triplet Support** + - Add `add_memory`, `update_memory`, `delete_memory`, `register_entity`, `register_allowed_predicates`, and `add_triplet` methods on `MeshMind` that wrap pipeline + driver operations. + - Extend storage pipeline to create relationships via `GraphDriver.upsert_edge` with sensible defaults for subject/object resolution. +3. **Relationship-Aware Examples** + - Update example scripts and documentation to demonstrate triplet storage and retrieval once implemented. + +## Phase 3 – Retrieval & Maintenance Enhancements +1. **Search Coverage** + - Implement vector-only, regex, and exact-match search helpers and expose them through `meshmind.retrieval.search`. + - Optionally integrate LLM reranking for high-quality results. +2. **Maintenance Tasks** + - Ensure consolidation and compression tasks persist results back to Memgraph. + - Move Celery driver initialization into task functions to avoid import-time failures and add logging for missing dependencies. +3. **Importance Scoring Improvements** + - Replace the constant importance score with a heuristic or LLM-based evaluator aligned with README claims. + +## Phase 4 – Developer Experience & Tooling +1. **Testing Overhaul** + - Modernize pytest suites to align with new SDKs, provide fixtures for stub drivers, and ensure tests run without external services. + - Add coverage for new API methods and relationship handling. +2. **Automation & CI** + - Expand the Makefile with lint, format, test, and type-check targets. + - Configure CI (GitHub Actions or similar) to run the suite and static checks on push/PR. +3. **Environment Provisioning** + - Replace `docker-compose.yml` with services for Memgraph and Redis or document alternative local development setups. + +## Phase 5 – Strategic Enhancements +1. **Pluggable Storage Backends** + - Abstract `GraphDriver` further to support alternative backends (Neo4j, in-memory driver, SQLite prototype). +2. **Service Interfaces** + - Expose REST/gRPC endpoints for ingestion and retrieval to enable external integrations. +3. **Operational Observability** + - Add logging, metrics, and dashboards for maintenance jobs, ingestion throughput, and retrieval latency. +4. **Onboarding & Documentation** + - Promote `NEW_README.md` to `README.md`, archive the legacy document, and maintain `SOT.md` with diagrams and workflow maps. diff --git a/PROJECT.md b/PROJECT.md new file mode 100644 index 0000000..ba9d9f6 --- /dev/null +++ b/PROJECT.md @@ -0,0 +1,69 @@ +# MeshMind Project Overview + +## Vision and Scope +- Build a practical memory service that turns unstructured text into graph-backed `Memory` records. +- Provide pipelines for extraction, preprocessing, storage, and retrieval without tightly coupling to a specific UI. +- Support background maintenance (expiry, consolidation, compression) once storage and scheduling dependencies are available. + +## Current Architecture Snapshot +- **Client façade**: `meshmind.client.MeshMind` wires together an OpenAI client, a configured embedding model name, and a `MemgraphDriver` instance. Every workflow starts here. +- **Pipelines**: Extraction (LLM + function-calling), preprocessing (deduplicate, score, compress), and storage utilities live in `meshmind.pipeline`. +- **Graph layer**: `meshmind.db` exposes an abstract `GraphDriver` and a Memgraph implementation that relies on `mgclient`. +- **Retrieval helpers**: `meshmind.retrieval` operates on in-memory `Memory` lists with TF-IDF, fuzzy, and hybrid (vector + lexical) scoring. +- **Task runners**: `meshmind.tasks` defines Celery wiring for expiry, consolidation, and compression jobs when Redis and Celery are present. +- **Support code**: `meshmind.core` contains models, configuration, similarity utilities, and embedding encoder helpers. Optional dependencies (OpenAI, sentence-transformers, tiktoken) are required at runtime for many modules. +- **Tooling**: A CLI ingest command (`meshmind ingest`) demonstrates the extract → preprocess → store loop. Tests exist but depend on heavy monkeypatching and outdated SDK assumptions. + +## Implemented Capabilities +- Serialize knowledge as `Memory` Pydantic models with namespace, entity label, metadata, optional embeddings, timestamps, TTL, and importance fields. +- Extract structured memories from text via the OpenAI Responses API using function-calling against the `Memory` schema (requires manual encoder registration). +- Deduplicate memories by name and (optionally) cosine similarity, assign a default importance score, and truncate metadata content using a tiktoken-based compressor. +- Persist memory nodes to Memgraph by calling `GraphDriver.upsert_entity` for each record. +- Run lexical, fuzzy, and hybrid retrieval against caller-provided in-memory lists of `Memory` objects, including optional metadata and namespace filters. +- Schedule expiry, consolidation, and compression maintenance tasks through Celery beat when both Celery and Redis are configured and the Memgraph driver initializes successfully. +- Provide an example script and CLI entry point that ingest plaintext files into a configured Memgraph instance. + +## Partially Implemented or Fragile Areas +- Hybrid search depends on an encoder registered in `EncoderRegistry`; nothing is auto-registered, so out-of-the-box calls fail. +- The OpenAI embedding wrapper assumes dictionary-style responses and does not match the latest SDK payload objects. +- Celery tasks instantiate the Memgraph driver at import time; without `mgclient` they silently degrade to no-ops. +- The compressor and utility helpers import `tiktoken` eagerly and fail if the package is absent. +- Tests reference hooks (`Memory.pre_init`, OpenAI chat completions) that are no longer present, so the suite does not execute cleanly. +- Python 3.13 is declared in `pyproject.toml`, yet third-party dependencies (Celery, mgclient) have not been validated for that interpreter. + +## Missing or Broken Capabilities +- No public API for registering entities, predicates, or storing triplets as promised in the legacy README. +- Graph relationships are never created because the storage pipeline only upserts nodes. +- There is no mid-level `add_memory`/`update_memory`/`delete_memory` surface on `MeshMind`; the CLI relies solely on extraction and store helpers. +- Vector search, regex search, exact match search, and LLM re-ranking endpoints described in the README are absent. +- Memory consolidation and expiry are not integrated into the ingestion workflow, and consolidation never writes results back to the database. +- Configuration guidance is minimal; missing environment variables lead to runtime failures when constructing the client. + +## External Services & Dependencies +- **Memgraph + mgclient**: Required for any persistence. Without `mgclient`, constructing `MeshMind` raises immediately. +- **OpenAI SDK**: Needed for both extraction and embeddings. Newer SDK versions return typed objects, not dicts, which breaks current assumptions. +- **tiktoken**: Used by compression and token counting utilities. Imported at module load time without fallbacks. +- **RapidFuzz, scikit-learn, numpy**: Support fuzzy and lexical retrieval. +- **Celery + Redis**: Optional but necessary for scheduled maintenance tasks. +- **sentence-transformers**: Optional embedding backend for offline models. + +## Tooling and Operational State +- `docker-compose.yml` lists services but does not provision Memgraph or Redis containers. +- No encoder instances are registered automatically; setup scripts are missing. +- Pytests rely on manual monkeypatches to simulate OpenAI and mgclient. Running `pytest` out of the box fails due to missing optional dependencies and incompatible SDK interfaces. +- Continuous integration or linting workflows are not defined. + +## Roadmap Highlights +- Restore the high-level API surface promised in the README (entity registration, predicate management, CRUD helpers, triplet storage). +- Introduce a safe, dependency-light initialization path (lazy imports, graceful fallbacks, injectable storage backends). +- Expand retrieval to include driver-backed vector search, regex/exact match helpers, and optional LLM-based re-ranking. +- Implement relationship persistence and richer metadata handling in the graph layer. +- Harden maintenance jobs to run independently of import-time side effects and to write results back to Memgraph. +- Rewrite the test suite around modern OpenAI SDK semantics and provide fixtures for running without external services. +- Document setup thoroughly (encoders, environment variables, dependency installation, service provisioning) and provide automation scripts. + +## Future Potential Extensions +- Plug-in architecture for alternative vector databases or document stores. +- Streaming ingestion workers that watch queues or webhooks instead of filesystem batches. +- UI or API gateway to expose memory search and curation to downstream agents or humans. +- Analytics dashboards that summarize namespace health, expiry cadence, and consolidation outcomes. diff --git a/RECOMMENDATIONS.md b/RECOMMENDATIONS.md new file mode 100644 index 0000000..fce74de --- /dev/null +++ b/RECOMMENDATIONS.md @@ -0,0 +1,33 @@ +# Recommendations + +## Stabilize the Foundation +- Refactor `MeshMind` initialization so that graph and OpenAI dependencies are optional or injectable, enabling local development without Memgraph. +- Provide a bootstrap module (or CLI option) that registers default embedding encoders and entity models before extraction. +- Update the OpenAI integration to use the current SDK response objects and add robust error handling for rate limits and API failures. +- Introduce optional dependency guards across the package; defer importing `tiktoken`, `mgclient`, and `celery` until the functionality is invoked. +- Align Python version support with dependency availability (target 3.11/3.12 until 3.13 is validated). + +## Restore Promised Functionality +- Implement entity and predicate registries with persistence hooks so that README workflows (`register_entity`, `add_predicate`, `add_triplet`) become real. +- Add mid-level CRUD methods (`add_memory`, `update_memory`, `delete_memory`) on `MeshMind` that delegate to `MemoryManager` and ensure stored records round-trip correctly. +- Extend the storage pipeline to create relationships via `GraphDriver.upsert_edge`, including handling for subject/object lookups by name or UUID. +- Build out retrieval helpers for vector-only, regex, and exact-match queries, and optionally integrate reranking via the LLM client. + +## Improve Developer Experience +- Replace the placeholder `docker-compose.yml` with services for Memgraph and Redis (or document how to run them separately). +- Ship sample scripts that register encoders, seed demo data, and demonstrate retrieval end-to-end. +- Add Makefile tasks for running tests, linting, type checking, and starting Celery workers. +- Modernize the pytest suite to rely on fixtures that do not require live services and that mirror the new OpenAI SDK APIs. +- Set up continuous integration to run unit tests and static checks on every change. + +## Documentation & Onboarding +- Promote `NEW_README.md` to `README.md` after validation and archive the legacy document for historical reference. +- Document configuration and dependency expectations in a dedicated setup guide linked from the README. +- Expand `SOT.md` with diagrams or tables that map modules to workflows once the architecture stabilizes. +- Provide troubleshooting steps for common failures (missing encoder registration, mgclient import errors, OpenAI authentication). + +## Future Enhancements +- Explore alternative storage backends (e.g., Neo4j driver, SQLite) for environments without Memgraph. +- Offer a lightweight REST or gRPC API to interact with memories programmatically. +- Instrument maintenance jobs with metrics and logging so operators can observe expiry/consolidation outcomes. +- Investigate incremental ingestion pipelines (message queues, streaming connectors) for real-time memory updates. diff --git a/SOT.md b/SOT.md new file mode 100644 index 0000000..4b21d05 --- /dev/null +++ b/SOT.md @@ -0,0 +1,127 @@ +# MeshMind Source of Truth + +This document captures the current architecture, modules, and operational expectations for MeshMind. Update it whenever code structure or workflows change so new contributors can ramp up quickly. + +## Repository Layout +``` +meshmind/ +├── api/ # MemoryManager CRUD adapter +├── cli/ # CLI entry point and ingest command +├── client.py # High-level MeshMind façade +├── core/ # Config, types, embeddings, similarity, utilities +├── db/ # Graph driver abstractions (Memgraph) +├── models/ # Entity/predicate registries (not yet integrated) +├── pipeline/ # Extraction, preprocessing, storage, maintenance steps +├── retrieval/ # In-memory search strategies and filters +├── tasks/ # Celery wiring and scheduled jobs +├── tests/ # Pytest suites (require extensive monkeypatching) +└── examples/ # Example extraction/preprocess/store script +``` +Supporting files include: +- `pyproject.toml`: project metadata (declares Python >=3.13, which is aspirational). +- `docker-compose.yml`: placeholder with no services defined. +- `Makefile`: minimal development targets (currently none for testing). +- Documentation artifacts (`PROJECT.md`, `PLAN.md`, etc.). + +## Configuration (`meshmind/core/config.py`) +- Loads environment variables for Memgraph (`MEMGRAPH_URI`, `MEMGRAPH_USERNAME`, `MEMGRAPH_PASSWORD`), Redis (`REDIS_URL`), OpenAI (`OPENAI_API_KEY`), and default embedding model (`EMBEDDING_MODEL`). +- Uses `python-dotenv` if available to load `.env` files at import time. +- Exposes a module-level `settings` object consumed by clients, drivers, and tasks. + +## Core Data Models (`meshmind/core/types.py`) +- `Memory`: Pydantic model representing a knowledge record. Fields include `uuid`, `namespace`, `name`, `entity_label`, optional `embedding`, `metadata`, `reference_time`, `created_at`, `updated_at`, `importance`, and `ttl_seconds`. +- `Triplet`: Subject–predicate–object relationship with namespace, entity label, metadata, and optional reference time. Not used in current flows. +- `SearchConfig`: Retrieval configuration (encoder name, `top_k`, `rerank_k`, metadata filters, hybrid weights). + +## Client (`meshmind/client.py`) +- `MeshMind` constructor wires: + - `OpenAI()` as the default LLM client (fails if OpenAI SDK is missing or API key absent). + - Embedding model name from `settings.EMBEDDING_MODEL`. + - `MemgraphDriver` instantiated with configured URI/credentials (raises `ImportError` if `mgclient` is unavailable). +- Provides high-level helpers: + - `extract_memories` delegates to `pipeline.extract.extract_memories` (requires an encoder in `EncoderRegistry`). + - `deduplicate`, `score_importance`, `compress` delegate to `pipeline.preprocess`. + - `store_memories` delegates to `pipeline.store.store_memories`. +- Does **not** expose registration, triplet storage, or CRUD methods promised in the legacy README. + +## Embeddings & Utilities (`meshmind/core/embeddings.py`, `meshmind/core/utils.py`) +- `EncoderRegistry`: Class-level map from string key to encoder instance. Call `register(name, encoder)` before extraction or hybrid search. +- `OpenAIEmbeddingEncoder`: Wraps the OpenAI Embeddings API with retry logic but assumes dictionary-style responses (`response['data']`), which is incompatible with current SDK objects. +- `SentenceTransformerEncoder`: Provides local embedding support via `sentence-transformers`. +- `meshmind.core.utils`: Supplies UUID generation, timestamp helpers, hashing, and token counting. Imports `tiktoken` at module load, so missing the package raises immediately. + +## Database Layer (`meshmind/db`) +- `base_driver.py`: Abstract `GraphDriver` defining `upsert_entity`, `upsert_edge`, `find`, and `delete` signatures. +- `memgraph_driver.py`: + - Imports `mgclient` and opens a Bolt connection on instantiation. + - Implements node upserts (MERGE by `uuid`), edge upserts (MERGE by `uuid`), arbitrary Cypher `find`, deletion, and a naive Python-based vector search over stored embeddings. + - Requires `mgclient`; otherwise constructing the driver raises `ImportError`. + +## Pipeline Modules (`meshmind/pipeline`) +1. **Extraction (`extract.py`)** + - Builds an OpenAI Responses API call with function-calling against the `Memory` JSON schema. + - Validates `entity_label` against supplied `entity_types` (string comparison only) and populates embeddings via `EncoderRegistry`. +2. **Preprocess (`preprocess.py`)** + - `deduplicate`: Removes duplicates by name and optionally by cosine similarity when embeddings exist. + - `score_importance`: Assigns a default importance of `1.0` when missing. + - `compress`: Delegates to `pipeline.compress.compress_memories` and falls back on errors. +3. **Compress (`compress.py`)** + - Uses `tiktoken` to truncate `metadata['content']` to a token budget (requires the package). +4. **Consolidate (`consolidate.py`)** + - Groups memories by name and selects the highest-importance entry (no persistence built in). +5. **Expire (`expire.py`)** + - Deletes memories whose `created_at + ttl_seconds` is in the past using `MemoryManager`. +6. **Store (`store.py`)** + - Iterates memories and calls `GraphDriver.upsert_entity`. Relationships are not touched. + +## Retrieval (`meshmind/retrieval`) +- `filters.py`: Filter helpers by namespace, entity labels, and metadata equality. +- `bm25.py`: TF-IDF vectorizer + cosine similarity (scikit-learn) used as a lexical scorer. +- `fuzzy.py`: RapidFuzz WRatio scoring for fuzzy name matching. +- `hybrid.py`: Combines query embeddings (from registered encoder) with BM25 scores using configurable weights. +- `search.py`: Dispatchers for hybrid (`search`), lexical (`search_bm25`), and fuzzy (`search_fuzzy`) retrieval. Operate on caller-provided lists of `Memory` objects; no direct graph querying. + +## CLI (`meshmind/cli`) +- `__main__.py`: Defines the `meshmind` CLI with an `ingest` subcommand. +- `ingest.py`: Walks files/directories, reads text contents, runs extraction + preprocessing + storage. Hardcodes `entity_types=[Memory]` and assumes an encoder is already registered. + +## Tasks (`meshmind/tasks`) +- `celery_app.py`: Creates a Celery app when the library is installed; otherwise exposes a no-op shim. +- `scheduled.py`: + - Attempts to instantiate `MemgraphDriver` and `MemoryManager` at import time, falling back to `None` when dependencies fail. + - Configures Celery beat schedules for expiry (daily), consolidation (every 6 hours), and compression (every 12 hours). + - Defines tasks that operate on the global `manager` instance; if initialization failed they return empty results. + +## API Adapter (`meshmind/api/memory_manager.py`) +- Wraps a graph driver to provide CRUD helpers (`add_memory`, `update_memory`, `delete_memory`, `get_memory`, `list_memories`). +- Converts Pydantic objects to dicts via `memory.dict(exclude_none=True)` with fallback to `__dict__`. +- Currently the primary way to list memories for retrieval; not exposed through the CLI or `MeshMind` convenience methods. + +## Models (`meshmind/models/registry.py`) +- `EntityRegistry` and `PredicateRegistry` store registered models and allowed relationship labels. +- No production code writes to these registries yet; integrating them is part of the future roadmap. + +## Examples & Tests +- `examples/extract_preprocess_store_example.py`: Demonstrates extraction and storage using `MeshMind`. Requires valid OpenAI credentials and Memgraph. +- Tests under `meshmind/tests` cover extraction, preprocessing, driver behavior, retrieval, similarity, and maintenance tasks. They rely on monkeypatching dummy encoders, OpenAI clients, and mgclient modules. Some tests assume attributes (`Memory.pre_init`) that are not defined in production code, so the suite will fail until updated. + +## External Dependencies +- **OpenAI SDK**: Required for extraction and embeddings. Update to latest `openai` package and adjust code accordingly. +- **mgclient**: Required for Memgraph persistence; missing package prevents `MeshMind` construction. +- **tiktoken**: Required for compression and utility token counting. Currently imported eagerly. +- **scikit-learn**, **rapidfuzz**, **numpy**: Support retrieval algorithms. +- **sentence-transformers** (optional): Alternative embedding encoder. +- **celery** and **redis** (optional): Required for scheduled maintenance tasks. + +## Operational Caveats +- No encoder is registered by default; failing to register one causes extraction and hybrid search to raise `KeyError`. +- `MeshMind` cannot be instantiated in environments lacking Memgraph or mgclient, limiting portability. +- Relationship data is not persisted, so graph analyses beyond isolated nodes are impossible. +- Tests and CLI commands assume manual setup of encoders and environment variables. +- `docker-compose.yml` does not start required services; developers must provision Memgraph and Redis separately. + +## Related Documentation +- `PROJECT.md`: Architectural summary, capability matrix, and roadmap themes. +- `PLAN.md`: Actionable next steps to close gaps. +- `DISCREPANCIES.md`: Detailed comparison between the legacy README and actual implementation. +- `RECOMMENDATIONS.md`: Suggested improvements ranked by impact. diff --git a/TODO.md b/TODO.md new file mode 100644 index 0000000..306d562 --- /dev/null +++ b/TODO.md @@ -0,0 +1,19 @@ +# TODO + +- [ ] Implement dependency guards and lazy imports for optional packages (`mgclient`, `tiktoken`, `celery`, `sentence-transformers`). +- [ ] Add bootstrap helper for default encoder registration and call it from the CLI. +- [ ] Update OpenAI encoder implementation to align with latest SDK responses and retry semantics. +- [ ] Improve configuration guidance and automation for environment variables and service setup. +- [ ] Wire `EntityRegistry` and `PredicateRegistry` into the storage pipeline and client. +- [ ] Implement CRUD and triplet methods on `MeshMind`, including relationship persistence in `GraphDriver`. +- [ ] Refresh examples to cover relationship-aware ingestion and retrieval flows. +- [ ] Extend retrieval module with vector-only, regex, exact-match, and optional LLM rerank search helpers. +- [ ] Harden Celery maintenance tasks to initialize drivers lazily and persist consolidation results. +- [ ] Replace constant importance scoring with a data-driven or LLM-assisted heuristic. +- [ ] Modernize pytest suites and add fixtures to run without external services. +- [ ] Expand Makefile and add CI workflows for linting, testing, and type checks. +- [ ] Document or provision local Memgraph and Redis services (e.g., via docker-compose) for onboarding. +- [ ] Abstract `GraphDriver` to support alternative storage backends (Neo4j, in-memory, SQLite prototype). +- [ ] Add service interfaces (REST/gRPC) for ingestion and retrieval. +- [ ] Introduce observability (logging, metrics) for ingestion and maintenance pipelines. +- [ ] Promote NEW_README.md, archive legacy README, and maintain SOT diagrams and maps.