Skip to content

Restructure project as monorepo.#2111

Merged
dworthen merged 12 commits intov3/mainfrom
monorepo
Nov 4, 2025
Merged

Restructure project as monorepo.#2111
dworthen merged 12 commits intov3/mainfrom
monorepo

Conversation

@dworthen
Copy link
Copy Markdown
Contributor

Restructure codebase as a monorepo project.

Checklist

  • I have tested these changes locally.
  • I have reviewed the code changes.
  • I have updated the documentation (if necessary).
  • I have added appropriate unit tests (if applicable).

@dworthen dworthen requested a review from a team as a code owner October 22, 2025 17:31
Comment thread packages/graphrag/README.md
Comment thread packages/graphrag/graphrag/__init__.py
Comment thread packages/graphrag/pyproject.toml
@AlonsoGuevara AlonsoGuevara requested a review from Copilot October 30, 2025 01:34
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR restructures the codebase into a monorepo by extracting the Factory pattern into a separate package (graphrag-factory) and updating all import references. The Factory class is enhanced with singleton support and improved error messages.

Key Changes:

  • Extracted Factory class into standalone graphrag-factory package with enhanced singleton/transient service scope support
  • Updated all Factory imports from graphrag.factory.factory to graphrag_factory
  • Added --all-packages flags to CI/CD workflows to support monorepo structure

Reviewed Changes

Copilot reviewed 26 out of 403 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
packages/graphrag-factory/pyproject.toml New package configuration for extracted Factory module
packages/graphrag-factory/graphrag_factory/factory.py Enhanced Factory class with singleton/transient service scopes
packages/graphrag-factory/graphrag_factory/init.py Package initialization exposing Factory class
packages/graphrag-factory/README.md Documentation for the new Factory package with usage examples
packages/graphrag/graphrag/logger/factory.py Updated Factory import to use new package
packages/graphrag/graphrag/language_model/factory.py Updated Factory import to use new package
packages/graphrag/graphrag/language_model/providers/litellm/services/retry/retry_factory.py Updated Factory import to use new package
packages/graphrag/graphrag/language_model/providers/litellm/services/rate_limiter/rate_limiter_factory.py Updated Factory import to use new package
packages/graphrag/graphrag/index/input/factory.py Updated Factory import to use new package
packages/graphrag/graphrag/cache/factory.py Updated Factory import to use new package
packages/graphrag/README.md New README for graphrag package within monorepo
.vscode/launch.json Enhanced debug configuration with user input prompts
.github/workflows/*.yml Updated CI/CD workflows to use --all-packages flag
docs/examples_notebooks/*.ipynb Formatting cleanup of import statements

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread packages/graphrag-factory/graphrag_factory/factory.py Outdated
Comment thread .github/workflows/python-publish.yml
@AlonsoGuevara
Copy link
Copy Markdown
Collaborator

Please include an architecture diagram in the documentation illustrating this change, along with a short explanation of what each submodule is responsible for. This will help establish clear guardrails for future development.

I might be misunderstanding, but from what I see, this change introduces two modules — Factories and GraphRAG. Could you clarify the role of the Factory module? Specifically, what’s the rationale for treating it as a standalone logical unit worth exposing independently?

@dworthen
Copy link
Copy Markdown
Contributor Author

Please include an architecture diagram in the documentation illustrating this change, along with a short explanation of what each submodule is responsible for. This will help establish clear guardrails for future development.

Hey @AlonsoGuevara, the monorepo structure does not change the system architecture or public API surface of GraphRAG. The workflows and all the pieces still fit together and work as they have been. The monorepo structure just pulls out some code into separate, independent pypi packages so that they can be used in isolation in our other projects. Our team GraphRAG Monorepo loop page discusses goals, principles, and modules to pull out. I think that might answer questions about guardrails and future development plans but let me know if I am misunderstanding that piece.

I might be misunderstanding, but from what I see, this change introduces two modules — Factories and GraphRAG. Could you clarify the role of the Factory module? Specifically, what’s the rationale for treating it as a standalone logical unit worth exposing independently?

So far there are only two packages but there will be more packages pulled out from graphrag core in future PRs that I am working on. I did two packages instead of one in this PR to give a better idea of the monorepo structure and how it will look as we add more packages. Factory was chosen as the first package to pull out from core because it is simple with minimal impact and will need to exist as a package as other packages we pull out (cache, vectorstore, etc) will rely on the base factory class. Let me know if you disagree with factory needing to be its own package and what alternate approach may be better suited. One such alternative approach may be to just copy the factory class code to packages that need it.

@andresmor-ms
Copy link
Copy Markdown
Contributor

So far there are only two packages but there will be more packages pulled out from graphrag core in future PRs that I am working on. I did two packages instead of one in this PR to give a better idea of the monorepo structure and how it will look as we add more packages. Factory was chosen as the first package to pull out from core because it is simple with minimal impact and will need to exist as a package as other packages we pull out (cache, vectorstore, etc) will rely on the base factory class.

Just so that I understand correctly, what you are describing here is to have something like:

flowchart TD
    A[graphrag] -->|depends on| B[graphrag-vectorstore]
    B --> |depends on| C[graphrag-factory]
    A -->|depends on| C
Loading

Let me know if you disagree with factory needing to be its own package and what alternate approach may be better suited. One such alternative approach may be to just copy the factory class code to packages that need it.

I kind of don't like the idea of exposing a package that only have one file in it, and we would need to publish this into pypi so that it can be used as a dependency in other packages.

Also, would this mean that for example if I had my own custom implementation of a vector store, would i need to first register that vectorstore in some factory in graphrag-vectorstore and then pass that to graphrag-core?

What do you think about not having a graphrag-factory and let graphrag-core manage the factories so that we don't have that dependency and only have graphrag-core depend on the different packages? Since graphrag-core will depend on the different vectorstore, cache, etc it will have access to the ABC or Protocols we have in there so it would be able to create and manage all the factories it needs to work and register default implementations, while not having to copy paste the factories in every module.

Let me know what you think :)

@dworthen
Copy link
Copy Markdown
Contributor Author

What do you think about not having a graphrag-factory and let graphrag-core manage the factories so that we don't have that dependency and only have graphrag-core depend on the different packages? Since graphrag-core will depend on the different vectorstore, cache, etc it will have access to the ABC or Protocols we have in there so it would be able to create and manage all the factories it needs to work and register default implementations, while not having to copy paste the factories in every module.

Fair point. GraphRAG core can and will manage some of the factories. The one other package I know that will need a factory implementation is graphrag-llm. The language model config contains configuration for subservices such as retries, rate limiting, etc. That means graphrag-llm encapsulates service definitions (ABCs), service implementations, and the factories for managing those implementations. Even if the other packages don't contain factories, that still leaves at least two packages that do need a factory implementation, graphrag and graphrag-llm. In my early monorepo explorations, graphrag-llm was one of the first packages I started to pull out of graphrag and I immediately ran into a situation where I needed to share factory across packages and so I pulled it out into its own package. I included it here as the second package since it was simple and easy to grok but perhaps I should have included graphrag-config as the second package.

So far there are only two packages but there will be more packages pulled out from graphrag core in future PRs that I am working on. I did two packages instead of one in this PR to give a better idea of the monorepo structure and how it will look as we add more packages. Factory was chosen as the first package to pull out from core because it is simple with minimal impact and will need to exist as a package as other packages we pull out (cache, vectorstore, etc) will rely on the base factory class.

Just so that I understand correctly, what you are describing here is to have something like:

flowchart TD
    A[graphrag] -->|depends on| B[graphrag-vectorstore]
    B --> |depends on| C[graphrag-factory]
    A -->|depends on| C
Loading

Let me know if you disagree with factory needing to be its own package and what alternate approach may be better suited. One such alternative approach may be to just copy the factory class code to packages that need it.

Not exactly. I did a poor job of listing out packages. My list was merely a hypothetical list of packages that may need a factory but I agree with your point that some of this management should be done by graphrag core. I should have listed out graphrag-llm.

I kind of don't like the idea of exposing a package that only have one file in it, and we would need to publish this into pypi so that it can be used as a dependency in other packages.

Why not? GitHub actions will manage publishing to pypi so that's not problematic. Another approach would be to not roll out our own DI container logic and lean on an existing library like https://pypi.org/project/dependency-injector/ but that is a bigger lift and there has been hesitation to do this in the past.

Also, would this mean that for example if I had my own custom implementation of a vector store, would i need to first register that vectorstore in some factory in graphrag-vectorstore and then pass that to graphrag-core?

I may be misunderstanding this point, but this is true regardless of where the factory lives. Whether the factory is in graphrag core or graphrag-vectorstore users need to register custom vector stores with the factory using custom strategy names in order to use them in graphrag. I don't think the extensibility model changes based on where the factories are defined.

If the concern is around what gets imported, graphrag core can and will still manage a public API surface so users will not need to from graphrag-vectorstore import VectorStoreFactory even if that is where the factory is defined. As an example, we don't expect end users to directly import python-dotenv for managing environment variables. Instead, we wrap up/encapsulate the functionality of third-party libraries in our own public API surface (load_config in this case). To extend that example, I have pulled out the config loading logic (based on your work in benchmark-qed) into graphrag-config (not in this PR as I am trying to keep these PRs small and manageable) but I did not update our docs or sample notebooks to from graphrag-config import load_config. Instead, the sample notebooks still show from graphrag.config.load_config import load_config as that method still exists apart of graphrag core API surface, it just now sits on top of the new graphrag-config package. The same approaches we take to encapsulate third-party dependencies can be used to encapsulate our own packages in order to maintain a public API that works. please let me know if I completely misunderstood this last point.

I hope I was able to address your concerns in a reasonable manner. In hindsight, I wish I kept this PR more focused and had only 1 package in the PR, graphrag. If the blocker to merging is around graphrag-factory then I am super-duper happy to take that out and revisit the need for that package. The primary goal of this PR was to establish the monorepo folder structure and CI/CD processes around managing a monorepo.

@dworthen dworthen merged commit 6192692 into v3/main Nov 4, 2025
12 checks passed
@dworthen dworthen deleted the monorepo branch November 4, 2025 17:52
dworthen added a commit that referenced this pull request Jan 27, 2026
* Remove graph embedding and UMAP (#2048)

* Remove umap/layout operation

* Remove graph embedding

* Bump unified-search to GR 2.5.0

* Remove graph vis from unified-search

* Remove file filtering (#2050)

* Remove document filtering

* Semver

* Fix integ tests

* Fix file find tuple

* Fix another dangling find tuple

* Remove text unit grouping (#2052)

* Remove text unit group_by_columns

* Semver

* Fix default token split test

* Fix models in config test samples

* Fix token length in context sort test

* Fix document sort

* Re-implement hierarchical Leiden (#2049)

* Use graspologic-native hierarchical leiden

* Re-implement largest_connected_component

* Copy in modularity

* Use graspologic-native directly in pyproject

* Remove directed graph tests (we don't use this)

* Semver

* Remove graspologic dep

* Use 4.1 and text-embedding-3-large as defaults

* Update comment

* Clean vector store (#2077)

* clean vector store code

* fix

* fix launch.json

---------

Co-authored-by: Gaudy Blanco <gaudy-microsoft@MacBook-Pro-m4-Gaudy-For-Work.local>

* Update v3/main missing config + functions (#2082)

* reduce schema fields (#2089)

* reduce schema fields

* fix launch.json

---------

Co-authored-by: Gaudy Blanco <gaudy-microsoft@MacBook-Pro-m4-Gaudy-For-Work.local>

* Remove strategy dicts (#2090)

* Remove "strategy" from community reports config/workflow

* Remove extraction strategy from extract_graph

* Remove summarization strategy from extract_graph

* Remove strategy from claim extraction

* Strongly type prompt templates

* Remove strategy from embed_text

* Push hydrated params into community report workflows

* Push hyrdated params into extract covariates

* Push hydrated params into extract graph NLP

* Push hydrated params into extract graph

* Push hydrated params into text embeddings

* Remove a few more low-level defaults

* Semver

* Remove configurable prompt delimiters

* Update smoke tests

* Remove fnllm (#2095)

* Sort deps alpha

* Remove multi search (#2093)

* Remove multi-search from CLI

* Remove multi-search from API

* Flatten vector_store config

* Push hydrated vector store down to embed_text

* Remove outputs from config

* Remove multi-search notebook/docs

* Add missing response_type in basic search API

* Fix basic search context and id mapping

* Fix v1 migration notebook

* Fix query entity search tests

* V3 docs and cleanup (#2100)

* Remove community contrib notebooks

* Add migration notebook and breaking changes page edits

* Update/polish docs

* Make model instance name configurable

* Add vector schema updates to v3 migration notebook

* Spellcheck

* Bump smoke test runtimes

* Remove document overwrite (#2101)

* remove document overwrite from vector store configuration

* remove document overwrite and refactor load documents method

* fix test

* fix test

* fix test

---------

Co-authored-by: Gaudy Blanco <gaudy-microsoft@MacBook-Pro-m4-Gaudy-For-Work.local>

* Unified factory (#2105)

* Simplify Factory interface

* Migrate CacheFactory to standard base class

* Migrate LoggerFactory to standard base class

* Migrate StorageFactory to standard base class

* Migrate VectorStoreFactory to standard base class

* Update vector store example notebook

* Delete notebook outputs

* Move default providers into factories

* Move retry/limit tests into integ

* Split language model factories

* Set smoke test tpm/rpm

* Fix factory integ tests

* Add method to smoke test, switch text to 'fast'

* Fix text smoke config for fast workflow

* Add new workflows to text smoke test

* Convert input readers to a proper factory

* Remove covariates from fast smoke test

* Update docs for input factory

* Bump smoke runtime

* Even longer runtime

* min-csv timeout

* Remove unnecessary lambdas

* Prefix vector store (#2106)

* add prefix to vector store configuration and removal of container name

* docs updated

* change prefix property name

* change prefix property name

* feedback implemented

---------

Co-authored-by: Gaudy Blanco <gaudy-microsoft@MacBook-Pro-m4-Gaudy-For-Work.local>

* fix for container name

* Restructure project as monorepo. (#2111)

* Restructure project as monorepo.

* Fix formatting

* Storage fixes and cleanup (#2118)

* Fix pipeline recursion

* Remove base_dir from storage.find

* Remove max_count from storage.find

* Remove prefix on storage integ test

* Add base_dir in creation_date test

* Wrap base_dir in Path

* Use constants for input/update directories

* Nov 2025 housekeeping (#2120)

* Remove gensim sideload

* Split CI build/type checks from unit tests

* Thorough review of docs to align with v3

* Format

* Fix version

* Fix type

* Graphrag config (#2119)

* Add load_config to graphrag-common package.

* Empty graph guards (#2126)

* Remove networkx from graph_extractor and clean out redundancy

* Bubble pipeline error to console

* Remove embeddings optional new (#2128)

* remove optional embeddings

* fix test

* fix tests

* fix pipeline

* fix test

* fix test

* fix test

* fix tests

---------

Co-authored-by: Gaudy Blanco <gaudy-microsoft@MacBook-Pro-m4-Gaudy-For-Work.local>

* Format

* Add empty checks for NLP graphs (#2133)

* Init command asks for models (#2137)

* Add init prompting for models

* Remove hard-coded model config validation

* Switch to typer option prompt for full CLI use with models

* Update getting started for init model input

* Bump request timeout and overall smoke test timeout

* Add graphrag-storage. (#2127)

* Add graphrag-storage.

* Python update (3.13) (#2149)

* Update to python 3.14 as default, with range down to 3.10

* Fix enum value in query cli

* Update pyarrow

* Update py version for storage package

* Remove 3.10

* add fastuuid

* Update Python support to 3.11-3.14 with stricter dependency constraints

- Set minimum Python version to 3.11 (removed 3.10 support)
- Added support for Python 3.14
- Updated CI workflows: single-version jobs use 3.14, matrix jobs use 3.11 and 3.14
- Fixed license format to use SPDX-compatible format for Python 3.14
- Updated pyarrow to >=22.0.0 for Python 3.14 wheel support
- Added explicit fastuuid~=0.14 and blis~=1.3 for Python 3.14 compatibility
- Replaced all loose version constraints (>=) with compatible release (~=) for better lock file control
- Applied stricter versioning to all packages: graphrag, graphrag-common, graphrag-storage, unified-search-app

* update uv lock

* Pin blis to ~=1.3.3 to ensure Python 3.14 wheel availability

* Update uv lock

* Update numpy to >=2.0.0 for Python 3.14 Windows compatibility

Numpy 1.25.x has access violation issues on Python 3.14 Windows.
Numpy 2.x has proper Python 3.14 support including Windows wheels.

* update uv lock

* Update pandas to >=2.3.0 for numpy 2.x compatibility

Pandas 2.2.x was compiled against numpy 1.x and causes ABI
incompatibility errors with numpy 2.x. Pandas 2.3.0+ supports
numpy 2.x properly.

* update uv.lock

* Add scipy>=1.15.0 for numpy 2.x compatibility

Scipy versions < 1.15.0 have C extensions built against numpy 1.x
and are incompatible with numpy 2.x, causing dtype size errors.

* update uv lock

* Update Python support to 3.11-3.13 with compatible dependencies

- Set Python version range to 3.11-3.13 (removed 3.14 support)
- Updated CI workflows: single-version jobs use 3.13, matrix jobs use 3.11 and 3.13
- Dependencies optimized for Python 3.13 compatibility:
  - pyarrow~=22.0 (has Python 3.13 wheels)
  - numpy~=1.26
  - pandas~=2.2
  - blis~=1.0
  - fastuuid~=0.13
- Applied stricter version constraints using ~= operator throughout
- Updated uv.lock with resolved dependencies

* Update numpy to 2.1+ and pandas to 2.3+ for Python 3.13 Windows compatibility

Numpy 1.26.x causes access violations on Python 3.13 Windows.
Numpy 2.1+ has proper Python 3.13 support with Windows wheels.
Pandas 2.3+ is required for numpy 2.x compatibility.

* update vsts.yml python version

* Add GraphRAG Cache package. (#2153)

* Add GraphRAG Cache package.

* Fix a bunch of module comments and function visibility (#2154)

* Issue #2004 fix (#2159)

* fix issue #2004 using KeenhoChu idea in his PR

* add unit test for dynamic community selection

* add unit test for dynamic community selection implementing #2158 logic

---------

Co-authored-by: Gaudy Blanco <gaudy-microsoft@MacBook-Pro-m4-Gaudy-For-Work.local>

* Mismatch between header in community report generation prompt examples and input data (id vs human_readable_id) (#2161)

* fix issue #860 for mismatch in prompts and input

* fix format

---------

Co-authored-by: Gaudy Blanco <gaudy-microsoft@MacBook-Pro-m4-Gaudy-For-Work.local>

* Chunker factory (#2156)

* Delete NoopTextSplitter

* Delete unused check_token_limit

* Add base chunking factory and migrate workflow to use it

* Split apart chunker module

* Co-locate chunking/splitting

* Collapse token splitting functionality into one class/function

* Restore create_base_text_units parameterization

* Move Tokenizer base class to common package

* Move pre-pending into chunkers

* Streamline config

* Fix defaults construction

* Add prepending tests

* Remove chunk_size_includes_metadata config

* Revert ChunkingDocument interface

* Move metadata prepending to a util

* Move Tokenizer back to GR core

* Fix tokenizer removal from chunker

* Set defaults for chunking config

* Move chunking to monorepo package

* Format

* Typo

* Add ChunkResult model

* Streamline chunking config

* Add missing version updates for graphrag_chunking

* Input factory (#2168)

* Update input factory to match other factories

* Move input config alongside input readers

* Move file pattern logic into InputReader

* Set encoding default

* Clean up optional column configs

* Combine structured data extraction

* Remove pandas from input loading

* Throw if empty documents

* Add json lines (jsonl) input support

* Store raw data

* Fix merge imports

* Move metadata handling entirely to chunking

* Nicer automatic title

* Typo

* Add get_property utility for nested dictionary access with dot notation

* Update structured_file_reader to use get_property utility

* Extract input module into new graphrag-input monorepo package

- Create new graphrag-input package with input loading utilities
- Move InputConfig, InputFileType, InputReader, TextDocument, and file readers (CSV, JSON, JSONL, Text)
- Add get_property utility for nested dictionary access with dot notation
- Include hashing utility for document ID generation
- Update all imports throughout codebase to use graphrag_input
- Add package to workspace configuration and release tasks
- Remove old graphrag.index.input module

* Rename ChunkResult to TextChunk and add transformer support

- Rename chunk_result.py to text_chunk.py with ChunkResult -> TextChunk
- Add 'original' field to TextChunk to track pre-transform text
- Add optional transform callback to chunker.chunk() method
- Add add_metadata transformer for prepending metadata to chunks
- Update create_chunk_results to apply transforms and populate original
- Update sentence_chunker and token_chunker with transform support
- Refactor create_base_text_units to use new transformer pattern
- Rename pluck_metadata to get/collect methods on TextDocument

* Back-compat comment

* Align input config type name with other factory configs

* Add MarkItDown support

* Remove pattern default from MarkItDown reader

* Remove plugins flag (implicit disabled)

* Format

* Update verb tests

* Separate storage from input config

* Add empty objects for NaN raw_data

* Fix smoke tests

* Fix BOM in csv smoke

* Format

* DRIFT fixes (#2171)

* Use stable ids for community reports

* Remove deprecated title from embedding flow

* Remove embedding column from df loaders

* Fix lancedb insertion

* Add drift back to smoke tests

* Fix mock embedder to match default embedding length

* Fix DRIFT notebook

* Push drift_k_followups through to prompt

* Format

* Vector package (#2172)

* Extract graphrag-vectors package

* Simplify vector factory usage and config defaults

* Update factory integ initializers

* Fix mock patch

* Format

* Register vector stores in tests

* Set a default vector store name

* Update vector readme

* Remove impls from init

* Move some validation into impls

* Remove index_prefix

* Move duplicate method to base class

* Fix smoke vector config

* Update index bug (#2173)

* fix update index bug

* blob storage bug fix

---------

Co-authored-by: Gaudy Blanco <gaudy-microsoft@MacBook-Pro-m4-Gaudy-For-Work.local>

* Add GraphRAG LLM package. (#2174)

* Update documentation for v3 release (#2176)

update documentation for v3 release

Co-authored-by: Gaudy Blanco <gaudy-microsoft@MacBook-Pro-m4-Gaudy-For-Work.local>

* Graphrag llm cleanup (#2181)

* Migration update (#2180)

* fix formatting.

---------

Co-authored-by: Nathan Evans <github@talkswithnumbers.com>
Co-authored-by: gaudyb <85708998+gaudyb@users.noreply.github.com>
Co-authored-by: Gaudy Blanco <gaudy-microsoft@MacBook-Pro-m4-Gaudy-For-Work.local>
Co-authored-by: Andres Morales <86074752+andresmor-ms@users.noreply.github.com>
takanori-ugai added a commit to takanori-ugai/graphrag that referenced this pull request Feb 13, 2026
* Pin pandas (microsoft#2179)

* Release v2.7.1 (microsoft#2186)

* Release v2.7.1 (microsoft#2187)

* Update Python publish workflow for PyPI (microsoft#2188)

Debug publish workflow

* V3/main (microsoft#2190)

* Remove graph embedding and UMAP (microsoft#2048)

* Remove umap/layout operation

* Remove graph embedding

* Bump unified-search to GR 2.5.0

* Remove graph vis from unified-search

* Remove file filtering (microsoft#2050)

* Remove document filtering

* Semver

* Fix integ tests

* Fix file find tuple

* Fix another dangling find tuple

* Remove text unit grouping (microsoft#2052)

* Remove text unit group_by_columns

* Semver

* Fix default token split test

* Fix models in config test samples

* Fix token length in context sort test

* Fix document sort

* Re-implement hierarchical Leiden (microsoft#2049)

* Use graspologic-native hierarchical leiden

* Re-implement largest_connected_component

* Copy in modularity

* Use graspologic-native directly in pyproject

* Remove directed graph tests (we don't use this)

* Semver

* Remove graspologic dep

* Use 4.1 and text-embedding-3-large as defaults

* Update comment

* Clean vector store (microsoft#2077)

* clean vector store code

* fix

* fix launch.json

---------

Co-authored-by: Gaudy Blanco <gaudy-microsoft@MacBook-Pro-m4-Gaudy-For-Work.local>

* Update v3/main missing config + functions (microsoft#2082)

* reduce schema fields (microsoft#2089)

* reduce schema fields

* fix launch.json

---------

Co-authored-by: Gaudy Blanco <gaudy-microsoft@MacBook-Pro-m4-Gaudy-For-Work.local>

* Remove strategy dicts (microsoft#2090)

* Remove "strategy" from community reports config/workflow

* Remove extraction strategy from extract_graph

* Remove summarization strategy from extract_graph

* Remove strategy from claim extraction

* Strongly type prompt templates

* Remove strategy from embed_text

* Push hydrated params into community report workflows

* Push hyrdated params into extract covariates

* Push hydrated params into extract graph NLP

* Push hydrated params into extract graph

* Push hydrated params into text embeddings

* Remove a few more low-level defaults

* Semver

* Remove configurable prompt delimiters

* Update smoke tests

* Remove fnllm (microsoft#2095)

* Sort deps alpha

* Remove multi search (microsoft#2093)

* Remove multi-search from CLI

* Remove multi-search from API

* Flatten vector_store config

* Push hydrated vector store down to embed_text

* Remove outputs from config

* Remove multi-search notebook/docs

* Add missing response_type in basic search API

* Fix basic search context and id mapping

* Fix v1 migration notebook

* Fix query entity search tests

* V3 docs and cleanup (microsoft#2100)

* Remove community contrib notebooks

* Add migration notebook and breaking changes page edits

* Update/polish docs

* Make model instance name configurable

* Add vector schema updates to v3 migration notebook

* Spellcheck

* Bump smoke test runtimes

* Remove document overwrite (microsoft#2101)

* remove document overwrite from vector store configuration

* remove document overwrite and refactor load documents method

* fix test

* fix test

* fix test

---------

Co-authored-by: Gaudy Blanco <gaudy-microsoft@MacBook-Pro-m4-Gaudy-For-Work.local>

* Unified factory (microsoft#2105)

* Simplify Factory interface

* Migrate CacheFactory to standard base class

* Migrate LoggerFactory to standard base class

* Migrate StorageFactory to standard base class

* Migrate VectorStoreFactory to standard base class

* Update vector store example notebook

* Delete notebook outputs

* Move default providers into factories

* Move retry/limit tests into integ

* Split language model factories

* Set smoke test tpm/rpm

* Fix factory integ tests

* Add method to smoke test, switch text to 'fast'

* Fix text smoke config for fast workflow

* Add new workflows to text smoke test

* Convert input readers to a proper factory

* Remove covariates from fast smoke test

* Update docs for input factory

* Bump smoke runtime

* Even longer runtime

* min-csv timeout

* Remove unnecessary lambdas

* Prefix vector store (microsoft#2106)

* add prefix to vector store configuration and removal of container name

* docs updated

* change prefix property name

* change prefix property name

* feedback implemented

---------

Co-authored-by: Gaudy Blanco <gaudy-microsoft@MacBook-Pro-m4-Gaudy-For-Work.local>

* fix for container name

* Restructure project as monorepo. (microsoft#2111)

* Restructure project as monorepo.

* Fix formatting

* Storage fixes and cleanup (microsoft#2118)

* Fix pipeline recursion

* Remove base_dir from storage.find

* Remove max_count from storage.find

* Remove prefix on storage integ test

* Add base_dir in creation_date test

* Wrap base_dir in Path

* Use constants for input/update directories

* Nov 2025 housekeeping (microsoft#2120)

* Remove gensim sideload

* Split CI build/type checks from unit tests

* Thorough review of docs to align with v3

* Format

* Fix version

* Fix type

* Graphrag config (microsoft#2119)

* Add load_config to graphrag-common package.

* Empty graph guards (microsoft#2126)

* Remove networkx from graph_extractor and clean out redundancy

* Bubble pipeline error to console

* Remove embeddings optional new (microsoft#2128)

* remove optional embeddings

* fix test

* fix tests

* fix pipeline

* fix test

* fix test

* fix test

* fix tests

---------

Co-authored-by: Gaudy Blanco <gaudy-microsoft@MacBook-Pro-m4-Gaudy-For-Work.local>

* Format

* Add empty checks for NLP graphs (microsoft#2133)

* Init command asks for models (microsoft#2137)

* Add init prompting for models

* Remove hard-coded model config validation

* Switch to typer option prompt for full CLI use with models

* Update getting started for init model input

* Bump request timeout and overall smoke test timeout

* Add graphrag-storage. (microsoft#2127)

* Add graphrag-storage.

* Python update (3.13) (microsoft#2149)

* Update to python 3.14 as default, with range down to 3.10

* Fix enum value in query cli

* Update pyarrow

* Update py version for storage package

* Remove 3.10

* add fastuuid

* Update Python support to 3.11-3.14 with stricter dependency constraints

- Set minimum Python version to 3.11 (removed 3.10 support)
- Added support for Python 3.14
- Updated CI workflows: single-version jobs use 3.14, matrix jobs use 3.11 and 3.14
- Fixed license format to use SPDX-compatible format for Python 3.14
- Updated pyarrow to >=22.0.0 for Python 3.14 wheel support
- Added explicit fastuuid~=0.14 and blis~=1.3 for Python 3.14 compatibility
- Replaced all loose version constraints (>=) with compatible release (~=) for better lock file control
- Applied stricter versioning to all packages: graphrag, graphrag-common, graphrag-storage, unified-search-app

* update uv lock

* Pin blis to ~=1.3.3 to ensure Python 3.14 wheel availability

* Update uv lock

* Update numpy to >=2.0.0 for Python 3.14 Windows compatibility

Numpy 1.25.x has access violation issues on Python 3.14 Windows.
Numpy 2.x has proper Python 3.14 support including Windows wheels.

* update uv lock

* Update pandas to >=2.3.0 for numpy 2.x compatibility

Pandas 2.2.x was compiled against numpy 1.x and causes ABI
incompatibility errors with numpy 2.x. Pandas 2.3.0+ supports
numpy 2.x properly.

* update uv.lock

* Add scipy>=1.15.0 for numpy 2.x compatibility

Scipy versions < 1.15.0 have C extensions built against numpy 1.x
and are incompatible with numpy 2.x, causing dtype size errors.

* update uv lock

* Update Python support to 3.11-3.13 with compatible dependencies

- Set Python version range to 3.11-3.13 (removed 3.14 support)
- Updated CI workflows: single-version jobs use 3.13, matrix jobs use 3.11 and 3.13
- Dependencies optimized for Python 3.13 compatibility:
  - pyarrow~=22.0 (has Python 3.13 wheels)
  - numpy~=1.26
  - pandas~=2.2
  - blis~=1.0
  - fastuuid~=0.13
- Applied stricter version constraints using ~= operator throughout
- Updated uv.lock with resolved dependencies

* Update numpy to 2.1+ and pandas to 2.3+ for Python 3.13 Windows compatibility

Numpy 1.26.x causes access violations on Python 3.13 Windows.
Numpy 2.1+ has proper Python 3.13 support with Windows wheels.
Pandas 2.3+ is required for numpy 2.x compatibility.

* update vsts.yml python version

* Add GraphRAG Cache package. (microsoft#2153)

* Add GraphRAG Cache package.

* Fix a bunch of module comments and function visibility (microsoft#2154)

* Issue microsoft#2004 fix (microsoft#2159)

* fix issue microsoft#2004 using KeenhoChu idea in his PR

* add unit test for dynamic community selection

* add unit test for dynamic community selection implementing microsoft#2158 logic

---------

Co-authored-by: Gaudy Blanco <gaudy-microsoft@MacBook-Pro-m4-Gaudy-For-Work.local>

* Mismatch between header in community report generation prompt examples and input data (id vs human_readable_id) (microsoft#2161)

* fix issue microsoft#860 for mismatch in prompts and input

* fix format

---------

Co-authored-by: Gaudy Blanco <gaudy-microsoft@MacBook-Pro-m4-Gaudy-For-Work.local>

* Chunker factory (microsoft#2156)

* Delete NoopTextSplitter

* Delete unused check_token_limit

* Add base chunking factory and migrate workflow to use it

* Split apart chunker module

* Co-locate chunking/splitting

* Collapse token splitting functionality into one class/function

* Restore create_base_text_units parameterization

* Move Tokenizer base class to common package

* Move pre-pending into chunkers

* Streamline config

* Fix defaults construction

* Add prepending tests

* Remove chunk_size_includes_metadata config

* Revert ChunkingDocument interface

* Move metadata prepending to a util

* Move Tokenizer back to GR core

* Fix tokenizer removal from chunker

* Set defaults for chunking config

* Move chunking to monorepo package

* Format

* Typo

* Add ChunkResult model

* Streamline chunking config

* Add missing version updates for graphrag_chunking

* Input factory (microsoft#2168)

* Update input factory to match other factories

* Move input config alongside input readers

* Move file pattern logic into InputReader

* Set encoding default

* Clean up optional column configs

* Combine structured data extraction

* Remove pandas from input loading

* Throw if empty documents

* Add json lines (jsonl) input support

* Store raw data

* Fix merge imports

* Move metadata handling entirely to chunking

* Nicer automatic title

* Typo

* Add get_property utility for nested dictionary access with dot notation

* Update structured_file_reader to use get_property utility

* Extract input module into new graphrag-input monorepo package

- Create new graphrag-input package with input loading utilities
- Move InputConfig, InputFileType, InputReader, TextDocument, and file readers (CSV, JSON, JSONL, Text)
- Add get_property utility for nested dictionary access with dot notation
- Include hashing utility for document ID generation
- Update all imports throughout codebase to use graphrag_input
- Add package to workspace configuration and release tasks
- Remove old graphrag.index.input module

* Rename ChunkResult to TextChunk and add transformer support

- Rename chunk_result.py to text_chunk.py with ChunkResult -> TextChunk
- Add 'original' field to TextChunk to track pre-transform text
- Add optional transform callback to chunker.chunk() method
- Add add_metadata transformer for prepending metadata to chunks
- Update create_chunk_results to apply transforms and populate original
- Update sentence_chunker and token_chunker with transform support
- Refactor create_base_text_units to use new transformer pattern
- Rename pluck_metadata to get/collect methods on TextDocument

* Back-compat comment

* Align input config type name with other factory configs

* Add MarkItDown support

* Remove pattern default from MarkItDown reader

* Remove plugins flag (implicit disabled)

* Format

* Update verb tests

* Separate storage from input config

* Add empty objects for NaN raw_data

* Fix smoke tests

* Fix BOM in csv smoke

* Format

* DRIFT fixes (microsoft#2171)

* Use stable ids for community reports

* Remove deprecated title from embedding flow

* Remove embedding column from df loaders

* Fix lancedb insertion

* Add drift back to smoke tests

* Fix mock embedder to match default embedding length

* Fix DRIFT notebook

* Push drift_k_followups through to prompt

* Format

* Vector package (microsoft#2172)

* Extract graphrag-vectors package

* Simplify vector factory usage and config defaults

* Update factory integ initializers

* Fix mock patch

* Format

* Register vector stores in tests

* Set a default vector store name

* Update vector readme

* Remove impls from init

* Move some validation into impls

* Remove index_prefix

* Move duplicate method to base class

* Fix smoke vector config

* Update index bug (microsoft#2173)

* fix update index bug

* blob storage bug fix

---------

Co-authored-by: Gaudy Blanco <gaudy-microsoft@MacBook-Pro-m4-Gaudy-For-Work.local>

* Add GraphRAG LLM package. (microsoft#2174)

* Update documentation for v3 release (microsoft#2176)

update documentation for v3 release

Co-authored-by: Gaudy Blanco <gaudy-microsoft@MacBook-Pro-m4-Gaudy-For-Work.local>

* Graphrag llm cleanup (microsoft#2181)

* Migration update (microsoft#2180)

* fix formatting.

---------

Co-authored-by: Nathan Evans <github@talkswithnumbers.com>
Co-authored-by: gaudyb <85708998+gaudyb@users.noreply.github.com>
Co-authored-by: Gaudy Blanco <gaudy-microsoft@MacBook-Pro-m4-Gaudy-For-Work.local>
Co-authored-by: Andres Morales <86074752+andresmor-ms@users.noreply.github.com>

* Release v3.0.0 (microsoft#2191)

* Fix deps (microsoft#2193)

* fix missing project urls

* fix missing deps.

* Release v3.0.1 (microsoft#2195)

* add TableProvider to enable future row-by-row streaming (microsoft#2189)

* write dataframe

* changed some workflows

* 1a

* add fixed files

* add versioning

* add patch and remove utility

* pr changes

* Python 3.13 (microsoft#2208)

* make graphrag-llm supports 3.13

* Semver

---------

Co-authored-by: Deo <liangzhanzhao@metrodata.cn>
Co-authored-by: Zhanzhao (Deo) Liang <liangzhanzhao1985@gmail.com>

* update vector store example. (microsoft#2202)

* Table factory (microsoft#2214)

* Add table provider factory

* Semver

* Remove unnecessary response format check. (microsoft#2213)

- Fixes: microsoft#2203

* add csv table provider (microsoft#2215)

* add csv table provider

* add in provider

* add semver

* change list_tables to list()

* Add DataReader class for typed dataframe loading (microsoft#2220)

* Add DataReader class for typed dataframe loading

Introduce DataReader that wraps TableProvider and applies type coercion
functions when loading dataframes from weakly-typed formats (e.g. CSV).

- Add DataReader class with methods for each table type: entities,
  relationships, communities, community_reports, covariates, text_units,
  and documents
- Add typed loading functions in dfs.py for community_reports, covariates,
  text_units, and documents (entities, relationships, communities already
  existed)
- Integrate DataReader into all 17 indexing workflows replacing raw
  read_dataframe calls
- Integrate DataReader into CLI query's _resolve_output_files for typed
  loading across all search types (global, local, drift, basic)
- Export DataReader from data_model package __init__

* Fix column check

* Add notebook example support for each package (microsoft#2205)

* add notebook example support for each package

* add notebook example support for each package

* semversioner change

* feedback implemented for notebooks

* feedback implemented for notebooks

* feedback implemented for notebooks

---------

Co-authored-by: Gaudy Blanco <gaudy-microsoft@MacBook-Pro-m4-Gaudy-For-Work.local>

* Streamline workflows (microsoft#2225)

* Move document ID, human_readable_id, and raw_data setup from create_final_documents into load workflows

Consolidates core document field initialization (id string cast, human_readable_id index, raw_data default) into load_input_documents and load_update_documents so that create_final_documents only handles the text unit join. Also applies the same setup in the run_pipeline input_documents bypass paths.

* Remove overzealous input document assignment

* Semver

* Format

* Add async iterator support to InputReader and use in load workflows (microsoft#2226)

* Add async iterator support to InputReader and use in load workflows

InputReader now implements __aiter__ so it can be used as `async for doc in reader`. The core iteration logic is in _iterate_files(), and read_files() delegates to the iterator for batch loading. Both load_input_documents and load_update_documents workflows now use the async iterator with dataclasses.asdict for DataFrame construction.

* Format

* add memory profiling (microsoft#2227)

* add profiling

* add unit test for profiling

* fix property name

---------

Co-authored-by: Nathan Evans <github@talkswithnumbers.com>
Co-authored-by: Derek Worthen <worthend.derek@gmail.com>
Co-authored-by: Alonso Guevara <alonsog@microsoft.com>
Co-authored-by: gaudyb <85708998+gaudyb@users.noreply.github.com>
Co-authored-by: Gaudy Blanco <gaudy-microsoft@MacBook-Pro-m4-Gaudy-For-Work.local>
Co-authored-by: Andres Morales <86074752+andresmor-ms@users.noreply.github.com>
Co-authored-by: Dayenne Souza <ddesouza@microsoft.com>
Co-authored-by: Deo <liangzhanzhao@metrodata.cn>
Co-authored-by: Zhanzhao (Deo) Liang <liangzhanzhao1985@gmail.com>
takanori-ugai added a commit to takanori-ugai/graphrag that referenced this pull request Feb 13, 2026
* initial version

* update version

* implements of CommunityDetection

* add vetctor search

* improve community detection

* add Sample program

* fix the prompts

* update community summarization

* improvements

* Use structured AiServices

* small improvements

* Query (#2)

* small improvements

* add initial version of query part

* add advanced methods

* update prompts

* update

* add global and drift mode.

* drift mode and global mode improvement

* improve the query part

* gettting closer

* small implement

* parameterized

* Sample query program

* small improvements

* add reading parquest.

* update

* improvement

* update

* CLI improvement

* default values

* add question generator

* getting closer

* get closer

* Add prograss logging

* update based on review

* improvements based on review

* 📝 Add docstrings to `query`

Docstrings generation was requested by @takanori-ugai.

* #2 (comment)

The following files were modified:

* `kotlin/src/main/kotlin/com/microsoft/graphrag/SampleIndexer.kt`
* `kotlin/src/main/kotlin/com/microsoft/graphrag/SampleQueries.kt`
* `kotlin/src/main/kotlin/com/microsoft/graphrag/cli/GraphRagCli.kt`
* `kotlin/src/main/kotlin/com/microsoft/graphrag/index/CommunityReportWorkflow.kt`
* `kotlin/src/main/kotlin/com/microsoft/graphrag/index/EmbedWorkflow.kt`
* `kotlin/src/main/kotlin/com/microsoft/graphrag/index/ExtractGraphWorkflow.kt`
* `kotlin/src/main/kotlin/com/microsoft/graphrag/index/LocalVectorStore.kt`
* `kotlin/src/main/kotlin/com/microsoft/graphrag/index/PipelineTypes.kt`
* `kotlin/src/main/kotlin/com/microsoft/graphrag/index/RunPipeline.kt`
* `kotlin/src/main/kotlin/com/microsoft/graphrag/index/StateCodec.kt`
* `kotlin/src/main/kotlin/com/microsoft/graphrag/index/Workflows.kt`
* `kotlin/src/main/kotlin/com/microsoft/graphrag/logger/Progress.kt`
* `kotlin/src/main/kotlin/com/microsoft/graphrag/query/AdvancedQueryEngines.kt`
* `kotlin/src/main/kotlin/com/microsoft/graphrag/query/BasicQueryEngine.kt`
* `kotlin/src/main/kotlin/com/microsoft/graphrag/query/BasicSearchContextBuilder.kt`
* `kotlin/src/main/kotlin/com/microsoft/graphrag/query/CollectingQueryCallbacks.kt`
* `kotlin/src/main/kotlin/com/microsoft/graphrag/query/ContextRecords.kt`
* `kotlin/src/main/kotlin/com/microsoft/graphrag/query/DriftSearchEngine.kt`
* `kotlin/src/main/kotlin/com/microsoft/graphrag/query/GlobalSearchEngine.kt`
* `kotlin/src/main/kotlin/com/microsoft/graphrag/query/LocalSearchContextBuilder.kt`
* `kotlin/src/main/kotlin/com/microsoft/graphrag/query/NameUtils.kt`
* `kotlin/src/main/kotlin/com/microsoft/graphrag/query/QueryCallbacks.kt`
* `kotlin/src/main/kotlin/com/microsoft/graphrag/query/QueryConfigLoader.kt`
* `kotlin/src/main/kotlin/com/microsoft/graphrag/query/QueryIndexLoader.kt`
* `kotlin/src/main/kotlin/com/microsoft/graphrag/query/QuestionGen.kt`

---------

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>

* improvement based on review

* fix some warnings

* rest of implementation

* add AGENTS.md

* Catch up (#4)

* Pin pandas (microsoft#2179)

* Release v2.7.1 (microsoft#2186)

* Release v2.7.1 (microsoft#2187)

* Update Python publish workflow for PyPI (microsoft#2188)

Debug publish workflow

* V3/main (microsoft#2190)

* Remove graph embedding and UMAP (microsoft#2048)

* Remove umap/layout operation

* Remove graph embedding

* Bump unified-search to GR 2.5.0

* Remove graph vis from unified-search

* Remove file filtering (microsoft#2050)

* Remove document filtering

* Semver

* Fix integ tests

* Fix file find tuple

* Fix another dangling find tuple

* Remove text unit grouping (microsoft#2052)

* Remove text unit group_by_columns

* Semver

* Fix default token split test

* Fix models in config test samples

* Fix token length in context sort test

* Fix document sort

* Re-implement hierarchical Leiden (microsoft#2049)

* Use graspologic-native hierarchical leiden

* Re-implement largest_connected_component

* Copy in modularity

* Use graspologic-native directly in pyproject

* Remove directed graph tests (we don't use this)

* Semver

* Remove graspologic dep

* Use 4.1 and text-embedding-3-large as defaults

* Update comment

* Clean vector store (microsoft#2077)

* clean vector store code

* fix

* fix launch.json

---------

Co-authored-by: Gaudy Blanco <gaudy-microsoft@MacBook-Pro-m4-Gaudy-For-Work.local>

* Update v3/main missing config + functions (microsoft#2082)

* reduce schema fields (microsoft#2089)

* reduce schema fields

* fix launch.json

---------

Co-authored-by: Gaudy Blanco <gaudy-microsoft@MacBook-Pro-m4-Gaudy-For-Work.local>

* Remove strategy dicts (microsoft#2090)

* Remove "strategy" from community reports config/workflow

* Remove extraction strategy from extract_graph

* Remove summarization strategy from extract_graph

* Remove strategy from claim extraction

* Strongly type prompt templates

* Remove strategy from embed_text

* Push hydrated params into community report workflows

* Push hyrdated params into extract covariates

* Push hydrated params into extract graph NLP

* Push hydrated params into extract graph

* Push hydrated params into text embeddings

* Remove a few more low-level defaults

* Semver

* Remove configurable prompt delimiters

* Update smoke tests

* Remove fnllm (microsoft#2095)

* Sort deps alpha

* Remove multi search (microsoft#2093)

* Remove multi-search from CLI

* Remove multi-search from API

* Flatten vector_store config

* Push hydrated vector store down to embed_text

* Remove outputs from config

* Remove multi-search notebook/docs

* Add missing response_type in basic search API

* Fix basic search context and id mapping

* Fix v1 migration notebook

* Fix query entity search tests

* V3 docs and cleanup (microsoft#2100)

* Remove community contrib notebooks

* Add migration notebook and breaking changes page edits

* Update/polish docs

* Make model instance name configurable

* Add vector schema updates to v3 migration notebook

* Spellcheck

* Bump smoke test runtimes

* Remove document overwrite (microsoft#2101)

* remove document overwrite from vector store configuration

* remove document overwrite and refactor load documents method

* fix test

* fix test

* fix test

---------

Co-authored-by: Gaudy Blanco <gaudy-microsoft@MacBook-Pro-m4-Gaudy-For-Work.local>

* Unified factory (microsoft#2105)

* Simplify Factory interface

* Migrate CacheFactory to standard base class

* Migrate LoggerFactory to standard base class

* Migrate StorageFactory to standard base class

* Migrate VectorStoreFactory to standard base class

* Update vector store example notebook

* Delete notebook outputs

* Move default providers into factories

* Move retry/limit tests into integ

* Split language model factories

* Set smoke test tpm/rpm

* Fix factory integ tests

* Add method to smoke test, switch text to 'fast'

* Fix text smoke config for fast workflow

* Add new workflows to text smoke test

* Convert input readers to a proper factory

* Remove covariates from fast smoke test

* Update docs for input factory

* Bump smoke runtime

* Even longer runtime

* min-csv timeout

* Remove unnecessary lambdas

* Prefix vector store (microsoft#2106)

* add prefix to vector store configuration and removal of container name

* docs updated

* change prefix property name

* change prefix property name

* feedback implemented

---------

Co-authored-by: Gaudy Blanco <gaudy-microsoft@MacBook-Pro-m4-Gaudy-For-Work.local>

* fix for container name

* Restructure project as monorepo. (microsoft#2111)

* Restructure project as monorepo.

* Fix formatting

* Storage fixes and cleanup (microsoft#2118)

* Fix pipeline recursion

* Remove base_dir from storage.find

* Remove max_count from storage.find

* Remove prefix on storage integ test

* Add base_dir in creation_date test

* Wrap base_dir in Path

* Use constants for input/update directories

* Nov 2025 housekeeping (microsoft#2120)

* Remove gensim sideload

* Split CI build/type checks from unit tests

* Thorough review of docs to align with v3

* Format

* Fix version

* Fix type

* Graphrag config (microsoft#2119)

* Add load_config to graphrag-common package.

* Empty graph guards (microsoft#2126)

* Remove networkx from graph_extractor and clean out redundancy

* Bubble pipeline error to console

* Remove embeddings optional new (microsoft#2128)

* remove optional embeddings

* fix test

* fix tests

* fix pipeline

* fix test

* fix test

* fix test

* fix tests

---------

Co-authored-by: Gaudy Blanco <gaudy-microsoft@MacBook-Pro-m4-Gaudy-For-Work.local>

* Format

* Add empty checks for NLP graphs (microsoft#2133)

* Init command asks for models (microsoft#2137)

* Add init prompting for models

* Remove hard-coded model config validation

* Switch to typer option prompt for full CLI use with models

* Update getting started for init model input

* Bump request timeout and overall smoke test timeout

* Add graphrag-storage. (microsoft#2127)

* Add graphrag-storage.

* Python update (3.13) (microsoft#2149)

* Update to python 3.14 as default, with range down to 3.10

* Fix enum value in query cli

* Update pyarrow

* Update py version for storage package

* Remove 3.10

* add fastuuid

* Update Python support to 3.11-3.14 with stricter dependency constraints

- Set minimum Python version to 3.11 (removed 3.10 support)
- Added support for Python 3.14
- Updated CI workflows: single-version jobs use 3.14, matrix jobs use 3.11 and 3.14
- Fixed license format to use SPDX-compatible format for Python 3.14
- Updated pyarrow to >=22.0.0 for Python 3.14 wheel support
- Added explicit fastuuid~=0.14 and blis~=1.3 for Python 3.14 compatibility
- Replaced all loose version constraints (>=) with compatible release (~=) for better lock file control
- Applied stricter versioning to all packages: graphrag, graphrag-common, graphrag-storage, unified-search-app

* update uv lock

* Pin blis to ~=1.3.3 to ensure Python 3.14 wheel availability

* Update uv lock

* Update numpy to >=2.0.0 for Python 3.14 Windows compatibility

Numpy 1.25.x has access violation issues on Python 3.14 Windows.
Numpy 2.x has proper Python 3.14 support including Windows wheels.

* update uv lock

* Update pandas to >=2.3.0 for numpy 2.x compatibility

Pandas 2.2.x was compiled against numpy 1.x and causes ABI
incompatibility errors with numpy 2.x. Pandas 2.3.0+ supports
numpy 2.x properly.

* update uv.lock

* Add scipy>=1.15.0 for numpy 2.x compatibility

Scipy versions < 1.15.0 have C extensions built against numpy 1.x
and are incompatible with numpy 2.x, causing dtype size errors.

* update uv lock

* Update Python support to 3.11-3.13 with compatible dependencies

- Set Python version range to 3.11-3.13 (removed 3.14 support)
- Updated CI workflows: single-version jobs use 3.13, matrix jobs use 3.11 and 3.13
- Dependencies optimized for Python 3.13 compatibility:
  - pyarrow~=22.0 (has Python 3.13 wheels)
  - numpy~=1.26
  - pandas~=2.2
  - blis~=1.0
  - fastuuid~=0.13
- Applied stricter version constraints using ~= operator throughout
- Updated uv.lock with resolved dependencies

* Update numpy to 2.1+ and pandas to 2.3+ for Python 3.13 Windows compatibility

Numpy 1.26.x causes access violations on Python 3.13 Windows.
Numpy 2.1+ has proper Python 3.13 support with Windows wheels.
Pandas 2.3+ is required for numpy 2.x compatibility.

* update vsts.yml python version

* Add GraphRAG Cache package. (microsoft#2153)

* Add GraphRAG Cache package.

* Fix a bunch of module comments and function visibility (microsoft#2154)

* Issue microsoft#2004 fix (microsoft#2159)

* fix issue microsoft#2004 using KeenhoChu idea in his PR

* add unit test for dynamic community selection

* add unit test for dynamic community selection implementing microsoft#2158 logic

---------

Co-authored-by: Gaudy Blanco <gaudy-microsoft@MacBook-Pro-m4-Gaudy-For-Work.local>

* Mismatch between header in community report generation prompt examples and input data (id vs human_readable_id) (microsoft#2161)

* fix issue microsoft#860 for mismatch in prompts and input

* fix format

---------

Co-authored-by: Gaudy Blanco <gaudy-microsoft@MacBook-Pro-m4-Gaudy-For-Work.local>

* Chunker factory (microsoft#2156)

* Delete NoopTextSplitter

* Delete unused check_token_limit

* Add base chunking factory and migrate workflow to use it

* Split apart chunker module

* Co-locate chunking/splitting

* Collapse token splitting functionality into one class/function

* Restore create_base_text_units parameterization

* Move Tokenizer base class to common package

* Move pre-pending into chunkers

* Streamline config

* Fix defaults construction

* Add prepending tests

* Remove chunk_size_includes_metadata config

* Revert ChunkingDocument interface

* Move metadata prepending to a util

* Move Tokenizer back to GR core

* Fix tokenizer removal from chunker

* Set defaults for chunking config

* Move chunking to monorepo package

* Format

* Typo

* Add ChunkResult model

* Streamline chunking config

* Add missing version updates for graphrag_chunking

* Input factory (microsoft#2168)

* Update input factory to match other factories

* Move input config alongside input readers

* Move file pattern logic into InputReader

* Set encoding default

* Clean up optional column configs

* Combine structured data extraction

* Remove pandas from input loading

* Throw if empty documents

* Add json lines (jsonl) input support

* Store raw data

* Fix merge imports

* Move metadata handling entirely to chunking

* Nicer automatic title

* Typo

* Add get_property utility for nested dictionary access with dot notation

* Update structured_file_reader to use get_property utility

* Extract input module into new graphrag-input monorepo package

- Create new graphrag-input package with input loading utilities
- Move InputConfig, InputFileType, InputReader, TextDocument, and file readers (CSV, JSON, JSONL, Text)
- Add get_property utility for nested dictionary access with dot notation
- Include hashing utility for document ID generation
- Update all imports throughout codebase to use graphrag_input
- Add package to workspace configuration and release tasks
- Remove old graphrag.index.input module

* Rename ChunkResult to TextChunk and add transformer support

- Rename chunk_result.py to text_chunk.py with ChunkResult -> TextChunk
- Add 'original' field to TextChunk to track pre-transform text
- Add optional transform callback to chunker.chunk() method
- Add add_metadata transformer for prepending metadata to chunks
- Update create_chunk_results to apply transforms and populate original
- Update sentence_chunker and token_chunker with transform support
- Refactor create_base_text_units to use new transformer pattern
- Rename pluck_metadata to get/collect methods on TextDocument

* Back-compat comment

* Align input config type name with other factory configs

* Add MarkItDown support

* Remove pattern default from MarkItDown reader

* Remove plugins flag (implicit disabled)

* Format

* Update verb tests

* Separate storage from input config

* Add empty objects for NaN raw_data

* Fix smoke tests

* Fix BOM in csv smoke

* Format

* DRIFT fixes (microsoft#2171)

* Use stable ids for community reports

* Remove deprecated title from embedding flow

* Remove embedding column from df loaders

* Fix lancedb insertion

* Add drift back to smoke tests

* Fix mock embedder to match default embedding length

* Fix DRIFT notebook

* Push drift_k_followups through to prompt

* Format

* Vector package (microsoft#2172)

* Extract graphrag-vectors package

* Simplify vector factory usage and config defaults

* Update factory integ initializers

* Fix mock patch

* Format

* Register vector stores in tests

* Set a default vector store name

* Update vector readme

* Remove impls from init

* Move some validation into impls

* Remove index_prefix

* Move duplicate method to base class

* Fix smoke vector config

* Update index bug (microsoft#2173)

* fix update index bug

* blob storage bug fix

---------

Co-authored-by: Gaudy Blanco <gaudy-microsoft@MacBook-Pro-m4-Gaudy-For-Work.local>

* Add GraphRAG LLM package. (microsoft#2174)

* Update documentation for v3 release (microsoft#2176)

update documentation for v3 release

Co-authored-by: Gaudy Blanco <gaudy-microsoft@MacBook-Pro-m4-Gaudy-For-Work.local>

* Graphrag llm cleanup (microsoft#2181)

* Migration update (microsoft#2180)

* fix formatting.

---------

Co-authored-by: Nathan Evans <github@talkswithnumbers.com>
Co-authored-by: gaudyb <85708998+gaudyb@users.noreply.github.com>
Co-authored-by: Gaudy Blanco <gaudy-microsoft@MacBook-Pro-m4-Gaudy-For-Work.local>
Co-authored-by: Andres Morales <86074752+andresmor-ms@users.noreply.github.com>

* Release v3.0.0 (microsoft#2191)

* Fix deps (microsoft#2193)

* fix missing project urls

* fix missing deps.

* Release v3.0.1 (microsoft#2195)

* add TableProvider to enable future row-by-row streaming (microsoft#2189)

* write dataframe

* changed some workflows

* 1a

* add fixed files

* add versioning

* add patch and remove utility

* pr changes

* Python 3.13 (microsoft#2208)

* make graphrag-llm supports 3.13

* Semver

---------

Co-authored-by: Deo <liangzhanzhao@metrodata.cn>
Co-authored-by: Zhanzhao (Deo) Liang <liangzhanzhao1985@gmail.com>

* update vector store example. (microsoft#2202)

* Table factory (microsoft#2214)

* Add table provider factory

* Semver

* Remove unnecessary response format check. (microsoft#2213)

- Fixes: microsoft#2203

* add csv table provider (microsoft#2215)

* add csv table provider

* add in provider

* add semver

* change list_tables to list()

* Add DataReader class for typed dataframe loading (microsoft#2220)

* Add DataReader class for typed dataframe loading

Introduce DataReader that wraps TableProvider and applies type coercion
functions when loading dataframes from weakly-typed formats (e.g. CSV).

- Add DataReader class with methods for each table type: entities,
  relationships, communities, community_reports, covariates, text_units,
  and documents
- Add typed loading functions in dfs.py for community_reports, covariates,
  text_units, and documents (entities, relationships, communities already
  existed)
- Integrate DataReader into all 17 indexing workflows replacing raw
  read_dataframe calls
- Integrate DataReader into CLI query's _resolve_output_files for typed
  loading across all search types (global, local, drift, basic)
- Export DataReader from data_model package __init__

* Fix column check

* Add notebook example support for each package (microsoft#2205)

* add notebook example support for each package

* add notebook example support for each package

* semversioner change

* feedback implemented for notebooks

* feedback implemented for notebooks

* feedback implemented for notebooks

---------

Co-authored-by: Gaudy Blanco <gaudy-microsoft@MacBook-Pro-m4-Gaudy-For-Work.local>

* Streamline workflows (microsoft#2225)

* Move document ID, human_readable_id, and raw_data setup from create_final_documents into load workflows

Consolidates core document field initialization (id string cast, human_readable_id index, raw_data default) into load_input_documents and load_update_documents so that create_final_documents only handles the text unit join. Also applies the same setup in the run_pipeline input_documents bypass paths.

* Remove overzealous input document assignment

* Semver

* Format

* Add async iterator support to InputReader and use in load workflows (microsoft#2226)

* Add async iterator support to InputReader and use in load workflows

InputReader now implements __aiter__ so it can be used as `async for doc in reader`. The core iteration logic is in _iterate_files(), and read_files() delegates to the iterator for batch loading. Both load_input_documents and load_update_documents workflows now use the async iterator with dataclasses.asdict for DataFrame construction.

* Format

* add memory profiling (microsoft#2227)

* add profiling

* add unit test for profiling

* fix property name

---------

Co-authored-by: Nathan Evans <github@talkswithnumbers.com>
Co-authored-by: Derek Worthen <worthend.derek@gmail.com>
Co-authored-by: Alonso Guevara <alonsog@microsoft.com>
Co-authored-by: gaudyb <85708998+gaudyb@users.noreply.github.com>
Co-authored-by: Gaudy Blanco <gaudy-microsoft@MacBook-Pro-m4-Gaudy-For-Work.local>
Co-authored-by: Andres Morales <86074752+andresmor-ms@users.noreply.github.com>
Co-authored-by: Dayenne Souza <ddesouza@microsoft.com>
Co-authored-by: Deo <liangzhanzhao@metrodata.cn>
Co-authored-by: Zhanzhao (Deo) Liang <liangzhanzhao1985@gmail.com>

* update with review

* update with review

---------

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
Co-authored-by: Nathan Evans <github@talkswithnumbers.com>
Co-authored-by: Derek Worthen <worthend.derek@gmail.com>
Co-authored-by: Alonso Guevara <alonsog@microsoft.com>
Co-authored-by: gaudyb <85708998+gaudyb@users.noreply.github.com>
Co-authored-by: Gaudy Blanco <gaudy-microsoft@MacBook-Pro-m4-Gaudy-For-Work.local>
Co-authored-by: Andres Morales <86074752+andresmor-ms@users.noreply.github.com>
Co-authored-by: Dayenne Souza <ddesouza@microsoft.com>
Co-authored-by: Deo <liangzhanzhao@metrodata.cn>
Co-authored-by: Zhanzhao (Deo) Liang <liangzhanzhao1985@gmail.com>
takanori-ugai added a commit to takanori-ugai/graphrag that referenced this pull request Feb 13, 2026
* Pin pandas (microsoft#2179)

* Release v2.7.1 (microsoft#2186)

* Release v2.7.1 (microsoft#2187)

* Update Python publish workflow for PyPI (microsoft#2188)

Debug publish workflow

* V3/main (microsoft#2190)

* Remove graph embedding and UMAP (microsoft#2048)

* Remove umap/layout operation

* Remove graph embedding

* Bump unified-search to GR 2.5.0

* Remove graph vis from unified-search

* Remove file filtering (microsoft#2050)

* Remove document filtering

* Semver

* Fix integ tests

* Fix file find tuple

* Fix another dangling find tuple

* Remove text unit grouping (microsoft#2052)

* Remove text unit group_by_columns

* Semver

* Fix default token split test

* Fix models in config test samples

* Fix token length in context sort test

* Fix document sort

* Re-implement hierarchical Leiden (microsoft#2049)

* Use graspologic-native hierarchical leiden

* Re-implement largest_connected_component

* Copy in modularity

* Use graspologic-native directly in pyproject

* Remove directed graph tests (we don't use this)

* Semver

* Remove graspologic dep

* Use 4.1 and text-embedding-3-large as defaults

* Update comment

* Clean vector store (microsoft#2077)

* clean vector store code

* fix

* fix launch.json

---------

Co-authored-by: Gaudy Blanco <gaudy-microsoft@MacBook-Pro-m4-Gaudy-For-Work.local>

* Update v3/main missing config + functions (microsoft#2082)

* reduce schema fields (microsoft#2089)

* reduce schema fields

* fix launch.json

---------

Co-authored-by: Gaudy Blanco <gaudy-microsoft@MacBook-Pro-m4-Gaudy-For-Work.local>

* Remove strategy dicts (microsoft#2090)

* Remove "strategy" from community reports config/workflow

* Remove extraction strategy from extract_graph

* Remove summarization strategy from extract_graph

* Remove strategy from claim extraction

* Strongly type prompt templates

* Remove strategy from embed_text

* Push hydrated params into community report workflows

* Push hyrdated params into extract covariates

* Push hydrated params into extract graph NLP

* Push hydrated params into extract graph

* Push hydrated params into text embeddings

* Remove a few more low-level defaults

* Semver

* Remove configurable prompt delimiters

* Update smoke tests

* Remove fnllm (microsoft#2095)

* Sort deps alpha

* Remove multi search (microsoft#2093)

* Remove multi-search from CLI

* Remove multi-search from API

* Flatten vector_store config

* Push hydrated vector store down to embed_text

* Remove outputs from config

* Remove multi-search notebook/docs

* Add missing response_type in basic search API

* Fix basic search context and id mapping

* Fix v1 migration notebook

* Fix query entity search tests

* V3 docs and cleanup (microsoft#2100)

* Remove community contrib notebooks

* Add migration notebook and breaking changes page edits

* Update/polish docs

* Make model instance name configurable

* Add vector schema updates to v3 migration notebook

* Spellcheck

* Bump smoke test runtimes

* Remove document overwrite (microsoft#2101)

* remove document overwrite from vector store configuration

* remove document overwrite and refactor load documents method

* fix test

* fix test

* fix test

---------

Co-authored-by: Gaudy Blanco <gaudy-microsoft@MacBook-Pro-m4-Gaudy-For-Work.local>

* Unified factory (microsoft#2105)

* Simplify Factory interface

* Migrate CacheFactory to standard base class

* Migrate LoggerFactory to standard base class

* Migrate StorageFactory to standard base class

* Migrate VectorStoreFactory to standard base class

* Update vector store example notebook

* Delete notebook outputs

* Move default providers into factories

* Move retry/limit tests into integ

* Split language model factories

* Set smoke test tpm/rpm

* Fix factory integ tests

* Add method to smoke test, switch text to 'fast'

* Fix text smoke config for fast workflow

* Add new workflows to text smoke test

* Convert input readers to a proper factory

* Remove covariates from fast smoke test

* Update docs for input factory

* Bump smoke runtime

* Even longer runtime

* min-csv timeout

* Remove unnecessary lambdas

* Prefix vector store (microsoft#2106)

* add prefix to vector store configuration and removal of container name

* docs updated

* change prefix property name

* change prefix property name

* feedback implemented

---------

Co-authored-by: Gaudy Blanco <gaudy-microsoft@MacBook-Pro-m4-Gaudy-For-Work.local>

* fix for container name

* Restructure project as monorepo. (microsoft#2111)

* Restructure project as monorepo.

* Fix formatting

* Storage fixes and cleanup (microsoft#2118)

* Fix pipeline recursion

* Remove base_dir from storage.find

* Remove max_count from storage.find

* Remove prefix on storage integ test

* Add base_dir in creation_date test

* Wrap base_dir in Path

* Use constants for input/update directories

* Nov 2025 housekeeping (microsoft#2120)

* Remove gensim sideload

* Split CI build/type checks from unit tests

* Thorough review of docs to align with v3

* Format

* Fix version

* Fix type

* Graphrag config (microsoft#2119)

* Add load_config to graphrag-common package.

* Empty graph guards (microsoft#2126)

* Remove networkx from graph_extractor and clean out redundancy

* Bubble pipeline error to console

* Remove embeddings optional new (microsoft#2128)

* remove optional embeddings

* fix test

* fix tests

* fix pipeline

* fix test

* fix test

* fix test

* fix tests

---------

Co-authored-by: Gaudy Blanco <gaudy-microsoft@MacBook-Pro-m4-Gaudy-For-Work.local>

* Format

* Add empty checks for NLP graphs (microsoft#2133)

* Init command asks for models (microsoft#2137)

* Add init prompting for models

* Remove hard-coded model config validation

* Switch to typer option prompt for full CLI use with models

* Update getting started for init model input

* Bump request timeout and overall smoke test timeout

* Add graphrag-storage. (microsoft#2127)

* Add graphrag-storage.

* Python update (3.13) (microsoft#2149)

* Update to python 3.14 as default, with range down to 3.10

* Fix enum value in query cli

* Update pyarrow

* Update py version for storage package

* Remove 3.10

* add fastuuid

* Update Python support to 3.11-3.14 with stricter dependency constraints

- Set minimum Python version to 3.11 (removed 3.10 support)
- Added support for Python 3.14
- Updated CI workflows: single-version jobs use 3.14, matrix jobs use 3.11 and 3.14
- Fixed license format to use SPDX-compatible format for Python 3.14
- Updated pyarrow to >=22.0.0 for Python 3.14 wheel support
- Added explicit fastuuid~=0.14 and blis~=1.3 for Python 3.14 compatibility
- Replaced all loose version constraints (>=) with compatible release (~=) for better lock file control
- Applied stricter versioning to all packages: graphrag, graphrag-common, graphrag-storage, unified-search-app

* update uv lock

* Pin blis to ~=1.3.3 to ensure Python 3.14 wheel availability

* Update uv lock

* Update numpy to >=2.0.0 for Python 3.14 Windows compatibility

Numpy 1.25.x has access violation issues on Python 3.14 Windows.
Numpy 2.x has proper Python 3.14 support including Windows wheels.

* update uv lock

* Update pandas to >=2.3.0 for numpy 2.x compatibility

Pandas 2.2.x was compiled against numpy 1.x and causes ABI
incompatibility errors with numpy 2.x. Pandas 2.3.0+ supports
numpy 2.x properly.

* update uv.lock

* Add scipy>=1.15.0 for numpy 2.x compatibility

Scipy versions < 1.15.0 have C extensions built against numpy 1.x
and are incompatible with numpy 2.x, causing dtype size errors.

* update uv lock

* Update Python support to 3.11-3.13 with compatible dependencies

- Set Python version range to 3.11-3.13 (removed 3.14 support)
- Updated CI workflows: single-version jobs use 3.13, matrix jobs use 3.11 and 3.13
- Dependencies optimized for Python 3.13 compatibility:
  - pyarrow~=22.0 (has Python 3.13 wheels)
  - numpy~=1.26
  - pandas~=2.2
  - blis~=1.0
  - fastuuid~=0.13
- Applied stricter version constraints using ~= operator throughout
- Updated uv.lock with resolved dependencies

* Update numpy to 2.1+ and pandas to 2.3+ for Python 3.13 Windows compatibility

Numpy 1.26.x causes access violations on Python 3.13 Windows.
Numpy 2.1+ has proper Python 3.13 support with Windows wheels.
Pandas 2.3+ is required for numpy 2.x compatibility.

* update vsts.yml python version

* Add GraphRAG Cache package. (microsoft#2153)

* Add GraphRAG Cache package.

* Fix a bunch of module comments and function visibility (microsoft#2154)

* Issue microsoft#2004 fix (microsoft#2159)

* fix issue microsoft#2004 using KeenhoChu idea in his PR

* add unit test for dynamic community selection

* add unit test for dynamic community selection implementing microsoft#2158 logic

---------

Co-authored-by: Gaudy Blanco <gaudy-microsoft@MacBook-Pro-m4-Gaudy-For-Work.local>

* Mismatch between header in community report generation prompt examples and input data (id vs human_readable_id) (microsoft#2161)

* fix issue microsoft#860 for mismatch in prompts and input

* fix format

---------

Co-authored-by: Gaudy Blanco <gaudy-microsoft@MacBook-Pro-m4-Gaudy-For-Work.local>

* Chunker factory (microsoft#2156)

* Delete NoopTextSplitter

* Delete unused check_token_limit

* Add base chunking factory and migrate workflow to use it

* Split apart chunker module

* Co-locate chunking/splitting

* Collapse token splitting functionality into one class/function

* Restore create_base_text_units parameterization

* Move Tokenizer base class to common package

* Move pre-pending into chunkers

* Streamline config

* Fix defaults construction

* Add prepending tests

* Remove chunk_size_includes_metadata config

* Revert ChunkingDocument interface

* Move metadata prepending to a util

* Move Tokenizer back to GR core

* Fix tokenizer removal from chunker

* Set defaults for chunking config

* Move chunking to monorepo package

* Format

* Typo

* Add ChunkResult model

* Streamline chunking config

* Add missing version updates for graphrag_chunking

* Input factory (microsoft#2168)

* Update input factory to match other factories

* Move input config alongside input readers

* Move file pattern logic into InputReader

* Set encoding default

* Clean up optional column configs

* Combine structured data extraction

* Remove pandas from input loading

* Throw if empty documents

* Add json lines (jsonl) input support

* Store raw data

* Fix merge imports

* Move metadata handling entirely to chunking

* Nicer automatic title

* Typo

* Add get_property utility for nested dictionary access with dot notation

* Update structured_file_reader to use get_property utility

* Extract input module into new graphrag-input monorepo package

- Create new graphrag-input package with input loading utilities
- Move InputConfig, InputFileType, InputReader, TextDocument, and file readers (CSV, JSON, JSONL, Text)
- Add get_property utility for nested dictionary access with dot notation
- Include hashing utility for document ID generation
- Update all imports throughout codebase to use graphrag_input
- Add package to workspace configuration and release tasks
- Remove old graphrag.index.input module

* Rename ChunkResult to TextChunk and add transformer support

- Rename chunk_result.py to text_chunk.py with ChunkResult -> TextChunk
- Add 'original' field to TextChunk to track pre-transform text
- Add optional transform callback to chunker.chunk() method
- Add add_metadata transformer for prepending metadata to chunks
- Update create_chunk_results to apply transforms and populate original
- Update sentence_chunker and token_chunker with transform support
- Refactor create_base_text_units to use new transformer pattern
- Rename pluck_metadata to get/collect methods on TextDocument

* Back-compat comment

* Align input config type name with other factory configs

* Add MarkItDown support

* Remove pattern default from MarkItDown reader

* Remove plugins flag (implicit disabled)

* Format

* Update verb tests

* Separate storage from input config

* Add empty objects for NaN raw_data

* Fix smoke tests

* Fix BOM in csv smoke

* Format

* DRIFT fixes (microsoft#2171)

* Use stable ids for community reports

* Remove deprecated title from embedding flow

* Remove embedding column from df loaders

* Fix lancedb insertion

* Add drift back to smoke tests

* Fix mock embedder to match default embedding length

* Fix DRIFT notebook

* Push drift_k_followups through to prompt

* Format

* Vector package (microsoft#2172)

* Extract graphrag-vectors package

* Simplify vector factory usage and config defaults

* Update factory integ initializers

* Fix mock patch

* Format

* Register vector stores in tests

* Set a default vector store name

* Update vector readme

* Remove impls from init

* Move some validation into impls

* Remove index_prefix

* Move duplicate method to base class

* Fix smoke vector config

* Update index bug (microsoft#2173)

* fix update index bug

* blob storage bug fix

---------

Co-authored-by: Gaudy Blanco <gaudy-microsoft@MacBook-Pro-m4-Gaudy-For-Work.local>

* Add GraphRAG LLM package. (microsoft#2174)

* Update documentation for v3 release (microsoft#2176)

update documentation for v3 release

Co-authored-by: Gaudy Blanco <gaudy-microsoft@MacBook-Pro-m4-Gaudy-For-Work.local>

* Graphrag llm cleanup (microsoft#2181)

* Migration update (microsoft#2180)

* fix formatting.

---------

Co-authored-by: Nathan Evans <github@talkswithnumbers.com>
Co-authored-by: gaudyb <85708998+gaudyb@users.noreply.github.com>
Co-authored-by: Gaudy Blanco <gaudy-microsoft@MacBook-Pro-m4-Gaudy-For-Work.local>
Co-authored-by: Andres Morales <86074752+andresmor-ms@users.noreply.github.com>

* Release v3.0.0 (microsoft#2191)

* Fix deps (microsoft#2193)

* fix missing project urls

* fix missing deps.

* Release v3.0.1 (microsoft#2195)

* add TableProvider to enable future row-by-row streaming (microsoft#2189)

* write dataframe

* changed some workflows

* 1a

* add fixed files

* add versioning

* add patch and remove utility

* pr changes

* Python 3.13 (microsoft#2208)

* make graphrag-llm supports 3.13

* Semver

---------

Co-authored-by: Deo <liangzhanzhao@metrodata.cn>
Co-authored-by: Zhanzhao (Deo) Liang <liangzhanzhao1985@gmail.com>

* update vector store example. (microsoft#2202)

* Table factory (microsoft#2214)

* Add table provider factory

* Semver

* Remove unnecessary response format check. (microsoft#2213)

- Fixes: microsoft#2203

* add csv table provider (microsoft#2215)

* add csv table provider

* add in provider

* add semver

* change list_tables to list()

* Add DataReader class for typed dataframe loading (microsoft#2220)

* Add DataReader class for typed dataframe loading

Introduce DataReader that wraps TableProvider and applies type coercion
functions when loading dataframes from weakly-typed formats (e.g. CSV).

- Add DataReader class with methods for each table type: entities,
  relationships, communities, community_reports, covariates, text_units,
  and documents
- Add typed loading functions in dfs.py for community_reports, covariates,
  text_units, and documents (entities, relationships, communities already
  existed)
- Integrate DataReader into all 17 indexing workflows replacing raw
  read_dataframe calls
- Integrate DataReader into CLI query's _resolve_output_files for typed
  loading across all search types (global, local, drift, basic)
- Export DataReader from data_model package __init__

* Fix column check

* Add notebook example support for each package (microsoft#2205)

* add notebook example support for each package

* add notebook example support for each package

* semversioner change

* feedback implemented for notebooks

* feedback implemented for notebooks

* feedback implemented for notebooks

---------

Co-authored-by: Gaudy Blanco <gaudy-microsoft@MacBook-Pro-m4-Gaudy-For-Work.local>

* Streamline workflows (microsoft#2225)

* Move document ID, human_readable_id, and raw_data setup from create_final_documents into load workflows

Consolidates core document field initialization (id string cast, human_readable_id index, raw_data default) into load_input_documents and load_update_documents so that create_final_documents only handles the text unit join. Also applies the same setup in the run_pipeline input_documents bypass paths.

* Remove overzealous input document assignment

* Semver

* Format

* Add async iterator support to InputReader and use in load workflows (microsoft#2226)

* Add async iterator support to InputReader and use in load workflows

InputReader now implements __aiter__ so it can be used as `async for doc in reader`. The core iteration logic is in _iterate_files(), and read_files() delegates to the iterator for batch loading. Both load_input_documents and load_update_documents workflows now use the async iterator with dataclasses.asdict for DataFrame construction.

* Format

* add memory profiling (microsoft#2227)

* add profiling

* add unit test for profiling

* fix property name

---------

Co-authored-by: Nathan Evans <github@talkswithnumbers.com>
Co-authored-by: Derek Worthen <worthend.derek@gmail.com>
Co-authored-by: Alonso Guevara <alonsog@microsoft.com>
Co-authored-by: gaudyb <85708998+gaudyb@users.noreply.github.com>
Co-authored-by: Gaudy Blanco <gaudy-microsoft@MacBook-Pro-m4-Gaudy-For-Work.local>
Co-authored-by: Andres Morales <86074752+andresmor-ms@users.noreply.github.com>
Co-authored-by: Dayenne Souza <ddesouza@microsoft.com>
Co-authored-by: Deo <liangzhanzhao@metrodata.cn>
Co-authored-by: Zhanzhao (Deo) Liang <liangzhanzhao1985@gmail.com>
JonasReuter pushed a commit to JonasReuter/graphrag that referenced this pull request Apr 13, 2026
* Remove graph embedding and UMAP (microsoft#2048)

* Remove umap/layout operation

* Remove graph embedding

* Bump unified-search to GR 2.5.0

* Remove graph vis from unified-search

* Remove file filtering (microsoft#2050)

* Remove document filtering

* Semver

* Fix integ tests

* Fix file find tuple

* Fix another dangling find tuple

* Remove text unit grouping (microsoft#2052)

* Remove text unit group_by_columns

* Semver

* Fix default token split test

* Fix models in config test samples

* Fix token length in context sort test

* Fix document sort

* Re-implement hierarchical Leiden (microsoft#2049)

* Use graspologic-native hierarchical leiden

* Re-implement largest_connected_component

* Copy in modularity

* Use graspologic-native directly in pyproject

* Remove directed graph tests (we don't use this)

* Semver

* Remove graspologic dep

* Use 4.1 and text-embedding-3-large as defaults

* Update comment

* Clean vector store (microsoft#2077)

* clean vector store code

* fix

* fix launch.json

---------

Co-authored-by: Gaudy Blanco <gaudy-microsoft@MacBook-Pro-m4-Gaudy-For-Work.local>

* Update v3/main missing config + functions (microsoft#2082)

* reduce schema fields (microsoft#2089)

* reduce schema fields

* fix launch.json

---------

Co-authored-by: Gaudy Blanco <gaudy-microsoft@MacBook-Pro-m4-Gaudy-For-Work.local>

* Remove strategy dicts (microsoft#2090)

* Remove "strategy" from community reports config/workflow

* Remove extraction strategy from extract_graph

* Remove summarization strategy from extract_graph

* Remove strategy from claim extraction

* Strongly type prompt templates

* Remove strategy from embed_text

* Push hydrated params into community report workflows

* Push hyrdated params into extract covariates

* Push hydrated params into extract graph NLP

* Push hydrated params into extract graph

* Push hydrated params into text embeddings

* Remove a few more low-level defaults

* Semver

* Remove configurable prompt delimiters

* Update smoke tests

* Remove fnllm (microsoft#2095)

* Sort deps alpha

* Remove multi search (microsoft#2093)

* Remove multi-search from CLI

* Remove multi-search from API

* Flatten vector_store config

* Push hydrated vector store down to embed_text

* Remove outputs from config

* Remove multi-search notebook/docs

* Add missing response_type in basic search API

* Fix basic search context and id mapping

* Fix v1 migration notebook

* Fix query entity search tests

* V3 docs and cleanup (microsoft#2100)

* Remove community contrib notebooks

* Add migration notebook and breaking changes page edits

* Update/polish docs

* Make model instance name configurable

* Add vector schema updates to v3 migration notebook

* Spellcheck

* Bump smoke test runtimes

* Remove document overwrite (microsoft#2101)

* remove document overwrite from vector store configuration

* remove document overwrite and refactor load documents method

* fix test

* fix test

* fix test

---------

Co-authored-by: Gaudy Blanco <gaudy-microsoft@MacBook-Pro-m4-Gaudy-For-Work.local>

* Unified factory (microsoft#2105)

* Simplify Factory interface

* Migrate CacheFactory to standard base class

* Migrate LoggerFactory to standard base class

* Migrate StorageFactory to standard base class

* Migrate VectorStoreFactory to standard base class

* Update vector store example notebook

* Delete notebook outputs

* Move default providers into factories

* Move retry/limit tests into integ

* Split language model factories

* Set smoke test tpm/rpm

* Fix factory integ tests

* Add method to smoke test, switch text to 'fast'

* Fix text smoke config for fast workflow

* Add new workflows to text smoke test

* Convert input readers to a proper factory

* Remove covariates from fast smoke test

* Update docs for input factory

* Bump smoke runtime

* Even longer runtime

* min-csv timeout

* Remove unnecessary lambdas

* Prefix vector store (microsoft#2106)

* add prefix to vector store configuration and removal of container name

* docs updated

* change prefix property name

* change prefix property name

* feedback implemented

---------

Co-authored-by: Gaudy Blanco <gaudy-microsoft@MacBook-Pro-m4-Gaudy-For-Work.local>

* fix for container name

* Restructure project as monorepo. (microsoft#2111)

* Restructure project as monorepo.

* Fix formatting

* Storage fixes and cleanup (microsoft#2118)

* Fix pipeline recursion

* Remove base_dir from storage.find

* Remove max_count from storage.find

* Remove prefix on storage integ test

* Add base_dir in creation_date test

* Wrap base_dir in Path

* Use constants for input/update directories

* Nov 2025 housekeeping (microsoft#2120)

* Remove gensim sideload

* Split CI build/type checks from unit tests

* Thorough review of docs to align with v3

* Format

* Fix version

* Fix type

* Graphrag config (microsoft#2119)

* Add load_config to graphrag-common package.

* Empty graph guards (microsoft#2126)

* Remove networkx from graph_extractor and clean out redundancy

* Bubble pipeline error to console

* Remove embeddings optional new (microsoft#2128)

* remove optional embeddings

* fix test

* fix tests

* fix pipeline

* fix test

* fix test

* fix test

* fix tests

---------

Co-authored-by: Gaudy Blanco <gaudy-microsoft@MacBook-Pro-m4-Gaudy-For-Work.local>

* Format

* Add empty checks for NLP graphs (microsoft#2133)

* Init command asks for models (microsoft#2137)

* Add init prompting for models

* Remove hard-coded model config validation

* Switch to typer option prompt for full CLI use with models

* Update getting started for init model input

* Bump request timeout and overall smoke test timeout

* Add graphrag-storage. (microsoft#2127)

* Add graphrag-storage.

* Python update (3.13) (microsoft#2149)

* Update to python 3.14 as default, with range down to 3.10

* Fix enum value in query cli

* Update pyarrow

* Update py version for storage package

* Remove 3.10

* add fastuuid

* Update Python support to 3.11-3.14 with stricter dependency constraints

- Set minimum Python version to 3.11 (removed 3.10 support)
- Added support for Python 3.14
- Updated CI workflows: single-version jobs use 3.14, matrix jobs use 3.11 and 3.14
- Fixed license format to use SPDX-compatible format for Python 3.14
- Updated pyarrow to >=22.0.0 for Python 3.14 wheel support
- Added explicit fastuuid~=0.14 and blis~=1.3 for Python 3.14 compatibility
- Replaced all loose version constraints (>=) with compatible release (~=) for better lock file control
- Applied stricter versioning to all packages: graphrag, graphrag-common, graphrag-storage, unified-search-app

* update uv lock

* Pin blis to ~=1.3.3 to ensure Python 3.14 wheel availability

* Update uv lock

* Update numpy to >=2.0.0 for Python 3.14 Windows compatibility

Numpy 1.25.x has access violation issues on Python 3.14 Windows.
Numpy 2.x has proper Python 3.14 support including Windows wheels.

* update uv lock

* Update pandas to >=2.3.0 for numpy 2.x compatibility

Pandas 2.2.x was compiled against numpy 1.x and causes ABI
incompatibility errors with numpy 2.x. Pandas 2.3.0+ supports
numpy 2.x properly.

* update uv.lock

* Add scipy>=1.15.0 for numpy 2.x compatibility

Scipy versions < 1.15.0 have C extensions built against numpy 1.x
and are incompatible with numpy 2.x, causing dtype size errors.

* update uv lock

* Update Python support to 3.11-3.13 with compatible dependencies

- Set Python version range to 3.11-3.13 (removed 3.14 support)
- Updated CI workflows: single-version jobs use 3.13, matrix jobs use 3.11 and 3.13
- Dependencies optimized for Python 3.13 compatibility:
  - pyarrow~=22.0 (has Python 3.13 wheels)
  - numpy~=1.26
  - pandas~=2.2
  - blis~=1.0
  - fastuuid~=0.13
- Applied stricter version constraints using ~= operator throughout
- Updated uv.lock with resolved dependencies

* Update numpy to 2.1+ and pandas to 2.3+ for Python 3.13 Windows compatibility

Numpy 1.26.x causes access violations on Python 3.13 Windows.
Numpy 2.1+ has proper Python 3.13 support with Windows wheels.
Pandas 2.3+ is required for numpy 2.x compatibility.

* update vsts.yml python version

* Add GraphRAG Cache package. (microsoft#2153)

* Add GraphRAG Cache package.

* Fix a bunch of module comments and function visibility (microsoft#2154)

* Issue microsoft#2004 fix (microsoft#2159)

* fix issue microsoft#2004 using KeenhoChu idea in his PR

* add unit test for dynamic community selection

* add unit test for dynamic community selection implementing microsoft#2158 logic

---------

Co-authored-by: Gaudy Blanco <gaudy-microsoft@MacBook-Pro-m4-Gaudy-For-Work.local>

* Mismatch between header in community report generation prompt examples and input data (id vs human_readable_id) (microsoft#2161)

* fix issue microsoft#860 for mismatch in prompts and input

* fix format

---------

Co-authored-by: Gaudy Blanco <gaudy-microsoft@MacBook-Pro-m4-Gaudy-For-Work.local>

* Chunker factory (microsoft#2156)

* Delete NoopTextSplitter

* Delete unused check_token_limit

* Add base chunking factory and migrate workflow to use it

* Split apart chunker module

* Co-locate chunking/splitting

* Collapse token splitting functionality into one class/function

* Restore create_base_text_units parameterization

* Move Tokenizer base class to common package

* Move pre-pending into chunkers

* Streamline config

* Fix defaults construction

* Add prepending tests

* Remove chunk_size_includes_metadata config

* Revert ChunkingDocument interface

* Move metadata prepending to a util

* Move Tokenizer back to GR core

* Fix tokenizer removal from chunker

* Set defaults for chunking config

* Move chunking to monorepo package

* Format

* Typo

* Add ChunkResult model

* Streamline chunking config

* Add missing version updates for graphrag_chunking

* Input factory (microsoft#2168)

* Update input factory to match other factories

* Move input config alongside input readers

* Move file pattern logic into InputReader

* Set encoding default

* Clean up optional column configs

* Combine structured data extraction

* Remove pandas from input loading

* Throw if empty documents

* Add json lines (jsonl) input support

* Store raw data

* Fix merge imports

* Move metadata handling entirely to chunking

* Nicer automatic title

* Typo

* Add get_property utility for nested dictionary access with dot notation

* Update structured_file_reader to use get_property utility

* Extract input module into new graphrag-input monorepo package

- Create new graphrag-input package with input loading utilities
- Move InputConfig, InputFileType, InputReader, TextDocument, and file readers (CSV, JSON, JSONL, Text)
- Add get_property utility for nested dictionary access with dot notation
- Include hashing utility for document ID generation
- Update all imports throughout codebase to use graphrag_input
- Add package to workspace configuration and release tasks
- Remove old graphrag.index.input module

* Rename ChunkResult to TextChunk and add transformer support

- Rename chunk_result.py to text_chunk.py with ChunkResult -> TextChunk
- Add 'original' field to TextChunk to track pre-transform text
- Add optional transform callback to chunker.chunk() method
- Add add_metadata transformer for prepending metadata to chunks
- Update create_chunk_results to apply transforms and populate original
- Update sentence_chunker and token_chunker with transform support
- Refactor create_base_text_units to use new transformer pattern
- Rename pluck_metadata to get/collect methods on TextDocument

* Back-compat comment

* Align input config type name with other factory configs

* Add MarkItDown support

* Remove pattern default from MarkItDown reader

* Remove plugins flag (implicit disabled)

* Format

* Update verb tests

* Separate storage from input config

* Add empty objects for NaN raw_data

* Fix smoke tests

* Fix BOM in csv smoke

* Format

* DRIFT fixes (microsoft#2171)

* Use stable ids for community reports

* Remove deprecated title from embedding flow

* Remove embedding column from df loaders

* Fix lancedb insertion

* Add drift back to smoke tests

* Fix mock embedder to match default embedding length

* Fix DRIFT notebook

* Push drift_k_followups through to prompt

* Format

* Vector package (microsoft#2172)

* Extract graphrag-vectors package

* Simplify vector factory usage and config defaults

* Update factory integ initializers

* Fix mock patch

* Format

* Register vector stores in tests

* Set a default vector store name

* Update vector readme

* Remove impls from init

* Move some validation into impls

* Remove index_prefix

* Move duplicate method to base class

* Fix smoke vector config

* Update index bug (microsoft#2173)

* fix update index bug

* blob storage bug fix

---------

Co-authored-by: Gaudy Blanco <gaudy-microsoft@MacBook-Pro-m4-Gaudy-For-Work.local>

* Add GraphRAG LLM package. (microsoft#2174)

* Update documentation for v3 release (microsoft#2176)

update documentation for v3 release

Co-authored-by: Gaudy Blanco <gaudy-microsoft@MacBook-Pro-m4-Gaudy-For-Work.local>

* Graphrag llm cleanup (microsoft#2181)

* Migration update (microsoft#2180)

* fix formatting.

---------

Co-authored-by: Nathan Evans <github@talkswithnumbers.com>
Co-authored-by: gaudyb <85708998+gaudyb@users.noreply.github.com>
Co-authored-by: Gaudy Blanco <gaudy-microsoft@MacBook-Pro-m4-Gaudy-For-Work.local>
Co-authored-by: Andres Morales <86074752+andresmor-ms@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants