-
Notifications
You must be signed in to change notification settings - Fork 2k
Description
Is your feature request related to a problem or challenge?
A Wikipedia article would be useful for Apache DataFusion to make the project easier to discover, easier to explain, and easier to cite from a neutral source.
The main benefit is not “marketing copy”; it is legitimacy and referenceability.
This is even more important these days when Wikipedia is a core training corpus for LLMs and search engine results
- It gives newcomers a neutral landing page distinct from https://datafusion.apache,org,
- It makes the project easier for journalists, analysts, conference organizers, students, and procurement people to cite quickly.
- It strengthens search visibility and entity recognition. In practice Wikipedia pages often feed search summaries, knowledge panels, mirrors, and LLM retrieval.
- It signals that the project is notable beyond its own community because the article must be supported by independent reliable sources.
- It gives a durable place to document ecosystem facts like history, governance, and adoption that do not fit cleanly into product docs.
Describe the solution you'd like
I would like a neutral wikipedia page for Apache DataFusion
Here are some similar pages
- https://en.wikipedia.org/wiki/DuckDB
- https://en.wikipedia.org/wiki/Apache_Spark
- https://en.wikipedia.org/wiki/Polars_(software)
DuckDB’s page shows the pattern clearly: a short neutral definition, history, architecture, language bindings, commercial use, and foundation/governance in one place, with references to papers and third-party coverage
Describe alternatives you've considered
I think a strong article will include many citations. Here are a bunch I found with the help of codex
Some third-party citations that are probably useful for this article:
- A standalone Apache top-level project as of April 16, 2024, announced publicly by the Apache Arrow PMC and ASF (Apache Arrow blog (https://arrow.apache.org/blog/2024/05/07/datafusion-tlp/), ASF announcement (https://news.apache.org/foundation/entry/apache-software-foundation-announces-new-top-level-project-apache-datafusion)).
SIGMOD 2024 technical paper
- It appears in the SIGMOD 2024 program as an accepted industry-track paper: SIGMOD accepted papers
(https://2024.sigmod.org/industrial-list.shtml), SIGMOD session listing (https://2024.sigmod.org/program_sigmod.shtml). - The DOI is 10.1145/3626246.3653368 (https://dl.acm.org/doi/10.1145/3626246.3653368).
Citations for technical importance
-
crates.io: 17,668,287 all time downloads (https://crates.io/crates/datafusion)
-
CRN: “The 10 Coolest Open-Source Software Tools Of 2024”
(https://www.crn.com/news/software/2024/the-10-coolest-open-source-software-tools-of-2024?page=3)
It explicitly includes Apache DataFusion and describes it as a fast extensible query engine, notes
its Rust/Arrow basis, and mentions its 2024 top-level-project milestone. This is a strongest source on that page for general
notability. -
Datanami: “How the FDAP Stack Gives InfluxDB 3.0 Real-Time Speed, Efficiency”
(https://www.datanami.com/2024/03/15/how-the-fdap-stack-gives-influxdb-3-0-real-time-speed-efficiency/)
This quotes Paul Dix saying DataFusion had matured substantially and had best-in-class performance on a number of queries versus other
columnar query engines. It is not a ranking article, but it is meaningful third-party validation of technical importance.
Third-party citations for usage in products
-
SiliconANGLE: “Enterprise DB begins rolling AI features into PostgreSQL”
(https://siliconangle.com/2024/05/23/enterprise-db-begins-rolling-ai-features-postgresql/)
Independent coverage stating EDB combined Apache DataFusion, Arrow, and Delta Lake in its analytics/lakehouse capability. -
Spice AI: “How we use Apache DataFusion at Spice AI” (https://spice.ai/blog/how-we-use-apache-datafusion-at-spice-ai)
This says Spice uses DataFusion as its SQL query engine and extends it with custom TableProviders, optimizer rules, and UDFs for
federated SQL workloads. -
Cloudflare Log Explorer GA announcement (https://blog.cloudflare.com/logexplorer-ga/) from June 10, 2025.
Queriers fetch matching files from R2 and “process SQL queries using Apache DataFusion.” -
InfluxData: “Flight, DataFusion, Arrow, and Parquet: Using the FDAP Architecture to build InfluxDB 3.0”
(https://www.influxdata.com/blog/flight-datafusion-arrow-parquet-fdap-architecture-influxdb/)
Clearly states InfluxDB 3.0 chose DataFusion as its query engine foundation and explains why. -
Pydantic Logfire issue: “We’re changing database” (We're changing database pydantic/logfire#408)
Usable as a primary source for adoption only. It says Logfire is moving from Timescale to a custom database built on DataFusion and
gives reasons. -
Palantir Foundry announcements for July 2025 (https://www.palantir.com/docs/foundry/announcements/2025-07)
This says lightweight pipelines are “powered by DataFusion,” -
Cube: “Query pushdown in Cube’s semantic layer” (https://cube.dev/blog/query-push-down-in-cubes-semantic-layer)
Good third-party primary source for “used in production by Cube” and for describing how Cube uses DataFusion internally. -
Kamu: “100X faster ingestion, and FlightSQL support for connecting BI tools” (https://www.kamu.dev/blog/2023-09-datafusion-flightsql/)
Good third-party primary source for ecosystem adoption. It explicitly says Kamu added support for Apache DataFusion and reports
performance claims in its own product. -
LanceDB: “Columnar File Readers in Depth: APIs and Fusion” (https://lancedb.com/blog/columnar-file-readers-in-depth-apis-and-fusion/)
Usable for ecosystem context. It says Lance uses DataFusion extensively and demonstrates integration with it. -
Bauplan Labs: “Duck Hunt: moving Bauplan from DuckDB to DataFusion”
(https://www.bauplanlabs.com/post/duck-hunt-moving-bauplan-from-duckdb-to-datafusion)
Bauplan explains the migration as driven by DataFusion’s Arrow-first architecture, extensibility, and community-driven development.
Additional context
No response