A event-driven repository signal engine tracking "Mag 7" open-source velocity using Apache Spark, Kafka and a dual-write Iceberg/Clickhouse architecture.
Project Description:
An event-driven data platform designed to capture, process, and visualize the heartbeat of open-source dominance and trend among the "Magnificent 7" (M7) companies and key contributors.
Ultimately, this project seeks to capture the most-up-to-date trend of the open source ecosystem, particularly on the open-source repositories of key industry players and contributors. It features a scalable Event-Driven Micro-Batch pipeline powered by Apache Spark and Kafka that ingests live Github GraphQL repositories into a dual-write Clickhouse and Apache Iceberg storage layer. This architecture decouples ingestion from processing, enabling resilient, low-latency analytics on high-dimensional data.
- Executive Summary
- System Overview & Architecture
- Data Modeling & Storage Strategy
- Orchestration & Data Quality
- Key Engineering Challenges
- How to Run
- Lesson Learned & Roadmap
- Author
- Appendices
In an era where frameworks like PyTorch (Meta), TensorFlow (Google), and others define the industry and the technology trend, engineering leaders lack a unified pane of glass to benchmark their open-source health against competitors. While for engineers/engineers-to-be in tech field, it is crucial to continuously get updated with the latest technology trend, tools and the way of working. With the major technological adoption notably correlated with the open-source projects and the open developers communities' contribution, this platform bridges that gap, ingesting the GitHub repos & events to provide sub-second insights into development velocity, community engagement, and code operational efficiency of the Github code repositories.
The Goal: Provide analytical intelligence on the development velocity of "Magnificent 7" companies/key contributors by analyzing the pulse of their open-source repositories (Commits (WIP), PRs, Issues).
The Solution: A Kappa Architecture pipeline that ingests micro-batches of data via GitHub GraphQL API every 1 hour, processes them statefully with Spark, stores raw history in an Iceberg Lakehouse, and serves low-latency analytics via ClickHouse.
Key Metric: Optimized Data-to-Decision Latency. While ingestion runs on a 1-hour schedule to respect API quotas, the downstream query performance is sub-second, enabling instant slicing and dicing of millions of data points.
Key Architectural Decision:
- Kappa Architecture: Kappa architecture over Lambda to simplify the code maintenance by using one Spark Streaming for both hot and cold path.
- Dual Write Pattern: Simultaneously writing to Iceberg (Data Lake) and ClickHouse (Real-time OLAP) provides: cheap, infinite storage for history and blazing-fast performance for dashboards.
| Component | Technology | Why it was chosen |
|---|---|---|
| Ingestion | GitHub GraphQL API | Prevents over-fetching compared to REST; allows precise retrieval of nested metrics (PRs, issues, stargazers etc.) in single calls. |
| Data Bus | Apache Kafka | Decouples ingestion from processing, and processing to writing; handles backpressure for multiple circumstances; enable services/function evolution without breakdown. |
| Schema | Apache Avro | Enforces schema evolution and compatibility; serialize payload in Kafka for higher throughput compared to JSON. |
| Processing | Spark Structured Streaming | A single unified engine for both streaming and batch workloads, eliminating the code duplication of traditional Lambda architectures. |
| Real-Time OLAP | Apache ClickHouse | Serves as the "Speed Layer," enabling sub-second aggregation queries over massive number of rows for live dashboards. |
| Data Lakehouse | Iceberg + MinIO + Nessie | Brings ACID transactions and versioning to object storage. Nessie allows for "Git-like" branching of data for safe experimentation. |
| Serving | Trino | As a query engine and the federation layer across multiple storage. |
| Consumption | Superset | Superset delivers enterprise-grade visualization, dashboards. |
The Data Flow:
-
Ingestion Layer: A Python-based
ParallelSchedulerqueries the GitHub GraphQL API, handling pagination and rate limits, pushing raw events to Kafka (github.base_repos). -
Stream Processing: Apache Spark (Structured Streaming) reads from Kafka, deserializes Avro payloads (via Confluent Schema Registry), and performs stateful aggregations and processing.
-
Data Lakehouse (Storage): Raw and processed data is committed to Apache Iceberg tables backed by MinIO S3, managed by a Nessie Catalog for git-like data versioning.
-
Serving Layer (Speed): A specialized consumer pushes aggregated metrics into ClickHouse for sub-second dashboard rendering.
Schema Design: One Big Table (OBT)
- Instead of traditional highly-normalized Star Scheme (which requires expensive joins), I implemented a Wide-Column (OBT) design optimized for columnar storage like in clickhouse.
Storage
- Hot store (Clickhouse)
-
github_processed_repos(State Table):- Engine:
ReplacingMergeTree - Purpose: Stores the current state of every repository (e.g., current star count, primary language etc.).
- Design Pattern: Act as a Slowly Changing Dimension.
ReplacingMergeTreehandles deduplication for repository data, replace old row with latest data without complex queries. - Optimization:
LowCardinality(String)for fields likeprimary_languageandowner_typeto reduce storage footprint and improve scan speeds.
- Engine:
-
github_metrics_timeseries(Fact Table):- Engine:
MergeTree(Append-only) - Purpose: A historical log that store metrics over time.
- Design Pattern: Immutable Fact Table. Allows analysts to query "How fast did the star count grow last week?" by aggregating over the time-series data partition by partition.
- Engine:
- Cold Store (Iceberg + Nessie, backed by MinIO, stored in Parquet format)
- Catalog: Project Nessie for git like semantics
- Partitioning:
days(processed_at)to optimize predicate pushdown for time-range queries.
WIP
| Challenge | Engineering Solution Implemented |
|---|---|
| API Rate Limiting | Adaptive Throttling: The GraphQL client monitors X-RateLimit-Remaining headers. I implemented an exponential backoff strategy that pauses ingestion exactly until the resetAt timestamp provided by GitHub, maximizing throughput without hitting 403s. |
| Database Resilience | Circuit Breaker Pattern: Implemented a custom CircuitBreaker class in the ClickHouse writer. If the DB fails (e.g., connection timeout), the circuit "opens" (Fails Fast) to prevent cascading failures and resource exhaustion, transitioning to "Half-Open" to test recovery. |
| Concurrency Control | Distributed Locking: Used Redis to implement a distributed lock (github:pipeline:lock). This ensures that multiple instances of the scheduler (e.g., in Kubernetes) do not trigger duplicate ingestion jobs for the same organization simultaneously. |
| Schema Evolution | Avro & Schema Registry: Utilized the ABRiS library in Spark to integrate with Confluent Schema Registry. This strictly enforces schema compatibility, rejecting upstream API changes that would break downstream consumers. |
This project is fully containerized. Prerequisites: Docker Engine & 16GB+ RAM.
- Clone the repo.
- Configure credentials and dependencies file, and place them based on the project structure.
- Spin up the stack.
make up- Access and login into UI (Superset Dashboard): http://localhost:8088
Lesson:
| Lesson | Description |
|---|---|
| Github GraphQL vs Github REST API | GraphQL API is more flexible and efficient for complex, nested data retrieval like the repositories metadata. Resulted in more efficient, precise query and rate consumption. |
| Avro schema vs JSON for Kafka | Avro is generally choice for high throughput (compact binary, smaller size) and robust schema enforcement and evolution without breakdown, while JSON is preferred when human readability for quick debugging (i.e. Dead Letter Queues topic). |
| Multithreading vs Rate Limit | Multithreading is implemented in data ingestion pipeline to optimise the ingestion speed (paginated data, I/O bound task). Tradeoff would be the limit usage, therefore monitoring of rate limit is implemented for retry/sleep mechanism when limit is close to being exceeded. |
Roadmap:
Phase 1.5:
- To implement observability and monitoring
Phase 2: Reliability & Hardening
- To implement data quality monitoring using Great Expectation.
- To implement more robust retry logic on ingestion, and optimise the multithread logic.
- To implement automated testing and chaos testing.
Phase 3: Infrastructure & Scale
- To use Terraform for IaaC
- To implement container orchestration tool with cron job scheduling etc.
- To implement storage/retention of Kafka logs in Minio using Connect
- To implement ingestion of the events API for more advanced analytics.
- To implement Kafka Connect to directly write to Clickhouse and Iceberg.
Phase 4:
- To implement self service layer to customize the organization/repo to be tracked.
- To improve Pyspark processing for advanced aggregation and processing on topics.
- To implement the Organization-based query instead of search query for exhaustive ingestion of repos. - Improve rate usage and ingestion capability.
I am always looking to elevate my work, and I view every project as a stepping stone. I genuinely welcome and seek out constructive criticism and improvement suggestions on my approach, code architecture, or documentation. Let's make this code better together—feel free to connect with your thoughts!
This project is open source and available under the MIT License.
Dashboard 1: Organization Performance & Portfolio Dashboard
- This dashboard provides a high-level view of a specific organization's open-source portfolio (in this instance, filtered to show Facebook). It focuses on the aggregate popularity, activity status, and technological composition of the organization's repositories.
Dashboard 2: Repository Health & Performance Dashboard
- This is a granular, "deep-dive" dashboard focused on a single repository. It moves away from portfolio aggregates to specific project health, maintenance velocity, and community engagement.
Dashboard 3: Competitive Intelligence and Trends
- This dashboard shifts focus from a single organization to the broader market. It identifies industry trends, dominant players using specific languages, and the most "viral" repositories across the entire ecosystem as overall and filtered by languages used.
Dashboard 4: Comparative Analysis Dashboard