M7 Github Pulse : Competitive Engineering Intelligence with Kappa Architecture

A event-driven repository signal engine tracking "Mag 7" open-source velocity using Apache Spark, Kafka and a dual-write Iceberg/Clickhouse architecture.

Project Description:
An event-driven data platform designed to capture, process, and visualize the heartbeat of open-source dominance and trend among the "Magnificent 7" (M7) companies and key contributors.

Ultimately, this project seeks to capture the most-up-to-date trend of the open source ecosystem, particularly on the open-source repositories of key industry players and contributors. It features a scalable Event-Driven Micro-Batch pipeline powered by Apache Spark and Kafka that ingests live Github GraphQL repositories into a dual-write Clickhouse and Apache Iceberg storage layer. This architecture decouples ingestion from processing, enabling resilient, low-latency analytics on high-dimensional data.

In an era where frameworks like PyTorch (Meta), TensorFlow (Google), and others define the industry and the technology trend, engineering leaders lack a unified pane of glass to benchmark their open-source health against competitors. While for engineers/engineers-to-be in tech field, it is crucial to continuously get updated with the latest technology trend, tools and the way of working. With the major technological adoption notably correlated with the open-source projects and the open developers communities' contribution, this platform bridges that gap, ingesting the GitHub repos & events to provide sub-second insights into development velocity, community engagement, and code operational efficiency of the Github code repositories.

The Goal: Provide analytical intelligence on the development velocity of "Magnificent 7" companies/key contributors by analyzing the pulse of their open-source repositories (Commits (WIP), PRs, Issues).

The Solution: A Kappa Architecture pipeline that ingests micro-batches of data via GitHub GraphQL API every 1 hour, processes them statefully with Spark, stores raw history in an Iceberg Lakehouse, and serves low-latency analytics via ClickHouse.

Key Metric: Optimized Data-to-Decision Latency. While ingestion runs on a 1-hour schedule to respect API quotas, the downstream query performance is sub-second, enabling instant slicing and dicing of millions of data points.

System Overview & Architecture

Key Architectural Decision:

Kappa Architecture: Kappa architecture over Lambda to simplify the code maintenance by using one Spark Streaming for both hot and cold path.
Dual Write Pattern: Simultaneously writing to Iceberg (Data Lake) and ClickHouse (Real-time OLAP) provides: cheap, infinite storage for history and blazing-fast performance for dashboards.

Component	Technology	Why it was chosen
Ingestion	GitHub GraphQL API	Prevents over-fetching compared to REST; allows precise retrieval of nested metrics (PRs, issues, stargazers etc.) in single calls.
Data Bus	Apache Kafka	Decouples ingestion from processing, and processing to writing; handles backpressure for multiple circumstances; enable services/function evolution without breakdown.
Schema	Apache Avro	Enforces schema evolution and compatibility; serialize payload in Kafka for higher throughput compared to JSON.
Processing	Spark Structured Streaming	A single unified engine for both streaming and batch workloads, eliminating the code duplication of traditional Lambda architectures.
Real-Time OLAP	Apache ClickHouse	Serves as the "Speed Layer," enabling sub-second aggregation queries over massive number of rows for live dashboards.
Data Lakehouse	Iceberg + MinIO + Nessie	Brings ACID transactions and versioning to object storage. Nessie allows for "Git-like" branching of data for safe experimentation.
Serving	Trino	As a query engine and the federation layer across multiple storage.
Consumption	Superset	Superset delivers enterprise-grade visualization, dashboards.

The Data Flow:

Ingestion Layer: A Python-based ParallelScheduler queries the GitHub GraphQL API, handling pagination and rate limits, pushing raw events to Kafka (github.base_repos).
Stream Processing: Apache Spark (Structured Streaming) reads from Kafka, deserializes Avro payloads (via Confluent Schema Registry), and performs stateful aggregations and processing.
Data Lakehouse (Storage): Raw and processed data is committed to Apache Iceberg tables backed by MinIO S3, managed by a Nessie Catalog for git-like data versioning.
Serving Layer (Speed): A specialized consumer pushes aggregated metrics into ClickHouse for sub-second dashboard rendering.

Data Modeling & Storage Strategy

Schema Design: One Big Table (OBT)

Instead of traditional highly-normalized Star Scheme (which requires expensive joins), I implemented a Wide-Column (OBT) design optimized for columnar storage like in clickhouse.

Storage

Hot store (Clickhouse)

github_processed_repos (State Table):
- Engine: ReplacingMergeTree
- Purpose: Stores the current state of every repository (e.g., current star count, primary language etc.).
- Design Pattern: Act as a Slowly Changing Dimension. ReplacingMergeTree handles deduplication for repository data, replace old row with latest data without complex queries.
- Optimization: LowCardinality(String) for fields like primary_language and owner_type to reduce storage footprint and improve scan speeds.
github_metrics_timeseries (Fact Table):
- Engine: MergeTree (Append-only)
- Purpose: A historical log that store metrics over time.
- Design Pattern: Immutable Fact Table. Allows analysts to query "How fast did the star count grow last week?" by aggregating over the time-series data partition by partition.

Cold Store (Iceberg + Nessie, backed by MinIO, stored in Parquet format)
- Catalog: Project Nessie for git like semantics
- Partitioning: days(processed_at) to optimize predicate pushdown for time-range queries.

Orchestration & Data Quality

WIP

Key Engineering Challenges

Challenge	Engineering Solution Implemented
API Rate Limiting	Adaptive Throttling: The GraphQL client monitors `X-RateLimit-Remaining` headers. I implemented an exponential backoff strategy that pauses ingestion exactly until the `resetAt` timestamp provided by GitHub, maximizing throughput without hitting 403s.
Database Resilience	Circuit Breaker Pattern: Implemented a custom `CircuitBreaker` class in the ClickHouse writer. If the DB fails (e.g., connection timeout), the circuit "opens" (Fails Fast) to prevent cascading failures and resource exhaustion, transitioning to "Half-Open" to test recovery.
Concurrency Control	Distributed Locking: Used Redis to implement a distributed lock (`github:pipeline:lock`). This ensures that multiple instances of the scheduler (e.g., in Kubernetes) do not trigger duplicate ingestion jobs for the same organization simultaneously.
Schema Evolution	Avro & Schema Registry: Utilized the ABRiS library in Spark to integrate with Confluent Schema Registry. This strictly enforces schema compatibility, rejecting upstream API changes that would break downstream consumers.

How to Run

This project is fully containerized. Prerequisites: Docker Engine & 16GB+ RAM.

Clone the repo.
Configure credentials and dependencies file, and place them based on the project structure.
- Github App Initialization: Refer to this readme.
- Create .env: Refer to the demo example.
- Create secrets directory: To store the secret keys for Docker. Refer to demo example.
Spin up the stack.

make up

Access and login into UI (Superset Dashboard): http://localhost:8088

Lesson Learned & Roadmap

Lesson:

Lesson	Description
Github GraphQL vs Github REST API	GraphQL API is more flexible and efficient for complex, nested data retrieval like the repositories metadata. Resulted in more efficient, precise query and rate consumption.
Avro schema vs JSON for Kafka	Avro is generally choice for high throughput (compact binary, smaller size) and robust schema enforcement and evolution without breakdown, while JSON is preferred when human readability for quick debugging (i.e. `Dead Letter Queues` topic).
Multithreading vs Rate Limit	Multithreading is implemented in data ingestion pipeline to optimise the ingestion speed (paginated data, I/O bound task). Tradeoff would be the limit usage, therefore monitoring of rate limit is implemented for retry/sleep mechanism when limit is close to being exceeded.

Roadmap:
Phase 1.5:

To implement observability and monitoring

Phase 2: Reliability & Hardening

To implement data quality monitoring using Great Expectation.
To implement more robust retry logic on ingestion, and optimise the multithread logic.
To implement automated testing and chaos testing.

Phase 3: Infrastructure & Scale

To use Terraform for IaaC
To implement container orchestration tool with cron job scheduling etc.
To implement storage/retention of Kafka logs in Minio using Connect
To implement ingestion of the events API for more advanced analytics.
To implement Kafka Connect to directly write to Clickhouse and Iceberg.

Phase 4:

To implement self service layer to customize the organization/repo to be tracked.
To improve Pyspark processing for advanced aggregation and processing on topics.
To implement the Organization-based query instead of search query for exhaustive ingestion of repos. - Improve rate usage and ingestion capability.

Author

Chen Lam

I am always looking to elevate my work, and I view every project as a stepping stone. I genuinely welcome and seek out constructive criticism and improvement suggestions on my approach, code architecture, or documentation. Let's make this code better together—feel free to connect with your thoughts!

License

This project is open source and available under the MIT License.

Appendices : Overview of the Project Output

Dashboard 1: Organization Performance & Portfolio Dashboard

This dashboard provides a high-level view of a specific organization's open-source portfolio (in this instance, filtered to show Facebook). It focuses on the aggregate popularity, activity status, and technological composition of the organization's repositories.

Dashboard 2: Repository Health & Performance Dashboard

This is a granular, "deep-dive" dashboard focused on a single repository. It moves away from portfolio aggregates to specific project health, maintenance velocity, and community engagement.

Dashboard 3: Competitive Intelligence and Trends

This dashboard shifts focus from a single organization to the broader market. It identifies industry trends, dominant players using specific languages, and the most "viral" repositories across the entire ecosystem as overall and filtered by languages used.

Dashboard 4: Comparative Analysis Dashboard

This dashboard is designed for head-to-head benchmarking, compares metrics between selected organizations (in this view: facebook vs. facebookresearch) to analyze volume versus efficiency and popularity.

Name		Name	Last commit message	Last commit date
Latest commit History 315 Commits
infrastructure/superset		infrastructure/superset
reports/figures		reports/figures
schemas		schemas
services		services
utils		utils
.gitignore		.gitignore
Makefile		Makefile
README-GITHUB_APP.md		README-GITHUB_APP.md
README-example_.env.md		README-example_.env.md
README-project-structure.md		README-project-structure.md
README-secrets.md		README-secrets.md
README.md		README.md
docker-compose.yml		docker-compose.yml
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

M7 Github Pulse : Competitive Engineering Intelligence with Kappa Architecture

A event-driven repository signal engine tracking "Mag 7" open-source velocity using Apache Spark, Kafka and a dual-write Iceberg/Clickhouse architecture.

Table of Contents