Status: Proposal / Architecture Draft
Author: Alexandre Ganea
Updated: 2026-02-22
Table of Contents
- 1. Executive Summary
- 2. Design Principles
- 2b. Why a New Compiler Instead of Modifying LLVM/Clang
- 3. High-Level Architecture
- 4. Component Design
- 4.1 The Distributed Content-Addressed Store
- 4.2 The Memoizer
- 4.3 Lexer & Preprocessor
- 4.4 Parser
- 4.5 Semantic Analysis (Modular Design)
- 4.6a consteval/constexpr via In-Process JIT
- 4.6b ForgeIR — Intermediate Representation
- 4.7 x86-64 Code Generation
- 4.8 Runtime Loader / JIT Execution
- 4.9 Linker
- 4.10 Unreal Build Tool (UBT) Integration
- 4.11 Daemon Architecture
- 4.12 Distributed P2P Network
- 4.13 Unified Protocol Design
- 5. Crate Structure
- 6. Key Data Flows
- 7. Build System Integration
- 8. Phased Implementation Plan
- 9. Risk Assessment
- 10. Key Dependencies (Rust Crates)
- 11. Lines-of-Code Estimates
- 12. Parallelism Strategy (Detailed)
- 13. Standard Library Strategy
- 14. C++20 Modules Decision
- 15. UE5 Codebase Analysis Findings
- 16. Deterministic Compilation (Required)
- 17. JIT Runtime Details
- 18. Live PGO Injection (PoC Feature)
- 19. A VM for C++?
- 20. Future Work (Beyond the PoC)
- 21. N4950 Cross-Reference (C++23 Standard Draft)
- 22. Success Criteria
forgecc is an architecture proposal for a C++ compiler designed from scratch around memoization, daemon persistence, and JIT execution. It takes the form of an incremental C++20 compiler and linker, written in Rust, running as a persistent daemon. Its primary goal is to dramatically reduce build times for large C++ codebases (target: Unreal Engine 5+ / Chromium) by:
- Eliminating redundant work through fine-grained memoization at every compilation stage
- Persisting compilation state across builds in a content-addressed distributed database
- Distributing compilation knowledge across machines via a peer-to-peer synchronized index
- Unifying compilation and linking in a single in-process pipeline
- Generating debug information on demand — no PDB or DWARF files during development; the daemon retains all compilation metadata (types, source maps, frame layouts) and serves debug info live through a built-in DAP server
- Runtime patching / JIT execution — the daemon acts as an OS loader, building the target application in-memory progressively, compiling functions on demand
forgecc is, in essence, a native JIT runtime for C++ — a daemon that compiles, loads, patches, and debugs C++ code entirely in-memory, without producing intermediate files. It is informed by LLVM/Clang's architecture but is a clean implementation, not a fork. It targets a minimal but sufficient feature set: C++20 (and C), x86-64, Windows only (initially). COFF/CodeView/PDB are supported for release builds only; during development, the daemon's JIT loader and DAP debugger eliminate the need for object files or debug info serialization.
Notable precedents:
-
CERN Cling (2011–present): Interactive C++ interpreter/JIT built on LLVM/Clang. Used in production at CERN for high-energy physics (1 EB of data analyzed, 1000+ scientific publications). Demonstrates that C++ JIT is viable at scale, though Cling is an interpreter with JIT, not a full compiler daemon. https://root.cern/cling/
-
Zapcc (2018): Caching C++ compiler daemon based on Clang. Keeps parsed headers in memory across compilations. Achieved 10–50x speedups on template-heavy code. Closest architectural precedent to forgecc's daemon model, though it doesn't do JIT execution or runtime patching. https://github.com/yrnkrn/zapcc
-
Live++ (commercial): Binary hot-reload tool for C++. Patches machine code of running executables in-place. Used by 100+ companies including many UE5 studios. Demonstrates that runtime patching of C++ is production-viable on Windows. https://liveplusplus.tech/
-
Runtime Compiled C++ (RCC++): Open-source runtime C++ recompilation for game development. Recompiles changed code and hot-swaps it into a running process. https://github.com/RuntimeCompiledCPlusPlus/RuntimeCompiledCPlusPlus
-
Circle (Sean Baxter): Single-developer C++ compiler with novel language extensions. Demonstrates that a single person/small team can build a working C++ compiler targeting real-world code.
-
SN-Systems Program Repository (Sony, 2018–2022): LLVM-based research project that replaced object files with a content-addressed database (pstore). Proved that storing compilation results in a database with minimal stamp files for build system compatibility works at scale. https://github.com/SNSystems/llvm-project-prepo/wiki
forgecc differs from all of these by unifying compilation, linking, loading, debugging, and distributed caching into a single daemon process.
- No 32-bit support, no ARM64
- No ELF/Mach-O/DWARF (architecture allows future extension)
- No optimization passes beyond trivial ones (dead code elimination, constant folding)
- No standalone compiler —
forge-cc.exeand the forge tool family are thin RPC clients (~800 lines total) that forward to the daemon; all real work happens in the daemon (see §7.3) - No Objective-C/OpenCL/CUDA
- No cross-compilation
- No LLVM IR for the PoC — forgecc defines its own IR for determinism and memoization control; LLVM IR lowering deferred to release-mode optimization path (see §4.6b)
- No PCH — the memoization/caching system replaces PCH transparently
- No C++20 modules for the PoC (see §14 and §15.14)
- No C++20 coroutines for the PoC (see §15.14)
- Selective language/library/extension scope: low-usage features outside the UE5 target subset are deferred (details in §15.14 and §21.2)
- Resource compilation is supported natively in the PoC via
forge-rc.exeand/v1/rc(MSVC/LLVM-compatible command-line surface) - No HLSL shader compilation — shaders are compiled by UE5's ShaderCompileWorker, a completely separate pipeline outside forgecc's scope
- No memory budget optimization — this is a PoC; memory profiling and optimization come later
-
Minimal code, maximum leverage: Every component is as small as possible. Prefer simplicity over generality. forgecc does not target every edge case in the C++ standard — it targets the subset that UE5/Game uses.
-
Content-addressed everything: Every intermediate artifact (tokens, AST nodes, types, IR, machine code sections) is identified by a hash of its inputs. If the inputs haven't changed, the output is reused. The indexing model uses Merkle-style content trees with copy-on-write updates, structural diffing, and layered indexers for efficient deduplication and change detection.
-
No files, only database records: Traditional compilers produce .obj files that the linker reads. forgecc skips this — the compiler writes sections directly into a distributed content-addressed store, and the linker reads from the same store. No serialization/deserialization of object files.
-
Always-on, always-warm: The daemon keeps hot data in memory. A full rebuild after a single-line change takes milliseconds, not seconds.
-
Distributed by default (IPFS-style): Multiple daemon instances form a peer-to-peer network. Storage is distributed across machines. No central server, no centralized storage. Machines discover each other via seeds.
-
No user-managed complexity: Users do not manage PCH, unity builds, include-what-you-use, or module maps. The compiler handles caching, deduplication, and incremental compilation transparently.
-
Parallel by design: Every component must support parallelism. Unlike LLVM/Clang which is single-threaded per TU, forgecc parallelizes within a single TU where possible and across TUs by default.
-
Runtime-first: The primary execution model is runtime patching / JIT. The daemon loads and patches the target application in-memory. Generating a standalone .EXE/.DLL is a secondary "release" mode.
The natural first question is: why not build forgecc's features on top of LLVM/Clang? Clang is an excellent, mature C++ compiler — arguably the best open- source C++ front-end ever built. This document respects the LLVM community and the decades of work that went into it. This project contributes ideas upstream where alignment exists. This section explains why, regrettably, the specific combination of features forgecc needs does not fit within LLVM/Clang's current architecture without changes so deep that they amount to a rewrite of core subsystems.
Prior art: projects that extended LLVM/Clang
Several teams have explored variations of this idea. Their experiences are instructive — not as failures, but as evidence of where the architectural boundaries lie:
| Project | Approach | What happened |
|---|---|---|
| Zapcc (2018) | Forked Clang; kept AST/Sema in memory across TUs as a daemon | Achieved impressive 10–50x speedups, validating the daemon concept. Unfortunately, maintaining a Clang fork proved unsustainable — Zapcc fell behind upstream within ~2 years. The fork touched ~200 files across Clang's internals, making rebasing prohibitively expensive. |
| Cling (CERN, 2011–present) | Built on top of Clang/LLVM as a library; interactive C++ JIT | A remarkable success story — used in production at CERN. However, the team reports that every LLVM major version upgrade requires significant porting effort (~62 files changed, ~1,300 insertions against Clang-9). The CaaS (Compiler-as-a-Service) project is working to upstream these patches, which speaks to how much effort it takes to maintain them out-of-tree. |
| SN-Systems pstore (Sony, 2018–2022) | Modified LLVM's output layer to write to a content-addressed database instead of .obj files | Proved that content-addressed compilation artifacts work at scale — a key validation for forgecc's approach. The project was eventually discontinued; the modifications were difficult to maintain alongside upstream LLVM evolution. |
| Clangd / rust-analyzer | Built as separate tools using compiler infrastructure as libraries | Clangd works well within Clang's constraints but inherits its single-threaded, file-at-a-time model. Notably, rust-analyzer chose to reimplement parsing and type-checking independently of rustc, specifically to achieve the incrementality that rustc's architecture could not easily provide. |
These projects demonstrate that LLVM/Clang is a powerful foundation, but that certain architectural properties — persistent daemon state, fine-grained memoization, intra-TU parallelism — require changes deep enough that maintaining them as patches or forks becomes the dominant engineering cost.
What forgecc's feature set would require inside LLVM/Clang:
-
Content-addressed memoization at every stage — This is the core challenge. Clang's AST nodes are allocated in a
BumpPtrAllocatorwith pointer identity; they are not content-hashable by design. To add content-addressing, forgecc needs either (a) serialization of every AST node to compute a hash (which adds overhead that can negate the memoization benefit) or (b) rework of Clang's AST allocation model to support structural hashing. Option (b) touches a very large number of files acrossclang/AST/andclang/Sema/. Content-addressable AST nodes in Clang are valuable, but the scope of that change is beyond what a single project can reasonably propose. -
Daemon mode with persistent state — Clang's
CompilerInstanceis designed around a create-use-destroy lifecycle for each TU. Global state (IdentifierTable, SourceManager, Preprocessor) is not designed to be reused across compilations. Zapcc demonstrated that this can be made to work, but the cleanup logic required careful handling of Clang internals that change with each upstream release. The CaaS project at CERN is making progress on upstreaming incremental compilation support (IncrementalAction, persistent CompilerInstance), which is encouraging — but even their scope is narrower than what forgecc needs (they focus on interactive REPL, not full build-system integration with cross-build caching). -
Fine-grained parallelism within a TU — Clang's Sema is single-threaded per TU, which is a reasonable design choice for a traditional compiler. Sema uses mutable state extensively (
Sema::CurContext,Sema::ExprEvalContexts, delayed diagnostics). Making Sema thread-safe would require reworking most oflib/Sema/(~180,000 lines) — a change so large that it would effectively be a new Sema. This is not a criticism of Clang's design; single-threaded Sema is the right trade-off for a general-purpose compiler. forgecc's different trade-off (daemon with warm caches) makes intra-TU parallelism more valuable. -
Deterministic output for memoization — Clang emits LLVM IR, which has sources of non-determinism described in §4.6b (sequential metadata IDs, emission-order-dependent value numbering). These are not bugs — they are consequences of LLVM IR's design, which optimizes for compilation speed rather than output reproducibility. Fixing this requires canonicalization passes in
llvm/lib/IR/andllvm/lib/Bitcode/, touching ~50+ files. Some of these changes are upstreamable candidates (deterministic output is valuable for build reproducibility in general), and this project contributes them when community alignment exists. -
JIT execution with runtime patching — This is where LLVM is closest to forgecc's vision. ORC JIT already provides impressive infrastructure: COFF/x86-64 JIT linking, SEH frame support, function redirection via atomically-rewritable stubs, a re-optimization layer, out-of-process execution, and lazy compilation. forgecc's JIT concept is directly informed by ORC's design. The remaining gaps are the DAP debug server, native Windows TLS (ORC currently uses emulated TLS), build-system daemon integration, and — most importantly — content-addressed memoization. The JIT in forgecc is tightly coupled to the memoization system: every JIT'd function must be content-addressed, cached, and distributable. Integrating this into ORC requires either feeding it pre-built .obj files (reducing ORC to just a linker) or modifying ORC's internals to work with the forgecc store. Instead, forgecc builds a simpler, purpose-built runtime loader (~3,000 lines) that consumes sections directly from the content-addressed store, using ORC as a reference implementation for the hard parts (relocation application, SEH registration, TLS setup).
-
Distributed P2P caching — This is entirely new functionality that does not exist in LLVM. It is a pure addition, but it needs deep integration with the content-addressed store (point 1), which itself requires the architectural changes described above.
Estimated scope of LLVM/Clang modifications:
| Change | Estimated files touched | Complexity |
|---|---|---|
| Content-hashable AST | ~500+ files in clang/AST/, clang/Sema/ |
Very high — fundamental data structure redesign |
| Daemon mode (persistent state) | ~100+ files (Zapcc touched ~200) | High — CaaS project is making progress but scope is narrower |
| Intra-TU parallelism | ~300+ files in clang/Sema/ |
Very high — Sema's threading model is deeply embedded |
| LLVM IR determinism fixes | ~50+ files in llvm/lib/IR/, llvm/lib/Bitcode/ |
Medium — potentially upstreamable; benefits reproducible builds in general |
| JIT + runtime patching | ~50 new files | Low — ORC provides most primitives; integration with content store is the gap |
| P2P distributed store | ~30 new files | Low — purely additive |
| Total | ~1,000+ files modified in a ~3M LoC codebase |
The fork maintenance challenge: LLVM/Clang receives ~1,000 commits per week. A fork touching 1,000+ files diverges from upstream within months, making it increasingly difficult to pull in bug fixes, new platform support, and C++ standard updates. Zapcc's experience illustrates this — it was a well-executed fork by an experienced team, and it still became impractical to maintain.
The upstreaming challenge: This project prefers to contribute these features to LLVM. However, the changes forgecc needs are architectural (AST allocation model, Sema threading, IR determinism), not incremental. Proposing changes of this scope to Clang's core data structures is a multi-year effort requiring broad community consensus — and rightly so, since these changes affect every Clang user. The CaaS project's experience is informative: even with CERN's backing and a relatively modest patch set (~62 files), upstreaming incremental compilation support has been a years-long effort. forgecc's changes are an order of magnitude larger.
What forgecc gains from a fresh start:
| Advantage | Detail |
|---|---|
| Pure Rust, no C++ dependency | Eliminates the LLVM build (10–30 min), C++ FFI complexity, and mixed-language debugging. The entire compiler is one cargo build. |
| Designed for memoization from day one | Every data structure is content-hashable by construction. No retrofitting. |
| Designed for parallelism from day one | Rust's ownership model enforces thread safety at compile time. No global mutable state in Sema. |
| Minimal code | ~110K LoC for the full compiler vs. ~3M LoC for LLVM/Clang. Easier to understand, debug, and modify. |
| No upstream dependency | No version tracking, no merge conflicts, no waiting for upstream reviews. |
| Subset targeting | Only implement the C++ features UE5 actually uses. Clang must support everything — a much harder problem. |
forgecc trade-offs and mitigations:
| Cost | Mitigation |
|---|---|
| 20+ years of Clang bug fixes | forge-verify (§8b) continuously compares forgecc's output against Clang, catching corner-case divergences early. |
| Optimization pipeline | The hybrid path (§4.6b) lowers ForgeIR → LLVM IR for release builds, reusing LLVM's full -O2/-O3 pipeline. |
| Platform breadth | forgecc targets one platform (x86-64 Windows) for the PoC. The architecture supports future extension. |
| Community and ecosystem | Clang has hundreds of contributors, sanitizers, static analyzers, clang-tidy, clang-format. forgecc starts without these, but the content-addressed store and daemon architecture enable new tooling patterns. |
| Standard compliance | forgecc implements the C++ subset that UE5 uses, see §21. |
A hybrid path preserves the relationship with LLVM: The design includes
ForgeIR → LLVM IR lowering for release builds (§4.6b). On this path, forgecc
does not replace LLVM and instead uses it as a backend for
optimized builds — similar to how Swift (SIL → LLVM IR) and Rust (MIR → LLVM IR)
leverage LLVM today. In that scenario, the clean break is only in the
front-end and the development-mode pipeline; improvements to LLVM's codegen or
optimization passes benefit forgecc's release builds automatically.
UBT / ninja
VS Code (DAP)
Peer daemons
│
HTTP (msgpack/json)
│
┌──────────────────────────────────────────────────────────────▼───────┐
│ forgecc daemon │
│ │
│ ┌──────────────────────────────────────────────────────────────┐ │
│ │ HTTP Server (axum) — unified endpoint :9473 │ │
│ │ /v1/compile, /v1/artifact/*, /v1/peers, /v1/debug/* │ │
│ └──────────┬──────────────┬──────────────────┬─────────────────┘ │
│ │ │ │ │
│ ┌──────────▼──┐ ┌───────▼────────┐ ┌──────▼───────────────┐ │
│ │ Build │ │ P2P Network │ │ Content-Addressed │ │
│ │ Scheduler │──│ (seeds, │──│ Distributed Store │ │
│ │ │ │ IPFS-like) │ │ (local + remote) │ │
│ └──────┬──────┘ └────────────────┘ └──────────┬───────────┘ │
│ │ │ │
│ ┌────────────────────────────┐ │ │
│ │ Compilation Pipeline │ │ │
│ │ (parallel throughout) │ │ │
│ │ │ │ │
│ │ ┌─────┐ ┌──────────┐ │ │ │
│ │ │Lexer│─>│Preproc- │ │ │ │
│ │ │ │ │essor │ │ │ │
│ │ └─────┘ └────┬─────┘ │ │ │
│ │ │ │ │ │
│ │ ┌──────▼──────┐ │ │ │
│ │ │ Parser │ │ │ │
│ │ └──────┬──────┘ │ │ │
│ │ │ │ │ │
│ │ ┌───────────▼────────┐ │ │ │
│ │ │ Sema (modular) │ │ ┌──────┐ │ │
│ │ │ ┌──────┐┌───────┐ │ │ │Memo- │ │ │
│ │ │ │Lookup││Overld.│ │ │ │izer │<┘ │
│ │ │ ├──────┤├───────┤ │──┼─>│ │ │
│ │ │ │Templ.││ Types │ │ │ └──────┘ │
│ │ │ ├──────┤├───────┤ │ │ │
│ │ │ │Layout││Mangle │ │ │ │
│ │ │ └──────┘└───────┘ │ │ │
│ │ └───────────┬────────┘ │ │
│ │ │ │ │
│ │ ┌──────▼──────┐ │ │
│ │ │ IR Lower │ │ │
│ │ │ (ForgeIR) │ │ │
│ │ └──────┬──────┘ │ │
│ │ │ │ │
│ │ ┌──────▼──────┐ │ │
│ │ │ x86-64 │ │ │
│ │ │ CodeGen │ │ │
│ │ └──────┬──────┘ │ │
│ │ │ │ │
│ └────────────────┼───────────┘ │
│ │ │
│ ┌───────────┼───────────┐ │
│ │ │ │ │
│ ┌──────▼──────┐ ┌─▼────────┐ ┌▼───────────┐ │
│ │ Linker │ │ Runtime │ │ DAP │ │
│ │ (release │ │ Loader / │ │ Debugger │<── VS Code │
│ │ mode only) │ │ JIT │ │ Server │ │
│ └─────────────┘ └──────────┘ └────────────┘ │
│ │
└──────────────────────────────────────────────────────────────────────┘
This is the heart of forgecc. Every compilation artifact is stored as a record keyed by the hash of its inputs. Storage is distributed across all daemon instances in the network — there is no central server.
Key: blake3(input_content + compiler_flags + dependency_hashes)
Value: the artifact (tokens, AST fragment, IR, machine code section, etc.)
Single content-store implementation for both index and artifacts:
- forgecc uses one custom content-store abstraction for both:
- compilation artifacts (tokens/AST/IR/code/debug records)
- Merkle/index metadata (build index, dependency signatures, namespace hashes, peer metadata)
- The build index is not a separate storage system; it is another table/view in the same content-store backend.
Multi-tier storage hierarchy:
- Tier 0 (memory): Hot objects, hash maps, and recently used record pages.
- Tier 1 (local SSD): Persistent embedded KV pages (redb-backed) with ACID semantics.
- Tier 2 (network peers): On-demand remote fetch by content hash via P2P endpoints.
- Read path is
memory → SSD → network; write path commits to SSD then advertises to peers. - Eviction is explicit: cold pages/objects are unloaded from Tier 0 under memory pressure. The OS can still page-process memory, but forgecc does not rely on random swap behavior for cache policy correctness.
Local persistent backend:
- redb — pure Rust, ACID, MVCC, zero-copy reads
Distributed layer (IPFS-like):
- Each daemon maintains a local store and a distributed hash table (DHT) index
- When a daemon produces an artifact, it stores it locally and announces the key to peers
- When a daemon needs an artifact, it checks: local store → DHT → compile locally
- Data is fetched on demand from the peer that has it (content-addressed, so any peer with the same key has the same data)
- Seeds: Daemons bootstrap into the network via a seed list (IP:port pairs). No mDNS, no central discovery server. Seeds are configured in a config file or environment variable.
- No centralized storage: Each machine stores what it compiles. Popular artifacts (e.g., frequently-included headers) naturally replicate across machines as they're requested.
Record types stored:
| Record Type | Key Inputs | Value |
|---|---|---|
SourceFile |
file path + content hash | raw source bytes |
TokenStream |
source hash + predefines | serialized token stream |
PreprocResult |
token hash + include graph hash | preprocessed token stream |
ASTFragment |
preproc hash + parse context | serialized AST |
TypeChecked |
AST hash + type context | type-annotated AST |
ForgeIR |
sema hash + codegen options | IR for a function/global |
MachineCode |
IR hash + target options | x86-64 bytes + relocations |
Section |
machine code hash + section name | COFF section content |
DebugInfo |
sema hash + debug options | Source maps, type info, frame layouts (internal format for DAP; serialized to CodeView only for release builds) |
The memoizer wraps every pipeline stage. Pseudocode:
fn memoized<I: Hash, O: Serialize + DeserializeOwned>(
store: &ContentStore,
stage: Stage,
input: &I,
compute: impl FnOnce(&I) -> O,
) -> O {
let key = Key::new(stage, input.content_hash());
// Check local store first, then distributed peers
if let Some(cached) = store.get(&key) {
return cached.deserialize();
}
let result = compute(input);
store.put(&key, &result.serialize());
result
}The fundamental memoization invariant: For memoization to be worthwhile at any granularity level, the cost of computing and validating the cache key (the "signature") must be significantly less than the cost of executing the computation it guards. If checking the signature takes as long as just re-doing the work, the cache is overhead, not optimization. This invariant must hold not just on average but in the common case — the "change one file, rebuild" inner loop that developers hit hundreds of times per day. Every design decision below is evaluated against this invariant.
Granularity levels (from coarse to fine):
-
Translation unit level: Hash(source + all includes + flags) → full compiled output. This is what ccache/sccache do. forgecc does this as the outermost cache.
-
Header level: Hash(header content + its transitive includes) → preprocessed + parsed result. This replaces PCH entirely. Every header is automatically cached at the granularity the compiler determines is optimal. Users never manage PCH or unity builds.
-
Function level: Hash(function AST + type context) → IR → machine code. If a header change doesn't affect a function's types, the function isn't recompiled.
-
Section level: Hash(machine code + relocations) → COFF section. Identical sections across TUs are stored once (COMDAT-like deduplication for free).
Why this replaces PCH: In traditional compilers, PCH is a user-managed optimization where the developer chooses which headers to precompile. This is fragile (wrong PCH invalidates everything) and requires manual maintenance. forgecc's approach:
- Every header's parsed/type-checked result is cached automatically
- The cache key includes the exact macro state at the point of inclusion
- Headers included with the same macro state share a single cached result
- No
SharedPCHHeaderFile, no/Yu//Ycflags, no unity builds
Macro-set pruning (two-pass optimization): On the first inclusion, the cache key
is the full macro environment — a conservative over-approximation. During
preprocessing, the preprocessor records which macros were actually tested
(#ifdef, #if defined(), #if, macro expansion). After the first pass, the
dependency signature stores only those tested macros. On subsequent builds, the
cache key narrows to just the tested subset. This prevents unrelated #define
changes from invalidating the cache. In practice, a typical UE header tests 5–15
macros out of 200+ defined on the command line, so the pruned key is dramatically
more stable.
This is one of the most impactful optimizations forgecc implements, and it is something no mainstream C++ compiler does today. The idea: after the first parse of a file, record a dependency signature — the minimal set of external inputs that actually affected the compilation result. On subsequent builds, check only those dependencies instead of re-hashing the entire transitive include graph.
What the dependency signature captures:
-
Macro dependencies: Which
#ifdef/#if defined()macros were tested, and their values. If a file tests#ifdef _DEBUGand#ifdef UE_BUILD_DEVELOPMENT, only those two macros are dependencies — not the hundreds of other-Dflags on the command line. -
Symbol dependencies: Which names were looked up from other TUs/headers, and the hash of their declarations. If
Foo.cppusesclass BarfromBar.h, the dependency ishash(Bar's declaration), nothash(entire Bar.h). -
Type layout dependencies: Which type layouts were queried (for
sizeof, member access, vtable dispatch). IfBaradds a new private method that doesn't change its layout,Foo.cppdoesn't need recompilation. -
Include dependencies: Which headers were actually included (after
#ifdefguards resolved), and the hash of their content.
Data structure (stored per TU in the content-addressed store):
struct DependencySignature {
/// Macros tested and their values at point of use
macro_deps: BTreeMap<InternedString, Option<TokenStream>>,
/// External symbols referenced and hash of their declarations
symbol_deps: BTreeMap<QualifiedName, Blake3Hash>,
/// Type layouts depended on (for sizeof, member access)
layout_deps: BTreeMap<QualifiedName, Blake3Hash>,
/// Headers included and their content hashes
include_deps: BTreeMap<PathBuf, Blake3Hash>,
}On rebuild: instead of re-parsing the file, check whether any dependency in the signature has changed. If none have, skip the file entirely — not even re-lexing is needed. This is strictly more precise than the TU-level hash (which re-hashes all includes and flags).
Expected impact: For a typical "change one .cpp file" scenario in UE5, this reduce the set of TUs that need any work from hundreds (due to shared headers) to just the one file that changed. For header changes, it narrows recompilation to only TUs that actually use the changed declarations — not all TUs that transitively include the header.
Dependency pruning strategy (anti-thrashing):
forgecc avoids "hash the whole world" invalidation by recording only the subset of facts that actually affected a result:
- Preprocessor pruning: cache keys shrink from full macro environment to macros
that were actually tested/expanded (
#ifdef,defined, expansion sites). - Semantic pruning: signatures include symbol/type/layout dependencies actually consulted by parse+sema, not entire transitive header text.
- ADL pruning: lookup records include both positive and negative dependencies
(e.g.,
EMPTY_HASHwhen no candidate existed), so newly introduced overloads in associated namespaces invalidate precisely. - Namespace pruning: per-namespace content hashes and version checks avoid rescanning unrelated declarations; global namespace uses partitioned/Merkle hashing.
This pruning model is the primary mechanism that keeps warm-cache hit rates high and
prevents cache thrashing from unrelated #define or declaration churn.
Implementation note: This is a Phase 1–3 feature. The dependency signature is built incrementally during lexing (macro deps, include deps), parsing (symbol deps), and sema (layout deps), then stored in the content-addressed store alongside the compilation result. Estimated cost: ~500–800 additional lines in the memoizer.
Template instantiation memoization is one of the hardest problems in the entire forgecc design. Unlike TU-level or header-level caching where the inputs are relatively self-contained (source text + flags + includes), a template instantiation's result depends on a large, open-ended set of contextual information — name lookup results, overload resolution outcomes, ADL-discovered functions, concept satisfaction, SFINAE success/failure, partial specialization selection, and more. Getting this wrong means either incorrect caching (returning stale results after a change) or excessive invalidation (recomputing instantiations that haven't actually changed).
UE5's template-heavy code (TArray, TMap, TSet, TSharedPtr, TFunction, TDelegate,
TVariant, TOptional, plus MSVC STL headers) means the same templates are instantiated
across hundreds of TUs with the same or similar arguments. A single TArray<AActor*>
instantiation produces ~2,000–5,000 lines of IR. Memoizing it avoids
repeating that work across every TU that uses it. The potential savings are enormous —
but only if the signature check is fast enough.
The result of instantiating a template T<Args...> depends on all of the following:
A. The template definition
| Dependency | Why it matters |
|---|---|
| Template body AST hash | Any change to the template source code |
| Base class template definitions (recursive) | e.g. _Compressed_pair inside std::vector |
| Default template argument expressions | template<class T, class Alloc = allocator<T>> |
| Member function template bodies | Each may be instantiated on demand |
| Friend declarations | Friends can inject names into enclosing namespaces |
B. The template arguments (deep hash)
| Dependency | Why it matters |
|---|---|
| Full type definition of each type argument | Layout, members, bases, virtual functions — not just the name |
Nested types of arguments (T::value_type, T::iterator) |
Dependent name resolution |
| Member functions of arguments | If template calls t.foo(), the overload set of foo matters |
| Base classes of arguments | Affects member lookup, conversions, concept satisfaction |
| Friend functions of arguments | Found via ADL |
| NTTP values | array<int, 5> — the 5 is part of the key |
| Template template argument definitions | template<template<class> class C> — C's definition matters |
C. Two-phase name lookup results
This is the hardest category. C++ templates use two-phase lookup (N4950 §13.8):
Phase 1 — non-dependent names (resolved at template definition point):
| Dependency | Why it matters |
|---|---|
| Non-dependent name resolutions | Names that don't depend on T are resolved once when the template is first parsed |
| Non-dependent operator overloads | a + b where both types are known at definition |
| Using declarations visible at definition | using std::swap; |
| Namespace-scope names visible at definition | All names from enclosing/used namespaces |
Phase 2 — dependent names (resolved at instantiation point):
| Dependency | Why it matters |
|---|---|
| ADL-found functions for dependent calls | swap(a, b) where a is type T — ADL searches T's associated namespaces |
| ADL-found operators for dependent expressions | a == b, os << a where a is dependent |
| Associated namespaces of all argument types | The set of namespaces ADL searches |
| Associated namespaces of base classes of arguments | ADL also searches base class namespaces |
| Associated namespaces of template arguments of arguments | vector<MyClass> → MyClass's namespace is associated |
D. Overload resolution outcomes
Every call expression involving dependent types produces an overload resolution:
| Dependency | Why it matters |
|---|---|
| Full candidate set (all visible overloads) | Adding/removing an overload changes the winner |
| Implicit conversion sequences for each candidate | Whether T is convertible to parameter types |
| Concept constraints on candidates | Constrained candidates may become more/less viable |
| Partial ordering of function templates | Which function template is more specialized |
| User-defined conversion operators/constructors | explicit/non-explicit conversions |
| Deleted functions in candidate set | A deleted overload can make the call ill-formed |
| Default argument expressions | Affect viable candidate determination |
E. Concept satisfaction and SFINAE
| Dependency | Why it matters |
|---|---|
| Concept definition hashes | The concept body itself (e.g. CSameAs) |
| Atomic constraint satisfaction results | Each requires sub-expression evaluated against T |
| Nested requires-expression validity | requires { t.foo(); } — depends on T::foo existing |
| Subsumption ordering | Which constrained overload is more constrained |
| SFINAE substitution success/failure | enable_if_t<is_integral_v<T>> — depends on the trait |
void_t detection idiom results |
Whether an expression is well-formed |
decltype on dependent expressions |
The deduced type of decltype(t + u) |
F. Specialization selection
| Dependency | Why it matters |
|---|---|
| Set of visible partial specializations | Adding a partial spec can change which is selected |
| Set of visible explicit specializations | An explicit spec replaces the primary template entirely |
| Partial ordering of specializations | Which partial spec is "more specialized" |
G. Implicit special member functions
| Dependency | Why it matters |
|---|---|
Defaulted/deleted status of T's special members |
Whether T is copyable, movable, destructible |
| Trivially copyable/destructible status | Affects memcpy optimization, constexpr eligibility |
Aggregate status of T |
Whether aggregate initialization applies |
H. Instantiation-point context
| Dependency | Why it matters |
|---|---|
#pragma pack state at instantiation point |
Affects layout of instantiated class templates |
[[no_unique_address]] on members |
Affects class layout |
Compiler flags (-D, -std) |
Feature-test macros, __cplusplus value |
Given the complexity above, forgecc uses a three-tier strategy that respects the fundamental memoization invariant (signature cost << computation cost):
Tier 1 — Structural key (Phase 3, day one)
struct TemplateInstantiationKey {
/// Hash of the template definition AST (body, bases, defaults, friends)
template_def_hash: Blake3Hash,
/// Deep hash of all template arguments (full type definitions, not just names)
template_args_hash: Blake3Hash,
/// Hash of the "instantiation context": all declarations visible in the
/// associated namespaces of the template arguments at the point of instantiation.
/// This is the conservative over-approximation that makes ADL safe.
instantiation_context_hash: Blake3Hash,
/// Active #pragma pack state, relevant compiler flags
context_flags_hash: Blake3Hash,
}Cost analysis: Computing template_def_hash and template_args_hash is cheap —
these are hashes of AST nodes that are already in memory (or in the store) from
earlier pipeline stages. The expensive part is instantiation_context_hash, which
requires enumerating all declarations in the associated namespaces. However:
- Associated namespaces are typically small (1–3 namespaces for most UE5 types)
- The declarations in those namespaces are already parsed and hashed (from header caching) — forgecc combines existing hashes, not re-parsing
- forgecc caches the "namespace content hash" per namespace and invalidates it only when a declaration is added/removed/changed in that namespace
Estimated signature cost: ~1–5 μs per instantiation (hash lookups + combine). Estimated instantiation cost: ~50–500 μs for a typical class template (type checking all members, overload resolution, codegen). Ratio: 50–500x, well within the memoization invariant.
Trade-off: This over-invalidates. Adding an unrelated function to namespace game
invalidates all instantiations whose arguments have game as an associated namespace.
In practice this is acceptable because (a) most UE5 types are in the global namespace
or a small set of UE namespaces, and (b) namespace-level changes are relatively rare
compared to function-body changes.
Tier 2 — Recorded dependency signature (Phase 7+, optimization)
During the first instantiation, instrument every lookup, overload resolution, and concept check to record exactly which external facts were consulted:
struct TemplateInstantiationSignature {
/// Tier 1 key (for initial lookup)
key: TemplateInstantiationKey,
/// Phase 1 (definition-time) name resolutions — shared across all instantiations
/// of the same template, computed once when the template is first parsed
nondependent_lookups: BTreeMap<QualifiedName, Blake3Hash>,
/// Phase 2 (instantiation-time) dependent name resolutions
dependent_name_resolutions: BTreeMap<QualifiedName, Blake3Hash>,
/// ADL results: (unqualified function name, argument type hashes) → winning candidate hash.
/// If ADL found no candidates, the value is a sentinel EMPTY_HASH — this is a
/// "negative dependency." If someone later adds a matching function to an associated
/// namespace, the hash changes and the instantiation is invalidated. Negative
/// dependencies are critical for soundness: without them, adding a new overload
/// in a seemingly unrelated namespace would not trigger re-instantiation.
adl_results: BTreeMap<(InternedString, Vec<Blake3Hash>), Blake3Hash>,
/// Overload resolution outcomes: call site → selected candidate hash
overload_winners: BTreeMap<CallSiteId, Blake3Hash>,
/// Concept satisfaction: (concept hash, argument hashes) → satisfied?
concept_results: BTreeMap<(Blake3Hash, Vec<Blake3Hash>), bool>,
/// SFINAE: substitution site → success/failure.
/// Both outcomes are recorded — a substitution that *failed* is a dependency on
/// the non-existence of a valid substitution. If a later change makes the
/// substitution succeed, the recorded "false" no longer matches and the
/// instantiation is invalidated.
sfinae_results: BTreeMap<SubstitutionSiteId, bool>,
/// Specialization selection: which partial/explicit spec was chosen
specialization_choices: BTreeMap<Blake3Hash, SpecializationChoice>,
/// Type properties queried: sizeof, alignof, is_trivially_copyable, etc.
type_properties: BTreeMap<Blake3Hash, TypePropertySet>,
}Cost analysis: On the first instantiation, recording these dependencies adds ~10–20% overhead (an extra hash per lookup/resolution). On subsequent builds, the validation cost is: iterate the recorded dependencies, check each hash against the current value. This is O(number of recorded dependencies), typically 10–100 lookups for a class template instantiation.
Estimated validation cost: ~5–20 μs (10–100 hash comparisons). Estimated re-instantiation cost: ~50–500 μs. Ratio: 5–50x, still well within the invariant, and with far fewer false invalidations than Tier 1.
Key insight: The Tier 2 signature is only computed once (during the first instantiation) and then stored alongside the cached result. On subsequent builds, forgecc only validates it (check if any recorded dependency changed), which is much cheaper than computing it from scratch. This is analogous to how build systems record file dependencies during compilation and check mtimes on rebuild — the recording happens as a side effect of the work, not as a separate pass.
Tier 3 — Namespace content hash caching (cross-cutting optimization)
The most expensive part of Tier 1's instantiation_context_hash is enumerating
associated namespaces. forgecc uses a per-namespace content hash that
is maintained incrementally:
struct NamespaceContentHash {
/// Hash of all declaration signatures in this namespace
decl_hash: Blake3Hash,
/// Monotonic version counter — bumped on any add/remove/change
version: u64,
}When a header is re-parsed and a namespace gains/loses/changes a declaration, the
namespace's content hash is updated. Template instantiation keys that reference this
namespace can do a cheap version check (version == cached_version) before falling
back to full hash comparison. This makes the common case (nothing changed in the
namespace) essentially free — a single integer comparison.
ADL is the single most dangerous dependency for memoization correctness. Consider:
namespace game {
struct Actor { /* ... */ };
void serialize(const Actor& a) { /* old implementation */ }
}
// In TU A (compiled first):
template<class T> void process(T& x) { serialize(x); } // ADL finds game::serialize(const Actor&)
void foo() { game::Actor a; process(a); }Later, someone adds a new overload in the same namespace:
namespace game {
void serialize(Actor& a) { /* new overload / implementation */ }
}Adding game::serialize(Actor&) changes the ADL/overload result for
process<game::Actor>. The
memoized instantiation of process<game::Actor> must be invalidated.
Tier 1 handles this correctly because instantiation_context_hash includes all
declarations in namespace game. Adding serialize changes the hash.
Tier 2 handles this precisely because adl_results records that the call to
serialize(x) resolved to a specific candidate — or to nothing (a negative
dependency). On rebuild, forgecc checks whether the same unqualified lookup with the same
argument types still produces the same winner. If the original lookup found no
candidate and a new overload now exists, the sentinel EMPTY_HASH no longer matches,
and the instantiation is invalidated. This "negative cache invalidation" is what makes
the system sound in the presence of ADL: forgecc tracks not just what was found, but also
what was not found.
Worst case for ADL: Types in the global namespace. The global namespace can contain thousands of declarations, making the namespace content hash expensive to compute. Mitigation: for the global namespace specifically, forgecc partitions the content hash by declaration name prefix or use a Merkle tree structure so that adding one function only invalidates the relevant subtree.
To maintain the memoization invariant, forgecc enforces these cost budgets:
| Granularity | Max signature cost | Typical computation cost | Min ratio |
|---|---|---|---|
| TU level | ~100 μs (hash source + includes) | ~100–500 ms (full compile) | 1000x |
| Header level | ~10–50 μs (hash content + macro state) | ~1–10 ms (parse + type-check) | 50x |
| Function level | ~1–5 μs (hash AST + type context) | ~10–100 μs (IR + codegen) | 10x |
| Template instantiation (Tier 1) | ~1–5 μs (combine existing hashes) | ~50–500 μs (full instantiation) | 50x |
| Template instantiation (Tier 2) | ~5–20 μs (validate recorded deps) | ~50–500 μs (full instantiation) | 5x |
If profiling shows a granularity level violating its ratio, forgecc either (a) makes the signature cheaper (coarser hash, fewer deps checked), or (b) drop to a coarser granularity level for that case. The system must never spend more time checking the cache than it would spend just doing the work.
Escape hatch: For pathological cases (deeply nested template instantiations, templates with hundreds of dependent calls, types in the global namespace with thousands of declarations), the memoizer falls back to Tier 1 or even TU-level caching. The instrumentation in Tier 2 can detect when the dependency set is growing too large and abort recording, falling back to the structural key.
- Phase 3 (Sema): Implement Tier 1 (structural key). This is sufficient for correctness and provides significant speedup for the common case.
- Phase 7+ (Optimization): Implement Tier 2 (recorded signatures) for the most-instantiated templates. Profile first to identify which templates benefit most.
- Ongoing: Implement Tier 3 (namespace content hash caching) as a cross-cutting optimization that benefits all tiers.
Estimated cost: ~1,500 lines for Tier 1, ~2,500 lines for Tier 2, ~500 lines for
Tier 3. Total: ~4,500 lines across forge-sema and forge-store.
forgecc uses forward (push-based) memoization: the pipeline runs stages in order (lex → preprocess → parse → sema → IR → codegen), each stage hashes its inputs, checks the content-addressed store, and either returns a cached result or computes and stores a new one. This is a fundamentally different model from query-based (pull-based) incrementality as implemented by salsa-rs, which powers rust-analyzer.
How salsa works: The program is expressed as a graph of tracked functions (queries). Each query declares its dependencies implicitly — salsa records which other queries it called during execution. When an input changes, salsa walks the dependency graph top-down using a "red-green" algorithm: it marks potentially affected queries as "red" (may need re-execution), then verifies each by checking whether its recorded dependencies actually changed. If a query's inputs are unchanged (or changed but produced the same output — "backdating"), the query stays "green" and its cached result is reused. Execution is demand-driven: nothing runs until a root query is requested.
| Forward Memoization (forgecc) | Query-Based (salsa-rs) | |
|---|---|---|
| Execution model | Push: pipeline runs stages in fixed order | Pull: queries execute on demand, top-down |
| Dependency tracking | Explicit: each stage declares its input hash upfront | Implicit: salsa records dependencies during execution |
| Granularity control | Manual: granularity levels are chosen explicitly (TU, header, function, template) | Automatic: every tracked function is a memoization point |
| Cache key | Content hash of inputs (blake3) | Revision counter + dependency graph walk |
| Invalidation | Content-based: if the hash matches, the result is valid regardless of history | Revision-based: if any transitive input's revision changed, re-verify |
| Determinism | Guaranteed: same inputs → same hash → same result, always | Guaranteed within a session; cross-session requires careful input identity |
| Distribution | Natural: content hashes are globally unique, any peer can serve a cached result | Hard: revision numbers are local to a single database instance |
| Persistence | Natural: content-addressed store is inherently persistent and shareable | Possible but not native; salsa's database is in-memory, persistence requires serialization |
| Parallelism | Coarse: stages run in parallel across TUs; within a TU, pipeline is sequential | Fine: salsa supports concurrent query execution with cycle detection |
| Overhead per cache check | ~1–5 μs (hash lookup in store) | ~0.1–1 μs (revision comparison, no hashing) |
| Overhead per cache miss | Hash computation + store write (~5–20 μs) | Execution + dependency recording (~1–5 μs recording overhead) |
| False re-executions | Low: content-based — if the output bytes are identical, it's a hit | Lower: backdating means even if an input changed, if the output didn't, downstream stays green |
| Incremental precision | Depends on granularity tier chosen (Tier 1 over-invalidates, Tier 2 is precise) | High: automatic fine-grained tracking of every query dependency |
| Code complexity | Lower: each stage is a standalone function with explicit hash-in/hash-out | Higher: requires structuring the entire program as tracked queries with a database |
| Bootstrapping cost | Low: works with any function signature | Moderate: requires salsa macros, database setup, tracked struct definitions |
Why forgecc uses forward memoization instead of salsa:
-
Distribution is the killer feature. Content hashes are globally meaningful — if machine A computes
blake3(inputs) → result, machine B can use that result without knowing anything about machine A's revision history. salsa's revision numbers are local to a single database instance. Making salsa distributed would require mapping local revisions to content hashes anyway, at which point the system has rebuilt forward memoization on top of salsa. -
Persistence is free. The content-addressed store is inherently persistent — shut down the daemon, restart it, and all cached results are still valid because they're keyed by content, not by ephemeral revision numbers. salsa's database is in-memory; persisting it requires serializing the entire dependency graph and re-validating it on reload (since file contents may have changed while the process was down).
-
C++ compilation is coarse-grained enough. salsa shines when the computation graph has thousands of fine-grained interdependent queries (like rust-analyzer's type inference, where changing one function signature can ripple through hundreds of call sites). C++ compilation is more pipeline-shaped: lex → preprocess → parse → sema → IR → codegen. The natural memoization points (TU, header, function, template instantiation) are well-defined and relatively few per TU. The overhead of salsa's automatic dependency tracking doesn't pay for itself when the granularity is already coarse.
-
Backdating is less valuable for C++. salsa's backdating (detecting that a re-executed query produced the same output as before) is powerful for IDE scenarios where the user is typing and most keystrokes don't change the semantic result. forgecc targets batch compilation where inputs change between builds, not keystroke-by-keystroke. Our content-based approach achieves the same effect: if the output hash is the same, downstream stages see no change.
-
Simpler mental model. Each pipeline stage is a pure function:
hash(inputs) → check store → compute or return cached. There is no global dependency graph to reason about, no cycle detection, no revision bookkeeping. This makes the system easier to debug, profile, and extend.
Where salsa would win: If forgecc ever adds an IDE mode (language server), salsa's demand-driven model with fine-grained invalidation would be superior for interactive use cases. The architecture does not preclude adding a salsa-based layer on top of the content-addressed store for IDE features — salsa queries could use the store as their backing cache, getting the best of both worlds.
Cost comparison: Forward memoization adds ~500–800 lines to the memoizer (§4.2). A salsa-based approach would require restructuring the entire pipeline into tracked queries (~2,000–3,000 lines of boilerplate) plus the salsa dependency itself. The forward approach is smaller, simpler, and directly supports the distributed use case that is core to forgecc's value proposition.
The lexer and preprocessor are tightly coupled in C++ because the preprocessor operates on tokens, and macro expansion can produce new tokens.
Key design decisions:
- Lazy lexing: Lex on demand as the preprocessor or parser requests tokens.
- Phases of translation (N4950 §5.2): The lexer implements the standard's phases:
- Phase 1: UTF-8 input decoding, CR/CRLF → LF normalization, BOM (
U+FEFF) stripping at start of file - Phase 2: Backslash-newline line splicing (must be reverted inside raw string literals per the standard)
- Phase 3: Decomposition into preprocessing tokens and whitespace
- Phase 5–6: Adjacent string literal concatenation with encoding prefix
compatibility checking (e.g.,
u8"a" "b"→u8"ab", butu"a" U"b"is ill-formed). This includes user-defined string literal concatenation whereud-suffixes must match.
- Phase 1: UTF-8 input decoding, CR/CRLF → LF normalization, BOM (
- Include-once optimization: Track
#pragma onceand include guards. If header content hash unchanged, skip re-lexing entirely. - Automatic header caching: Any header's fully parsed and type-checked result is cached in the store — not just the preprocessed token stream, but the complete AST and type information. This is closer to an "automatic serialized AST per header" than traditional PCH. The key difference from PCH: the cache is per-header (not per-project), keyed by content + macro state, and requires zero user configuration.
Literal support (N4950 §5.13):
- Integer literals: decimal, octal, hex, binary; digit separators (
1'000'000); suffixesu/U,l/L,ll/LL,z/Z(C++23size_tsuffix — needed for MSVC STL headers) - Floating-point literals: decimal and hexadecimal (
0xC.68p+2); digit separators; suffixesf/F,l/L; extended float suffixes (f16,f32,f64,f128,bf16) recognized but treated as conditionally-unsupported - Character literals: all encoding prefixes (
u8,u,U,L); escape sequences (simple, octal, hex,\u,\U,\N{name},\o{}/\x{}); multicharacter literals - String literals: all encoding prefixes; raw string literals (
R"delim(...)delim") with special lexer handling (line splicing reverted, delimiter matching); adjacent string concatenation (phase 5–6) - User-defined literals (N4950 §5.13.8):
ud-suffixrecognition for integer, floating-point, string, and character literals (e.g.,123_km,"hello"s,100ms). The lexer produces aUserDefinedLiteraltoken; Sema resolves it as a call tooperator""(see §4.5 overload module) - Boolean literals:
true,false - Pointer literal:
nullptr
Preprocessor features needed (based on UE5/Game codebase analysis + N4950 §15):
#include/#include_next#define/#undef(object-like and function-like macros, variadic)#if/#ifdef/#ifndef/#elif/#elifdef/#elifndef/#else/#endif#pragma once,#pragma push_macro,#pragma pop_macro#pragma comment(lib, ...),#pragma comment(linker, ...)#pragma pack(push, N),#pragma pack(pop),#pragma pack(N)— critical for MSVC struct layout compatibility; heavily used in UE5 and MSVC STL headers#pragma warning(push),#pragma warning(pop),#pragma warning(disable: N)— heavily used in UE5 and MSVC STL headers to suppress warnings#line(N4950 §15.7) — used by code generators including UHT-generated.gen.cpp#error(N4950 §15.8) — used in UE5 headers for platform/config validation#warning— non-standard but supported by MSVC and clang-cl- Null directive
#(bare#on a line, N4950 §15.10) __has_include,__has_cpp_attribute__has_builtin— clang extension, queried by MSVC STL and UE5 headers to detect compiler built-in availability (e.g.,__has_builtin(__builtin_addressof))__has_feature,__has_extension— clang extensions, queried by third-party headers and some UE5 code for feature detection_Pragma__VA_ARGS__,__VA_OPT____FILE__,__LINE__,__COUNTER__,__FUNCTION__,__FUNCSIG__- MSVC-specific:
__declspec(...),__forceinline,__pragma - UE-specific:
UCLASS(),UPROPERTY(),GENERATED_BODY()— standard macros, but the compiler must handle UHT annotations without issues
Predefined macros (N4950 §15.11 + MSVC/clang-cl extensions):
- Standard:
__cplusplus(must be202002Lfor C++20),__STDC_HOSTED__,__STDCPP_DEFAULT_NEW_ALIGNMENT__,__STDCPP_THREADS__ - Feature-test macros:
__cpp_concepts,__cpp_consteval,__cpp_designated_initializers,__cpp_structured_bindings,__cpp_three_way_comparison,__cpp_if_constexpr,__cpp_fold_expressions,__cpp_deduction_guides, etc. — queried by MSVC STL headers via#if __cpp_lib_*/#if __cpp_*to enable/disable features - MSVC-specific:
_MSC_VER,_MSC_FULL_VER,_MSVC_LANG,_WIN32,_WIN64,_M_X64,_M_AMD64,_DEBUG/NDEBUG - Clang-cl-specific:
__clang__,__clang_major__,__clang_minor__
C language support: The preprocessor and lexer must handle both C and C++ modes. Based on the codebase analysis, UE5's Engine/Source/ThirdParty contains 4,335 .c files (libcurl, zlib, FreeType, Bink Audio, Rad Audio, etc.). These are third-party libraries that must compile as C.
Recursive descent parser (like Clang), producing an AST. Parsing is inherently
sequential within a TU because C++ is context-dependent: whether a name refers to a
type or a value affects how subsequent tokens are parsed (e.g., T * p; is a pointer
declaration if T is a type, or a multiplication if T is a value). typedef,
using, class/struct/enum declarations, and template declarations all change the
parsing context for everything that follows. This means the parser must process
top-level declarations in order — there is no practical way to parallelize C++ parsing
within a single file. (Parallelism across TUs is unaffected.)
C++20 features needed (based on UE5/Game codebase analysis + N4950 cross-reference):
- Classes, structs, unions, enums (including
enum class) - Bit-fields — used in UE5 for flags, compact representations; affects parser
(member-declarator with
:width), Sema (layout computation), and codegen - Non-static data member initializers (NSDMI) — default member initializers
(
int x = 42;in class body); heavily used in UE5 - Templates (class, function, variable, alias templates)
- Template template parameters — used in MSVC STL headers
- Template specialization (partial and full)
- Explicit template instantiation (
template class X<int>;) andextern template(extern template class X<int>;) — used in UE5 Core for compile-time control (StringFormatter, TokenStream, SizedHeapAllocator for ANSICHAR/WIDECHAR) - Class template argument deduction (CTAD) and deduction guides (N4950 §13.3.1) —
used in MSVC STL (
std::vector v = {1, 2, 3};,std::pair p = {1, "hello"};); requires parsing deduction-guide declarations and synthesizing deduction candidates - Concepts and requires-clauses — UE5 uses these! Found in
Core/Public/Concepts/(CSameAs,CDerivedFrom,CConvertibleTo, etc.),MassEntity,CoreUObject,TypedElementFramework, and many more. ~200+ files with concept/requires usage. - Lambdas (including generic lambdas, init-capture)
constexpr/constinit/constevalconstinit(N4950 §9.2.1) — guarantees constant initialization of variables; used for static initialization safetyconsteval— used in D3D12RHI, VerseVM, CoreUObject (~20 files)
- Structured bindings (N4950 §9.6)
if constexpr/if consteval(N4950 §8.5.2)- Fold expressions
- Designated initializers
- Aggregate initialization including C++20 parenthesized aggregate init
(
T(args...)for aggregates) - Brace-enclosed initializer lists /
std::initializer_listconstruction — complex rules for narrowing conversions, brace elision, copy-list-init vs. direct-list-init (N4950 §9.4.5) - Three-way comparison (
<=>) — used in abseil, D3D12RHI - Defaulted comparison operators (
= defaultforoperator==andoperator<=>, N4950 §11.10) — compiler must synthesize member-wise comparison logic using enumstatic_assertwith optional message (N4950 §9.1)alignas/alignof(N4950 §9.12.2) — used in UE5 for SIMD alignmentextern "C"/ linkage specifications (N4950 §9.11) — used extensively in UE5 for C interop and DLL exports; affects name mangling and calling conventionsexplicit(bool)conditional explicit (N4950 §9.2.3) —explicit(condition)where condition is a constant expression; used in MSVC STL headers (std::pair,std::tuple,std::optionalconstructors)noexceptspecifier and operator (N4950 §14.5, §7.6.2.7) —noexceptis part of the function type in C++17+; affects overload resolution and type systemtypeidand RTTI —dynamic_cast,typeidused in some UE5 paths; requires vtable RTTI info generation- Attributes (
[[nodiscard]],[[nodiscard("reason")]],[[likely]],[[unlikely]],[[no_unique_address]],[[deprecated("msg")]],[[fallthrough]],[[maybe_unused]]) — attribute argument parsing needed for string-argument attributes - User-defined literals — parser must recognize
ud-suffixon literals (see §4.3) - MSVC extensions:
__declspec,__forceinline,__int64, SEH (__try/__except/__finally),__uuidof(COM interface identification, ~70 files),__assume(optimizer hint),_alloca(stack allocation) - C language parsing mode (for .c files)
NOT needed (confirmed from UE5 codebase scan — see §15.14):
- Coroutines (
co_await,co_yield,co_return) — zero occurrences in Engine - C++20 modules (
export module,import) — zero occurrences std::ranges/std::views— zero occurrences (UE uses custom iterators)std::format— zero occurrences (UE usesFString::Printf/fmt)std::span— zero occurrences (UE usesTArrayView/FMemoryView)__fastcall/__vectorcall/__thiscall— zero occurrences__declspec(novtable)/__declspec(property)— zero occurrences
This is the largest component. Critical design requirement: Sema must NOT be monolithic. LLVM's Sema is notoriously hard to navigate and debug due to massive source files (tens of thousands of lines). forgecc's Sema is split into small, focused modules.
Modular Sema architecture:
forge-sema/
├── src/
│ ├── lib.rs # Public API, orchestration (~200 lines)
│ │
│ ├── context.rs # Shared semantic context (type tables, scope stack)
│ │
│ ├── lookup/ # Name lookup (separate crate-like module)
│ │ ├── mod.rs # Public interface
│ │ ├── unqualified.rs # Unqualified name lookup
│ │ ├── qualified.rs # Qualified name lookup (::)
│ │ ├── adl.rs # Argument-dependent lookup
│ │ ├── using.rs # Using declarations/directives
│ │ └── two_phase.rs # Two-phase name lookup for templates (N4950 §13.8):
│ │ # non-dependent names resolved at definition,
│ │ # dependent names resolved at instantiation
│ │
│ ├── types/ # Type system
│ │ ├── mod.rs
│ │ ├── repr.rs # Type representation (~500 lines max)
│ │ ├── equivalence.rs # Type comparison, compatibility
│ │ ├── conversion.rs # Implicit/explicit conversions, narrowing detection
│ │ ├── deduction.rs # Template argument deduction, auto, CTAD
│ │ ├── qualifiers.rs # const, volatile, ref qualifiers
│ │ └── noexcept.rs # noexcept as part of function type (C++17+),
│ │ # noexcept operator evaluation (N4950 §7.6.2.7)
│ │
│ ├── overload/ # Overload resolution
│ │ ├── mod.rs
│ │ ├── candidates.rs # Candidate set construction (including built-in
│ │ │ # operator candidates per N4950 §12.5)
│ │ ├── ranking.rs # Implicit conversion ranking
│ │ ├── resolution.rs # Best viable function selection
│ │ ├── literals.rs # User-defined literal operator resolution:
│ │ │ # literal-operator-id lookup, literal operator
│ │ │ # templates (N4950 §12.6)
│ │ └── address_of.rs # Address of overloaded function (N4950 §12.3)
│ │
│ ├── templates/ # Template instantiation
│ │ ├── mod.rs
│ │ ├── instantiate.rs # On-demand instantiation with memoization
│ │ │ # (class, function, variable, alias templates)
│ │ ├── specialize.rs # Partial/full specialization matching
│ │ ├── sfinae.rs # Substitution failure handling
│ │ ├── ctad.rs # Class template argument deduction (N4950 §13.3.1):
│ │ │ # deduction guide synthesis, implicit guides from
│ │ │ # constructors, aggregate deduction guides (C++20)
│ │ └── nttp.rs # Non-type template parameters: class-type NTTPs
│ │ # (C++20 structural types), template template params
│ │
│ ├── concepts/ # C++20 concepts
│ │ ├── mod.rs
│ │ ├── satisfaction.rs # Concept satisfaction checking
│ │ └── constraints.rs # Requires-clause evaluation
│ │
│ ├── consteval/ # Constant expression evaluator (JIT-based, see §4.6a)
│ │ ├── mod.rs # evaluate_constexpr(expr) -> Value
│ │ ├── compiler.rs # Drives IR lowering + codegen for consteval functions
│ │ ├── sandbox.rs # Memory sandbox (custom allocator, UB detection)
│ │ └── builtins.rs # Compiler built-in constexpr functions:
│ │ # __builtin_is_constant_evaluated() (used by UE5
│ │ # via UE_IF_CONSTEVAL), __builtin_addressof,
│ │ # __builtin_unreachable, __builtin_expect
│ │
│ ├── decl/ # Declaration processing
│ │ ├── mod.rs
│ │ ├── variables.rs # Including constinit validation
│ │ ├── functions.rs # Including explicit(bool), noexcept spec
│ │ ├── classes.rs # Including bit-fields, NSDMI, aggregate detection
│ │ ├── enums.rs
│ │ ├── namespaces.rs
│ │ ├── linkage.rs # extern "C" / linkage specifications (N4950 §9.11):
│ │ │ # affects name mangling and calling conventions
│ │ └── static_assert.rs # static_assert with optional message
│ │
│ ├── expr/ # Expression type-checking
│ │ ├── mod.rs
│ │ ├── binary.rs # Including three-way comparison (<=>)
│ │ ├── unary.rs # Including alignof, sizeof, noexcept operator, typeid
│ │ ├── call.rs
│ │ ├── member.rs
│ │ ├── cast.rs # Including dynamic_cast (requires RTTI)
│ │ ├── lambda.rs
│ │ └── init.rs # Brace-enclosed init lists, std::initializer_list
│ │ # construction, narrowing checks, brace elision
│ │
│ ├── stmt/ # Statement checking
│ │ ├── mod.rs
│ │ └── control_flow.rs
│ │
│ ├── bindings.rs # Structured binding declarations (N4950 §9.6):
│ │ # tuple-like (std::tuple_size/get<>), array, and
│ │ # data member decomposition
│ │
│ ├── comparisons.rs # Defaulted comparison operators (N4950 §11.10):
│ │ # synthesize member-wise ==, <=>, rewrite rules
│ │
│ ├── rtti.rs # RTTI support: typeid, dynamic_cast, vtable RTTI
│ │ # info generation
│ │
│ ├── msvc/ # MSVC-specific
│ │ ├── mod.rs
│ │ ├── layout.rs # MSVC class layout computation (including
│ │ │ # #pragma pack effects, __declspec(align),
│ │ │ # __declspec(empty_bases))
│ │ ├── mangle.rs # Microsoft name mangling
│ │ ├── seh.rs # SEH __try/__except/__finally
│ │ ├── abi.rs # Calling conventions (__cdecl, __stdcall),
│ │ │ # vtable layout, COM (__uuidof)
│ │ ├── declspec.rs # __declspec variants: dllexport/dllimport,
│ │ │ # noinline, thread, selectany, noreturn,
│ │ │ # allocate, restrict, safebuffers, code_seg
│ │ └── intrinsics.rs # MSVC compiler intrinsics: _BitScanForward/Reverse,
│ │ # _InterlockedCompareExchange*, __debugbreak,
│ │ # _ReturnAddress, __cpuid, _byteswap_*,
│ │ # __assume, _alloca
│ │
│ └── access.rs # Access control (public/private/protected/friend)
Design rules for Sema:
- No source file exceeds 1,000 lines
- Each module has a clear, narrow responsibility
- Modules communicate through the shared
SemaContext(type tables, scope stack) - Each module is independently testable
- Memoization hooks at every module boundary
Parallelism in Sema: Sema has a sequential spine and parallel fan-out:
- Sequential (the "spine"): Top-level declaration signatures must be processed in declaration order, because each declaration can reference types and names introduced by earlier declarations. This is the same ordering constraint as parsing — a function signature can use a class declared above it, so signatures are resolved top-to-bottom.
- Parallel (the "fan-out"): Once all signatures in a TU are resolved, function
bodies can be type-checked in parallel — each body only reads from the shared type
table (populated during the sequential phase) and does not modify it. This is a
read-heavy workload ideal for shared-memory multithreading (
RwLock/DashMap). Template instantiations are also parallelized (each instantiation is independent once the template definition and arguments are known). - The
SemaContextuses concurrent data structures (lock-free maps for type tables, read-write locks for scope stacks) to support the parallel phase.
MSVC ABI compatibility (critical for UE5):
- Name mangling scheme (Microsoft C++ mangling, not Itanium)
- Class layout (including virtual bases, vtable pointer placement)
- Virtual function table layout and thunks
- Exception handling (SEH-based) — 22 files in Engine/Source/Runtime use
__try/__except/__finally, mostly in crash handling and D3D code - Calling conventions (
__cdecl,__stdcall,__fastcall,__vectorcall)
Traditional compilers implement consteval/constexpr evaluation by building an
AST interpreter — a mini virtual machine that walks the AST and evaluates
expressions. Clang's implementation (ExprConstant.cpp) is ~17,000 lines and one
of the most complex files in the entire codebase. It must reimplement every expression,
statement, and type operation that can appear in a constexpr context.
forgecc takes a radically different approach: instead of interpreting the AST,
compile the consteval function to native x86-64 code, execute it in the daemon's
own process, and capture the result.
Implementation dependency: This approach requires the full compilation pipeline (forge-ir → forge-codegen → JIT loader) to be operational. During Sema (Phase 3),
constevalcontexts are detected and recorded as deferred evaluations. The actual compilation and execution happens in Phase 6, after the JIT infrastructure from Phase 5 is available. The generated native code is mapped into the daemon's own address space and executed in-process to produce compile-time constants.
consteval int factorial(int n) {
int result = 1;
for (int i = 2; i <= n; ++i)
result *= i;
return result;
}
constexpr int x = factorial(10); // needs compile-time evaluation
Flow:
1. Sema encounters factorial(10) in a constant expression context
│
▼
2. forge-sema requests consteval compilation of factorial()
│
▼
3. forge-ir: Lower factorial() to ForgeIR (same pipeline as normal code)
│
▼
4. forge-codegen: Compile to x86-64 machine code (just this function + callees)
│
▼
5. Map the machine code into the daemon's own address space (VirtualAlloc + PAGE_EXECUTE)
│
▼
6. Call the function pointer: result = compiled_factorial(10)
│
▼
7. Capture the return value (3628800)
│
▼
8. Sema uses the result as a compile-time constant
Why this is advantageous:
-
Massively reduces Sema complexity: No need for a ~17,000-line AST interpreter. The consteval module becomes ~800 lines of "compile and call" orchestration instead.
-
Lower semantic duplication: Reusing the normal lowering/codegen path reduces "two implementations of semantics" risk, but it does not eliminate constexpr-specific correctness work. The evaluator still needs explicit checks for constant-evaluation rules that are stricter than runtime execution.
-
Performance: Native execution is orders of magnitude faster than AST interpretation. For heavy
constexprcomputation (compile-time string processing, compile-time hashing, etc.), this matters. -
Reuses existing infrastructure: The JIT compilation pipeline (
forge-ir→forge-codegen→ map into memory) is the same one used for the runtime loader. No new compilation infrastructure needed. -
Better evolution path for new constexpr features: As the language adds features to
constexpr(dynamic allocation, virtual calls, try-catch), the shared lowering path reduces maintenance, but each new rule still needs conformance tests and targeted semantic checks in the evaluator.
Handling transitive calls: consteval functions can call other constexpr
functions, use std:: algorithms, etc. The evaluator compiles the entire transitive closure of
called functions. This is the same lazy compilation model as the runtime JIT — when
a called function hasn't been compiled yet, compile it on demand.
Undefined behavior detection: The C++ standard requires that constexpr evaluation
detect UB and report it as a compile error. Native execution doesn't trap on all UB.
Solution: compile consteval code with lightweight sanitizer instrumentation:
- Signed integer overflow → overflow-checking arithmetic instructions
- Null/dangling pointer dereference → insert null checks before dereferences
- Out-of-bounds array access → insert bounds checks
- Division by zero → insert zero-check before division
- Shift by negative/too-large amount → insert range checks
This is far simpler than a full AST interpreter because forgecc inserts checks during lowering rather than reimplementing the whole C++ execution model.
Beyond UB checks: constexpr requires enforcing abstract-machine constraints that
runtime code may not enforce: allowed side effects, lifetime validity, pointer
provenance, constant initialization/destruction rules, and forbidden operations in
constant-evaluation context. These are validated in Sema and a dedicated IR validation
pass before JIT execution.
Hybrid evaluator strategy: JIT-first for the common subset, with a ForgeIR interpreter fallback for cases where strict abstract-machine tracking is easier or safer than native execution. This preserves the "no full AST interpreter" goal while maintaining standards conformance.
Memory allocation in constexpr (C++20): new/delete are allowed in constexpr
contexts, but all allocations must be freed before evaluation completes. Solution: use
a sandboxed allocator for consteval execution — a bump allocator that tracks all
allocations. After the function returns, verify all allocations were freed. If not,
report a compile error.
Memoization: consteval results are cached in the content-addressed store:
Key = blake3(function_hash + argument_values) → Value = result. If the same
consteval function is called with the same arguments in a different TU, the result
is reused without re-execution.
Cross-compilation and consteval: When the daemon runs on x86-64 Windows but
compiles for a different target (e.g., aarch64), consteval functions
cannot simply be JIT'd as native x86-64 code — the target may have different
sizeof, alignof, endianness, or type-promotion rules, and the C++ standard
requires consteval to evaluate under target semantics. Two strategies:
-
ForgeIR interpreter fallback: Evaluate
constevalby interpreting ForgeIR directly (slower but correct for any target). This is a small interpreter over the SSA graph (~1,500–2,000 lines), not a full C++ AST interpreter — ForgeIR is already lowered past most language complexity. -
Embeddable CPU emulator: JIT-compile the
constevalfunction to the target architecture's machine code, then execute it under a QEMU-based emulator. QEMU's CPU core supports ARM, ARM64, x86, RISC-V, MIPS, PowerPC, SPARC, and more. Its user-mode emulation is Linux/BSD-only, but Unicorn Engine extracts QEMU's CPU emulation core into an embeddable library that runs natively on Windows x64 (pure C with Rust bindings, thread-safe). This preserves the performance advantage of native execution while honoring target semantics.
For the PoC (x86-64 → x86-64 only), host-native JIT is the fast path, and it runs in an isolated evaluator context (separate process and sanitized runtime state) rather than sharing mutable host state directly.
Isolation policy for consteval evaluation:
- Fast path (default): host-native JIT in an isolated evaluator process.
- Strict path: ForgeIR interpreter for maximal semantic control.
- Cross-target path: target emulation (QEMU/Unicorn) when host and target semantics differ.
QEMU is not the default for same-target evaluation because its overhead is too high for the edit-build-debug loop. QEMU/Unicorn is the mandated path for cross-target correctness and available as a strict verification mode.
forgecc uses a custom SSA-based IR (ForgeIR), not MLIR or LLVM IR. Here is why, and how the alternatives compare:
| Custom ForgeIR | MLIR-based | LLVM IR | |
|---|---|---|---|
| Simplicity | Simpler to start | More infrastructure upfront | Moderate — well-documented but large spec |
| Extensibility | Must build everything | Dialects, passes, transformations for free | Fixed instruction set; extending requires forking |
| Parallelism | Must implement | MLIR has parallel pass infrastructure | No built-in parallel pass infrastructure |
| Optimization | Must build pass manager | Can leverage existing MLIR passes | Full LLVM optimization pipeline available |
| Debugging | Must build printer/verifier | Built-in verification, printing, testing | Mature tools: llvm-dis, opt, llc, verifier |
| Code size | Smaller | Larger (MLIR dependency) | Large (LLVM dependency — millions of LoC) |
| Rust interop | Native | FFI to C++ MLIR libraries (or Rust MLIR bindings) | FFI to C++ LLVM (inkwell / llvm-sys crates) |
| Memoization | Full control over hashing and serialization | Full control | Hard — non-deterministic metadata IDs, pointer identity, global numbering make content-addressing difficult |
| Determinism | Full control over ordering | Full control | Difficult — many sources of non-determinism (metadata !0/!1 numbering, type uniquing, unnamed value %0/%1 sequencing depend on emission order) |
| Incremental granularity | Function-level (forgecc design) | Function-level (forgecc design) | Naturally module-oriented — LLVM's primary unit is a Module; function-level reuse is possible but requires additional partitioning/canonicalization infrastructure |
| JIT | Must build loader | Must build loader | LLVM ORC JIT available — mature, handles relocations, exception tables, TLS |
| Codegen quality (-O0) | Our codegen (sufficient) | Our codegen (sufficient) | LLVM's codegen (higher quality but overkill for -O0) |
| Codegen quality (-O2+) | Not available | Not available | Full LLVM optimization pipeline — critical for release builds |
| SIMD intrinsics | Must map each intrinsic manually | Must map each intrinsic | All x86 SIMD intrinsics already defined in LLVM IR |
| Build complexity | Pure Rust, no external deps | C++ FFI, LLVM/MLIR build | C++ FFI, full LLVM build (~10–30 min) |
Why not LLVM IR: LLVM IR conflicts with forgecc's two core differentiators:
-
Memoization: LLVM IR assigns sequential numeric IDs to unnamed values (
%0,%1, ...) and metadata nodes (!0,!1, ...). These IDs depend on emission order, which can vary. Our content-addressed store requires byte-identical output for identical inputs — LLVM IR makes this hard to guarantee without careful canonicalization. -
Fine-grained incrementality: LLVM is primarily organized around
Modules. Function-level memoization is possible, but requires non-trivial module partition/merge infrastructure and strict canonicalization to make content hashes stable. That complexity is workable, but higher than a purpose-built function-level pipeline.
LLVM IR does give us a mature JIT (ORC), all SIMD intrinsics for free, and production-quality codegen for release builds. forgecc captures these benefits through the hybrid path described below, without giving up control over the dev pipeline.
Why not MLIR: MLIR is a C++ library. Adding a C++ dependency to a Rust project adds significant build complexity. MLIR's abstractions (dialects, passes) are powerful but unnecessary for the PoC — ForgeIR is simple enough that a custom implementation is smaller and faster to iterate on.
Hybrid path for release builds: ForgeIR is used for the development pipeline
(where forgecc needs full control over determinism, memoization, and function-level
granularity). For release builds with -O2+ optimizations, ForgeIR lowers to
LLVM IR to leverage the full LLVM optimization pipeline. This is the same pattern
used by Swift (SIL → LLVM IR) and Rust (MIR → LLVM IR). The release-mode path:
ForgeIR → LLVM IR → LLVM opt → LLVM codegen → .obj. This path is deferred —
the PoC focuses on -O0 / JIT where forgecc codegen is sufficient.
ForgeIR uses MLIR-compatible concepts (operations, regions, blocks) so that migration to MLIR is straightforward if the need arises post-PoC.
ForgeIR design:
; Example: int add(int a, int b) { return a + b; }
func @"?add@@YAHHH@Z"(i32 %a, i32 %b) -> i32 {
entry:
%result = add i32 %a, %b
ret i32 %result
}
Minimal pass manager: Only these passes for the prototype:
- Dead code elimination (remove unreachable basic blocks)
- Constant folding (evaluate constant expressions at compile time)
- Stack slot allocation (convert SSA to stack-based before register allocation)
- Register allocation (linear scan — simple and fast, good enough for -O0)
Generates machine code from ForgeIR. Since forgecc targets unoptimized code here, this is essentially a 1:1 lowering.
SIMD intrinsics support: This is critical. Based on codebase analysis, UE5 uses SIMD extensively:
- 383 matches in
UnrealMathSSE.halone (the core math library) - SSE2 baseline (
<emmintrin.h>), optional SSE4.1 (<smmintrin.h>), AVX/AVX2 (<immintrin.h>) - Key types:
__m128,__m128d,__m128i,__m256,__m256d - Key intrinsics:
_mm_add_ps,_mm_mul_ps,_mm_shuffle_ps,_mm_castps_si128, etc. - UE5 wraps these in
VectorRegister4Float,VectorRegister4Doubleabstractions
Strategy: Map each SIMD intrinsic to a ForgeIR intrinsic operation, then lower directly to the corresponding x86-64 instruction. No optimization of SIMD code — just faithful 1:1 translation.
Inline assembly (__asm): Based on analysis, only ~23 files use __asm in
Engine/Source/Runtime, mostly in:
- Audio codecs (Bink Audio, Rad Audio) — third-party, linked as prebuilt where practical
- Platform detection (
cpux86.cpp) - Floating point state manipulation
- LongJump (AutoRTFM)
Strategy: Support basic __asm blocks for x86-64. For the PoC, if specific __asm
blocks are too complex, emit a call to a precompiled stub.
Output: Not .obj files, but database records:
Sectionrecords (.text,.data,.bss,.rdata,.pdata,.xdata)RelocationrecordsSymbolrecords (mangled names, visibility)DebugInforecords (source maps, type info, frame layouts — kept in internal format for DAP; serialized to CodeView only in release mode)
This is a key innovation. Instead of the traditional compile → link → run cycle, forgecc's daemon acts as an OS loader:
- Process creation: The daemon creates a suspended process (or uses itself as host)
- Section mapping: Maps compiled sections into the target process's address space
- Lazy compilation: Functions that haven't been compiled yet get a stub that traps back to the daemon, triggers compilation, patches the call site, and resumes
- Incremental patching: When source changes, only affected functions are recompiled and patched in the running process
This is conceptually similar to LLVM ORC JIT but integrated with the compiler.
Prior art: the in-process clang branch
explored compiling multiple files in-process within a single clang invocation, and
encountered related challenges (TLS via LdrpHandleTlsData, global state cleanup,
etc.) that inform this design.
┌─────────────────────────────────────────────────┐
│ Target Process │
│ │
│ ┌──────────┐ ┌──────────┐ ┌──────────────┐ │
│ │ main() │ │ Foo::bar │ │ stub: Baz() │ │
│ │ compiled │ │ compiled │ │ → trap to │ │
│ │ │──│ │──│ daemon │ │
│ └──────────┘ └──────────┘ └───────┬──────┘ │
│ │ │
└──────────────────────────────────────┼──────────┘
│
┌──────────────────▼──────────┐
│ forgecc Daemon │
│ │
│ 1. Compile Baz() │
│ 2. Map into target process │
│ 3. Patch stub → real code │
│ 4. Resume execution │
└─────────────────────────────┘
Relationship to UE5's Live Coding: UE5 already has Live Coding (via Live++) which patches running code. However, Live Coding is limited:
- Only function body changes in .cpp files
- No header changes, no new members, no constructor/destructor changes
- No UCLASS/USTRUCT/UFUNCTION changes
forgecc's approach is more fundamental — the entire application is JIT-compiled, so most body-level changes and many declaration-level changes can be patched by recompiling and re-linking affected code. ABI-shape changes (class layout, vtable shape, base-class changes) require either object migration or a process restart. For the PoC, restart-on-ABI-change is the default safety policy; live object migration is future work (see §17.5).
Hot-patch contract (PoC, normative):
- Patchable now: function body edits that preserve ABI shape (same signature, calling convention, parameter/return ABI, and referenced type layouts).
- Guarded / restart-only: class layout changes, vtable shape changes, base-class changes, calling-convention changes, and any edit that requires migrating live objects.
- Atomicity: patch application is all-or-nothing at a safepoint. On failure, keep old code active and return an explicit diagnostic/restart-required status.
Runtime preconditions before resume:
- All external symbol references for the patch set are resolved.
- Unwind metadata (
.pdata/.xdata) is registered for newly mapped code. - TLS registration for newly mapped code succeeded (or target does not require TLS).
- No active stack frame executes code that is being replaced in-place.
- Debug metadata for patched functions is published as one versioned update.
Failure policy:
- If any precondition above fails, the daemon must not partially apply the patch.
- For ABI-epoch mismatches with no proven adapter/migration path, return
restart_requiredand keep execution on the previous code epoch.
Debugging: DAP-first approach (no PDB generation)
Since the daemon always reconstructs the executable in-memory and acts as the loader, it has perfect knowledge of everything a debugger needs: source locations, types, variable locations, stack frame layouts. Serializing this into PDB format and then having a separate debugger deserialize it is redundant work.
Instead, the daemon itself IS the debugger. It implements the Debug Adapter Protocol (DAP), allowing VS Code (or any DAP client, including Visual Studio via extensions) to debug the JIT-compiled application directly.
What the daemon provides as a DAP server:
- Breakpoints: Patch
int3instructions into JIT'd code at the requested source line - Single-stepping: Use hardware debug registers (
DR0-DR3) orint3at next source line - Variable inspection: Read target process memory via
ReadProcessMemory, interpret using the daemon's own type information from Sema (no CodeView serialization needed) - Stack unwinding: The daemon knows every frame layout because it generated the code
- Source mapping: Direct — the daemon has the source map from compilation
- Watch expressions: Evaluate using the same consteval JIT infrastructure (§4.6a)
- Conditional breakpoints: Compile the condition, JIT it, evaluate at breakpoint
What this saves vs. always-on PDB generation in the dev loop:
- PDB/CodeView writing is deferred from the primary dev path (still required in release mode)
- No need to keep on-disk PDB synchronized with each live patch in day-to-day iteration
- Debug info can stay in the compiler's internal form during development, avoiding repeated serialization
What DAP costs:
- DAP server implementation (~1,500 lines) — well-documented JSON-over-stdio protocol
- Process debugging primitives (~1,000 lines) —
WaitForDebugEvent,ReadProcessMemory,WriteProcessMemory,SetThreadContext, hardware breakpoints - Stack unwinding logic (~500 lines) — simpler than PDB because forgecc controls frame layout
Net effect (PoC): lower dev-loop complexity, but not elimination of debug-format complexity overall. PDB/CodeView still matters for release builds and ecosystem interop.
The application requires the daemon to run: This is a fundamental architectural property. The target application can only start if the daemon is up, because the daemon creates the process, maps code into it, and manages its execution. The daemon reconstructs the executable from the content-addressed store on startup (fast due to caching). This is analogous to how a JVM or .NET CLR must be running for managed applications to execute.
Release mode: For shipping builds (not day-to-day development), the daemon can still generate standalone PE/DLL + PDB files via the linker (§4.9). But for the PoC, the DAP-based debugging path is the primary focus.
The linker reads from the same content-addressed store the compiler writes to. It
never reads .obj files from disk. When the build system sends a link command with
.obj paths (e.g., POST /v1/link), the daemon resolves each path through the
build index (see §4.11) to find the corresponding content key, then retrieves
sections, symbols, and relocations from the store. In JIT mode, the daemon responds
to the build system immediately after linking completes; JIT injection into the target
process happens asynchronously and does not block the build system.
Two modes:
-
JIT mode (primary, PoC focus): No traditional linking step needed — sections are mapped directly into the target process by the runtime loader. Symbol resolution happens in-memory. The daemon acts as both linker and loader.
-
Release mode (secondary, for shipping builds): Generate PE executables and DLLs for distribution.
- Resolve symbol references between translation units
- COMDAT deduplication (already partially done by content-addressed store)
- Apply relocations
- Generate PE/DLL
- Generate PDB (only needed for release mode — DAP handles dev debugging)
- Handle
#pragma comment(lib, ...)and default libraries - Read import libraries (.lib) for Windows SDK / third-party DLLs
- For interop with external linkers (link.exe, lld), real
.objfiles can be materialized from the store on demand (future work, see §4.13)
ODR merging: When multiple TUs provide definitions of the same inline function or template instantiation (as the One Definition Rule requires), the linker must pick exactly one. In a content-addressed system this is naturally deterministic: identical definitions produce the same content hash and collapse to a single store entry. For ODR-violating definitions (different content hashes for the same mangled name), the linker emits a hard error and aborts producing a loadable image. The build exposes an explicit opt-in escape hatch for investigation, while the default policy is fail-fast.
UE5 DLL structure: The Game editor target builds ~190 modules (see §15.5),
each as a separate DLL. The *_API macro pattern (CORE_API, ENGINE_API, etc.)
controls __declspec(dllexport/dllimport). In JIT mode, forgecc ignores DLL boundaries
entirely and treat everything as a single address space.
Incremental linking (release mode): When only a few functions change:
- Detect which sections changed (via content hash comparison)
- If changed sections fit in their old locations, patch in place
- Otherwise, re-layout only affected sections
- Update PDB incrementally
Note: PDB generation and CodeView serialization are only needed for release mode. During development, the DAP-based debugger (§4.8) provides debugging without PDB. This means PDB support can be deferred to a later phase.
No MSVC shim. Instead, forgecc integrates with UBT via a direct command-passing
protocol. UBT already structures compilation as a set of VCCompileAction objects
(see VCCompileAction.cs) containing:
- Source file path
- Object file output path
- Include paths (user + system)
- Preprocessor definitions
- Force-include files
- Architecture
- PCH settings
- Compiler flags
Protocol design: UBT sends these structured actions to the forgecc daemon via
HTTP (see §4.13 for the unified protocol design). The same HTTP endpoint serves
both local builds (http://localhost:9473) and remote builds (http://peer:9473).
The request/response uses MessagePack by default (compact, fast) with JSON
available for debugging:
POST /v1/compile
Content-Type: application/msgpack
Accept: application/msgpack
// Decoded request body (shown as JSON for readability):
{
"source_file": "Game/Source/Game/Foo.cpp",
"output": "build/Game/Foo.obj",
"include_paths": ["Engine/Source/Runtime/Core/Public", ...],
"definitions": ["UE_BUILD_DEVELOPMENT=1", "WITH_EDITOR=1", ...],
"force_includes": ["SharedPCH.Engine.h"],
"flags": { "cpp_standard": "c++20", "exceptions": true, "rtti": true }
}
// Response:
{
"status": "success",
"content_key": "blake3:abc123...",
"diagnostics": []
}
Compile output — shallow .obj files: The build system specifies an output path
where it expects an .obj file (e.g., build/Game/Foo.obj). When compilation
completes, the daemon: (1) stores the result in the content-addressed store,
(2) updates the build index (see §4.11) mapping output → content_key, and
(3) kicks off an async background write of a shallow .obj to disk. The HTTP
response is sent immediately after steps 1–2; the file write is non-blocking.
The shallow .obj is a valid COFF file with a proper header, but containing only a
small .forgeid section with metadata (source file path, flags hash, content key).
No .text, .data, or other real sections. It is a few hundred bytes and exists
solely to satisfy the build system's timestamp-based dependency tracking. The daemon's
own pipeline never reads these files back — the content_key in the HTTP response
(and the build index) is the canonical reference to the compilation result.
This approach is inspired by SN-Systems' Program Repository (Sony, 2018–2022), which proved that storing compilation results in a database while writing minimal stamp files for build system compatibility is practical at scale.
UBT integration uses clang-cl mode. The Game project already uses clang-cl. Clang-cl mode uses fewer MSVC-specific flags, has cleaner semantics, and maps naturally to forgecc's internal representation.
UBT modifications needed: A new ForgeccToolChain.cs that either (a) points
CompilerPath at forge-cc.exe (quick start — UBT spawns it like clang-cl), or
(b) sends ForgeccCompileAction objects directly to the daemon via HTTP (optimized
— bypasses process spawning). See §7.1 for details.
The daemon is a long-running process that owns the local store and compilation pipeline.
forge-daemon
│
├── HTTP Server (unified protocol — serves drivers, ninja, UBT, peers, DAP)
│ ├── /v1/compile — compile actions (forge-cc / clang-cl equivalent)
│ ├── /v1/link — link actions (forge-link / lld-link equivalent)
│ ├── /v1/lib — static library actions (forge-lib / llvm-lib equivalent)
│ ├── /v1/rc — resource compile actions (forge-rc native implementation)
│ ├── /v1/ml — assembly actions (forge-ml / llvm-ml equivalent)
│ ├── /v1/artifact/ — content-addressed artifact get/put (local + P2P)
│ └── /v1/peers — peer discovery (gossip protocol)
├── Build Index (output path → content key, same content-store backend)
├── P2P Network (seed-based discovery, DHT)
├── Build Scheduler (parallel task graph)
├── File Watcher (pre-compilation on save)
├── Runtime Loader (JIT execution)
├── Compilation Thread Pool
└── Content-Addressed Store (multi-tier: memory → SSD/redb → network/DHT)
Build model:
- UBT sends compile actions to daemon via HTTP (
POST /v1/compile) - Daemon checks each source file against the store:
- Content hash unchanged? → skip entirely
- Content hash changed? → determine what needs recompilation (fine-grained)
- Daemon compiles changed units (parallel, using thread pool)
- In JIT mode: patches running process
- In release mode: links incrementally
- Daemon returns results to UBT
File watching: The daemon watches source directories. When a file is saved, it begins pre-compilation immediately, before UBT even sends the compile action. By the time the user triggers a build, the work is already done in the common case.
UBT bypass (future optimization): The daemon caches UBT command lines and uses the file watcher to detect changes, making UBT invocations unnecessary for incremental builds. UBT remains necessary for the initial "Makefile" generation (module discovery, dependency graph), but subsequent rebuilds are driven entirely by the daemon's file watcher + cached command lines.
Parallelism model: The daemon uses a work-stealing thread pool (rayon). Parallelism occurs at multiple levels:
- Across TUs: Different translation units compile in parallel (like today)
- Within a TU: Top-level declarations, template instantiations, and function bodies can be processed in parallel
- Across pipeline stages: Pipelining — TU1's codegen can run while TU2 is still parsing
- Across machines: Via the distributed store, machines share work
Build index: The daemon maintains an in-memory build index that maps output
paths (e.g., build/Game/Foo.obj) to content keys in the store. This is the bridge
between the build system's file-oriented world and forgecc's content-addressed
world. When a compilation completes, the daemon updates the build index and kicks off
an async background write of the shallow .obj to disk (non-blocking — the HTTP
response is sent immediately). When the daemon later receives a link command with
.obj paths on the command line, it resolves each path through the build index to
find the corresponding content key — it never reads the .obj files from disk.
The build index is persisted to the local redb store as a dedicated table. Since redb
is ACID, writes are safe even if the daemon crashes mid-update. On restart, the daemon
loads the build index from redb — recovery is instant, no recompilation needed. This
is essentially the daemon's equivalent of a .ninja_log file, but stored in the same
database as all other compilation artifacts.
The same multi-tier store policy applies to build-index pages and artifact pages: hot entries remain in memory, warm/cold entries live on SSD, and missing content is resolved from peers by hash.
Workspace binding: Each daemon instance is bound to a workspace root — a specific directory tree (git repo, Perforce workspace, or arbitrary source tree):
# forgecc.toml
[workspace]
root = "C:/src/Game" # or auto-detected from .git / P4CONFIGAll paths in the build index are stored relative to the workspace root (consistent with the determinism requirements in §16). This means the build index is portable if the workspace moves, and different developers with different checkout paths produce identical content keys. One daemon per workspace — if you have two projects, you run two daemons.
Multiple daemon instances form a peer-to-peer network with no central authority.
Discovery: Seed-based. Each daemon is configured with a list of seed peers:
# forgecc.toml
[network]
seeds = ["192.168.1.10:9473", "192.168.1.20:9473"]
listen = "0.0.0.0:9473"Port 9473 is unassigned by IANA and spells "XFCC" (heX ForgeCC) on a phone keypad. Any unassigned port works; this is configurable.
Protocol: The P2P protocol uses the same unified HTTP + MessagePack/JSON transport described in §4.13. Every daemon exposes the same HTTP API on the same port, serving both build system clients and peer daemons. The P2P-specific endpoints are:
GET /v1/artifact/{blake3_hash}— fetch an artifact by content hashPUT /v1/artifact/{blake3_hash}— announce/store an artifactGET /v1/peers— list known peers (for gossip protocol bootstrap)POST /v1/peers— register as a peer
- On startup, daemon connects to seeds and exchanges peer lists (gossip protocol)
- Each daemon maintains a distributed hash table (DHT) mapping content keys to peer addresses
- When daemon A compiles something, it announces the key to the DHT
- When daemon B needs an artifact:
- Check local store → found? Use it.
- Check DHT → found on peer C? Request data from C via
GET /v1/artifact/{hash}. - Not found anywhere? Compile locally, announce to DHT.
Data transfer: HTTP/2 over a single persistent TCP connection per peer (see §4.13
for details on HTTP version choice). Metadata (DHT announcements, peer lists) uses
MessagePack. Artifact bodies are transferred as raw binary (application/octet-stream)
with MessagePack metadata headers. Content-addressed, so integrity is verified by hash.
Multiple artifact fetches from the same peer run as concurrent HTTP/2 streams on one
connection.
Consistency: No strong consistency needed. If two machines compile the same thing independently, they produce the same output (deterministic compilation). The DHT is eventually consistent.
Security / trust model:
- PoC default: trusted workstation/LAN mode. Integrity is enforced by content hashes; peer identity/provenance policy is local and simplified.
- Production target: zero-trust peers. Every daemon has a stable cryptographic identity; all peer traffic is mutually authenticated and encrypted.
Artifact provenance:
Each artifact is stored with a signed provenance envelope containing:
- artifact hash (
blake3) - compiler identity + version + build config digest
- target triple + ABI-relevant flags digest
- source/input root digest (workspace snapshot or VCS commit + command digest)
- dependency artifact hashes
- builder signature (or attestation token)
Local daemon acceptance policy:
- Always verify content hash.
- In zero-trust mode, require valid signature/attestation from trusted identities.
- High-assurance policy requires N-of-M reproducible attestations (independent builders produce the same hash) for designated artifact classes (for example, release-critical paths).
This avoids blockchain-style global consensus for every artifact (too expensive for the inner dev loop) while still enabling strong supply-chain assurances when needed.
forgecc uses a single transport for all communication: build system to daemon, daemon to daemon, and (future) console client to daemon. This is a deliberate architectural choice — one code path, one port, one protocol.
Transport: HTTP/2 over TCP
All communication uses HTTP/2 over TCP. This includes local communication
(http://localhost:9473) and remote communication (http://peer:9473).
Transport security modes:
- PoC / trusted LAN: plaintext HTTP/2 is acceptable for iteration speed.
- Zero-trust / production: HTTPS (TLS) with mutual TLS (mTLS) between daemons and authenticated client credentials for build-system callers.
Why HTTP/2 (not HTTP/1.1 or HTTP/3):
HTTP/2's key feature is multiplexing — many concurrent request/response exchanges over a single TCP connection, with no head-of-line blocking between streams. This maps perfectly to the forgecc workload:
- Build system → daemon: UBT/ninja sends hundreds of compile actions in parallel. With HTTP/1.1, this would require hundreds of TCP connections (or serialize requests, killing parallelism). With HTTP/2, one connection handles all of them as concurrent streams. The daemon receives them in parallel and responds as each completes — out of order, no blocking.
- Daemon → peer daemons: When fetching artifacts from a peer, the daemon requests dozens concurrently. HTTP/2 multiplexes all of them on a single connection per peer — no connection pool management, no per-request TCP handshake overhead.
- Connection lifecycle: HTTP/2 connections are long-lived. The build system opens one connection to the daemon at startup and reuses it for the entire build session. Peer connections are established on first contact and kept alive for the daemon's lifetime.
Why not HTTP/1.1: HTTP/1.1 allows only one in-flight request per connection (pipelining exists but is effectively unused — no browser or major client implements it reliably). To get parallelism, you'd need a pool of many TCP connections. This works but adds complexity (pool sizing, connection management) and wastes resources (each TCP connection has kernel buffers, file descriptors, etc.).
Why not HTTP/3 (QUIC): HTTP/3 runs over QUIC (UDP) and eliminates TCP-level
head-of-line blocking on lossy networks. This matters for the internet and mobile, but
on a LAN or localhost, TCP packet loss is effectively zero — QUIC's advantage
disappears. Additionally, the Rust server-side ecosystem for HTTP/3 is less mature
(axum doesn't support it natively yet), and QUIC adds complexity (TLS 1.3 is
mandatory, UDP may be blocked by corporate firewalls). HTTP/3 is out of scope; if
WAN-distributed builds become a use case post-PoC, it is a straightforward protocol
upgrade.
Why TCP even for local IPC: Windows implements a
TCP Loopback Fast Path
(enabled by default since Windows 10 1607) that bypasses the full network stack for
localhost connections. Loopback TCP does a direct memory copy between send/receive
buffers — no Winsock Kernel, no AFD driver, no TCP/IP driver. In practice, localhost
TCP throughput is within ~5–10% of shared memory, while giving us a single code
path for local and remote communication. No named pipes, no Unix domain sockets, no
separate IPC mechanism to maintain.
Connection model summary:
UBT / ninja ──── 1 HTTP/2 connection ────▶ forge-daemon (:9473)
(or forge-cc (hundreds of concurrent │
thin drivers) compile/link/lib streams) │
├── 1 HTTP/2 connection ──▶ peer A (:9473)
│ (concurrent artifact
│ get/put streams)
│
└── 1 HTTP/2 connection ──▶ peer B (:9473)
Wire format: MessagePack (default) + JSON (debug)
Every endpoint supports two serialization formats, selected via standard HTTP content negotiation:
| Header | MessagePack | JSON |
|---|---|---|
Content-Type (request body) |
application/msgpack |
application/json |
Accept (desired response) |
application/msgpack |
application/json |
MessagePack (msgpack.org) is the default. It is a binary format that is schema-less like JSON but:
- ~30–50% smaller on the wire
- ~5–10x faster to serialize/deserialize
- Supports binary data natively (no base64 encoding)
- Trivially interoperable with serde in Rust (via
rmp-serde)
JSON is available for debugging and tooling. Any endpoint can be called with
curl by setting Accept: application/json:
# Debug: inspect a compile request as JSON
curl -H "Accept: application/json" http://localhost:9473/v1/status
# Debug: submit a compile action as JSON
curl -X POST -H "Content-Type: application/json" -H "Accept: application/json" \
-d '{"source_file": "foo.cpp", ...}' http://localhost:9473/v1/compileFormat selection rules:
- If
Content-Typeheader is present, the daemon parses the request body in that format. If absent, the daemon assumes MessagePack. - If
Acceptheader is present, the daemon responds in the requested format. If absent, the daemon responds in the same format as the request. - Artifact bodies (
GET /v1/artifact/{hash}) are always raw binary (application/octet-stream), regardless of content negotiation — only the metadata envelope (headers, error responses) uses MessagePack/JSON.
Why not gRPC / Cap'n Proto / FlatBuffers: These require schema compilation (.proto
files, code generation) which adds build complexity. MessagePack + serde gives us
schema-from-code (the Rust structs ARE the schema), zero code generation, and the
JSON fallback for free. gRPC's framing also makes curl debugging harder — this
approach lets you curl any endpoint with Accept: application/json and get readable
output.
Endpoint summary (all served on the same port):
| Endpoint | Method | Purpose | Used by |
|---|---|---|---|
/v1/compile |
POST | Submit compile action(s) | forge-cc.exe, UBT, ninja |
/v1/link |
POST | Submit link action (.obj paths resolved via build index) | forge-link.exe, UBT, ninja |
/v1/lib |
POST | Submit static library creation action | forge-lib.exe, ninja |
/v1/rc |
POST | Submit resource compilation action | forge-rc.exe, ninja |
/v1/ml |
POST | Submit assembly action | forge-ml.exe, ninja |
/v1/status |
GET | Daemon status, build progress | Build system, CLI tools |
/v1/artifact/{hash} |
GET | Fetch artifact by content hash | Peers, forge-verify |
/v1/artifact/{hash} |
PUT | Store/announce artifact | Peers |
/v1/peers |
GET | List known peers | Peers (gossip) |
/v1/peers |
POST | Register as peer | Peers (gossip) |
/v1/debug/* |
various | DAP-over-HTTP tunnel (see §4.8) | Non-VS Code debugger clients |
Future: real .obj materialization (deferred, not needed for PoC): The
/v1/artifact/{hash}?format=coff endpoint could reconstruct a full COFF .obj file
from the store. This would be useful for forge-verify (diffing against reference
compilers), external linker interop (link.exe, lld), and inspection with
dumpbin/llvm-objdump.
Note on DAP transport: VS Code's standard DAP integration uses JSON-over-stdio
(VS Code launches the debug adapter as a child process). The daemon supports this as
its primary DAP transport — a thin forge-dap wrapper process connects to the
daemon's HTTP API internally and bridges to stdio for VS Code. The /v1/debug/* HTTP
endpoints provide an alternative for clients that prefer HTTP (e.g., custom IDE
integrations, web-based debuggers).
forgecc/
├── Cargo.toml # Workspace root
├── crates/
│ ├── forge-daemon/ # The main daemon binary
│ │ ├── src/
│ │ │ ├── main.rs # Entry point, daemon lifecycle
│ │ │ ├── scheduler.rs # Build scheduler, parallel task graph
│ │ │ ├── watcher.rs # File system watcher
│ │ │ ├── loader.rs # Runtime loader / JIT execution
│ │ │ └── determinism.rs # --verify-determinism mode
│ │ └── Cargo.toml
│ │
│ ├── forge-net/ # HTTP server + P2P networking
│ │ ├── src/
│ │ │ ├── lib.rs
│ │ │ ├── server.rs # axum HTTP server (unified: UBT + peers + debug)
│ │ │ ├── protocol.rs # MessagePack/JSON content negotiation
│ │ │ ├── peer.rs # Peer connection management
│ │ │ ├── dht.rs # Distributed hash table
│ │ │ ├── gossip.rs # Peer discovery gossip protocol
│ │ │ └── transfer.rs # Content transfer (get/put artifacts)
│ │ └── Cargo.toml
│ │
│ ├── forge-store/ # Content-addressed distributed store
│ │ ├── src/
│ │ │ ├── lib.rs
│ │ │ ├── local.rs # Local redb store
│ │ │ ├── distributed.rs # Distributed layer (local + DHT + remote fetch)
│ │ │ ├── keys.rs # Key computation (BLAKE3 hashing)
│ │ │ └── records.rs # Record type definitions & serialization
│ │ └── Cargo.toml
│ │
│ ├── forge-lex/ # Lexer (C and C++)
│ │ ├── src/
│ │ │ ├── lib.rs
│ │ │ ├── token.rs # Token types
│ │ │ ├── lexer.rs # Tokenizer
│ │ │ └── literals.rs # Number/string/char literal parsing
│ │ └── Cargo.toml
│ │
│ ├── forge-pp/ # Preprocessor
│ │ ├── src/
│ │ │ ├── lib.rs
│ │ │ ├── directives.rs # #include, #define, #if, etc.
│ │ │ ├── macros.rs # Macro expansion engine
│ │ │ ├── include.rs # Include resolution & automatic caching
│ │ │ └── builtins.rs # __FILE__, __LINE__, predefined macros
│ │ └── Cargo.toml
│ │
│ ├── forge-parse/ # Parser (C and C++)
│ │ ├── src/
│ │ │ ├── lib.rs
│ │ │ ├── ast.rs # AST node definitions
│ │ │ ├── parser.rs # Recursive descent parser (orchestration)
│ │ │ ├── expr.rs # Expression parsing
│ │ │ ├── stmt.rs # Statement parsing
│ │ │ ├── decl.rs # Declaration parsing
│ │ │ ├── types.rs # Type parsing
│ │ │ └── templates.rs # Template parsing
│ │ └── Cargo.toml
│ │
│ ├── forge-sema/ # Semantic analysis (MODULAR — see §4.5)
│ │ ├── src/
│ │ │ ├── lib.rs # Public API (~200 lines)
│ │ │ ├── context.rs # Shared semantic context
│ │ │ ├── lookup/ # Name lookup (6 files, ~2500 lines total)
│ │ │ ├── types/ # Type system (6 files, ~3500 lines total)
│ │ │ ├── overload/ # Overload resolution (5 files, ~3000 lines)
│ │ │ ├── templates/ # Template instantiation (5 files, ~4000 lines)
│ │ │ ├── concepts/ # C++20 concepts (2 files, ~1000 lines)
│ │ │ ├── consteval/ # Constant evaluator (2 files, ~2000 lines)
│ │ │ ├── decl/ # Declaration processing (7 files, ~4000 lines)
│ │ │ ├── expr/ # Expression checking (7 files, ~5000 lines)
│ │ │ ├── stmt/ # Statement checking (2 files, ~500 lines)
│ │ │ ├── msvc/ # MSVC-specific (6 files, ~4000 lines)
│ │ │ ├── bindings.rs # Structured binding decomposition (~500 lines)
│ │ │ ├── comparisons.rs # Defaulted comparison synthesis (~500 lines)
│ │ │ ├── rtti.rs # typeid, dynamic_cast support (~300 lines)
│ │ │ └── access.rs # Access control (~500 lines)
│ │ └── Cargo.toml
│ │ # Total: ~55 files, ~34,500 lines, no file > 1000 lines
│ │
│ ├── forge-ir/ # Intermediate representation
│ │ ├── src/
│ │ │ ├── lib.rs
│ │ │ ├── types.rs # IR type system
│ │ │ ├── instructions.rs # IR instruction set (including SIMD intrinsics)
│ │ │ ├── builder.rs # IR construction helpers
│ │ │ ├── printer.rs # Human-readable IR dump
│ │ │ └── passes/
│ │ │ ├── mod.rs # Pass manager
│ │ │ ├── dce.rs # Dead code elimination
│ │ │ └── constfold.rs # Constant folding
│ │ └── Cargo.toml
│ │
│ ├── forge-codegen/ # x86-64 code generation
│ │ ├── src/
│ │ │ ├── lib.rs
│ │ │ ├── lower.rs # IR → x86-64 instruction selection
│ │ │ ├── regalloc.rs # Register allocation (linear scan)
│ │ │ ├── emit.rs # Machine code emission
│ │ │ ├── encoding.rs # x86-64 instruction encoding
│ │ │ ├── simd.rs # SSE/AVX intrinsic lowering
│ │ │ └── relocations.rs # Relocation generation
│ │ └── Cargo.toml
│ │
│ ├── forge-debug/ # DAP debugger server
│ │ ├── src/
│ │ │ ├── lib.rs
│ │ │ ├── dap_server.rs # DAP protocol implementation (JSON-over-stdio)
│ │ │ ├── breakpoints.rs # Breakpoint management (int3 patching)
│ │ │ ├── stepping.rs # Single-step, step-over, step-out
│ │ │ ├── variables.rs # Variable inspection (ReadProcessMemory + type info)
│ │ │ ├── stack.rs # Stack unwinding (using codegen frame layout info)
│ │ │ └── process.rs # WaitForDebugEvent, debug primitives
│ │ └── Cargo.toml
│ │
│ ├── forge-link/ # Linker (release mode only — not needed for dev builds)
│ │ ├── src/
│ │ │ ├── lib.rs
│ │ │ ├── resolver.rs # Symbol resolution
│ │ │ ├── layout.rs # Section layout
│ │ │ ├── relocate.rs # Relocation application
│ │ │ ├── pe.rs # PE/COFF executable output
│ │ │ ├── dll.rs # DLL generation (for UE5 module structure)
│ │ │ ├── pdb.rs # PDB generation (release mode only)
│ │ │ ├── codeview.rs # CodeView debug info (release mode only)
│ │ │ └── imports.rs # Import library reading (.lib)
│ │ └── Cargo.toml
│ │
│ ├── forge-common/ # Shared utilities
│ │ ├── src/
│ │ │ ├── lib.rs
│ │ │ ├── diagnostics.rs # Error/warning reporting
│ │ │ ├── source_map.rs # Source location tracking
│ │ │ ├── intern.rs # String interning (lock-free)
│ │ │ └── hash.rs # BLAKE3 hashing utilities
│ │ └── Cargo.toml
│ │
│ ├── forge-driver/ # Thin driver binaries (see §7.3)
│ │ ├── src/
│ │ │ ├── main.rs # Entry point: argv[0] detection, dispatch
│ │ │ ├── detect.rs # Tool name detection (forge-cc, forge-link, etc.)
│ │ │ ├── args_cc.rs # forge-cc flag parsing (clang-cl-compatible) → CompileAction
│ │ │ ├── args_link.rs # forge-link flag parsing (lld-link-compatible) → LinkAction
│ │ │ ├── args_lib.rs # forge-lib flag parsing (llvm-lib-compatible) → LibAction
│ │ │ ├── args_rc.rs # forge-rc flag parsing (MSVC/LLVM-compatible) → RcAction
│ │ │ ├── args_ml.rs # forge-ml flag parsing (llvm-ml-compatible) → MlAction
│ │ │ ├── client.rs # HTTP/2 client (reqwest) → daemon RPC
│ │ │ └── autostart.rs # Daemon auto-start (spawn, wait for /v1/status)
│ │ └── Cargo.toml
│ │ # Build produces forge-daemon.exe + symlinks: forge-cc, forge-link,
│ │ # forge-lib, forge-rc, forge-ml
│ │
│ └── forge-verify/ # Verification CLI tool (see §8b)
│ ├── src/
│ │ ├── main.rs # CLI entry point, argument parsing
│ │ ├── daemon_client.rs # HTTP client to connect to forge-daemon
│ │ ├── reference.rs # Invoke clang-cl / MSVC as reference compiler
│ │ ├── diff_sections.rs # Compare COFF sections, symbols, relocations
│ │ ├── diff_debug.rs # Invoke llvm-debuginfo-analyzer --compare
│ │ └── diff_exec.rs # Link, run, compare stdout/stderr/exit code
│ └── Cargo.toml
│
└── tests/
├── level0_handcrafted/ # Hand-crafted per-feature tests (see §8a)
├── level1_stb/ # stb library compilation tests
├── level2_lmdb/ # LMDB compilation tests
├── level3_json/ # nlohmann/json compilation tests
├── level4_imgui/ # Dear ImGui compilation tests
├── level5_gtest/ # Google Test compilation tests
├── level6_lit/ # Selected LLVM lit tests
└── integration/ # End-to-end build tests
Source files ──▶ Lexer ──▶ Preprocessor ──▶ Parser ──▶ Sema ──▶ IR ──▶ CodeGen
│ │ │ │ │ │
▼ ▼ ▼ ▼ ▼ ▼
Store:Token Store:Preproc Store:AST Store: Store: Store:Section
Stream Result Fragment Type IR + Store:Debug
Checked
Every stage writes its output to the store before passing it to the next stage. All stages run in parallel across TUs, and within TUs where possible.
Source file changed
│
▼
Hash source ──▶ Store lookup: TokenStream? ──YES──▶ skip lexing
│ │
NO ▼
│ Store lookup: PreprocResult
▼ with same deps? ──YES──▶ skip
Lex file ──▶ Preprocess ──▶ ... each stage checks store first
User: "Run"
│
▼
Daemon creates target process (suspended)
│
▼
Map already-compiled sections into process
│
▼
Install stubs for uncompiled functions
│
▼
Resume process
│
▼
Function called ──▶ stub traps ──▶ daemon compiles ──▶ patch & resume
Machine A Machine B
┌──────────┐ ┌──────────┐
│ Daemon A │◄── DHT sync (TCP) ─────▶│ Daemon B │
│ │ │ │
│ Store: │ │ Store: │
│ has X123 │◄── get(X123) ────────────│ needs │
│ │─── bytes ──────────────▶│ X123 │
└──────────┘ └──────────┘
A key design goal for forgecc's AST is immutability after construction. In LLVM/Clang, AST nodes are mutable — Sema modifies them in place during type-checking, template instantiation, and implicit conversion insertion. This makes the AST difficult to share across threads and impossible to content-address reliably.
forgecc's approach: once an AST node is constructed, it is immutable. Sema produces new, annotated AST nodes rather than modifying existing ones. This enables:
- Safe sharing across threads: Immutable AST nodes can be read concurrently without locks, enabling the parallel Sema design (§4.5)
- Content-addressable AST: Since nodes don't change, their hash is stable and can be used as a cache key in the content-addressed store
- Efficient caching: Unchanged subtrees from a previous compilation can be reused directly — no need to deep-copy and re-verify
- Simpler debugging: Immutable data structures are easier to reason about and reproduce bugs in
Rust's ownership model naturally encourages this pattern. Arc<AstNode> provides
shared immutable access, and the borrow checker prevents accidental mutation. Where
Sema needs to "modify" a node (e.g., adding type annotations), it creates a new node
that references the original, forming a layered structure.
A central insight behind forgecc is that full invalidation of the build is rare. On a shared branch, most commits touch a small number of files. Clean rebuilds or cache wipes are the exception. So the daily reality is: each developer and each build machine only needs to compile a delta — the TUs that actually changed or that depend on changed code. With fine-grained dependency signatures (§4.2), that delta is minimal. With a shared content-addressed store over P2P (§4.12), each unique (TU, input context) is compiled once across the entire team and CI; every other daemon fetches the artifact by hash.
Implication for weekly savings: The benefit is not only "one full rebuild is 2× faster" but "across the whole team and all build machines, the system compiles much less in total." Aggregate compilation work in a week can drop by a large factor because work is shared instead of repeated.
Illustrative model (UE-style team):
| Role | Count | Builds per day | Today: TUs recompiled per build (approx.) | Today: TU-minutes per build (approx.) |
|---|---|---|---|---|
| Developer | 20 | 20 | 50–200 | 25–200 |
| Build machine (CI) | 5 | 50 | 100–500 | 50–300 |
- Without forgecc: Each developer does ~20 × 100 ≈ 2,000 TU-minutes/day; 20 devs → 40,000 TU-minutes/day. CI: 5 × 50 × 150 ≈ 37,500 TU-minutes/day. Total ~77,500 TU-minutes/day → ~388k TU-minutes/week (single-threaded equivalent).
- With forgecc (P2P + fine-grained incremental): Many of those "TUs" are the same (same file, same flags, same header state). The first daemon to need a (TU, context) compiles it; all others get a cache hit from the P2P store. If 20% of the week's logical compilations are unique and 80% are served from peers, total actual compilation is 20% of 388k ≈ 78k TU-minutes/week.
- Result: Roughly 5× less aggregate compilation work in the week. The exact factor depends on how aligned the team's changes are (same branch, similar -D flags, similar includes). Conservative estimates still yield 3–5× reduction in total CPU-time spent compiling across the org.
Takeaways:
- Weekly savings are in aggregate: fewer CPU-hours compiling, faster feedback for everyone (incremental + cache hits from peers), and less load on build machines.
- First build of the week (or after a big merge) does more work; subsequent builds are mostly cache hits from peers. The system naturally favors the common case: small deltas, shared codebase.
Today, getting distribution + caching for C++ builds requires assembling and operating several pieces: a cache (sccache, ccache, Reclient, BuildBuddy), a distributed build layer (Incredibuild, FASTBuild, Reclient), cache servers or cluster endpoints, auth, and tuning. forgecc provides equivalent behavior out of the box with no separate servers and no .obj shipping.
| Concern | Today | With forgecc |
|---|---|---|
| Cache | sccache / ccache / Reclient, separate from the compiler | Daemon is the cache; content-addressed store is built in (§4.1, §4.2). |
| Distribution | Incredibuild, FASTBuild, Reclient, distcc, etc. | P2P between daemons (§4.12); no central job scheduler. |
| What is shared | Whole .obj (or equivalent) per TU | Content-addressed sections; same hash ⇒ one logical copy across the network. |
| Warm cache | Dev A's cache ≠ Dev B's; CI cache often separate. | Same codebase + same inputs ⇒ same hashes ⇒ any peer can serve. |
| Setup | Servers, auth, config, cache sizing, invalidation. | Run the daemon; point at seeds (or same LAN). No dedicated cache servers. |
| Network load | 15–18 GB (e.g. UE Editor) of .obj in/out on cache miss or distribute. | Fetch only missing content by hash; heavy dedup (header-derived code shared). |
Value proposition: Teams get shared caching and distribution by running forgecc daemons and connecting them (e.g. via seeds). No separate distribution or cache infrastructure to deploy or maintain.
This section sets practical expectations for a fully implemented forgecc stack in
developer workflows (-O0 / fast-iteration mode). These ranges are planning targets,
not guarantees, and depend on cache warmth, change shape, memory pressure, and
workspace policy.
Important operating model: true full rebuilds are primarily a first import event (or explicit cache wipe / toolchain epoch reset). After import, steady-state development is overwhelmingly incremental. Most rebuilds recompile a small delta.
Build-time reduction ranges:
| Codebase | Full rebuild (cold import / forced invalidation) | Incremental: .cpp body change |
Incremental: header/API change |
|---|---|---|---|
| Clang/LLD | ~0.9×–1.3× vs baseline toolchain | ~3×–10× faster | ~1.5×–4× faster |
| Chromium | ~0.8×–1.2× vs baseline toolchain | ~5×–20× faster | ~2×–8× faster |
| Unreal Engine | ~0.8×–1.3× vs baseline toolchain | ~8×–30× faster | ~3×–12× faster |
Interpretation:
- Full rebuild speedup is moderate because most work is genuinely new on a cold import.
- Incremental speedup is large because unchanged units/functions/templates are skipped and cache hits dominate.
- Team/CI aggregate reduction in total compile work is typically 3×–8× due to cross-machine hash reuse (§6.6).
Storage / artifact footprint expectations:
| Storage view | Traditional model (.obj/.pdb intermediates) | forgecc model (index + content store) |
|---|---|---|
| Build index metadata | N/A | Tiny relative footprint (typically <<1% of total) |
| Intermediate duplication | High (near-duplicate artifacts repeated across builds) | Low (content-addressed dedup across stages and builds) |
| Single-machine active workspace | Baseline | Typically ~20%–60% lower net footprint after warm-up |
| Cross-machine/team footprint + transfer | Large repeated artifact transfer | Hash-addressed reuse; only missing content transfers |
Storage notes:
- The content store carries metadata overhead, but deduplication and reuse dominate in steady-state development.
- Retention policy (GC horizon, branch pinning, CI pin sets) determines long-tail size.
Two approaches, used in sequence:
Quick start (spawn driver): A new ForgeccToolChain.cs that points
CompilerPath at forge-cc.exe (see §7.3). UBT spawns it exactly like
clang-cl.exe — no changes to UBT's action execution model, only the toolchain
path changes (~5 lines). This is the initial integration path.
Optimized (direct HTTP): ForgeccToolChain.cs constructs
ForgeccCompileAction objects and sends them to the daemon via HTTP
(POST /v1/compile) directly, bypassing process spawning entirely. This
eliminates per-action process creation overhead for UE5's hundreds of compile
actions. The data sent is the same either way:
- Source file path
- Include paths (user + system + VC include paths)
- Preprocessor definitions
- Force-include files
- Language standard (C++20)
- Architecture (x64)
- PCH settings → ignored by forgecc (automatic caching replaces PCH)
Both approaches report results back to UBT's action graph (success/failure + diagnostics).
For maximum performance, projects can use forgecc's native build description:
# forgecc-build.toml
[project]
name = "game"
standard = "c++20"
[targets.editor]
type = "executable"
sources = ["game/Source/**/*.cpp"]
include_paths = ["unreal/Engine/Source/Runtime/Core/Public", ...]
definitions = ["UE_BUILD_DEVELOPMENT=1", "WITH_EDITOR=1"]forgecc provides a family of thin driver binaries that present a standard
compiler/linker interface to build systems. Each binary is the same Rust
executable, distinguished by argv[0] — the same pattern clang uses for
clang, clang++, and clang-cl (a single binary with symlinks/copies).
forge-daemon.exe ← main binary (the daemon)
forge-cc.exe ← symlink/copy, native compiler driver (clang-cl-compatible CLI)
forge-link.exe ← symlink/copy, native linker driver (lld-link-compatible CLI)
forge-lib.exe ← symlink/copy, native librarian driver (llvm-lib-compatible CLI)
forge-rc.exe ← symlink/copy, native resource compiler driver
forge-ml.exe ← symlink/copy, native assembler driver (llvm-ml-compatible CLI)
argv[0] detection: On startup, the binary extracts the filename from argv[0],
strips .exe, lowercases on Windows, and matches against the known tool names. This
determines which command-line format to parse and which daemon endpoint to call.
--driver-mode=X overrides the name-based detection (same as clang).
Each driver:
- Parses the command line in the format of its corresponding LLVM/MSVC tool
- Connects to the local daemon (
localhost:9473) via HTTP/2 - POSTs the action to the appropriate
/v1/*endpoint - Streams diagnostics (warnings/errors) to stderr
- Writes output files to the paths the build system expects
- Exits with 0 (success) or 1 (failure)
Daemon auto-start: If the driver cannot connect to the daemon, it starts
forge-daemon.exe as a background process (detached, writes PID file), waits
for it to become ready (poll /v1/status), then proceeds. This means developers
never need to manually start the daemon — the first build invocation starts it
automatically.
Tool-to-endpoint mapping:
| Driver binary | CLI compatibility target | Daemon endpoint | Flags accepted |
|---|---|---|---|
forge-cc.exe |
clang-cl-compatible CLI | /v1/compile |
/c, /Fo, /I, /D, -std:c++20, etc. |
forge-link.exe |
lld-link-compatible CLI | /v1/link |
/OUT:, /DLL, /LIBPATH:, .obj paths |
forge-lib.exe |
llvm-lib-compatible CLI | /v1/lib |
/OUT:, .obj paths |
forge-rc.exe |
native .rc compiler (MSVC/LLVM-compatible CLI) |
/v1/rc |
/fo, .rc path |
forge-ml.exe |
llvm-ml-compatible CLI | /v1/ml |
/c, /Fo, .asm path |
Implementation: The driver is a single Rust crate (forge-driver, ~800 lines)
that compiles to one binary. The build system produces the binary as forge-daemon.exe
and creates symlinks/copies for each tool name (via build.rs or a post-build step).
For maximum build performance, forgecc includes a small patch to
ninja that eliminates process spawning entirely for
forge-family tools. When ninja detects that a build rule's command uses a forge
tool, it routes the action directly to the daemon via RPC instead of calling
CreateProcess().
Detection: When ninja loads build.ninja, for each rule, it checks whether
the command's tool path ends with a forge-family binary name (forge-cc.exe,
forge-link.exe, forge-lib.exe, forge-rc.exe, forge-ml.exe). This is a
simple string suffix check on the first token of the command.
RPC dispatch: For matched rules, instead of spawning a process, ninja:
- Opens a single HTTP/2 connection to the daemon (on first use, kept alive for the entire build session)
- Parses the command line into the tool type + arguments
- POSTs to the appropriate
/v1/*endpoint as a new HTTP/2 stream - Streams stdout/stderr back from the response body
- Reports success/failure to ninja's build graph as if the process had exited
Non-forge rules: Any rule whose tool is NOT a forge-family binary (e.g.,
rc.exe, midl.exe, mt.exe, custom build steps) continues to spawn processes
normally. Ninja's existing process pool handles these unchanged.
Fallback: If the daemon is not reachable, ninja falls back to spawning the thin driver binary (which will auto-start the daemon). The system is always correct, just slower without the daemon running.
Why this matters: A typical UE5 editor build has ~5,000+ compile actions.
Spawning 5,000 processes (even thin ones) has measurable overhead on Windows
(~1–5 ms per CreateProcess call = 5–25 seconds total). With ninja daemon mode,
all 5,000 actions are HTTP/2 streams on a single TCP connection — zero process
creation, zero context switches, zero file system overhead from process startup.
Estimated ninja patch size: ~300–500 lines of C++ in ninja's subprocess.cc
/ build.cc:
- Tool detection in rule parsing (~50 lines)
- HTTP/2 client using a lightweight C library (e.g., nghttp2 or libcurl with HTTP/2) (~200 lines)
- Response streaming and exit code mapping (~100 lines)
- Connection lifecycle management (~50 lines)
CMake must see a real compiler executable for configuration. forge-cc.exe
satisfies this by identifying as clang-cl.
Version string: forge-cc.exe --version outputs:
clang version 17.0.0 (forgecc 0.1.0)
Target: x86_64-pc-windows-msvc
Thread model: posix
This causes CMake to set CMAKE_CXX_COMPILER_ID=Clang and
CMAKE_CXX_SIMULATE_ID=MSVC, inheriting all of CMake's existing clang-cl
support. No CMake changes needed.
What CMake does during configuration (all handled by the thin driver):
--version— parseable version string (see above)try_compile— compiles a small test program and produces a.obj. The driver forwards to the daemon, which produces a valid COFF.obj.- Feature detection — compiles snippets to detect C++ standard support
(
-std:c++20, etc.). This works because forgecc accepts clang-cl flags. - ABI detection — compiles
CMakeCXXCompilerABI.cppand inspects the.objfor ABI information. Our.objmust be valid COFF with correct section names and symbol types.
Flow: CMake invokes forge-cc.exe during configure (spawning the thin
driver). The generated build.ninja references forge-cc.exe as the compiler.
When ninja runs the build, it detects the forge tool and switches to daemon
mode (§7.4), bypassing process spawning for all compile/link actions.
Note on time estimates: The week ranges below assume one experienced systems programmer working full-time, without AI assistance. With AI-assisted development (vibe coding with an AI agent), realistic speedups are roughly 2–4x depending on the component:
Component type AI leverage Examples Boilerplate / table-driven / well-documented formats 2–4x faster Lexer, x86-64 encoding tables, DAP protocol, COFF/PE, MSVC name mangling Design-heavy but with good references 1.5–2x faster Parser, IR design, codegen lowering, linker Subtle correctness / deep language semantics 1–1.5x faster Sema (templates, overload resolution, concepts), JIT patching, distributed consistency Adjusted total: The ~108 manual weeks may compress with AI assistance, but the realistic range depends heavily on Sema/JIT/debugger correctness work. A planning range of ~54–90 weeks is more defensible; ~36–54 weeks should be treated as an optimistic best case rather than the baseline expectation.
De-risking priority rule: Highest semantic-risk components are implemented and validated before feature breadth. Template-instantiation memoization and invalidation correctness are front-loaded in the plan because they are on the critical path for incremental performance and correctness.
- Cargo workspace setup
-
forge-store: Content-addressed local store on redb, build index table -
forge-common: Diagnostics, source map, string interning, hashing -
forge-daemon: Basic daemon lifecycle, HTTP server (axum), workspace binding -
forge-net: Seed-based peer discovery, basic DHT - CLI / HTTP compile interface:
POST /v1/compileandPOST /v1/linkendpoints accepting compile/link actions — this is the primary way to feed work into the daemon from day one (UBT integration comes later; during early phases, a simple CLI wrapper orcurlsubmits actions directly) - Determinism infrastructure: deterministic hasher,
--verify-determinismmode
Milestone: Daemon starts, accepts compile actions via HTTP, peers discover each other, store get/put works locally and across network. All data structures use deterministic ordering.
-
forge-lex: Full C/C++20 lexer including:- All literal types (integer, float, char, string) with all encoding prefixes
- Raw string literals with delimiter matching and line-splice reversion
- User-defined literal token recognition (
ud-suffix) - Hexadecimal floating-point literals
- Phase 1–2: UTF-8 decoding, BOM stripping, line splicing
- Phase 5–6: Adjacent string literal concatenation with encoding compatibility
-
forge-pp: Preprocessor with all required directives including:-
#pragma pack,#pragma warning(critical for MSVC STL compatibility) -
#line,#error,#warning, null directive -
__has_builtin,__has_feature,__has_extension - All predefined macros (
__cplusplus,_MSC_VER,__cpp_*feature-test macros)
-
- Automatic header caching (replaces PCH)
- Memoization of preprocessed output in the store
- Basic Windows SDK header compatibility (start testing against
<windows.h>and common system headers — these are needed by every target level)
Milestone: Can preprocess UE5 header files correctly. Headers are automatically cached. No PCH configuration needed. Target level I (hand-crafted tests) preprocesses correctly.
-
forge-parse: Recursive descent parser for C and C++20 - AST definition covering UE5's usage patterns
- Concepts/requires-clause parsing
-
extern "C"/ linkage specification parsing -
explicit(bool)conditional explicit parsing -
constinitdeclaration parsing - CTAD / deduction guide declaration parsing
- Bit-field member declarations
-
static_assertwith optional message -
alignas/alignofparsing -
noexceptspecifier and operator parsing - Attribute argument parsing (string arguments for
[[nodiscard("msg")]], etc.) - Error recovery
-
forge-driver: Thin driver binaries (see §7.3):- argv[0] detection and tool dispatch
- clang-cl flag parsing (
forge-cc.exe) - HTTP/2 client → daemon RPC
- Daemon auto-start
-
--versionoutput for CMake compatibility (see §7.5) - Symlink/copy generation for forge-cc, forge-link, forge-lib, forge-rc, forge-ml
Milestone: Can parse UE5/Game source files into an AST. Target level I parses
correctly. forge-cc.exe --version works with CMake; cmake -G Ninja generates
a valid build.ninja referencing forge-cc.
-
forge-sema: Modular Sema (see §4.5 for structure) - Name lookup (including two-phase lookup for templates), type checking (parallel where possible)
- Template instantiation with Tier 1 memoization (structural key — see §4.2)
- Template memoization de-risk track (front-loaded):
- Implement Tier 2 dependency-signature recording for a focused template subset
(
std::vector, traits, selected UE containers) before full language breadth - Add negative-dependency recording for ADL/concept lookups in this subset
- Add signature-cost telemetry (
hashes checked,lookup count,validate-us) and enforce budget thresholds in CI - Run controlled edit-sequence tests (body edit, overload add/remove, namespace injection, macro flip) and require deterministic invalidation outcomes
- Implement Tier 2 dependency-signature recording for a focused template subset
(
- CTAD: deduction guide synthesis, implicit guides from constructors, aggregate deduction guides
- Template template parameters, class-type NTTPs (structural types)
- Overload resolution (including built-in operator candidates, user-defined literal operator resolution, address of overloaded function)
- Concept satisfaction checking
-
noexceptas part of function type;noexceptoperator evaluation -
extern "C"linkage: name mangling suppression, calling convention -
explicit(bool)conditional explicit evaluation -
constinitvalidation (must be constant-initialized) - Structured binding decomposition (tuple-like, array, data member)
- Defaulted comparison operators (
= defaultfor==and<=>) - Brace-enclosed initializer lists /
std::initializer_listconstruction / narrowing conversion detection / aggregate initialization rules -
static_assertevaluation -
alignas/alignofevaluation - RTTI support:
typeid,dynamic_cast - MSVC class layout and name mangling (including
#pragma packeffects) -
constexprbasic evaluation (literal folding, simple expressions) -
constevalcontext detection — Sema records deferredconstevalcalls for later JIT evaluation (requires IR + codegen + JIT from Phases 4–5; see §4.6a) - SEH support (
__try/__except/__finally) - MSVC STL header compatibility (progressively — needed for levels IV–VI; link against vcruntime.lib, ucrt.lib)
Milestone: Can type-check UE5/Game source files. No Sema file > 1000 lines.
consteval contexts are detected and recorded; full evaluation deferred to Phase 6.
-
forge-ir: SSA-based IR with SIMD intrinsic support - Minimal passes (DCE, constant folding)
-
forge-codegen: x86-64 lowering, register allocation, emission - SSE/AVX intrinsic lowering
- COFF section generation
- Debug metadata generation (source maps, type info, frame layouts — stored in internal format for DAP; CodeView serialization deferred to Phase 8)
- Basic
__asmblock support - Target levels I–II: stb libraries compile and produce correct output
Milestone: Can compile simple C/C++ programs to working x86-64 code. Levels I–II
pass forge-verify.
-
forge-daemon/loader.rs: Process creation, section mapping - Lazy compilation stubs
- Incremental patching (function-level)
- Import library parsing (.lib) for Windows SDK / third-party DLLs
- Static initializer ordering and execution
- Exception handling registration (
RtlAddFunctionTable) - Ninja daemon mode patch (see §7.4):
- Fork ninja, add forge tool detection in rule parsing
- HTTP/2 client (nghttp2 or libcurl) for daemon RPC
- Response streaming and exit code mapping
- Fallback to process spawning when daemon is unreachable
- Target level III: LMDB compiles, links, and runs via JIT
Milestone: Can JIT-compile and run a simple C++ program. Changes are patched live.
Ninja daemon mode eliminates process spawning for forge tools.
Target level III passes forge-verify.
-
forge-sema/consteval: Wire deferredconstevalcalls to the JIT pipeline - Compile
constevalfunctions to native code via forge-ir → forge-codegen - Map compiled code into daemon's address space (
VirtualAlloc+PAGE_EXECUTE) - Execute and capture results; feed back into Sema as compile-time constants
- Lightweight sanitizer instrumentation (overflow, null deref, bounds checks)
- Sandboxed allocator for
constexprnew/delete(C++20) - Memoization of
constevalresults:blake3(function_hash + args) → result - Target levels IV–VI: nlohmann/json, Dear ImGui, Google Test
Milestone: consteval functions compile and execute in-process. Results are
cached and reused across TUs. No AST interpreter needed. Levels IV–VI pass
forge-verify.
-
forge-debug: DAP protocol server (JSON-over-stdio) - Breakpoint management (int3 patching)
- Single-stepping (hardware debug registers)
- Variable inspection (ReadProcessMemory + Sema type info)
- Stack unwinding
- VS Code launch.json integration
Milestone: Can debug JIT-compiled programs from VS Code. No PDB needed.
-
forge-link: Symbol resolution, section layout - PE executable generation
- DLL generation (for UE5 module structure)
- Import library reading
- Incremental linking
- PDB generation (release mode only — deferred priority)
- Target level VII: Selected LLVM lit tests pass
Milestone: Can link multi-file C++ programs into working PE executables and DLLs. Target level VII provides regression coverage.
- Target level VIII: Compile raylib + a small real-time demo
- Multi-TU linking with D3D11/Win32 interop
- JIT-compile and run the demo via the daemon
- Test hot-patching: change a shader parameter or game logic function, see the result in the running application within seconds
- Fix issues found with DLL loading, SIMD, threads, game loop patterns
Milestone: A real-time graphical application builds, runs, and hot-reloads via the daemon. This validates the full pipeline (compile → link → JIT → patch) on a real-world content project before tackling UE5.
- Target level IX: Write and compile forge-demo (~2–5k LoC C++20 app)
- C++20 feature coverage:
concept,requires, structured bindings,if constexpr,autoreturn types,constevallookup tables (sin/cos, color palettes) - SIMD math: SSE/AVX vector/matrix operations (
__m128,_mm_mul_ps) - SEH:
__try/__exceptaround D3D calls - STL usage:
std::array,std::vector,<algorithm>(plus UE-styleTArrayView/FMemoryViewpatterns, notstd::span/std::ranges) - DLL plugin:
plugin.dllloaded at runtime — validates DLL generation and import - DAP debugging: set breakpoints in the running demo, inspect variables mid-frame
- JIT hot-patching demo: change gravity/color/logic in
physics.cpporscene.cpp, see the result in the running application within ~1 second - Distributed compilation: compile forge-demo across two machines via P2P
Milestone: A single small project exercises every compiler phase. The hot-patching demo provides a compelling visual demonstration of forgecc's value proposition. All C++20 features, SIMD, SEH, DLL linking, DAP, and distributed compilation work correctly on this project before moving to UE5.
- UBT integration (
ForgeccToolChain.cs) — replaces the CLI/curl workflow with native UBT support; compile and link actions flow through UBT's action graph - Handle UE5-specific patterns (UHT-generated code,
*_APIDLL exports, etc.) - Remaining Windows SDK / MSVC STL edge cases found during UE5 builds
- Deterministic compilation verification on UE5 codebase
- Progressive UE5 testing (target level X):
- Single UE5 module (e.g., Core) compiles successfully
- Small set of modules (Core + CoreUObject + Engine) compile and link
- UE5 Lyra demo builds and runs
- Full Game editor builds and runs
- Fix edge cases found at each progressive stage
Milestone: Game editor builds and runs via the daemon. Debugging works via DAP.
- Full DHT implementation with replication
- Cross-machine artifact sharing
- Benchmark: two machines sharing compilation work
- File watching and pre-compilation
- Template memoization scale tuning (Tier 2/Tier 3 already implemented earlier): profile UE5/Game builds to refine dependency-signature granularity and reduce over-invalidation under real workloads
- Namespace hash/index scale tuning: bucket sizing, Merkle partition policy, and validation-cost profiling for associated-namespace lookups
Milestone: Two machines share compilation work via P2P network.
Rather than jumping from "Hello World" to UE5, forgecc is validated against a ladder of real-world open-source C/C++ projects of increasing complexity. Each target level (numbered with Roman numerals I–X) serves as a milestone gate: the compiler must produce correct, runnable output for a given target level before moving to the next. (Target levels are distinct from implementation phases — phases describe what is built, target levels describe what is validated against.)
| Target Level | Project | ~LoC | Why | Key features exercised |
|---|---|---|---|---|
| I | Hand-crafted test suite | <1k | Controlled tests for each language feature | All, incrementally |
| II | stb libraries (stb_image.h, etc.) | ~7k per lib | Pure C, single-file, no dependencies, widely used | C99, macros, function pointers |
| III | LMDB | ~12k | Pure C, real-world database, COFF/PE linking exercise | C99, Win32 API, threads |
| IV | nlohmann/json (single-header) | ~25k | Header-only C++11/17, heavy templates, tests memoization | Templates, SFINAE, exceptions, STL |
| V | Dear ImGui | ~70k | C++11, moderate complexity, Win32/DX backends, good for JIT testing | Classes, vtables, function pointers, Win32 |
| VI | Google Test | ~30k | C++14, templates, macros, used by many projects | Templates, macros, exceptions, RTTI |
| VII | LLVM lit test suite (selected C/C++ tests) | varies | Known-good reference tests with expected outputs | Wide coverage |
| VIII | raylib + small real-time demo | ~100k | Real-time rendering, game loop, Win32/D3D, multi-TU linking, hot-reload testing | D3D interop, DLL loading, game loop, SIMD, threads, JIT patching of running app |
| IX | forge-demo (purpose-built C++20 showcase) | ~2–5k | Smallest project that exercises every phase; interactive real-time app designed for the "wow" demo | C++20 concepts, consteval tables, std::array/std::vector (+ UE-style view wrappers), SIMD math, Win32/D3D11, SEH, DLL plugin, DAP debugging, hot-patching live |
| X | UE5 Lyra demo / Game editor | ~millions | Full target | Everything |
Target level IX — forge-demo: No single small open-source project exercises all compiler phases, because real-world projects target broad compiler support and avoid bleeding-edge C++20 features. forge-demo is a purpose-built ~2–5k LoC interactive application (Win32 + D3D11) that intentionally uses every feature the compiler must handle — a compiler stress test disguised as a demo app. Suggested structure:
forge-demo/
main.cpp — Win32 window creation, message loop, D3D11 init
renderer.cpp — D3D11 rendering, shader loading, draw calls
scene.cpp — Scene graph, entity system using C++20 concepts
math.cpp — SIMD vector/matrix math (SSE/AVX), consteval lookup tables
physics.cpp — Particle/physics sim (the "change this and see it live" target)
plugin.dll — Optional DLL plugin to test DLL linking and loading
| Phase | Feature needed | How forge-demo exercises it |
|---|---|---|
| 1 | Preprocessor, Windows SDK headers | #include <windows.h>, #include <d3d11.h>, conditional compilation |
| 2 | C++20 parser | concept, requires, structured bindings, if constexpr, auto return types |
| 3 | Templates, concepts, consteval, STL, SEH |
concept Renderable, consteval color/math tables, std::array/std::vector (+ UE-style view wrappers), __try/__except around D3D calls |
| 4 | x86-64 codegen, SIMD | __m128/_mm_mul_ps for transforms, hand-written SIMD math |
| 5 | JIT, live patching, DLL imports | Entire app runs via JIT; changing physics.cpp patches the running app live |
| 6 | consteval via JIT |
consteval sin/cos tables, color palettes, vertex data |
| 7 | DAP | Set breakpoints in the running demo, inspect variables mid-frame |
| 8 | PE/DLL linking | Release build produces standalone .exe + plugin.dll |
| 9 | Real-time + hot-reload | This is the real-time hot-reload demo |
| 12 | Distributed | Compile forge-demo across two machines |
The demo scenario for JIT hot-patching: the app is running and rendering a scene;
the developer changes the gravity constant in physics.cpp or the particle color
function in scene.cpp, and within a second the running application reflects the
change — no restart, no visible recompile cycle. This is the "wow" moment that
demonstrates forgecc's value proposition.
How target levels map to implementation phases (each target level is an explicit milestone gate within its phase):
- Target levels I–II gate Phase 4 (codegen) — pure C, simple linking
- Target level III gates Phase 5 (JIT loader) — threads, Win32 API
- Target levels IV–VI gate Phase 6 (consteval) — heavy C++ templates, STL headers
- Target level VII gates Phase 8 (release linker) — regression coverage
- Target level VIII bridges Phase 8 → Phase 10 — real-time rendering, game loop, D3D interop, multi-TU linking, and the first real test of JIT hot-patching a running graphical application
- Target level IX gates Phase 11 — the only project that exercises every phase
(C++20 concepts,
consteval, SIMD, SEH, DLL plugin, DAP debugging, hot-patching); serves as the comprehensive validation gate before UE5 - Target level X is progressive within Phase 11: single UE5 module → small set → Lyra → full Game editor
To validate correctness against existing compilers, forgecc includes a thin CLI tool
called forge-verify (~1,500 LoC). Rather than building a full comparison tool
from scratch, it leverages
llvm-debuginfo-analyzer
(the successor to llvm-diva, already in LLVM trunk) for debug info comparison, and
adds execution-level and structural comparison on top.
forge-verify workflow:
- Connects to the forgecc daemon via HTTP (
POST /v1/compile) - Requests compilation of a source file (or set of files)
- Materializes a temporary
.objfile from the store (via/v1/artifact/{hash}?format=coff— see §4.13) - Compares against a reference
.objproduced by MSVC or clang-cl using:llvm-debuginfo-analyzer --compare: debug info (CodeView) correctnessllvm-objdump --disassemble: instruction-level comparison (structural, not byte-identical — register allocation will differ)- Custom section diff: compare section names, sizes, relocation counts, symbol tables
- Execution diff: link and run both executables, compare stdout/stderr/exit code
Test case structure:
tests/
level0_handcrafted/
hello_world.cpp # source
hello_world.expected # expected stdout/stderr/exit code
level1_stb/
stb_image_test.cpp
stb_image_test.expected
level2_lmdb/
...
Usage in CI: forge-verify is the primary unit-test harness. Each target codebase
level (§8a) has a corresponding test directory. CI runs forge-verify against all
levels that the current phase supports, comparing output against clang-cl as the
reference compiler.
Usage for development: During development, forge-verify --single foo.cpp compiles
a single file with both forgecc and clang-cl, then diffs the results — useful for
debugging individual compilation issues.
Incremental correctness test framework (required for an incremental compiler):
- Edit-sequence tests as first-class citizens: each test defines
S0 -> S1 -> S2 ...source revisions and expected outcomes at each step (cache hit/miss, invalidation set, compile result, runtime output). - Deterministic invalidation assertions: for each revision transition,
forge-verifychecks that the invalidated unit set and produced artifacts are stable across repeated runs. - Differential incremental mode: each transition is run both incrementally and from clean state; outputs and diagnostics must match.
- Template stress suites: dedicated suites for ADL/SFINAE/concepts/two-phase lookup transitions where tiny source edits historically trigger broad recomputation.
Fuzzing roadmap (staged):
- Phase 3 onward: grammar-aware mutational fuzzing for parser+sema with crash-only oracles and reducer integration.
- After Tier 2 memoization is active: stateful incremental fuzzing generates edit
sequences and checks invariants (
incremental == clean, deterministic invalidation, no stale-cache acceptance). - Later (Phase 8+): differential fuzzing against clang-cl/MSVC for diagnostics, codegen structure, and runtime behavior on generated corpora.
| Risk | Likelihood | Impact | Mitigation |
|---|---|---|---|
| C++ corner cases in Sema | Very High | Critical | Focus on UE5/Game's subset; fail gracefully on unsupported features; incremental target codebases (§8a) |
| MSVC ABI incompatibility | High | Critical | Extensive testing against MSVC-compiled code; forge-verify (§8b) + llvm-debuginfo-analyzer |
| JIT execution stability (loader, patching, unwind/TLS) | Very High | Critical | Two dev modes: hot-patch for body-safe edits, restart-on-ABI-change for safety; staged rollout by target level |
| ABI-shape hot-reload (layout/vtable changes) | High | Critical | ABI epoch tracking, generated migrators + optional user callbacks, hard fallback to process restart when unsafe |
| Debugger parity / tooling ecosystem gap | High | High | DAP-first for dev loop, but keep release-mode PDB/CodeView path and interoperability tests |
| Performance of unoptimized code | Medium | Medium | UE5 debug builds are already unoptimized; acceptable |
| SIMD intrinsic coverage | Medium | High | Map intrinsics incrementally; start with SSE2, add SSE4/AVX as needed |
| Windows SDK header complexity | High | High | Test incrementally; many headers are already handled by preprocessor |
| Distributed store consistency / trust boundaries | Medium | High | Determinism verification (--verify-determinism), protocol/version checks, content-hash verification, mTLS peer identity, signed provenance envelopes, policy-gated acceptance (optional N-of-M attestation) |
| Parallelism correctness | Medium | High | Rust ownership helps, but require deterministic merge order and stress/fuzz testing of concurrent paths |
| Crate | Purpose | Link |
|---|---|---|
redb |
Embedded key-value database (local store) | crates.io/crates/redb |
blake3 |
Fast content hashing | crates.io/crates/blake3 |
rayon |
Parallel compilation (work-stealing thread pool) | crates.io/crates/rayon |
tokio |
Async I/O for daemon HTTP server and P2P networking | crates.io/crates/tokio |
axum |
HTTP server framework (lightweight, tokio-native) | crates.io/crates/axum |
reqwest |
HTTP client (for P2P peer communication) | crates.io/crates/reqwest |
notify |
File system watching | crates.io/crates/notify |
windows |
Windows API bindings (process creation, VirtualAlloc, debug API) | crates.io/crates/windows |
object |
Reading import libraries and COFF files | crates.io/crates/object |
serde / rmp-serde |
Serialization: serde for derive, rmp-serde for MessagePack wire format | crates.io/crates/rmp-serde |
serde_json |
JSON wire format (debug/curl fallback) | crates.io/crates/serde_json |
clap |
CLI argument parsing | crates.io/crates/clap |
tracing |
Structured logging | crates.io/crates/tracing |
dashmap |
Concurrent hash maps (for Sema type tables) | crates.io/crates/dashmap |
crossbeam |
Lock-free data structures | crates.io/crates/crossbeam |
| Crate | Estimated LoC | Notes |
|---|---|---|
forge-common |
2,000 | Diagnostics, source map, interning |
forge-store |
4,500 | Local store + distributed layer + dependency signature tracking (+500) + namespace content hash caching + template instantiation cache (+1000) |
forge-net |
5,000 | HTTP server (axum), MessagePack/JSON content negotiation, P2P networking, DHT, gossip |
forge-lex |
5,000 | Lexer + literal parsing (C + C++); raw strings, UDL, all encoding prefixes, hex floats |
forge-pp |
7,500 | Preprocessor + macro expansion; #pragma pack/warning, feature-test macros, predefined macros |
forge-parse |
16,000 | Parser + AST definitions (C + C++); CTAD, extern "C", explicit(bool), bit-fields |
forge-sema |
34,500 | Modular Sema (~52 files, none > 1000 lines); two-phase lookup, CTAD, UDL resolution, noexcept types, defaulted comparisons, structured bindings, brace init, RTTI, MSVC __declspec/intrinsics; template instantiation memoization instrumentation (+3000); consteval via JIT saves ~3k lines vs interpreter |
forge-ir |
5,000 | IR + minimal passes |
forge-codegen |
14,000 | x86-64 lowering + encoding + SIMD |
forge-debug |
3,000 | DAP server + breakpoints + variable inspection |
forge-link |
6,000 | Linker + PE + DLL (PDB deferred to release mode) |
forge-daemon |
5,500 | Daemon + scheduler + loader/JIT + determinism verification (HTTP server moved to forge-net) |
forge-driver |
800 | Thin native driver binaries: argv[0] detection; clang-cl/lld-link/llvm-lib/MSVC-LLVM-rc/llvm-ml compatible flag parsing; HTTP client; daemon auto-start (see §7.3) |
forge-verify |
1,500 | Verification CLI: daemon client, reference compiler diff, execution diff (see §8b) |
| Total | ~110,300 | + ~300–500 lines C++ ninja patch (external, see §7.4) |
This is a core differentiator from LLVM/Clang. Every component is designed for parallelism:
Like existing build systems, but managed by the daemon's scheduler instead of make/ninja. The daemon has full knowledge of the dependency graph and can schedule optimally.
Intra-TU parallelism follows a sequential spine → parallel fan-out model. The first three stages are inherently sequential due to C++'s context-dependent grammar and declaration-order semantics; the later stages fan out to all cores:
- Lexer/Preprocessor (sequential): Token order matters. The preprocessor must
process
#includedirectives in order because each header is preprocessed in the macro context established by everything before it. Different#included files can only be preprocessed in parallel if they are guarded (#pragma once/ include guards) and don't depend on external macro state. - Parser (sequential): C++ parsing is context-dependent — whether a name is a
type or a value affects how subsequent tokens are parsed.
typedef,using, class/struct/enum declarations all change the parsing context. Top-level declarations must be parsed in order. - Sema — signatures (sequential): Declaration signatures must be resolved top-to-bottom because each can reference types introduced by earlier declarations.
- Sema — function bodies (parallel): Once all signatures are resolved, function
bodies can be type-checked concurrently. Each body only reads from the shared type
table (
DashMap,RwLock) — a read-heavy workload ideal for multithreading. Template instantiations are also parallelized (each is independent). - IR lowering (parallel): Each function is lowered independently → trivially parallel via rayon.
- CodeGen (parallel): Each function is compiled independently → trivially parallel via rayon.
The sequential spine (lex → parse → sema signatures) produces the work items (function bodies, template instantiations) that then fan out across all cores. For a large TU with hundreds of functions, the parallel phase dominates.
Different TUs can be at different pipeline stages simultaneously:
Time →
TU1: [lex] [parse] [sema] [ir] [codegen]
TU2: [lex] [parse] [sema] [ir] [codegen]
TU3: [lex] [parse] [sema] [ir] [codegen]
For a TU that takes 2–3 minutes on a single core today (without caching):
| Stage | Parallelism potential | Estimated savings (8 cores) |
|---|---|---|
| Preprocessing | Sequential; limited parallelism for guarded includes | Minimal |
| Parsing | Sequential (context-dependent grammar) | None |
| Sema — signatures | Sequential (declaration order) | None |
| Sema — bodies | Function bodies + template instantiations in parallel | 30–50% of Sema time |
| IR lowering | Per-function, trivially parallel | 60–80% |
| CodeGen | Per-function, trivially parallel | 60–80% |
Realistic overall estimate: For a 2–3 minute TU, intra-TU parallelism alone (no caching) could bring it down to 1–1.5 minutes on 8 cores. The bigger win comes from memoization — if 90% of headers haven't changed, the TU might take 5–15 seconds on a warm cache regardless of parallelism.
Honest assessment: Intra-TU parallelism is a moderate win. The memoization/caching is the transformative win. Parallelism across TUs (already already have) plus caching is where 90% of the speedup comes from.
Via the distributed store. If machine A has already compiled a header that machine B needs, B fetches the result instead of recompiling.
Distributed TU compilation (compiling a single TU across multiple machines) is theoretically possible — individual functions could be farmed out for codegen. However, the overhead of serializing AST/IR, transferring it, and collecting results likely exceeds the benefit for a single TU. The better model is: distribute TUs across machines (coarse-grained), not functions within a TU (fine-grained).
With forgecc's caching, unity files should become unnecessary. Unity files exist because:
- Compiler startup cost per TU is high (loading headers, PCH) — forgecc eliminates this via the daemon's warm cache
- Redundant header parsing across TUs — forgecc's header-level memoization eliminates this
- Linker overhead from many .obj files — forgecc's JIT mode has no linker step
If caching works as designed, compiling 100 separate TUs should be as fast as (or faster than) compiling them as 10 unity files, because the cache eliminates the redundant work that unity files were designed to avoid.
forgecc does NOT compile the MSVC STL. The STL headers are #included in source files
and processed by the preprocessor/parser/sema like any other header. forgecc links
against the precompiled MSVC runtime libraries:
vcruntime.lib— C runtime basicsucrt.lib— Universal C runtimemsvcrt.lib/msvcprt.lib— C++ runtime, STL implementation
This means:
- Our preprocessor/parser/sema must handle MSVC STL headers (which use heavy template metaprogramming and MSVC extensions)
- Our linker must read MSVC import libraries (.lib)
- Our generated code must be ABI-compatible with MSVC-compiled code in these libraries
The UE5 editor target uses DLLs, so forgecc links against the DLL versions of the CRT
(/MD flag equivalent).
forgecc does not implement C++20 modules. The reasons:
- UE5/Game has zero usage of C++20 modules (
export module,import) - UE5's "modules" are its own build system concept (
.Build.cs,Private/Publicdirectories), not C++20 modules - C++20 modules exist primarily for build time improvement — forgecc achieves the same through its memoization/caching system, making modules redundant
- C++20 modules add significant compiler complexity (module dependency scanning, BMI generation, build ordering constraints) for no benefit in the target codebase
The architecture naturally supports modules (the content-addressed store handles BMI caching), so adding them post-PoC is straightforward if a target codebase requires them.
The findings below are based on a thorough automated scan of
C:\src\git\UnrealEngine\Engine\Source\(Runtime, Editor, ThirdParty).
- 4,335 .c files in Engine/Source (all third-party: zlib, libpng, FreeType, Bink Audio, Rad Audio, ICU, etc.)
- Engine code itself is C++ only
- forgecc must support C compilation for these third-party libraries
- ~20 files in Engine/Source/Runtime use MSVC-style
__asm/_asm - Key locations: RadAudio
rrCore.h, BinkAudiorrCore.h/win32_ticks.cpp, MemProMemPro.cpp, AutoRTFMLongJump.cpp - GNU-style
__asm__also used in cross-platform code (Unix, ARM, x86 breakpoints) - Most are in third-party SDK code that could potentially be linked as prebuilt
- Strategy: Support basic
__asmblocks; complex cases fall back to prebuilt stubs
- Heavily used — hundreds of files, thousands of intrinsic calls
UnrealMathSSE.halone has 383 SIMD-related lines (core math library)- SSE2 baseline, optional SSE4.1, AVX, AVX2
- Key files:
UnrealMathSSE.h,UnrealPlatformMathSSE.h,UnrealPlatformMathSSE4.h,AES.cpp(133 matches),SecureHash.cpp(109 matches),radfft.cpp(154 matches) - Critical: Must support SSE/AVX intrinsics from day one
- ~20 files in Engine/Source/Runtime
- Concentrated in: crash handling (
WindowsPlatformCrashContext.cpp), D3D11/D3D12 device creation (delay-load exception handling), thread management (WindowsRunnableThread.cpp), stack walking (WindowsPlatformStackWalk.cpp), rendering thread (RenderingThread.cpp), shader compilation (ShaderCore.cpp), race detector (RaceDetectorWindows.cpp) - Important for Windows stability, but not in hot paths
| Feature | Files | Usage Level | Notes |
|---|---|---|---|
| Concepts/requires | ~95+ | Critical | UE defines 20+ concepts in Core/Public/Concepts/; used in MassEntity, TypedElementFramework, Chaos, CoreUObject. UE_REQUIRES macro wraps requires with SFINAE fallback. |
if constexpr |
~91+ | Critical | Pervasive: TypedElementFramework, SlateCore, RenderCore, MovieScene, Net/Iris, VerseCompiler, VulkanRHI, DelegateSignatureImpl (72 uses), Array.h (31 uses) |
| Structured bindings | ~33+ | Moderate | RenderGraphBuilder, ActorDescContainer, MaterialExpressions, VerseVM |
| consteval | ~6 | Rare | Via UE_IF_CONSTEVAL macro (expands to if consteval or __builtin_is_constant_evaluated()); Platform.h, StringView.h, NameTypes.h |
| constinit | ~14 | Rare | Mostly CoreUObject: UObjectGlobals.h, Object.h, Class.h, ObjectMacros.h |
Three-way comparison (<=>) |
~10 | Rare | GuardedInt.h, GenericPlatformMath.h, EdGraphPin.h, RenderGraphUtils.h |
| Designated initializers | ~6 | Rare | Net/Iris, AppleDynamicRHI, ReplicationSystem |
using enum |
~4 | Rare | NavigationSystem.h, ISpatialAcceleration.h |
| User-defined literals | ~4 | Rare | Delegate.h (operator""_intseq), StringView.h (operator""_PrivateSV) |
| Fold expressions | ~1 | Rare | Solaris/uLangCore Cases.h |
Lambdas with template params ([]<) |
~5 | Rare | Core tests, VerseVM, GeometryCore |
| CTAD (deduction guides) | ~2 | Rare | StringBuilderTest.cpp only |
| char8_t | ~6 | Rare | VerseGrammar.h, GenericPlatform.h, StringView.h, StringConv.h |
| Coroutines | 0 | Not used | Zero in Engine; only in ThirdParty docs/comments |
| C++20 modules | 0 | Not used | Zero usage |
std::ranges / std::views |
0 | Not used | UE uses custom iterators |
std::format |
0 | Not used | UE uses FString::Printf / fmt |
std::span |
0 | Not used | UE uses TArrayView / FMemoryView |
std::expected |
0 | Not used | Only in ThirdParty (WebRTC) |
| Extension | Usage | Priority |
|---|---|---|
__declspec(dllexport/dllimport) |
Pervasive via *_API macros |
Critical |
__forceinline |
200+ files via FORCEINLINE macro |
Critical |
__declspec(noinline) |
WindowsPlatform.h (FORCENOINLINE), MemPro, AutoRTFM |
High |
__declspec(align(N)) |
~8 files (mostly third-party); alignas preferred (~75 files) |
Medium |
__declspec(thread) |
syms_base.h, RadAudio, MemPro, FramePro | High |
__declspec(selectany) |
Platform.h (UE_SELECT_ANY), RadAudio, BinkAudio |
High |
__declspec(empty_bases) |
WindowsPlatform.h | Medium |
__declspec(noreturn) |
MicrosoftPlatformCodeAnalysis.h | Medium |
__declspec(restrict) / noalias |
MemoryArena.h, MemPro | Low |
__declspec(code_seg) |
WindowsPlatform.h | Low |
__declspec(allocate) |
syms_base.h (.roglob), RaceDetector (.CRT$XCT) |
Medium |
__declspec(safebuffers) |
Instrumentation/Defines.h | Low |
__declspec(deprecated) |
LZ4 only | Low |
__declspec(novtable) |
Not used | Skip |
__declspec(property) |
Not used | Skip |
__pragma |
~12 files; MSVCPlatformCompilerPreSetup.h (warning macros) | High |
__int64 |
~9 files (RadAudio, BinkAudio, SymsLib — third-party) | Medium |
__cdecl / __stdcall |
Common (Windows API, COM, CRT callbacks) | High |
__fastcall / __vectorcall / __thiscall |
Not used | Skip |
__assume |
~6 files; UE_ASSUME macro in Platform.h |
Medium |
_alloca / alloca |
~15 files; stack allocation in Renderer, Core, BinkAudio | Medium |
| Intrinsic | Used | Key Locations |
|---|---|---|
_BitScanForward / _BitScanForward64 |
Yes | LZ4, RadAudio, D3D12 |
_BitScanReverse / _BitScanReverse64 |
Yes | Same |
_InterlockedCompareExchange* |
Yes | WindowsPlatformAtomics.h |
__debugbreak |
Yes | Solaris, RadAudio, VerseGrammar |
_ReturnAddress / _AddressOfReturnAddress |
Yes | MSVCPlatform.h, RaceDetector |
__cpuid / __cpuidex |
Yes | RadAudio cpux86.cpp, VVMAtomics.h |
_byteswap_ushort / _ulong / _uint64 |
Yes | RadAudio, BinkAudio |
| Pattern | Usage | Notes |
|---|---|---|
#pragma once |
Dominant | Used in virtually all UE headers |
#pragma pack |
~37 files | SymsLib, TraceLog, RenderCore, Oodle, D3D11/D3D12 |
#pragma warning |
~80 files | MSVCPlatformCompilerSetup.h alone has ~97 uses |
#pragma comment(lib, ...) |
~17 files | psapi, pdh, version, Shlwapi, Dbghelp, Ws2_32, d3d12, etc. |
__has_include |
~60 files | Platform.h, ThirdParty (abseil, WebRTC, Vulkan, SDL3) |
__has_cpp_attribute |
~60 files | Mostly ThirdParty (simde, hedley, abseil, Catch2) |
__VA_ARGS__ |
100+ files | PreprocessorHelpers.h, LogMacros, delegates, profiling |
__VA_OPT__ |
~10 files | Mostly ThirdParty (fmt, Catch2) |
_Pragma |
~60 files | Mostly ThirdParty (simde, hedley, abseil, WebRTC) |
Token pasting (##) |
60+ files | PreprocessorHelpers, Stats, Delegates, LogMacros |
Most template-heavy areas (in order of complexity):
Runtime/Core/Public/Templates/— 80+ headers: traits, SFINAE, concepts, utilitiesRuntime/Core/Public/Delegates/— variadic delegate signatures, policy-based designRuntime/Core/Public/Containers/— TArray, TMap, TSet, allocators, iteratorsRuntime/Core/Public/Misc/— TVariant, TOptional, ExpressionParser, StringBuilderRuntime/Core/Public/Concepts/— 20+ C++20 concept definitions
Key patterns:
TEnableIf/std::enable_if_tfor SFINAE (withUE_REQUIRESC++20 fallback)TAnd,TOr,TNotfor trait composition- Heavy variadic template usage in delegates, tuples, structured logging
extern templatefor compile-time control (StringFormatter, TokenStream, etc.)- Explicit template instantiation for common char types (ANSICHAR, WIDECHAR)
decltype(auto)for perfect forwarding (~20 files in Core)- Variable templates (
TModels_V,TIsVariant_V, etc.) static_assert— 100+ files in Runtime/Core
- Game editor builds ~190 game modules as DLLs (see
GameEditor.Target.cs) - Engine adds hundreds more (Core, Engine, RHI, Slate, etc.)
- Each module uses
*_APImacros for dllexport/dllimport FModuleManagerhandles dynamic loading- Live Coding (Live++) patches running DLLs — limited to function body changes
- Circular dependencies: UE5 uses a two-step lib creation process for modules
with circular dependencies — first create an import
.lib(with exports only), then build the real DLL. forgecc must handle this in release mode; in JIT mode, circular dependencies are resolved naturally since all symbols live in one address space
- ~70+ files in Runtime use COM interfaces (IUnknown, HRESULT, etc.)
- Main areas: D3D11, D3D12, Windows WASAPI, WMF, TextStore, UIA
COMPointer.hin Core provides UE's COM smart pointer__uuidofused for COM interface identification- Must support:
IUnknown,HRESULT,__uuidof, COM calling conventions
- 137 .rc files in Engine/Source (44 in Runtime/Launch, rest in ThirdParty)
- Primary:
Runtime/Launch/Resources/Windows/PCLaunch.rc - Supported natively via
forge-rc.exein the PoC (CLI-compatible with existing MSVC/LLVM.rcusage patterns)
VCToolChain.cs(~3,979 lines) handles both MSVC and clang-cl modesVCCompileAction.csstructures compilation as action objects with all necessary metadata- Clang-cl mode is the better integration point for forgecc
These features have zero usage in Engine/Source and can be safely excluded from the PoC without risk:
| Feature | Verified | Notes |
|---|---|---|
| C++20 Coroutines | Zero in Engine | Only in ThirdParty docs/comments (asio) |
| C++20 Modules | Zero | UE "modules" are build system concept |
std::ranges / std::views |
Zero | UE uses custom iterators/algorithms |
std::format |
Zero | UE uses FString::Printf, fmt |
std::span |
Zero | UE uses TArrayView, FMemoryView |
std::expected |
Zero in Engine | Only in ThirdParty (WebRTC) |
__declspec(novtable) |
Zero | |
__declspec(property) |
Zero | |
__fastcall / __vectorcall / __thiscall |
Zero | Only __cdecl and __stdcall used |
WinRT (<winrt/>) |
Zero | |
WRL (<wrl/>) |
Zero |
Deterministic compilation is critical for the distributed content-addressed store. If two machines compile the same inputs and get different outputs, the store's content-addressing breaks — the same key would map to different values.
Reference: Deterministic builds with clang and lld
What forgecc must guarantee:
-
No timestamp embedding:
__TIME__,__DATE__,__TIMESTAMP__must either be errors or expand to fixed values. The Game project should not use these (to verify). -
No non-deterministic iteration order: All internal data structures that affect output must use deterministic iteration (sorted maps, not hash maps with random seeds). Rust's
HashMapuses RandomState by default — forgecc usesBTreeMapor a deterministic hasher for any map that affects output ordering. -
No pointer-value-dependent output: Addresses of AST nodes, types, etc. must not leak into output. Use indices or content hashes instead.
-
Path independence: Build outputs must not contain absolute paths (or must normalize them). This is needed for the distributed store — machine A and machine B may have different checkout paths.
- Use relative paths internally, relative to the branch root (e.g., the repository checkout root), not relative to the current working directory
- Normalize all paths to forward slashes and consistent casing
- For debug info, use a configurable source root prefix
- Different developers may have different branch root paths (e.g.,
C:\src\projectvs.D:\dev\project) — the content-addressed store must produce identical hashes regardless of the absolute branch root location
-
Thread-determinism: Even though compilation is parallel, the output must be independent of thread scheduling. This means:
- No output in "order threads finish"
- Merge parallel results in a deterministic order (e.g., sorted by symbol name)
- Deterministic scheduling is not required; only deterministic outputs and deterministic merge order are required
-
No
__COUNTER__across TUs:__COUNTER__is inherently TU-local and deterministic within a TU, so it's fine. forgecc ensures it doesn't leak across TU boundaries.
6b. Floating-point determinism: Floating-point constant folding and constexpr
evaluation must produce identical results across machines. This means:
- All compile-time floating-point arithmetic uses IEEE 754 semantics with a fixed rounding mode (round-to-nearest-even), regardless of the host FPU state.
-ffast-math//fp:fastflags affect codegen (what instructions are emitted) but not compile-time evaluation. The flag is part of the cache key, so different FP modes produce different hashes — but the same flag always produces the same output.- Cross-compilation between x86 (80-bit extended precision intermediates) and ARM
(64-bit double intermediates) can produce different constant-folding results. Since
forgecc targets Windows x86-64 for the PoC, this is not an immediate concern, but
the
constexprevaluator should use strict 64-bit double semantics (no extended precision) to be forward-compatible with cross-machine distributed builds.
Pipeline computation determinism:
The points above cover output determinism (what bytes end up in the store). But the memoization system also requires that intermediate computations within the pipeline are deterministic — if a pipeline stage produces a different intermediate result for the same inputs, downstream cache keys diverge and memoization breaks. Every stage must be a pure function of its inputs:
-
Overload resolution: Given the same candidate set and argument types, overload resolution must always select the same winner. This is naturally deterministic in the C++ standard (the rules are fully specified), but the implementation must not depend on the order candidates are discovered. If candidates are collected into a set, that set must be ordered deterministically (e.g., by declaration source location or by a content hash) before ranking.
-
Template instantiation order: The order in which templates are instantiated must not affect the result of any individual instantiation. In standard C++, this is guaranteed (instantiations are independent), but implementation shortcuts — such as caching a partially-computed type and reusing it before it is complete — can introduce order-dependence. Each instantiation must see a fully consistent view of the type system.
-
Name lookup and ADL: Unqualified name lookup and argument-dependent lookup must produce the same result regardless of which TU triggered the lookup first. Since forgecc caches lookup results, the cache key must include all inputs that affect the result (the name, the argument types, the set of visible declarations in associated namespaces). Two lookups with the same key must return the same result.
-
Implicit special member generation: Whether a class's copy constructor, move constructor, destructor, etc. are defaulted, deleted, or trivial must be computed identically regardless of which TU first queries the property. The result depends only on the class definition and its members' properties — not on when or where the query happens.
-
Constant evaluation:
constexprandconstevalevaluation must produce identical results for identical inputs. Since forgecc uses JIT for constant evaluation (§4.6a), the JIT'd code must be compiled deterministically (same machine code for the same ForgeIR), and the evaluation must not depend on host-specific state (e.g., stack addresses, allocation addresses). The sandboxed allocator forconstexpr new/deletemust assign deterministic addresses. -
SFINAE and concept satisfaction: Substitution failure detection and concept constraint evaluation must be deterministic. Given the same template arguments and the same set of visible declarations, SFINAE must fail or succeed identically, and concept satisfaction must produce the same boolean result. This is naturally the case if name lookup and overload resolution are deterministic (points 7–9).
-
Diagnostic emission: Diagnostics (warnings, errors) must be emitted in a deterministic order. If two warnings are generated in parallel, they must be sorted before output. Diagnostic content must not include non-deterministic information (pointer addresses, thread IDs).
Enforcing referential transparency in code:
Deterministic intent is not sufficient. forgecc enforces stage purity with hard code-structure constraints:
-
Pure stage boundary: Every memoized stage is a pure function:
StageOutput = run(StageInput). Every dependency that can affect output is present inStageInputand therefore in the content hash key. -
Core/runtime split: Each stage is split into:
*-coremodule/crate: pure computation only*-runtimemodule/crate: I/O, networking, filesystem, process state The runtime layer constructsStageInput; the core layer computesStageOutput.
-
No ambient state in core: Core code does not read wall clock, RNG, env vars, filesystem, network, current directory, locale, or global mutable singletons.
-
Disallowed stateful types in core: Core code does not accept or store shared mutable handles such as
Arc<Mutex<_>>,Arc<RwLock<_>>,DashMap, interior- mutable globals, or atomics that influence semantic output.Arc<T>is allowed only for immutableT. -
Type-level API gating: Impure service types (
Store,Clock,NetClient,Env,Loggerwith side effects) are not exposed to core crates. If a function can observe mutable external state, it is not part of a memoized core stage. -
Lint and CI gates: Pure crates run deny-by-default lint rules for disallowed imports/types and fail CI on violations. Determinism CI replays the same action twice in separate processes and compares intermediate and final artifact hashes.
-
Fail-closed policy: If a stage cannot prove complete input closure, the stage is marked non-cacheable and emits a diagnostic in verification mode. forgecc never silently caches impure computations.
Verification: The daemon has a --verify-determinism mode that compiles
everything twice and asserts that all content hashes match — both final outputs
and intermediate artifacts at every pipeline stage. This catches determinism bugs
early, whether they originate in output formatting or in pipeline computations.
Even in JIT mode, forgecc needs .lib parsing for symbol validation before execution.
The daemon must know what symbols are available from external DLLs (kernel32, d3d12,
ucrt, vcruntime, etc.) to validate that all references can be resolved.
JIT flow:
- Parse import libraries (
.libfiles) at build start → build external symbol table - During compilation, validate that external references exist in the symbol table
- At JIT execution time,
LoadLibrarythe actual DLLs into the target process GetProcAddressto resolve each import to a concrete address- Patch call sites to point to the resolved addresses
forge-link/imports.rs is therefore needed for both JIT and release mode.
The C++ standard says initialization order of globals across TUs is unspecified (the "static initialization order fiasco"). Within a single TU, it's top-to-bottom order of definition. Across TUs, it's implementation-defined.
In practice, MSVC/LLD orders them by the order .obj files appear on the link
command line, and within each .obj by the order of CRT initializer entries in
.CRT$XCU sections. Well-written C++ code (and UE5) should not depend on cross-TU
init order.
forgecc's approach:
- Collect all
.CRT$XCUinitializer entries from compiled TUs - Execute them before
main()/WinMain() - Order can be arbitrary, or match the order UBT sends compile actions (for compatibility)
- This is straightforward — just collect function pointers and call them
When JIT-compiling code that uses __declspec(thread), the OS doesn't know about
the new TLS slots because the module wasn't loaded via LoadLibrary. This is the
same problem encountered in the
in-process clang branch.
The solution is to call the undocumented ntdll!LdrpHandleTlsData to register TLS
data for JIT-compiled modules. This approach has been proven to work in the in-process
clang work.
Alternative approaches (if LdrpHandleTlsData proves fragile across Windows versions):
- Convert
__declspec(thread)to explicitTlsAlloc/TlsGetValue/TlsSetValuecalls during IR lowering - Use Fiber-Local Storage (FLS) API as a fallback
For the PoC, the LdrpHandleTlsData approach is the most pragmatic since it's
already proven.
SEH and C++ exceptions require .pdata/.xdata sections and runtime support. In JIT
mode, forgecc registers unwind info with RtlAddFunctionTable or
RtlInstallFunctionTableCallback for each JIT-compiled function.
This is well-understood territory — LLVM's ORC JIT does the same thing. The key is
to call RtlAddFunctionTable after mapping code sections and before execution.
In JIT mode, vtables must be laid out correctly and patched when classes are recompiled. If a class layout changes (new virtual function, reordered members), all vtable references must be updated. This requires:
- Tracking which vtables reference which functions
- When a class is recompiled with a layout change, update the vtable and all objects that point to it
ABI-shape updates (layout/vtable/base-class changes) — practical strategy:
- ABI epoch per type: Each class layout has a stable shape hash (ABI epoch). Recompiled code is tagged with the epoch it expects.
- Dual-code coexistence: Old and new code may coexist temporarily. Call boundaries crossing epochs are mediated through generated thunks/adapters where safe.
- Object migration path (future):
- Generated field-wise migrators for trivially mappable cases
- Optional user callbacks for custom migration/serialization of complex state
- Heap objects can be migrated at safepoints; stack objects are not migrated
- Safety rule: If a safe migration/adaptation path cannot be proven, force a process restart. Correctness beats hot-reload convenience.
For the PoC, the default policy is conservative: restart on ABI-shape change. This keeps behavior correct while body-only edits remain hot-patchable.
Invariants (must always hold):
- Epoch-correct dispatch: Every call edge and vtable slot dispatches to code that expects the same ABI epoch as the object/signature at that boundary.
- No silent reinterpretation: An object of old layout is never treated as the new layout without an explicit proven adapter or migration.
- Stack safety first: Stack-resident objects are not migrated in-place in PoC mode.
- Deterministic fallback: If invariants cannot be proven, patch is rejected and a restart is required.
Preconditions for future in-place migration:
- The type delta is classified as migratable (e.g., field-preserving or explicitly mapped).
- A generated migrator or user-provided callback exists and type-checks against both epochs.
- Migration runs at a safepoint where target objects are discoverable and quiescent.
- All cross-epoch boundaries are covered by verified adapters (or explicitly blocked).
Cross-epoch ABI adaptation note:
Keeping old and new code alive simultaneously can be safe only with explicit boundary adapters. For example, old-layout pointers passed into new code require a thunk that either (a) migrates to new layout before call, or (b) routes to old-epoch implementation. Absent such an adapter, the call is invalid and must be rejected/restarted.
The daemon's JIT architecture enables a novel form of profile-guided optimization: live PGO injection without the traditional profile-collect-rebuild cycle.
Concept: Since the daemon controls code execution and can re-JIT functions at any time, it can collect profiling data from the running application and use it to recompile hot functions with better optimization decisions — all while the application is running.
Implementation approach for the PoC:
-
Instrumentation: The daemon instruments JIT-compiled code with lightweight counters (branch frequencies, call counts) during initial compilation. This adds minimal overhead (~2-5%) since counters are just memory increments.
-
Profile collection: Periodically (or on demand), the daemon reads the counters from the target process. Alternatively, use ETW/xperf to collect hardware performance counters externally without any code instrumentation.
-
Re-JIT with profile data: Feed the profile data back into the optimizer (even the minimal pass manager). For the PoC, this could be as simple as:
- Reorder basic blocks in hot functions based on branch frequencies
- Inline small, frequently-called functions
- Improve register allocation hints for hot paths
-
Hot-swap: Replace the old function code with the optimized version, exactly like a source-code change would trigger a re-JIT.
Why this is simpler than traditional PGO: Everything stays in memory — no PGO file format needed, no rebuild step, no profile merging. The daemon has the profile data and the compiler in the same process. This is a natural extension of the JIT architecture.
Yes, essentially. The forgecc daemon is a C++ virtual machine in the same sense that the JVM is a Java VM or the CLR is a .NET VM:
- Source code goes in, the daemon compiles and executes it
- The application can only run while the daemon is running
- Code can be hot-patched at runtime
- The "VM" has full knowledge of types, symbols, and source locations (enabling the DAP debugger)
- Garbage collection is not needed (C++ manages its own memory)
The key difference: unlike JVM/CLR, forgecc compiles to native x86-64 code (not bytecode), so there's no interpretation overhead. It's more like a "native JIT runtime for C++" — similar to what Julia does for its language.
This framing helps explain the architecture: forgecc is to C++ what the JVM is to Java — a runtime environment that compiles, loads, executes, and debugs code, with the added benefit of persistent caching and distributed compilation. The difference is that C++ code runs at full native speed with zero runtime overhead.
These are out of scope for the proof-of-concept but are natural extensions of the architecture. The design should not preclude them.
Game studios target consoles alongside PC. The forgecc daemon architecture extends naturally to console development via a thin client on the devkit that communicates with the PC daemon over the network.
Developer PC Console Devkit (dev mode)
┌──────────────────────┐ ┌──────────────────────┐
│ forgecc daemon │ │ forgecc-client │
│ │ HTTP+msgpack │ (thin agent) │
│ - Cross-compiles │◄───────────────▶│ │
│ to console arch │ (same unified │ - Receives sections │
│ - Content store │ protocol §4.13)│ - Maps into process │
│ - Sema, IR, codegen │ │ - Patches functions │
│ - DAP server │ │ - Debug primitives │
│ │ │ - Sends back: │
│ Targets: │ │ - crash dumps │
│ - x86-64 (local PC) │ │ - perf counters │
│ - aarch64 (console) │ │ - stdout/stderr │
└──────────────────────┘ └──────────────────────┘
Key principles:
-
All compilation stays on the PC. The console client does NOT compile anything. Console devkits have limited CPU/RAM; the daemon already has all the infrastructure. The console client is purely a remote loader/patcher (~2,000–3,000 lines).
-
The console client is a remote
forge-daemon/loader.rs. It does what the local JIT loader does, but over the network: receives compiled machine code sections, maps them into the running game process, patches function stubs, registers exception tables, and manages static initializers. -
The P2P protocol already supports this. The console client is just another peer. It calls
GET /v1/artifact/{hash}to fetch compiled sections. The only difference is that artifacts are compiled for a different target architecture. The content store key naturally includes the target triple, soblake3(function_hash + "x86_64-pc-windows")andblake3(function_hash + "aarch64-ps5")are different entries. -
Lazy compilation still works. When the game on the console calls an uncompiled function stub, the stub traps to the console client, which sends a compile request back to the PC daemon over the network. Latency is higher (~10–50ms round trip over LAN) but acceptable for development iteration.
What changes from the PC-only design:
forge-codegenneeds additional backends (aarch64 for PS5/Switch, x86-64 with different ABI for Xbox). ForgeIR is target-independent; only the lowering changes. Each backend is a separate module (~8,000–12,000 lines per target).forge-daemon/loader.rssplits into a local loader (current) and a remote loader protocol. The console client implements the remote side.- The content-addressed store handles multi-target artifacts transparently.
Debug info: DWARF vs. CodeView:
Console targets (PS5, Switch) use DWARF debug info instead of CodeView/PDB. However, this does not affect the DAP-first debugging strategy:
- In development mode, the daemon still acts as the DAP server. Debug info lives in Sema's internal format — it is never serialized to DWARF or CodeView during development. The daemon sends debug commands (set breakpoint, read memory, inspect variable) to the console client, which executes them via the console's debug API.
- DWARF vs. CodeView only matters for release builds, where
forge-linkwould need to emit DWARF sections (.debug_info,.debug_abbrev,.debug_line, etc.) instead of CodeView/PDB. This is a separate code path in the linker, not in Sema or the DAP server. - The codegen itself is unaffected — DWARF and CodeView are debug info formats, not code generation concerns. The same ForgeIR lowers to machine code regardless of which debug format will eventually be used.
Existing devkit agents:
Every console devkit already runs a debug agent that supports the operations forgecc-client would need:
| Console | Debug Agent | Protocol | Key Capabilities |
|---|---|---|---|
| PS5 | Target Manager + DECI5 | DECI5 over TCP | Memory read/write, breakpoints, module loading, process control, core dumps |
| Xbox Series X | XBDM / MSVSMon | TCP (ports 4020+) | Memory access, remote debugging, process attach, deployment |
| Nintendo Switch | GDB server (devkit) | GDB remote protocol | Breakpoints, memory inspection, step execution, variable inspection |
These agents already support memory read/write and process control — the exact primitives forgecc-client needs for JIT patching. If the console maker (Sony, Microsoft, Nintendo) wanted to support forgecc natively, they could extend their existing debug agent with:
- Section mapping API: Map arbitrary code sections into a running process's
address space (similar to
VirtualAllocExon Windows) - Function table registration: Register exception handling / unwind info for
JIT'd code (equivalent to
RtlAddFunctionTable) - TLS registration: Register thread-local storage for dynamically loaded code
These are small extensions (~500–1,000 lines each) to agents that already handle memory writes and process control. The alternative is for forgecc-client to implement these on top of the existing memory read/write primitives, which is also feasible but less efficient.
During development, forgecc keeps everything in the content-addressed store and the daemon's memory. For shipping builds, artifacts must be extracted into standard formats:
Object file extraction (.obj / .o):
- The daemon can emit standard COFF
.objfiles (Windows) or ELF.ofiles (console/Linux) from its internal section records - This enables interoperability with existing toolchains — e.g., link forgecc-compiled
.objfiles with MSVC'slink.exeor console-specific linkers - Useful for: CI/CD pipelines, build verification, mixing forgecc output with third-party prebuilt libraries
Executable generation (.exe / .dll / console executables):
forge-link(Phase 8) handles PE/DLL generation for Windows- Console targets would need additional linker output formats (ELF for PS5/Switch, XEX-like for Xbox)
- PDB generation (Windows) and DWARF emission (console/Linux) are only needed here
The key insight: release build generation is a batch export from the content store, not a separate compilation. All the compiled artifacts already exist in the store from development iteration. The release pipeline is:
- Collect all section/symbol/relocation records from the store for the target
- Apply release-mode optimizations (if any — beyond PoC scope)
- Run the linker to produce the final executable + debug info
- Sign/package for the target platform
This means a release build after a full development session could be very fast — most of the compilation work is already done and cached.
A Rust front-end for forgecc is a natural long-term extension. The motivations are complementary:
- Bootstrapping: forgecc is written in Rust. A Rust front-end lets forgecc compile itself, closing the bootstrap loop and providing the ultimate dogfooding scenario — every improvement to the compiler immediately benefits its own build.
- Mixed C++/Rust codebases: Game engines are increasingly adopting Rust for systems-level code (gameplay scripting, tooling, asset pipelines). A single daemon that compiles both C++ and Rust, sharing the same content-addressed store and memoization infrastructure, eliminates the overhead of coordinating two separate toolchains. Cross-language LTO and unified debug info become possible.
- Faster Rust compilation: Rust's compile times are a well-known pain point. forgecc's memoization, daemon architecture, and distributed caching apply equally to Rust — the front-end changes, but the entire backend (ForgeIR, codegen, store, JIT, DAP debugger) is reused.
What a Rust front-end shares with the C++ front-end:
| Component | Shared? | Notes |
|---|---|---|
| Content-addressed store | 100% | Language-agnostic |
| Memoizer | 100% | Same hash-in/hash-out model |
| ForgeIR | 100% | Both languages lower to the same IR |
| x86-64 codegen | 100% | Same backend |
| Runtime loader / JIT | 100% | Same patching mechanism |
| DAP debugger | 100% | Same debug info model |
| Linker | 100% | Same COFF/PE output |
| Daemon / HTTP server | 100% | New /v1/compile-rs endpoint |
| Distributed P2P | 100% | Same artifact format |
| Lexer | 0% | Completely different token set |
| Parser | ~10% | Different grammar; some AST node shapes overlap (functions, structs, enums) |
| Sema | ~20% | Type checking, trait solving, and borrow checking are unique to Rust; name resolution and overload-like dispatch (trait method selection) share concepts |
What a Rust front-end requires that the C++ front-end does not:
-
Borrow checker — Rust's defining feature. Implementing Polonius/NLL-style borrow checking is a multi-year effort. The borrow checker is a dataflow analysis over MIR (Mid-level IR), which maps to a ForgeIR analysis pass. Estimated: ~15,000–20,000 lines.
-
Trait solver — Rust's trait system (coherence, specialization, auto traits, negative impls, associated types with bounds) is comparable in complexity to C++ template instantiation + overload resolution combined. Estimated: ~8,000–12,000 lines.
-
Proc macro execution — Rust's procedural macros are arbitrary Rust code that runs at compile time. Two options: (a) run
rustc-compiled proc macro.so/.dllfiles directly (pragmatic, requires ABI compatibility withrustc), or (b) compile proc macros with forgecc itself (requires the Rust front-end to be self-hosting first). Option (a) is the practical starting point. -
Pattern matching exhaustiveness — Rust requires exhaustive pattern matching. This is a well-understood algorithm (Maranget's approach) but still ~2,000–3,000 lines.
-
Lifetime elision and inference — Implicit lifetime rules that are unique to Rust. ~1,000–2,000 lines.
Estimated scope: ~30,000–40,000 lines for a Rust front-end (parser + sema + borrow checker + trait solver), on top of the shared infrastructure. This roughly doubles the front-end code but adds zero backend code.
Phasing: The Rust front-end is strictly post-PoC. The implementation order:
- Rust lexer + parser (produces Rust AST)
- Name resolution + type checking (without borrow checker — produces typed IR)
- Lower to ForgeIR (at this point, Rust code compiles and runs but without
borrow checking — similar to running with
unsafeeverywhere) - Borrow checker (the hard part — but the code already compiles and runs, so this is a verification pass, not a blocking dependency)
- Proc macro support (load
rustc-compiled proc macro crates) - Self-hosting (forgecc compiles itself)
Precedent: The gccrs project (Rust front-end for GCC) has been in development
since 2014 and is still not production-ready, illustrating the difficulty. However,
forgecc has a structural advantage: the entire backend is already built and tested
by the time the Rust front-end starts. gccrs had to integrate with GCC's backend
simultaneously. forgecc's Rust front-end is purely additive — it produces ForgeIR,
and everything downstream works unchanged.
This section documents the results of a systematic cross-reference between this design document and N4950 (the C++23 working draft, which includes all C++20 features). The cross-reference was performed chapter-by-chapter against the standard's core language clauses (§5–§15). Items identified as gaps have been incorporated into the relevant sections above.
| N4950 Chapter | Status | Notes |
|---|---|---|
| §5 Lexical conventions | ✅ Covered | Added: raw strings, UDL tokens, hex floats, all encoding prefixes, phases of translation (line splicing, BOM, string concat) |
| §6 Basics (ODR, scope, lookup, types) | ✅ Covered | ODR handled by content-addressed store; scope/lookup in Sema; added two-phase lookup |
| §7 Expressions | ✅ Covered | Added: noexcept operator, brace-init lists, UDL expression resolution |
| §8 Statements | ✅ Covered | if constexpr, if consteval, range-for, structured bindings |
| §9 Declarations | ✅ Covered | Added: constinit, explicit(bool), extern "C", static_assert, alignas, #pragma pack effects, bit-fields, NSDMI |
| §10 Modules | ⬜ Not needed | UE5 has zero C++20 module usage |
| §11 Classes | ✅ Covered | Added: bit-fields, defaulted comparisons, aggregate init, NSDMI |
| §12 Overloading | ✅ Covered | Added: UDL operator resolution, built-in operator candidates, address of overloaded function |
| §13 Templates | ✅ Covered | Added: CTAD/deduction guides, template template params, class-type NTTPs, two-phase lookup, variable template instantiation |
| §14 Exception handling | ✅ Covered | SEH + C++ exceptions; added: noexcept as part of function type |
| §15 Preprocessing | ✅ Covered | Added: #pragma pack/warning, #line, #error, #warning, __has_builtin, __has_feature, feature-test macros, predefined macros |
| §16–§33 Library | N/A | forgecc links against MSVC STL; it parses STL headers but does not implement the library |
| Feature | N4950 Section | Reason |
|---|---|---|
| C++20 Modules | §10 | Zero usage in UE5; memoization replaces the build-time benefit |
| Coroutines | §7.6.2.4, §17.12 | Zero usage in UE5 Engine/Source/Runtime |
char8_t as distinct type |
§6.8.2 | Recognized by lexer; full type system support deferred (UE5 uses TCHAR/wchar_t) |
Extended float types (std::float16_t, etc.) |
§6.8.3 | Conditionally-unsupported; lexer recognizes suffixes but rejects them |
std::source_location compiler magic |
§17.8 | Requires __builtin_source_location(); deferred to MSVC STL compat phase |
These areas of the standard are disproportionately complex relative to their specification length and deserve extra attention during implementation:
-
Overload resolution (§12.2) — 26 pages in the standard; the most intricate single algorithm in C++. Implicit conversion sequences, partial ordering of function templates, and concept-constrained candidates all interact.
-
Template argument deduction (§13.10.3) — 21 pages; deduction from function calls, partial ordering, CTAD, and SFINAE rules are deeply intertwined.
-
Constant expressions (§7.7) — 11 pages of rules defining what is and isn't allowed in constexpr context. The JIT approach (§4.6a) simplifies this significantly but UB detection still requires careful instrumentation.
-
Initialization (§9.4) — 17 pages covering direct-init, copy-init, list-init, aggregate-init, reference-init, and their interactions with constructors, conversion operators, and
std::initializer_list. This is where many real-world compilation failures occur. -
Name lookup (§6.5) — 13 pages; unqualified, qualified, ADL, and the interaction with
usingdeclarations/directives. Two-phase lookup in templates adds another layer.
The proof-of-concept is successful if:
- ✅ Game editor builds and runs via the forgecc daemon
- ✅ Incremental rebuild after a single-line change completes in < 2 seconds (for a project of Game's scale)
- ✅ Two machines share compilation work via the P2P network
- ✅ Full rebuild time is within 3x of Clang
-O0(acceptable for unoptimized PoC) - ✅ JIT mode: change a function, see the result in the running editor within seconds
- ✅ No user-managed PCH, unity builds, or module maps — caching is fully automatic
- ✅ No Sema source file exceeds 1,000 lines
- ✅ All compilation stages support parallelism
- ✅ Debugging works via DAP in VS Code (breakpoints, stepping, variable inspection)
- ✅ Compilation is fully deterministic (
--verify-determinismpasses) - ✅ The application only runs through the daemon — no standalone .EXE needed for dev
- ✅ All incremental target codebases (§8a) levels I–VI pass
forge-verify(§8b) - ✅
constevalfunctions execute via in-process JIT, not an AST interpreter