Kernel embeddings#98
Draft
badnikhil wants to merge 6 commits into
Draft
Conversation
using "stringImportPaths" for better readability
Contributor
Author
|
while ensureInit in buffers is not required but we don't know what a user will do . so , added a check |
| * runtime. | ||
| * | ||
| * Example: | ||
| * Program p = Program.fromEmbedded!(import("kernel.ptx"))(); |
Collaborator
There was a problem hiding this comment.
it should be possible to do Program.fromEmbedded!"kernel.ptx" and do a mixin+import
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
With this PR, all DCompute runtime infrastructure is managed lazily and transparently behind the scenes. Developers only need to write their host code, allocate memory (
Buffer), and launch their compute kernels directly usinglaunch!k.Major Changes
1. Lazy Static Init Runtime (
source/dcompute/driver/cuda/runtime.d)shared static this()) that initializes CUDA, discovers active GPUs, allocates the defaultContext(Device 0), and pushes it onto the context stack.static this()) that ensures every thread gets a lock-free, dedicatedQueue(CUstream) with zero resource contention.ensureInit()guard as a defensive safety fallback for edge cases.2. Context-Sensitive Compile-Time PTX Embedding (
source/dcompute/driver/cuda/package.d)import()statement inside thelaunch!ktemplate definition.launch!is a template, it is instantiated inside the parent project's compilation context.dcomputelibrary to compile as a standard static library without requiring any local PTX files or string import flags, while seamlessly embedding the consumer project's custom PTX at compile time.3. Defensive Safety Triggers (
source/dcompute/driver/cuda/buffer.d)ensureInit()triggers inside bothBuffer!Tconstructors.4. dub.json update
"stringImportPaths": ["."]or-Jflag should be used with the path where ptx is generated .Developer Workflow & Flow of State
1. Compilation Flow:
@computemodules (e.g.tests/kernel.d) directly into PTX intermediate assembly (kernels_cuda800_64.ptx).-J.(the current directory) to the host compilation.launch!matmul. The compiler processesimport("kernels_cuda800_64.ptx"), embedding the GPU bytecode directly into your executable's text segment.2. Execution Flow:
Bufferis instantiated, the underlying static constructors initialize CUDA, assign the default device, push the GPU context, and initialize the active thread's CUDA stream.launch!is executed, it checks ifProgram.globalProgramis initialized. Seeing it is null, it passes the embedded PTX string tocuModuleLoadData, registering your custom kernels in the GPU context.Current State & Validation
All internal unittests and client applications compile, link, and validate successfully in one command:
dub test --compiler=ldc2completes and passes successfully.dub run --force --compiler=ldc2builds cleanly from scratch, embeds custommatmulkernels, executes them on the GPU, and validates output against host CPU matrices.