You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Design discussion for the texture / surface API in cuda.core — to settle the API shape and
naming before code review of the implementation. Reviewers asked for design sign-off in an issue
before we commit to a ~9k-line feature.
Name of Array. ✅ Decided — rename Array → CUDAArray.
This type is an opaque cudaArray_t — the GPU stores it in a scrambled, hardware-defined layout
with no linear pointer, so it cannot expose __cuda_array_interface__ / DLPack and cannot
share memory zero-copy with cupy / numba-cuda / torch. The name Array implies an n-dimensional
array that participates in that ecosystem — it can't. CuPy names the identical type CUDAarray,
and its whole cupy.cuda.texture module already matches this PR's surface 1:1.
Resolution: use CUDAArray — the PEP 8 CapWords form (deliberately differing from CuPy's exact CUDAarray casing to follow Python's class-naming standard). The name signals "CUDA texture/surface
backing store," not "n-dimensional array."
Open detail resolved: keep ArrayFormat (do not rename to CUDAArrayFormat). The sibling
enums in these modules — AddressMode, FilterMode, ReadMode — are all unprefixed, so ArrayFormat matches the established enum-naming pattern; and the "Array implies an
ndarray/DLPack participant" concern that motivated CUDAArray does not apply to a format enum
(nobody mistakes ArrayFormat for an n-dimensional array).
Interop path. ✅ Decided — ship only copy_from / copy_to; no allocation helper.
Zero-copy is impossible (opaque layout, no linear pointer), so copying is the only option —
this was purely about how polished the path is. The copy path to/from linear cuda.core Buffers already exists: copy_from / copy_to accept a device Buffer or a host
buffer-protocol object, in both directions. The only thing an extra helper would add is
allocating the linear Buffer for the caller — folding mr.allocate(arr.size_bytes, stream=s)
arr.copy_to(buf, stream=s) into a one-liner, i.e. ~2 lines of convenience.
Resolution: ship copy_from / copy_to only, and document the copy-only contract. We will
not add an allocating convenience helper now. It is purely additive and non-breaking, so we can
add one later if users request it.
Factory set. ✅ Not a real decision — driver-mandated, all four required.
A texture can be backed by four kinds of memory — the PR exposes one factory per kind:
from_array — texture over a CUDAArray(the headline feature)
from_mipmapped_array — texture over a MipmappedArray(the headline feature)
from_linear — texture over a plain 1D device buffer (ordinary linear memory, no CUDAArray)
from_pitch2d — texture over a plain 2D pitched buffer (ordinary linear memory, no CUDAArray)
ResourceDescriptor binds CUDA_RESOURCE_DESC, a driver union whose resType is exactly one of ARRAY / MIPMAPPED_ARRAY / LINEAR / PITCH2D — one factory per union arm. A faithful binding
of that type must cover all four; shipping only two would be an incomplete binding of a mandatory
driver struct, not a smaller-but-valid surface. So there was no real optionality here — the CTK
driver API dictates the set. (Listed only because it sat next to the genuine decisions.)
Resolution: ship all four factories — required by the driver API, not a tradeoff.
Channel format. ✅ Decided — keep the folded format + num_channels parameters.
Each array element has a component type (e.g. 8-bit uint, 32-bit float) and a channel count
(1 = grayscale … 4 = RGBA). Two ways to surface that:
Separate (CuPy): one ChannelFormatDescriptor(...) object passed as a unit
The driver descriptor cuda.core actually fills in (CUDA_ARRAY3D_DESCRIPTOR) already stores
these as two separate fields — Format (a CUarray_format, mirrored 1:1 by ArrayFormat) and NumChannels. So the folded form maps straight onto the driver struct with no translation, and
read-back is already exposed as two properties (.format, .num_channels). The bundled ChannelFormatDescriptor is the runtime API's (cudaChannelFormatDesc) modeling — the form
CuPy wraps because its texture module sits on the runtime API. Adopting it in a driver-based
library would mean a translation wrapper the underlying API doesn't use (and the shapes don't even
map cleanly: the driver uses one uniform component format × channel count, while cudaChannelFormatDesc allows per-channel bit widths).
Resolution: keep folded format + num_channels. It's the driver-faithful surface
(consistent with Renaming #1 favoring correctness over CuPy parity and Docs Updates #3 following the driver API); the
bundled form's only wins are CuPy look-alike and a single read-back object, neither worth a
runtime-style wrapper here.
Descriptor type consistency. ✅ Not a real decision — divergence is intentional and harmless.
Note: the original framing here was factually wrong.ResourceDescriptor is not a cdef class and holds no native C struct — it is a plain Python class with __slots__, storing a
reference to the backing resource plus a few Python fields. The CUDA_RESOURCE_DESC struct is
assembled later, in TextureObject.from_descriptor. So this was never a @dataclass-vs-cdef class / performance question. Both descriptors are pure Python.
The genuine difference is only how you construct each, and it reflects what each type is:
TextureDescriptor — a flat bag of independent sampling settings, built directly with keyword
args (@dataclass fits perfectly).
ResourceDescriptor — a "pick exactly one of four backings" union (array / mipmap / linear /
pitch2d), built via from_* factories because each kind carries different fields. A single __init__ would be a pile of mutually-exclusive optional args plus a kind tag.
Consistency is not a goal in itself — it only matters when inconsistency makes the API harder to
learn or use, and here it doesn't: a user learns each type once and never has to reconcile them.
The only behavioral gap is equality (TextureDescriptor compares by value; ResourceDescriptor
by identity), which is essentially never exercised on these objects and is arguably correct since ResourceDescriptor wraps a live device resource. Forcing both to the same kind of type would
be uniformity for its own sake and would make ResourceDescriptor's constructor worse.
Resolution: keep the split — the divergence is intentional and does not hurt usability. Like Docs Updates #3, this resolves to a non-issue once examined (and on a mistaken premise to begin with).
Bool naming. ✅ Decided — adopt the is_<something> convention.
surface_load_store is a boolean on Array: it records whether the array was created with the
surface load/store capability (CUDA's CUDA_ARRAY3D_SURFACE_LDST), which a SurfaceObject
requires. Exposed both as a constructor keyword (surface_load_store=True) and a read-only
property (arr.surface_load_store).
The repo convention for boolean properties is is_<something>, so a property named surface_load_store doesn't read as a boolean the way arr.is_managed does. Resolution: rename
the property to follow the is_<x> convention (e.g. is_surface_load_store) for consistency with
the cuda-python codebase.
Open detail resolved: the property name is is_surface_load_store (already implemented), and
the constructor keyword is renamed to match — from_descriptor(..., is_surface_load_store=False) —
so one symmetric name serves both set and read-back. This follows the existing StridedMemoryView(is_readonly=...) precedent in cuda.core, where an is_<x> boolean is used as
both the constructor argument and the attribute. (The keyword rename is a small implementation
follow-up in the PR; the property is already done.)
Scope. ✅ Decided — split the examples into a follow-up PR.
The nine gl_interop_*.py examples (~5k lines, not CI-wired, need a GL context CI lacks) are
orthogonal to the core API. Resolution: drop them from this PR and land them in a separate
follow-up PR once this core texture/surface PR merges, since the examples depend on the new API
it introduces.
Purpose
Design discussion for the texture / surface API in
cuda.core— to settle the API shape andnaming before code review of the implementation. Reviewers asked for design sign-off in an issue
before we commit to a ~9k-line feature.
TextureObjectandSurfaceObject? #467cc @leofang @mdboom @Andy-Jost @kkraus14 — you asked for a design pass; this is the home for it.
Proposed public surface (from #2095)
Array+ArrayFormat— opaque, hardware-laid-out GPU allocations backing textures/surfaces.MipmappedArray— wrapsCUmipmappedArray;get_levelreturns a non-owningArraykept aliveby a strong ref to the parent.
TextureObject+TextureDescriptor— bindless texture handle + sampling state.SurfaceObject— bindless surface handle; requiresArray(surface_load_store=True).ResourceDescriptor— factoriesfrom_array,from_mipmapped_array,from_linear,from_pitch2d.Decisions to make
Name of
Array. ✅ Decided — renameArray→CUDAArray.This type is an opaque
cudaArray_t— the GPU stores it in a scrambled, hardware-defined layoutwith no linear pointer, so it cannot expose
__cuda_array_interface__/ DLPack and cannotshare memory zero-copy with cupy / numba-cuda / torch. The name
Arrayimplies an n-dimensionalarray that participates in that ecosystem — it can't. CuPy names the identical type
CUDAarray,and its whole
cupy.cuda.texturemodule already matches this PR's surface 1:1.Resolution: use
CUDAArray— the PEP 8 CapWords form (deliberately differing from CuPy's exactCUDAarraycasing to follow Python's class-naming standard). The name signals "CUDA texture/surfacebacking store," not "n-dimensional array."
Open detail resolved: keep
ArrayFormat(do not rename toCUDAArrayFormat). The siblingenums in these modules —
AddressMode,FilterMode,ReadMode— are all unprefixed, soArrayFormatmatches the established enum-naming pattern; and the "Arrayimplies anndarray/DLPack participant" concern that motivated
CUDAArraydoes not apply to a format enum(nobody mistakes
ArrayFormatfor an n-dimensional array).Interop path. ✅ Decided — ship only
copy_from/copy_to; no allocation helper.Zero-copy is impossible (opaque layout, no linear pointer), so copying is the only option —
this was purely about how polished the path is. The copy path to/from linear
cuda.coreBuffers already exists:copy_from/copy_toaccept a deviceBufferor a hostbuffer-protocol object, in both directions. The only thing an extra helper would add is
allocating the linear
Bufferfor the caller — foldingmr.allocate(arr.size_bytes, stream=s)arr.copy_to(buf, stream=s)into a one-liner, i.e. ~2 lines of convenience.Resolution: ship
copy_from/copy_toonly, and document the copy-only contract. We willnot add an allocating convenience helper now. It is purely additive and non-breaking, so we can
add one later if users request it.
Factory set. ✅ Not a real decision — driver-mandated, all four required.
A texture can be backed by four kinds of memory — the PR exposes one factory per kind:
from_array— texture over aCUDAArray(the headline feature)from_mipmapped_array— texture over aMipmappedArray(the headline feature)from_linear— texture over a plain 1D device buffer (ordinary linear memory, noCUDAArray)from_pitch2d— texture over a plain 2D pitched buffer (ordinary linear memory, noCUDAArray)ResourceDescriptorbindsCUDA_RESOURCE_DESC, a driver union whoseresTypeis exactly one ofARRAY/MIPMAPPED_ARRAY/LINEAR/PITCH2D— one factory per union arm. A faithful bindingof that type must cover all four; shipping only two would be an incomplete binding of a mandatory
driver struct, not a smaller-but-valid surface. So there was no real optionality here — the CTK
driver API dictates the set. (Listed only because it sat next to the genuine decisions.)
Resolution: ship all four factories — required by the driver API, not a tradeoff.
Channel format. ✅ Decided — keep the folded
format+num_channelsparameters.Each array element has a component type (e.g. 8-bit uint, 32-bit float) and a channel count
(1 = grayscale … 4 = RGBA). Two ways to surface that:
CUDAArray.from_descriptor(shape=..., format=ArrayFormat.FLOAT32, num_channels=4)ChannelFormatDescriptor(...)object passed as a unitThe driver descriptor
cuda.coreactually fills in (CUDA_ARRAY3D_DESCRIPTOR) already storesthese as two separate fields —
Format(aCUarray_format, mirrored 1:1 byArrayFormat) andNumChannels. So the folded form maps straight onto the driver struct with no translation, andread-back is already exposed as two properties (
.format,.num_channels). The bundledChannelFormatDescriptoris the runtime API's (cudaChannelFormatDesc) modeling — the formCuPy wraps because its texture module sits on the runtime API. Adopting it in a driver-based
library would mean a translation wrapper the underlying API doesn't use (and the shapes don't even
map cleanly: the driver uses one uniform component format × channel count, while
cudaChannelFormatDescallows per-channel bit widths).Resolution: keep folded
format+num_channels. It's the driver-faithful surface(consistent with Renaming #1 favoring correctness over CuPy parity and Docs Updates #3 following the driver API); the
bundled form's only wins are CuPy look-alike and a single read-back object, neither worth a
runtime-style wrapper here.
Descriptor type consistency. ✅ Not a real decision — divergence is intentional and harmless.
Note: the original framing here was factually wrong.
ResourceDescriptoris not acdef classand holds no native C struct — it is a plain Python class with__slots__, storing areference to the backing resource plus a few Python fields. The
CUDA_RESOURCE_DESCstruct isassembled later, in
TextureObject.from_descriptor. So this was never a@dataclass-vs-cdef class/ performance question. Both descriptors are pure Python.The genuine difference is only how you construct each, and it reflects what each type is:
TextureDescriptor— a flat bag of independent sampling settings, built directly with keywordargs (
@dataclassfits perfectly).ResourceDescriptor— a "pick exactly one of four backings" union (array / mipmap / linear /pitch2d), built via
from_*factories because each kind carries different fields. A single__init__would be a pile of mutually-exclusive optional args plus a kind tag.Consistency is not a goal in itself — it only matters when inconsistency makes the API harder to
learn or use, and here it doesn't: a user learns each type once and never has to reconcile them.
The only behavioral gap is equality (
TextureDescriptorcompares by value;ResourceDescriptorby identity), which is essentially never exercised on these objects and is arguably correct since
ResourceDescriptorwraps a live device resource. Forcing both to the same kind of type wouldbe uniformity for its own sake and would make
ResourceDescriptor's constructor worse.Resolution: keep the split — the divergence is intentional and does not hurt usability. Like
Docs Updates #3, this resolves to a non-issue once examined (and on a mistaken premise to begin with).
Bool naming. ✅ Decided — adopt the
is_<something>convention.surface_load_storeis a boolean onArray: it records whether the array was created with thesurface load/store capability (CUDA's
CUDA_ARRAY3D_SURFACE_LDST), which aSurfaceObjectrequires. Exposed both as a constructor keyword (
surface_load_store=True) and a read-onlyproperty (
arr.surface_load_store).The repo convention for boolean properties is
is_<something>, so a property namedsurface_load_storedoesn't read as a boolean the wayarr.is_manageddoes. Resolution: renamethe property to follow the
is_<x>convention (e.g.is_surface_load_store) for consistency withthe cuda-python codebase.
Open detail resolved: the property name is
is_surface_load_store(already implemented), andthe constructor keyword is renamed to match —
from_descriptor(..., is_surface_load_store=False)—so one symmetric name serves both set and read-back. This follows the existing
StridedMemoryView(is_readonly=...)precedent in cuda.core, where anis_<x>boolean is used asboth the constructor argument and the attribute. (The keyword rename is a small implementation
follow-up in the PR; the property is already done.)
Scope. ✅ Decided — split the examples into a follow-up PR.
The nine
gl_interop_*.pyexamples (~5k lines, not CI-wired, need a GL context CI lacks) areorthogonal to the core API. Resolution: drop them from this PR and land them in a separate
follow-up PR once this core texture/surface PR merges, since the examples depend on the new API
it introduces.