xarray-contrib · brendancol · May 11, 2026 · May 11, 2026 · May 11, 2026
diff --git a/.claude/sweep-security-state.csv b/.claude/sweep-security-state.csv
@@ -18,7 +18,7 @@ fire,2026-04-25,,,,,"Clean. Despite the module's size hint, fire.py is purely pe
 flood,2026-05-03,1437,MEDIUM,3,,Re-audit 2026-05-03. MEDIUM Cat 3 fixed in PR #1438 (travel_time and flood_depth_vegetation now validate mannings_n DataArray values are finite and strictly positive via _validate_mannings_n_dataarray helper). No remaining unfixed findings. Other categories clean: every allocation is same-shape as input; no flat index math; NaN propagation explicit in every backend; tan_slope clamped by _TAN_MIN; no CUDA kernels; no file I/O; every public API calls _validate_raster on DataArray inputs.
 focal,2026-04-27,1284,HIGH,1,,"HIGH (fixed PR #1286): apply(), focal_stats(), and hotspots() accepted unbounded user-supplied kernels via custom_kernel(), which only checks shape parity. The kernel-size guard from #1241 (_check_kernel_memory) only ran inside circle_kernel/annulus_kernel, so a (50001, 50001) custom kernel on a 10x10 raster allocated ~10 GB on the kernel itself plus a much larger padded raster before any work -- same shape as the bilateral DoS in #1236. Fixed by adding _check_kernel_vs_raster_memory in focal.py and wiring it into apply(), focal_stats(), and hotspots() after custom_kernel() validation. All 134 focal tests + 19 bilateral tests pass. No other findings: 10 CUDA kernels all have proper bounds + stencil guards; _validate_raster called on every public entry point; hotspots already raises ZeroDivisionError on constant-value rasters; _focal_variety_cuda uses a fixed-size local buffer (silent truncation but bounded); _focal_std_cuda/_focal_var_cuda clamp the catastrophic-cancellation case via if var < 0.0: var = 0.0; no file I/O."
 geodesic,2026-04-27,1283,HIGH,1,,"HIGH (fixed PR #1285): slope(method='geodesic') and aspect(method='geodesic') stack a (3, H, W) float64 array (data, lat, lon) before dispatch with no memory check. A large lat/lon-tagged raster passed to either function would OOM. Fixed by adding _check_geodesic_memory(rows, cols) in xrspatial/geodesic.py (mirrors morphology._check_kernel_memory): budgets 56 bytes/cell (24 stacked float64 + 4 float32 output + 24 padded copy + slack) and raises MemoryError when > 50% of available RAM; called from slope.py and aspect.py inside the geodesic branch before dispatch. No other findings: 6 CUDA kernels all have bounds guards (e.g. _run_gpu_geodesic_aspect at geodesic.py:395), custom 16x16 thread blocks avoid register spill, no shared memory, _validate_raster runs upstream in slope/aspect, all backends cast to float32, slope_mag < 1e-7 flat threshold prevents arctan2 NaN propagation, curvature correction uses hardcoded WGS84 R."
-geotiff,2026-05-10,,,,,"Pass 11 (2026-05-10): CLEAN. One new commit since pass 10: #1559 (centralise GeoTIFF attrs population via _populate_attrs_from_geo_info helper). Pure metadata-handling refactor that dedupes attrs population across the four read paths (eager numpy, dask, GPU stripped, GPU tiled). No new allocations, no file I/O changes, no kernel changes, no dtype handling changes. Helper only writes to in-memory attrs dict from validated geo_info fields already produced by the header parser. Re-verified all prior guards intact: MAX_PIXELS_DEFAULT=1e9 (_reader.py:46), MAX_IFD_ENTRY_COUNT=100_000 (_header.py:34), MAX_IFDS=256 (_header.py:45), MAX_TILE_BYTES_DEFAULT=256 MiB (_reader.py:68), _MAX_DASK_CHUNKS=50_000 (__init__.py:1428), realpath canonicalisation in _reader.py:161 and _vrt.py:160. Cat 1-6 all clean. No PRs opened. | Pass 10 (2026-05-10): CLEAN. Audited 8 commits since pass 9: #1558 (drop nodata copies - buffers uniquely owned), #1556 (fp-predictor ngjit row loop - bit-exact), #1552 (batched GDS-fallback D2H - _check_gpu_memory guards new staging), #1551 (parallel decode at tile_size=256 - pool bounded by min(n_tiles, cpu_count), mmap thread-safe read-only), #1549 (nvjpeg constants - actually closed CRITICAL: prior off-by-two passed PLANAR where INTERLEAVED expected, dereferenced NULL channel ptrs), #1542 (GPU nodata mask - called once per path, stripped fallback returns early), #1543 (combined backend tests-only), #1544 (__all__/deprecation no security surface). Re-verified guards intact: MAX_PIXELS_DEFAULT, MAX_IFD_ENTRY_COUNT, MAX_IFDS, MAX_TILE_BYTES_DEFAULT (256 MiB), _MAX_DASK_CHUNKS, _check_gpu_memory, _MmapCache TOCTOU, VRT realpath, LZW (sp<4096) and Deflate (overflow array) bounds. No issues opened. | Pass 2 (2026-05-09): Found unbounded compressed-byte-count fetch in HTTP COG path (_fetch_decode_cog_http_tiles passes attacker-controlled TileByteCounts to read_ranges_coalesced). Verified exploit: 4 tiles x 100 MB byte_count -> single 100 MB Range GET. Fixed via per-tile cap (MAX_TILE_BYTES_DEFAULT=256 MiB, XRSPATIAL_COG_MAX_TILE_BYTES env override). Other recent PRs (#1530 IFD chain cap, #1527 IFD count, #1535 LERC mask, #1532 PlanarConfig=2, #1534 HTTP coalesce, #1531 parallel write, #1528 nvCOMP batch) audited clean. | Prior: Re-audit 2026-04-28: only #7984f7a since prior pass (predictor=3 multi-band CPU fix, LERC dedup, VRT nodata dtype, AREA_OR_POINT, BigTIFF LONG8 offsets). LZW/inflate GPU kernels gate every write on out_pos<dst_len; _assemble_tiles_kernel safe under existing validate_tile_layout. Documented LOW: RowsPerStrip=0 raises ZeroDivisionError in _read_strips (line 396), VRT realpath canonicalises but doesn't contain (by-design GDAL parity, low impact since we only return arrays), no max-bytes guard on VRT XML parse. None opened; documented only. | Pass 10 (2026-05-10): Audited 8 unaudited commits (f157746 drop defensive copies, 39322c3 fp-predictor ngjit row loop, f23ec8f batch D2H, 1aac3b7 parallel tile decode, 593aee5 nvjpeg constant fix, 118678a GPU nodata mask, 49a9820 dask+cupy tests, fa00718 __all__ exposure). All clean: defensive copy removal (#1558) verified aliasing-safe via two-read independence test; fp_predictor_encode_rows (#1556) bit-exact vs Python loop at 1x1/prime/256x256 dims; _batched_d2h_to_bytes (#1552) GPU round-trip at empty/single/zero-size/100-tile regimes; parallel tile decode (#1551) correct at 1024x1024 and prime 997x1013 dims; nvjpeg constants (#1549) verified fixed - no OOB GPU write; _apply_nodata_mask_gpu (#1542) single-application verified for all GPU paths (stripped returns early); no double-application race or aliasing issue. No new CRITICAL/HIGH/MEDIUM issues found. No PRs opened."
+geotiff,2026-05-11,1579,HIGH,5,,"HIGH (fixed #1579, PR pending): xml.etree.ElementTree.fromstring in _vrt.parse_vrt and _geotags._parse_gdal_metadata expanded internal XML entities (CWE-776 billion-laughs). A crafted .vrt file or a hostile GDALMetadata tag (42112) embedded in a TIFF could OOM the reader. Fixed by adding _safe_xml.safe_fromstring which refuses DOCTYPE declarations (where entities live) before parsing and prefers defusedxml when installed. GDALMetadata path falls through to empty dict on rejection (non-essential metadata); VRT path raises ValueError (file is required). 9 new tests added under TestVRTXMLEntityExpansion, TestGDALMetadataXMLEntityExpansion, TestSafeXMLHelper. No other CRITICAL/HIGH/MEDIUM findings: existing guards already cover Cat 1 (MAX_PIXELS_DEFAULT + per-tile/strip checks + VRT _check_dimensions + HTTP tile byte cap via XRSPATIAL_COG_MAX_TILE_BYTES, 256 MiB default), Cat 2 (header uses int64 offsets, MAX_IFD_ENTRY_COUNT=100k, MAX_IFDS=256, MAX_IFD_ENTRY_BYTES=256 KiB), Cat 4 (every CUDA kernel inspected has bounds guard: _byte_swap_lanes_kernel:91, _lzw_decode_tiles_kernel:168, _predictor_decode_kernel_*:725-771, _fp_predictor_decode_kernel:792, _assemble_tiles_kernel:841; shared-memory arrays are 4096-cell fixed allocations sized for LZW table, no overflow path), Cat 5 (VRT SourceFilename canonicalised via realpath, _writer atomic-rename via mkstemp in dirname-of-abspath), Cat 6 (open_geotiff promotes integer + nodata to float64 with NaN; _validate_dtype_cast rejects float->int casts). Decompression-bomb guards already enforced via _DECOMPRESS_MARGIN=1.05 cap in _compression.deflate_decompress."
 glcm,2026-04-24,1257,HIGH,1,,"HIGH (fixed #1257): glcm_texture() validated window_size only as >= 3 and distance only as >= 1, with no upper bound on either. _glcm_numba_kernel iterates range(r-half, r+half+1) for every pixel, so window_size=1_000_001 on a 10x10 raster ran ~10^14 loop iterations with all neighbors failing the interior bounds check (CPU DoS). On the dask backends depth = window_size // 2 + distance drove map_overlap padding, so a huge window also caused oversize per-chunk allocations (memory DoS). Fixed by adding max_val caps in the public entrypoint: window_size <= max(3, min(rows, cols)) and distance <= max(1, window_size // 2). One cap covers every backend because cupy and dask+cupy call through to the CPU kernel after cupy.asnumpy. No other HIGH findings: levels is already capped at 256 so the per-pixel np.zeros((levels, levels)) matrix in the kernel is bounded to 512 KB. No CUDA kernels. No file I/O. Quantization clips to [0, levels-1] before the kernel and NaN maps to -1 which the kernel filters with i_val >= 0. Entropy log(p) and correlation p / (std_i * std_j) are both guarded. All four backends use _validate_raster and cast to float64 before quantizing. MEDIUM (unfixed, Cat 1): the per-pixel np.zeros((levels, levels)) allocation inside the hot loop is a perf issue (levels=256 -> 512 KB alloc+free per pixel) but not a security issue because levels is bounded. Could be hoisted out of the loop or replaced with an in-place clear, but that is an efficiency concern, not security."
 gpu_rtx,2026-04-29,1308,HIGH,1,,"HIGH (fixed #1308 / PR #1310): hillshade_rtx (gpu_rtx/hillshade.py:184) and viewshed_gpu (gpu_rtx/viewshed.py:269) allocated cupy device buffers sized by raster shape with no memory check. create_triangulation (mesh_utils.py:23-24) adds verts (12 B/px) + triangles (24 B/px) = 36 B/px; hillshade_rtx adds d_rays(32) + d_hits(16) + d_aux(12) + d_output(4) = 64 B/px (100 B/px total); viewshed_gpu adds d_rays(32) + d_hits(16) + d_visgrid(4) + d_vsrays(32) = 84 B/px (120 B/px total). A 30000x30000 raster asked for 90-108 GB of VRAM before cupy surfaced an opaque allocator error. Fixed by adding gpu_rtx/_memory.py with _available_gpu_memory_bytes() and _check_gpu_memory(func_name, h, w) helpers (cost_distance #1262 / sky_view_factor #1299 pattern, 120 B/px budget covers worst case, raises MemoryError when required > 50% of free VRAM, skips silently when memGetInfo() unavailable). Wired into both entry points after the cupy.ndarray type check and before create_triangulation. 9 new tests in test_gpu_rtx_memory.py (5 helper-unit + 4 end-to-end gated on has_rtx). All 81 existing hillshade/viewshed tests still pass. Cat 4 clean: all CUDA kernels (hillshade.py:25/62/106, viewshed.py:32/74/116, mesh_utils.py:50) have bounds guards; no shared memory, no syncthreads needed. MEDIUM not fixed (Cat 6): hillshade_rtx and viewshed_gpu do not call _validate_raster directly but parent hillshade() (hillshade.py:252) and viewshed() (viewshed.py:1707) already validate, so input validation runs before the gpu_rtx entry point - defense-in-depth, not exploitable. MEDIUM not fixed (Cat 2): mesh_utils.py:64-68 cast mesh_map_index to int32 in the triangle index buffer; overflows at H*W > 2.1B vertices (~46341x46341+) but the new memory guard rejects rasters that large first - documentation/clarity item rather than exploitable. MEDIUM not fixed (Cat 3): mesh_utils.py:19 scale = maxDim / maxH divides by zero on an all-zero raster, propagating inf/NaN into mesh vertex z-coords; separate follow-up. LOW not fixed (Cat 5): mesh_utils.write() opens user-supplied path without canonicalization but its only call site (mesh_utils.py:38-39) sits behind if False: in create_triangulation, not reachable in production."
 hillshade,2026-04-27,,,,,"Clean. Cat 1: only allocation is the output np.empty(data.shape) at line 32 (cupy at line 165) and a _pad_array with hardcoded depth=1 (line 62) -- bounded by caller, no user-controlled amplifier. Azimuth/altitude are scalars and don't drive size. Cat 2: numba kernel uses range(1, rows-1) with simple (y, x) indexing; numba range loops promote to int64. Cat 3: math.sqrt(1.0 + xx_plus_yy) is always >= 1.0 (no neg sqrt, no div-by-zero); NaN elevation propagates correctly through dz_dx/dz_dy -> shaded -> output (the shaded < 0.0 / shaded > 1.0 clamps don't fire on NaN). Azimuth validated to [0, 360], altitude to [0, 90]. Cat 4: _gpu_calc_numba (line 107) guards both grid bounds and 3x3 stencil reads via i > 0 and i < shape[0]-1 and j > 0 and j < shape[1]-1; no shared memory. Cat 5: no file I/O. Cat 6: hillshade() calls _validate_raster (line 252) and _validate_scalar for both azimuth (253) and angle_altitude (254); all four backend paths cast to float32; tests parametrize int32/int64/float32/float64."

diff --git a/xrspatial/geotiff/_geotags.py b/xrspatial/geotiff/_geotags.py
@@ -170,9 +170,13 @@ def _parse_gdal_metadata(xml_str: str) -> dict:
     Per-band items are stored as ``{(name, band_int): value}``.
     """
     import xml.etree.ElementTree as ET
+    from ._safe_xml import safe_fromstring
     result = {}
     try:
-        root = ET.fromstring(xml_str)
+        # GDALMetadata XML rides inside TIFF tag 42112; a crafted file
+        # can carry a billion-laughs payload there, so refuse DOCTYPEs
+        # before parsing. See issue #1579.
+        root = safe_fromstring(xml_str)
         for item in root.findall('Item'):
             name = item.get('name', '')
             sample = item.get('sample')
@@ -181,7 +185,11 @@ def _parse_gdal_metadata(xml_str: str) -> dict:
                 result[(name, int(sample))] = text
             else:
                 result[name] = text
-    except ET.ParseError:
+    except (ET.ParseError, ValueError):
+        # ValueError surfaces from safe_fromstring when the payload
+        # carries a DOCTYPE. GDALMetadata is non-essential metadata, so
+        # we silently drop it rather than failing the whole read --
+        # matches the existing ParseError fallback.
         pass
     return result
 

diff --git a/xrspatial/geotiff/_safe_xml.py b/xrspatial/geotiff/_safe_xml.py
@@ -0,0 +1,108 @@
+"""Safe XML parsing helpers for VRT and GDALMetadata payloads.
+
+`xml.etree.ElementTree.fromstring` is built on top of expat and, by
+default, expands internal entities. A crafted VRT file or a hostile
+TIFF carrying a billion-laughs payload in the GDALMetadata tag (42112)
+can OOM the host process when read via the standard API (CWE-776,
+GHSA-class "XML Entity Expansion"). External entities (SYSTEM/PUBLIC)
+are already blocked by expat, but internal entity DoS is not.
+
+This module exposes a single :func:`safe_fromstring` helper that:
+
+* Rejects any input declaring a ``<!DOCTYPE`` (and therefore any
+  ``<!ENTITY ...>`` definitions) before expat ever sees the bytes.
+* Falls back to :mod:`defusedxml.ElementTree` when that library is
+  installed, layering a second, audited defence (defusedxml also
+  blocks external entities, processing instructions on untrusted
+  input, and a few other XML pitfalls).
+* Otherwise falls back to :func:`xml.etree.ElementTree.fromstring`,
+  which is fine once DOCTYPEs are pre-rejected: expat exposes no
+  external-entity fetch by default, and without a DTD there is no
+  way to declare an entity that would expand to more than the
+  literal bytes.
+
+VRT and GDALMetadata XML never contain DTDs in legitimate files, so
+the pre-rejection is loss-free for real-world inputs.
+"""
+from __future__ import annotations
+
+import re
+import xml.etree.ElementTree as _ET
+
+# A DOCTYPE declaration is the only path to internal entity expansion in
+# stdlib XML, so refuse the whole input the moment we see one. The
+# regex is intentionally permissive on whitespace because spec-compliant
+# parsers accept blanks between ``<!`` and ``DOCTYPE``. We do not look
+# inside comments / CDATA: those constructs cannot legally hold a
+# DOCTYPE declaration anyway, and trying to be clever about them is how
+# parser-confusion bugs get introduced.
+_DOCTYPE_RE = re.compile(r'<!\s*DOCTYPE', re.IGNORECASE)
+
+
+def _decode_bytes(data: bytes) -> str:
+    """Decode XML bytes to text using BOM / XML-declaration detection.
+
+    Mirrors the encoding-detection precedence expat would apply, so a
+    UTF-16- or UTF-32-encoded payload cannot smuggle a DOCTYPE past
+    the ASCII-only scanner. Falls back to UTF-8 with replacement when
+    no BOM or leading null bytes hint at a wider encoding.
+    """
+    if data[:4] in (b'\x00\x00\xfe\xff', b'\xff\xfe\x00\x00'):
+        return data.decode('utf-32')
+    if data[:2] in (b'\xfe\xff', b'\xff\xfe'):
+        return data.decode('utf-16')
+    if data[:3] == b'\xef\xbb\xbf':
+        return data.decode('utf-8-sig')
+    # No BOM: a leading or alternating null byte gives away UTF-16/32.
+    if len(data) >= 4:
+        if data[:2] == b'\x00\x00':
+            return data.decode('utf-32-be', errors='replace')
+        if data[2:4] == b'\x00\x00':
+            return data.decode('utf-32-le', errors='replace')
+        if data[0] == 0:
+            return data.decode('utf-16-be', errors='replace')
+        if data[1] == 0:
+            return data.decode('utf-16-le', errors='replace')
+    return data.decode('utf-8', errors='replace')
+
+
+def _reject_doctype(data: bytes | str) -> None:
+    """Raise ValueError if *data* declares a DOCTYPE.
+
+    Accepts both ``bytes`` and ``str`` so callers don't have to encode
+    upstream. Bytes are decoded with BOM/encoding detection so that a
+    UTF-16 or UTF-32 encoded payload cannot evade the ASCII scanner.
+    Empty / None inputs are passed through and handled by the
+    downstream parser.
+    """
+    if data is None:
+        return
+    if isinstance(data, str):
+        probe = data
+    else:
+        probe = _decode_bytes(bytes(data))
+    if _DOCTYPE_RE.search(probe):
+        raise ValueError(
+            "XML input contains a DOCTYPE declaration; this is refused "
+            "to prevent XML entity expansion (billion-laughs) attacks. "
+            "VRT and GDALMetadata payloads never need a DTD."
+        )
+
+
+def safe_fromstring(text: str | bytes):
+    """Parse *text* into an ElementTree root, refusing DTDs / entities.
+
+    Returns the parsed root :class:`xml.etree.ElementTree.Element`. Raises
+    ``ValueError`` if the input declares a DOCTYPE, or whatever the
+    underlying parser raises on malformed input (typically
+    :class:`xml.etree.ElementTree.ParseError`).
+    """
+    _reject_doctype(text)
+    try:
+        # Prefer defusedxml when available -- it adds belt-and-braces
+        # defences (external entity / network fetch / processing
+        # instruction handling) on top of the DOCTYPE rejection.
+        from defusedxml import ElementTree as _defused_ET
+        return _defused_ET.fromstring(text)
+    except ImportError:
+        return _ET.fromstring(text)
diff --git a/xrspatial/geotiff/_vrt.py b/xrspatial/geotiff/_vrt.py
@@ -6,11 +6,12 @@
 from __future__ import annotations
 
 import os
-import xml.etree.ElementTree as ET
 from dataclasses import dataclass, field
 
 import numpy as np
 
+from ._safe_xml import safe_fromstring
+
 # Lazy imports to avoid circular dependency
 _DTYPE_MAP = {
     'Byte': np.uint8,
@@ -105,7 +106,10 @@ def parse_vrt(xml_str: str, vrt_dir: str = '.') -> VRTDataset:
     -------
     VRTDataset
     """
-    root = ET.fromstring(xml_str)
+    # ``safe_fromstring`` refuses DOCTYPE declarations so a crafted VRT
+    # cannot trigger XML entity expansion (billion-laughs) attacks
+    # against the reader. See issue #1579.
+    root = safe_fromstring(xml_str)
 
     width = int(root.get('rasterXSize', 0))
     height = int(root.get('rasterYSize', 0))