Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .claude/sweep-security-state.csv
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@ fire,2026-04-25,,,,,"Clean. Despite the module's size hint, fire.py is purely pe
flood,2026-05-03,1437,MEDIUM,3,,Re-audit 2026-05-03. MEDIUM Cat 3 fixed in PR #1438 (travel_time and flood_depth_vegetation now validate mannings_n DataArray values are finite and strictly positive via _validate_mannings_n_dataarray helper). No remaining unfixed findings. Other categories clean: every allocation is same-shape as input; no flat index math; NaN propagation explicit in every backend; tan_slope clamped by _TAN_MIN; no CUDA kernels; no file I/O; every public API calls _validate_raster on DataArray inputs.
focal,2026-04-27,1284,HIGH,1,,"HIGH (fixed PR #1286): apply(), focal_stats(), and hotspots() accepted unbounded user-supplied kernels via custom_kernel(), which only checks shape parity. The kernel-size guard from #1241 (_check_kernel_memory) only ran inside circle_kernel/annulus_kernel, so a (50001, 50001) custom kernel on a 10x10 raster allocated ~10 GB on the kernel itself plus a much larger padded raster before any work -- same shape as the bilateral DoS in #1236. Fixed by adding _check_kernel_vs_raster_memory in focal.py and wiring it into apply(), focal_stats(), and hotspots() after custom_kernel() validation. All 134 focal tests + 19 bilateral tests pass. No other findings: 10 CUDA kernels all have proper bounds + stencil guards; _validate_raster called on every public entry point; hotspots already raises ZeroDivisionError on constant-value rasters; _focal_variety_cuda uses a fixed-size local buffer (silent truncation but bounded); _focal_std_cuda/_focal_var_cuda clamp the catastrophic-cancellation case via if var < 0.0: var = 0.0; no file I/O."
geodesic,2026-04-27,1283,HIGH,1,,"HIGH (fixed PR #1285): slope(method='geodesic') and aspect(method='geodesic') stack a (3, H, W) float64 array (data, lat, lon) before dispatch with no memory check. A large lat/lon-tagged raster passed to either function would OOM. Fixed by adding _check_geodesic_memory(rows, cols) in xrspatial/geodesic.py (mirrors morphology._check_kernel_memory): budgets 56 bytes/cell (24 stacked float64 + 4 float32 output + 24 padded copy + slack) and raises MemoryError when > 50% of available RAM; called from slope.py and aspect.py inside the geodesic branch before dispatch. No other findings: 6 CUDA kernels all have bounds guards (e.g. _run_gpu_geodesic_aspect at geodesic.py:395), custom 16x16 thread blocks avoid register spill, no shared memory, _validate_raster runs upstream in slope/aspect, all backends cast to float32, slope_mag < 1e-7 flat threshold prevents arctan2 NaN propagation, curvature correction uses hardcoded WGS84 R."
geotiff,2026-05-10,,,,,"Pass 11 (2026-05-10): CLEAN. One new commit since pass 10: #1559 (centralise GeoTIFF attrs population via _populate_attrs_from_geo_info helper). Pure metadata-handling refactor that dedupes attrs population across the four read paths (eager numpy, dask, GPU stripped, GPU tiled). No new allocations, no file I/O changes, no kernel changes, no dtype handling changes. Helper only writes to in-memory attrs dict from validated geo_info fields already produced by the header parser. Re-verified all prior guards intact: MAX_PIXELS_DEFAULT=1e9 (_reader.py:46), MAX_IFD_ENTRY_COUNT=100_000 (_header.py:34), MAX_IFDS=256 (_header.py:45), MAX_TILE_BYTES_DEFAULT=256 MiB (_reader.py:68), _MAX_DASK_CHUNKS=50_000 (__init__.py:1428), realpath canonicalisation in _reader.py:161 and _vrt.py:160. Cat 1-6 all clean. No PRs opened. | Pass 10 (2026-05-10): CLEAN. Audited 8 commits since pass 9: #1558 (drop nodata copies - buffers uniquely owned), #1556 (fp-predictor ngjit row loop - bit-exact), #1552 (batched GDS-fallback D2H - _check_gpu_memory guards new staging), #1551 (parallel decode at tile_size=256 - pool bounded by min(n_tiles, cpu_count), mmap thread-safe read-only), #1549 (nvjpeg constants - actually closed CRITICAL: prior off-by-two passed PLANAR where INTERLEAVED expected, dereferenced NULL channel ptrs), #1542 (GPU nodata mask - called once per path, stripped fallback returns early), #1543 (combined backend tests-only), #1544 (__all__/deprecation no security surface). Re-verified guards intact: MAX_PIXELS_DEFAULT, MAX_IFD_ENTRY_COUNT, MAX_IFDS, MAX_TILE_BYTES_DEFAULT (256 MiB), _MAX_DASK_CHUNKS, _check_gpu_memory, _MmapCache TOCTOU, VRT realpath, LZW (sp<4096) and Deflate (overflow array) bounds. No issues opened. | Pass 2 (2026-05-09): Found unbounded compressed-byte-count fetch in HTTP COG path (_fetch_decode_cog_http_tiles passes attacker-controlled TileByteCounts to read_ranges_coalesced). Verified exploit: 4 tiles x 100 MB byte_count -> single 100 MB Range GET. Fixed via per-tile cap (MAX_TILE_BYTES_DEFAULT=256 MiB, XRSPATIAL_COG_MAX_TILE_BYTES env override). Other recent PRs (#1530 IFD chain cap, #1527 IFD count, #1535 LERC mask, #1532 PlanarConfig=2, #1534 HTTP coalesce, #1531 parallel write, #1528 nvCOMP batch) audited clean. | Prior: Re-audit 2026-04-28: only #7984f7a since prior pass (predictor=3 multi-band CPU fix, LERC dedup, VRT nodata dtype, AREA_OR_POINT, BigTIFF LONG8 offsets). LZW/inflate GPU kernels gate every write on out_pos<dst_len; _assemble_tiles_kernel safe under existing validate_tile_layout. Documented LOW: RowsPerStrip=0 raises ZeroDivisionError in _read_strips (line 396), VRT realpath canonicalises but doesn't contain (by-design GDAL parity, low impact since we only return arrays), no max-bytes guard on VRT XML parse. None opened; documented only. | Pass 10 (2026-05-10): Audited 8 unaudited commits (f157746 drop defensive copies, 39322c3 fp-predictor ngjit row loop, f23ec8f batch D2H, 1aac3b7 parallel tile decode, 593aee5 nvjpeg constant fix, 118678a GPU nodata mask, 49a9820 dask+cupy tests, fa00718 __all__ exposure). All clean: defensive copy removal (#1558) verified aliasing-safe via two-read independence test; fp_predictor_encode_rows (#1556) bit-exact vs Python loop at 1x1/prime/256x256 dims; _batched_d2h_to_bytes (#1552) GPU round-trip at empty/single/zero-size/100-tile regimes; parallel tile decode (#1551) correct at 1024x1024 and prime 997x1013 dims; nvjpeg constants (#1549) verified fixed - no OOB GPU write; _apply_nodata_mask_gpu (#1542) single-application verified for all GPU paths (stripped returns early); no double-application race or aliasing issue. No new CRITICAL/HIGH/MEDIUM issues found. No PRs opened."
geotiff,2026-05-11,1579,HIGH,5,,"HIGH (fixed #1579, PR pending): xml.etree.ElementTree.fromstring in _vrt.parse_vrt and _geotags._parse_gdal_metadata expanded internal XML entities (CWE-776 billion-laughs). A crafted .vrt file or a hostile GDALMetadata tag (42112) embedded in a TIFF could OOM the reader. Fixed by adding _safe_xml.safe_fromstring which refuses DOCTYPE declarations (where entities live) before parsing and prefers defusedxml when installed. GDALMetadata path falls through to empty dict on rejection (non-essential metadata); VRT path raises ValueError (file is required). 9 new tests added under TestVRTXMLEntityExpansion, TestGDALMetadataXMLEntityExpansion, TestSafeXMLHelper. No other CRITICAL/HIGH/MEDIUM findings: existing guards already cover Cat 1 (MAX_PIXELS_DEFAULT + per-tile/strip checks + VRT _check_dimensions + HTTP tile byte cap via XRSPATIAL_COG_MAX_TILE_BYTES, 256 MiB default), Cat 2 (header uses int64 offsets, MAX_IFD_ENTRY_COUNT=100k, MAX_IFDS=256, MAX_IFD_ENTRY_BYTES=256 KiB), Cat 4 (every CUDA kernel inspected has bounds guard: _byte_swap_lanes_kernel:91, _lzw_decode_tiles_kernel:168, _predictor_decode_kernel_*:725-771, _fp_predictor_decode_kernel:792, _assemble_tiles_kernel:841; shared-memory arrays are 4096-cell fixed allocations sized for LZW table, no overflow path), Cat 5 (VRT SourceFilename canonicalised via realpath, _writer atomic-rename via mkstemp in dirname-of-abspath), Cat 6 (open_geotiff promotes integer + nodata to float64 with NaN; _validate_dtype_cast rejects float->int casts). Decompression-bomb guards already enforced via _DECOMPRESS_MARGIN=1.05 cap in _compression.deflate_decompress."
glcm,2026-04-24,1257,HIGH,1,,"HIGH (fixed #1257): glcm_texture() validated window_size only as >= 3 and distance only as >= 1, with no upper bound on either. _glcm_numba_kernel iterates range(r-half, r+half+1) for every pixel, so window_size=1_000_001 on a 10x10 raster ran ~10^14 loop iterations with all neighbors failing the interior bounds check (CPU DoS). On the dask backends depth = window_size // 2 + distance drove map_overlap padding, so a huge window also caused oversize per-chunk allocations (memory DoS). Fixed by adding max_val caps in the public entrypoint: window_size <= max(3, min(rows, cols)) and distance <= max(1, window_size // 2). One cap covers every backend because cupy and dask+cupy call through to the CPU kernel after cupy.asnumpy. No other HIGH findings: levels is already capped at 256 so the per-pixel np.zeros((levels, levels)) matrix in the kernel is bounded to 512 KB. No CUDA kernels. No file I/O. Quantization clips to [0, levels-1] before the kernel and NaN maps to -1 which the kernel filters with i_val >= 0. Entropy log(p) and correlation p / (std_i * std_j) are both guarded. All four backends use _validate_raster and cast to float64 before quantizing. MEDIUM (unfixed, Cat 1): the per-pixel np.zeros((levels, levels)) allocation inside the hot loop is a perf issue (levels=256 -> 512 KB alloc+free per pixel) but not a security issue because levels is bounded. Could be hoisted out of the loop or replaced with an in-place clear, but that is an efficiency concern, not security."
gpu_rtx,2026-04-29,1308,HIGH,1,,"HIGH (fixed #1308 / PR #1310): hillshade_rtx (gpu_rtx/hillshade.py:184) and viewshed_gpu (gpu_rtx/viewshed.py:269) allocated cupy device buffers sized by raster shape with no memory check. create_triangulation (mesh_utils.py:23-24) adds verts (12 B/px) + triangles (24 B/px) = 36 B/px; hillshade_rtx adds d_rays(32) + d_hits(16) + d_aux(12) + d_output(4) = 64 B/px (100 B/px total); viewshed_gpu adds d_rays(32) + d_hits(16) + d_visgrid(4) + d_vsrays(32) = 84 B/px (120 B/px total). A 30000x30000 raster asked for 90-108 GB of VRAM before cupy surfaced an opaque allocator error. Fixed by adding gpu_rtx/_memory.py with _available_gpu_memory_bytes() and _check_gpu_memory(func_name, h, w) helpers (cost_distance #1262 / sky_view_factor #1299 pattern, 120 B/px budget covers worst case, raises MemoryError when required > 50% of free VRAM, skips silently when memGetInfo() unavailable). Wired into both entry points after the cupy.ndarray type check and before create_triangulation. 9 new tests in test_gpu_rtx_memory.py (5 helper-unit + 4 end-to-end gated on has_rtx). All 81 existing hillshade/viewshed tests still pass. Cat 4 clean: all CUDA kernels (hillshade.py:25/62/106, viewshed.py:32/74/116, mesh_utils.py:50) have bounds guards; no shared memory, no syncthreads needed. MEDIUM not fixed (Cat 6): hillshade_rtx and viewshed_gpu do not call _validate_raster directly but parent hillshade() (hillshade.py:252) and viewshed() (viewshed.py:1707) already validate, so input validation runs before the gpu_rtx entry point - defense-in-depth, not exploitable. MEDIUM not fixed (Cat 2): mesh_utils.py:64-68 cast mesh_map_index to int32 in the triangle index buffer; overflows at H*W > 2.1B vertices (~46341x46341+) but the new memory guard rejects rasters that large first - documentation/clarity item rather than exploitable. MEDIUM not fixed (Cat 3): mesh_utils.py:19 scale = maxDim / maxH divides by zero on an all-zero raster, propagating inf/NaN into mesh vertex z-coords; separate follow-up. LOW not fixed (Cat 5): mesh_utils.write() opens user-supplied path without canonicalization but its only call site (mesh_utils.py:38-39) sits behind if False: in create_triangulation, not reachable in production."
hillshade,2026-04-27,,,,,"Clean. Cat 1: only allocation is the output np.empty(data.shape) at line 32 (cupy at line 165) and a _pad_array with hardcoded depth=1 (line 62) -- bounded by caller, no user-controlled amplifier. Azimuth/altitude are scalars and don't drive size. Cat 2: numba kernel uses range(1, rows-1) with simple (y, x) indexing; numba range loops promote to int64. Cat 3: math.sqrt(1.0 + xx_plus_yy) is always >= 1.0 (no neg sqrt, no div-by-zero); NaN elevation propagates correctly through dz_dx/dz_dy -> shaded -> output (the shaded < 0.0 / shaded > 1.0 clamps don't fire on NaN). Azimuth validated to [0, 360], altitude to [0, 90]. Cat 4: _gpu_calc_numba (line 107) guards both grid bounds and 3x3 stencil reads via i > 0 and i < shape[0]-1 and j > 0 and j < shape[1]-1; no shared memory. Cat 5: no file I/O. Cat 6: hillshade() calls _validate_raster (line 252) and _validate_scalar for both azimuth (253) and angle_altitude (254); all four backend paths cast to float32; tests parametrize int32/int64/float32/float64."
Expand Down
12 changes: 10 additions & 2 deletions xrspatial/geotiff/_geotags.py
Original file line number Diff line number Diff line change
Expand Up @@ -170,9 +170,13 @@ def _parse_gdal_metadata(xml_str: str) -> dict:
Per-band items are stored as ``{(name, band_int): value}``.
"""
import xml.etree.ElementTree as ET
from ._safe_xml import safe_fromstring
result = {}
try:
root = ET.fromstring(xml_str)
# GDALMetadata XML rides inside TIFF tag 42112; a crafted file
# can carry a billion-laughs payload there, so refuse DOCTYPEs
# before parsing. See issue #1579.
root = safe_fromstring(xml_str)
for item in root.findall('Item'):
name = item.get('name', '')
sample = item.get('sample')
Expand All @@ -181,7 +185,11 @@ def _parse_gdal_metadata(xml_str: str) -> dict:
result[(name, int(sample))] = text
else:
result[name] = text
except ET.ParseError:
except (ET.ParseError, ValueError):
# ValueError surfaces from safe_fromstring when the payload
# carries a DOCTYPE. GDALMetadata is non-essential metadata, so
# we silently drop it rather than failing the whole read --
# matches the existing ParseError fallback.
pass
return result

Expand Down
108 changes: 108 additions & 0 deletions xrspatial/geotiff/_safe_xml.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,108 @@
"""Safe XML parsing helpers for VRT and GDALMetadata payloads.

`xml.etree.ElementTree.fromstring` is built on top of expat and, by
default, expands internal entities. A crafted VRT file or a hostile
TIFF carrying a billion-laughs payload in the GDALMetadata tag (42112)
can OOM the host process when read via the standard API (CWE-776,
GHSA-class "XML Entity Expansion"). External entities (SYSTEM/PUBLIC)
are already blocked by expat, but internal entity DoS is not.

This module exposes a single :func:`safe_fromstring` helper that:

* Rejects any input declaring a ``<!DOCTYPE`` (and therefore any
``<!ENTITY ...>`` definitions) before expat ever sees the bytes.
* Falls back to :mod:`defusedxml.ElementTree` when that library is
installed, layering a second, audited defence (defusedxml also
blocks external entities, processing instructions on untrusted
input, and a few other XML pitfalls).
* Otherwise falls back to :func:`xml.etree.ElementTree.fromstring`,
which is fine once DOCTYPEs are pre-rejected: expat exposes no
external-entity fetch by default, and without a DTD there is no
way to declare an entity that would expand to more than the
literal bytes.

VRT and GDALMetadata XML never contain DTDs in legitimate files, so
the pre-rejection is loss-free for real-world inputs.
"""
from __future__ import annotations

import re
import xml.etree.ElementTree as _ET

# A DOCTYPE declaration is the only path to internal entity expansion in
# stdlib XML, so refuse the whole input the moment we see one. The
# regex is intentionally permissive on whitespace because spec-compliant
# parsers accept blanks between ``<!`` and ``DOCTYPE``. We do not look
# inside comments / CDATA: those constructs cannot legally hold a
# DOCTYPE declaration anyway, and trying to be clever about them is how
# parser-confusion bugs get introduced.
_DOCTYPE_RE = re.compile(r'<!\s*DOCTYPE', re.IGNORECASE)


def _decode_bytes(data: bytes) -> str:
"""Decode XML bytes to text using BOM / XML-declaration detection.

Mirrors the encoding-detection precedence expat would apply, so a
UTF-16- or UTF-32-encoded payload cannot smuggle a DOCTYPE past
the ASCII-only scanner. Falls back to UTF-8 with replacement when
no BOM or leading null bytes hint at a wider encoding.
"""
if data[:4] in (b'\x00\x00\xfe\xff', b'\xff\xfe\x00\x00'):
return data.decode('utf-32')
if data[:2] in (b'\xfe\xff', b'\xff\xfe'):
return data.decode('utf-16')
if data[:3] == b'\xef\xbb\xbf':
return data.decode('utf-8-sig')
# No BOM: a leading or alternating null byte gives away UTF-16/32.
if len(data) >= 4:
if data[:2] == b'\x00\x00':
return data.decode('utf-32-be', errors='replace')
if data[2:4] == b'\x00\x00':
return data.decode('utf-32-le', errors='replace')
if data[0] == 0:
return data.decode('utf-16-be', errors='replace')
if data[1] == 0:
return data.decode('utf-16-le', errors='replace')
return data.decode('utf-8', errors='replace')


def _reject_doctype(data: bytes | str) -> None:
"""Raise ValueError if *data* declares a DOCTYPE.

Accepts both ``bytes`` and ``str`` so callers don't have to encode
upstream. Bytes are decoded with BOM/encoding detection so that a
UTF-16 or UTF-32 encoded payload cannot evade the ASCII scanner.
Empty / None inputs are passed through and handled by the
downstream parser.
"""
if data is None:
return
if isinstance(data, str):
probe = data
else:
probe = _decode_bytes(bytes(data))
if _DOCTYPE_RE.search(probe):
Comment thread
brendancol marked this conversation as resolved.
raise ValueError(
"XML input contains a DOCTYPE declaration; this is refused "
"to prevent XML entity expansion (billion-laughs) attacks. "
"VRT and GDALMetadata payloads never need a DTD."
)


def safe_fromstring(text: str | bytes):
"""Parse *text* into an ElementTree root, refusing DTDs / entities.

Returns the parsed root :class:`xml.etree.ElementTree.Element`. Raises
``ValueError`` if the input declares a DOCTYPE, or whatever the
underlying parser raises on malformed input (typically
:class:`xml.etree.ElementTree.ParseError`).
"""
_reject_doctype(text)
try:
# Prefer defusedxml when available -- it adds belt-and-braces
# defences (external entity / network fetch / processing
# instruction handling) on top of the DOCTYPE rejection.
from defusedxml import ElementTree as _defused_ET
return _defused_ET.fromstring(text)
except ImportError:
return _ET.fromstring(text)
8 changes: 6 additions & 2 deletions xrspatial/geotiff/_vrt.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,11 +6,12 @@
from __future__ import annotations

import os
import xml.etree.ElementTree as ET
from dataclasses import dataclass, field

import numpy as np

from ._safe_xml import safe_fromstring

# Lazy imports to avoid circular dependency
_DTYPE_MAP = {
'Byte': np.uint8,
Expand Down Expand Up @@ -105,7 +106,10 @@ def parse_vrt(xml_str: str, vrt_dir: str = '.') -> VRTDataset:
-------
VRTDataset
"""
root = ET.fromstring(xml_str)
# ``safe_fromstring`` refuses DOCTYPE declarations so a crafted VRT
# cannot trigger XML entity expansion (billion-laughs) attacks
# against the reader. See issue #1579.
root = safe_fromstring(xml_str)

width = int(root.get('rasterXSize', 0))
height = int(root.get('rasterYSize', 0))
Expand Down
Loading
Loading