diff --git a/.claude/sweep-security-state.csv b/.claude/sweep-security-state.csv index c91c1c74..00517e27 100644 --- a/.claude/sweep-security-state.csv +++ b/.claude/sweep-security-state.csv @@ -18,7 +18,7 @@ fire,2026-04-25,,,,,"Clean. Despite the module's size hint, fire.py is purely pe flood,2026-05-03,1437,MEDIUM,3,,Re-audit 2026-05-03. MEDIUM Cat 3 fixed in PR #1438 (travel_time and flood_depth_vegetation now validate mannings_n DataArray values are finite and strictly positive via _validate_mannings_n_dataarray helper). No remaining unfixed findings. Other categories clean: every allocation is same-shape as input; no flat index math; NaN propagation explicit in every backend; tan_slope clamped by _TAN_MIN; no CUDA kernels; no file I/O; every public API calls _validate_raster on DataArray inputs. focal,2026-04-27,1284,HIGH,1,,"HIGH (fixed PR #1286): apply(), focal_stats(), and hotspots() accepted unbounded user-supplied kernels via custom_kernel(), which only checks shape parity. The kernel-size guard from #1241 (_check_kernel_memory) only ran inside circle_kernel/annulus_kernel, so a (50001, 50001) custom kernel on a 10x10 raster allocated ~10 GB on the kernel itself plus a much larger padded raster before any work -- same shape as the bilateral DoS in #1236. Fixed by adding _check_kernel_vs_raster_memory in focal.py and wiring it into apply(), focal_stats(), and hotspots() after custom_kernel() validation. All 134 focal tests + 19 bilateral tests pass. No other findings: 10 CUDA kernels all have proper bounds + stencil guards; _validate_raster called on every public entry point; hotspots already raises ZeroDivisionError on constant-value rasters; _focal_variety_cuda uses a fixed-size local buffer (silent truncation but bounded); _focal_std_cuda/_focal_var_cuda clamp the catastrophic-cancellation case via if var < 0.0: var = 0.0; no file I/O." geodesic,2026-04-27,1283,HIGH,1,,"HIGH (fixed PR #1285): slope(method='geodesic') and aspect(method='geodesic') stack a (3, H, W) float64 array (data, lat, lon) before dispatch with no memory check. A large lat/lon-tagged raster passed to either function would OOM. Fixed by adding _check_geodesic_memory(rows, cols) in xrspatial/geodesic.py (mirrors morphology._check_kernel_memory): budgets 56 bytes/cell (24 stacked float64 + 4 float32 output + 24 padded copy + slack) and raises MemoryError when > 50% of available RAM; called from slope.py and aspect.py inside the geodesic branch before dispatch. No other findings: 6 CUDA kernels all have bounds guards (e.g. _run_gpu_geodesic_aspect at geodesic.py:395), custom 16x16 thread blocks avoid register spill, no shared memory, _validate_raster runs upstream in slope/aspect, all backends cast to float32, slope_mag < 1e-7 flat threshold prevents arctan2 NaN propagation, curvature correction uses hardcoded WGS84 R." -geotiff,2026-05-10,,,,,"Pass 11 (2026-05-10): CLEAN. One new commit since pass 10: #1559 (centralise GeoTIFF attrs population via _populate_attrs_from_geo_info helper). Pure metadata-handling refactor that dedupes attrs population across the four read paths (eager numpy, dask, GPU stripped, GPU tiled). No new allocations, no file I/O changes, no kernel changes, no dtype handling changes. Helper only writes to in-memory attrs dict from validated geo_info fields already produced by the header parser. Re-verified all prior guards intact: MAX_PIXELS_DEFAULT=1e9 (_reader.py:46), MAX_IFD_ENTRY_COUNT=100_000 (_header.py:34), MAX_IFDS=256 (_header.py:45), MAX_TILE_BYTES_DEFAULT=256 MiB (_reader.py:68), _MAX_DASK_CHUNKS=50_000 (__init__.py:1428), realpath canonicalisation in _reader.py:161 and _vrt.py:160. Cat 1-6 all clean. No PRs opened. | Pass 10 (2026-05-10): CLEAN. Audited 8 commits since pass 9: #1558 (drop nodata copies - buffers uniquely owned), #1556 (fp-predictor ngjit row loop - bit-exact), #1552 (batched GDS-fallback D2H - _check_gpu_memory guards new staging), #1551 (parallel decode at tile_size=256 - pool bounded by min(n_tiles, cpu_count), mmap thread-safe read-only), #1549 (nvjpeg constants - actually closed CRITICAL: prior off-by-two passed PLANAR where INTERLEAVED expected, dereferenced NULL channel ptrs), #1542 (GPU nodata mask - called once per path, stripped fallback returns early), #1543 (combined backend tests-only), #1544 (__all__/deprecation no security surface). Re-verified guards intact: MAX_PIXELS_DEFAULT, MAX_IFD_ENTRY_COUNT, MAX_IFDS, MAX_TILE_BYTES_DEFAULT (256 MiB), _MAX_DASK_CHUNKS, _check_gpu_memory, _MmapCache TOCTOU, VRT realpath, LZW (sp<4096) and Deflate (overflow array) bounds. No issues opened. | Pass 2 (2026-05-09): Found unbounded compressed-byte-count fetch in HTTP COG path (_fetch_decode_cog_http_tiles passes attacker-controlled TileByteCounts to read_ranges_coalesced). Verified exploit: 4 tiles x 100 MB byte_count -> single 100 MB Range GET. Fixed via per-tile cap (MAX_TILE_BYTES_DEFAULT=256 MiB, XRSPATIAL_COG_MAX_TILE_BYTES env override). Other recent PRs (#1530 IFD chain cap, #1527 IFD count, #1535 LERC mask, #1532 PlanarConfig=2, #1534 HTTP coalesce, #1531 parallel write, #1528 nvCOMP batch) audited clean. | Prior: Re-audit 2026-04-28: only #7984f7a since prior pass (predictor=3 multi-band CPU fix, LERC dedup, VRT nodata dtype, AREA_OR_POINT, BigTIFF LONG8 offsets). LZW/inflate GPU kernels gate every write on out_posint casts). Decompression-bomb guards already enforced via _DECOMPRESS_MARGIN=1.05 cap in _compression.deflate_decompress." glcm,2026-04-24,1257,HIGH,1,,"HIGH (fixed #1257): glcm_texture() validated window_size only as >= 3 and distance only as >= 1, with no upper bound on either. _glcm_numba_kernel iterates range(r-half, r+half+1) for every pixel, so window_size=1_000_001 on a 10x10 raster ran ~10^14 loop iterations with all neighbors failing the interior bounds check (CPU DoS). On the dask backends depth = window_size // 2 + distance drove map_overlap padding, so a huge window also caused oversize per-chunk allocations (memory DoS). Fixed by adding max_val caps in the public entrypoint: window_size <= max(3, min(rows, cols)) and distance <= max(1, window_size // 2). One cap covers every backend because cupy and dask+cupy call through to the CPU kernel after cupy.asnumpy. No other HIGH findings: levels is already capped at 256 so the per-pixel np.zeros((levels, levels)) matrix in the kernel is bounded to 512 KB. No CUDA kernels. No file I/O. Quantization clips to [0, levels-1] before the kernel and NaN maps to -1 which the kernel filters with i_val >= 0. Entropy log(p) and correlation p / (std_i * std_j) are both guarded. All four backends use _validate_raster and cast to float64 before quantizing. MEDIUM (unfixed, Cat 1): the per-pixel np.zeros((levels, levels)) allocation inside the hot loop is a perf issue (levels=256 -> 512 KB alloc+free per pixel) but not a security issue because levels is bounded. Could be hoisted out of the loop or replaced with an in-place clear, but that is an efficiency concern, not security." gpu_rtx,2026-04-29,1308,HIGH,1,,"HIGH (fixed #1308 / PR #1310): hillshade_rtx (gpu_rtx/hillshade.py:184) and viewshed_gpu (gpu_rtx/viewshed.py:269) allocated cupy device buffers sized by raster shape with no memory check. create_triangulation (mesh_utils.py:23-24) adds verts (12 B/px) + triangles (24 B/px) = 36 B/px; hillshade_rtx adds d_rays(32) + d_hits(16) + d_aux(12) + d_output(4) = 64 B/px (100 B/px total); viewshed_gpu adds d_rays(32) + d_hits(16) + d_visgrid(4) + d_vsrays(32) = 84 B/px (120 B/px total). A 30000x30000 raster asked for 90-108 GB of VRAM before cupy surfaced an opaque allocator error. Fixed by adding gpu_rtx/_memory.py with _available_gpu_memory_bytes() and _check_gpu_memory(func_name, h, w) helpers (cost_distance #1262 / sky_view_factor #1299 pattern, 120 B/px budget covers worst case, raises MemoryError when required > 50% of free VRAM, skips silently when memGetInfo() unavailable). Wired into both entry points after the cupy.ndarray type check and before create_triangulation. 9 new tests in test_gpu_rtx_memory.py (5 helper-unit + 4 end-to-end gated on has_rtx). All 81 existing hillshade/viewshed tests still pass. Cat 4 clean: all CUDA kernels (hillshade.py:25/62/106, viewshed.py:32/74/116, mesh_utils.py:50) have bounds guards; no shared memory, no syncthreads needed. MEDIUM not fixed (Cat 6): hillshade_rtx and viewshed_gpu do not call _validate_raster directly but parent hillshade() (hillshade.py:252) and viewshed() (viewshed.py:1707) already validate, so input validation runs before the gpu_rtx entry point - defense-in-depth, not exploitable. MEDIUM not fixed (Cat 2): mesh_utils.py:64-68 cast mesh_map_index to int32 in the triangle index buffer; overflows at H*W > 2.1B vertices (~46341x46341+) but the new memory guard rejects rasters that large first - documentation/clarity item rather than exploitable. MEDIUM not fixed (Cat 3): mesh_utils.py:19 scale = maxDim / maxH divides by zero on an all-zero raster, propagating inf/NaN into mesh vertex z-coords; separate follow-up. LOW not fixed (Cat 5): mesh_utils.write() opens user-supplied path without canonicalization but its only call site (mesh_utils.py:38-39) sits behind if False: in create_triangulation, not reachable in production." hillshade,2026-04-27,,,,,"Clean. Cat 1: only allocation is the output np.empty(data.shape) at line 32 (cupy at line 165) and a _pad_array with hardcoded depth=1 (line 62) -- bounded by caller, no user-controlled amplifier. Azimuth/altitude are scalars and don't drive size. Cat 2: numba kernel uses range(1, rows-1) with simple (y, x) indexing; numba range loops promote to int64. Cat 3: math.sqrt(1.0 + xx_plus_yy) is always >= 1.0 (no neg sqrt, no div-by-zero); NaN elevation propagates correctly through dz_dx/dz_dy -> shaded -> output (the shaded < 0.0 / shaded > 1.0 clamps don't fire on NaN). Azimuth validated to [0, 360], altitude to [0, 90]. Cat 4: _gpu_calc_numba (line 107) guards both grid bounds and 3x3 stencil reads via i > 0 and i < shape[0]-1 and j > 0 and j < shape[1]-1; no shared memory. Cat 5: no file I/O. Cat 6: hillshade() calls _validate_raster (line 252) and _validate_scalar for both azimuth (253) and angle_altitude (254); all four backend paths cast to float32; tests parametrize int32/int64/float32/float64." diff --git a/xrspatial/geotiff/_geotags.py b/xrspatial/geotiff/_geotags.py index 401017e7..270d0330 100644 --- a/xrspatial/geotiff/_geotags.py +++ b/xrspatial/geotiff/_geotags.py @@ -170,9 +170,13 @@ def _parse_gdal_metadata(xml_str: str) -> dict: Per-band items are stored as ``{(name, band_int): value}``. """ import xml.etree.ElementTree as ET + from ._safe_xml import safe_fromstring result = {} try: - root = ET.fromstring(xml_str) + # GDALMetadata XML rides inside TIFF tag 42112; a crafted file + # can carry a billion-laughs payload there, so refuse DOCTYPEs + # before parsing. See issue #1579. + root = safe_fromstring(xml_str) for item in root.findall('Item'): name = item.get('name', '') sample = item.get('sample') @@ -181,7 +185,11 @@ def _parse_gdal_metadata(xml_str: str) -> dict: result[(name, int(sample))] = text else: result[name] = text - except ET.ParseError: + except (ET.ParseError, ValueError): + # ValueError surfaces from safe_fromstring when the payload + # carries a DOCTYPE. GDALMetadata is non-essential metadata, so + # we silently drop it rather than failing the whole read -- + # matches the existing ParseError fallback. pass return result diff --git a/xrspatial/geotiff/_safe_xml.py b/xrspatial/geotiff/_safe_xml.py new file mode 100644 index 00000000..edaffbd4 --- /dev/null +++ b/xrspatial/geotiff/_safe_xml.py @@ -0,0 +1,108 @@ +"""Safe XML parsing helpers for VRT and GDALMetadata payloads. + +`xml.etree.ElementTree.fromstring` is built on top of expat and, by +default, expands internal entities. A crafted VRT file or a hostile +TIFF carrying a billion-laughs payload in the GDALMetadata tag (42112) +can OOM the host process when read via the standard API (CWE-776, +GHSA-class "XML Entity Expansion"). External entities (SYSTEM/PUBLIC) +are already blocked by expat, but internal entity DoS is not. + +This module exposes a single :func:`safe_fromstring` helper that: + +* Rejects any input declaring a ```` definitions) before expat ever sees the bytes. +* Falls back to :mod:`defusedxml.ElementTree` when that library is + installed, layering a second, audited defence (defusedxml also + blocks external entities, processing instructions on untrusted + input, and a few other XML pitfalls). +* Otherwise falls back to :func:`xml.etree.ElementTree.fromstring`, + which is fine once DOCTYPEs are pre-rejected: expat exposes no + external-entity fetch by default, and without a DTD there is no + way to declare an entity that would expand to more than the + literal bytes. + +VRT and GDALMetadata XML never contain DTDs in legitimate files, so +the pre-rejection is loss-free for real-world inputs. +""" +from __future__ import annotations + +import re +import xml.etree.ElementTree as _ET + +# A DOCTYPE declaration is the only path to internal entity expansion in +# stdlib XML, so refuse the whole input the moment we see one. The +# regex is intentionally permissive on whitespace because spec-compliant +# parsers accept blanks between `` str: + """Decode XML bytes to text using BOM / XML-declaration detection. + + Mirrors the encoding-detection precedence expat would apply, so a + UTF-16- or UTF-32-encoded payload cannot smuggle a DOCTYPE past + the ASCII-only scanner. Falls back to UTF-8 with replacement when + no BOM or leading null bytes hint at a wider encoding. + """ + if data[:4] in (b'\x00\x00\xfe\xff', b'\xff\xfe\x00\x00'): + return data.decode('utf-32') + if data[:2] in (b'\xfe\xff', b'\xff\xfe'): + return data.decode('utf-16') + if data[:3] == b'\xef\xbb\xbf': + return data.decode('utf-8-sig') + # No BOM: a leading or alternating null byte gives away UTF-16/32. + if len(data) >= 4: + if data[:2] == b'\x00\x00': + return data.decode('utf-32-be', errors='replace') + if data[2:4] == b'\x00\x00': + return data.decode('utf-32-le', errors='replace') + if data[0] == 0: + return data.decode('utf-16-be', errors='replace') + if data[1] == 0: + return data.decode('utf-16-le', errors='replace') + return data.decode('utf-8', errors='replace') + + +def _reject_doctype(data: bytes | str) -> None: + """Raise ValueError if *data* declares a DOCTYPE. + + Accepts both ``bytes`` and ``str`` so callers don't have to encode + upstream. Bytes are decoded with BOM/encoding detection so that a + UTF-16 or UTF-32 encoded payload cannot evade the ASCII scanner. + Empty / None inputs are passed through and handled by the + downstream parser. + """ + if data is None: + return + if isinstance(data, str): + probe = data + else: + probe = _decode_bytes(bytes(data)) + if _DOCTYPE_RE.search(probe): + raise ValueError( + "XML input contains a DOCTYPE declaration; this is refused " + "to prevent XML entity expansion (billion-laughs) attacks. " + "VRT and GDALMetadata payloads never need a DTD." + ) + + +def safe_fromstring(text: str | bytes): + """Parse *text* into an ElementTree root, refusing DTDs / entities. + + Returns the parsed root :class:`xml.etree.ElementTree.Element`. Raises + ``ValueError`` if the input declares a DOCTYPE, or whatever the + underlying parser raises on malformed input (typically + :class:`xml.etree.ElementTree.ParseError`). + """ + _reject_doctype(text) + try: + # Prefer defusedxml when available -- it adds belt-and-braces + # defences (external entity / network fetch / processing + # instruction handling) on top of the DOCTYPE rejection. + from defusedxml import ElementTree as _defused_ET + return _defused_ET.fromstring(text) + except ImportError: + return _ET.fromstring(text) diff --git a/xrspatial/geotiff/_vrt.py b/xrspatial/geotiff/_vrt.py index a3fe4f31..7afba86e 100644 --- a/xrspatial/geotiff/_vrt.py +++ b/xrspatial/geotiff/_vrt.py @@ -6,11 +6,12 @@ from __future__ import annotations import os -import xml.etree.ElementTree as ET from dataclasses import dataclass, field import numpy as np +from ._safe_xml import safe_fromstring + # Lazy imports to avoid circular dependency _DTYPE_MAP = { 'Byte': np.uint8, @@ -105,7 +106,10 @@ def parse_vrt(xml_str: str, vrt_dir: str = '.') -> VRTDataset: ------- VRTDataset """ - root = ET.fromstring(xml_str) + # ``safe_fromstring`` refuses DOCTYPE declarations so a crafted VRT + # cannot trigger XML entity expansion (billion-laughs) attacks + # against the reader. See issue #1579. + root = safe_fromstring(xml_str) width = int(root.get('rasterXSize', 0)) height = int(root.get('rasterYSize', 0)) diff --git a/xrspatial/geotiff/tests/test_security.py b/xrspatial/geotiff/tests/test_security.py index f3fb770c..b9929865 100644 --- a/xrspatial/geotiff/tests/test_security.py +++ b/xrspatial/geotiff/tests/test_security.py @@ -726,3 +726,158 @@ def test_local_path_unaffected_by_cap(self, tmp_path, monkeypatch): monkeypatch.setenv('XRSPATIAL_COG_MAX_TILE_BYTES', '1') result = open_geotiff(path) np.testing.assert_array_equal(result.values, arr) + + +# --------------------------------------------------------------------------- +# XML entity expansion (billion-laughs) -- issue #1579 +# +# VRT and GDALMetadata payloads go through xml.etree.ElementTree, which by +# default expands internal entities. A crafted file can OOM the host via +# exponential entity expansion. ``_safe_xml.safe_fromstring`` refuses any +# input carrying a DOCTYPE declaration, which is where entity definitions +# live. +# --------------------------------------------------------------------------- + + +_BILLION_LAUGHS_PROLOGUE = ( + '\n' + '\n' + ' \n' + ' \n' + ' \n' + ']>\n' +) + + +class TestVRTXMLEntityExpansion: + """Issue #1579: parse_vrt refuses XML entity (billion-laughs) payloads.""" + + def test_parse_vrt_rejects_doctype(self, tmp_path): + """A VRT that declares ```` is rejected outright.""" + from xrspatial.geotiff._vrt import parse_vrt + + payload = _BILLION_LAUGHS_PROLOGUE + ( + '\n' + ' \n' + ' &lol4;\n' + ' \n' + '\n' + ) + + with pytest.raises(ValueError, match="DOCTYPE"): + parse_vrt(payload, str(tmp_path)) + + def test_open_geotiff_vrt_rejects_doctype(self, tmp_path): + """Routing through the public open_geotiff entry still rejects DOCTYPEs.""" + from xrspatial.geotiff import open_geotiff + + payload = _BILLION_LAUGHS_PROLOGUE + ( + '\n' + ' \n' + ' &lol4;\n' + ' \n' + '\n' + ) + vrt_path = tmp_path / "bomb.vrt" + vrt_path.write_text(payload) + + with pytest.raises(ValueError, match="DOCTYPE"): + open_geotiff(str(vrt_path)) + + def test_normal_vrt_still_parses(self, tmp_path): + """A VRT without a DOCTYPE parses normally.""" + from xrspatial.geotiff._vrt import parse_vrt + + payload = ( + '\n' + ' \n' + ' \n' + '\n' + ) + vrt = parse_vrt(payload, str(tmp_path)) + assert vrt.width == 4 + assert vrt.height == 4 + + +class TestGDALMetadataXMLEntityExpansion: + """Issue #1579: _parse_gdal_metadata refuses entity-expansion payloads.""" + + def test_parse_gdal_metadata_doctype_returns_empty(self): + """A DOCTYPE in GDALMetadata yields an empty dict, not expansion. + + GDALMetadata is non-essential auxiliary metadata, so the parser + silently drops malformed payloads rather than failing the whole + TIFF read. Crucially it must NOT expand the entity. + """ + from xrspatial.geotiff._geotags import _parse_gdal_metadata + + payload = _BILLION_LAUGHS_PROLOGUE + ( + '&lol4;' + ) + result = _parse_gdal_metadata(payload) + # safe_fromstring raises ValueError on DOCTYPE, which the + # parser catches and turns into an empty dict. + assert result == {} + + def test_parse_gdal_metadata_normal_still_works(self): + """A well-formed GDALMetadata XML parses into a flat dict.""" + from xrspatial.geotiff._geotags import _parse_gdal_metadata + + payload = ( + '' + '0' + '255' + '' + ) + result = _parse_gdal_metadata(payload) + assert result.get('STATISTICS_MINIMUM') == '0' + assert result.get(('STATISTICS_MAXIMUM', 0)) == '255' + + +class TestSafeXMLHelper: + """Direct unit tests for ``_safe_xml.safe_fromstring``.""" + + def test_rejects_bytes_doctype(self): + from xrspatial.geotiff._safe_xml import safe_fromstring + with pytest.raises(ValueError, match="DOCTYPE"): + safe_fromstring(b'') + + def test_rejects_str_doctype(self): + from xrspatial.geotiff._safe_xml import safe_fromstring + with pytest.raises(ValueError, match="DOCTYPE"): + safe_fromstring('') + + def test_rejects_doctype_with_whitespace(self): + """Whitespace between ``') + + @pytest.mark.parametrize("encoding", [ + "utf-16", # with BOM + "utf-16-le", # no BOM, alternating nulls + "utf-16-be", # no BOM, leading null + "utf-32", # with BOM + "utf-32-le", + "utf-32-be", + "utf-8-sig", # UTF-8 BOM + ]) + def test_rejects_doctype_in_wide_encodings(self, encoding): + """A DOCTYPE encoded in UTF-16 / UTF-32 still gets rejected. + + The original ASCII byte-regex would miss these because the + DOCTYPE bytes interleave null bytes. The decoder normalizes + first so wide-encoded payloads cannot smuggle a DTD past + the scanner. + """ + from xrspatial.geotiff._safe_xml import safe_fromstring + payload = ''.encode(encoding) + with pytest.raises(ValueError, match="DOCTYPE"): + safe_fromstring(payload) + + def test_parses_normal_xml(self): + from xrspatial.geotiff._safe_xml import safe_fromstring + root = safe_fromstring('hi') + assert root.tag == 'root' + assert root.find('child').text == 'hi'