Skip to content

Block XML entity expansion in VRT and GDALMetadata parsers (#1579)#1584

Merged
brendancol merged 2 commits into
mainfrom
deep-sweep-security-geotiff-2026-05-11-01
May 11, 2026
Merged

Block XML entity expansion in VRT and GDALMetadata parsers (#1579)#1584
brendancol merged 2 commits into
mainfrom
deep-sweep-security-geotiff-2026-05-11-01

Conversation

@brendancol
Copy link
Copy Markdown
Contributor

Closes #1579.

Summary

  • xml.etree.ElementTree.fromstring in _vrt.parse_vrt and _geotags._parse_gdal_metadata expanded internal XML entities by default. A crafted .vrt file or a hostile GDALMetadata tag (42112) inside a TIFF could OOM the host via billion-laughs (CWE-776).
  • New xrspatial/geotiff/_safe_xml.safe_fromstring refuses any input declaring a <!DOCTYPE ...> (the only path to entity expansion in stdlib XML) and prefers defusedxml when available.
  • VRT/GDALMetadata payloads never need a DTD in legitimate files, so the change is loss-free for real-world inputs.
  • GDALMetadata is non-essential metadata, so its branch falls through to an empty dict on rejection (matching the existing ParseError handling). VRT raises ValueError because the entire input is the payload.

Reproducer (pre-fix)

from xrspatial.geotiff._vrt import parse_vrt
xml = '''<?xml version="1.0"?>
<!DOCTYPE lolz [
  <!ENTITY lol "lol">
  <!ENTITY lol2 "&lol;&lol;&lol;&lol;&lol;&lol;&lol;&lol;&lol;&lol;">
  <!ENTITY lol3 "&lol2;&lol2;&lol2;&lol2;&lol2;&lol2;&lol2;&lol2;&lol2;&lol2;">
  <!ENTITY lol4 "&lol3;&lol3;&lol3;&lol3;&lol3;&lol3;&lol3;&lol3;&lol3;&lol3;">
]>
<VRTDataset rasterXSize="4" rasterYSize="4">
  <VRTRasterBand dataType="Float32" band="1">
    <Description>&lol4;</Description>
  </VRTRasterBand>
</VRTDataset>'''
parse_vrt(xml, '.')  # silently expanded &lol4; into 10,000+ copies before this change

Test plan

  • pytest xrspatial/geotiff/tests/test_security.py -q -> 39 passed (30 baseline + 9 new).
  • Full geotiff test suite passes apart from three pre-existing test_features.py::TestPalette failures unrelated to this change (matplotlib deepcopy recursion on Python 3.14 / matplotlib path objects).
  • State file .claude/sweep-security-state.csv updated with the geotiff sweep row.

`xml.etree.ElementTree.fromstring` expands internal entities by default,
so a crafted `.vrt` file or a hostile `GDALMetadata` tag (42112) in a
TIFF could trigger billion-laughs DoS against any caller of
`open_geotiff` or `read_vrt`. Route both parsers through a new
`_safe_xml.safe_fromstring` helper that refuses DOCTYPE declarations
(the only path to entity expansion in stdlib XML) and prefers
`defusedxml` when installed. VRT and GDALMetadata payloads never carry
DTDs in legitimate files, so this is loss-free for real inputs.

The GDALMetadata branch falls through to an empty dict on rejection
(matching the existing ParseError fallback for non-essential metadata);
the VRT branch raises ValueError because the entire file is the
payload.

Test coverage: 9 new tests under TestVRTXMLEntityExpansion,
TestGDALMetadataXMLEntityExpansion, and TestSafeXMLHelper exercise the
DOCTYPE rejection path, the end-to-end `open_geotiff(.vrt)` route, and
backwards-compat with well-formed inputs.
@github-actions github-actions Bot added the performance PR touches performance-sensitive code label May 11, 2026
@brendancol brendancol requested a review from Copilot May 11, 2026 14:25
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR mitigates XML entity-expansion (billion-laughs, CWE-776) denial-of-service risks in the GeoTIFF VRT and GDALMetadata parsing paths by introducing a shared “safe XML” helper and adding regression tests to ensure DOCTYPE-based payloads are rejected.

Changes:

  • Add xrspatial/geotiff/_safe_xml.safe_fromstring, rejecting <!DOCTYPE ...> and preferring defusedxml when available.
  • Route VRT parsing (_vrt.parse_vrt) and GDALMetadata parsing (_geotags._parse_gdal_metadata) through the safe helper, with VRT raising and GDALMetadata dropping metadata on rejection.
  • Add security-focused tests covering VRT/GDALMetadata DOCTYPE rejection and the helper’s behavior.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
xrspatial/geotiff/_safe_xml.py Introduces centralized safe XML parsing (DOCTYPE rejection + optional defusedxml).
xrspatial/geotiff/_vrt.py Switches VRT parsing to use the safe XML helper.
xrspatial/geotiff/_geotags.py Switches GDALMetadata XML parsing to use the safe XML helper and drops metadata on rejection.
xrspatial/geotiff/tests/test_security.py Adds regression tests ensuring DOCTYPE payloads are rejected and normal payloads still parse.
.claude/sweep-security-state.csv Records the security sweep/audit status update for the geotiff module.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread xrspatial/geotiff/_safe_xml.py
Comment thread xrspatial/geotiff/_safe_xml.py Outdated
Comment thread xrspatial/geotiff/_vrt.py
Comment on lines 8 to +15
import os
import xml.etree.ElementTree as ET
from dataclasses import dataclass, field

import numpy as np

from ._safe_xml import safe_fromstring

Comment thread xrspatial/geotiff/tests/test_security.py Outdated
- _safe_xml: decode bytes via BOM / leading-null detection before
  scanning for DOCTYPE so UTF-16 / UTF-32 payloads can't bypass the
  ASCII regex.
- _safe_xml: align module docstring with actual fallback
  (ET.fromstring, not XMLParser).
- _vrt: drop unused `xml.etree.ElementTree as ET` import.
- test_security: tighten DOCTYPE-rejection assertion to `== {}` and
  add 7 parametrized wide-encoding (UTF-16 / UTF-32 / UTF-8-BOM)
  cases.
@brendancol
Copy link
Copy Markdown
Contributor Author

Thanks @copilot — all four comments were valid. Pushed 8d1a563:

  1. UTF-16 / UTF-32 bypass — replaced the raw byte regex with a BOM- and leading-null-aware decode step, so wide-encoded payloads are normalized to text before the DOCTYPE scan. Added 7 parametrized regression tests (utf-16, utf-16-le, utf-16-be, utf-32, utf-32-le, utf-32-be, utf-8-sig) confirming the bypass is closed.
  2. Docstring drift — bullet now reads "falls back to xml.etree.ElementTree.fromstring" to match the implementation.
  3. Unused ET import in _vrt.py — removed.
  4. Weak GDALMetadata test assertion — tightened to assert result == {}.

Security suite: 46 passed (was 39).

@brendancol brendancol merged commit cab63f0 into main May 11, 2026
10 of 12 checks passed
Copilot stopped work on behalf of brendancol due to an error May 11, 2026 14:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

performance PR touches performance-sensitive code

Projects

None yet

Development

Successfully merging this pull request may close these issues.

geotiff: XML entity expansion (billion-laughs) in VRT/GDALMetadata parsers

2 participants