Skip to content

Optimize xml_path and xml_find_all for large node sets#478

Open
astruebi wants to merge 2 commits into
r-lib:mainfrom
astruebi:codex/optimize-xml-path-nodesets
Open

Optimize xml_path and xml_find_all for large node sets#478
astruebi wants to merge 2 commits into
r-lib:mainfrom
astruebi:codex/optimize-xml-path-nodesets

Conversation

@astruebi
Copy link
Copy Markdown

@astruebi astruebi commented Jun 5, 2026

Closes #477

This PR improves performance for large node sets in two places:

  • xml_path() now caches reusable ancestor paths while processing an xml_nodeset, avoiding repeated parent walks and repeated construction of shared path prefixes.
  • xml_find_all.xml_node() avoids an unnecessary duplicate pass for single-context XPath node sets and materializes xml_node results with less per-node overhead.

The intended semantics are unchanged.

Tests added:

  • batch xml_path() output is identical to calling xml_path() on each node individually
  • single-context xml_find_all() XPath node sets remain unique even for union expressions

Implementation note:

Some repetition in the path occurrence logic is intentional. I tested a more compact shared abstraction for sibling occurrence counting, but it made the hot loop noticeably slower for large node sets. The current version keeps those cases explicit to preserve the performance gain while matching xmlGetNodePath() semantics.

Local checks:

  • testthat::test_dir("tests/testthat"): PASS
  • R CMD build --no-manual --no-build-vignettes .: OK
  • _R_CHECK_FORCE_SUGGESTS_=false R CMD check --no-manual --no-build-vignettes xml2_1.5.2.tar.gz: OK, with only vignette warnings due to --no-build-vignettes

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Optimize xml_path() and xml_find_all() for large node sets

1 participant