Skip to content

Fix #2790: Add geometry deduplication to contours()#3001

Open
Melissari1997 wants to merge 3 commits into
xarray-contrib:mainfrom
Melissari1997:fix-2790-duplicate-contour-lines
Open

Fix #2790: Add geometry deduplication to contours()#3001
Melissari1997 wants to merge 3 commits into
xarray-contrib:mainfrom
Melissari1997:fix-2790-duplicate-contour-lines

Conversation

@Melissari1997
Copy link
Copy Markdown
Contributor

Closes #2790

Summary

contours() can return duplicate contour lines when raster cells exactly equal the contour level. This happens because:

  1. Dask backend: Overlapping chunk boundaries (via da.overlap.overlap with 1-cell halo) process the same 2x2 quads from adjacent chunks, producing duplicate segments that can stitch into identical polylines.
  2. NumPy backend: Saddle-case disambiguation on flat plateaus can emit the same polyline geometry through different code paths.

The existing code already guards against zero-length segments (_emit_seg line 217-218) and single-point polylines (_stitch_segments line 296), but lacked a final geometry-level deduplication step.

Solution

Added _deduplicate_lines(results) function that:

  • Groups results by contour level
  • Builds a canonical signature for each polyline (sorted tuple of rounded vertex coordinates)
  • Keeps only the first occurrence of each unique signature
  • Returns the deduplicated list

The function is called in contours() after the coordinate transform and before the return-type dispatch, ensuring all backends (numpy, cupy, dask+numpy, dask+cupy) and both output formats (list, GeoDataFrame) benefit from deduplication.

@github-actions github-actions Bot added the performance PR touches performance-sensitive code label Jun 6, 2026
@Melissari1997 Melissari1997 marked this pull request as draft June 6, 2026 16:22
@Melissari1997
Copy link
Copy Markdown
Contributor Author

[Human] still in draft

@Melissari1997
Copy link
Copy Markdown
Contributor Author

PR Review: Fix #2790 - Add geometry deduplication to contours()

Blockers (must fix before merge)

None identified.

Suggestions (should fix, not blocking)

  • xrspatial/contour.py:776_deduplicate_lines runs on array-index coordinates, which is correct and consistent with DECIMALS=10 usage elsewhere in the module. However, the Dask backend already calls _deduplicate_by_level() internally (line 440, 472), which performs its own stitching and deduplication. The top-level _deduplicate_lines() call then runs a second time on the same data. For the Dask path, this is redundant work — the polylines from _deduplicate_by_level are already unique. Consider whether the top-level call should be conditional on backend, or whether _deduplicate_by_level should skip its own deduplication and rely on the single top-level call. Either approach would eliminate the double-pass.

  • xrspatial/contour.py:565-606 — The _deduplicate_lines function builds a segment signature for every polyline by iterating over all vertices. For long contour lines with many segments, this is O(n²) per line (sorting the segment list). If a single level produces many long polylines, this could be slow. Consider whether the signature can be built incrementally or whether a hash-based approach (e.g., hashing the sorted segment tuple) would be faster for large inputs.

Nits (optional improvements)

  • xrspatial/tests/test_contour.py:1131,1141,1149 — The three TestDeduplicateLines tests each do from xrspatial.contour import _deduplicate_lines inside the test method. Since this is a test module, a top-level import would be cleaner and consistent with how _stitch_segments is imported at line 7.

  • xrspatial/tests/test_contour.py:431_collect_segments in TestBackendEquivalence also has an inline from collections import defaultdict. Now that defaultdict is imported at the top level (line 5), this inline import is redundant and can be removed.

What looks good

  • The _deduplicate_lines function is well-documented with a clear docstring explaining the canonical segment comparison approach.
  • Deduplication runs on array-index coordinates (before np.interp transform), which is consistent with how DECIMALS=10 is used in _stitch_segments and _remove_duplicate_segments.
  • The reverse-duplicate handling (direction-independent comparison via min(p0, p1), max(p0, p1)) is correct and matches the approach in _remove_duplicate_segments.
  • Tests cover numpy, dask, and geopandas return types for the plateau regression case.
  • The _assert_no_duplicate_lines helper mirrors the same canonical segment logic used in the implementation, making it a faithful regression check.
  • All 52 tests pass.

Checklist

  • Algorithm matches reference/paper — marching squares with deduplication is standard
  • All implemented backends produce consistent results — numpy and dask tested; cupy delegates to numpy
  • NaN handling is correct — deduplication operates on stitched polylines, NaN cells are skipped during extraction
  • Edge cases are covered by tests — plateau, checkerboard, genuine contours all tested
  • Dask chunk boundaries handled correctly — _deduplicate_by_level already handles this; top-level call is a safety net
  • No premature materialization or unnecessary copies — _deduplicate_lines iterates results in-place, no unnecessary copies
  • Benchmark exists or is not needed — deduplication is O(n) per level with small constant overhead
  • README feature matrix updated — N/A, this is an internal fix to existing contours() function
  • Docstrings present and accurate — _deduplicate_lines has a clear docstring
  • No breaking changes without documentation — behavior is strictly additive (removes duplicates that shouldn't have been there)

CI Status

Unable to check via gh pr checks (CLI not authenticated). Local test run: 52 passed, 23 skipped (optional deps: cupy, geopandas, skimage).

When raster cells exactly equal the contour level, overlapping chunk
boundaries (dask) or saddle-case disambiguation (numpy) can emit
duplicate polylines. Add _deduplicate_lines() to remove identical
geometries before returning results, ensuring all backends produce
unique contour lines.

Added tests for numpy, dask, and geopandas backends to verify no
duplicate geometries are returned.
When raster cells exactly equal the contour level, overlapping chunk
boundaries (dask) or saddle-case disambiguation (numpy) can emit
duplicate polylines. Add _deduplicate_lines() to remove identical
geometries before returning results, ensuring all backends produce
unique contour lines.

Changes:
- Add _deduplicate_lines() function with canonical segment comparison
  (direction-independent, rounded to 10 decimals)
- Move deduplication before coordinate transform (array-index space)
- Move defaultdict to top-level import in contour.py and test_contour.py
- Add plateau regression tests for numpy, dask, and geopandas backends
- Add 3 unit tests for _deduplicate_lines (exact dupes, different levels,
  reverse dupes)

Review notes:
- No blockers identified
- All 52 tests pass (23 skipped for optional deps: cupy, geopandas, skimage)
- Deduplication runs on array-index coordinates, consistent with DECIMALS=10
  usage in _stitch_segments and _remove_duplicate_segments
- Dask backend already calls _deduplicate_by_level() internally; top-level
  _deduplicate_lines() is a safety net for all backends
@Melissari1997 Melissari1997 force-pushed the fix-2790-duplicate-contour-lines branch from 749a2e1 to ec4b994 Compare June 6, 2026 20:37
@Melissari1997 Melissari1997 marked this pull request as ready for review June 6, 2026 20:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

performance PR touches performance-sensitive code

Projects

None yet

Development

Successfully merging this pull request may close these issues.

contours() emits degenerate and duplicate lines for exact-level plateaus

1 participant