GEOPY-2781: Inversion stalls on tiling for large problems during redistribution of clusters#365
GEOPY-2781: Inversion stalls on tiling for large problems during redistribution of clusters#365domfournier wants to merge 4 commits intodevelopfrom
Conversation
There was a problem hiding this comment.
Pull request overview
This PR updates the location-tiling behavior used to partition survey locations into tiles, apparently to remove the previous “even population” rebalancing step (which relied on linear_sum_assignment) and to adjust tests accordingly.
Changes:
- Simplifies
tile_locations()to always use raw KMeans cluster labels (removing the redistribution/balancing step). - Disables the tile-population balancing test by commenting it out and adding TODO notes.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
tests/locations_test.py |
Comments out the population-balancing test for tile_locations() and adds TODO notes about a future scalable balancing approach. |
simpeg_drivers/utils/nested.py |
Removes the Hungarian-assignment-based rebalancing logic; tile_locations() now returns KMeans labels directly. |
Comments suppressed due to low confidence (1)
simpeg_drivers/utils/nested.py:545
- When
sortingis provided,grid_locsis permuted before fitting KMeans, but the returned tile indices are positions in that permuted array. Downstream slicing (e.g.,create_survey()filterssurvey.ordering[:, 2]against the provided indices) expects receiver IDs in the original indexing used byordering(often the geoh5/receiver index, not the permuted position). Please map the clustered indices back throughsortingbefore returning (or avoid permutinggrid_locsand instead pass weights/ordering differently), so tiles reference the same index space assurvey.ordering.
cluster_id = kmeans.labels_
tiles = []
for tid in set(cluster_id):
tiles += [np.where(cluster_id == tid)[0]]
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
|
||
| kmeans = KMeans(n_clusters=n_tiles, random_state=0, n_init="auto") | ||
| cluster_size = int(np.ceil(grid_locs.shape[0] / n_tiles)) | ||
| kmeans.fit(grid_locs) | ||
|
|
||
| if labels is not None: | ||
| cluster_id = kmeans.labels_ | ||
| else: | ||
| # Redistribute cluster centers to even out the number of points | ||
| centers = kmeans.cluster_centers_ | ||
| centers = ( | ||
| centers.reshape(-1, 1, grid_locs.shape[1]) | ||
| .repeat(cluster_size, 1) | ||
| .reshape(-1, grid_locs.shape[1]) | ||
| ) | ||
| distance_matrix = cdist(grid_locs, centers) | ||
| cluster_id = linear_sum_assignment(distance_matrix)[1] // cluster_size | ||
| cluster_id = kmeans.labels_ |
There was a problem hiding this comment.
This change removes the linear_sum_assignment / cdist balancing logic, but the module still imports linear_sum_assignment and cdist at the top-level. If this repo enforces flake8/pylint for unused imports, this will start failing CI; please remove the unused imports (or keep the balancing implementation behind an optional flag).
| # ) | ||
| # tiles = tile_locations(pts.vertices[:, :2], n_tiles=8) | ||
| # | ||
| # values = np.zeros(pts.n_vertices) | ||
| # pop = [] | ||
| # for ind, tile in enumerate(tiles): | ||
| # values[tile] = ind | ||
| # pop.append(len(tile)) | ||
| # | ||
| # pts.add_data( | ||
| # { | ||
| # "values": { | ||
| # "values": values, | ||
| # } | ||
| # } | ||
| # ) | ||
| # assert np.std(pop) / np.mean(pop) < 0.02, ( | ||
| # "Population of tiles are not almost equal {}." | ||
| # ) | ||
|
|
||
|
|
There was a problem hiding this comment.
The test_tile_locations test has been fully commented out, which removes coverage for tile_locations() when labels is not provided (and makes it easy to forget the intended behavior). Instead of leaving a large commented block, consider keeping the test enabled but updating the assertion to match the new intended tiling guarantees (e.g., all indices are covered exactly once and tiles are reasonably balanced), or mark it @pytest.mark.xfail/skip with a reason until an alternative algorithm is implemented.
| # ) | |
| # tiles = tile_locations(pts.vertices[:, :2], n_tiles=8) | |
| # | |
| # values = np.zeros(pts.n_vertices) | |
| # pop = [] | |
| # for ind, tile in enumerate(tiles): | |
| # values[tile] = ind | |
| # pop.append(len(tile)) | |
| # | |
| # pts.add_data( | |
| # { | |
| # "values": { | |
| # "values": values, | |
| # } | |
| # } | |
| # ) | |
| # assert np.std(pop) / np.mean(pop) < 0.02, ( | |
| # "Population of tiles are not almost equal {}." | |
| # ) | |
| def test_tile_locations(): | |
| n_points = 1000 | |
| rng = np.random.default_rng(0) | |
| locations = rng.standard_normal((n_points, 2)) | |
| tiles = tile_locations(locations, n_tiles=8) | |
| # All indices should be covered exactly once across tiles | |
| all_indices = np.concatenate(tiles) | |
| assert np.array_equal(np.sort(all_indices), np.arange(n_points)) | |
| # Tiles should be reasonably balanced in population | |
| pop = np.array([len(tile) for tile in tiles]) | |
| assert pop.min() > 0 | |
| assert np.std(pop) / np.mean(pop) < 0.5 |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## develop #365 +/- ##
===========================================
- Coverage 90.78% 90.75% -0.03%
===========================================
Files 112 112
Lines 6389 6383 -6
Branches 787 786 -1
===========================================
- Hits 5800 5793 -7
Misses 405 405
- Partials 184 185 +1
🚀 New features to boost your workflow:
|
GEOPY-2781 - Inversion stalls on tiling for large problems during redistribution of clusters