Skip to content

fix: Prevent and clean up zero-valued entries in Alpha storage maps during hotkey and coldkey swaps#2513

Open
bittoby wants to merge 4 commits intoopentensor:devnet-readyfrom
bittoby:fix/remove-zero-alpha-entries
Open

fix: Prevent and clean up zero-valued entries in Alpha storage maps during hotkey and coldkey swaps#2513
bittoby wants to merge 4 commits intoopentensor:devnet-readyfrom
bittoby:fix/remove-zero-alpha-entries

Conversation

@bittoby
Copy link
Copy Markdown

@bittoby bittoby commented Mar 18, 2026

Description

Prevent and clean up zero-valued entries in Alpha storage maps during hotkey and coldkey swaps.

Coldkey and hotkey swap logic was inserting zero-valued entries into Alpha, TotalHotkeyShares, TotalHotkeyAlphaLastEpoch, and AlphaDividendsPerSubnet without checking if the value was zero first. These entries persist indefinitely and bloat storage - a single coldkey swap could produce ~18k zero entries, and roughly 52% of Alpha entries on-chain are zero-valued.

This PR adds zero-guards to all insertion points in swap_coldkey.rs and swap_hotkey.rs, and includes a one-time migration to purge existing zero entries from the four affected storage maps.

Related Issue(s)

Type of Change

  • Bug fix (non-breaking change which fixes an issue)

Breaking Change

No breaking changes.

Checklist

  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have run ./scripts/fix_rust.sh to ensure my code is formatted and linted correctly
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes
  • Any dependent changes have been merged and published in downstream modules

Additional Notes

The migration collects the full Alpha map into memory before removing zero entries. This is intentional - mutating a Substrate storage map while iterating over it is unsafe. The other three maps are smaller and iterated directly. All four maps use HasMigrationRun to ensure idempotency.

@bittoby
Copy link
Copy Markdown
Author

bittoby commented Mar 18, 2026

@open-junius Could you please review this PR? Welcome to any feedbacks.
thank you

@open-junius
Copy link
Copy Markdown
Contributor

change the target branch as devnet-ready, not main.

@bittoby bittoby changed the base branch from main to devnet-ready March 19, 2026 02:47
@bittoby bittoby changed the base branch from devnet-ready to main March 19, 2026 03:17
@bittoby bittoby changed the base branch from main to devnet-ready March 19, 2026 04:36
@bittoby bittoby closed this Mar 19, 2026
@bittoby bittoby force-pushed the fix/remove-zero-alpha-entries branch from 8cd929a to 369a319 Compare March 19, 2026 04:37
@bittoby bittoby reopened this Mar 19, 2026
@bittoby
Copy link
Copy Markdown
Author

bittoby commented Mar 19, 2026

I changed target branch. Thanks! Ready to review

@open-junius
Copy link
Copy Markdown
Contributor

please check https://github.com/opentensor/subtensor/pull/2509/changes, which also prevent some zero value insert. Could you include these storage into the PR.

@bittoby
Copy link
Copy Markdown
Author

bittoby commented Mar 19, 2026

@open-junius Thanks for your feedback! I updated all following your feedback. Could you please review again?

@open-junius open-junius added the skip-cargo-audit This PR fails cargo audit but needs to be merged anyway label Mar 20, 2026
@open-junius
Copy link
Copy Markdown
Contributor

please merge the latest devnet-ready, will fix the e2e test.

@open-junius open-junius mentioned this pull request Mar 20, 2026
13 tasks
@bittoby
Copy link
Copy Markdown
Author

bittoby commented Mar 20, 2026

@open-junius I merged the latest devnet-ready. Could you take a look once more?
thank you

@bittoby
Copy link
Copy Markdown
Author

bittoby commented Mar 20, 2026

@open-junius All e2e tests passed. thank you. The benchmark testing issue is not relate to my change.

@bittoby
Copy link
Copy Markdown
Author

bittoby commented Mar 23, 2026

@open-junius I hope you had a great weekend. Any updates for me, please.

@bittoby bittoby requested a review from open-junius March 23, 2026 12:32
open-junius
open-junius previously approved these changes Mar 23, 2026
@bittoby
Copy link
Copy Markdown
Author

bittoby commented Mar 24, 2026

@evgeny-s @open-junius

Converting to SteppedMigration would require adding pallet-migrations to the runtime since MultiBlockMigrator is currently () That's a bigger scope change.

I think the cleanest approach is to drop the migration from this PR and just ship the zero-value insert guards (which prevent new zeros from being created). The purge of existing zeros can be handled properly in a separate PR with the right multi-block infrastructure. I’m planning to take care of that follow-up PR too after this one.

Let me know what you'd prefer.

@bittoby
Copy link
Copy Markdown
Author

bittoby commented Mar 24, 2026

@evgeny-s @open-junius did you check my description? Should I have to update my PR?


// Set the new root claimed value.
RootClaimed::<T>::insert((netuid, hotkey, coldkey), new_root_claimed);
// Set the new root claimed value, removing if zero to avoid storage bloat.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment is saying "removing", but I don't see it here.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But below we remove it. Fix it for consistency

@evgeny-s
Copy link
Copy Markdown
Collaborator

@evgeny-s @open-junius did you check my description? Should I have to update my PR?

I would rather keep it with the migration. Please test it first on the mainnet clone, and use multi-block migration if needed. Let's see other opinions on this regard.
Also, it would be great to add several TS tests for this PR.

@bittoby
Copy link
Copy Markdown
Author

bittoby commented Mar 24, 2026

@evgeny-s @open-junius did you check my description? Should I have to update my PR?

I would rather keep it with the migration. Please test it first on the mainnet clone, and use multi-block migration if needed. Let's see other opinions on this regard. Also, it would be great to add several TS tests for this PR.

I absolute agree with you. If possible, I want to keep migration in this PR, too. Now I am testing locally. After finishing, I will share result here. Thanks @evgeny-s

@bittoby
Copy link
Copy Markdown
Author

bittoby commented Mar 25, 2026

@evgeny-s

Updated:

  • Converted migration to multi-block on_idle approach - on_runtime_upgrade is O(1), cleanup runs one map per block across 4 phases
  • Fixed claim_root.rs - added missing else { remove } branch for consistency
  • Tested against Finney mainnet clone via build-patched-spec - node runs and produces blocks normally with the migration code
  • Added TS e2e test for zero-alpha prevention
  • Reverted unrelated check-node-compat.yml change

@bittoby bittoby force-pushed the fix/remove-zero-alpha-entries branch 2 times, most recently from e7ec1f3 to 3fedc8b Compare March 25, 2026 12:15
const stakeAfterRemove = await getStake(api, hotkey.address, coldkey.address, netuid);

// Stake should be zero or very close to zero (owner minimum may be retained)
log(`Stake after full remove: ${stakeAfterRemove}`);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we can't check in the code. maybe we don't need this test case. since 0 is default value, we can't tell the difference. It seems we need lower level api to check if the key exists or not.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my opinion, it should be enough you already verify them in rust unit test. no need to add it in e2e.

@bittoby bittoby force-pushed the fix/remove-zero-alpha-entries branch from 02a8034 to 83a8392 Compare March 25, 2026 12:31
…block on_idle migration to purge existing ones
@bittoby bittoby force-pushed the fix/remove-zero-alpha-entries branch from 83a8392 to 2966490 Compare March 25, 2026 12:34
@bittoby
Copy link
Copy Markdown
Author

bittoby commented Mar 25, 2026

@open-junius Thanks for your feedback. I removed e2e test. Please review again.

@bittoby
Copy link
Copy Markdown
Author

bittoby commented Mar 25, 2026

@open-junius Could you tell me why assert-clean-merges testing failed?

@open-junius
Copy link
Copy Markdown
Contributor

@open-junius Could you tell me why assert-clean-merges testing failed?

You can ignore it. we need merge the main back to devnet-ready sometimes if there is any hotfix in main branch.

@bittoby
Copy link
Copy Markdown
Author

bittoby commented Mar 25, 2026

@evgeny-s @open-junius could you please review again once more? I think this almost done

@bittoby bittoby requested review from evgeny-s and open-junius March 25, 2026 23:26
@bittoby
Copy link
Copy Markdown
Author

bittoby commented Mar 26, 2026

These testing failure doesn't seem to relate to my change. Please check the log.
@open-junius

@open-junius
Copy link
Copy Markdown
Contributor

These testing failure doesn't seem to relate to my change. Please check the log. @open-junius
I see, it is common issue in devnet-ready branch. I noticed you split the data cleanup into phases.
Could you follow the method recommended by @evgeny-s to get the amount and time needed to run migration?

@bittoby
Copy link
Copy Markdown
Author

bittoby commented Mar 26, 2026

I have updated the migration script more correctly:
It now uses remaining_weight to dynamically limit how many entries are processed per block instead of a fixed batch size. This helps prevent block timeouts on large maps.

@bittoby
Copy link
Copy Markdown
Author

bittoby commented Mar 26, 2026

Tested on Finney mainnet clone via build-patched-spec:

  • Alpha zero entries: ~580,000
  • Time: ~14 min (70 blocks) to remove 99% of them

No OOM or block timeout issues.

@bittoby
Copy link
Copy Markdown
Author

bittoby commented Mar 26, 2026

@open-junius @evgeny-s I'd like you review once more ^^. It works well for me

@open-junius
Copy link
Copy Markdown
Contributor

I have updated the migration script more correctly: It now uses remaining_weight to dynamically limit how many entries are processed per block instead of a fixed batch size. This helps prevent block timeouts on large maps.

It is great, let me review it again.

@open-junius
Copy link
Copy Markdown
Contributor

Tested on Finney mainnet clone via build-patched-spec:

  • Alpha zero entries: ~580,000
  • Time: ~14 min (70 blocks) to remove 99% of them

No OOM or block timeout issues.

Thanks for collect the data, it is vaulable.


/// Called from `on_runtime_upgrade`. Schedules the cleanup by setting phase = 1
/// if the migration hasn't run yet. This is O(1) — no iteration.
pub fn migrate_remove_zero_alpha<T: Config>() -> Weight {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After the actual cleanup executed in the on_idle, I am wondering if we really need the migration part.
For instance, we can set the ZeroAlphaCleanupPhase init value as 1 in storage.
then on_idle start to handle it until the value go to 0.
Or use different values for different steps, make it clearer.
What's your opinion?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. We could simplify by setting the default to 1, but the issue is: after cleanup completes and sets phase=0, a node restart or fresh sync would read the default (1) again and re-trigger cleanup. The current approach uses HasMigrationRun as a permanent guard to prevent this.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. We could simplify by setting the default to 1, but the issue is: after cleanup completes and sets phase=0, a node restart or fresh sync would read the default (1) again and re-trigger cleanup. The current approach uses HasMigrationRun as a permanent guard to prevent this.

@open-junius what do you think about it? Still should i need to update?

@bittoby bittoby requested a review from open-junius March 27, 2026 13:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

skip-cargo-audit This PR fails cargo audit but needs to be merged anyway

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Alpha and hotkey storage maps accumulate persistent zero-valued entries

3 participants