HDDS-14926. Allow QUASI_CLOSED containers in DiskBalancer with improved debug logs for containers by Gargi-jais11 · Pull Request #10022 · apache/ozone

Gargi-jais11 · 2026-04-01T10:40:26Z

What changes were proposed in this pull request?

Currently DiskBalancer only supports CLOSED containers to be moved but if user wants to also move quasi closed containers then we should support that as well.

1st Improvement:
Below scenario can happen in real, so Disk balancer can only attempt those 10 CLOSED containers. Even if it moves all 10 successfully, disk A might drop only from 90% → 85% — nowhere near balance. The QUASI_CLOSED containers are the real bulk occupying the space and disk balancer is completely blind to them.

Disk A: 90% utilized 
├── 600 containers: QUASI_CLOSED 
└── 10 containers: CLOSED 
Disk B: 10% utilized 
Disk C: 11% utilized

Added a config hdds.datanode.disk.balancer.include.non.standard.containers default=false. If true, balancer include non-standard states, i.e, QUASI_CLOSED. So both CLOSED and QUASI_CLOSED state containers are eligible for move. If false (default), balancer only moves CLOSED containers.

2nd Improvement:
We need to add debug logs for chooseContainer method as user might not understand why if they have over and under utilised volume still container is not moved. This parts needs more clarification. Because I see in escalation it helped a lot with balancer debug logs for container not choose to identify what state the container or volume was in.
I suggest adding these logs for container not choose :

// UsedBytes less than 0
 LOG.debug("Skipping container {} from volume {}: bytes used is {}", containerId, src.getStorageDir().getPath(), containerData.getBytesUsed());
              
//  Skip containers already in progress
LOG.debug("Skipping container {} from volume {}: disk balancer move already in progress", containerId, src.getStorageDir().getPath());

// only closed and quasi closed containers should be moved
LOG.debug("Skipping container {} from volume {}: state is {}. Requires {}", containerId, src.getStorageDir().getPath(), containerData.getState(), phase.getEligibilityCriteria());

// skipping container move as its size is more than destination available space.
LOG.debug("Skipping container {} ({}B) from volume {}: exceeds destination {} usable space {}B, containerId, containerSize, src.getStorageDir().getPath(), dst.getStorageDir().getPath(), usableSpace);

// skipping container move as it will make dest more utilised after movement.
LOG.debug("Skipping container {} ({}B) from volume {}: moving to {} would  result in utilization {} exceeding upper threshold {}", containerId, containerSize, src.getStorageDir().getPath(), dst.getStorageDir().getPath(),

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-14926

How was this patch tested?

Added unit test.

Gargi-jais11 · 2026-04-02T03:58:30Z

@ChenSammi Please review the PR.

errose28 · 2026-04-06T16:43:35Z

Added a config hdds.datanode.disk.balancer.include.non.standard.containers default=false. If true, balancer include non-standard states, i.e, QUASI_CLOSED. So both CLOSED and QUASI_CLOSED state containers are eligible for move. If false (default), balancer only moves CLOSED containers.

I think we should remove the config and have all containers that would otherwise be eligible for replication between datanodes (non-open containers) eligible for disk balancing. Since they can be safely copied between datanodes there is no reason to exclude them when copying between disks as well.

Gargi-jais11 · 2026-04-07T04:25:14Z

I think we should remove the config and have all containers that would otherwise be eligible for replication between datanodes (non-open containers) eligible for disk balancing. Since they can be safely copied between datanodes there is no reason to exclude them when copying between disks as well.

Thanks for the suggestion — the idea that anything safe to replicate between datanodes should be safe to move between disks on the same node is reasonable.
But the config was added hdds.datanode.disk.balancer.include.non-standard.containers (default false) to stay aligned with container balancer (in this PR): there, non-standard states such as QUASI_CLOSED are only considered when the corresponding include flag is set. Intially, I and @ChenSammi had the opinion of moving QUASI_CLOSED containers regardless of any config setting. But after discussion we decided to align with container Balancer.

errose28 · 2026-04-07T15:52:53Z

#9964 largely applies to over-replicated containers. It is required because the balancer will create more over-replication as it runs if deletes are not processed. The disk balancer does not create over-replication because a datanode can only load one container at a time, and furthermore the datanodes themselves are not aware of whether a container is over-replicated. The only overlap between the two is handling quasi-closed containers. For quasi-closed containers I don't see a reason for either process to exclude them, and I'm not sure #9964 made the right call here. I can't find any documented reason why quasi-closed containers should be excluded from either process, only jira descriptions saying it will be done with no supporting reasoning.

@siddhantsangwan @sarvekshayr @sadanand48 what do you think?

sadanand48 · 2026-04-09T07:31:51Z

I can't find any documented reason why quasi-closed containers should be excluded from either process.

Same here, Theoretically I think we could skip excluding quasi-closed containers however my guess based on some discussions is that we didn't want cases where same container is acted upon by both balancer and RM, Say balancer schedules a move for a quasi-closed container and RM sends a command to close the container at the same time.

While balancer moves the QC container, If we don't do the bcsId check it can increase the containers that have lower bcsId by the balancer selecting it and delete not processing in the system. Still should not be a major issue. I think we just wanted to play safe by only moving immutable container state. If we have tested these scenarios I believe we can include these in the default criteria

The origin of the check is in https://github.com/apache/ozone/pull/2441/changes#diff-b0ac1e8086a1da3247489a1a6ad742996052aa17a46544f1f6aaf24bfd92a762R76-R82

@siddhantsangwan please correct me if I am wrong

allow moving quasi closed container if default config is false

cb501b5

Gargi-jais11 changed the title ~~HDDS-14926. Allow QUASI_CLOSED containers in DiskBalancer with improved debug logging for containers~~ HDDS-14926. Allow QUASI_CLOSED containers in DiskBalancer with improved debug logs for containers Apr 1, 2026

Gargi-jais11 marked this pull request as ready for review April 2, 2026 03:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HDDS-14926. Allow QUASI_CLOSED containers in DiskBalancer with improved debug logs for containers#10022

HDDS-14926. Allow QUASI_CLOSED containers in DiskBalancer with improved debug logs for containers#10022
Gargi-jais11 wants to merge 1 commit intoapache:masterfrom
Gargi-jais11:HDDS-14926

Gargi-jais11 commented Apr 1, 2026

Uh oh!

Gargi-jais11 commented Apr 2, 2026

Uh oh!

errose28 commented Apr 6, 2026

Uh oh!

Gargi-jais11 commented Apr 7, 2026

Uh oh!

errose28 commented Apr 7, 2026 •

edited

Loading

Uh oh!

sadanand48 commented Apr 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

Gargi-jais11 commented Apr 1, 2026

What changes were proposed in this pull request?

What is the link to the Apache JIRA

How was this patch tested?

Uh oh!

Gargi-jais11 commented Apr 2, 2026

Uh oh!

errose28 commented Apr 6, 2026

Uh oh!

Gargi-jais11 commented Apr 7, 2026

Uh oh!

errose28 commented Apr 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sadanand48 commented Apr 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

errose28 commented Apr 7, 2026 •

edited

Loading