HDDS-14921. Improve space accounting in SCM with In-Flight container allocation tracking. by ashishkumar50 · Pull Request #10000 · apache/ozone

ashishkumar50 · 2026-03-30T04:23:43Z

What changes were proposed in this pull request?

Maintain space accounting during container allocation in SCM. More detail description is in Jira.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-14921

How was this patch tested?

UT and IT.

…allocation tracking.

rakeshadr

Thanks @ashishkumar50 for providing the patch. Added a few comments, please take care.

rakeshadr · 2026-04-01T11:29:18Z

...cm/src/main/java/org/apache/hadoop/hdds/scm/container/IncrementalContainerReportHandler.java

+            if (!alreadyOnDn && getContainerManager() instanceof ContainerManagerImpl) {
+              ((ContainerManagerImpl) getContainerManager())
+                  .getPendingContainerTracker()
+                  .removePendingAllocation(dd, id);


Say, DN is healthy, all containers confirmed, no new allocations → that DN's bucket never rolls even though heartbeats come every 30 seconds, right?

t=0 Container C1 allocated → pending recorded in tracker t=60-120 FCR arrives from DN → cid = C1 → alreadyInDn = expectedContainersInDatanode.remove(C1) = FALSE → !alreadyInDn = TRUE → removePendingAllocation called → rollIfNeeded fires ✓ → C1 added to NM DN-set

How abt rolls on every processHeartbeat, every 30 seconds regardless of container state changes ?

Added roll in every node report which is per minute from DN.

rakeshadr · 2026-04-01T11:47:21Z

...p-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/node/PendingContainerTracker.java

+      }
+
+      // Cleanup empty buckets to prevent memory leak
+      if (bucket.isEmpty()) {


Potentially hits concurrency issue. Say two threads entered this block.

Thread-1 (removePendingAllocation): bucket.isEmpty(), returns true

Thread-2 (recordPendingAllocationForDatanode): computeIfAbsent(uuid) returns same bucket
reference (key still exists), calls bucket.add(containerID) and now the bucket will be non-empty

Thread-1: datanodeBuckets.remove(uuid, bucket), then removes the non-empty bucket and now the containerID will be in a detached bucket object, right?

I think, we need to add synchronization to avoid detached bucket object.

Added sync at bucket level.

Please use CHM mutation, that will simplify it.

bucket.rollIfNeeded(); removed.set(bucket.remove(containerID)); remaining.set(bucket.getCount()); LOG.debug("Removed pending container {} from DataNode {}. Removed={}, Remaining={}", containerID, node.getUuidString(), removed.get(), remaining.get()); return bucket.isEmpty() ? null : bucket; }); if (removed.get() && metrics != null) { metrics.incNumPendingContainersRemoved(); }

Important lock design principle:-

PendingContainerTracker: Please ensure no bucket -> map mutation path anymore. You need to switch remove/roll/clear to concurrenthashmap compute-based mutations, keep bucket internals intact.

rakeshadr · 2026-04-01T11:55:20Z

...ds/server-scm/src/test/java/org/apache/hadoop/hdds/scm/node/TestPendingContainerTracker.java

+  }
+
+  @Test
+  public void testRemoveFromBothWindows() {


Do we have test scenario covering roll over?

The two-window rolling behavior (container in previousWindow roll after 2× interval). Say, add C1 in currentWindow, then moves C1 to previousWindow, then wait for the roll over.

Added test for this testTwoWindowRollAgesOutContainerAfterTwoIntervals.

sumitagrawl

@ashishkumar50 Thanks for working over this, have few review comments.

hadoop-hdds/common/src/main/java/org/apache/hadoop/hdds/scm/ScmConfig.java

sumitagrawl · 2026-04-01T13:07:50Z

...p-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/node/PendingContainerTracker.java

+   * @param pipeline The pipeline where container is allocated
+   * @param containerID The container being allocated
+   */
+  public void recordPendingAllocation(Pipeline pipeline, ContainerID containerID) {


This needs to be part of SCMNodeManager, more specific to SCMNodeStat. Reason,

need handle even like stale node / dead node handler as cleanup

May need report this when reporting to CLI for available space in the DN

To be used for pipeline allocation policy, where container manager does not come in role
Its datanode space, just trying to identify already allocated space. And needs to be part of committed space at SCM when reporting to CLI, or other breakup.

Moved to node package

...cm/src/main/java/org/apache/hadoop/hdds/scm/container/IncrementalContainerReportHandler.java

sumitagrawl · 2026-04-01T13:12:20Z

...cm/src/main/java/org/apache/hadoop/hdds/scm/container/IncrementalContainerReportHandler.java

            processContainerReplica(dd, container, replicaProto, publisher, detailsForLogging);
+
+            // Remove from pending tracker when container is added to DN
+            if (!alreadyOnDn && getContainerManager() instanceof ContainerManagerImpl) {


Please check if node report is also send in ICR, this is for reason that node information should be updated with ICR at same time.

sumitagrawl · 2026-04-01T13:18:32Z

...hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/container/ContainerManagerImpl.java

+        // (1*5GB) + (2*5GB) = 15GB → actually 3 containers
+        long totalCapacity = 0L;
+        long effectiveAllocatableSpace = 0L;
+        for (StorageReportProto report : storageReports) {


Instead of calcuating all available and then removing, we can do progressive base, like,
required=pending+newAllocation
for each report
required = required - volumeUsage in roundoff value
if (required <= 0)
return true

But we need to reserve also, can do first add and check, if not present, remove containerId

OR other way,
when DN report storage handling, total consolidate value can also be added to memory to avoid looping on every call.

Updated the logic to break when enough space is available on any volume

...s/server-scm/src/main/java/org/apache/hadoop/hdds/scm/container/PendingContainerTracker.java

hadoop-hdds/common/src/main/java/org/apache/hadoop/hdds/scm/ScmConfig.java

...ds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/container/ContainerReportHandler.java

aswinshakil · 2026-04-02T20:54:41Z

...hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/container/ContainerManagerImpl.java

+      return true;
+    } catch (Exception e) {
+      LOG.warn("Error checking space for pipeline {}", pipeline.getId(), e);
+      return true;


If we are not sure if we can create container here, Should we still choose this pipeline? Instead of making it generic, we can specify what to do for each exception we might see.

Moved the code, there is no exception here.

aswinshakil · 2026-04-02T21:02:09Z

...ds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/container/ContainerReportHandler.java

+            // Remove from pending tracker when container is added to DN
+            // This container was just confirmed for the first time on this DN
+            // No need to remove on subsequent reports (it's already been removed)
+            if (container != null && getContainerManager() instanceof ContainerManagerImpl) {


Why not just add this to the ContainerManager interface? We can avoid these conversions. Is this because Recon uses the same code path and we don't want it to this? For Recon we can just make it a No-Op.

Moved to node package, so not required these conversions now.

...cm/src/main/java/org/apache/hadoop/hdds/scm/container/IncrementalContainerReportHandler.java

...p-hdds/server-scm/src/test/java/org/apache/hadoop/hdds/scm/pipeline/MockPipelineManager.java

...s/server-scm/src/main/java/org/apache/hadoop/hdds/scm/container/PendingContainerTracker.java

...p-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/node/PendingContainerTracker.java

aswinshakil · 2026-04-02T22:14:32Z

...hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/container/ContainerManagerImpl.java

+        long effectiveRemaining = effectiveAllocatableSpace - pendingAllocations;
+
+        // Check if there's enough space for a new container
+        if (effectiveRemaining < maxContainerSize) {


This makes the allocation little aggressive right? Even if we just have 5GB we allocate it. Should we have leave some buffer when allocating a container?

No need of extra buffer here, as we are anyway going to give buffer in DN by considering soft and hard limit. So in case of some overflow DN will accept it until hard limit.

...s/server-scm/src/main/java/org/apache/hadoop/hdds/scm/container/PendingContainerTracker.java

rakeshadr · 2026-04-06T05:46:33Z

...hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/container/ContainerManagerImpl.java

    if (!pipelineManager.hasEnoughSpace(pipeline, maxContainerSize)) {
      LOG.debug("Cannot allocate a new container because pipeline {} does not have the required space {}.",
          pipeline, maxContainerSize);
      return null;


Since this PR is adding more stricter/defensive "two-window tumbling bucket" logic, there is a high chance to hit the return null code path flow. Please double check all the callers of API ContainerManagerImpl#allocateContainer() then safeguard with null check, otw it would result in NPE.

For example, SCMClientProtocolServer.java#L258, this would hit NPE.

Checked other places already have safeguard. SCMClientProtocolServer.java#L258 is used only for test or tools. Added a safeguard here.

rakeshadr · 2026-04-07T10:02:29Z

...p-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/node/PendingContainerTracker.java

+     * Count of pending containers in both windows.
+     */
+    synchronized int getCount() {
+      return currentWindow.size() + previousWindow.size();


For safe coding, can you please do union,

synchronized int getCount() { return getAllPending().size() }

getAllPending is creating two new set which may be costly. Instead this looks less costly. Sumit had given comment above to avoid creating new list. If we want other way then we need to cache size value, which i think may not be needed here to store size as well.

...p-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/node/PendingContainerTracker.java

...c/test/java/org/apache/hadoop/hdds/scm/container/TestPendingContainerTrackerIntegration.java

...ds/server-scm/src/test/java/org/apache/hadoop/hdds/scm/node/TestPendingContainerTracker.java

rakeshadr · 2026-04-07T12:11:38Z

...p-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/node/PendingContainerTracker.java

+      return;
+    }
+    synchronized (bucket) {
+      datanodeBuckets.remove(uuid, bucket);


Please use CHM mutation, that will simplify it.

if (node == null) { return; } UUID uuid = node.getUuid(); if (datanodeBuckets.remove(uuid) != null) { LOG.debug("Cleared pending container allocations for datanode {}", node.getUuidString()); }

Important lock design principle:-

PendingContainerTracker: Please ensure no bucket -> map mutation path anymore. You need to switch remove/roll/clear to concurrenthashmap compute-based mutations, keep bucket internals intact.

...p-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/node/PendingContainerTracker.java

rakeshadr · 2026-04-07T12:54:02Z

...p-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/node/PendingContainerTracker.java

+    }
+
+    UUID uuid = node.getUuid();
+    TwoWindowBucket bucket = datanodeBuckets.computeIfAbsent(


Since datanodeBuckets is concurrent hashmap, can we simplify the logic instead of synchronized bucket and avoids unnecessary null check?

boolean added = addContainerToBucket(node.getUuid(), containerID); if (added && metrics != null) { metrics.incNumPendingContainersAdded(); }

private boolean addContainerToBucket(UUID uuid, ContainerID containerID) { AtomicBoolean added = new AtomicBoolean(false); datanodeBuckets.compute(uuid, (k, existing) -> { TwoWindowBucket bucket = (existing != null) ? existing : new TwoWindowBucket(rollIntervalMs); bucket.rollIfNeeded(); added.set(bucket.add(containerID)); LOG.debug("Recorded pending container {} on DataNode {}. Added={}, Total pending={}", containerID, uuid, added.get(), bucket.getCount()); return bucket; }); return added.get(); }

Important lock design principle:-

PendingContainerTracker: Please ensure no bucket -> map mutation path anymore. You need to switch remove/roll/clear to concurrenthashmap compute-based mutations, keep bucket internals intact.

rakeshadr · 2026-04-07T13:13:28Z

...p-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/node/PendingContainerTracker.java

+    if (node == null) {
+      return;
+    }
+    UUID uuid = node.getUuid();


Please use CHM mutation, that will simplify it.

UUID uuid = node.getUuid(); datanodeBuckets.computeIfPresent(uuid, (k, bucket) -> { bucket.rollIfNeeded(); return bucket.isEmpty() ? null : bucket; });

Important lock design principle:-

PendingContainerTracker: Please ensure no bucket -> map mutation path anymore. You need to switch remove/roll/clear to concurrenthashmap compute-based mutations, keep bucket internals intact.

rakeshadr · 2026-04-07T13:13:37Z

...p-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/node/PendingContainerTracker.java

+    }
+
+    UUID uuid = node.getUuid();
+    TwoWindowBucket bucket = datanodeBuckets.computeIfAbsent(


Important lock design principle:-

PendingContainerTracker: Please ensure no bucket -> map mutation path anymore. You need to switch remove/roll/clear to concurrenthashmap compute-based mutations, keep bucket internals intact.

rakeshadr · 2026-04-07T13:17:36Z

...p-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/node/PendingContainerTracker.java

+      }
+
+      // Cleanup empty buckets to prevent memory leak
+      if (bucket.isEmpty()) {


Please use CHM mutation, that will simplify it.

bucket.rollIfNeeded(); removed.set(bucket.remove(containerID)); remaining.set(bucket.getCount()); LOG.debug("Removed pending container {} from DataNode {}. Removed={}, Remaining={}", containerID, node.getUuidString(), removed.get(), remaining.get()); return bucket.isEmpty() ? null : bucket; }); if (removed.get() && metrics != null) { metrics.incNumPendingContainersRemoved(); }

Important lock design principle:-

PendingContainerTracker: Please ensure no bucket -> map mutation path anymore. You need to switch remove/roll/clear to concurrenthashmap compute-based mutations, keep bucket internals intact.

hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/pipeline/PipelineManager.java

...p-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/node/PendingContainerTracker.java

aswinshakil · 2026-04-07T23:24:45Z

...hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/container/ContainerManagerImpl.java

  private ContainerInfo allocateContainer(final Pipeline pipeline,
                                          final String owner)
      throws IOException {
    if (!pipelineManager.hasEnoughSpace(pipeline, maxContainerSize)) {


@rakeshadr @ashishkr200 pipelineManager.hasEnoughSpace is not a synchronized operation on the datanodes. We take lock on the pipeline synchronized (pipeline.getId()). Still two different pipelines can still allocate container on the same datanode.

We record the pending allocation on only Line 289. Until the code reaches that point multiple pipeline will be allocated to the same space on the datanode.

Let's say we only have 6GB space available in d1.
Pipeline 1: d1, d2, d3
Pipeline 2: d1, d4, d5

When we allocate container on both the pipeline, will get past pipelineManager.hasEnoughSpace and both will be allocated container 5GB *2 = 10GB. But we only have 6GB on d1

@aswinshakil Some allocation may happen extra but that will be taken care by PR as DN has more buffer to handle of these extra allocation. In my opinion synchronizing completely will be too costly here.

ashishkumar50 · 2026-04-08T08:14:04Z

@rakeshadr @aswinshakil Fixed review comments.

szetszwo

@ashishkumar50 , thanks for working on this!

This change is quite big and complicated. Let's split it into multiple subtasks. The first one could be adding the new PendingContainerTracker class and the related test.

(Sorry for reviewing this late.)

ashishkr200 added 4 commits March 30, 2026 02:54

HDDS-14921. Improve space accounting in SCM with In-Flight container …

511ffd0

…allocation tracking.

Fix PMD

32caf74

Safe cast ozone conf

8706c1c

Fix test case

634c94e

ashishkumar50 marked this pull request as draft March 30, 2026 04:24

ashishkumar50 requested review from aswinshakil, rakeshadr, sumitagrawl and szetszwo March 30, 2026 04:39

ashishkr200 added 3 commits March 30, 2026 23:40

update test case

4150f5d

Handle unregistered datanodes

0895224

Fix test

5008cd4

rakeshadr reviewed Apr 1, 2026

View reviewed changes

sumitagrawl reviewed Apr 1, 2026

View reviewed changes

aswinshakil reviewed Apr 2, 2026

View reviewed changes

ashishkr200 added 2 commits April 5, 2026 01:36

Move PendingContainerTracker to node package and fix review comments

848a8d8

Move declaration to top

f74bc9b

ashishkumar50 marked this pull request as ready for review April 4, 2026 21:51

rakeshadr reviewed Apr 6, 2026

View reviewed changes

Fix review comment and test

a53445f

rakeshadr reviewed Apr 7, 2026

View reviewed changes

peterxcli self-requested a review April 7, 2026 13:34

aswinshakil reviewed Apr 7, 2026

View reviewed changes

Fix review comments

3ffbce0

szetszwo reviewed Apr 10, 2026

View reviewed changes

Conversation

ashishkumar50 commented Mar 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

What is the link to the Apache JIRA

How was this patch tested?

Uh oh!

rakeshadr left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sumitagrawl left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ashishkumar50 Apr 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

rakeshadr Apr 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

ashishkumar50 commented Mar 30, 2026 •

edited

Loading

ashishkumar50 Apr 8, 2026 •

edited

Loading

rakeshadr Apr 7, 2026 •

edited

Loading

aswinshakil Apr 7, 2026 •

edited

Loading