HBASE-30102 Add metric to account for region data classified as cold by the Time Based Priority logic by wchevreuil · Pull Request #8128 · apache/hbase

wchevreuil · 2026-04-24T11:48:48Z

No description provided.

…by the Time Based Priority logic Change-Id: I5601a37300a3f5d10fe4886ba988f2d25e66b546

Change-Id: I8eb236beaf7976ccd02349aa81277ca84925e7e6

taklwu

when Time Based Priority is disabled, what would be the % Cold Data? is it always showing?

taklwu · 2026-04-27T21:50:55Z

+        getAllCacheKeysForFile(hFileInfo.getHFileContext().getHFileName(), 0, Long.MAX_VALUE);
+      int evictedBlocks = evictBlockSet(keySet);
+      if (evictedBlocks > 0) {
+        LOG.info("Evicted {} blocks for file {} as it is now considered cold by DataTieringManager",


nit: should we have it as debug level? I'm wondered if we see a lot of these message.

Maybe, yeah. Although it would be triggered only upon enabling of the time based priority on the individual store, and once for each affected file, it can still flood the logs. Let me switch it to DEBUG.

taklwu · 2026-04-27T21:52:22Z

+    if (key.getCfName() != null) {
+      builder.setFamilyName(key.getCfName());
+    }
+    if (key.getRegionName() != null) {
+      builder.setRegionName(key.getRegionName());
+    }


question: why weren't the cf and regionname filled before ? is it because the cold data needs for log message or other compute usage?

Yes. This is required not only by the new "coldDataRatio" metric we are adding, but also the existing "regionCachedRatio" that is critical for the CacheAwareLoadBalancer. Without this change here, we cannot calculate these metrics when recovering the persistent cache. IMO, it's a bug in the current CacheAwareLoadBalancer implementation.

Change-Id: I392517f882e7c5a8c6063b16f525f6467956a3bb

taklwu

LGTM.

something is wrong with the github action, can you give a try to trigger them ?

wchevreuil · 2026-04-29T09:10:34Z

when Time Based Priority is disabled, what would be the % Cold Data? is it always showing?

It would show as 0%. With the "% Cold Data" and "% Cached", operators can infer if there's indeed a problem with the region cache, as those are mutually exclusive, if both are low, it means the region caching went into some problems, most likely, not enough cache capacity.

petersomogyi · 2026-04-29T09:57:29Z

+      for (HStoreFile file : newFiles) {
+        // call isHotData to account for the new file size in regionColdDataSize, if the new file is
+        // considered cold data as per data-tiering logic.
+        isHotData(file.getFileInfo().getHFileInfo(), file.getFileInfo().getConf());


Can this cause a deadlock? This part is inside regionColdDataSize.computeIfPresent block and the isHotData also runs regionColdDataSize.compute on the same ConcurrentMap.

You mean, if another thread is calling isHotData? I don't think it would, which ever reaches the regionColdDataSize atomic methods first would own the lock and block the other, right?

I checked this part with Claude and it gave this answer:

No — this is a single-thread deadlock. That's what makes it especially insidious.

One thread does this:

Enters computeIfPresent(regionName, lambda) — acquires the internal bin lock

Inside the lambda, calls isHotData(...)

isHotData calls compute(regionName, lambda2) — tries to acquire the same bin lock

The lock is non-reentrant, so the same thread blocks waiting on itself

It's not a classic two-thread deadlock — it's a single thread trying to re-acquire a lock it already holds, on a lock that doesn't support reentrancy. The thread hangs forever.

This will happen every time a compaction produces a new cold file in a region that already has cold data tracked. That's a normal steady-state scenario, not a rare race condition.

I don't think this is true. I've added a UT that simulates the compaction resulting in new cold file (the scenario claude mentioned) and it doesn't dead lock.

Thanks for taking a look and adding a unit test for it!

petersomogyi · 2026-04-29T09:59:58Z

+      // to evict it.
+      Set<BlockCacheKey> keySet =
+        getAllCacheKeysForFile(hFileInfo.getHFileContext().getHFileName(), 0, Long.MAX_VALUE);
+      int evictedBlocks = evictBlockSet(keySet);


The method name shouldCacheFile suggests this just a check but it actually evicts the blocks.

Yeah, not ideal, but I still think it should be BucketCache responsibility to evict blocks from files that became classified as cold, upon changing time based priority configuration. Moving this to DataTieringManager would tighter couple it with BucketCache.

wchevreuil · 2026-04-29T10:09:20Z

LGTM.

something is wrong with the github action, can you give a try to trigger them ?

Sure. I've run the failing tests locally, and it all passed. I'm trying another round here.

Co-authored-by: Peter Somogyi <psomogyi@cloudera.com>

Change-Id: I2194da9f2d1e596ae76a0fa244a521a698ff13f9

Change-Id: I171bd2169c18ca47795d239af60ca2414410d060

…by the Time Based Priority logic (#8128) Signed-off-by: Peter Somogyi <psomogyi@apache.com> Signed-off-by: Tak Lon (Stephen) Wu <taklwu@apache.org>

wchevreuil added 2 commits April 24, 2026 12:47

HBASE-30102 Add metric to account for region data classified as cold …

06b6ab8

…by the Time Based Priority logic Change-Id: I5601a37300a3f5d10fe4886ba988f2d25e66b546

UT fix

3b4bd11

Change-Id: I8eb236beaf7976ccd02349aa81277ca84925e7e6

taklwu requested changes Apr 27, 2026

View reviewed changes

addressing review comments

14fb76d

Change-Id: I392517f882e7c5a8c6063b16f525f6467956a3bb

wchevreuil requested a review from taklwu April 28, 2026 15:45

taklwu approved these changes Apr 28, 2026

View reviewed changes

petersomogyi reviewed Apr 29, 2026

View reviewed changes

wchevreuil and others added 4 commits April 29, 2026 12:35

Apply suggestion from @petersomogyi

91ac5d5

Co-authored-by: Peter Somogyi <psomogyi@cloudera.com>

Apply suggestion from @petersomogyi

19456d3

Co-authored-by: Peter Somogyi <psomogyi@cloudera.com>

adding extra UT to simulate compacted new cold file

1e5e23b

Change-Id: I2194da9f2d1e596ae76a0fa244a521a698ff13f9

spotless

860d2a6

Change-Id: I171bd2169c18ca47795d239af60ca2414410d060

petersomogyi approved these changes Apr 30, 2026

View reviewed changes

wchevreuil merged commit 01ca956 into apache:master Apr 30, 2026
8 checks passed

Conversation

wchevreuil commented Apr 24, 2026

Uh oh!

taklwu left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

taklwu Apr 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

taklwu left a comment

Choose a reason for hiding this comment

Uh oh!

wchevreuil commented Apr 29, 2026

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

wchevreuil commented Apr 29, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

taklwu Apr 27, 2026 •

edited

Loading