Skip to content

Conversation

@ColinNV
Copy link
Contributor

@ColinNV ColinNV commented Dec 11, 2025

What?

Disable LRU for infinite size and consistent locking.

Why?

An improvement over the solution in #11049 that fixes a crash in NIXL.

How?

Disable LRU in rcache when max_size and max_regions are "infinity" and ensure consistent locking when it is enabled.

@iyastreb iyastreb requested review from brminich and yosefe December 12, 2025 07:19
iyastreb
iyastreb previously approved these changes Dec 12, 2025
Copy link
Contributor

@iyastreb iyastreb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM
Would be nice to add a gtest, in this or separate PR

return ucs_ilog2(ucs_rcache_stat_max_pow2() / UCS_RCACHE_STAT_MIN_POW2) + 2;
}

static int ucs_rcache_lru_enable(const ucs_rcache_params_t *params)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

imo no need for a separate func for this
or pls rename as

Suggested change
static int ucs_rcache_lru_enable(const ucs_rcache_params_t *params)
static int ucs_rcache_lru_enabled(const ucs_rcache_params_t *params)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed.

Comment on lines +1325 to +1331
if (self->lru.enabled) {
ucs_list_head_init(&self->lru.list);
ucs_spinlock_init(&self->lru.lock, 0);
} else {
self->lru.list.prev = (void*)0xdead0042;
self->lru.list.next = (void*)0xdead0043;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO we could initialized the list regardless, or add function like ucs_list_head_invalidate() that sets head,tail to NULL


region->refcount++;
ucs_rcache_region_lru_remove(rcache, region);
ucs_rcache_region_lru_get(rcache, region);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this function will take a lock which adds overhead, while the "unsafe" functions should not take lock
why the LRU lock needs to be taken, even when rcache lock is not needed?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. As Artemy suggested, we completely disable LRU when max_regions/max_size is not configured (and it's INF by default). So for default case that should actually reduce the overhead
  2. Here is the workflow for gdr_copy:
uct_gdr_copy_mem_rcache_reg
  ucs_rcache_lookup
    ucs_rw_spinlock_read_lock(&rcache->pgt_lock);
      ucs_rcache_lookup_unsafe
        ucs_rcache_region_lru_remove(rcache, region); << Modify LRU without acquiring LRU lock
    ucs_rw_spinlock_read_unlock(&rcache->pgt_lock);
 
uct_gdr_copy_mem_rcache_dereg
  ucs_rcache_region_put
    ucs_spin_lock(&rcache->lru.lock);
      ucs_rcache_region_lru_add(rcache, region); << This is properly locked
    ucs_spin_unlock(&rcache->lru.lock);

The problem is that lookup updates the LRU without taking a lock.
For regular rcache that works, because all operations are implicitly protected by context mutex.
However gdr_copy is a global object, accessed from multiple contexts -> we can't rely on a context lock.
3) We discussed 2 ways of fixing it:

  • add proper LRU lock (was done here by Colin)
  • add mutex to gdr_copy_md - it will be sequential, not good..

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems gdrcopy already takes uct_gdr_copy_context.lock mutex? So it should be able to use unsafe functions and not need to take LRU lock

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

image

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Memory corruption happens with rcache ops:

static uct_md_ops_t uct_gdr_copy_md_rcache_ops = {
    .mem_reg            = uct_gdr_copy_mem_rcache_reg,
    .mem_dereg          = uct_gdr_copy_mem_rcache_dereg,

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so lookup-unsafe should update LRU without lock, and ucs_rcache_lookup should update LRU with lock

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I see, this is the way to go

ucs_rcache_region_put_unsafe(ucs_rcache_t *rcache, ucs_rcache_region_t *region)
{
ucs_rcache_region_lru_add(rcache, region);
ucs_rcache_region_lru_put(rcache, region);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this function will take a lock which adds overhead, while the "unsafe" functions should not take lock

Copy link
Contributor

@iyastreb iyastreb Dec 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then why do we have a separate LRU lock, if the entire rcache state is protected by rcache lock?

I'm not a designer of rcache, but it seems that rcache lock protects rcache state without LRU.
And a separate LRU lock is needed to protect only LRU state, because we have a flow when LRU is updated during lookup, when rcache remains immutable. Due to this flow, we need to take LRU lock in all the places, no matter if it's _unsafe or not

Copy link
Contributor

@yosefe yosefe Dec 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason for separate lock is that if we used the regular rcache mutex, we would need to take write-mode lock during rcache_get , instead of just read-mode lock - and wanted to avoid that

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I see
In principle for this particular corruption in gdr_copy, the function ucs_rcache_region_put_unsafe is not used.
And in other places it's already implicitly protected by a context mutex.
So change in ucs_rcache_region_put_unsafe can be reverted.

On the other hand:

  • now we completely disable LRU for default use case (when max_size/regions are not set), so that we don't do any operations apart from a single if branch
  • IMO it's cleaner from design POV: since we take this lock in lookup_unsafe, we also need to take it here

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's ok we don't take LRU lock if not needed. However, we should also not take LRU lock when the user is calling "unsafe" functions.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But it's not the case here. Even if we stick to "safe" api, the progress may call "unsafe" api without proper lock -> memory corruption

Which "progress" do you mean? can you write the call chain?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After checking many flows I didn't find a problem.
I meant this flow when saying "progress":

#5  0x00007fb9bc1abf29 in uct_gdr_copy_mem_rcache_dereg 
#6  0x00007fb9c98dfa59 in uct_md_mem_dereg              
#7  0x00007fb9c309520e in ucp_mtype_pack_dereg   <<<<<
#8  0x00007fb9c309628c in ucp_mem_type_unreg_buffer
#9  0x00007fb9c30c64b2 in ucp_mem_type_unpack_inner       
#10 ucp_mem_type_unpack                         
#11 0x00007fb9c312dbf7 in ucp_dt_contig_unpack
#12 ucp_put_handler_inner
#13 ucp_put_handler 
#14 0x00007fb9c98fa2f0 in uct_iface_invoke_am
#15 0x00007fb9c9900296 in uct_tcp_ep_comp_recv_am 
#16 0x00007fb9c9900a06 in uct_tcp_ep_progress_am_rx 
#17 0x00007fb9c9900f22 in uct_tcp_ep_progress_data_rx
#23 ucp_worker_progress

I was deluded by ucp_mtype_pack_dereg impl, I thought it calls ucp_memh_put_rcache on gdr_copy_md rcache, but in fact it calls it on context->rcache.

static void ucp_mtype_pack_dereg(ucp_context_h context,
                                 const ucp_mtype_pack_context_t *pack_context)
{
    if (pack_context->ucp_memh == NULL) {
        uct_md_mem_dereg(context->tl_mds[pack_context->md_index].md,
                         pack_context->uct_memh);
    } else {
        ucp_memh_put_rcache(context, pack_context->ucp_memh);  << It's a different rcache
    }
}

So let's conclude: gdr_copy rcache is not accessed from outside => only safe API is used from gdr_copy => lock in unsafe is not needed

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So let's conclude: gdr_copy rcache is not accessed from outside => only safe API is used from gdr_copy => lock in unsafe is not needed

So WDYT we should do in this PR?

Copy link
Contributor

@iyastreb iyastreb Dec 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think instead of new LRU.enabled field we might have enum that defines the usage mode:

typedef struct {
    LRU_DISABLED, // LRU is completely disabled
    LRU_NO_LOCK, // LRU is enabled, no locking because only unsafe API is used
    LRU_LOCK // LRU is enabled, lock is operational, because only safe API is used
}

That should work, because we must never mix unsafe with safe APIs
This way we optimize away LRU overhead when it's not needed.
And also optimize away locking when not needed
WDYT?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we can use func pointers to minimize branching:
LRU_DISABLED: lru_get, lru_put points to no-op
LRU_NO_LOCK: points to lru_add, lru_remove
LRU_LOCK: points to locked impl

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants