-
Notifications
You must be signed in to change notification settings - Fork 511
UCS/RCACHE: Improve LRU handling. #11054
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
iyastreb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Would be nice to add a gtest, in this or separate PR
src/ucs/memory/rcache.c
Outdated
| return ucs_ilog2(ucs_rcache_stat_max_pow2() / UCS_RCACHE_STAT_MIN_POW2) + 2; | ||
| } | ||
|
|
||
| static int ucs_rcache_lru_enable(const ucs_rcache_params_t *params) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
imo no need for a separate func for this
or pls rename as
| static int ucs_rcache_lru_enable(const ucs_rcache_params_t *params) | |
| static int ucs_rcache_lru_enabled(const ucs_rcache_params_t *params) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Renamed.
| if (self->lru.enabled) { | ||
| ucs_list_head_init(&self->lru.list); | ||
| ucs_spinlock_init(&self->lru.lock, 0); | ||
| } else { | ||
| self->lru.list.prev = (void*)0xdead0042; | ||
| self->lru.list.next = (void*)0xdead0043; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO we could initialized the list regardless, or add function like ucs_list_head_invalidate() that sets head,tail to NULL
|
|
||
| region->refcount++; | ||
| ucs_rcache_region_lru_remove(rcache, region); | ||
| ucs_rcache_region_lru_get(rcache, region); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this function will take a lock which adds overhead, while the "unsafe" functions should not take lock
why the LRU lock needs to be taken, even when rcache lock is not needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- As Artemy suggested, we completely disable LRU when max_regions/max_size is not configured (and it's INF by default). So for default case that should actually reduce the overhead
- Here is the workflow for gdr_copy:
uct_gdr_copy_mem_rcache_reg
ucs_rcache_lookup
ucs_rw_spinlock_read_lock(&rcache->pgt_lock);
ucs_rcache_lookup_unsafe
ucs_rcache_region_lru_remove(rcache, region); << Modify LRU without acquiring LRU lock
ucs_rw_spinlock_read_unlock(&rcache->pgt_lock);
uct_gdr_copy_mem_rcache_dereg
ucs_rcache_region_put
ucs_spin_lock(&rcache->lru.lock);
ucs_rcache_region_lru_add(rcache, region); << This is properly locked
ucs_spin_unlock(&rcache->lru.lock);
The problem is that lookup updates the LRU without taking a lock.
For regular rcache that works, because all operations are implicitly protected by context mutex.
However gdr_copy is a global object, accessed from multiple contexts -> we can't rely on a context lock.
3) We discussed 2 ways of fixing it:
- add proper LRU lock (was done here by Colin)
- add mutex to gdr_copy_md - it will be sequential, not good..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems gdrcopy already takes uct_gdr_copy_context.lock mutex? So it should be able to use unsafe functions and not need to take LRU lock
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Memory corruption happens with rcache ops:
static uct_md_ops_t uct_gdr_copy_md_rcache_ops = {
.mem_reg = uct_gdr_copy_mem_rcache_reg,
.mem_dereg = uct_gdr_copy_mem_rcache_dereg,
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so lookup-unsafe should update LRU without lock, and ucs_rcache_lookup should update LRU with lock
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I see, this is the way to go
| ucs_rcache_region_put_unsafe(ucs_rcache_t *rcache, ucs_rcache_region_t *region) | ||
| { | ||
| ucs_rcache_region_lru_add(rcache, region); | ||
| ucs_rcache_region_lru_put(rcache, region); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this function will take a lock which adds overhead, while the "unsafe" functions should not take lock
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then why do we have a separate LRU lock, if the entire rcache state is protected by rcache lock?
I'm not a designer of rcache, but it seems that rcache lock protects rcache state without LRU.
And a separate LRU lock is needed to protect only LRU state, because we have a flow when LRU is updated during lookup, when rcache remains immutable. Due to this flow, we need to take LRU lock in all the places, no matter if it's _unsafe or not
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason for separate lock is that if we used the regular rcache mutex, we would need to take write-mode lock during rcache_get , instead of just read-mode lock - and wanted to avoid that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I see
In principle for this particular corruption in gdr_copy, the function ucs_rcache_region_put_unsafe is not used.
And in other places it's already implicitly protected by a context mutex.
So change in ucs_rcache_region_put_unsafe can be reverted.
On the other hand:
- now we completely disable LRU for default use case (when max_size/regions are not set), so that we don't do any operations apart from a single if branch
- IMO it's cleaner from design POV: since we take this lock in lookup_unsafe, we also need to take it here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's ok we don't take LRU lock if not needed. However, we should also not take LRU lock when the user is calling "unsafe" functions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But it's not the case here. Even if we stick to "safe" api, the progress may call "unsafe" api without proper lock -> memory corruption
Which "progress" do you mean? can you write the call chain?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After checking many flows I didn't find a problem.
I meant this flow when saying "progress":
#5 0x00007fb9bc1abf29 in uct_gdr_copy_mem_rcache_dereg
#6 0x00007fb9c98dfa59 in uct_md_mem_dereg
#7 0x00007fb9c309520e in ucp_mtype_pack_dereg <<<<<
#8 0x00007fb9c309628c in ucp_mem_type_unreg_buffer
#9 0x00007fb9c30c64b2 in ucp_mem_type_unpack_inner
#10 ucp_mem_type_unpack
#11 0x00007fb9c312dbf7 in ucp_dt_contig_unpack
#12 ucp_put_handler_inner
#13 ucp_put_handler
#14 0x00007fb9c98fa2f0 in uct_iface_invoke_am
#15 0x00007fb9c9900296 in uct_tcp_ep_comp_recv_am
#16 0x00007fb9c9900a06 in uct_tcp_ep_progress_am_rx
#17 0x00007fb9c9900f22 in uct_tcp_ep_progress_data_rx
#23 ucp_worker_progress
I was deluded by ucp_mtype_pack_dereg impl, I thought it calls ucp_memh_put_rcache on gdr_copy_md rcache, but in fact it calls it on context->rcache.
static void ucp_mtype_pack_dereg(ucp_context_h context,
const ucp_mtype_pack_context_t *pack_context)
{
if (pack_context->ucp_memh == NULL) {
uct_md_mem_dereg(context->tl_mds[pack_context->md_index].md,
pack_context->uct_memh);
} else {
ucp_memh_put_rcache(context, pack_context->ucp_memh); << It's a different rcache
}
}
So let's conclude: gdr_copy rcache is not accessed from outside => only safe API is used from gdr_copy => lock in unsafe is not needed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So let's conclude: gdr_copy rcache is not accessed from outside => only safe API is used from gdr_copy => lock in unsafe is not needed
So WDYT we should do in this PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think instead of new LRU.enabled field we might have enum that defines the usage mode:
typedef struct {
LRU_DISABLED, // LRU is completely disabled
LRU_NO_LOCK, // LRU is enabled, no locking because only unsafe API is used
LRU_LOCK // LRU is enabled, lock is operational, because only safe API is used
}
That should work, because we must never mix unsafe with safe APIs
This way we optimize away LRU overhead when it's not needed.
And also optimize away locking when not needed
WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe we can use func pointers to minimize branching:
LRU_DISABLED: lru_get, lru_put points to no-op
LRU_NO_LOCK: points to lru_add, lru_remove
LRU_LOCK: points to locked impl

What?
Disable LRU for infinite size and consistent locking.
Why?
An improvement over the solution in #11049 that fixes a crash in NIXL.
How?
Disable LRU in rcache when
max_sizeandmax_regionsare "infinity" and ensure consistent locking when it is enabled.