Skip to content

Add android retry#131

Merged
larkox merged 5 commits intomasterfrom
addAndroidRetry
Feb 4, 2025
Merged

Add android retry#131
larkox merged 5 commits intomasterfrom
addAndroidRetry

Conversation

@larkox
Copy link
Contributor

@larkox larkox commented Dec 17, 2024

Summary

Add android retry to push proxy

Ticket Link

Fix https://mattermost.atlassian.net/browse/MM-56827

@larkox larkox added the 1: Dev Review Requires review by a core commiter label Dec 17, 2024
@larkox larkox requested review from a team and agnivade December 17, 2024 13:25
@lieut-data lieut-data removed the request for review from a team December 17, 2024 16:28
@agnivade
Copy link
Contributor

Hi @larkox - I've not been feeling well for the past few days. I hope you are not blocked on this review?

Copy link
Contributor

@agnivade agnivade left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also wondering if there is a way to test this by setting up a mock HTTP server or something? I was looking at https://github.com/firebase/firebase-admin-go/blob/v3.13.0/internal/http_client_test.go but they have access to the client's internal fields so it doesn't help us.

defer cancelRetryContext()
_, err = me.client.Send(retryContext, fcmMsg)
if me.metrics != nil {
me.metrics.observerNotificationResponse(PushNotifyApple, time.Since(start).Seconds())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wrong platform key.

Comment on lines 249 to 255
if err == nil {
break
}

if !isRetryable(err) {
break
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we combine these in an OR?

Comment on lines 259 to 262
if retries == MAX_RETRIES-1 {
me.logger.Errorf("Max retries reached did=%v", fcmMsg.Token)
break
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't the for loop take care of this? Why do we need an extra check? And if we do need it, can we remove the condition from the for loop?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At the last loop, we don't want to wait, that is why the early break.

I would keep the for condition just to be "intention revealing", but completely 0/5 on that.


select {
case <-generalContext.Done():
case <-time.After(waitTime):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to incorporate some of this logic as well: https://github.com/firebase/firebase-admin-go/blob/54b81142d35e1b98cfd8687a9e56712c78a0f4ec/internal/http_client.go#L427-L438. We need to get the value of the Retry-After header and set that to be the minimum wait time to retry. Otherwise we will get a rate limited error again.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doing this will require even more spelunking than what we did to get out the error. I feel the effort to try to get out this value is not worth the risk of retrying before the expected time. We expect this not to happen often, and we hope the delay we are adding is enough.

But if you think otherwise, I can try to look deeper into it.

@larkox larkox requested a review from agnivade January 21, 2025 14:08
@larkox larkox requested review from a team and lieut-data and removed request for a team January 22, 2025 09:27
Copy link
Member

@lieut-data lieut-data left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only two nits to add, and entirely non-blocking. Deferring to Agniva's exisiting comments on whether more is required or not.

start := time.Now()

retryContext, cancelRetryContext := context.WithTimeout(generalContext, me.retryTimeout)
defer cancelRetryContext()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: these cancellations queue up to run at when SendNotificationWithRetry returns. I don't think that causes any issues, but a defer in a for loop always gives me pause in case the semantics aren't understood by a future reader. Is it worth considering calling this explicitly at the end of the loop, or wrapping in a closure?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am 0/5 on this. I feel adding extra boilerplate will just cause more cognitive overload, instead of clarify things.

You are right that having these defer here are a bit counterintuitive (they are not called at the end of the loop, but when the function finishes) but we would need to cancel then on every break, which is also quite boilerplaty. I feel it is OK like this, but I would be happy to hear other options.

Comment on lines 278 to 286
if generalContext.Err() != nil {
logger.Info(
"Not retrying because context error",
mlog.Int("retry", retries),
mlog.Err(generalContext.Err()),
)
err = generalContext.Err()
break
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Might we inline this within the case <-generalContext.Done():? When I first read it, I was thinking about the timer expiring vs. generalContext being done. Err just returns nil if it's not, so this code isn't wrong, but a more idiomatic pattern seems to be:

select {
    case <-ctx.Done():
        return ctx.Err()
    case ...:
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added the change, but I am 0/5 about the result. There is something that feels off about returning in a switch inside a loop. But maybe is just that it's been a long time since I do proper GO :P

@larkox
Copy link
Contributor Author

larkox commented Feb 4, 2025

@agnivade @lieut-data I added one extra change to decide whether the error should be retried by adding the condition of it being a context error (we want to retry if the request is taking too long).

I will ask the re-review just in case.

@larkox larkox requested review from agnivade and lieut-data February 4, 2025 15:52
Comment on lines +296 to +298
if errors.Is(err, context.DeadlineExceeded) {
return true
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it does start happening more often than expected, do we have a way of tracking this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apart of the bug reports from notifications arriving more than once, on line 263 we are logging all errors that imply a retry.

@larkox larkox added 2: Reviews Complete All reviewers have approved the pull request and removed 1: Dev Review Requires review by a core commiter labels Feb 4, 2025
@larkox larkox merged commit 620c92f into master Feb 4, 2025
5 checks passed
@larkox larkox deleted the addAndroidRetry branch February 4, 2025 16:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

2: Reviews Complete All reviewers have approved the pull request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants