Conversation
|
Hi @larkox - I've not been feeling well for the past few days. I hope you are not blocked on this review? |
agnivade
left a comment
There was a problem hiding this comment.
Also wondering if there is a way to test this by setting up a mock HTTP server or something? I was looking at https://github.com/firebase/firebase-admin-go/blob/v3.13.0/internal/http_client_test.go but they have access to the client's internal fields so it doesn't help us.
| defer cancelRetryContext() | ||
| _, err = me.client.Send(retryContext, fcmMsg) | ||
| if me.metrics != nil { | ||
| me.metrics.observerNotificationResponse(PushNotifyApple, time.Since(start).Seconds()) |
| if err == nil { | ||
| break | ||
| } | ||
|
|
||
| if !isRetryable(err) { | ||
| break | ||
| } |
There was a problem hiding this comment.
Can we combine these in an OR?
| if retries == MAX_RETRIES-1 { | ||
| me.logger.Errorf("Max retries reached did=%v", fcmMsg.Token) | ||
| break | ||
| } |
There was a problem hiding this comment.
Doesn't the for loop take care of this? Why do we need an extra check? And if we do need it, can we remove the condition from the for loop?
There was a problem hiding this comment.
At the last loop, we don't want to wait, that is why the early break.
I would keep the for condition just to be "intention revealing", but completely 0/5 on that.
|
|
||
| select { | ||
| case <-generalContext.Done(): | ||
| case <-time.After(waitTime): |
There was a problem hiding this comment.
I think we need to incorporate some of this logic as well: https://github.com/firebase/firebase-admin-go/blob/54b81142d35e1b98cfd8687a9e56712c78a0f4ec/internal/http_client.go#L427-L438. We need to get the value of the Retry-After header and set that to be the minimum wait time to retry. Otherwise we will get a rate limited error again.
There was a problem hiding this comment.
Doing this will require even more spelunking than what we did to get out the error. I feel the effort to try to get out this value is not worth the risk of retrying before the expected time. We expect this not to happen often, and we hope the delay we are adding is enough.
But if you think otherwise, I can try to look deeper into it.
lieut-data
left a comment
There was a problem hiding this comment.
Only two nits to add, and entirely non-blocking. Deferring to Agniva's exisiting comments on whether more is required or not.
| start := time.Now() | ||
|
|
||
| retryContext, cancelRetryContext := context.WithTimeout(generalContext, me.retryTimeout) | ||
| defer cancelRetryContext() |
There was a problem hiding this comment.
nit: these cancellations queue up to run at when SendNotificationWithRetry returns. I don't think that causes any issues, but a defer in a for loop always gives me pause in case the semantics aren't understood by a future reader. Is it worth considering calling this explicitly at the end of the loop, or wrapping in a closure?
There was a problem hiding this comment.
I am 0/5 on this. I feel adding extra boilerplate will just cause more cognitive overload, instead of clarify things.
You are right that having these defer here are a bit counterintuitive (they are not called at the end of the loop, but when the function finishes) but we would need to cancel then on every break, which is also quite boilerplaty. I feel it is OK like this, but I would be happy to hear other options.
| if generalContext.Err() != nil { | ||
| logger.Info( | ||
| "Not retrying because context error", | ||
| mlog.Int("retry", retries), | ||
| mlog.Err(generalContext.Err()), | ||
| ) | ||
| err = generalContext.Err() | ||
| break | ||
| } |
There was a problem hiding this comment.
nit: Might we inline this within the case <-generalContext.Done():? When I first read it, I was thinking about the timer expiring vs. generalContext being done. Err just returns nil if it's not, so this code isn't wrong, but a more idiomatic pattern seems to be:
select {
case <-ctx.Done():
return ctx.Err()
case ...:
}There was a problem hiding this comment.
I added the change, but I am 0/5 about the result. There is something that feels off about returning in a switch inside a loop. But maybe is just that it's been a long time since I do proper GO :P
|
@agnivade @lieut-data I added one extra change to decide whether the error should be retried by adding the condition of it being a context error (we want to retry if the request is taking too long). I will ask the re-review just in case. |
| if errors.Is(err, context.DeadlineExceeded) { | ||
| return true | ||
| } |
There was a problem hiding this comment.
If it does start happening more often than expected, do we have a way of tracking this?
There was a problem hiding this comment.
Apart of the bug reports from notifications arriving more than once, on line 263 we are logging all errors that imply a retry.
Summary
Add android retry to push proxy
Ticket Link
Fix https://mattermost.atlassian.net/browse/MM-56827