[bugfix] Check response status in the `httpjson` log handler and retry in case of `TOO_MANY_REQUESTS` by ekouts · Pull Request #3356 · reframe-hpc/reframe

ekouts · 2025-01-08T08:18:15Z

Reopening #3354 on a branch based on master.
Fixes #3353 .

ekouts · 2025-01-08T08:24:50Z

@vkarak I have addressed the comments from the previous PR. Before catching the logging error, Reframe would stop the execution of the tests with a message like this:

...
P: available_nodes_percentage: 87.32394366197182 % (r:10.0, l:-0.0001, u:None)
[  PASSED  ] Ran 1/2 test case(s) from 2 check(s) (0 failure(s), 0 skipped, 0 aborted)
[==========] Finished on Wed Jan  8 09:11:44 2025+0100
ERROR: run session stopped: logging error: HTTPJSONhandler logging failed: HTTP response code 429
Log file(s) saved in '/users/eirinik/reframe/reframe.log', '/users/eirinik/reframe/reframe.out'

Now we simply get a message but the tests continue to run:

...
[       OK ] (1/2) SlurmQueueStatusCheck %slurm_partition=normal* /72a54254 @daint:login+builtin
P: available_nodes_percentage: 87.32394366197182 % (r:10.0, l:-0.0001, u:None)
WARNING: could not log performance data for SlurmQueueStatusCheck %slurm_partition=normal* @daint:login+builtin: HTTPJSONhandler logging failed: HTTP response code 429
[       OK ] (2/2) SlurmQueueStatusCheck %slurm_partition=debug /67512ae1 @daint:login+builtin
P: available_nodes_percentage: 96.875 % (r:10.0, l:-0.0001, u:None)
WARNING: could not log performance data for SlurmQueueStatusCheck %slurm_partition=debug @daint:login+builtin: HTTPJSONhandler logging failed: HTTP response code 429
[----------] all spawned checks have finished

[  PASSED  ] Ran 2/2 test case(s) from 2 check(s) (0 failure(s), 0 skipped, 0 aborted)
[==========] Finished on Wed Jan  8 09:13:02 2025+0100
Log file(s) saved in '/users/eirinik/reframe/reframe.log', '/users/eirinik/reframe/reframe.out'

(I triggered it with 429 and we actually handle this normally, but it should be similar for an error code 500)

vkarak · 2025-01-08T22:13:47Z

WARNING: could not log performance data for SlurmQueueStatusCheck %slurm_partition=debug @daint:login+builtin: HTTPJSONhandler logging failed: HTTP response code 429

@ekouts Looking in the code, how can this message (I mean with code 429) can be produced? Because we keep poking if we get 429. Unless, we should not retry infinitely with iterools.cycle() (which I think is a good idea) and break the loop once we consume the list of intervals.

ekouts · 2025-01-09T08:26:27Z

@ekouts Looking in the code, how can this message (I mean with code 429) can be produced? Because we keep poking if we get 429.

It can't 😛 I commented out lines 701-703 so that you see the output.

Unless, we should not retry infinitely with iterools.cycle() (which I think is a good idea) and break the loop once we consume the list of intervals.

Hm you are probably right. But I am not sure how much is too long, maybe 1 minute? Some times the rate limit is set per minute so it may take some time to "reset". I cannot tell how much it is in logstash for us.

vkarak · 2025-01-09T14:08:42Z

Hm you are probably right. But I am not sure how much is too long, maybe 1 minute?

What if we make it a configuration parameter along with the list of intervals?

ekouts · 2025-01-09T16:19:28Z

What if we make it a configuration parameter along with the list of intervals?

Makes sense, then the user can add a finite list or a cycle if they want to wait forever. I will leave as default the [.1, .2, .4, .8, 1.6, 3.2] list.

vkarak · 2025-01-09T22:00:03Z

Makes sense, then the user can add a finite list or a cycle if they want to wait forever. I will leave as default the [.1, .2, .4, .8, 1.6, 3.2] list.

I would rather have the cycle always in the code and a timeout as a separate parameter, as this is more intuitive. If timeout=0 then it would mean wait until you get served. I think also that these two parameters can be specific to that log handler only.

ekouts · 2025-01-14T16:25:06Z

I think also that these two parameters can be specific to that log handler only.

Just to be clear, the first parameter is the timeout (float in seconds). Which one is the second parameter, the list that we cycle over?

vkarak · 2025-01-14T21:31:04Z

Just to be clear, the first parameter is the timeout (float in seconds). Which one is the second parameter, the list that we cycle over?

Yes, the first one is the timeout after which we give up and issue a warning and the second one are the wait/sleep intervals.

…bugfix/log_httpjson_errors_2

vkarak

I renamed sleep_intervals to backoff_intervals as it's more accurate, enhanced a bit the docs and fixed some coding style issues. Lgtm now.

Try it once more tomorrow so as to be sure that my changes haven't broken anything and we're good to merge it!

vkarak · 2025-01-16T23:28:35Z

I've also renamed timeout to retry_timeout and updated the schema accordingly.

ekouts · 2025-01-17T09:21:09Z

@vkarak I tried it and still works as expected, thanks for the fixes :)

ekouts added 5 commits January 8, 2025 08:39

Add simple error checking in httpjson

fd001d5

Handle 429 in httpjson logger

9bfae8a

Fix import

ffae0ec

Small improvements

eb01cde

Add warning for logging errors

32e6bca

ekouts added prio: normal bugfix logging labels Jan 8, 2025

ekouts added this to the ReFrame 4.8 milestone Jan 8, 2025

ekouts requested review from teojgo and vkarak January 8, 2025 08:18

ekouts self-assigned this Jan 8, 2025

Add sleep_times and timeout parameters in the httpjson logger

958372c

ekouts added 2 commits January 15, 2025 11:45

Rename sleep_times to sleep_intervals

2b8adff

Merge branch 'master' of https://github.com/reframe-hpc/reframe into …

bba5e05

…bugfix/log_httpjson_errors_2

vkarak modified the milestones: ReFrame 4.8, ReFrame 4.7.3 Jan 16, 2025

Rename sleep_intervals to backoff_intervals and enhance docs

7bf459d

vkarak approved these changes Jan 16, 2025

View reviewed changes

Rename also httpjson timeout to retry_timeout

347ef6c

Merge branch 'master' into bugfix/log_httpjson_errors_2

78f6006

vkarak changed the title ~~[bugfix] Improve handling of requests in httpjson logger~~ [bugfix] Check response status in the httpjson log handler and retry in case of TOO_MANY_REQUESTS Jan 17, 2025

vkarak enabled auto-merge January 17, 2025 10:47

teojgo approved these changes Jan 17, 2025

View reviewed changes

vkarak merged commit 8bccc24 into reframe-hpc:master Jan 17, 2025
36 checks passed

vkarak linked an issue Jan 17, 2025 that may be closed by this pull request

Check response status code in httpjson logger #3353

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[bugfix] Check response status in the `httpjson` log handler and retry in case of `TOO_MANY_REQUESTS`#3356

[bugfix] Check response status in the `httpjson` log handler and retry in case of `TOO_MANY_REQUESTS`#3356
vkarak merged 11 commits intoreframe-hpc:masterfrom
ekouts:bugfix/log_httpjson_errors_2

ekouts commented Jan 8, 2025

Uh oh!

ekouts commented Jan 8, 2025 •

edited

Loading

Uh oh!

vkarak commented Jan 8, 2025

Uh oh!

ekouts commented Jan 9, 2025 •

edited

Loading

Uh oh!

vkarak commented Jan 9, 2025

Uh oh!

ekouts commented Jan 9, 2025 •

edited

Loading

Uh oh!

vkarak commented Jan 9, 2025

Uh oh!

ekouts commented Jan 14, 2025

Uh oh!

vkarak commented Jan 14, 2025

Uh oh!

vkarak left a comment

Uh oh!

vkarak commented Jan 16, 2025

Uh oh!

ekouts commented Jan 17, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

ekouts commented Jan 8, 2025

Uh oh!

ekouts commented Jan 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vkarak commented Jan 8, 2025

Uh oh!

ekouts commented Jan 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vkarak commented Jan 9, 2025

Uh oh!

ekouts commented Jan 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vkarak commented Jan 9, 2025

Uh oh!

ekouts commented Jan 14, 2025

Uh oh!

vkarak commented Jan 14, 2025

Uh oh!

vkarak left a comment

Choose a reason for hiding this comment

Uh oh!

vkarak commented Jan 16, 2025

Uh oh!

ekouts commented Jan 17, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ekouts commented Jan 8, 2025 •

edited

Loading

ekouts commented Jan 9, 2025 •

edited

Loading

ekouts commented Jan 9, 2025 •

edited

Loading