[bugfix] Check response status in the httpjson log handler and retry in case of TOO_MANY_REQUESTS#3356
Conversation
|
@vkarak I have addressed the comments from the previous PR. Before catching the logging error, Reframe would stop the execution of the tests with a message like this: ...
P: available_nodes_percentage: 87.32394366197182 % (r:10.0, l:-0.0001, u:None)
[ PASSED ] Ran 1/2 test case(s) from 2 check(s) (0 failure(s), 0 skipped, 0 aborted)
[==========] Finished on Wed Jan 8 09:11:44 2025+0100
ERROR: run session stopped: logging error: HTTPJSONhandler logging failed: HTTP response code 429
Log file(s) saved in '/users/eirinik/reframe/reframe.log', '/users/eirinik/reframe/reframe.out'Now we simply get a message but the tests continue to run: ...
[ OK ] (1/2) SlurmQueueStatusCheck %slurm_partition=normal* /72a54254 @daint:login+builtin
P: available_nodes_percentage: 87.32394366197182 % (r:10.0, l:-0.0001, u:None)
WARNING: could not log performance data for SlurmQueueStatusCheck %slurm_partition=normal* @daint:login+builtin: HTTPJSONhandler logging failed: HTTP response code 429
[ OK ] (2/2) SlurmQueueStatusCheck %slurm_partition=debug /67512ae1 @daint:login+builtin
P: available_nodes_percentage: 96.875 % (r:10.0, l:-0.0001, u:None)
WARNING: could not log performance data for SlurmQueueStatusCheck %slurm_partition=debug @daint:login+builtin: HTTPJSONhandler logging failed: HTTP response code 429
[----------] all spawned checks have finished
[ PASSED ] Ran 2/2 test case(s) from 2 check(s) (0 failure(s), 0 skipped, 0 aborted)
[==========] Finished on Wed Jan 8 09:13:02 2025+0100
Log file(s) saved in '/users/eirinik/reframe/reframe.log', '/users/eirinik/reframe/reframe.out'(I triggered it with 429 and we actually handle this normally, but it should be similar for an error code 500) |
@ekouts Looking in the code, how can this message (I mean with code 429) can be produced? Because we keep poking if we get 429. Unless, we should not retry infinitely with |
It can't 😛 I commented out lines 701-703 so that you see the output.
Hm you are probably right. But I am not sure how much is too long, maybe 1 minute? Some times the rate limit is set per minute so it may take some time to "reset". I cannot tell how much it is in logstash for us. |
What if we make it a configuration parameter along with the list of intervals? |
Makes sense, then the user can add a finite list or a cycle if they want to wait forever. I will leave as default the |
I would rather have the cycle always in the code and a timeout as a separate parameter, as this is more intuitive. If |
Just to be clear, the first parameter is the timeout (float in seconds). Which one is the second parameter, the list that we cycle over? |
Yes, the first one is the timeout after which we give up and issue a warning and the second one are the wait/sleep intervals. |
…bugfix/log_httpjson_errors_2
vkarak
left a comment
There was a problem hiding this comment.
I renamed sleep_intervals to backoff_intervals as it's more accurate, enhanced a bit the docs and fixed some coding style issues. Lgtm now.
Try it once more tomorrow so as to be sure that my changes haven't broken anything and we're good to merge it!
|
I've also renamed |
|
@vkarak I tried it and still works as expected, thanks for the fixes :) |
httpjson log handler and retry in case of TOO_MANY_REQUESTS
Reopening #3354 on a branch based on master.
Fixes #3353 .