Skip to content
This repository was archived by the owner on Mar 21, 2024. It is now read-only.

Fix timeouts when downloading multiple checkpoint files#498

Merged
ant0nsc merged 9 commits intomainfrom
antonsc/downloadall
Jun 22, 2021
Merged

Fix timeouts when downloading multiple checkpoint files#498
ant0nsc merged 9 commits intomainfrom
antonsc/downloadall

Conversation

@ant0nsc
Copy link
Copy Markdown
Contributor

@ant0nsc ant0nsc commented Jun 21, 2021

Downloading multiple checkpoints uses a codepath that has a fixed 120sec timeout. Instead, use multiple individual download operations.

Please follow the guidelines for PRs contained here. Checklist:

  • Ensure that your PR is small, and implements one change.
  • Add unit tests for all functions that you introduced or modified.
  • Run PyCharm's code cleanup tools on your Python files.
  • Link the correct GitHub issue for tracking.
  • Update the Changelog file: Describe your change in terms of
    Added/Changed/Removed/... in the "Upcoming" section.
  • When merging your PR, replace the default merge message with a description of your PR,
    and if needed a motivation why that change was required.

@ant0nsc ant0nsc enabled auto-merge (squash) June 21, 2021 19:12
Copy link
Copy Markdown
Contributor

@dumbledad dumbledad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All looks good.

@ant0nsc ant0nsc merged commit 7cd7e58 into main Jun 22, 2021
@ant0nsc ant0nsc deleted the antonsc/downloadall branch June 22, 2021 13:33
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Download of recovery checkpoints should only happen on rank 0 in distributed training

3 participants