Multi-GPU training outside of AzureML fails when creating folder structures

Running outside azureML with >1 GPU.
Rank 0 will create a folder with a timestamp, like "20211112T...." where its files will go.
When Rank 1 starts, it will try to do exactly the same, and fail there.
What it should actually do is to write into exactly the same folder as Rank 0
We need to find a way of passing the folder name to the subsequent ranks.
Two options: commandline argument or environment variables, the latter is probably cleaner.
Possible solution:
In Rank 0, folders are created. Output folder, logs folder are stored in environment variables.
In Ranks != 0: The call to self.container.create_filesystem(self.project_root) in run_ml.py should be avoided. Instead, a DeepLearningFileSystem should be instantiated with the folders taken from environment variables.

Workaround: Do not run multi-GPU jobs on VMs. Can force that by setting `max_num_gpus=1` on the commandline.

[AB#4747](https://innereye.visualstudio.com/60ce1777-00d6-4015-82bc-488a0c00202f/_workitems/edit/4747)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multi-GPU training outside of AzureML fails when creating folder structures #601

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Multi-GPU training outside of AzureML fails when creating folder structures #601

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions