Trying to bring up a local cluster using docker and get the following error. I've also tried with the ray-ml docker image, same result.
I know docker can have some DNS challenges, but if I spin up a docker container on worker or head nodes using the same ray image as in my yaml, localhost resolves just fine.
File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/_private/node.py", line 125, in __init__
self._localhost = socket.gethostbyname("localhost")
socket.gaierror: [Errno -2] Name or service not known
ray up ./ray-cluster.yaml
cluster_name: local
# The maximum number of workers nodes to launch in addition to the head node.
max_workers: 2
# Cloud-provider specific configuration.
provider:
type: local
head_ip: MY_HEAD_NODE
worker_ips: [WORKER_NODE_ONE, WORKER_NODE_TWO]
# How Ray will authenticate with newly launched nodes.
auth:
ssh_user: ai
ssh_private_key: ~/.ssh/ai
docker:
image: rayproject/ray:2.9.3-cpu
container_name: ray
pull_before_run: True
# The commands that will be run on the head node after it has been updated.
#head_setup_commands:
# - conda create -n ray python=3.9 ray=2.9.3 -y
# - pip install -U ray
# The commands to run on worker nodes after they have been updated.
#worker_setup_commands:
# - conda create -n ray python=3.9 ray=2.9.3 -y
# - pip install -U ray
# Command to start Ray on the head node. You might need to adjust the memory and CPU resources.
head_start_ray_commands:
- ray stop
- ray start --head --port=6379 --object-manager-port=8076 # --autoscaling-config=~/ray_bootstrap_config.yaml
# Command to start Ray on worker nodes. Adjust resources as necessary.
worker_start_ray_commands:
- ray stop
- ray start --address=MY_HEAD_NODE:6379 --object-manager-port=8076
ray up ./ray-cluster.yaml
Cluster: local
Loaded cached provider configuration
If you experience issues with the cloud provider, try re-running the command with --no-config-cache.
2024-02-24 14:34:00,505 INFO node_provider.py:53 -- ClusterState: Loaded cluster state: ['MY_HEAD_NODE', 'WORKER_NODE_ONE', 'WORKER_NODE_TWO']
No head node found. Launching a new cluster. Confirm [y/N]: y
Usage stats collection is enabled. To disable this, add `--disable-usage-stats` to the command that starts the cluster, or run the following command: `ray disable-usage-stats` before starting the cluster. See https://docs.ray.io/en/master/cluster/usage-stats.html for more details.
Acquiring an up-to-date head node
2024-02-24 14:34:18,786 INFO node_provider.py:114 -- ClusterState: Writing cluster state: ['MY_HEAD_NODE', 'WORKER_NODE_ONE', 'WORKER_NODE_TWO']
Launched a new head node
Fetching the new head node
<1/1> Setting up head node
Prepared bootstrap config
The head node will not launch any workers because `ray start` does not have `--autoscaling-config` set.
Potential fix: add `--autoscaling-config=~/ray_bootstrap_config.yaml` to the `ray start` command under `head_start_ray_commands`.
2024-02-24 14:34:18,789 INFO node_provider.py:114 -- ClusterState: Writing cluster state: ['MY_HEAD_NODE', 'WORKER_NODE_ONE', 'WORKER_NODE_TWO']
New status: waiting-for-ssh
[1/7] Waiting for SSH to become available
Running `uptime` as a test.
Fetched IP: 192.168.17.2
Warning: Permanently added '192.168.17.2' (ED25519) to the list of known hosts.
14:34:19 up 18:25, 2 users, load average: 0.04, 0.10, 0.18
Shared connection to 192.168.17.2 closed.
Success.
Updating cluster configuration. [hash=c754366af104b4999602666f2635565295c9e05c]
2024-02-24 14:34:19,504 INFO node_provider.py:114 -- ClusterState: Writing cluster state: ['MY_HEAD_NODE', 'WORKER_NODE_ONE', 'WORKER_NODE_TWO']
New status: syncing-files
[2/7] Processing file mounts
Shared connection to 192.168.17.2 closed.
Shared connection to 192.168.17.2 closed.
[3/7] No worker file mounts to sync
2024-02-24 14:34:20,135 INFO node_provider.py:114 -- ClusterState: Writing cluster state: ['MY_HEAD_NODE', 'WORKER_NODE_ONE', 'WORKER_NODE_TWO']
New status: setting-up
[4/7] No initialization commands to run.
[5/7] Initializing command runner
Shared connection to 192.168.17.2 closed.
2.9.3-cpu: Pulling from rayproject/ray
Digest: sha256:c864e37f4ce516ff49425f69cac5503a51e84c333d30928416714a2c3da55b43
Status: Image is up to date for rayproject/ray:2.9.3-cpu
docker.io/rayproject/ray:2.9.3-cpu
Shared connection to 192.168.17.2 closed.
Shared connection to 192.168.17.2 closed.
Shared connection to 192.168.17.2 closed.
Shared connection to 192.168.17.2 closed.
Shared connection to 192.168.17.2 closed.
be3ec57c8176bc12599d0f9bf97b026d0caab95d11d68a5a014aa0e64cd34a27
Shared connection to 192.168.17.2 closed.
Shared connection to 192.168.17.2 closed.
sending incremental file list
ray_bootstrap_config.yaml
sent 568 bytes received 35 bytes 1,206.00 bytes/sec
total size is 974 speedup is 1.62
Shared connection to 192.168.17.2 closed.
Shared connection to 192.168.17.2 closed.
sending incremental file list
ray_bootstrap_key.pem
sent 2,088 bytes received 35 bytes 4,246.00 bytes/sec
total size is 2,590 speedup is 1.22
Shared connection to 192.168.17.2 closed.
Shared connection to 192.168.17.2 closed.
[6/7] No setup commands to run.
[7/7] Starting the Ray runtime
Did not find any active Ray processes.
Shared connection to 192.168.17.2 closed.
Usage stats collection is enabled. To disable this, add `--disable-usage-stats` to the command that starts the cluster, or run the following command: `ray disable-usage-stats` before starting the cluster. See https://docs.ray.io/en/master/cluster/usage-stats.html for more details.
Local node IP: 192.168.17.2
Traceback (most recent call last):
File "/home/ray/anaconda3/bin/ray", line 8, in <module>
sys.exit(main())
File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/scripts/scripts.py", line 2498, in main
return cli()
File "/home/ray/anaconda3/lib/python3.8/site-packages/click/core.py", line 1157, in __call__
return self.main(*args, **kwargs)
File "/home/ray/anaconda3/lib/python3.8/site-packages/click/core.py", line 1078, in main
rv = self.invoke(ctx)
File "/home/ray/anaconda3/lib/python3.8/site-packages/click/core.py", line 1688, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/home/ray/anaconda3/lib/python3.8/site-packages/click/core.py", line 1434, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/home/ray/anaconda3/lib/python3.8/site-packages/click/core.py", line 783, in invoke
return __callback(*args, **kwargs)
File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/autoscaler/_private/cli_logger.py", line 856, in wrapper
return f(*args, **kwargs)
File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/scripts/scripts.py", line 771, in start
node = ray._private.node.Node(
File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/_private/node.py", line 125, in __init__
self._localhost = socket.gethostbyname("localhost")
socket.gaierror: [Errno -2] Name or service not known
Shared connection to 192.168.17.2 closed.
2024-02-24 14:34:27,065 INFO node_provider.py:114 -- ClusterState: Writing cluster state: ['MY_HEAD_NODE', 'WORKER_NODE_ONE', 'WORKER_NODE_TWO']
New status: update-failed
!!!
Full traceback: Traceback (most recent call last):
File "/home/jon/.conda/envs/ml39/lib/python3.9/site-packages/ray/autoscaler/_private/updater.py", line 159, in run
self.do_update()
File "/home/jon/.conda/envs/ml39/lib/python3.9/site-packages/ray/autoscaler/_private/updater.py", line 527, in do_update
self.cmd_runner.run(
File "/home/jon/.conda/envs/ml39/lib/python3.9/site-packages/ray/autoscaler/_private/command_runner.py", line 493, in run
return self.ssh_command_runner.run(
File "/home/jon/.conda/envs/ml39/lib/python3.9/site-packages/ray/autoscaler/_private/command_runner.py", line 383, in run
return self._run_helper(final_cmd, with_output, exit_on_fail, silent=silent)
File "/home/jon/.conda/envs/ml39/lib/python3.9/site-packages/ray/autoscaler/_private/command_runner.py", line 298, in _run_helper
raise click.ClickException(fail_msg) from None
click.exceptions.ClickException: SSH command failed.
Error message: SSH command failed.
!!!
Failed to setup head node.
What happened + What you expected to happen
Trying to bring up a local cluster using docker and get the following error. I've also tried with the ray-ml docker image, same result.
I know docker can have some DNS challenges, but if I spin up a docker container on worker or head nodes using the same ray image as in my yaml, localhost resolves just fine.
I do notice that on my target nodes,
ping localhost(ssh to the host, not inside the docker container) resolves to an ipv6 address, but inside the ray container it resolves to 127.0.0.1 and I'd expect these commands are being run inside the container? I mention this because I ran across: https://discuss.ray.io/t/lack-of-ipv6-support/2323/3.Versions / Dependencies
ray 2.9.3, python 3.9, arch linux
Reproduction script
ray-cluster.yaml
output:
Issue Severity
None