Skip to content

[cluster] bringing up cluster fails: socket.gethostbyname("localhost") - socket.gaierror: [Errno -2] Name or service not known #43416

@jon-ressio

Description

@jon-ressio

What happened + What you expected to happen

Trying to bring up a local cluster using docker and get the following error. I've also tried with the ray-ml docker image, same result.

I know docker can have some DNS challenges, but if I spin up a docker container on worker or head nodes using the same ray image as in my yaml, localhost resolves just fine.

I do notice that on my target nodes, ping localhost (ssh to the host, not inside the docker container) resolves to an ipv6 address, but inside the ray container it resolves to 127.0.0.1 and I'd expect these commands are being run inside the container? I mention this because I ran across: https://discuss.ray.io/t/lack-of-ipv6-support/2323/3.

  File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/_private/node.py", line 125, in __init__
    self._localhost = socket.gethostbyname("localhost")
socket.gaierror: [Errno -2] Name or service not known

Versions / Dependencies

ray 2.9.3, python 3.9, arch linux

Reproduction script

ray up ./ray-cluster.yaml

ray-cluster.yaml

cluster_name: local

# The maximum number of workers nodes to launch in addition to the head node.
max_workers: 2

# Cloud-provider specific configuration.
provider:
  type: local
  head_ip: MY_HEAD_NODE
  worker_ips: [WORKER_NODE_ONE, WORKER_NODE_TWO]

# How Ray will authenticate with newly launched nodes.
auth:
  ssh_user: ai
  ssh_private_key: ~/.ssh/ai

docker:
  image: rayproject/ray:2.9.3-cpu
  container_name: ray
  pull_before_run: True

# The commands that will be run on the head node after it has been updated.
#head_setup_commands:
# - conda create -n ray python=3.9 ray=2.9.3 -y
# - pip install -U ray

# The commands to run on worker nodes after they have been updated.
#worker_setup_commands:
# - conda create -n ray python=3.9 ray=2.9.3 -y
# - pip install -U ray

# Command to start Ray on the head node. You might need to adjust the memory and CPU resources.
head_start_ray_commands:
  - ray stop
  - ray start --head --port=6379 --object-manager-port=8076 # --autoscaling-config=~/ray_bootstrap_config.yaml

# Command to start Ray on worker nodes. Adjust resources as necessary.
worker_start_ray_commands:
  - ray stop
  - ray start --address=MY_HEAD_NODE:6379 --object-manager-port=8076

output:

ray up ./ray-cluster.yaml 
Cluster: local

Loaded cached provider configuration
If you experience issues with the cloud provider, try re-running the command with --no-config-cache.
2024-02-24 14:34:00,505 INFO node_provider.py:53 -- ClusterState: Loaded cluster state: ['MY_HEAD_NODE', 'WORKER_NODE_ONE', 'WORKER_NODE_TWO']
No head node found. Launching a new cluster. Confirm [y/N]: y

Usage stats collection is enabled. To disable this, add `--disable-usage-stats` to the command that starts the cluster, or run the following command: `ray disable-usage-stats` before starting the cluster. See https://docs.ray.io/en/master/cluster/usage-stats.html for more details.

Acquiring an up-to-date head node
2024-02-24 14:34:18,786 INFO node_provider.py:114 -- ClusterState: Writing cluster state: ['MY_HEAD_NODE', 'WORKER_NODE_ONE', 'WORKER_NODE_TWO']
  Launched a new head node
  Fetching the new head node
  
<1/1> Setting up head node
  Prepared bootstrap config
  The head node will not launch any workers because `ray start` does not have `--autoscaling-config` set.
Potential fix: add `--autoscaling-config=~/ray_bootstrap_config.yaml` to the `ray start` command under `head_start_ray_commands`.
2024-02-24 14:34:18,789 INFO node_provider.py:114 -- ClusterState: Writing cluster state: ['MY_HEAD_NODE', 'WORKER_NODE_ONE', 'WORKER_NODE_TWO']
  New status: waiting-for-ssh
  [1/7] Waiting for SSH to become available
    Running `uptime` as a test.
    Fetched IP: 192.168.17.2
Warning: Permanently added '192.168.17.2' (ED25519) to the list of known hosts.
 14:34:19 up 18:25,  2 users,  load average: 0.04, 0.10, 0.18
Shared connection to 192.168.17.2 closed.
    Success.
  Updating cluster configuration. [hash=c754366af104b4999602666f2635565295c9e05c]
2024-02-24 14:34:19,504 INFO node_provider.py:114 -- ClusterState: Writing cluster state: ['MY_HEAD_NODE', 'WORKER_NODE_ONE', 'WORKER_NODE_TWO']
  New status: syncing-files
  [2/7] Processing file mounts
Shared connection to 192.168.17.2 closed.
Shared connection to 192.168.17.2 closed.
  [3/7] No worker file mounts to sync
2024-02-24 14:34:20,135 INFO node_provider.py:114 -- ClusterState: Writing cluster state: ['MY_HEAD_NODE', 'WORKER_NODE_ONE', 'WORKER_NODE_TWO']
  New status: setting-up
  [4/7] No initialization commands to run.
  [5/7] Initializing command runner
Shared connection to 192.168.17.2 closed.
2.9.3-cpu: Pulling from rayproject/ray
Digest: sha256:c864e37f4ce516ff49425f69cac5503a51e84c333d30928416714a2c3da55b43
Status: Image is up to date for rayproject/ray:2.9.3-cpu
docker.io/rayproject/ray:2.9.3-cpu
Shared connection to 192.168.17.2 closed.
Shared connection to 192.168.17.2 closed.
Shared connection to 192.168.17.2 closed.
Shared connection to 192.168.17.2 closed.
Shared connection to 192.168.17.2 closed.
be3ec57c8176bc12599d0f9bf97b026d0caab95d11d68a5a014aa0e64cd34a27
Shared connection to 192.168.17.2 closed.
Shared connection to 192.168.17.2 closed.
sending incremental file list
ray_bootstrap_config.yaml

sent 568 bytes  received 35 bytes  1,206.00 bytes/sec
total size is 974  speedup is 1.62
Shared connection to 192.168.17.2 closed.
Shared connection to 192.168.17.2 closed.
sending incremental file list
ray_bootstrap_key.pem

sent 2,088 bytes  received 35 bytes  4,246.00 bytes/sec
total size is 2,590  speedup is 1.22
Shared connection to 192.168.17.2 closed.
Shared connection to 192.168.17.2 closed.
  [6/7] No setup commands to run.
  [7/7] Starting the Ray runtime
Did not find any active Ray processes.
Shared connection to 192.168.17.2 closed.
Usage stats collection is enabled. To disable this, add `--disable-usage-stats` to the command that starts the cluster, or run the following command: `ray disable-usage-stats` before starting the cluster. See https://docs.ray.io/en/master/cluster/usage-stats.html for more details.

Local node IP: 192.168.17.2
Traceback (most recent call last):
  File "/home/ray/anaconda3/bin/ray", line 8, in <module>
    sys.exit(main())
  File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/scripts/scripts.py", line 2498, in main
    return cli()
  File "/home/ray/anaconda3/lib/python3.8/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.8/site-packages/click/core.py", line 1078, in main
    rv = self.invoke(ctx)
  File "/home/ray/anaconda3/lib/python3.8/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/ray/anaconda3/lib/python3.8/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/ray/anaconda3/lib/python3.8/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/autoscaler/_private/cli_logger.py", line 856, in wrapper
    return f(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/scripts/scripts.py", line 771, in start
    node = ray._private.node.Node(
  File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/_private/node.py", line 125, in __init__
    self._localhost = socket.gethostbyname("localhost")
socket.gaierror: [Errno -2] Name or service not known
Shared connection to 192.168.17.2 closed.
2024-02-24 14:34:27,065 INFO node_provider.py:114 -- ClusterState: Writing cluster state: ['MY_HEAD_NODE', 'WORKER_NODE_ONE', 'WORKER_NODE_TWO']
  New status: update-failed
  !!!
  Full traceback: Traceback (most recent call last):
  File "/home/jon/.conda/envs/ml39/lib/python3.9/site-packages/ray/autoscaler/_private/updater.py", line 159, in run
    self.do_update()
  File "/home/jon/.conda/envs/ml39/lib/python3.9/site-packages/ray/autoscaler/_private/updater.py", line 527, in do_update
    self.cmd_runner.run(
  File "/home/jon/.conda/envs/ml39/lib/python3.9/site-packages/ray/autoscaler/_private/command_runner.py", line 493, in run
    return self.ssh_command_runner.run(
  File "/home/jon/.conda/envs/ml39/lib/python3.9/site-packages/ray/autoscaler/_private/command_runner.py", line 383, in run
    return self._run_helper(final_cmd, with_output, exit_on_fail, silent=silent)
  File "/home/jon/.conda/envs/ml39/lib/python3.9/site-packages/ray/autoscaler/_private/command_runner.py", line 298, in _run_helper
    raise click.ClickException(fail_msg) from None
click.exceptions.ClickException: SSH command failed.

  Error message: SSH command failed.
  !!!
  
  Failed to setup head node.

Issue Severity

None

Metadata

Metadata

Assignees

No one assigned

    Labels

    P2Important issue, but not time-criticalbugSomething that is supposed to be working; but isn'tcoreIssues that should be addressed in Ray Corecore-clustersFor launching and managing Ray clusters/jobs/kubernetes

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions