问题
I am finding it incredibly difficult to follow rays guidelines to running a docker image on a ray cluster in order to execute a python script. I am finding a lack of simple working examples.
So I have the simplest docker file:
FROM rayproject/ray
WORKDIR /usr/src/app
COPY . .
CMD ["step_1.py"]
ENTRYPOINT ["python3"]
I use this to create can image and push this to docker hub. ("myimage" is just an example)
docker build -t myimage .
docker push myimage
"step_1.py" just prints hello every second for 200 seconds:
import time
for i in range(200):
time.sleep(1)
print("hello")
This is my config.yaml. again very simple:
cluster_name: simple-1
min_workers: 0
max_workers: 2
docker:
image: "myimage"
container_name: "my_simple_docker_container"
pull_before_run: True
idle_timeout_minutes: 5
provider:
type: aws
region: eu-west-2
availability_zone: eu-west-2a
file_mounts_sync_continuously: False
auth:
ssh_user: ubuntu
ssh_private_key: /home/user/.ssh/aws_ubuntu_test.pem
head_node:
InstanceType: c5.2xlarge
ImageId: ami-xxxxx826a6b31fd2c
KeyName: aws_ubuntu_test
BlockDeviceMappings:
- DeviceName: /dev/sda1
Ebs:
VolumeSize: 200
worker_nodes:
InstanceType: c5.2xlarge
ImageId: ami-xxxxx826a6b31fd2c
KeyName: aws_ubuntu_test
InstanceMarketOptions:
MarketType: spot
head_setup_commands:
- pip install boto3==1.4.8
worker_setup_commands: []
head_start_ray_commands:
- ray stop
- ulimit -n 65536; ray start --head --port=6379 --object-manager-port=8076 --autoscaling-config=~/ray_bootstrap_config.yaml
worker_start_ray_commands:
- ray stop
- ulimit -n 65536; ray start --address=$RAY_HEAD_IP:6379 --object-manager-port=8076
I hit in the terminal:
ray up simple1.yaml:
and this error every time:
shared connection to x.x.xx.119 closed.
"docker cp" requires exactly 2 arguments.
See 'docker cp --help'.
Usage: docker cp [OPTIONS] CONTAINER:SRC_PATH DEST_PATH|-
docker cp [OPTIONS] SRC_PATH|- CONTAINER:DEST_PATH
Copy files/folders between a container and the local filesystem
Shared connection to x.x.xx.119 closed.
Just to add the docker image will run on any other remote machine just fine, just not on the the ray cluster.
If someone could please help me, I would be eternally grateful, and I will even promise to add a tutorial on medium after my struggles.
回答1:
I think the issue might be around using ENTRYPOINT
. The Ray ClusterLauncher starts docker using a command roughly like:
docker run --rm --name <NAME> -d -it --net=host <image_name> bash
When I ran docker build -t myimage .
and then ran docker run --rm -it myimage bash
, Docker errored with:
python3: can't open file 'bash': [Errno 2] No such file or directory
来源:https://stackoverflow.com/questions/65570374/launching-a-simple-python-script-on-an-aws-ray-cluster-with-docker