[Bug] KubeRay Worker group pod keeps restarting on EKS - Fails to CrashLoopBackOff #2420

teopopescu · 2024-10-02T16:56:32Z

Search before asking

I searched the issues and found no similar issues.

KubeRay Component

Others

What happened + What you expected to happen

I am following steps 2-5 here on an Amazon EKS cluster. I am able to run a job and access the dashboard, however, the workers keep restarting (K9s screenshot attached)

Logs of the ray-worker can be found below:

Running the same steps on kind works as expected, with the worker pod being in ready state and not failing

Reproduction script

kubectl create ns ray-system
helm repo add kuberay https://ray-project.github.io/kuberay-helm/ -n ray-system
helm repo update
helm install kuberay-operator kuberay/kuberay-operator --version 1.2.2 -n ray-system
helm install raycluster kuberay/ray-cluster --version 1.2.2 -n ray-system

Anything else

No response

Are you willing to submit a PR?

Yes I am willing to submit a PR!

The text was updated successfully, but these errors were encountered:

kevin85421 · 2024-10-05T23:51:01Z

Do you have a reproduction script especially the setup of your EKS cluster so that I can reproduce on an EKS cluster?

teopopescu added bug Something isn't working triage labels Oct 2, 2024

teopopescu changed the title ~~[Bug] KubeRay Worker group pod keeps restarting - Fails to CrashLoopBackOff~~ [Bug] KubeRay Worker group pod keeps restarting on EKS - Fails to CrashLoopBackOff Oct 2, 2024

kevin85421 added external-author-action-required P1 Issue that should be fixed within a few weeks and removed triage labels Oct 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] KubeRay Worker group pod keeps restarting on EKS - Fails to CrashLoopBackOff #2420

[Bug] KubeRay Worker group pod keeps restarting on EKS - Fails to CrashLoopBackOff #2420

teopopescu commented Oct 2, 2024 •

edited

Loading

kevin85421 commented Oct 5, 2024

[Bug] KubeRay Worker group pod keeps restarting on EKS - Fails to CrashLoopBackOff #2420

[Bug] KubeRay Worker group pod keeps restarting on EKS - Fails to CrashLoopBackOff #2420

Comments

teopopescu commented Oct 2, 2024 • edited Loading

Search before asking

KubeRay Component

What happened + What you expected to happen

Reproduction script

Anything else

Are you willing to submit a PR?

kevin85421 commented Oct 5, 2024

teopopescu commented Oct 2, 2024 •

edited

Loading