Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] KubeRay Worker group pod keeps restarting on EKS - Fails to CrashLoopBackOff #2420

Open
2 tasks done
teopopescu opened this issue Oct 2, 2024 · 1 comment
Open
2 tasks done
Labels
bug Something isn't working external-author-action-required P1 Issue that should be fixed within a few weeks

Comments

@teopopescu
Copy link

teopopescu commented Oct 2, 2024

Search before asking

  • I searched the issues and found no similar issues.

KubeRay Component

Others

What happened + What you expected to happen

I am following steps 2-5 here on an Amazon EKS cluster. I am able to run a job and access the dashboard, however, the workers keep restarting (K9s screenshot attached)
image

Logs of the ray-worker can be found below:
image

Running the same steps on kind works as expected, with the worker pod being in ready state and not failing

image

Reproduction script

kubectl create ns ray-system
helm repo add kuberay https://ray-project.github.io/kuberay-helm/ -n ray-system
helm repo update
helm install kuberay-operator kuberay/kuberay-operator --version 1.2.2 -n ray-system
helm install raycluster kuberay/ray-cluster --version 1.2.2 -n ray-system


Anything else

No response

Are you willing to submit a PR?

  • Yes I am willing to submit a PR!
@teopopescu teopopescu added bug Something isn't working triage labels Oct 2, 2024
@teopopescu teopopescu changed the title [Bug] KubeRay Worker group pod keeps restarting - Fails to CrashLoopBackOff [Bug] KubeRay Worker group pod keeps restarting on EKS - Fails to CrashLoopBackOff Oct 2, 2024
@kevin85421
Copy link
Member

Do you have a reproduction script especially the setup of your EKS cluster so that I can reproduce on an EKS cluster?

@kevin85421 kevin85421 added external-author-action-required P1 Issue that should be fixed within a few weeks and removed triage labels Oct 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working external-author-action-required P1 Issue that should be fixed within a few weeks
Projects
None yet
Development

No branches or pull requests

2 participants