-
Notifications
You must be signed in to change notification settings - Fork 48
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cylon container fails #679
Comments
This looks like an execution issue. There are also subtleties related to docker usage and port mapping. You might want to try host networking and ensure there are not processes already executing/idle on the various nodes. For ECS, it has been necessary to specifically map ports to avoid this sort of thing. |
@mstaylor Can you elaborate more, please, on |
@mstaylor, a gentle reminder about the comment above. |
@AymenFJA - your issue is here: [cylon-join-workerl address (192.168.99.12:0) failed: Address already in use (98) For my research experiments, I use UCX/UXX/Redis which is a bit different. For OpenMPI, you might consider using the following approach: https://github.com/everpeace/kube-openmpi. If you switch to ECS, you can generate a task that includes port mapping. Here's an example from my ECS task mapping: "family": "cylon-ucc-ucx-redis-ec2-4_26_9100000-8Node-task", The issue is your are running on pods with addresses already in use (hence the error logged). What does your hosts file look like? |
@AymenFJA - did you use our docker image or build an image based on updates in main? |
Thanks, @mstaylor, for your response. Can we have a 1-1 meeting to discuss it? It would be great to do that. If you agree, I can ping you on Slack and take it from there. |
@AymenFJA - that sounds great. |
@mstaylor I pinged you on slack/cylondata. |
Hello @nirandaperera and Cylon team,
I was testing Cylon container with Kubernetes on AWS. I have a multi-node setup of MPI environment on the cluster.
I tested Cylon with 1 and 2 nodes (each node has 128 cores and 16GB of memory per core (total per node is 2048 GB)) both runs worked just fine when executing
join
operation with ~35M
rows using the following script https://github.com/cylondata/cylon/blob/main/summit/scripts/cylon_scaling.py.The command line that I used:
I repeated the same setup but this time with 3 or 4 nodes:
And I started getting the following error:
Any help here would be appreciate it.
The text was updated successfully, but these errors were encountered: