-
Notifications
You must be signed in to change notification settings - Fork 38
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SoftRoCE support might be interesting #136
Comments
I was playing around with this on CentOS8 and it is pretty easy to set up:
the problem is that
it only seems to be visible by the |
It does appear like there would be a benefit to this. Some initial testing showed it can lower latency for MPI by 50%. |
The devices are probably not visible to You can try to su from centos as a user |
Confirmed, that works |
The setup steps are lost on reboot (IIUC) so setting it up would need a puppet module I guess...or more specifically
|
In ComputeCanada/puppet-magic_castle#84, you mentioned that installing the rdma and ibverbs with yum was not sufficient with CentOS 7. Is it different because you are using CentOS 8? |
In this case I was trying CentOS 8 because I thought I needed a more recent kernel for soft-RoCE support (kernel module In general I actually don't know a whole lot about infiniband configuration, this is just a lot of googling and trial and error on my part so far. Now that I've learned a little more about how it all works, I might be able to revisit ComputeCanada/puppet-magic_castle#84 and have more success. |
I suspect if I had run the right |
Sounds good. Keep us updated. |
Within a node and between nodes it would appear to make no difference. I guess the implication is that without exposing hardware capabilities on the network card it's not going to do anything. Unfortunately there was a regression in our latest UCX installation and I'm not certain I am squeezing out what I can. |
I see that you are running the benchmarks on the login nodes. If you run on a compute node during a Slurm job, you won't have to disable SELinux as the user inherits the slurmd context which is unconfined. |
On CentOS 7 it also works, with just one command change, rather than use
and the resulting device is called |
I saw a talk today that said for small clusters the performance is pretty good, can also be leveraged for NFS:
The text was updated successfully, but these errors were encountered: