Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SoftRoCE support might be interesting #136

Open
ocaisa opened this issue Jan 27, 2021 · 12 comments
Open

SoftRoCE support might be interesting #136

ocaisa opened this issue Jan 27, 2021 · 12 comments
Assignees
Labels
enhancement New feature or request

Comments

@ocaisa
Copy link
Collaborator

ocaisa commented Jan 27, 2021

I saw a talk today that said for small clusters the performance is pretty good, can also be leveraged for NFS:

@ocaisa
Copy link
Collaborator Author

ocaisa commented Feb 1, 2021

I was playing around with this on CentOS8 and it is pretty easy to set up:

sudo modprobe rdma_rxe
sudo yum install -y rdma-core libibverbs-utils
sudo rdma link add rxe_eth0 type rxe netdev eth0  # need name of network device
rdma link                                         # will show group to use below
# To send to nodes
clush -w node[1-N] ...
# IB ping pong
ibv_rc_pingpong -d rxe_eth0 -g 1              # server
ibv_rc_pingpong -d rxe_eth0 -g 1 <SERVER_IP>  # client

the problem is that user[1-N] cannot see (and therefore use) the devices:

[user01@login1 GROMACS]$ ibv_devices 
    device          	   node GUID
    ------          	----------------
[user01@login1 GROMACS]$ 

it only seems to be visible by the centos user.

@ocaisa
Copy link
Collaborator Author

ocaisa commented Feb 1, 2021

It does appear like there would be a benefit to this. Some initial testing showed it can lower latency for MPI by 50%.

@cmd-ntrf cmd-ntrf self-assigned this Feb 1, 2021
@cmd-ntrf cmd-ntrf added the enhancement New feature or request label Feb 1, 2021
@cmd-ntrf
Copy link
Member

cmd-ntrf commented Feb 1, 2021

The devices are probably not visible to user[1-N] because of SELinux confinement.

You can try to su from centos as a user sudo su - user01, and look if the devices are visible. This would confirm SELinux policies have to be fine tuned to allow users to see the devices.

@ocaisa
Copy link
Collaborator Author

ocaisa commented Feb 1, 2021

Confirmed, that works

@ocaisa
Copy link
Collaborator Author

ocaisa commented Feb 1, 2021

The setup steps are lost on reboot (IIUC) so setting it up would need a puppet module I guess...or more specifically

sudo rdma link add rxe_eth0 type rxe netdev eth0

@cmd-ntrf
Copy link
Member

cmd-ntrf commented Feb 1, 2021

In ComputeCanada/puppet-magic_castle#84, you mentioned that installing the rdma and ibverbs with yum was not sufficient with CentOS 7.

Is it different because you are using CentOS 8?

@ocaisa
Copy link
Collaborator Author

ocaisa commented Feb 1, 2021

In this case I was trying CentOS 8 because I thought I needed a more recent kernel for soft-RoCE support (kernel module rdma_rxe). It looks like the support should already be available since CentOS7.4 though so I might go back and test again.

In general I actually don't know a whole lot about infiniband configuration, this is just a lot of googling and trial and error on my part so far. Now that I've learned a little more about how it all works, I might be able to revisit ComputeCanada/puppet-magic_castle#84 and have more success.

@ocaisa
Copy link
Collaborator Author

ocaisa commented Feb 1, 2021

I suspect if I had run the right rdma link ... command I might have had some success

@cmd-ntrf
Copy link
Member

cmd-ntrf commented Feb 1, 2021

Sounds good. Keep us updated.

@ocaisa
Copy link
Collaborator Author

ocaisa commented Feb 2, 2021

So between nodes it seems to make little difference, but inside a node it would appear to be able to half the latency for tiny messages, I need to do more testing...

Within a node and between nodes it would appear to make no difference. I guess the implication is that without exposing hardware capabilities on the network card it's not going to do anything.

Unfortunately there was a regression in our latest UCX installation and I'm not certain I am squeezing out what I can.

@cmd-ntrf
Copy link
Member

cmd-ntrf commented Feb 2, 2021

I see that you are running the benchmarks on the login nodes. If you run on a compute node during a Slurm job, you won't have to disable SELinux as the user inherits the slurmd context which is unconfined.

@ocaisa
Copy link
Collaborator Author

ocaisa commented Feb 4, 2021

I was playing around with this on CentOS8 and it is pretty easy to set up:

sudo modprobe rdma_rxe
sudo yum install -y rdma-core libibverbs-utils
sudo rdma link add rxe_eth0 type rxe netdev eth0  # need name of network device
rdma link                                         # will show group to use below
# To send to nodes
clush -w node[1-N] ...
# IB ping pong
ibv_rc_pingpong -d rxe_eth0 -g 1              # server
ibv_rc_pingpong -d rxe_eth0 -g 1 <SERVER_IP>  # client

On CentOS 7 it also works, with just one command change, rather than use rdma link add ..., you use:

rxe_cfg add eth0

and the resulting device is called rxe0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants