Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segmentation fault after successful transfer #3

Open
drewhemm opened this issue May 9, 2021 · 1 comment · May be fixed by #4
Open

Segmentation fault after successful transfer #3

drewhemm opened this issue May 9, 2021 · 1 comment · May be fixed by #4

Comments

@drewhemm
Copy link

drewhemm commented May 9, 2021

Output on server:

# hdrdmacp -s -n 32 -m 8GB
Looking for IB devices ...

=============================================
Found 1 devices
---------------------------------------------
   device 0 : mlx4_0 : uverbs0 : IB : InfiniBand channel adapter : Num. ports=2 : port num=1 : lid=2
=============================================

Device mlx4_0 opened. num_comp_vectors=32
Port attributes:
           state: 4
         max_mtu: 5
      active_mtu: 5
  port_cap_flags: 38865002
      max_msg_sz: 1073741824
    active_width: 2
    active_speed: 4
      phys_state: 5
      link_layer: 1
buff_len_GB: 8
num_buff_sections: 32
We got this far...
Created 32 buffers of 250MB (8GB total)
Listening for connections on port ... 10470
=== [10 sec avg.] 0 GB/s  --  0 TB total received
=== [10 sec avg.] 0 GB/s  --  0 TB total received
=== [10 sec avg.] 0 GB/s  --  0 TB total received
Receiving file: /root/windows.iso
hi->flags: 0x1


Message from syslogd@HOSTNAME at May  9 16:45:46 ...
 kernel:[15768.841039] watchdog: BUG: soft lockup - CPU#7 stuck for 22s! [hdrdmacp:130454]
^C^C^C # tried to exit the program here, but to no avail
Message from syslogd@HOSTNAME at May  9 16:46:14 ...
 kernel:[15796.840997] watchdog: BUG: soft lockup - CPU#7 stuck for 22s! [hdrdmacp:130454]
Segmentation fault

Output on client:

# ./hdrdmacp windows.iso 192.168.19.1:/root/windows.iso
Looking for IB devices ...

=============================================
Found 1 devices
---------------------------------------------
   device 0 : mlx4_0 : uverbs0 : IB : InfiniBand channel adapter : Num. ports=2 : port num=1 : lid=1
=============================================

Device mlx4_0 opened. num_comp_vectors=96
Port attributes:
           state: 4
         max_mtu: 5
      active_mtu: 5
  port_cap_flags: 38865000
      max_msg_sz: 1073741824
    active_width: 2
    active_speed: 4
      phys_state: 5
      link_layer: 1
Created 4 buffers of 250MB (1GB total)
IP address: 192.168.19.1 (192.168.19.1)
Connected to 192.168.19.1:10470
Sending file: windows.iso-> (192.168.19.1:)/root/windows.iso   (5.50971 GB)
  queued 9MB (5509/5509 MB -- 100%  - 11.3267 Gbps)   ps)
  waiting for final 1 transfers to complete ...
  Transferred 5.50971 GB in 2.71587 sec  (16.2297 Gbps)
  I/O rate reading from file: 1.65955 sec  (26.56 Gbps)

Transfer from the client side looked good and I checked that the file size and md5sum of the destination file matched the source. If I can find a solution for the seg fault, I'll be a happy man!

Looks like the ib connection itself, as seen by opensm was interrupted:

# opensm
-------------------------------------------------
OpenSM 5.7.2.MLNX20201014.9378048
Command Line Arguments:
 Log File: /var/log/opensm.log
-------------------------------------------------
OpenSM 5.7.2.MLNX20201014.9378048

Using default GUID 0x2c903004bfc0b
Entering DISCOVERING state

Entering MASTER state


Message from syslogd@HOSTNAME at May  9 16:45:46 ...
 kernel:[15768.841039] watchdog: BUG: soft lockup - CPU#7 stuck for 22s! [hdrdmacp:130454]

Message from syslogd@HOSTNAME at May  9 16:46:14 ...
 kernel:[15796.840997] watchdog: BUG: soft lockup - CPU#7 stuck for 22s! [hdrdmacp:130454]
SM port is down

Entering DISCOVERING state
@faustus123
Copy link
Collaborator

Hi Andrew,

This may be the first time someone has tried to build and run this outside of the Hall-D counting house. I'm glad to see that you were actually able to get this far.

First off, I need to confess that this is not the absolute latest version of this code. Development was done within the Hall-D subversion repository where the rest of our online software lives. This GiHub version is a snapshot that looks like it was pushed there about 1 month prior to the last commit to subversion. You can find the subversion code at the following URL and you might want to try that first to see if it just fixes the issues.

https://halldsvn.jlab.org/repos/trunk/online/packages/miscUtils/src/hdrdmacp/

I should also note that in Hall-D the hdrdmacp server is always run as a system service and the client is pretty much always run via scripted commands as part of the HOSS system. Thus, the variation on commands we use with it are limited. It has been running very successfully though with all of the 2020 GlueX data having been passed through it.

More work could be done to make this more user friendly. It also looks like more work is needed on the GitHub version to improve the build system. If you are at all interested in contributing, then I would very much welcome that.

Let me know how it goes.

@lucasz93 lucasz93 linked a pull request Jan 6, 2022 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants