Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

All PaRSEC threads binding to the same physical core #130

Open
josephjohnjj opened this issue Jul 31, 2021 · 5 comments
Open

All PaRSEC threads binding to the same physical core #130

josephjohnjj opened this issue Jul 31, 2021 · 5 comments

Comments

@josephjohnjj
Copy link

josephjohnjj commented Jul 31, 2021

Hi,

When am running a TTG program all the thread gets bound to the same physical core. Things are working better when I use

--bind-to none

Are there any performance problems if I use --bind-to none?

Program was compiled using the following modules- intel-mkl/2021.2.0 boost/1.71.0 openmpi/4.0.2 eigen/3.3.7 libunwind/1.2.1 intel-compiler/2021.2.0
and I am working the parsec commit 15b871975fa596e1f2d5e4430c405d9e1b50e54d.

Regards,
Joseph

@devreal
Copy link
Contributor

devreal commented Jul 31, 2021 via email

@josephjohnjj
Copy link
Author

The bind-to none option was passed to mpi. This was the pbs script I used initially where all the threads were getting bound to the same physical core and the job was getting timed out.

#!/bin/bash
#PBS -P kq12
#PBS -q normal
#PBS -l walltime=00:15:00
#PBS -l mem=192GB
#PBS -l jobfs=1GB
#PBS -l ncpus=96

module load  openmpi/4.0.5  

mpirun -np 2 --map-by node /home/659/jj8451/TTG/ttg/build/examples/uts-parsec  -b 2000 -q 0.124875 -m 8 -r 42

When I added --bind-to none the run is complete in 90sec.

#!/bin/bash
#PBS -P kq12
#PBS -q normal
#PBS -l walltime=00:15:00
#PBS -l mem=192GB
#PBS -l jobfs=1GB
#PBS -l ncpus=96

ulimit -c unlimited

module load openmpi/4.0.5
mpirun  -np 2 --map-by node --bind-to none /home/659/jj8451/TTG/ttg/build/examples/uts-parsec  -b 2000 -q 0.124875 -m 8 -r 42

I am running with one mpi process per node. PaRSEC was build normally without any additional features and this external PaRSEC was used to build TTG.

Machine (189GB total)
  Package L#0 + L3 L#0 (36MB)
    Group0 L#0
      NUMANode L#0 (P#0 47GB)
      L2 L#0 (1024KB) + L1d L#0 (32KB) + L1i L#0 (32KB) + Core L#0 + PU L#0 (P#0)
      L2 L#1 (1024KB) + L1d L#1 (32KB) + L1i L#1 (32KB) + Core L#1 + PU L#1 (P#1)
      L2 L#2 (1024KB) + L1d L#2 (32KB) + L1i L#2 (32KB) + Core L#2 + PU L#2 (P#2)
      L2 L#3 (1024KB) + L1d L#3 (32KB) + L1i L#3 (32KB) + Core L#3 + PU L#3 (P#3)
      L2 L#4 (1024KB) + L1d L#4 (32KB) + L1i L#4 (32KB) + Core L#4 + PU L#4 (P#7)
      L2 L#5 (1024KB) + L1d L#5 (32KB) + L1i L#5 (32KB) + Core L#5 + PU L#5 (P#8)
      L2 L#6 (1024KB) + L1d L#6 (32KB) + L1i L#6 (32KB) + Core L#6 + PU L#6 (P#12)
      L2 L#7 (1024KB) + L1d L#7 (32KB) + L1i L#7 (32KB) + Core L#7 + PU L#7 (P#13)
      L2 L#8 (1024KB) + L1d L#8 (32KB) + L1i L#8 (32KB) + Core L#8 + PU L#8 (P#14)
      L2 L#9 (1024KB) + L1d L#9 (32KB) + L1i L#9 (32KB) + Core L#9 + PU L#9 (P#18)
      L2 L#10 (1024KB) + L1d L#10 (32KB) + L1i L#10 (32KB) + Core L#10 + PU L#10 (P#19)
      L2 L#11 (1024KB) + L1d L#11 (32KB) + L1i L#11 (32KB) + Core L#11 + PU L#11 (P#20)
      HostBridge
        PCI 00:11.5 (SATA)
        PCI 00:17.0 (SATA)
          Block(Disk) "sda"
        PCIBridge
          PCIBridge
            PCI 02:00.0 (VGA)
      HostBridge
        PCIBridge
          PCIBridge
            PCIBridge
              PCI 08:00.2 (Ethernet)
                Net "eno1"
    Group0 L#1
      NUMANode L#1 (P#1 47GB)
      L2 L#12 (1024KB) + L1d L#12 (32KB) + L1i L#12 (32KB) + Core L#12 + PU L#12 (P#4)
      L2 L#13 (1024KB) + L1d L#13 (32KB) + L1i L#13 (32KB) + Core L#13 + PU L#13 (P#5)
      L2 L#14 (1024KB) + L1d L#14 (32KB) + L1i L#14 (32KB) + Core L#14 + PU L#14 (P#6)
      L2 L#15 (1024KB) + L1d L#15 (32KB) + L1i L#15 (32KB) + Core L#15 + PU L#15 (P#9)
      L2 L#16 (1024KB) + L1d L#16 (32KB) + L1i L#16 (32KB) + Core L#16 + PU L#16 (P#10)
      L2 L#17 (1024KB) + L1d L#17 (32KB) + L1i L#17 (32KB) + Core L#17 + PU L#17 (P#11)
      L2 L#18 (1024KB) + L1d L#18 (32KB) + L1i L#18 (32KB) + Core L#18 + PU L#18 (P#15)
      L2 L#19 (1024KB) + L1d L#19 (32KB) + L1i L#19 (32KB) + Core L#19 + PU L#19 (P#16)
      L2 L#20 (1024KB) + L1d L#20 (32KB) + L1i L#20 (32KB) + Core L#20 + PU L#20 (P#17)
      L2 L#21 (1024KB) + L1d L#21 (32KB) + L1i L#21 (32KB) + Core L#21 + PU L#21 (P#21)
      L2 L#22 (1024KB) + L1d L#22 (32KB) + L1i L#22 (32KB) + Core L#22 + PU L#22 (P#22)
      L2 L#23 (1024KB) + L1d L#23 (32KB) + L1i L#23 (32KB) + Core L#23 + PU L#23 (P#23)
      HostBridge
        PCIBridge
          PCI 58:00.0 (InfiniBand)
            Net "ib0"
            OpenFabrics "mlx5_0"
  Package L#1 + L3 L#1 (36MB)
    Group0 L#2
      NUMANode L#2 (P#2 47GB)
      L2 L#24 (1024KB) + L1d L#24 (32KB) + L1i L#24 (32KB) + Core L#24 + PU L#24 (P#24)
      L2 L#25 (1024KB) + L1d L#25 (32KB) + L1i L#25 (32KB) + Core L#25 + PU L#25 (P#25)
      L2 L#26 (1024KB) + L1d L#26 (32KB) + L1i L#26 (32KB) + Core L#26 + PU L#26 (P#26)
      L2 L#27 (1024KB) + L1d L#27 (32KB) + L1i L#27 (32KB) + Core L#27 + PU L#27 (P#27)
      L2 L#28 (1024KB) + L1d L#28 (32KB) + L1i L#28 (32KB) + Core L#28 + PU L#28 (P#31)
      L2 L#29 (1024KB) + L1d L#29 (32KB) + L1i L#29 (32KB) + Core L#29 + PU L#29 (P#32)
      L2 L#30 (1024KB) + L1d L#30 (32KB) + L1i L#30 (32KB) + Core L#30 + PU L#30 (P#33)
      L2 L#31 (1024KB) + L1d L#31 (32KB) + L1i L#31 (32KB) + Core L#31 + PU L#31 (P#37)
      L2 L#32 (1024KB) + L1d L#32 (32KB) + L1i L#32 (32KB) + Core L#32 + PU L#32 (P#38)
      L2 L#33 (1024KB) + L1d L#33 (32KB) + L1i L#33 (32KB) + Core L#33 + PU L#33 (P#39)
      L2 L#34 (1024KB) + L1d L#34 (32KB) + L1i L#34 (32KB) + Core L#34 + PU L#34 (P#43)
      L2 L#35 (1024KB) + L1d L#35 (32KB) + L1i L#35 (32KB) + Core L#35 + PU L#35 (P#44)
    Group0 L#3
      NUMANode L#3 (P#3 47GB)
      L2 L#36 (1024KB) + L1d L#36 (32KB) + L1i L#36 (32KB) + Core L#36 + PU L#36 (P#28)
      L2 L#37 (1024KB) + L1d L#37 (32KB) + L1i L#37 (32KB) + Core L#37 + PU L#37 (P#29)
      L2 L#38 (1024KB) + L1d L#38 (32KB) + L1i L#38 (32KB) + Core L#38 + PU L#38 (P#30)
      L2 L#39 (1024KB) + L1d L#39 (32KB) + L1i L#39 (32KB) + Core L#39 + PU L#39 (P#34)
      L2 L#40 (1024KB) + L1d L#40 (32KB) + L1i L#40 (32KB) + Core L#40 + PU L#40 (P#35)
      L2 L#41 (1024KB) + L1d L#41 (32KB) + L1i L#41 (32KB) + Core L#41 + PU L#41 (P#36)
      L2 L#42 (1024KB) + L1d L#42 (32KB) + L1i L#42 (32KB) + Core L#42 + PU L#42 (P#40)
      L2 L#43 (1024KB) + L1d L#43 (32KB) + L1i L#43 (32KB) + Core L#43 + PU L#43 (P#41)
      L2 L#44 (1024KB) + L1d L#44 (32KB) + L1i L#44 (32KB) + Core L#44 + PU L#44 (P#42)
      L2 L#45 (1024KB) + L1d L#45 (32KB) + L1i L#45 (32KB) + Core L#45 + PU L#45 (P#45)
      L2 L#46 (1024KB) + L1d L#46 (32KB) + L1i L#46 (32KB) + Core L#46 + PU L#46 (P#46)
      L2 L#47 (1024KB) + L1d L#47 (32KB) + L1i L#47 (32KB) + Core L#47 + PU L#47 (P#47)

@devreal
Copy link
Contributor

devreal commented Aug 2, 2021

Any chance your PaRSEC wasn't built with support for hwloc? According to the OMPI documentation, the default binding with np<=2 is core and if PaRSEC has no support for hwloc it won't enforce any binding itself.

@josephjohnjj
Copy link
Author

josephjohnjj commented Aug 2, 2021

PaRSEC was built with hwloc. ldd libparsec.so.3.0.0 gives the following

    linux-vdso.so.1 (0x00007ffcf9587000)
    libdl.so.2 => /lib64/libdl.so.2 (0x00007fca2d397000)
    librt.so.1 => /lib64/librt.so.1 (0x00007fca2d18f000)
    libhwloc.so.15 => /lib64/libhwloc.so.15 (0x00007fca2cf3f000)
    libmpi.so.40 => /apps/openmpi/4.0.5/lib/libmpi.so.40 (0x00007fca2cc18000)
    libimf.so => /apps/intel-ct/2021.2.0/compiler/linux/compiler/lib/intel64/libimf.so (0x00007fca2c590000)
    libsvml.so => /apps/intel-ct/2021.2.0/compiler/linux/compiler/lib/intel64/libsvml.so (0x00007fca2aa93000)
    libirng.so => /apps/intel-ct/2021.2.0/compiler/linux/compiler/lib/intel64/libirng.so (0x00007fca2a729000)
    libm.so.6 => /lib64/libm.so.6 (0x00007fca2a3a7000)
    libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x00007fca2a18f000)
    libintlc.so.5 => /apps/intel-ct/2021.2.0/compiler/linux/compiler/lib/intel64/libintlc.so.5 (0x00007fca29f17000)
    libpthread.so.0 => /lib64/libpthread.so.0 (0x00007fca29cf7000)
    libc.so.6 => /lib64/libc.so.6 (0x00007fca29932000)
    /lib64/ld-linux-x86-64.so.2 (0x00007fca2d843000)
    libopen-rte.so.40 => /apps/openmpi-mofed5.1-pbs2021.1/4.0.5/lib/libopen-rte.so.40 (0x00007fca2967c000)
    libopen-pal.so.40 => /apps/openmpi-mofed5.1-pbs2021.1/4.0.5/lib/libopen-pal.so.40 (0x00007fca29371000)
    libudev.so.1 => /lib64/libudev.so.1 (0x00007fca290db000)
    libpciaccess.so.0 => /lib64/libpciaccess.so.0 (0x00007fca28ed1000)
    libutil.so.1 => /lib64/libutil.so.1 (0x00007fca28ccd000)
    libz.so.1 => /lib64/libz.so.1 (0x00007fca28ab6000)
    libmount.so.1 => /lib64/libmount.so.1 (0x00007fca2885c000)
    libblkid.so.1 => /lib64/libblkid.so.1 (0x00007fca28609000)
    libuuid.so.1 => /lib64/libuuid.so.1 (0x00007fca28401000)
    libselinux.so.1 => /lib64/libselinux.so.1 (0x00007fca281d7000)
    libpcre2-8.so.0 => /lib64/libpcre2-8.so.0 (0x00007fca27f53000)

@josephjohnjj
Copy link
Author

This is the error generated by PaRSEC

^[[1;37;43mW@00000^[[0m binding core #2000 not valid (must be between 0 and 47 (nb_core-1)
^[[1;37;43mW@00002^[[0m binding core #2000 not valid (must be between 0 and 47 (nb_core-1)
^[[1;37;43mW@00005^[[0m binding core #2000 not valid (must be between 0 and 47 (nb_core-1)
^[[1;37;43mW@00004^[[0m binding core #2000 not valid (must be between 0 and 47 (nb_core-1)
^[[1;37;43mW@00003^[[0m binding core #2000 not valid (must be between 0 and 47 (nb_core-1)
^[[1;37;43mW@00001^[[0m binding core #2000 not valid (must be between 0 and 47 (nb_core-1)
^[[1;37;43mW@00007^[[0m binding core #2000 not valid (must be between 0 and 47 (nb_core-1)
^[[1;37;43mW@00006^[[0m binding core #2000 not valid (must be between 0 and 47 (nb_core-1)
^[[1;37;43mW@00000^[[0m Couldn't bind to cpuset 0x0
^[[1;37;43mW@00007^[[0m Couldn't bind to cpuset 0x0
^[[1;37;43mW@00005^[[0m Couldn't bind to cpuset 0x0
^[[1;37;43mW@00006^[[0m Couldn't bind to cpuset 0x0
^[[1;37;43mW@00003^[[0m Couldn't bind to cpuset 0x0
^[[1;37;43mW@00002^[[0m Couldn't bind to cpuset 0x0
^[[1;37;43mW@00004^[[0m Couldn't bind to cpuset 0x0
^[[1;37;43mW@00001^[[0m Couldn't bind to cpuset 0x0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants