Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

valgrind reports a ton of 'Uninitialised byte(s) found during client check request' #11

Open
marehr opened this issue May 13, 2018 · 2 comments

Comments

@marehr
Copy link
Contributor

marehr commented May 13, 2018

Look at ./inner_product_mpi

> mpirun -n 1 ./inner_product_mpi : -n 1 valgrind ./inner_product_mpi : -n 2 ./inner_product_mpi 
==28739== Memcheck, a memory error detector
==28739== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
==28739== Using Valgrind-3.13.0 and LibVEX; rerun with -h for copyright info
==28739== Command: ./inner_product_mpi
==28739== 
==28739== Uninitialised byte(s) found during client check request
==28739==    at 0x6814E31: ??? (in /usr/lib/openmpi/libopen-pal.so.40.10.0)
==28739==    by 0x533310E: PMPI_Allgather (in /usr/lib/openmpi/libmpi.so.40.10.0)
==28739==    by 0x152777: ham::net::communicator::communicator(int, char**) (in /home/marehr/develope/ham/build/inner_product_mpi)
==28739==    by 0x152016: ham::offload::runtime::runtime(int, char**) (in /home/marehr/develope/ham/build/inner_product_mpi)
==28739==    by 0x15FC68: ham::offload::ham_main(int, char**) (in /home/marehr/develope/ham/build/inner_product_mpi)
==28739==    by 0x14801E: main (in /home/marehr/develope/ham/build/inner_product_mpi)
==28739==  Address 0x1ffefff5e6 is on thread 1's stack
==28739==  in frame #2, created by ham::net::communicator::communicator(int, char**) (???:)
==28739== 
Using target node 1 with hostname t470p
==28739== Uninitialised byte(s) found during client check request
==28739==    at 0x6814E31: ??? (in /usr/lib/openmpi/libopen-pal.so.40.10.0)
==28739==    by 0x5365F2C: PMPI_Send (in /usr/lib/openmpi/libmpi.so.40.10.0)
==28739==    by 0x150F7C: void ham::net::communicator::request::send_result<void>(void*, unsigned long) (in /home/marehr/develope/ham/build/inner_product_mpi)
==28739==    by 0x151462: ham::offload::detail::offload_result_msg<ham::new_buffer<double>, ham::msg::execution_policy_direct>::operator()() (in /home/marehr/develope/ham/build/inner_product_mpi)
==28739==    by 0x150D96: ham::msg::execution_policy_direct<ham::offload::detail::offload_result_msg<ham::new_buffer<double>, ham::msg::execution_policy_direct> >::handler(void*) (in /home/marehr/develope/ham/build/inner_product_mpi)
==28739==    by 0x152B40: ham::msg::active_msg_base::operator()(void*) (in /home/marehr/develope/ham/build/inner_product_mpi)
==28739==    by 0x1521ED: ham::offload::runtime::run_receive() (in /home/marehr/develope/ham/build/inner_product_mpi)
==28739==    by 0x15FCA5: ham::offload::ham_main(int, char**) (in /home/marehr/develope/ham/build/inner_product_mpi)
==28739==    by 0x14801E: main (in /home/marehr/develope/ham/build/inner_product_mpi)
==28739==  Address 0x1ffefff6cc is on thread 1's stack
==28739==  in frame #3, created by ham::offload::detail::offload_result_msg<ham::new_buffer<double>, ham::msg::execution_policy_direct>::operator()() (???:)
==28739== 
Result: 1.78957e+08
==28739== 
==28739== HEAP SUMMARY:
==28739==     in use at exit: 40,241 bytes in 423 blocks
==28739==   total heap usage: 20,553 allocs, 20,130 frees, 4,370,387 bytes allocated
==28739== 
==28739== LEAK SUMMARY:
==28739==    definitely lost: 12,592 bytes in 153 blocks
==28739==    indirectly lost: 8,657 bytes in 215 blocks
==28739==      possibly lost: 0 bytes in 0 blocks
==28739==    still reachable: 18,992 bytes in 55 blocks
==28739==         suppressed: 0 bytes in 0 blocks
==28739== Rerun with --leak-check=full to see details of leaked memory
==28739== 
==28739== For counts of detected and suppressed errors, rerun with: -v
==28739== Use --track-origins=yes to see where uninitialised values come from
==28739== ERROR SUMMARY: 3 errors from 2 contexts (suppressed: 0 from 0)

If you look closely it always happens within /usr/lib/openmpi/libopen-pal.so. I want to make sure that those invocations are done in the right way and the reported problems are due to /usr/lib/openmpi/libopen-pal.so.

I created a suppression file for valgrind which suppresses those warnings (see attachment)

==28925== Memcheck, a memory error detector
==28925== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
==28925== Using Valgrind-3.13.0 and LibVEX; rerun with -h for copyright info
==28925== Command: ./inner_product_mpi
==28925== 
Using target node 1 with hostname t470p
Result: 1.78957e+08
==28925== 
==28925== HEAP SUMMARY:
==28925==     in use at exit: 40,276 bytes in 423 blocks
==28925==   total heap usage: 20,553 allocs, 20,130 frees, 4,370,422 bytes allocated
==28925== 
==28925== LEAK SUMMARY:
==28925==    definitely lost: 12,592 bytes in 153 blocks
==28925==    indirectly lost: 8,657 bytes in 215 blocks
==28925==      possibly lost: 0 bytes in 0 blocks
==28925==    still reachable: 19,027 bytes in 55 blocks
==28925==         suppressed: 0 bytes in 0 blocks
==28925== Rerun with --leak-check=full to see details of leaked memory
==28925== 
==28925== For counts of detected and suppressed errors, rerun with: -v
==28925== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 3 from 2)

@marehr
Copy link
Contributor Author

marehr commented May 25, 2018

==23332== Syscall param process_vm_readv(lvec[...]) points to unaddressable byte(s)
==23332==    at 0x601235A: process_vm_readv (in /usr/lib/libc-2.27.so)
==23332==    by 0xE54EE93: mca_btl_vader_get_cma (in /usr/lib/openmpi/openmpi/mca_btl_vader.so)
==23332==    by 0xEF68E0F: mca_pml_ob1_recv_request_get_frag (in /usr/lib/openmpi/openmpi/mca_pml_ob1.so)
==23332==    by 0xEF692CB: mca_pml_ob1_recv_request_progress_rget (in /usr/lib/openmpi/openmpi/mca_pml_ob1.so)
==23332==    by 0xEF644F9: ??? (in /usr/lib/openmpi/openmpi/mca_pml_ob1.so)
==23332==    by 0xEF64773: ??? (in /usr/lib/openmpi/openmpi/mca_pml_ob1.so)
==23332==    by 0xE54D0BE: mca_btl_vader_poll_handle_frag (in /usr/lib/openmpi/openmpi/mca_btl_vader.so)
==23332==    by 0xE54D404: ??? (in /usr/lib/openmpi/openmpi/mca_btl_vader.so)
==23332==    by 0x67BE6BB: opal_progress (in /usr/lib/openmpi/libopen-pal.so.40.10.0)
==23332==    by 0x67C5295: ompi_sync_wait_mt (in /usr/lib/openmpi/libopen-pal.so.40.10.0)
==23332==    by 0x5320DFA: ompi_request_default_wait_all (in /usr/lib/openmpi/libmpi.so.40.10.0)
==23332==    by 0x536E5EE: PMPI_Waitall (in /usr/lib/openmpi/libmpi.so.40.10.0)
==23332==  Address 0xe9c9900 is 0 bytes inside a block of size 1,048,576 alloc'd
==23332==    at 0x4C2F246: memalign (vg_replace_malloc.c:857)
==23332==    by 0x4C2F361: posix_memalign (vg_replace_malloc.c:1020)
==23332==    by 0x1554B2: local_allocate(unsigned long) (benchmark_ham_offload.cpp:87)
==23332==    by 0x155B9C: ham_user_main(int, char**) (benchmark_ham_offload.cpp:181)
==23332==    by 0x17295B: ham::offload::runtime::run_main(int, char**) (runtime.cpp:38)
==23332==    by 0x176D88: ham::offload::ham_main(int, char**) (main.cpp:37)
==23332==    by 0x15554D: main (benchmark_ham_offload.cpp:101)
==23332== 

A different one

@noma
Copy link
Owner

noma commented Jul 3, 2018

The uninitialised bytes are probably a result of the internal buffer size and the actually allocated buffers size (one page by default). It wouldn't make much sense to initialise the whole buffer just to make valgrind happy.

Could also be internal OpenMPI stuff, if you can try MPICH and see if things change.

The unadressable byte for the process_vm_readv syscall seems to be a false positive, or if not out of my control.

I think when dealing with network buffers and DMA accesses, there's plenty of theses errors to be expected.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants