Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DNM: pybind/mgr: add smart mgr module #1

Open
wants to merge 5 commits into
base: wip-smartctl-pull
Choose a base branch
from

Conversation

yaarith
Copy link
Owner

@yaarith yaarith commented Jan 18, 2018

This mgr module contains two commands:

  • 'osd smart get <osd.id>' # queries a specific OSD for SMART data of all
    of its disks.
  • 'osd smart get all' # queries all OSDs in the mgr's osd_map for SMART
    data.
    The mgr invokes a 'tell' command on the OSD side, which scrapes it for
    SMART metrics.

Signed-off-by: Yaarit Hatuka [email protected]

Yaarit Hatuka added 2 commits January 16, 2018 15:11
Also added 'ceph daemon osd.id list_devices' which prints to stdout the
osd devices. 'ceph [tell|daemon] osd.id smart' probes the osd devices
for SMART data and prints it to stdout in a JSON format. It assumes smartctl '--json' feature
exists.

Signed-off-by: Yaarit Hatuka <[email protected]>
This mgr module contains two commands:
- 'osd smart get <osd.id>' # queries a specific OSD for SMART data of all
of its disks.
- 'osd smart get all' # queries all OSDs in the mgr's osd_map for SMART
  data.
The mgr invokes a 'tell' command on the OSD side, which scrapes it for
SMART metrics.

Signed-off-by: Yaarit Hatuka <[email protected]>
"perm": "r"
},
{
"cmd": "osd smart scrape all",

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

scrape-all

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed

"perm": "r"
},
{
"cmd": "osd smart get all",

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

osd smart dump ? or get-all

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed to 'smart dump'

r, outb, outs = result.wait()
smart_data = json.loads(outb)

osd_host = self.get_metadata('osd', osd_id)['hostname']

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  •    osd_host = self.get_metadata('osd', osd_id).get('hostname')
    

if not osd_host:
return

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added; should we throw an exception?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better to just log someting (at debug level probably?) and continue. Remember thi sis going to loop over 100s or 1000s of devices, so we can't just stop when we hit one that looks weird.

now = str(time.time())
ioctx.set_omap(op, (now,), (str(json.dumps(device_smart_data)),))
ioctx.operate_write_op(op, object_name)
self.log.debug('writing object %s, key: %s, value: %s' % (object_name, now, str(json.dumps(device_smart_data))))

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you confirm that it is writing key/value data with

rados -p smart_data ls -
rados -p smart_data listomapvals $object_name

I would make sure the date looks like "YYYYMMDD-HHMMSS".

Also, it might makes sense to have a single record for the entire smart scrape, and the value be the blob of json.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. can you confirm that it is writing key/value data:
  • yes, here's a partial output:
    https://pastebin.com/ds5CNuHP
    all the examples I found used ioctx.set_omap() and ioctx.operate_write_op() to write objects, should I add op.create() regardless?
  1. I would make sure the date looks like "YYYYMMDD-HHMMSS".
  • done
  1. Also, it might makes sense to have a single record for the entire smart scrape, and the value be the blob of json.
  • not sure I understand: right now object names are $host:device, and for each object, omap key:val are $date:$json_blob_of_that_device.
    Should we add another object "$host" with omap key:val of $date:$json_blob_all_host_devices ?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. What you have is right--I just misread the code :)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. i would expect it to be needed or else the object won't get created.. but maybe i'm wrong? try deleting the pool and doing a scrape and see if the object for the host:device gets recreated.

with rados.WriteOpCtx() as op:
# TODO should we change second arg to be byte?
# TODO what about the flag in ioctx.operate_write_op()?
now = str(time.time())

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you need an

op.create()

here too to make sure the object exists.

# looking for the right pool
# TODO open pool, and if it fails - create and open
pools = self.rados.list_pools()
is_pool = False

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pool_name = get_config('pool_name') or 'smart_data'

@yaarith yaarith force-pushed the wip-mgr-module-smart branch from 5b76950 to b35f62b Compare February 28, 2018 19:51
yaarith pushed a commit that referenced this pull request Jan 27, 2020
otherwise ODR is violated:

==449025==ERROR: AddressSanitizer: odr-violation (0x000000f03700):
  [1] size=8 'g_ceph_context' ../src/global/global_context.cc:24:14
  [2] size=8 'g_ceph_context' ../src/global/global_context.cc:24:14
These globals were registered at these points:
  [1]:
    #0 0x4779bd in __asan_register_globals (/var/ssd/ceph/clang-build/bin/ceph-conf+0x4779bd)
    #1 0x56e9cb in asan.module_ctor (/var/ssd/ceph/clang-build/bin/ceph-conf+0x56e9cb)

  [2]:
    #0 0x4779bd in __asan_register_globals (/var/ssd/ceph/clang-build/bin/ceph-conf+0x4779bd)
    #1 0x7fe5fed12aeb in asan.module_ctor (/var/ssd/ceph/clang-build/lib/libceph-common.so.2+0x2f34aeb)

==449025==HINT: if you don't care about these errors you may set ASAN_OPTIONS=detect_odr_violation=0

Signed-off-by: Kefu Chai <[email protected]>
liewegas pushed a commit that referenced this pull request Feb 19, 2020
Accordingly to cppreference.com [1]:

  "If multiple threads of execution access the same std::shared_ptr
  object without synchronization and any of those accesses uses
  a non-const member function of shared_ptr then a data race will
  occur (...)"

[1]: https://en.cppreference.com/w/cpp/memory/shared_ptr/atomic

One of the coredumps showed the `shared_ptr`-typed `OSD::osdmap`
with healthy looking content but damaged control block:

  ```
  [Current thread is 1 (Thread 0x7f7dcaf73700 (LWP 205295))]
  (gdb) bt
  #0  0x0000559cb81c3ea0 in ?? ()
  #1  0x0000559c97675b27 in std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release (this=0x559cba0ec900) at /usr/include/c++/8/bits/shared_ptr_base.h:148
  #2  std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release (this=0x559cba0ec900) at /usr/include/c++/8/bits/shared_ptr_base.h:148
  ceph#3  0x0000559c975ef8aa in std::__shared_count<(__gnu_cxx::_Lock_policy)2>::~__shared_count (this=<optimized out>, __in_chrg=<optimized out>) at /usr/include/c++/8/bits/shared_ptr_base.h:1167
  ceph#4  std::__shared_ptr<OSDMap const, (__gnu_cxx::_Lock_policy)2>::~__shared_ptr (this=<optimized out>, __in_chrg=<optimized out>) at /usr/include/c++/8/bits/shared_ptr_base.h:1167
  ceph#5  std::shared_ptr<OSDMap const>::~shared_ptr (this=<optimized out>, __in_chrg=<optimized out>) at /usr/include/c++/8/bits/shared_ptr.h:103
  ceph#6  OSD::create_context (this=<optimized out>) at /usr/src/debug/ceph-15.0.0-10071.g5b5a3a3.el8.x86_64/src/osd/OSD.cc:9053
  ceph#7  0x0000559c97655571 in OSD::dequeue_peering_evt (this=0x559ca22ac000, sdata=0x559ca2ef2900, pg=0x559cb4aa3400, evt=std::shared_ptr<PGPeeringEvent> (use count 2, weak count 0) = {...}, handle=...)
      at /usr/src/debug/ceph-15.0.0-10071.g5b5a3a3.el8.x86_64/src/osd/OSD.cc:9665
  ceph#8  0x0000559c97886db6 in ceph::osd::scheduler::PGPeeringItem::run (this=<optimized out>, osd=<optimized out>, sdata=<optimized out>, pg=..., handle=...) at /usr/include/c++/8/ext/atomicity.h:96
  ceph#9  0x0000559c9764862f in ceph::osd::scheduler::OpSchedulerItem::run (handle=..., pg=..., sdata=<optimized out>, osd=<optimized out>, this=0x7f7dcaf703f0) at /usr/include/c++/8/bits/unique_ptr.h:342
  ceph#10 OSD::ShardedOpWQ::_process (this=<optimized out>, thread_index=<optimized out>, hb=<optimized out>) at /usr/src/debug/ceph-15.0.0-10071.g5b5a3a3.el8.x86_64/src/osd/OSD.cc:10677
  ceph#11 0x0000559c97c76094 in ShardedThreadPool::shardedthreadpool_worker (this=0x559ca22aca28, thread_index=14) at /usr/src/debug/ceph-15.0.0-10071.g5b5a3a3.el8.x86_64/src/common/WorkQueue.cc:311
  ceph#12 0x0000559c97c78cf4 in ShardedThreadPool::WorkThreadSharded::entry (this=<optimized out>) at /usr/src/debug/ceph-15.0.0-10071.g5b5a3a3.el8.x86_64/src/common/WorkQueue.h:706
  ceph#13 0x00007f7df17852de in start_thread () from /lib64/libpthread.so.0
  ceph#14 0x00007f7df052f133 in __libc_ifunc_impl_list () from /lib64/libc.so.6
  ceph#15 0x0000000000000000 in ?? ()
  (gdb) frame 7
  ceph#7  0x0000559c97655571 in OSD::dequeue_peering_evt (this=0x559ca22ac000, sdata=0x559ca2ef2900, pg=0x559cb4aa3400, evt=std::shared_ptr<PGPeeringEvent> (use count 2, weak count 0) = {...}, handle=...)
      at /usr/src/debug/ceph-15.0.0-10071.g5b5a3a3.el8.x86_64/src/osd/OSD.cc:9665
  9665      in /usr/src/debug/ceph-15.0.0-10071.g5b5a3a3.el8.x86_64/src/osd/OSD.cc
  (gdb) print osdmap
  $24 = std::shared_ptr<const OSDMap> (expired, weak count 0) = {get() = 0x559cba028000}
  (gdb) print *osdmap
     # pretty sane OSDMap
  (gdb) print sizeof(osdmap)
  $26 = 16
  (gdb) x/2a &osdmap
  0x559ca22acef0:   0x559cba028000  0x559cba0ec900

  (gdb) frame 2
  #2  std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release (this=0x559cba0ec900) at /usr/include/c++/8/bits/shared_ptr_base.h:148
  148       /usr/include/c++/8/bits/shared_ptr_base.h: No such file or directory.
  (gdb) disassemble
  Dump of assembler code for function std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release():
  ...
     0x0000559c97675b1e <+62>:      mov    (%rdi),%rax
     0x0000559c97675b21 <+65>:      mov    %rdi,%rbx
     0x0000559c97675b24 <+68>:      callq  *0x10(%rax)
  => 0x0000559c97675b27 <+71>:      test   %rbp,%rbp
  ...
  End of assembler dump.
  (gdb) info registers rdi rbx rax
  rdi            0x559cba0ec900      94131624790272
  rbx            0x559cba0ec900      94131624790272
  rax            0x559cba0ec8a0      94131624790176
  (gdb) x/a 0x559cba0ec8a0 + 0x10
  0x559cba0ec8b0:   0x559cb81c3ea0
  (gdb) bt
  #0  0x0000559cb81c3ea0 in ?? ()
  ...
  (gdb) p $_siginfo._sifields._sigfault.si_addr
  $27 = (void *) 0x559cb81c3ea0
  ```

Helgrind seems to agree:
  ```
  ==00:00:02:54.519 510301== Possible data race during write of size 8 at 0xF123930 by thread ceph#90
  ==00:00:02:54.519 510301== Locks held: 2, at addresses 0xF122A58 0xF1239A8
  ==00:00:02:54.519 510301==    at 0x7218DD: operator= (shared_ptr_base.h:1078)
  ==00:00:02:54.519 510301==    by 0x7218DD: operator= (shared_ptr.h:103)
  ==00:00:02:54.519 510301==    by 0x7218DD: OSD::_committed_osd_maps(unsigned int, unsigned int, MOSDMap*) (OSD.cc:8116)
  ==00:00:02:54.519 510301==    by 0x7752CA: C_OnMapCommit::finish(int) (OSD.cc:7678)
  ==00:00:02:54.519 510301==    by 0x72A06C: Context::complete(int) (Context.h:77)
  ==00:00:02:54.519 510301==    by 0xD07F14: Finisher::finisher_thread_entry() (Finisher.cc:66)
  ==00:00:02:54.519 510301==    by 0xA7E1203: mythread_wrapper (hg_intercepts.c:389)
  ==00:00:02:54.519 510301==    by 0xC6182DD: start_thread (in /usr/lib64/libpthread-2.28.so)
  ==00:00:02:54.519 510301==    by 0xD8B34B2: clone (in /usr/lib64/libc-2.28.so)
  ==00:00:02:54.519 510301==
  ==00:00:02:54.519 510301== This conflicts with a previous read of size 8 by thread ceph#117
  ==00:00:02:54.519 510301== Locks held: 1, at address 0x2123E9A0
  ==00:00:02:54.519 510301==    at 0x6B5842: __shared_ptr (shared_ptr_base.h:1165)
  ==00:00:02:54.519 510301==    by 0x6B5842: shared_ptr (shared_ptr.h:129)
  ==00:00:02:54.519 510301==    by 0x6B5842: get_osdmap (OSD.h:1700)
  ==00:00:02:54.519 510301==    by 0x6B5842: OSD::create_context() (OSD.cc:9053)
  ==00:00:02:54.519 510301==    by 0x71B570: OSD::dequeue_peering_evt(OSDShard*, PG*, std::shared_ptr<PGPeeringEvent>, ThreadPool::TPHandle&) (OSD.cc:9665)
  ==00:00:02:54.519 510301==    by 0x71B997: OSD::dequeue_delete(OSDShard*, PG*, unsigned int, ThreadPool::TPHandle&) (OSD.cc:9701)
  ==00:00:02:54.519 510301==    by 0x70E62E: run (OpSchedulerItem.h:148)
  ==00:00:02:54.519 510301==    by 0x70E62E: OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*) (OSD.cc:10677)
  ==00:00:02:54.519 510301==    by 0xD3C093: ShardedThreadPool::shardedthreadpool_worker(unsigned int) (WorkQueue.cc:311)
  ==00:00:02:54.519 510301==    by 0xD3ECF3: ShardedThreadPool::WorkThreadSharded::entry() (WorkQueue.h:706)
  ==00:00:02:54.519 510301==    by 0xA7E1203: mythread_wrapper (hg_intercepts.c:389)
  ==00:00:02:54.519 510301==    by 0xC6182DD: start_thread (in /usr/lib64/libpthread-2.28.so)
  ==00:00:02:54.519 510301==  Address 0xf123930 is 3,824 bytes inside a block of size 10,296 alloc'd
  ==00:00:02:54.519 510301==    at 0xA7DC0C3: operator new[](unsigned long) (vg_replace_malloc.c:433)
  ==00:00:02:54.519 510301==    by 0x66F766: main (ceph_osd.cc:688)
  ==00:00:02:54.519 510301==  Block was alloc'd by thread #1
  ```

Actually there is plenty of similar issues reported like:
  ```
  ==00:00:05:04.903 510301== Possible data race during read of size 8 at 0x1E3E0588 by thread ceph#119
  ==00:00:05:04.903 510301== Locks held: 1, at address 0x1EAD41D0
  ==00:00:05:04.903 510301==    at 0x753165: clear (hashtable.h:2051)
  ==00:00:05:04.903 510301==    by 0x753165: std::_Hashtable<entity_addr_t, std::pair<entity_addr_t const, utime_t>, mempool::pool_allocator<(mempool::pool_index_t)15, std::pair<entity_addr_t const, utime_t>
  >, std::__detail::_Select1st, std::equal_to<entity_addr_t>, std::hash<entity_addr_t>, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__deta
  il::_Hashtable_traits<true, false, true> >::~_Hashtable() (hashtable.h:1369)
  ==00:00:05:04.903 510301==    by 0x75331C: ~unordered_map (unordered_map.h:102)
  ==00:00:05:04.903 510301==    by 0x75331C: OSDMap::~OSDMap() (OSDMap.h:350)
  ==00:00:05:04.903 510301==    by 0x753606: operator() (shared_cache.hpp:100)
  ==00:00:05:04.903 510301==    by 0x753606: std::_Sp_counted_deleter<OSDMap const*, SharedLRU<unsigned int, OSDMap const>::Cleanup, std::allocator<void>, (__gnu_cxx::_Lock_policy)2>::_M_dispose() (shared_ptr
  _base.h:471)
  ==00:00:05:04.903 510301==    by 0x73BB26: _M_release (shared_ptr_base.h:155)
  ==00:00:05:04.903 510301==    by 0x73BB26: std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release() (shared_ptr_base.h:148)
  ==00:00:05:04.903 510301==    by 0x6B58A9: ~__shared_count (shared_ptr_base.h:728)
  ==00:00:05:04.903 510301==    by 0x6B58A9: ~__shared_ptr (shared_ptr_base.h:1167)
  ==00:00:05:04.903 510301==    by 0x6B58A9: ~shared_ptr (shared_ptr.h:103)
  ==00:00:05:04.903 510301==    by 0x6B58A9: OSD::create_context() (OSD.cc:9053)
  ==00:00:05:04.903 510301==    by 0x71B570: OSD::dequeue_peering_evt(OSDShard*, PG*, std::shared_ptr<PGPeeringEvent>, ThreadPool::TPHandle&) (OSD.cc:9665)
  ==00:00:05:04.903 510301==    by 0x71B997: OSD::dequeue_delete(OSDShard*, PG*, unsigned int, ThreadPool::TPHandle&) (OSD.cc:9701)
  ==00:00:05:04.903 510301==    by 0x70E62E: run (OpSchedulerItem.h:148)
  ==00:00:05:04.903 510301==    by 0x70E62E: OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*) (OSD.cc:10677)
  ==00:00:05:04.903 510301==    by 0xD3C093: ShardedThreadPool::shardedthreadpool_worker(unsigned int) (WorkQueue.cc:311)
  ==00:00:05:04.903 510301==    by 0xD3ECF3: ShardedThreadPool::WorkThreadSharded::entry() (WorkQueue.h:706)
  ==00:00:05:04.903 510301==    by 0xA7E1203: mythread_wrapper (hg_intercepts.c:389)
  ==00:00:05:04.903 510301==    by 0xC6182DD: start_thread (in /usr/lib64/libpthread-2.28.so)
  ==00:00:05:04.903 510301==    by 0xD8B34B2: clone (in /usr/lib64/libc-2.28.so)
  ==00:00:05:04.903 510301==
  ==00:00:05:04.903 510301== This conflicts with a previous write of size 8 by thread ceph#90
  ==00:00:05:04.903 510301== Locks held: 2, at addresses 0xF122A58 0xF1239A8
  ==00:00:05:04.903 510301==    at 0x7531E1: clear (hashtable.h:2054)
  ==00:00:05:04.903 510301==    by 0x7531E1: std::_Hashtable<entity_addr_t, std::pair<entity_addr_t const, utime_t>, mempool::pool_allocator<(mempool::pool_index_t)15, std::pair<entity_addr_t const, utime_t> >, std::__detail::_Select1st, std::equal_to<entity_addr_t>, std::hash<entity_addr_t>, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<true, false, true> >::~_Hashtable() (hashtable.h:1369)
  ==00:00:05:04.903 510301==    by 0x75331C: ~unordered_map (unordered_map.h:102)
  ==00:00:05:04.903 510301==    by 0x75331C: OSDMap::~OSDMap() (OSDMap.h:350)
  ==00:00:05:04.903 510301==    by 0x753606: operator() (shared_cache.hpp:100)
  ==00:00:05:04.903 510301==    by 0x753606: std::_Sp_counted_deleter<OSDMap const*, SharedLRU<unsigned int, OSDMap const>::Cleanup, std::allocator<void>, (__gnu_cxx::_Lock_policy)2>::_M_dispose() (shared_ptr_base.h:471)
  ==00:00:05:04.903 510301==    by 0x73BB26: _M_release (shared_ptr_base.h:155)
  ==00:00:05:04.903 510301==    by 0x73BB26: std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release() (shared_ptr_base.h:148)
  ==00:00:05:04.903 510301==    by 0x72191E: operator= (shared_ptr_base.h:747)
  ==00:00:05:04.903 510301==    by 0x72191E: operator= (shared_ptr_base.h:1078)
  ==00:00:05:04.903 510301==    by 0x72191E: operator= (shared_ptr.h:103)
  ==00:00:05:04.903 510301==    by 0x72191E: OSD::_committed_osd_maps(unsigned int, unsigned int, MOSDMap*) (OSD.cc:8116)
  ==00:00:05:04.903 510301==    by 0x7752CA: C_OnMapCommit::finish(int) (OSD.cc:7678)
  ==00:00:05:04.903 510301==    by 0x72A06C: Context::complete(int) (Context.h:77)
  ==00:00:05:04.903 510301==    by 0xD07F14: Finisher::finisher_thread_entry() (Finisher.cc:66)
  ==00:00:05:04.903 510301==  Address 0x1e3e0588 is 872 bytes inside a block of size 1,208 alloc'd
  ==00:00:05:04.903 510301==    at 0xA7DC0C3: operator new[](unsigned long) (vg_replace_malloc.c:433)
  ==00:00:05:04.903 510301==    by 0x6C7C0C: OSDService::try_get_map(unsigned int) (OSD.cc:1606)
  ==00:00:05:04.903 510301==    by 0x7213BD: get_map (OSD.h:699)
  ==00:00:05:04.903 510301==    by 0x7213BD: get_map (OSD.h:1732)
  ==00:00:05:04.903 510301==    by 0x7213BD: OSD::_committed_osd_maps(unsigned int, unsigned int, MOSDMap*) (OSD.cc:8076)
  ==00:00:05:04.903 510301==    by 0x7752CA: C_OnMapCommit::finish(int) (OSD.cc:7678)
  ==00:00:05:04.903 510301==    by 0x72A06C: Context::complete(int) (Context.h:77)
  ==00:00:05:04.903 510301==    by 0xD07F14: Finisher::finisher_thread_entry() (Finisher.cc:66)
  ==00:00:05:04.903 510301==    by 0xA7E1203: mythread_wrapper (hg_intercepts.c:389)
  ==00:00:05:04.903 510301==    by 0xC6182DD: start_thread (in /usr/lib64/libpthread-2.28.so)
  ==00:00:05:04.903 510301==    by 0xD8B34B2: clone (in /usr/lib64/libc-2.28.so)
  ```

Signed-off-by: Radoslaw Zarzynski <[email protected]>
yaarith pushed a commit that referenced this pull request Sep 22, 2020
This fixes the the following selinux error when using ceph-iscsi's
rbd-target-api daemon (rbd-target-gw has the same issue). They are
a result of the a python library, rtslib, which the daemons use.

Additional Information:
Source Context                system_u:system_r:ceph_t:s0
Target Context                system_u:object_r:configfs_t:s0
Target Objects
/sys/kernel/config/target/iscsi/iqn.2003-01.com.re
                              dhat:ceph-iscsi/tpgt_1/attrib/authentication
[
                              file ]
Source                        rbd-target-api
Source Path                   /usr/libexec/platform-python3.6
Port                          <Unknown>
Host                          ans8
Source RPM Packages           platform-python-3.6.8-15.1.el8.x86_64
Target RPM Packages
Policy RPM                    selinux-policy-3.14.3-20.el8.noarch
Selinux Enabled               True
Policy Type                   targeted
Enforcing Mode                Enforcing
Host Name                     ans8
Platform                      Linux ans8 4.18.0-147.el8.x86_64 #1 SMP
Thu Sep 26
                              15:52:44 UTC 2019 x86_64 x86_64
Alert Count                   1
First Seen                    2020-01-08 18:39:47 EST
Last Seen                     2020-01-08 18:39:47 EST
Local ID                      6f8c3415-7a50-4dc8-b3d2-2621e1d00ca3

Raw Audit Messages
type=AVC msg=audit(1578526787.577:68): avc:  denied  { ioctl } for
pid=995 comm="rbd-target-api"
path="/sys/kernel/config/target/iscsi/iqn.2003-01.com.redhat:ceph-iscsi/tpgt_1/attrib/authentication"
dev="configfs" ino=25703 ioctlcmd=0x5401
scontext=system_u:system_r:ceph_t:s0
tcontext=system_u:object_r:configfs_t:s0 tclass=file permissive=1

type=SYSCALL msg=audit(1578526787.577:68): arch=x86_64 syscall=ioctl
success=no exit=ENOTTY a0=34 a1=5401 a2=7ffd4f8f1f60 a3=3052cd2d95839b96
items=0 ppid=1 pid=995 auid=4294967295 uid=0 gid=0 euid=0 suid=0 fsuid=0
egid=0 sgid=0 fsgid=0 tty=(none) ses=4294967295 comm=rbd-target-api
exe=/usr/libexec/platform-python3.6 subj=system_u:system_r:ceph_t:s0
key=(null)

Hash: rbd-target-api,ceph_t,configfs_t,file,ioctl

Signed-off-by: Mike Christie <[email protected]>
(cherry picked from commit 8187235)
yaarith pushed a commit that referenced this pull request Sep 22, 2020
This fixes the selinux errors like this for /etc/target

-----------------------------------
Additional Information:
Source Context                system_u:system_r:ceph_t:s0
Target Context                system_u:object_r:targetd_etc_rw_t:s0
Target Objects                target [ dir ]
Source                        rbd-target-api
Source Path                   rbd-target-api
Port                          <Unknown>
Host                          ans8
Source RPM Packages
Target RPM Packages
Policy RPM                    selinux-policy-3.14.3-20.el8.noarch
Selinux Enabled               True
Policy Type                   targeted
Enforcing Mode                Enforcing
Host Name                     ans8
Platform                      Linux ans8 4.18.0-147.el8.x86_64 #1 SMP
Thu Sep 26
                              15:52:44 UTC 2019 x86_64 x86_64
Alert Count                   1
First Seen                    2020-01-08 18:39:48 EST
Last Seen                     2020-01-08 18:39:48 EST
Local ID                      9a13ee18-eaf2-4f2a-872f-2809ee4928f6

Raw Audit Messages
type=AVC msg=audit(1578526788.148:69): avc:  denied  { search } for
pid=995 comm="rbd-target-api" name="target" dev="sda1" ino=52198
scontext=system_u:system_r:ceph_t:s0
tcontext=system_u:object_r:targetd_etc_rw_t:s0 tclass=dir permissive=1

Hash: rbd-target-api,ceph_t,targetd_etc_rw_t,dir,search

which are a result of the rtslib library the ceph-iscsi daemons use
accessing /etc/target to read/write a file which stores meta data the
target uses.

Signed-off-by: Mike Christie <[email protected]>
(cherry picked from commit 53be181)

Conflicts:
	selinux/ceph.te: trivial resolution
yaarith pushed a commit that referenced this pull request Oct 4, 2021
It's perfectly legal for a client to reconnect to particular `Watch`
using different socket / `Connection` than original one. This shall
include proper handling of the watch timer which is currently broken
as, when reconnecting, we don't cancel the timer. This leaded to the
following crash at Sepia:

```
rzarzynski@teuthology:/home/teuthworker/archive/rzarzynski-2021-09-02_07:44:51-rados-master-distro-basic-smithi/6372357$ less ./remote/smithi183/log/ceph-osd.4.log.gz
...
DEBUG 2021-09-02 08:10:45,462 [shard 0] osd - client_request(id=12, detail=m=[osd_op(client.5087.0:93 7.1e 7:7c7084bd:::repobj:head {watch reconnect cookie 94478891024832 gen 1} snapc 0={} ondisk+write+know
n_if_redirected e40) v8]): got obc lock
...
DEBUG 2021-09-02 08:10:45,462 [shard 0] osd - do_op_watch
INFO  2021-09-02 08:10:45,462 [shard 0] osd - found existing watch by client.5087
DEBUG 2021-09-02 08:10:45,462 [shard 0] osd - do_op_watch_subop_watch
INFO  2021-09-02 08:10:45,462 [shard 0] osd - found existing watch watch(cookie 94478891024832 30s 172.21.15.150:0/3544196211) by client.5087
...
INFO  2021-09-02 08:10:45,462 [shard 0] osd - op_effect: found existing watcher: 94478891024832,client.5087
ceph-osd: /home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.0.0-7406-g9d30203c/rpm/el8/BUILD/ceph-
17.0.0-7406-g9d30203c/src/seastar/include/seastar/core/timer.hh:95: void seastar::timer<Clock>::arm_state(seastar::timer<Clock>::time_point, std::optional<typename Clock::duration>) [with Clock = seastar::l
owres_clock; seastar::timer<Clock>::time_point = std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long int, std::ratio<1, 1000> > >; typename Clock::duration = std::chrono::duration<long
 int, std::ratio<1, 1000> >]: Assertion `!_armed' failed.
Aborting on shard 0.
Backtrace:
 0# 0x000055CC052CF0B6 in ceph-osd
 1# FatalSignal::signaled(int, siginfo_t const&) in ceph-osd
 2# FatalSignal::install_oneshot_signal_handler<6>()::{lambda(int, siginfo_t*, void*)#1}::_FUN(int, siginfo_t*, void*) in ceph-osd
 3# 0x00007FA58349FB20 in /lib64/libpthread.so.0
 4# gsignal in /lib64/libc.so.6
 5# abort in /lib64/libc.so.6
 6# 0x00007FA581A98C89 in /lib64/libc.so.6
 7# 0x00007FA581AA6A76 in /lib64/libc.so.6
 8# 0x000055CC0BEEE9DD in ceph-osd
 9# crimson::osd::Watch::connect(seastar::shared_ptr<crimson::net::Connection>, bool) in ceph-osd
10# 0x000055CC00B1D246 in ceph-osd
11# 0x000055CBFFEF01AE in ceph-osd
...
```

Signed-off-by: Radoslaw Zarzynski <[email protected]>
yaarith pushed a commit that referenced this pull request Oct 4, 2021
`seastar::engine()` is available only for Seastar's threads;
it shouldn't be called outside of a reactor thread.
Unfortunately, this assumption is violated in `AlienStore`
where `OnCommit::finish()`, executed from a finisher thread
of `BlueStore`, calls `alien()` on `seastar::engine()`.
The net effect are crashes like the following one:

```
INFO  2021-09-22 14:26:33,214 [shard 0] osd - operator() writing superblock cluster_fsid 1d8f7908-2ebf-4a91-ae70-f445668c126b osd_fsid 4da9fe9a-1da5-4ea9-aa79-a1178165ede5         [381/1839]
Segmentation fault.
Backtrace:
 0# print_backtrace(std::basic_string_view<char, std::char_traits<char> >) at /home/rzarzynski/ceph1/build/../src/crimson/common/fatal_signal.cc:80
 1# FatalSignal::signaled(int, siginfo_t const&) at /opt/rh/gcc-toolset-9/root/usr/include/c++/9/ostream:570
 2# FatalSignal::install_oneshot_signal_handler<11>()::{lambda(int, siginfo_t*, void*)#1}::_FUN(int, siginfo_t*, void*) at /home/rzarzynski/ceph1/build/../src/crimson/common/fatal_signal.cc:
62
 3# 0x00007F16BBA13B30 in /lib64/libpthread.so.0
 4# (anonymous namespace)::OnCommit::finish(int) at /home/rzarzynski/ceph1/build/../src/crimson/os/alienstore/alien_store.cc:53
 5# Context::complete(int) at /home/rzarzynski/ceph1/build/../src/include/Context.h:100
 6# Finisher::finisher_thread_entry() at /home/rzarzynski/ceph1/build/../src/common/Finisher.cc:65
 7# 0x00007F16BBA0915A in /lib64/libpthread.so.0
 8# clone in /lib64/libc.so.6
Dump of siginfo:
  ...
  si_addr: 0x10
```

Signed-off-by: Radoslaw Zarzynski <[email protected]>
yaarith pushed a commit that referenced this pull request Oct 4, 2021
It's all about these items:

```
 0# print_backtrace(std::basic_string_view<char, std::char_traits<char> >) at /home/rzarzynski/ceph1/build/../src/crimson/common/fatal_signal.cc:80
 1# FatalSignal::signaled(int, siginfo_t const&) at /opt/rh/gcc-toolset-9/root/usr/include/c++/9/ostream:570
 2# FatalSignal::install_oneshot_signal_handler<11>()::{lambda(int, siginfo_t*, void*)#1}::_FUN(int, siginfo_t*, void*) at /home/rzarzynski/ceph1/build/../src/crimson/common/fatal_signal.cc:
62
 3# 0x00007F16BBA13B30 in /lib64/libpthread.so.0
```

They are part of our backtrace handling and typically developers
are not interested in them. Let's be consistent with the classical
OSD and hide them.

Signed-off-by: Radoslaw Zarzynski <[email protected]>
yaarith pushed a commit that referenced this pull request Nov 1, 2021
`PG::request_{local,remote}_recovery_reservation()` dynamically allocates
up to 2 instances of `LambdaContext<T>` and transfers their ownership to
the `AsyncReserver<T, F>`. This is expressed in raw pointers (`new` and
`delete`) notion. Further analysis shows the only place where `delete`
for these objects is called is the `AsyncReserver::cancel_reservation()`.
In contrast to the classical OSD, crimson doesn't invoke the method when
stopping a PG during the shutdown sequence. This would explain the
following ASan issue observed at Sepia:

```
Direct leak of 576 byte(s) in 24 object(s) allocated from:
    #0 0x7fa108fc57b0 in operator new(unsigned long) (/lib64/libasan.so.5+0xf17b0)
    #1 0x55723d8b0b56 in non-virtual thunk to crimson::osd::PG::request_local_background_io_reservation(unsigned int, std::unique_ptr<PGPeeringEvent, std::default_delete<PGPeeringEvent> >, std::unique_ptr<PGPeeringEvent, std::default_delete<PGPeeringEvent> >) (/usr/bin/ceph-osd+0x24d95b56)
    #2 0x55723f1f66ef in PeeringState::WaitDeleteReserved::WaitDeleteReserved(boost::statechart::state<PeeringState::WaitDeleteReserved, PeeringState::ToDelete, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::my_context) (/usr/bin/ceph-osd+0x266db6ef)
```

Signed-off-by: Radoslaw Zarzynski <[email protected]>
yaarith pushed a commit that referenced this pull request Nov 1, 2021
exit() will call pthread_cond_destroy attempting to destroy dpdk::eal::cond
upon which other threads are currently blocked results in undefine
behavior. Link different libc version test, libc-2.17 can exit,
libc-2.27 will deadlock, the call stack is as follows:

Thread 3 (Thread 0xffff7e5749f0 (LWP 62213)):
 #0  0x0000ffff7f3c422c in futex_wait_cancelable (private=<optimized out>, expected=0,
    futex_word=0xaaaadc0e30f4 <dpdk::eal::cond+44>) at ../sysdeps/unix/sysv/linux/futex-internal.h:88
 #1  __pthread_cond_wait_common (abstime=0x0, mutex=0xaaaadc0e30f8 <dpdk::eal::lock>, cond=0xaaaadc0e30c8 <dpdk::eal::cond>)
    at pthread_cond_wait.c:502
 #2  __pthread_cond_wait (cond=0xaaaadc0e30c8 <dpdk::eal::cond>, mutex=0xaaaadc0e30f8 <dpdk::eal::lock>)
    at pthread_cond_wait.c:655
 ceph#3  0x0000ffff7f1f1f80 in std::condition_variable::wait(std::unique_lock<std::mutex>&) ()
   from /usr/lib/aarch64-linux-gnu/libstdc++.so.6
 ceph#4  0x0000aaaad37f5078 in dpdk::eal::<lambda()>::operator()(void) const (__closure=<optimized out>, __closure=<optimized out>)
    at ./src/msg/async/dpdk/dpdk_rte.cc:136
 ceph#5  0x0000ffff7f1f7ed4 in ?? () from /usr/lib/aarch64-linux-gnu/libstdc++.so.6
 ceph#6  0x0000ffff7f3be088 in start_thread (arg=0xffffe73e197f) at pthread_create.c:463
 ceph#7  0x0000ffff7efc74ec in thread_start () at ../sysdeps/unix/sysv/linux/aarch64/clone.S:78

Thread 1 (Thread 0xffff7ee3b010 (LWP 62200)):
 #0  0x0000ffff7f3c3c38 in futex_wait (private=<optimized out>, expected=12, futex_word=0xaaaadc0e30ec <dpdk::eal::cond+36>)
    at ../sysdeps/unix/sysv/linux/futex-internal.h:61
 #1  futex_wait_simple (private=<optimized out>, expected=12, futex_word=0xaaaadc0e30ec <dpdk::eal::cond+36>)
    at ../sysdeps/nptl/futex-internal.h:135
 #2  __pthread_cond_destroy (cond=0xaaaadc0e30c8 <dpdk::eal::cond>) at pthread_cond_destroy.c:54
 ceph#3  0x0000ffff7ef2be34 in __run_exit_handlers (status=-6, listp=0xffff7f04a5a0 <__exit_funcs>, run_list_atexit=255,
    run_list_atexit@entry=true, run_dtors=run_dtors@entry=true) at exit.c:108
 ceph#4  0x0000ffff7ef2bf6c in __GI_exit (status=<optimized out>) at exit.c:139
 ceph#5  0x0000ffff7ef176e4 in __libc_start_main (main=0x0, argc=0, argv=0x0, init=<optimized out>, fini=<optimized out>,
    rtld_fini=<optimized out>, stack_end=<optimized out>) at ../csu/libc-start.c:344
 ceph#6  0x0000aaaad2939db0 in _start () at ./src/include/buffer.h:642

Fixes: https://tracker.ceph.com/issues/42890
Signed-off-by: Chunsong Feng <[email protected]>
Signed-off-by: luo rixin <[email protected]>
yaarith pushed a commit that referenced this pull request Dec 8, 2021
The patch is supposed to fix the following problems (extra
debugs onboard):

```
NFO  2021-11-16 01:18:38,713 [shard 0] osd - ~OSD: OSD dtor called
INFO  2021-11-16 01:18:38,713 [shard 0] osd - Heartbeat::Peer: osd.6 removed
INFO  2021-11-16 01:18:38,714 [shard 0] osd - Heartbeat::Peer: osd.5 removed
INFO  2021-11-16 01:18:38,714 [shard 0] osd - Heartbeat::Peer: osd.2 removed
INFO  2021-11-16 01:18:38,714 [shard 0] osd - ~ShardServices: ShardServices dtor called
INFO  2021-11-16 01:18:38,714 [shard 0] osd - ~ObjectContextRegistry: ShardServices dtor called; unref_size=3, size=3
INFO  2021-11-16 01:18:38,714 [shard 0] osd - ~ObjectContextRegistry: unreferenced p=0x619000115380
INFO  2021-11-16 01:18:38,714 [shard 0] osd - ~ObjectContextRegistry: unreferenced p=0x619000114980
INFO  2021-11-16 01:18:38,714 [shard 0] osd - ~ObjectContextRegistry: unreferenced p=0x619000112680
INFO  2021-11-16 01:18:38,714 [shard 0] osd - ~ObjectContextRegistry: set p=0x619000114980
INFO  2021-11-16 01:18:38,714 [shard 0] osd - ~ObjectContextRegistry: set p=0x619000115380
INFO  2021-11-16 01:18:38,714 [shard 0] osd - ~ObjectContextRegistry: set p=0x619000112680
INFO  2021-11-16 01:18:38,738 [shard 0] osd - crimson shutdown complete

=================================================================
==33351==ERROR: LeakSanitizer: detected memory leaks

Direct leak of 2808 byte(s) in 3 object(s) allocated from:
    #0 0x7fe10c0327b0 in operator new(unsigned long) (/lib64/libasan.so.5+0xf17b0)
    #1 0x55accbe8ffc4 in ceph::common::intrusive_lru<ceph::common::intrusive_lru_config<hobject_t, crimson::osd::ObjectContext, crimson::osd::obc_to_hoid<crimson::osd::ObjectContext> > >::get_or_create(hobject_t const&) (/usr/bin/ceph-osd+0x3b000fc4)

Objects leaked above:
0x619000112680 (936 bytes)
0x619000114980 (936 bytes)
0x619000115380 (936 bytes)
```

Signed-off-by: Radoslaw Zarzynski <[email protected]>
yaarith pushed a commit that referenced this pull request Dec 8, 2021
Initially, we were assuming that no pointer obtained from SharedLRU
can outlive the lru itself. However, since going with the interruption
concept for handling shutdowns, this is no longer valid.

The patch is supposed to deal with crashes like the following one:

```
ceph-osd: /home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.0.0-8898-ge57ad63c/rpm/el8/BUILD/ceph-17.0.
0-8898-ge57ad63c/src/crimson/common/shared_lru.h:46: SharedLRU<K, V>::~SharedLRU() [with K = unsigned int; V = OSDMap]: Assertion `weak_refs.empty()' failed.
Aborting on shard 0.
Backtrace:
Reactor stalled for 1162 ms on shard 0. Backtrace: 0xb14ab 0x46e57428 0x46bc450d 0x46be03bd 0x46be0782 0x46be0946 0x46be0bf6 0x12b1f 0xc8e3b 0x3fdd77e2 0x3fddccdb 0x3fdde1ee 0x3fdde8b3 0x3fdd3f2b 0x3fdd4442 0x3f
dd4c3a 0x12b1f 0x3737e 0x21db4 0x21c88 0x2fa75 0x3a5ae1b9 0x3a38c5e2 0x3a0c823d 0x3a1771f1 0x3a1796f5 0x46ff92c9 0x46ff9525 0x46ff9e93 0x46ff8eae 0x46ff8bd9 0x3a160e67 0x39f50c83 0x39f51cd0 0x46b96271 0x46bde51a
 0x46d6891b 0x46d6a8f0 0x4681a7d2 0x4681f03b 0x39fd50f2 0x23492 0x39b7a7dd
 0# gsignal in /lib64/libc.so.6
 1# abort in /lib64/libc.so.6
 2# 0x00007F9535E04C89 in /lib64/libc.so.6
 3# 0x00007F9535E12A76 in /lib64/libc.so.6
 4# crimson::osd::OSD::~OSD() in ceph-osd
 5# seastar::shared_ptr_count_for<crimson::osd::OSD>::~shared_ptr_count_for() in ceph-osd
 6# seastar::shared_ptr<crimson::osd::OSD>::~shared_ptr() in ceph-osd
 7# seastar::futurize<std::result_of<seastar::sharded<crimson::osd::OSD>::stop()::{lambda(seastar::future<void>)#2}::operator()(seastar::future<void>) const::{lambda(unsigned int)#1}::operator()(unsigned int) co
nst::{lambda()#1} ()>::type>::type seastar::smp::submit_to<seastar::sharded<crimson::osd::OSD>::stop()::{lambda(seastar::future<void>)#2}::operator()(seastar::future<void>) const::{lambda(unsigned int)#1}::opera
tor()(unsigned int) const::{lambda()#1}>(unsigned int, seastar::smp_submit_to_options, seastar::sharded<crimson::osd::OSD>::stop()::{lambda(seastar::future<void>)#2}::operator()(seastar::future<void>) const::{la
mbda(unsigned int)#1}::operator()(unsigned int) const::{lambda()#1}&&) in ceph-osd
 8# std::_Function_handler<seastar::future<void> (unsigned int), seastar::sharded<crimson::osd::OSD>::stop()::{lambda(seastar::future<void>)#2}::operator()(seastar::future<void>) const::{lambda(unsigned int)#1}>
::_M_invoke(std::_Any_data const&, unsigned int&&) in ceph-osd
 9# 0x0000562DA18162CA in ceph-osd
10# 0x0000562DA1816526 in ceph-osd
11# 0x0000562DA1816E94 in ceph-osd
12# 0x0000562DA1815EAF in ceph-osd
13# 0x0000562DA1815BDA in ceph-osd
14# seastar::noncopyable_function<seastar::future<void> (seastar::future<void>&&)>::direct_vtable_for<seastar::future<void>::then_wrapped_maybe_erase<true, seastar::future<void>, seastar::sharded<crimson::osd::OSD>::stop()::{lambda(seastar::future<void>)#2}>(seastar::sharded<crimson::osd::OSD>::stop()::{lambda(seastar::future<void>)#2}&&)::{lambda(seastar::future<void>&&)#1}>::call(seastar::noncopyable_function<seastar::future<void> (seastar::future<void>&&)> const*, seastar::future<void>&&) in ceph-osd
15# 0x0000562D9476DC84 in ceph-osd
16# 0x0000562D9476ECD1 in ceph-osd
17# 0x0000562DA13B3272 in ceph-osd
18# 0x0000562DA13FB51B in ceph-osd
19# 0x0000562DA158591C in ceph-osd
20# 0x0000562DA15878F1 in ceph-osd
21# 0x0000562DA10377D3 in ceph-osd
22# 0x0000562DA103C03C in ceph-osd
23# main in ceph-osd
24# __libc_start_main in /lib64/libc.so.6
25# _start in ceph-osd
```

Signed-off-by: Radoslaw Zarzynski <[email protected]>
@github-actions
Copy link

This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved

yaarith pushed a commit that referenced this pull request Jun 1, 2022
The problem is:

```
DEBUG 2022-03-07 13:50:40,027 [shard 0] osd - calling method rbd.create, num_read=0, num_write=0
DEBUG 2022-03-07 13:50:40,027 [shard 0] objclass - <cls> ../src/cls/rbd/cls_rbd.cc:787: create object_prefix=parent_id size=2097152 order=0 features=1
DEBUG 2022-03-07 13:50:40,027 [shard 0] osd - handling op omap-get-vals-by-keys on object 1:144d5af5:::parent_id:head
=================================================================
==2109764==ERROR: AddressSanitizer: stack-buffer-overflow on address 0x7f6de5176e70 at pc 0x7f6dfd2a7157 bp 0x7f6de5176e30 sp 0x7f6de51765d8
WRITE of size 24 at 0x7f6de5176e70 thread T0
    #0 0x7f6dfd2a7156 in __interceptor_sigaltstack.part.0 (/lib64/libasan.so.6+0x54156)
    #1 0x7f6dfd30d5b3 in __asan::PlatformUnpoisonStacks() (/lib64/libasan.so.6+0xba5b3)
    #2 0x7f6dfd31314c in __asan_handle_no_return (/lib64/libasan.so.6+0xc014c)
Reactor stalled for 275 ms on shard 0. Backtrace: 0x45d9d 0xda72bd3 0xd801f73 0xd81f6f9 0xd81fb9c 0xd81fe2c 0xd8200f7 0x12b2f 0x7f6dfd3383c1 0x7f6dfd339b18 0x7f6dfd339bd4 0x7f6dfd339bd4 0x7f6dfd339bd4 0x7f6dfd339bd4 0x7f6dfd33b089 0x7f6dfd33bb36 0x7f6dfd32e0b5 0x7f6dfd32ff3a 0xd61d0 0x32412 0xbd8a7 0xbd134 0x54178 0xba5b3 0xc014c 0x1881f22 0x188344a 0xe8b439d 0xe8b58f2 0x2521d5a 0x2a2ee12 0x2c76349 0x2e04ce9 0x3c70c55 0x3cb8aa8 0x7f6de558de39
    ceph#3 0x1881f22 in fmt::v6::internal::arg_map<fmt::v6::basic_format_context<seastar::internal::log_buf::inserter_iterator, char> >::~arg_map() /usr/include/fmt/core.h:1170
    ceph#4 0x1881f22 in fmt::v6::basic_format_context<seastar::internal::log_buf::inserter_iterator, char>::~basic_format_context() /usr/include/fmt/core.h:1265
    ceph#5 0x1881f22 in fmt::v6::format_handler<fmt::v6::arg_formatter<fmt::v6::internal::output_range<seastar::internal::log_buf::inserter_iterator, char> >, char, fmt::v6::basic_format_context<seastar::internal::log_buf::inserter_iterator, char> >::~format_handler() /usr/include/fmt/format.h:3143
    ceph#6 0x1881f22 in fmt::v6::basic_format_context<seastar::internal::log_buf::inserter_iterator, char>::iterator fmt::v6::vformat_to<fmt::v6::arg_formatter<fmt::v6::internal::output_range<seastar::internal::log_buf::inserter_iterator, char> >, char, fmt::v6::basic_format_context<seastar::internal::log_buf::inserter_iterator, char> >(fmt::v6::arg_formatter<fmt::v6::internal::output_range<seastar::internal::log_buf::inserter_iterator, char> >::range, fmt::v6::basic_string_view<char>, fmt::v6::basic_format_args<fmt::v6::basic_format_context<seastar::internal::log_buf::inserter_iterator, char> >, fmt::v6::internal::locale_ref) /usr/include/fmt/format.h:3206
    ceph#7 0x188344a in seastar::internal::log_buf::inserter_iterator fmt::v6::vformat_to<fmt::v6::basic_string_view<char>, seastar::internal::log_buf::inserter_iterator, , 0>(seastar::internal::log_buf::inserter_iterator, fmt::v6::basic_string_view<char> const&, fmt::v6::basic_format_args<fmt::v6::basic_format_context<fmt::v6::type_identity<seastar::internal::log_buf::inserter_iterator>::type, fmt::v6::internal::char_t_impl<fmt::v6::basic_string_view<char>, void>::type> >) /usr/include/fmt/format.h:3395
    ceph#8 0x188344a in seastar::internal::log_buf::inserter_iterator fmt::v6::format_to<seastar::internal::log_buf::inserter_iterator, std::basic_string_view<char, std::char_traits<char> >, hobject_t const&, 0>(seastar::internal::log_buf::inserter_iterator, std::basic_string_view<char, std::char_traits<char> > const&, hobject_t const&) /usr/include/fmt/format.h:3418
    ceph#9 0x188344a in seastar::logger::log<hobject_t const&>(seastar::log_level, seastar::logger::format_info, hobject_t const&)::{lambda(seastar::internal::log_buf::inserter_iterator)#1}::operator()(seastar::internal::log_buf::inserter_iterator) const ../src/seastar/include/seastar/util/log.hh:227
    ceph#10 0x188344a in seastar::logger::lambda_log_writer<seastar::logger::log<hobject_t const&>(seastar::log_level, seastar::logger::format_info, hobject_t const&)::{lambda(seastar::internal::log_buf::inserter_iterator)#1}>::operator()(seastar::internal::log_buf::inserter_iterator) ../src/seastar/include/seastar/util/log.hh:106
    ceph#11 0xe8b439d in operator() ../src/seastar/src/util/log.cc:268
    ceph#12 0xe8b58f2 in seastar::logger::do_log(seastar::log_level, seastar::logger::log_writer&) ../src/seastar/src/util/log.cc:280
    ceph#13 0x2521d5a in void seastar::logger::log<hobject_t const&>(seastar::log_level, seastar::logger::format_info, hobject_t const&) ../src/seastar/include/seastar/util/log.hh:230
    ceph#14 0x2a2ee12 in void seastar::logger::debug<hobject_t const&>(seastar::logger::format_info, hobject_t const&) ../src/seastar/include/seastar/util/log.hh:373
    ceph#15 0x2a2ee12 in PGBackend::omap_get_vals_by_keys(ObjectState const&, OSDOp&, object_stat_sum_t&) const ../src/crimson/osd/pg_backend.cc:1220
    ceph#16 0x2c76349 in operator()<PGBackend, ObjectState> ../src/crimson/osd/ops_executer.cc:577
    ceph#17 0x2c76349 in do_const_op<crimson::osd::OpsExecuter::execute_op(OSDOp&)::<lambda(auto:167&, const auto:168&)> > ../src/crimson/osd/ops_executer.cc:449
    ceph#18 0x2e04ce9 in do_read_op<crimson::osd::OpsExecuter::execute_op(OSDOp&)::<lambda(auto:167&, const auto:168&)> > ../src/crimson/osd/ops_executer.h:216
    ceph#19 0x2e04ce9 in crimson::osd::OpsExecuter::execute_op(OSDOp&) ../src/crimson/osd/ops_executer.cc:576
Reactor stalled for 762 ms on shard 0. Backtrace: 0x45d9d 0xda72bd3 0xd801f73 0xd81f6f9 0xd81fb9c 0xd81fe2c 0xd8200f7 0x12b2f 0x7f6dfd33ae85 0x7f6dfd33bb36 0x7f6dfd32e0b5 0x7f6dfd32ff3a 0xd61d0 0x32412 0xbd8a7 0xbd134 0x54178 0xba5b3 0xc014c 0x1881f22 0x188344a 0xe8b439d 0xe8b58f2 0x2521d5a 0x2a2ee12 0x2c76349 0x2e04ce9 0x3c70c55 0x3cb8aa8 0x7f6de558de39
    ceph#20 0x3c70c55 in execute_osd_op ../src/crimson/osd/objclass.cc:35
    ceph#21 0x3cb8aa8 in cls_cxx_map_get_val(void*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, ceph::buffer::v15_2_0::list*) ../src/crimson/osd/objclass.cc:372
    ceph#22 0x7f6de558de39  (/home/rzarzynski/ceph1/build/lib/libcls_rbd.so.1.0.0+0x28e39)

0x7f6de5176e70 is located 249456 bytes inside of 262144-byte region [0x7f6de513a000,0x7f6de517a000)
allocated by thread T0 here:
    #0 0x7f6dfd3084a7 in aligned_alloc (/lib64/libasan.so.6+0xb54a7)
    #1 0xdd414fc in seastar::thread_context::make_stack(unsigned long) ../src/seastar/src/core/thread.cc:196
    #2 0x7fff3214bc4f  ([stack]+0xa5c4f)
```

Signed-off-by: Radoslaw Zarzynski <[email protected]>
yaarith pushed a commit that referenced this pull request May 23, 2023
Improvement #1:

CapTester.write_test_files() not only creates the test file but also
does the following for every mount object it receives in parameters -

* carefully produces the path for the test file as per parameters
  received
* generates the unique data for each test file on a CephFS mount
* creates a data structure -- list of lists -- that holds all this
  information along with mount object itself for each mount object so
  that tests can be conducted at a later point

Untangle this mess of code by splitting this method into 3 separate
methods -

1. To produce the path for test file (as per user's need).
2. To generate the data that will be written into the test file.
3. To actually create the test file on CephFS.

Improvement #2:

Remove the internal data structure used for testing -- self.test_set --
and use separate class attributes to store all the data required for
testing instead of a tuple. This serves two purpose -

One, it makes it easy to manipulate all this data from helper methods
and during debugging session, especially while using a PDB session.

And two, make it impossible to have multiple mounts/multiple "test sets"
within same CapTester instance for the sake of simplicity. Users can
instead create two instances of CapTester instances if needed.

Signed-off-by: Rishabh Dave <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants