-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DNM: pybind/mgr: add smart mgr module #1
base: wip-smartctl-pull
Are you sure you want to change the base?
Conversation
Also added 'ceph daemon osd.id list_devices' which prints to stdout the osd devices. 'ceph [tell|daemon] osd.id smart' probes the osd devices for SMART data and prints it to stdout in a JSON format. It assumes smartctl '--json' feature exists. Signed-off-by: Yaarit Hatuka <[email protected]>
This mgr module contains two commands: - 'osd smart get <osd.id>' # queries a specific OSD for SMART data of all of its disks. - 'osd smart get all' # queries all OSDs in the mgr's osd_map for SMART data. The mgr invokes a 'tell' command on the OSD side, which scrapes it for SMART metrics. Signed-off-by: Yaarit Hatuka <[email protected]>
d1d21e7
to
e42bf30
Compare
src/pybind/mgr/smart/module.py
Outdated
"perm": "r" | ||
}, | ||
{ | ||
"cmd": "osd smart scrape all", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
scrape-all
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed
src/pybind/mgr/smart/module.py
Outdated
"perm": "r" | ||
}, | ||
{ | ||
"cmd": "osd smart get all", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
osd smart dump ? or get-all
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed to 'smart dump'
src/pybind/mgr/smart/module.py
Outdated
r, outb, outs = result.wait() | ||
smart_data = json.loads(outb) | ||
|
||
osd_host = self.get_metadata('osd', osd_id)['hostname'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-
osd_host = self.get_metadata('osd', osd_id).get('hostname')
if not osd_host:
return
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added; should we throw an exception?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Better to just log someting (at debug level probably?) and continue. Remember thi sis going to loop over 100s or 1000s of devices, so we can't just stop when we hit one that looks weird.
src/pybind/mgr/smart/module.py
Outdated
now = str(time.time()) | ||
ioctx.set_omap(op, (now,), (str(json.dumps(device_smart_data)),)) | ||
ioctx.operate_write_op(op, object_name) | ||
self.log.debug('writing object %s, key: %s, value: %s' % (object_name, now, str(json.dumps(device_smart_data)))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you confirm that it is writing key/value data with
rados -p smart_data ls -
rados -p smart_data listomapvals $object_name
I would make sure the date looks like "YYYYMMDD-HHMMSS".
Also, it might makes sense to have a single record for the entire smart scrape, and the value be the blob of json.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- can you confirm that it is writing key/value data:
- yes, here's a partial output:
https://pastebin.com/ds5CNuHP
all the examples I found used ioctx.set_omap() and ioctx.operate_write_op() to write objects, should I add op.create() regardless?
- I would make sure the date looks like "YYYYMMDD-HHMMSS".
- done
- Also, it might makes sense to have a single record for the entire smart scrape, and the value be the blob of json.
- not sure I understand: right now object names are $host:device, and for each object, omap key:val are $date:$json_blob_of_that_device.
Should we add another object "$host" with omap key:val of $date:$json_blob_all_host_devices ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- What you have is right--I just misread the code :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- i would expect it to be needed or else the object won't get created.. but maybe i'm wrong? try deleting the pool and doing a scrape and see if the object for the host:device gets recreated.
src/pybind/mgr/smart/module.py
Outdated
with rados.WriteOpCtx() as op: | ||
# TODO should we change second arg to be byte? | ||
# TODO what about the flag in ioctx.operate_write_op()? | ||
now = str(time.time()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you need an
op.create()
here too to make sure the object exists.
src/pybind/mgr/smart/module.py
Outdated
# looking for the right pool | ||
# TODO open pool, and if it fails - create and open | ||
pools = self.rados.list_pools() | ||
is_pool = False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pool_name = get_config('pool_name') or 'smart_data'
5b76950
to
b35f62b
Compare
otherwise ODR is violated: ==449025==ERROR: AddressSanitizer: odr-violation (0x000000f03700): [1] size=8 'g_ceph_context' ../src/global/global_context.cc:24:14 [2] size=8 'g_ceph_context' ../src/global/global_context.cc:24:14 These globals were registered at these points: [1]: #0 0x4779bd in __asan_register_globals (/var/ssd/ceph/clang-build/bin/ceph-conf+0x4779bd) #1 0x56e9cb in asan.module_ctor (/var/ssd/ceph/clang-build/bin/ceph-conf+0x56e9cb) [2]: #0 0x4779bd in __asan_register_globals (/var/ssd/ceph/clang-build/bin/ceph-conf+0x4779bd) #1 0x7fe5fed12aeb in asan.module_ctor (/var/ssd/ceph/clang-build/lib/libceph-common.so.2+0x2f34aeb) ==449025==HINT: if you don't care about these errors you may set ASAN_OPTIONS=detect_odr_violation=0 Signed-off-by: Kefu Chai <[email protected]>
Accordingly to cppreference.com [1]: "If multiple threads of execution access the same std::shared_ptr object without synchronization and any of those accesses uses a non-const member function of shared_ptr then a data race will occur (...)" [1]: https://en.cppreference.com/w/cpp/memory/shared_ptr/atomic One of the coredumps showed the `shared_ptr`-typed `OSD::osdmap` with healthy looking content but damaged control block: ``` [Current thread is 1 (Thread 0x7f7dcaf73700 (LWP 205295))] (gdb) bt #0 0x0000559cb81c3ea0 in ?? () #1 0x0000559c97675b27 in std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release (this=0x559cba0ec900) at /usr/include/c++/8/bits/shared_ptr_base.h:148 #2 std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release (this=0x559cba0ec900) at /usr/include/c++/8/bits/shared_ptr_base.h:148 ceph#3 0x0000559c975ef8aa in std::__shared_count<(__gnu_cxx::_Lock_policy)2>::~__shared_count (this=<optimized out>, __in_chrg=<optimized out>) at /usr/include/c++/8/bits/shared_ptr_base.h:1167 ceph#4 std::__shared_ptr<OSDMap const, (__gnu_cxx::_Lock_policy)2>::~__shared_ptr (this=<optimized out>, __in_chrg=<optimized out>) at /usr/include/c++/8/bits/shared_ptr_base.h:1167 ceph#5 std::shared_ptr<OSDMap const>::~shared_ptr (this=<optimized out>, __in_chrg=<optimized out>) at /usr/include/c++/8/bits/shared_ptr.h:103 ceph#6 OSD::create_context (this=<optimized out>) at /usr/src/debug/ceph-15.0.0-10071.g5b5a3a3.el8.x86_64/src/osd/OSD.cc:9053 ceph#7 0x0000559c97655571 in OSD::dequeue_peering_evt (this=0x559ca22ac000, sdata=0x559ca2ef2900, pg=0x559cb4aa3400, evt=std::shared_ptr<PGPeeringEvent> (use count 2, weak count 0) = {...}, handle=...) at /usr/src/debug/ceph-15.0.0-10071.g5b5a3a3.el8.x86_64/src/osd/OSD.cc:9665 ceph#8 0x0000559c97886db6 in ceph::osd::scheduler::PGPeeringItem::run (this=<optimized out>, osd=<optimized out>, sdata=<optimized out>, pg=..., handle=...) at /usr/include/c++/8/ext/atomicity.h:96 ceph#9 0x0000559c9764862f in ceph::osd::scheduler::OpSchedulerItem::run (handle=..., pg=..., sdata=<optimized out>, osd=<optimized out>, this=0x7f7dcaf703f0) at /usr/include/c++/8/bits/unique_ptr.h:342 ceph#10 OSD::ShardedOpWQ::_process (this=<optimized out>, thread_index=<optimized out>, hb=<optimized out>) at /usr/src/debug/ceph-15.0.0-10071.g5b5a3a3.el8.x86_64/src/osd/OSD.cc:10677 ceph#11 0x0000559c97c76094 in ShardedThreadPool::shardedthreadpool_worker (this=0x559ca22aca28, thread_index=14) at /usr/src/debug/ceph-15.0.0-10071.g5b5a3a3.el8.x86_64/src/common/WorkQueue.cc:311 ceph#12 0x0000559c97c78cf4 in ShardedThreadPool::WorkThreadSharded::entry (this=<optimized out>) at /usr/src/debug/ceph-15.0.0-10071.g5b5a3a3.el8.x86_64/src/common/WorkQueue.h:706 ceph#13 0x00007f7df17852de in start_thread () from /lib64/libpthread.so.0 ceph#14 0x00007f7df052f133 in __libc_ifunc_impl_list () from /lib64/libc.so.6 ceph#15 0x0000000000000000 in ?? () (gdb) frame 7 ceph#7 0x0000559c97655571 in OSD::dequeue_peering_evt (this=0x559ca22ac000, sdata=0x559ca2ef2900, pg=0x559cb4aa3400, evt=std::shared_ptr<PGPeeringEvent> (use count 2, weak count 0) = {...}, handle=...) at /usr/src/debug/ceph-15.0.0-10071.g5b5a3a3.el8.x86_64/src/osd/OSD.cc:9665 9665 in /usr/src/debug/ceph-15.0.0-10071.g5b5a3a3.el8.x86_64/src/osd/OSD.cc (gdb) print osdmap $24 = std::shared_ptr<const OSDMap> (expired, weak count 0) = {get() = 0x559cba028000} (gdb) print *osdmap # pretty sane OSDMap (gdb) print sizeof(osdmap) $26 = 16 (gdb) x/2a &osdmap 0x559ca22acef0: 0x559cba028000 0x559cba0ec900 (gdb) frame 2 #2 std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release (this=0x559cba0ec900) at /usr/include/c++/8/bits/shared_ptr_base.h:148 148 /usr/include/c++/8/bits/shared_ptr_base.h: No such file or directory. (gdb) disassemble Dump of assembler code for function std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release(): ... 0x0000559c97675b1e <+62>: mov (%rdi),%rax 0x0000559c97675b21 <+65>: mov %rdi,%rbx 0x0000559c97675b24 <+68>: callq *0x10(%rax) => 0x0000559c97675b27 <+71>: test %rbp,%rbp ... End of assembler dump. (gdb) info registers rdi rbx rax rdi 0x559cba0ec900 94131624790272 rbx 0x559cba0ec900 94131624790272 rax 0x559cba0ec8a0 94131624790176 (gdb) x/a 0x559cba0ec8a0 + 0x10 0x559cba0ec8b0: 0x559cb81c3ea0 (gdb) bt #0 0x0000559cb81c3ea0 in ?? () ... (gdb) p $_siginfo._sifields._sigfault.si_addr $27 = (void *) 0x559cb81c3ea0 ``` Helgrind seems to agree: ``` ==00:00:02:54.519 510301== Possible data race during write of size 8 at 0xF123930 by thread ceph#90 ==00:00:02:54.519 510301== Locks held: 2, at addresses 0xF122A58 0xF1239A8 ==00:00:02:54.519 510301== at 0x7218DD: operator= (shared_ptr_base.h:1078) ==00:00:02:54.519 510301== by 0x7218DD: operator= (shared_ptr.h:103) ==00:00:02:54.519 510301== by 0x7218DD: OSD::_committed_osd_maps(unsigned int, unsigned int, MOSDMap*) (OSD.cc:8116) ==00:00:02:54.519 510301== by 0x7752CA: C_OnMapCommit::finish(int) (OSD.cc:7678) ==00:00:02:54.519 510301== by 0x72A06C: Context::complete(int) (Context.h:77) ==00:00:02:54.519 510301== by 0xD07F14: Finisher::finisher_thread_entry() (Finisher.cc:66) ==00:00:02:54.519 510301== by 0xA7E1203: mythread_wrapper (hg_intercepts.c:389) ==00:00:02:54.519 510301== by 0xC6182DD: start_thread (in /usr/lib64/libpthread-2.28.so) ==00:00:02:54.519 510301== by 0xD8B34B2: clone (in /usr/lib64/libc-2.28.so) ==00:00:02:54.519 510301== ==00:00:02:54.519 510301== This conflicts with a previous read of size 8 by thread ceph#117 ==00:00:02:54.519 510301== Locks held: 1, at address 0x2123E9A0 ==00:00:02:54.519 510301== at 0x6B5842: __shared_ptr (shared_ptr_base.h:1165) ==00:00:02:54.519 510301== by 0x6B5842: shared_ptr (shared_ptr.h:129) ==00:00:02:54.519 510301== by 0x6B5842: get_osdmap (OSD.h:1700) ==00:00:02:54.519 510301== by 0x6B5842: OSD::create_context() (OSD.cc:9053) ==00:00:02:54.519 510301== by 0x71B570: OSD::dequeue_peering_evt(OSDShard*, PG*, std::shared_ptr<PGPeeringEvent>, ThreadPool::TPHandle&) (OSD.cc:9665) ==00:00:02:54.519 510301== by 0x71B997: OSD::dequeue_delete(OSDShard*, PG*, unsigned int, ThreadPool::TPHandle&) (OSD.cc:9701) ==00:00:02:54.519 510301== by 0x70E62E: run (OpSchedulerItem.h:148) ==00:00:02:54.519 510301== by 0x70E62E: OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*) (OSD.cc:10677) ==00:00:02:54.519 510301== by 0xD3C093: ShardedThreadPool::shardedthreadpool_worker(unsigned int) (WorkQueue.cc:311) ==00:00:02:54.519 510301== by 0xD3ECF3: ShardedThreadPool::WorkThreadSharded::entry() (WorkQueue.h:706) ==00:00:02:54.519 510301== by 0xA7E1203: mythread_wrapper (hg_intercepts.c:389) ==00:00:02:54.519 510301== by 0xC6182DD: start_thread (in /usr/lib64/libpthread-2.28.so) ==00:00:02:54.519 510301== Address 0xf123930 is 3,824 bytes inside a block of size 10,296 alloc'd ==00:00:02:54.519 510301== at 0xA7DC0C3: operator new[](unsigned long) (vg_replace_malloc.c:433) ==00:00:02:54.519 510301== by 0x66F766: main (ceph_osd.cc:688) ==00:00:02:54.519 510301== Block was alloc'd by thread #1 ``` Actually there is plenty of similar issues reported like: ``` ==00:00:05:04.903 510301== Possible data race during read of size 8 at 0x1E3E0588 by thread ceph#119 ==00:00:05:04.903 510301== Locks held: 1, at address 0x1EAD41D0 ==00:00:05:04.903 510301== at 0x753165: clear (hashtable.h:2051) ==00:00:05:04.903 510301== by 0x753165: std::_Hashtable<entity_addr_t, std::pair<entity_addr_t const, utime_t>, mempool::pool_allocator<(mempool::pool_index_t)15, std::pair<entity_addr_t const, utime_t> >, std::__detail::_Select1st, std::equal_to<entity_addr_t>, std::hash<entity_addr_t>, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__deta il::_Hashtable_traits<true, false, true> >::~_Hashtable() (hashtable.h:1369) ==00:00:05:04.903 510301== by 0x75331C: ~unordered_map (unordered_map.h:102) ==00:00:05:04.903 510301== by 0x75331C: OSDMap::~OSDMap() (OSDMap.h:350) ==00:00:05:04.903 510301== by 0x753606: operator() (shared_cache.hpp:100) ==00:00:05:04.903 510301== by 0x753606: std::_Sp_counted_deleter<OSDMap const*, SharedLRU<unsigned int, OSDMap const>::Cleanup, std::allocator<void>, (__gnu_cxx::_Lock_policy)2>::_M_dispose() (shared_ptr _base.h:471) ==00:00:05:04.903 510301== by 0x73BB26: _M_release (shared_ptr_base.h:155) ==00:00:05:04.903 510301== by 0x73BB26: std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release() (shared_ptr_base.h:148) ==00:00:05:04.903 510301== by 0x6B58A9: ~__shared_count (shared_ptr_base.h:728) ==00:00:05:04.903 510301== by 0x6B58A9: ~__shared_ptr (shared_ptr_base.h:1167) ==00:00:05:04.903 510301== by 0x6B58A9: ~shared_ptr (shared_ptr.h:103) ==00:00:05:04.903 510301== by 0x6B58A9: OSD::create_context() (OSD.cc:9053) ==00:00:05:04.903 510301== by 0x71B570: OSD::dequeue_peering_evt(OSDShard*, PG*, std::shared_ptr<PGPeeringEvent>, ThreadPool::TPHandle&) (OSD.cc:9665) ==00:00:05:04.903 510301== by 0x71B997: OSD::dequeue_delete(OSDShard*, PG*, unsigned int, ThreadPool::TPHandle&) (OSD.cc:9701) ==00:00:05:04.903 510301== by 0x70E62E: run (OpSchedulerItem.h:148) ==00:00:05:04.903 510301== by 0x70E62E: OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*) (OSD.cc:10677) ==00:00:05:04.903 510301== by 0xD3C093: ShardedThreadPool::shardedthreadpool_worker(unsigned int) (WorkQueue.cc:311) ==00:00:05:04.903 510301== by 0xD3ECF3: ShardedThreadPool::WorkThreadSharded::entry() (WorkQueue.h:706) ==00:00:05:04.903 510301== by 0xA7E1203: mythread_wrapper (hg_intercepts.c:389) ==00:00:05:04.903 510301== by 0xC6182DD: start_thread (in /usr/lib64/libpthread-2.28.so) ==00:00:05:04.903 510301== by 0xD8B34B2: clone (in /usr/lib64/libc-2.28.so) ==00:00:05:04.903 510301== ==00:00:05:04.903 510301== This conflicts with a previous write of size 8 by thread ceph#90 ==00:00:05:04.903 510301== Locks held: 2, at addresses 0xF122A58 0xF1239A8 ==00:00:05:04.903 510301== at 0x7531E1: clear (hashtable.h:2054) ==00:00:05:04.903 510301== by 0x7531E1: std::_Hashtable<entity_addr_t, std::pair<entity_addr_t const, utime_t>, mempool::pool_allocator<(mempool::pool_index_t)15, std::pair<entity_addr_t const, utime_t> >, std::__detail::_Select1st, std::equal_to<entity_addr_t>, std::hash<entity_addr_t>, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<true, false, true> >::~_Hashtable() (hashtable.h:1369) ==00:00:05:04.903 510301== by 0x75331C: ~unordered_map (unordered_map.h:102) ==00:00:05:04.903 510301== by 0x75331C: OSDMap::~OSDMap() (OSDMap.h:350) ==00:00:05:04.903 510301== by 0x753606: operator() (shared_cache.hpp:100) ==00:00:05:04.903 510301== by 0x753606: std::_Sp_counted_deleter<OSDMap const*, SharedLRU<unsigned int, OSDMap const>::Cleanup, std::allocator<void>, (__gnu_cxx::_Lock_policy)2>::_M_dispose() (shared_ptr_base.h:471) ==00:00:05:04.903 510301== by 0x73BB26: _M_release (shared_ptr_base.h:155) ==00:00:05:04.903 510301== by 0x73BB26: std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release() (shared_ptr_base.h:148) ==00:00:05:04.903 510301== by 0x72191E: operator= (shared_ptr_base.h:747) ==00:00:05:04.903 510301== by 0x72191E: operator= (shared_ptr_base.h:1078) ==00:00:05:04.903 510301== by 0x72191E: operator= (shared_ptr.h:103) ==00:00:05:04.903 510301== by 0x72191E: OSD::_committed_osd_maps(unsigned int, unsigned int, MOSDMap*) (OSD.cc:8116) ==00:00:05:04.903 510301== by 0x7752CA: C_OnMapCommit::finish(int) (OSD.cc:7678) ==00:00:05:04.903 510301== by 0x72A06C: Context::complete(int) (Context.h:77) ==00:00:05:04.903 510301== by 0xD07F14: Finisher::finisher_thread_entry() (Finisher.cc:66) ==00:00:05:04.903 510301== Address 0x1e3e0588 is 872 bytes inside a block of size 1,208 alloc'd ==00:00:05:04.903 510301== at 0xA7DC0C3: operator new[](unsigned long) (vg_replace_malloc.c:433) ==00:00:05:04.903 510301== by 0x6C7C0C: OSDService::try_get_map(unsigned int) (OSD.cc:1606) ==00:00:05:04.903 510301== by 0x7213BD: get_map (OSD.h:699) ==00:00:05:04.903 510301== by 0x7213BD: get_map (OSD.h:1732) ==00:00:05:04.903 510301== by 0x7213BD: OSD::_committed_osd_maps(unsigned int, unsigned int, MOSDMap*) (OSD.cc:8076) ==00:00:05:04.903 510301== by 0x7752CA: C_OnMapCommit::finish(int) (OSD.cc:7678) ==00:00:05:04.903 510301== by 0x72A06C: Context::complete(int) (Context.h:77) ==00:00:05:04.903 510301== by 0xD07F14: Finisher::finisher_thread_entry() (Finisher.cc:66) ==00:00:05:04.903 510301== by 0xA7E1203: mythread_wrapper (hg_intercepts.c:389) ==00:00:05:04.903 510301== by 0xC6182DD: start_thread (in /usr/lib64/libpthread-2.28.so) ==00:00:05:04.903 510301== by 0xD8B34B2: clone (in /usr/lib64/libc-2.28.so) ``` Signed-off-by: Radoslaw Zarzynski <[email protected]>
This fixes the the following selinux error when using ceph-iscsi's rbd-target-api daemon (rbd-target-gw has the same issue). They are a result of the a python library, rtslib, which the daemons use. Additional Information: Source Context system_u:system_r:ceph_t:s0 Target Context system_u:object_r:configfs_t:s0 Target Objects /sys/kernel/config/target/iscsi/iqn.2003-01.com.re dhat:ceph-iscsi/tpgt_1/attrib/authentication [ file ] Source rbd-target-api Source Path /usr/libexec/platform-python3.6 Port <Unknown> Host ans8 Source RPM Packages platform-python-3.6.8-15.1.el8.x86_64 Target RPM Packages Policy RPM selinux-policy-3.14.3-20.el8.noarch Selinux Enabled True Policy Type targeted Enforcing Mode Enforcing Host Name ans8 Platform Linux ans8 4.18.0-147.el8.x86_64 #1 SMP Thu Sep 26 15:52:44 UTC 2019 x86_64 x86_64 Alert Count 1 First Seen 2020-01-08 18:39:47 EST Last Seen 2020-01-08 18:39:47 EST Local ID 6f8c3415-7a50-4dc8-b3d2-2621e1d00ca3 Raw Audit Messages type=AVC msg=audit(1578526787.577:68): avc: denied { ioctl } for pid=995 comm="rbd-target-api" path="/sys/kernel/config/target/iscsi/iqn.2003-01.com.redhat:ceph-iscsi/tpgt_1/attrib/authentication" dev="configfs" ino=25703 ioctlcmd=0x5401 scontext=system_u:system_r:ceph_t:s0 tcontext=system_u:object_r:configfs_t:s0 tclass=file permissive=1 type=SYSCALL msg=audit(1578526787.577:68): arch=x86_64 syscall=ioctl success=no exit=ENOTTY a0=34 a1=5401 a2=7ffd4f8f1f60 a3=3052cd2d95839b96 items=0 ppid=1 pid=995 auid=4294967295 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0 tty=(none) ses=4294967295 comm=rbd-target-api exe=/usr/libexec/platform-python3.6 subj=system_u:system_r:ceph_t:s0 key=(null) Hash: rbd-target-api,ceph_t,configfs_t,file,ioctl Signed-off-by: Mike Christie <[email protected]> (cherry picked from commit 8187235)
This fixes the selinux errors like this for /etc/target ----------------------------------- Additional Information: Source Context system_u:system_r:ceph_t:s0 Target Context system_u:object_r:targetd_etc_rw_t:s0 Target Objects target [ dir ] Source rbd-target-api Source Path rbd-target-api Port <Unknown> Host ans8 Source RPM Packages Target RPM Packages Policy RPM selinux-policy-3.14.3-20.el8.noarch Selinux Enabled True Policy Type targeted Enforcing Mode Enforcing Host Name ans8 Platform Linux ans8 4.18.0-147.el8.x86_64 #1 SMP Thu Sep 26 15:52:44 UTC 2019 x86_64 x86_64 Alert Count 1 First Seen 2020-01-08 18:39:48 EST Last Seen 2020-01-08 18:39:48 EST Local ID 9a13ee18-eaf2-4f2a-872f-2809ee4928f6 Raw Audit Messages type=AVC msg=audit(1578526788.148:69): avc: denied { search } for pid=995 comm="rbd-target-api" name="target" dev="sda1" ino=52198 scontext=system_u:system_r:ceph_t:s0 tcontext=system_u:object_r:targetd_etc_rw_t:s0 tclass=dir permissive=1 Hash: rbd-target-api,ceph_t,targetd_etc_rw_t,dir,search which are a result of the rtslib library the ceph-iscsi daemons use accessing /etc/target to read/write a file which stores meta data the target uses. Signed-off-by: Mike Christie <[email protected]> (cherry picked from commit 53be181) Conflicts: selinux/ceph.te: trivial resolution
It's perfectly legal for a client to reconnect to particular `Watch` using different socket / `Connection` than original one. This shall include proper handling of the watch timer which is currently broken as, when reconnecting, we don't cancel the timer. This leaded to the following crash at Sepia: ``` rzarzynski@teuthology:/home/teuthworker/archive/rzarzynski-2021-09-02_07:44:51-rados-master-distro-basic-smithi/6372357$ less ./remote/smithi183/log/ceph-osd.4.log.gz ... DEBUG 2021-09-02 08:10:45,462 [shard 0] osd - client_request(id=12, detail=m=[osd_op(client.5087.0:93 7.1e 7:7c7084bd:::repobj:head {watch reconnect cookie 94478891024832 gen 1} snapc 0={} ondisk+write+know n_if_redirected e40) v8]): got obc lock ... DEBUG 2021-09-02 08:10:45,462 [shard 0] osd - do_op_watch INFO 2021-09-02 08:10:45,462 [shard 0] osd - found existing watch by client.5087 DEBUG 2021-09-02 08:10:45,462 [shard 0] osd - do_op_watch_subop_watch INFO 2021-09-02 08:10:45,462 [shard 0] osd - found existing watch watch(cookie 94478891024832 30s 172.21.15.150:0/3544196211) by client.5087 ... INFO 2021-09-02 08:10:45,462 [shard 0] osd - op_effect: found existing watcher: 94478891024832,client.5087 ceph-osd: /home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.0.0-7406-g9d30203c/rpm/el8/BUILD/ceph- 17.0.0-7406-g9d30203c/src/seastar/include/seastar/core/timer.hh:95: void seastar::timer<Clock>::arm_state(seastar::timer<Clock>::time_point, std::optional<typename Clock::duration>) [with Clock = seastar::l owres_clock; seastar::timer<Clock>::time_point = std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long int, std::ratio<1, 1000> > >; typename Clock::duration = std::chrono::duration<long int, std::ratio<1, 1000> >]: Assertion `!_armed' failed. Aborting on shard 0. Backtrace: 0# 0x000055CC052CF0B6 in ceph-osd 1# FatalSignal::signaled(int, siginfo_t const&) in ceph-osd 2# FatalSignal::install_oneshot_signal_handler<6>()::{lambda(int, siginfo_t*, void*)#1}::_FUN(int, siginfo_t*, void*) in ceph-osd 3# 0x00007FA58349FB20 in /lib64/libpthread.so.0 4# gsignal in /lib64/libc.so.6 5# abort in /lib64/libc.so.6 6# 0x00007FA581A98C89 in /lib64/libc.so.6 7# 0x00007FA581AA6A76 in /lib64/libc.so.6 8# 0x000055CC0BEEE9DD in ceph-osd 9# crimson::osd::Watch::connect(seastar::shared_ptr<crimson::net::Connection>, bool) in ceph-osd 10# 0x000055CC00B1D246 in ceph-osd 11# 0x000055CBFFEF01AE in ceph-osd ... ``` Signed-off-by: Radoslaw Zarzynski <[email protected]>
`seastar::engine()` is available only for Seastar's threads; it shouldn't be called outside of a reactor thread. Unfortunately, this assumption is violated in `AlienStore` where `OnCommit::finish()`, executed from a finisher thread of `BlueStore`, calls `alien()` on `seastar::engine()`. The net effect are crashes like the following one: ``` INFO 2021-09-22 14:26:33,214 [shard 0] osd - operator() writing superblock cluster_fsid 1d8f7908-2ebf-4a91-ae70-f445668c126b osd_fsid 4da9fe9a-1da5-4ea9-aa79-a1178165ede5 [381/1839] Segmentation fault. Backtrace: 0# print_backtrace(std::basic_string_view<char, std::char_traits<char> >) at /home/rzarzynski/ceph1/build/../src/crimson/common/fatal_signal.cc:80 1# FatalSignal::signaled(int, siginfo_t const&) at /opt/rh/gcc-toolset-9/root/usr/include/c++/9/ostream:570 2# FatalSignal::install_oneshot_signal_handler<11>()::{lambda(int, siginfo_t*, void*)#1}::_FUN(int, siginfo_t*, void*) at /home/rzarzynski/ceph1/build/../src/crimson/common/fatal_signal.cc: 62 3# 0x00007F16BBA13B30 in /lib64/libpthread.so.0 4# (anonymous namespace)::OnCommit::finish(int) at /home/rzarzynski/ceph1/build/../src/crimson/os/alienstore/alien_store.cc:53 5# Context::complete(int) at /home/rzarzynski/ceph1/build/../src/include/Context.h:100 6# Finisher::finisher_thread_entry() at /home/rzarzynski/ceph1/build/../src/common/Finisher.cc:65 7# 0x00007F16BBA0915A in /lib64/libpthread.so.0 8# clone in /lib64/libc.so.6 Dump of siginfo: ... si_addr: 0x10 ``` Signed-off-by: Radoslaw Zarzynski <[email protected]>
It's all about these items: ``` 0# print_backtrace(std::basic_string_view<char, std::char_traits<char> >) at /home/rzarzynski/ceph1/build/../src/crimson/common/fatal_signal.cc:80 1# FatalSignal::signaled(int, siginfo_t const&) at /opt/rh/gcc-toolset-9/root/usr/include/c++/9/ostream:570 2# FatalSignal::install_oneshot_signal_handler<11>()::{lambda(int, siginfo_t*, void*)#1}::_FUN(int, siginfo_t*, void*) at /home/rzarzynski/ceph1/build/../src/crimson/common/fatal_signal.cc: 62 3# 0x00007F16BBA13B30 in /lib64/libpthread.so.0 ``` They are part of our backtrace handling and typically developers are not interested in them. Let's be consistent with the classical OSD and hide them. Signed-off-by: Radoslaw Zarzynski <[email protected]>
`PG::request_{local,remote}_recovery_reservation()` dynamically allocates up to 2 instances of `LambdaContext<T>` and transfers their ownership to the `AsyncReserver<T, F>`. This is expressed in raw pointers (`new` and `delete`) notion. Further analysis shows the only place where `delete` for these objects is called is the `AsyncReserver::cancel_reservation()`. In contrast to the classical OSD, crimson doesn't invoke the method when stopping a PG during the shutdown sequence. This would explain the following ASan issue observed at Sepia: ``` Direct leak of 576 byte(s) in 24 object(s) allocated from: #0 0x7fa108fc57b0 in operator new(unsigned long) (/lib64/libasan.so.5+0xf17b0) #1 0x55723d8b0b56 in non-virtual thunk to crimson::osd::PG::request_local_background_io_reservation(unsigned int, std::unique_ptr<PGPeeringEvent, std::default_delete<PGPeeringEvent> >, std::unique_ptr<PGPeeringEvent, std::default_delete<PGPeeringEvent> >) (/usr/bin/ceph-osd+0x24d95b56) #2 0x55723f1f66ef in PeeringState::WaitDeleteReserved::WaitDeleteReserved(boost::statechart::state<PeeringState::WaitDeleteReserved, PeeringState::ToDelete, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::my_context) (/usr/bin/ceph-osd+0x266db6ef) ``` Signed-off-by: Radoslaw Zarzynski <[email protected]>
exit() will call pthread_cond_destroy attempting to destroy dpdk::eal::cond upon which other threads are currently blocked results in undefine behavior. Link different libc version test, libc-2.17 can exit, libc-2.27 will deadlock, the call stack is as follows: Thread 3 (Thread 0xffff7e5749f0 (LWP 62213)): #0 0x0000ffff7f3c422c in futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0xaaaadc0e30f4 <dpdk::eal::cond+44>) at ../sysdeps/unix/sysv/linux/futex-internal.h:88 #1 __pthread_cond_wait_common (abstime=0x0, mutex=0xaaaadc0e30f8 <dpdk::eal::lock>, cond=0xaaaadc0e30c8 <dpdk::eal::cond>) at pthread_cond_wait.c:502 #2 __pthread_cond_wait (cond=0xaaaadc0e30c8 <dpdk::eal::cond>, mutex=0xaaaadc0e30f8 <dpdk::eal::lock>) at pthread_cond_wait.c:655 ceph#3 0x0000ffff7f1f1f80 in std::condition_variable::wait(std::unique_lock<std::mutex>&) () from /usr/lib/aarch64-linux-gnu/libstdc++.so.6 ceph#4 0x0000aaaad37f5078 in dpdk::eal::<lambda()>::operator()(void) const (__closure=<optimized out>, __closure=<optimized out>) at ./src/msg/async/dpdk/dpdk_rte.cc:136 ceph#5 0x0000ffff7f1f7ed4 in ?? () from /usr/lib/aarch64-linux-gnu/libstdc++.so.6 ceph#6 0x0000ffff7f3be088 in start_thread (arg=0xffffe73e197f) at pthread_create.c:463 ceph#7 0x0000ffff7efc74ec in thread_start () at ../sysdeps/unix/sysv/linux/aarch64/clone.S:78 Thread 1 (Thread 0xffff7ee3b010 (LWP 62200)): #0 0x0000ffff7f3c3c38 in futex_wait (private=<optimized out>, expected=12, futex_word=0xaaaadc0e30ec <dpdk::eal::cond+36>) at ../sysdeps/unix/sysv/linux/futex-internal.h:61 #1 futex_wait_simple (private=<optimized out>, expected=12, futex_word=0xaaaadc0e30ec <dpdk::eal::cond+36>) at ../sysdeps/nptl/futex-internal.h:135 #2 __pthread_cond_destroy (cond=0xaaaadc0e30c8 <dpdk::eal::cond>) at pthread_cond_destroy.c:54 ceph#3 0x0000ffff7ef2be34 in __run_exit_handlers (status=-6, listp=0xffff7f04a5a0 <__exit_funcs>, run_list_atexit=255, run_list_atexit@entry=true, run_dtors=run_dtors@entry=true) at exit.c:108 ceph#4 0x0000ffff7ef2bf6c in __GI_exit (status=<optimized out>) at exit.c:139 ceph#5 0x0000ffff7ef176e4 in __libc_start_main (main=0x0, argc=0, argv=0x0, init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>, stack_end=<optimized out>) at ../csu/libc-start.c:344 ceph#6 0x0000aaaad2939db0 in _start () at ./src/include/buffer.h:642 Fixes: https://tracker.ceph.com/issues/42890 Signed-off-by: Chunsong Feng <[email protected]> Signed-off-by: luo rixin <[email protected]>
The patch is supposed to fix the following problems (extra debugs onboard): ``` NFO 2021-11-16 01:18:38,713 [shard 0] osd - ~OSD: OSD dtor called INFO 2021-11-16 01:18:38,713 [shard 0] osd - Heartbeat::Peer: osd.6 removed INFO 2021-11-16 01:18:38,714 [shard 0] osd - Heartbeat::Peer: osd.5 removed INFO 2021-11-16 01:18:38,714 [shard 0] osd - Heartbeat::Peer: osd.2 removed INFO 2021-11-16 01:18:38,714 [shard 0] osd - ~ShardServices: ShardServices dtor called INFO 2021-11-16 01:18:38,714 [shard 0] osd - ~ObjectContextRegistry: ShardServices dtor called; unref_size=3, size=3 INFO 2021-11-16 01:18:38,714 [shard 0] osd - ~ObjectContextRegistry: unreferenced p=0x619000115380 INFO 2021-11-16 01:18:38,714 [shard 0] osd - ~ObjectContextRegistry: unreferenced p=0x619000114980 INFO 2021-11-16 01:18:38,714 [shard 0] osd - ~ObjectContextRegistry: unreferenced p=0x619000112680 INFO 2021-11-16 01:18:38,714 [shard 0] osd - ~ObjectContextRegistry: set p=0x619000114980 INFO 2021-11-16 01:18:38,714 [shard 0] osd - ~ObjectContextRegistry: set p=0x619000115380 INFO 2021-11-16 01:18:38,714 [shard 0] osd - ~ObjectContextRegistry: set p=0x619000112680 INFO 2021-11-16 01:18:38,738 [shard 0] osd - crimson shutdown complete ================================================================= ==33351==ERROR: LeakSanitizer: detected memory leaks Direct leak of 2808 byte(s) in 3 object(s) allocated from: #0 0x7fe10c0327b0 in operator new(unsigned long) (/lib64/libasan.so.5+0xf17b0) #1 0x55accbe8ffc4 in ceph::common::intrusive_lru<ceph::common::intrusive_lru_config<hobject_t, crimson::osd::ObjectContext, crimson::osd::obc_to_hoid<crimson::osd::ObjectContext> > >::get_or_create(hobject_t const&) (/usr/bin/ceph-osd+0x3b000fc4) Objects leaked above: 0x619000112680 (936 bytes) 0x619000114980 (936 bytes) 0x619000115380 (936 bytes) ``` Signed-off-by: Radoslaw Zarzynski <[email protected]>
Initially, we were assuming that no pointer obtained from SharedLRU can outlive the lru itself. However, since going with the interruption concept for handling shutdowns, this is no longer valid. The patch is supposed to deal with crashes like the following one: ``` ceph-osd: /home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.0.0-8898-ge57ad63c/rpm/el8/BUILD/ceph-17.0. 0-8898-ge57ad63c/src/crimson/common/shared_lru.h:46: SharedLRU<K, V>::~SharedLRU() [with K = unsigned int; V = OSDMap]: Assertion `weak_refs.empty()' failed. Aborting on shard 0. Backtrace: Reactor stalled for 1162 ms on shard 0. Backtrace: 0xb14ab 0x46e57428 0x46bc450d 0x46be03bd 0x46be0782 0x46be0946 0x46be0bf6 0x12b1f 0xc8e3b 0x3fdd77e2 0x3fddccdb 0x3fdde1ee 0x3fdde8b3 0x3fdd3f2b 0x3fdd4442 0x3f dd4c3a 0x12b1f 0x3737e 0x21db4 0x21c88 0x2fa75 0x3a5ae1b9 0x3a38c5e2 0x3a0c823d 0x3a1771f1 0x3a1796f5 0x46ff92c9 0x46ff9525 0x46ff9e93 0x46ff8eae 0x46ff8bd9 0x3a160e67 0x39f50c83 0x39f51cd0 0x46b96271 0x46bde51a 0x46d6891b 0x46d6a8f0 0x4681a7d2 0x4681f03b 0x39fd50f2 0x23492 0x39b7a7dd 0# gsignal in /lib64/libc.so.6 1# abort in /lib64/libc.so.6 2# 0x00007F9535E04C89 in /lib64/libc.so.6 3# 0x00007F9535E12A76 in /lib64/libc.so.6 4# crimson::osd::OSD::~OSD() in ceph-osd 5# seastar::shared_ptr_count_for<crimson::osd::OSD>::~shared_ptr_count_for() in ceph-osd 6# seastar::shared_ptr<crimson::osd::OSD>::~shared_ptr() in ceph-osd 7# seastar::futurize<std::result_of<seastar::sharded<crimson::osd::OSD>::stop()::{lambda(seastar::future<void>)#2}::operator()(seastar::future<void>) const::{lambda(unsigned int)#1}::operator()(unsigned int) co nst::{lambda()#1} ()>::type>::type seastar::smp::submit_to<seastar::sharded<crimson::osd::OSD>::stop()::{lambda(seastar::future<void>)#2}::operator()(seastar::future<void>) const::{lambda(unsigned int)#1}::opera tor()(unsigned int) const::{lambda()#1}>(unsigned int, seastar::smp_submit_to_options, seastar::sharded<crimson::osd::OSD>::stop()::{lambda(seastar::future<void>)#2}::operator()(seastar::future<void>) const::{la mbda(unsigned int)#1}::operator()(unsigned int) const::{lambda()#1}&&) in ceph-osd 8# std::_Function_handler<seastar::future<void> (unsigned int), seastar::sharded<crimson::osd::OSD>::stop()::{lambda(seastar::future<void>)#2}::operator()(seastar::future<void>) const::{lambda(unsigned int)#1}> ::_M_invoke(std::_Any_data const&, unsigned int&&) in ceph-osd 9# 0x0000562DA18162CA in ceph-osd 10# 0x0000562DA1816526 in ceph-osd 11# 0x0000562DA1816E94 in ceph-osd 12# 0x0000562DA1815EAF in ceph-osd 13# 0x0000562DA1815BDA in ceph-osd 14# seastar::noncopyable_function<seastar::future<void> (seastar::future<void>&&)>::direct_vtable_for<seastar::future<void>::then_wrapped_maybe_erase<true, seastar::future<void>, seastar::sharded<crimson::osd::OSD>::stop()::{lambda(seastar::future<void>)#2}>(seastar::sharded<crimson::osd::OSD>::stop()::{lambda(seastar::future<void>)#2}&&)::{lambda(seastar::future<void>&&)#1}>::call(seastar::noncopyable_function<seastar::future<void> (seastar::future<void>&&)> const*, seastar::future<void>&&) in ceph-osd 15# 0x0000562D9476DC84 in ceph-osd 16# 0x0000562D9476ECD1 in ceph-osd 17# 0x0000562DA13B3272 in ceph-osd 18# 0x0000562DA13FB51B in ceph-osd 19# 0x0000562DA158591C in ceph-osd 20# 0x0000562DA15878F1 in ceph-osd 21# 0x0000562DA10377D3 in ceph-osd 22# 0x0000562DA103C03C in ceph-osd 23# main in ceph-osd 24# __libc_start_main in /lib64/libc.so.6 25# _start in ceph-osd ``` Signed-off-by: Radoslaw Zarzynski <[email protected]>
This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved |
The problem is: ``` DEBUG 2022-03-07 13:50:40,027 [shard 0] osd - calling method rbd.create, num_read=0, num_write=0 DEBUG 2022-03-07 13:50:40,027 [shard 0] objclass - <cls> ../src/cls/rbd/cls_rbd.cc:787: create object_prefix=parent_id size=2097152 order=0 features=1 DEBUG 2022-03-07 13:50:40,027 [shard 0] osd - handling op omap-get-vals-by-keys on object 1:144d5af5:::parent_id:head ================================================================= ==2109764==ERROR: AddressSanitizer: stack-buffer-overflow on address 0x7f6de5176e70 at pc 0x7f6dfd2a7157 bp 0x7f6de5176e30 sp 0x7f6de51765d8 WRITE of size 24 at 0x7f6de5176e70 thread T0 #0 0x7f6dfd2a7156 in __interceptor_sigaltstack.part.0 (/lib64/libasan.so.6+0x54156) #1 0x7f6dfd30d5b3 in __asan::PlatformUnpoisonStacks() (/lib64/libasan.so.6+0xba5b3) #2 0x7f6dfd31314c in __asan_handle_no_return (/lib64/libasan.so.6+0xc014c) Reactor stalled for 275 ms on shard 0. Backtrace: 0x45d9d 0xda72bd3 0xd801f73 0xd81f6f9 0xd81fb9c 0xd81fe2c 0xd8200f7 0x12b2f 0x7f6dfd3383c1 0x7f6dfd339b18 0x7f6dfd339bd4 0x7f6dfd339bd4 0x7f6dfd339bd4 0x7f6dfd339bd4 0x7f6dfd33b089 0x7f6dfd33bb36 0x7f6dfd32e0b5 0x7f6dfd32ff3a 0xd61d0 0x32412 0xbd8a7 0xbd134 0x54178 0xba5b3 0xc014c 0x1881f22 0x188344a 0xe8b439d 0xe8b58f2 0x2521d5a 0x2a2ee12 0x2c76349 0x2e04ce9 0x3c70c55 0x3cb8aa8 0x7f6de558de39 ceph#3 0x1881f22 in fmt::v6::internal::arg_map<fmt::v6::basic_format_context<seastar::internal::log_buf::inserter_iterator, char> >::~arg_map() /usr/include/fmt/core.h:1170 ceph#4 0x1881f22 in fmt::v6::basic_format_context<seastar::internal::log_buf::inserter_iterator, char>::~basic_format_context() /usr/include/fmt/core.h:1265 ceph#5 0x1881f22 in fmt::v6::format_handler<fmt::v6::arg_formatter<fmt::v6::internal::output_range<seastar::internal::log_buf::inserter_iterator, char> >, char, fmt::v6::basic_format_context<seastar::internal::log_buf::inserter_iterator, char> >::~format_handler() /usr/include/fmt/format.h:3143 ceph#6 0x1881f22 in fmt::v6::basic_format_context<seastar::internal::log_buf::inserter_iterator, char>::iterator fmt::v6::vformat_to<fmt::v6::arg_formatter<fmt::v6::internal::output_range<seastar::internal::log_buf::inserter_iterator, char> >, char, fmt::v6::basic_format_context<seastar::internal::log_buf::inserter_iterator, char> >(fmt::v6::arg_formatter<fmt::v6::internal::output_range<seastar::internal::log_buf::inserter_iterator, char> >::range, fmt::v6::basic_string_view<char>, fmt::v6::basic_format_args<fmt::v6::basic_format_context<seastar::internal::log_buf::inserter_iterator, char> >, fmt::v6::internal::locale_ref) /usr/include/fmt/format.h:3206 ceph#7 0x188344a in seastar::internal::log_buf::inserter_iterator fmt::v6::vformat_to<fmt::v6::basic_string_view<char>, seastar::internal::log_buf::inserter_iterator, , 0>(seastar::internal::log_buf::inserter_iterator, fmt::v6::basic_string_view<char> const&, fmt::v6::basic_format_args<fmt::v6::basic_format_context<fmt::v6::type_identity<seastar::internal::log_buf::inserter_iterator>::type, fmt::v6::internal::char_t_impl<fmt::v6::basic_string_view<char>, void>::type> >) /usr/include/fmt/format.h:3395 ceph#8 0x188344a in seastar::internal::log_buf::inserter_iterator fmt::v6::format_to<seastar::internal::log_buf::inserter_iterator, std::basic_string_view<char, std::char_traits<char> >, hobject_t const&, 0>(seastar::internal::log_buf::inserter_iterator, std::basic_string_view<char, std::char_traits<char> > const&, hobject_t const&) /usr/include/fmt/format.h:3418 ceph#9 0x188344a in seastar::logger::log<hobject_t const&>(seastar::log_level, seastar::logger::format_info, hobject_t const&)::{lambda(seastar::internal::log_buf::inserter_iterator)#1}::operator()(seastar::internal::log_buf::inserter_iterator) const ../src/seastar/include/seastar/util/log.hh:227 ceph#10 0x188344a in seastar::logger::lambda_log_writer<seastar::logger::log<hobject_t const&>(seastar::log_level, seastar::logger::format_info, hobject_t const&)::{lambda(seastar::internal::log_buf::inserter_iterator)#1}>::operator()(seastar::internal::log_buf::inserter_iterator) ../src/seastar/include/seastar/util/log.hh:106 ceph#11 0xe8b439d in operator() ../src/seastar/src/util/log.cc:268 ceph#12 0xe8b58f2 in seastar::logger::do_log(seastar::log_level, seastar::logger::log_writer&) ../src/seastar/src/util/log.cc:280 ceph#13 0x2521d5a in void seastar::logger::log<hobject_t const&>(seastar::log_level, seastar::logger::format_info, hobject_t const&) ../src/seastar/include/seastar/util/log.hh:230 ceph#14 0x2a2ee12 in void seastar::logger::debug<hobject_t const&>(seastar::logger::format_info, hobject_t const&) ../src/seastar/include/seastar/util/log.hh:373 ceph#15 0x2a2ee12 in PGBackend::omap_get_vals_by_keys(ObjectState const&, OSDOp&, object_stat_sum_t&) const ../src/crimson/osd/pg_backend.cc:1220 ceph#16 0x2c76349 in operator()<PGBackend, ObjectState> ../src/crimson/osd/ops_executer.cc:577 ceph#17 0x2c76349 in do_const_op<crimson::osd::OpsExecuter::execute_op(OSDOp&)::<lambda(auto:167&, const auto:168&)> > ../src/crimson/osd/ops_executer.cc:449 ceph#18 0x2e04ce9 in do_read_op<crimson::osd::OpsExecuter::execute_op(OSDOp&)::<lambda(auto:167&, const auto:168&)> > ../src/crimson/osd/ops_executer.h:216 ceph#19 0x2e04ce9 in crimson::osd::OpsExecuter::execute_op(OSDOp&) ../src/crimson/osd/ops_executer.cc:576 Reactor stalled for 762 ms on shard 0. Backtrace: 0x45d9d 0xda72bd3 0xd801f73 0xd81f6f9 0xd81fb9c 0xd81fe2c 0xd8200f7 0x12b2f 0x7f6dfd33ae85 0x7f6dfd33bb36 0x7f6dfd32e0b5 0x7f6dfd32ff3a 0xd61d0 0x32412 0xbd8a7 0xbd134 0x54178 0xba5b3 0xc014c 0x1881f22 0x188344a 0xe8b439d 0xe8b58f2 0x2521d5a 0x2a2ee12 0x2c76349 0x2e04ce9 0x3c70c55 0x3cb8aa8 0x7f6de558de39 ceph#20 0x3c70c55 in execute_osd_op ../src/crimson/osd/objclass.cc:35 ceph#21 0x3cb8aa8 in cls_cxx_map_get_val(void*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, ceph::buffer::v15_2_0::list*) ../src/crimson/osd/objclass.cc:372 ceph#22 0x7f6de558de39 (/home/rzarzynski/ceph1/build/lib/libcls_rbd.so.1.0.0+0x28e39) 0x7f6de5176e70 is located 249456 bytes inside of 262144-byte region [0x7f6de513a000,0x7f6de517a000) allocated by thread T0 here: #0 0x7f6dfd3084a7 in aligned_alloc (/lib64/libasan.so.6+0xb54a7) #1 0xdd414fc in seastar::thread_context::make_stack(unsigned long) ../src/seastar/src/core/thread.cc:196 #2 0x7fff3214bc4f ([stack]+0xa5c4f) ``` Signed-off-by: Radoslaw Zarzynski <[email protected]>
Improvement #1: CapTester.write_test_files() not only creates the test file but also does the following for every mount object it receives in parameters - * carefully produces the path for the test file as per parameters received * generates the unique data for each test file on a CephFS mount * creates a data structure -- list of lists -- that holds all this information along with mount object itself for each mount object so that tests can be conducted at a later point Untangle this mess of code by splitting this method into 3 separate methods - 1. To produce the path for test file (as per user's need). 2. To generate the data that will be written into the test file. 3. To actually create the test file on CephFS. Improvement #2: Remove the internal data structure used for testing -- self.test_set -- and use separate class attributes to store all the data required for testing instead of a tuple. This serves two purpose - One, it makes it easy to manipulate all this data from helper methods and during debugging session, especially while using a PDB session. And two, make it impossible to have multiple mounts/multiple "test sets" within same CapTester instance for the sake of simplicity. Users can instead create two instances of CapTester instances if needed. Signed-off-by: Rishabh Dave <[email protected]>
This mgr module contains two commands:
of its disks.
data.
The mgr invokes a 'tell' command on the OSD side, which scrapes it for
SMART metrics.
Signed-off-by: Yaarit Hatuka [email protected]