Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pmproxy[1696354]: segfault at 7 ip 00007f2cd8644e2a sp 00007ffd0335e440 error 4 in libpcp_web.so.1 #1815

Open
christianhorn opened this issue Sep 22, 2023 · 15 comments

Comments

@christianhorn
Copy link
Collaborator

On Debian 12/bookwork amd64, with a self built PCP as from today (6.1.1), I see this crash when accessing some metrics via grafana or for example 'wget "http://localhost:44322/series/instances?series=5f8b5c6d7865695d21729b676d8168cfa070d0a1"':

[1057733.661020] pmproxy[1696354]: segfault at 7 ip 00007f2cd8644e2a sp 00007ffd0335e440 error 4 in libpcp_web.so.1[7f2cd8639000+5a000] likely on CPU 2 (core 2, socket 0)
[1057733.661040] Code: ff ff e8 a9 4e ff ff 66 0f 1f 84 00 00 00 00 00 48 85 ff 0f 84 3f 02 00 00 41 57 41 56 41 55 41 54 55 48 89 fd 53 48 83 ec 18 <8b> 07 83 f8 33 0f 86 fb 00 00 00 8b b5 a8 00 00 00 48 c7 04 24 00
[1057734.665869] pmproxy[1696400]: segfault at 7 ip 00007f8c8d374e2a sp 00007fffe115a540 error 4 in libpcp_web.so.1[7f8c8d369000+5a000] likely on CPU 2 (core 2, socket 0)
[1057734.665886] Code: ff ff e8 a9 4e ff ff 66 0f 1f 84 00 00 00 00 00 48 85 ff 0f 84 3f 02 00 00 41 57 41 56 41 55 41 54 55 48 89 fd 53 48 83 ec 18 <8b> 07 83 f8 33 0f 86 fb 00 00 00 8b b5 a8 00 00 00 48 c7 04 24 00
[1057736.670009] pmproxy[1696445]: segfault at 7 ip 00007fd712026e2a sp 00007ffea60aa500 error 4 in libpcp_web.so.1[7fd71201b000+5a000] likely on CPU 0 (core 0, socket 0)
[1057736.670028] Code: ff ff e8 a9 4e ff ff 66 0f 1f 84 00 00 00 00 00 48 85 ff 0f 84 3f 02 00 00 41 57 41 56 41 55 41 54 55 48 89 fd 53 48 83 ec 18 <8b> 07 83 f8 33 0f 86 fb 00 00 00 8b b5 a8 00 00 00 48 c7 04 24 00

Apparently just some metrics are affected, all of them via pmda-openmetrics. Some metrics via pmda-openmetrics work. Accessing the metrics which lead to the crash via 'pmrep -a' directly from the archive files, the metrics are available and look ok.

@christianhorn
Copy link
Collaborator Author

It's not about the data: I can get the data with 'pmseries --load

' loaded on a Fedora38 with pcp-6.1.0-1.fc38.x86_64, and then executing 'wget "http://localhost:44322/series/instances?series=5f8b5c6d7865695d21729b676d8168cfa070d0a1"' succeeds.

@christianhorn
Copy link
Collaborator Author

Before segfaulting, pmproxy logs this:

[Sat Sep 23 01:31:07] pmproxy(2406264) Critical: bad mapping for name of series 5f8b5c6d7865695d21729b676d8168cfa070d0a1

pmproxy_strace.txt

@natoscott
Copy link
Member

@christianhorn do you have a way I can reproduce this? I'm not seeing similar behaviour here, doing instances queries on an openmetrics metric series ID ... must be missing some characteristic of your metrics.

Looking at the code, that bad mapping diagnostic suggests some metric/instance metadata has not been indexed in Redis for some reason, ... and if thats the root cause then the crash may be in a cascading error.
But, I'm not seeing either issue.

Re pcp-6.1.0 vs pcp-6.1.1 - does it work OK with one but not the other? Thanks.

@christianhorn
Copy link
Collaborator Author

On the affected system, whenever I run the query with wget (or from Grafana), I see the crash again. I can not setup graphes in Grafana for some metrics because of this, as each access fails.

Good idea, let me try pcp-6.1.0.

@natoscott
Copy link
Member

| Good idea, let me try pcp-6.1.0.

Oh, I thought that's what you were saying in the second comment already?

@christianhorn
Copy link
Collaborator Author

christianhorn commented Sep 27, 2023

Oh, I thought that's what you were saying in the second comment already?

Ah sorry, I should be more specific. I saw the issue initially on Debian Bookworm/PCP 6.1.1, had for testing loaded these archive files with pmseries into a Fedora system with pcp-6.1.0-1.fc38, and did not see any issues there.

More tests:

Own build of PCP-6.1.1 on that Debian Bookworm: pmroxy crash
Own build of PCP-6.1.0 on that Debian Bookworm: pmroxy crash
Own build of PCP-6.0.5 on that Debian Bookworm: pmroxy crash
Own build of PCP-5.3.7 on that Debian Bookworm: no crash
Debian Bookworm repo package, PCP-6.0.3: pmroxy crash

On a pristine fresh Debian Bookworm on a KVM guest, the same PCP-6.0.3 from the Debian repo does not crash. checksums of /usr/lib/pcp/bin/pmproxy and all of the libs it was linked against are similiar on both systems. I tried to compare strace runs of pmproxy receiving the request, but did not get far with that.

I reckon this must be something fishy about the installation.. but not sure how to further pin the issue down.

@natoscott
Copy link
Member

Can you find a minimal openmetrics configuration that I can run locally to reproduce it here? Thanks.

@christianhorn
Copy link
Collaborator Author

I see the crashing pmproxy without extra pmdas, also reduced to "pmda: root pmcd pmproxy linux" the crash occurs. I'm not able to replicate it in a newly setup Debian Bookworm amd64 deployment. Both working/not working AMD64, and KVM guests.

@christianhorn
Copy link
Collaborator Author

On the working system I installed the same packages as on the non-working system, but not seeing the crashes there. pmproxy logfile has this:

pmproxy(84960) Critical: bad mapping for name of series 5f8b5c6d7865695d21729b676d8168cfa070d0a1

@christianhorn
Copy link
Collaborator Author

Hm.. so I have a plain new deployment of Debian Bookworm with the PCP packages from the Bookworm repos, and do not see the crash there. The last part in an strace on that box (before the strace diverges from the system where pmproxy crashes):

sendto(15, "*3\r\n$5\r\nHMGET\r\n$56\r\npcp:expr:series:5f8b5c6d7865695d21729b676d8168cfa070d0a1\r\n$5\r\nquery\r\n", 89, 0, NULL, 0) = 89
epoll_ctl(3, EPOLL_CTL_DEL, 15, 0x7ffcca0395ac) = 0
epoll_ctl(3, EPOLL_CTL_DEL, 15, 0x7ffcca0395ac) = -1 ENOENT (No such file or directory)
epoll_ctl(3, EPOLL_CTL_ADD, 15, {events=EPOLLIN, data={u32=15, u64=15}}) = 0
epoll_wait(3, [{events=EPOLLIN, data={u32=15, u64=15}}], 1024, 306) = 1
recvfrom(15, "*1\r\n$-1\r\n", 16384, 0, NULL, NULL) = 9
epoll_ctl(3, EPOLL_CTL_DEL, 15, 0x7ffcca03959c) = 0
write(8, "\1\0\0\0\0\0\0\0", 8)   = 8
epoll_ctl(3, EPOLL_CTL_ADD, 15, {events=EPOLLIN, data={u32=15, u64=15}}) = 0
epoll_wait(3, [{events=EPOLLIN, data={u32=8, u64=8}}], 1024, 306) = 1
read(8, "\1\0\0\0\0\0\0\0", 1024) = 8
writev(21, [{iov_base="HTTP/1.1 200 OK\r\nConnection: Keep-Alive\r\nAccess-Control-Allow-Origin: *\r\nAccess-Control-Allow-Header"..., iov_len=289}, {iov_base="[]\r\n", iov_len=4}], 2) = 293
epoll_ctl(3, EPOLL_CTL_MOD, 21, {events=EPOLLIN, data={u32=21, u64=21}}) = 0

That part on the system where pmproxy crashes:

sendto(15, "*5\r\n$5\r\nHMGET\r\n$56\r\npcp:inst:series:50421f1895efb3b957095c7f632efabba2c60189\r\n$4\r\ninst\r\n$4\r\nname\r\n$6"..., 7370, 0, NULL, 0) = 7370
epoll_ctl(3, EPOLL_CTL_DEL, 15, 0x7fffb6d292bc) = 0
epoll_ctl(3, EPOLL_CTL_DEL, 15, 0x7fffb6d292bc) = -1 ENOENT (No such file or directory)
epoll_ctl(3, EPOLL_CTL_ADD, 15, {events=EPOLLIN, data={u32=15, u64=15}}) = 0
epoll_wait(3, [{events=EPOLLIN, data={u32=15, u64=15}}], 1024, 116) = 1
recvfrom(15, "*3\r\n$2\r\n39\r\n$20\r\n\22\244+r>F\206T\302\323*\275,\203\366\204\242*\t\254\r\n$20\r\n\277\251\221\374#\315\336\271\253\305\276\253G\241\372b\243f\246v\r\n*3\r\n$2\r\n36\r\n$20\r\nD\312,\36\344\213)\276\232\366]\32\203\203\271\317,"..., 16384, 0, NULL, NULL) = 4396
epoll_ctl(3, EPOLL_CTL_DEL, 15, 0x7fffb6d292ac) = 0
write(8, "\1\0\0\0\0\0\0\0", 8) = 8
getpid()                        = 2989425
write(2, "[Sat Sep 30 13:22:59] pmproxy(2989425) Critical: bad mapping for name of series 5f8b5c6d7865695d2172"..., 121) = 121
--- SIGSEGV {si_signo=SIGSEGV, si_code=SEGV_MAPERR, si_addr=0x7} ---
+++ killed by SIGSEGV +++

Handle 15 is the connection with redis, so right after communication with Redis, pmproxy writes the log entry and crashes.
I tried to transfer /usr and /etc from the non-working system to the working one, without getting further on what the difference is between working/not working.

@natoscott
Copy link
Member

Is there anything in the Redis log file? You can also run a 'redis-cli monitor' in another terminal & watch redis commands arrive up until the failure point, that may prove helpful. Could there be a difference in redis versions between the working/failure machines I wonder?

I've audited the PCP code and can see where that diagnostic is produced ... but cannot see any code issue that could result in SIGSEGV around there. Another helpful piece of information would be the pmproxy stack trace from when it crashes (got abrt there?).

@christianhorn
Copy link
Collaborator Author

christianhorn commented Oct 2, 2023

Ok/notok systems have same versions of redis, package wise and also checksums of the bits on the file systems.

On the system with not-crashing-pmproxy, monitor reports:

OK
1696211397.679236 [0 127.0.0.1:47004] "HGETALL" "pcp:map:inst.name"
1696211397.680071 [0 127.0.0.1:47004] "SMEMBERS" "pcp:instances:series:5f8b5c6d7865695d21729b676d8168cfa070d0a1"
1696211397.680456 [0 127.0.0.1:47004] "HMGET" "pcp:expr:series:5f8b5c6d7865695d21729b676d8168cfa070d0a1" "query"

Ok the system with crashing pmproxy:

1696211173.924942 [0 127.0.0.1:54514] "HGETALL" "pcp:map:inst.name"
1696211173.934850 [0 127.0.0.1:54514] "SMEMBERS" "pcp:instances:series:5f8b5c6d7865695d21729b676d8168cfa070d0a1"
1696211173.935566 [0 127.0.0.1:54514] "HMGET" "pcp:inst:series:50421f1895efb3b957095c7f632efabba2c60189" "inst" "name" "source"
1696211173.935583 [0 127.0.0.1:54514] "HMGET" "pcp:inst:series:74abe88e2b8d93ae89ad4be3b807a691f7de5eec" "inst" "name" "source"
[more HMGETs]
1696211173.936326 [0 127.0.0.1:54514] "HMGET" "pcp:inst:series:f47beefaae98255886e7d8ae9e2b94d2654d5e1b" "inst" "name" "source"
1696211173.936338 [0 127.0.0.1:54514] "HMGET" "pcp:inst:series:9ff8de6b8483ac76652c237ed81d7ec3f4fdedea" "inst" "name" "source"
1696211174.230777 [0 127.0.0.1:46122] "CLUSTER" "NODES"
1696211174.233041 [0 127.0.0.1:46134] "COMMAND"
1696211174.234958 [0 127.0.0.1:46134] "INFO" "SERVER"
1696211174.235317 [0 127.0.0.1:46134] "GET" "pcp:version:schema"

These last 4 are queries from the newly spawned pmroxy, after the crash. If I execute the last query towards pmroxy manually with

wget "http://localhost:44322/series/instances?series=9ff8de6b8483ac76652c237ed81d7ec3f4fdedea"

..then pmproxy is not crashing:

1696212427.353965 [0 127.0.0.1:56486] "HGETALL" "pcp:map:inst.name"
1696212427.363053 [0 127.0.0.1:56486] "SMEMBERS" "pcp:instances:series:9ff8de6b8483ac76652c237ed81d7ec3f4fdedea"
1696212427.363775 [0 127.0.0.1:56486] "HMGET" "pcp:expr:series:9ff8de6b8483ac76652c237ed81d7ec3f4fdedea" "query"

When I look with strace at redis on the box where pmproxy is crashing, I observe this:

$ grep write logg_redis |grep ERR
1294337 write(8, "-ERR This instance has cluster support disabled\r\n", 49) = 49
1294337 write(8, "-ERR unknown command 'FT.CREATE', with args beginning with: 'pcp:text' 'SCHEMA' 'TYPE' 'TAG' 'SORTABLE' 'NAME' 'TEXT' 'WEIGHT' '9' 'SORTABLE' 'INDOM' 'TAG' 'ONELINE' 'TEXT' 'WEIGHT' '4' 'HE' \r\n", 193) = 193
1294337 write(8, "-ERR This instance has cluster support disabled\r\n", 49) = 49
1294337 write(8, "-ERR unknown command 'FT.CREATE', with args beginning with: 'pcp:text' 'SCHEMA' 'TYPE' 'TAG' 'SORTABLE' 'NAME' 'TEXT' 'WEIGHT' '9' 'SORTABLE' 'INDOM' 'TAG' 'ONELINE' 'TEXT' 'WEIGHT' '4' 'HE' \r\n", 193) = 193

redis is writing that to pmproxy.

@christianhorn
Copy link
Collaborator Author

stack trace:

# gdb /usr/lib/pcp/bin/pmproxy pmproxy.core13
GNU gdb (Debian 13.1-3) 13.1
Copyright (C) 2023 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<https://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
    <http://www.gnu.org/software/gdb/documentation/>.

For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from /usr/lib/pcp/bin/pmproxy...
[New LWP 3524527]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by `/usr/lib/pcp/bin/pmproxy -F -A'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  skip_free_value_set (np=np@entry=0x7) at query.c:98
98          if (np->type == N_RATE || np->type == N_RESCALE || np->type == N_ABS
(gdb) bt
#0  skip_free_value_set (np=np@entry=0x7) at query.c:98
#1  0x00007f9746f51db7 in freeSeriesQueryNode (np=0x7) at query.c:114
#2  0x00007f9746f51ff5 in freeSeriesGetQuery (baton=baton@entry=0x5603081c6810) at query.c:154
#3  0x00007f9746f5205a in series_query_finished (arg=arg@entry=0x5603081c6810) at query.c:283
#4  0x00007f9746f5a7ac in series_query_end_phase (arg=arg@entry=0x5603081c6810) at query.c:302
#5  0x00007f9746f5bccf in series_instances_reply_callback (c=<optimized out>, r=<optimized out>, arg=0x5603069049b0)
    at query.c:5858
#6  0x00007f9746f6dd85 in redisSlotsReplyCallback (c=0x560306276a10, r=0x560307e15290, arg=0x5603070616b0)
    at slots.c:515
#7  0x00007f9746f803ce in redisClusterAsyncCallback (ac=<optimized out>, r=0x560307e15290, privdata=0x56030690f590)
    at deps/hiredis-cluster/hircluster.c:3837
#8  0x00007f9746f75edb in __redisRunCallback (ac=ac@entry=0x5603062746d0, cb=cb@entry=0x7fffa230f460,
    reply=<optimized out>) at deps/hiredis/async.c:287
#9  0x00007f9746f76c6d in redisProcessCallbacks (ac=ac@entry=0x5603062746d0) at deps/hiredis/async.c:572
#10 0x00007f9746f76d72 in redisAsyncRead (ac=0x5603062746d0) at deps/hiredis/async.c:635
#11 0x00007f9746f76d94 in redisAsyncHandleRead (ac=0x5603062746d0) at deps/hiredis/async.c:654
#12 0x00007f9746f6db38 in redisLibuvPoll (handle=<optimized out>, status=<optimized out>, events=<optimized out>)
    at deps/hiredis/adapters/libuv.h:21
#13 0x00007f97469f2e3c in ?? () from /lib/x86_64-linux-gnu/libuv.so.1
#14 0x00007f97469df9e4 in uv_run () from /lib/x86_64-linux-gnu/libuv.so.1
#15 0x0000560304faf606 in main_loop (arg=0x56030625bd30, runtime=<optimized out>) at server.c:901
#16 0x0000560304faed7d in main (argc=<optimized out>, argv=0x7fffa2313a28) at pmproxy.c:468

@natoscott
Copy link
Member

The np@entry=0x7 is the cause of the sigsegv - its not null (0x0) but is also not a valid pointer address, so when its accessed as such in skip_free_value_set bad things happen.

Question is, how did it get to be this value? It could be a use-after-free (valgrind would tell us more there) - auditing the code once more, I cannot see it just from a visual inspection.

@christianhorn
Copy link
Collaborator Author

christianhorn commented Oct 6, 2023

Output from

valgrind -s --log-file=/tmp/debug2.log --main-stacksize=83886080 --leak-check=full \
  --show-leak-kinds=all --trace-children=yes --track-origins=yes /etc/init.d/pmproxy start

and then doing the request with wget.
debug2.log

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants