"invalid http POST response" and later crash #416

pacu · 2022-09-29T18:00:34Z

What is the bug?
we found that out testnet server was down.

Additional context

LWD logs

{"app":"lightwalletd","level":"info","msg":"Waiting for block: 2057199","time":"2022-09-29T07:30:54Z"}
lightwalletd_1     | 2022-09-29T07:31:25.170377206Z {"app":"lightwalletd","error":"invalid http POST response (nil), method: getbestblockhash, id: 494636, last error=Post \"http://zcashd:38232\": dial tcp: lookup zcashd on 127.0.0.11:53: no such host","level":"fatal","msg":"error zcashd getbestblockhash rpc","time":"2022-09-29T07:31:25Z"}
lightwalletd_1     | 2022-09-29T07:31:25.170436128Z Lightwalletd died with a Fatal error. Check logfile for details.

appending logfiles for lightwalletd

Zcashd Logs

cashd_1           | 2022-09-29T07:30:35.348777431Z 2022-09-29T07:30:35.348590Z  INFO main: keypool reserve 1
zcashd_1           | 2022-09-29T07:30:35.348897797Z 2022-09-29T07:30:35.348691Z  INFO main: keypool return 1
zcashd_1           | 2022-09-29T07:30:52.699171626Z 2022-09-29T07:30:52.698897Z  INFO ProcessNewBlock: main: UpdateTip: new best hash=0027e9c247ea67c458c9fd51750fbab56ecf72231338189a29f19b0cba0cc1a3 height=2057198 bits=523969164 log2_work=37.287206 tx=2595076 date=2022-09-29 07:30:35 progress=1.000000 cache=0.0MiB(1tx)

Solution
I wish I knew it! 😅

The text was updated successfully, but these errors were encountered:

defuse · 2022-10-15T20:06:20Z

It looks like lightwalletd is configured to talk connect to zcashd via the hostname zcashd, but that hostname didn't exist when it tried to do so. It's most likely a configuration error in the deployment side of things or something went haywire in the Docker/GKE networking.

LarryRuane · 2022-10-15T21:42:09Z

We could certainly improve the error message! I can make a PR.

LarryRuane · 2022-10-20T18:27:19Z

This area is messy right now. What I think happened is that zcashd died while lightwalletd had a getbestblockhash rpc outstanding. (It calls that rpc every 2 seconds in normal operation in order to detect a new block.) I was able to reproduce it once:

{"app":"lightwalletd","level":"info","msg":"getblock 1813132 failed, will retry: error requesting block: invalid http POST response (nil), method: getblock, id: 2381, last error=Post \"http://127.0.0.1:8232\": dial tcp 127.0.0.1:8232: connect: connection refused","time":"2022-10-20T11:00:07-06:00"}
{"app":"lightwalletd","error":"invalid http POST response (nil), method: getbestblockhash, id: 2382, last error=Post \"http://127.0.0.1:8232\": dial tcp 127.0.0.1:8232: connect: connection refused","level":"fatal","msg":"error zcashd getbestblockhash rpc","time":"2022-10-20T11:00:37-06:00"}
Lightwalletd died with a Fatal error. Check logfile for details.

One solution is to always restart lightwalletd if zcashd has died.

The lightwalletd server (lwd) does a series of retries when an rpc to zcashd fails, but it's not consistent and you'll get timing-dependant variations in behavior. For example, if you start zcashd when it has many blocks to sync, and start lwd right after starting zcashd, lwd will try to execute rpcs a limited number of times and then will give up. But if zcashd happens to already be synced to the tip (or close to it), then it will -- usually! -- become responsive to rpcs before lwd's retries hit the limit.

I think a better design is to make lightwalletd retry rpcs forever. The only possible problem I can imagine is if some monitor process is watching that lwd is running (starting it if it isn't) -- but it's not also monitoring the local zcashd. But it or some other process should be monitoring zcashd and restarting it if necessary. I'd think that would be required since zcashd can die for a wide variety of reasons. So I think making this "infinite retry" change to lightwalletd would be a good idea. It would also simplify its code, and make its behavior more understandable and consistent. It would have prevented this issue from occurring.

This would also make lightwalletd more available to its clients (although of course some gRPCs could hang until they timeout, if zcashd is taking a long time to restart).

While infinitely retrying rpcs, lwd should log the failures, but it should be careful not to log too much (which would increase the danger of running the filesystem out of space). Probably it should log only the first failure, then log again when successful.

The other way to get consistent behavior is if lwd always dies when zcashd dies, but it's not possible to ensure that -- it's possible that lwd attempts no rpcs across a (fast) zcashd restart. So really the only way to get fully-consistent lwd behavior is for it to retry rpcs infinitely.

Maybe @defuse has some thoughts on this?

pacu added the bug Something isn't working label Sep 29, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

"invalid http POST response" and later crash #416

"invalid http POST response" and later crash #416

pacu commented Sep 29, 2022

defuse commented Oct 15, 2022

LarryRuane commented Oct 15, 2022

LarryRuane commented Oct 20, 2022

"invalid http POST response" and later crash #416

"invalid http POST response" and later crash #416

Comments

pacu commented Sep 29, 2022

defuse commented Oct 15, 2022

LarryRuane commented Oct 15, 2022

LarryRuane commented Oct 20, 2022