Upon restart p2p connectback or sync are very hard plus rpc usually timoeut #836

jun0tpyrc · 2024-09-30T09:57:48Z

v1.0.1

nodes were powerful enough staying in sync before reboot - but post reboot looks everything change

hard to connect , keep doing handshake

info  2024-09-30T09:56:43.473 net-3     net_plugin.cpp:4863           connection_monitor   ] p2p client connections: 0/150, peer connections: 26/26, block producer peers: 0
info  2024-09-30T09:56:45.948 net-1     net_plugin.cpp:2308           recv_handshake       ] ["0.0.0.0:9876:blk - 25c26e4" - 5 95.217.34.40:9876] handshake lib 396920011, fhead 396920013,
 id d3cd63d3f73b1b42.. sync 1, fhead 396904273, lib 396903272
info  2024-09-30T09:56:45.948 net-1     net_plugin.cpp:1604           operator()           ] ["0.0.0.0:9876:blk - 25c26e4" - 5 95.217.34.40:9876] Sending handshake generation 2059, lib 39
6903272, fhead 396904273, id 0e0150127f1cae4e
info  2024-09-30T09:56:46.359 net-2     net_plugin.cpp:2308           recv_handshake       ] ["eosn-eos-seed172:9876 - 216c673" - 9 209.249.216.152:9876] handshake lib 396920012, fhead 39
6920014, id a7ee3f4876e3821e.. sync 1, fhead 396904273, lib 396903272
info  2024-09-30T09:56:46.359 net-2     net_plugin.cpp:1604           operator()           ] ["eosn-eos-seed172:9876 - 216c673" - 9 209.249.216.152:9876] Sending handshake generation 2591
, lib 396903272, fhead 396904273, id 0e0150127f1cae4e

for hours
(p2p port already open to world)

The text was updated successfully, but these errors were encountered:

jun0tpyrc · 2024-09-30T10:06:44Z

we have 3 or 4 nodes hitting this . If reimport from almost brand new snaphosts - still hard to connect

alecalve · 2024-09-30T11:52:55Z

I am encountering the same issues

heifner · 2024-09-30T19:27:18Z

See #829. The handshake messages are harmless, but they do make it hard to see what the real issue is if anything. Can you provide your complete log?

FoieringSerjo · 2024-09-30T19:41:35Z

Same here.

The sync is very slow, and when it reaches specific blocks, the node cannot establish a connection at all.

spoonincode · 2024-09-30T20:24:03Z

In my experience this is what logs can look like when nodeos is exceptionally slow applying blocks due to low memory or poor disk performance. Can you run top and see if nodeos is almost always listed with its status as D (instead of R or S, which would be more normal)?

FoieringSerjo · 2024-09-30T20:51:31Z

In my experience this is what logs can look like when nodeos is exceptionally slow applying blocks due to low memory or poor disk performance. Can you run top and see if nodeos is almost always listed with its status as D (instead of R or S, which would be more normal)?

Indeed always list with status D

However, I definitely have more than enough memory. it uses 2GB, and I have 64GB.

I will check the disk; however, it's hard to believe that this is the root cause; the node worked perfectly until the new version was upgraded.

jun0tpyrc · 2024-10-01T00:31:00Z

In my experience this is what logs can look like when nodeos is exceptionally slow applying blocks due to low memory or poor disk performance. Can you run top and see if nodeos is almost always listed with its status as D (instead of R or S, which would be more normal)?

Indeed always list with status D

However, I definitely have more than enough memory. it uses 2GB, and I have 64GB.

I will check the disk; however, it's hard to believe that this is the root cause; the node worked perfectly until the new version was upgraded.

yes , it's D even stuck
same mem-spec
nodes were perfectly in sync and feed the hyperion indexer fail BEFORE reboot -
but after reboot they output not much logs useful other than p2p /handshake and not quite serving rpc

jun0tpyrc · 2024-10-01T00:36:59Z

last 100000lines of logs of one of the stuck node , should include the process which i restarted it few times and tried various param change to see if helpful

last100000log-eos.txt

heifner · 2024-10-01T00:43:21Z

Thanks for the log.

Unfortunately you are hitting: #830 which will be fixed in 1.0.2.
As a work-around you can remove any peers that you are failing to connect to:

18.234.6.119:80
fullnode.eoslaomao.com
p2p.meet.one

info  2024-09-30T11:05:52.788 net-2     net_plugin.cpp:2833           operator()           ] connection failed to fn001.eossv.org:443, Element not found
info  2024-09-30T11:05:52.788 net-2     net_plugin.cpp:2833           operator()           ] connection failed to seed2.greymass.com:9876, Element not found
info  2024-09-30T11:05:52.788 net-2     net_plugin.cpp:2833           operator()           ] connection failed to seed1.greymass.com:9876, Element not found
info  2024-09-30T11:05:52.788 net-2     net_plugin.cpp:2833           operator()           ] connection failed to node1.eoscannon.io:59876, Element not found
info  2024-09-30T11:05:52.788 net-2     net_plugin.cpp:2833           operator()           ] connection failed to node.eosflare.io:1883, Element not found
info  2024-09-30T11:05:52.788 net-0     net_plugin.cpp:2833           operator()           ] connection failed to publicnode.cypherglass.com:9876, Element not found
info  2024-09-30T11:05:52.788 net-1     net_plugin.cpp:2833           operator()           ] connection failed to bp.eosbeijing.one:8080, Element not found
info  2024-09-30T11:05:52.788 net-3     net_plugin.cpp:2833           operator()           ] connection failed to peer2.eosphere.io:9876, Element not found

etc.

Until 1.0.2, just remove all but one, two, or at most three good peers.

heifner · 2024-10-01T00:44:15Z

https://validate.eosnation.io/eos/reports/endpoints.html

jun0tpyrc · 2024-10-01T08:10:24Z

even with 1 - 3 peers it's still pretty easy to hit same issues at our side after few attempts

For example , for using snapshot x single peer that works previously to rework (did not come up in error logs before) , may still encounter something like below

info  2024-10-01T08:06:40.385 net-3     net_plugin.cpp:2017           set_state            ] old state lib catchup becoming in sync
info  2024-10-01T08:07:09.662 net-0     net_plugin.cpp:4863           connection_monitor   ] p2p client connections: 0/150, peer connections: 1/1, block producer peers: 0
info  2024-10-01T08:07:39.662 net-3     net_plugin.cpp:4863           connection_monitor   ] p2p client connections: 0/150, peer connections: 1/1, block producer peers: 0
info  2024-10-01T08:07:39.900 net-2     net_plugin.cpp:1604           operator()           ] ["peer1.eosphere.io:9876 - 922061b" - 1 198.50.156.24:9876] Sending handshake generation 1, lib 396068749, fhead 396068749, id 9ed73bed1562926a
info  2024-10-01T08:07:40.138 net-0     net_plugin.cpp:1604           operator()           ] ["peer1.eosphere.io:9876 - 922061b" - 1 198.50.156.24:9876] Sending handshake generation 2, lib 396068749, fhead 396068749, id 9ed73bed1562926a
info  2024-10-01T08:07:40.138 net-3     net_plugin.cpp:2017           set_state            ] old state in sync becoming lib catchup
info  2024-10-01T08:07:40.138 net-3     net_plugin.cpp:2117           request_next_chunk   ] ["peer1.eosphere.io:9876 - 922061b" - 1 198.50.156.24:9876] Catching up with chain, our last req is 0, theirs is 397073291, next expected 396068750, fhead 396068749
info  2024-10-01T08:07:40.138 net-3     net_plugin.cpp:2156           operator()           ] ["peer1.eosphere.io:9876 - 922061b" - 1 198.50.156.24:9876] requesting range 396068750 to 396069249, fhead 396068749, lib 396068749
info  2024-10-01T08:07:40.138 net-0     net_plugin.cpp:3438           handle_message       ] ["peer1.eosphere.io:9876 - 922061b" - 1 198.50.156.24:9876] Local network version: 9
info  2024-10-01T08:07:40.138 net-0     net_plugin.cpp:2308           recv_handshake       ] ["peer1.eosphere.io:9876 - 922061b" - 1 198.50.156.24:9876] handshake lib 397073291, fhead 397073293, id 41da1e90867fe683.. sync 1, fhead 396068749, lib 396068749
warn  2024-10-01T08:07:40.376 net-3     net_plugin.cpp:3506           handle_message       ] ["peer1.eosphere.io:9876 - 922061b" - 1 198.50.156.24:9876] received go_away_message, reason = some other non-fatal condition, possibly unknown block
info  2024-10-01T08:07:40.376 net-3     net_plugin.cpp:1463           _close               ] ["peer1.eosphere.io:9876 - 922061b" - 1 198.50.156.24:9876] closing
warn  2024-10-01T08:07:40.376 net-3     net_plugin.cpp:2139           request_next_chunk   ] Unable to continue syncing at this time
info  2024-10-01T08:07:40.376 net-3     net_plugin.cpp:2017           set_state            ] old state lib catchup becoming in sync

turn out it won't finish the import->catchup flow forever and if one turning off at this stage - node got corrupted and need rerun the import flow

===
This would be a big obstacle for who need to maintain private nodes stably & continuously behind dapp/exchange like our usage. Feel it's risky to reboot things with current version & even fresh rework may not be seamless as before

blocknodehub · 2024-10-01T09:31:20Z

even with 1 - 3 peers it's still pretty easy to hit same issues at our side after few attempts

For example , for using snapshot x single peer that works previously to rework (did not come up in error logs before) , may still encounter something like below

info  2024-10-01T08:06:40.385 net-3     net_plugin.cpp:2017           set_state            ] old state lib catchup becoming in sync
info  2024-10-01T08:07:09.662 net-0     net_plugin.cpp:4863           connection_monitor   ] p2p client connections: 0/150, peer connections: 1/1, block producer peers: 0
info  2024-10-01T08:07:39.662 net-3     net_plugin.cpp:4863           connection_monitor   ] p2p client connections: 0/150, peer connections: 1/1, block producer peers: 0
info  2024-10-01T08:07:39.900 net-2     net_plugin.cpp:1604           operator()           ] ["peer1.eosphere.io:9876 - 922061b" - 1 198.50.156.24:9876] Sending handshake generation 1, lib 396068749, fhead 396068749, id 9ed73bed1562926a
info  2024-10-01T08:07:40.138 net-0     net_plugin.cpp:1604           operator()           ] ["peer1.eosphere.io:9876 - 922061b" - 1 198.50.156.24:9876] Sending handshake generation 2, lib 396068749, fhead 396068749, id 9ed73bed1562926a
info  2024-10-01T08:07:40.138 net-3     net_plugin.cpp:2017           set_state            ] old state in sync becoming lib catchup
info  2024-10-01T08:07:40.138 net-3     net_plugin.cpp:2117           request_next_chunk   ] ["peer1.eosphere.io:9876 - 922061b" - 1 198.50.156.24:9876] Catching up with chain, our last req is 0, theirs is 397073291, next expected 396068750, fhead 396068749
info  2024-10-01T08:07:40.138 net-3     net_plugin.cpp:2156           operator()           ] ["peer1.eosphere.io:9876 - 922061b" - 1 198.50.156.24:9876] requesting range 396068750 to 396069249, fhead 396068749, lib 396068749
info  2024-10-01T08:07:40.138 net-0     net_plugin.cpp:3438           handle_message       ] ["peer1.eosphere.io:9876 - 922061b" - 1 198.50.156.24:9876] Local network version: 9
info  2024-10-01T08:07:40.138 net-0     net_plugin.cpp:2308           recv_handshake       ] ["peer1.eosphere.io:9876 - 922061b" - 1 198.50.156.24:9876] handshake lib 397073291, fhead 397073293, id 41da1e90867fe683.. sync 1, fhead 396068749, lib 396068749
warn  2024-10-01T08:07:40.376 net-3     net_plugin.cpp:3506           handle_message       ] ["peer1.eosphere.io:9876 - 922061b" - 1 198.50.156.24:9876] received go_away_message, reason = some other non-fatal condition, possibly unknown block
info  2024-10-01T08:07:40.376 net-3     net_plugin.cpp:1463           _close               ] ["peer1.eosphere.io:9876 - 922061b" - 1 198.50.156.24:9876] closing
warn  2024-10-01T08:07:40.376 net-3     net_plugin.cpp:2139           request_next_chunk   ] Unable to continue syncing at this time
info  2024-10-01T08:07:40.376 net-3     net_plugin.cpp:2017           set_state            ] old state lib catchup becoming in sync

turn out it won't finish the import->catchup flow forever and if one turning off at this stage - node got corrupted and need rerun the import flow

=== This would be a big obstacle for who need to maintain private nodes stably & continuously behind dapp/exchange like our usage. Feel it's risky to reboot things with current version & even fresh rework may not be seamless as before

our nodes have the same problem, become very unstable since upgrade to spring

heifner · 2024-10-01T10:47:56Z

Please DM me complete logs and config.ini. Would like to try and reproduce.

the-smooth-operator · 2024-10-01T18:51:51Z

we are having a similar issue. Nodes are struggling to process blocks, RPC API stops responding. Snapshots take a long time to process and, once processed, the node doesn't start syncing up blocks.

We have noticed a huge increase in data per block on Friday and again today.

What is the recommended configuration for running a node that can keep up? Looking for advice on modifications to config.ini like number of net/http threads etc...

the-smooth-operator · 2024-10-01T18:52:53Z

node got corrupted and need rerun the import flow

this corruption seems to happen because the process is so busy that it doesn't catch SIGTERM so it can't finish gracefully

heifner · 2024-10-01T18:54:13Z

It will respond to SIGTERM, it just takes awhile. When I tried stopping my syncing node it took 9 minutes to respond to SIGTERM.

the-smooth-operator · 2024-10-01T19:29:57Z

@heifner could you share a config.ini for a high-performing (i.e z1d.6xlarge) RPC node?

heifner · 2024-10-01T19:54:19Z

@heifner could you share a config.ini for a high-performing (i.e z1d.6xlarge) RPC node?

If not also being used as a P2P relay node for transactions:
p2p-accept-transactions=false

Since you have plenty of RAM you can run with:
chain-state-db-size-mb = 180000
database-map-mode = heap
The startup/shutdown is slower, but runs faster. You could also use:
database-map-mode = mapped_private
which provide fast startup, slower shutdown than default but does not use disk IO like default.

You also have plenty of cores, so:
net-threads = 6
chain-threads = 6

Keep your peer list small, just a small number of reliable peers.

If you are not processing read-only-transactions and not running a public node:
read-only-threads = 0
Although, I don't think this makes much if any difference.

If not running a public node, and not using:
If you are not using /v1/chain/get_block or /v1/chain/get_block_info or /v1/chain/get_block_header_state or /v1/chain/get_raw_block or /v1/chain/get_block_header then you can run without a block log:
block-log-retain-blocks = 0

jun0tpyrc · 2024-10-02T04:34:46Z

We examined p2p-accept-transactions=false is the killer config here to fix p2p issue during initial import of recent snapshot - smooth until reaching chain tip

The rpc-thread timeout may still occur after importing multiple day's data , haven't examined completely for catching-up with an older snapshot, which i need to do for eliminating data gap - a much minor issue then

It will respond to SIGTERM, it just takes awhile. When I tried stopping my syncing node it took 9 minutes to respond to SIGTERM.

Wonder if this graceful-stop time could be improved (thread priority/race ?!) , in old version even in moderate config machine it just needs within 1 or at most 2 minute i recall

heifner · 2024-10-02T12:08:31Z

Wonder if this graceful-stop time could be improved (thread priority/race ?!) , in old version even in moderate config machine it just needs within 1 or at most 2 minute i recall

Yes, we have a plan to fix it: #284

enf-ci-bot added the triage label Sep 30, 2024

heifner added the more-info More information requested from issue reporter label Sep 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Upon restart p2p connectback or sync are very hard plus rpc usually timoeut #836

Upon restart p2p connectback or sync are very hard plus rpc usually timoeut #836

jun0tpyrc commented Sep 30, 2024

jun0tpyrc commented Sep 30, 2024

alecalve commented Sep 30, 2024

heifner commented Sep 30, 2024

FoieringSerjo commented Sep 30, 2024

spoonincode commented Sep 30, 2024

FoieringSerjo commented Sep 30, 2024 •

edited

Loading

jun0tpyrc commented Oct 1, 2024

jun0tpyrc commented Oct 1, 2024 •

edited

Loading

heifner commented Oct 1, 2024 •

edited

Loading

heifner commented Oct 1, 2024

jun0tpyrc commented Oct 1, 2024 •

edited

Loading

blocknodehub commented Oct 1, 2024

heifner commented Oct 1, 2024

the-smooth-operator commented Oct 1, 2024

the-smooth-operator commented Oct 1, 2024

heifner commented Oct 1, 2024

the-smooth-operator commented Oct 1, 2024

heifner commented Oct 1, 2024 •

edited

Loading

jun0tpyrc commented Oct 2, 2024 •

edited

Loading

heifner commented Oct 2, 2024

Upon restart p2p connectback or sync are very hard plus rpc usually timoeut #836

Upon restart p2p connectback or sync are very hard plus rpc usually timoeut #836

Comments

jun0tpyrc commented Sep 30, 2024

jun0tpyrc commented Sep 30, 2024

alecalve commented Sep 30, 2024

heifner commented Sep 30, 2024

FoieringSerjo commented Sep 30, 2024

spoonincode commented Sep 30, 2024

FoieringSerjo commented Sep 30, 2024 • edited Loading

jun0tpyrc commented Oct 1, 2024

jun0tpyrc commented Oct 1, 2024 • edited Loading

heifner commented Oct 1, 2024 • edited Loading

heifner commented Oct 1, 2024

jun0tpyrc commented Oct 1, 2024 • edited Loading

blocknodehub commented Oct 1, 2024

heifner commented Oct 1, 2024

the-smooth-operator commented Oct 1, 2024

the-smooth-operator commented Oct 1, 2024

heifner commented Oct 1, 2024

the-smooth-operator commented Oct 1, 2024

heifner commented Oct 1, 2024 • edited Loading

jun0tpyrc commented Oct 2, 2024 • edited Loading

heifner commented Oct 2, 2024

FoieringSerjo commented Sep 30, 2024 •

edited

Loading

jun0tpyrc commented Oct 1, 2024 •

edited

Loading

heifner commented Oct 1, 2024 •

edited

Loading

jun0tpyrc commented Oct 1, 2024 •

edited

Loading

heifner commented Oct 1, 2024 •

edited

Loading

jun0tpyrc commented Oct 2, 2024 •

edited

Loading