Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Node Issue: freshly setup backup node can not run with 32GB RAM #12171

Open
wackazong opened this issue Sep 30, 2024 · 6 comments
Open

Node Issue: freshly setup backup node can not run with 32GB RAM #12171

wackazong opened this issue Sep 30, 2024 · 6 comments
Assignees
Labels
community Issues created by community investigation required Node Node team

Comments

@wackazong
Copy link

Contact Details

[email protected]

Node type

Top 100 Validator

Which network are you running?

mainnet

What happened?

I am currently trying to get a new backup node started. I followed the requirements for the new backup node scenario and downloaded a new snapshot. My machine has 15 Cores, 32GB RAM and four parallel fast SSDs in RAID0.

I start neard with the backup config recommended for scenario 2 (the new one).

  "tracked_shards": [],
  "tracked_shadow_validator": "myvalidatorname.poolv1.near,
  "state_sync_enabled": true,
  "load_mem_tries_for_tracked_shards": true,

First, it dowloads headers. Then, as soon as headers are downloaded I get these lines in the log first. I never get a log message that any blocks have been downloaded to catch up with the network.

WARN garbage collection: Error in gc: GC Error: block on canonical chain shouldn't have refcount 0

Then, after some 30mins or so, I get this

Sep 30 09:12:07 validator-b systemd[1]: neard.service: A process of this unit has been killed by the OOM killer.
Sep 30 09:12:09 validator-b systemd[1]: neard.service: Main process exited, code=killed, status=9/KILL
Sep 30 09:12:09 validator-b systemd[1]: neard.service: Failed with result 'oom-kill'.
Sep 30 09:12:09 validator-b systemd[1]: neard.service: Consumed 2h 16min 6.405s CPU time.
Sep 30 09:12:39 validator-b systemd[1]: neard.service: Scheduled restart job, restart counter is at 27.
Sep 30 09:12:39 validator-b systemd[1]: Stopped Run a NEAR protocol node.
Sep 30 09:12:39 validator-b systemd[1]: neard.service: Consumed 2h 16min 6.405s CPU time.
Sep 30 09:12:39 validator-b systemd[1]: Started Run a NEAR protocol node.

What can I do to help analyse this issue?

Version

2.2.1

Relevant log output

see above

Node head info

2024-09-30T13:28:21.112530Z  WARN genesis: Skipped genesis validation
2024-09-30T13:28:21.112561Z  WARN genesis: Skipped genesis validation
thread 'main' panicked at tools/state-viewer/src/cli.rs:144:55:
called `Result::unwrap()` on an `Err` value: DbDoesNotExist
stack backtrace:
   0: rust_begin_unwind
   1: core::panicking::panic_fmt
   2: core::result::unwrap_failed
   3: state_viewer::cli::StateViewerSubCommand::run
   4: neard::cli::NeardCmd::parse_and_run
   5: neard::main
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
Aborted

Node upgrade history

Now

DB reset history

Now
@wackazong wackazong added community Issues created by community investigation required Node Node team labels Sep 30, 2024
@telezhnaya
Copy link
Contributor

Config: https://ctxt.io/2/AAC4C18fFg

@telezhnaya
Copy link
Contributor

May be related: #11927

@wackazong
Copy link
Author

I added generous swap and now the node is downloading blocks.

@wackazong
Copy link
Author

I did some memory monitoring while setting up another node from a fresh snapshot. Strangely enough, memory was never over 30% on a 32GB machine. But also on that machine I got the same behaviour before.

@wackazong
Copy link
Author

wackazong commented Oct 1, 2024

Over night, the second node now started showing the error while downloading blocks and failed subsequently:

Oct 01 00:09:50 validator-a neard[626278]: 2024-09-30T22:09:50.794404Z  INFO stats: #129220486 Downloading blocks 37.22% (47816 left; at 129220486) 31 peers ⬇ 7.23 MB/s ⬆ 5.43 MB/s 3.00 bps>
Oct 01 00:09:51 validator-a neard[626278]: 2024-09-30T22:09:51.971708Z  WARN network: Message dropped because TTL reached 0. msg=RoutedMessageV2 { msg: RoutedMessage { target: PeerId(ed2551>
Oct 01 00:09:51 validator-a neard[626278]: 2024-09-30T22:09:51.978342Z  WARN network: Message dropped because TTL reached 0. msg=RoutedMessageV2 { msg: RoutedMessage { target: PeerId(ed2551>
Oct 01 00:09:52 validator-a neard[626278]: 2024-09-30T22:09:52.081806Z  WARN network: Message dropped because TTL reached 0. msg=RoutedMessageV2 { msg: RoutedMessage { target: PeerId(ed2551>
Oct 01 00:10:00 validator-a neard[626278]: 2024-09-30T22:10:00.941000Z  WARN garbage collection: Error in gc: GC Error: block on canonical chain shouldn't have refcount 0
Oct 01 00:35:23 validator-a neard[626278]: 2024-09-30T22:35:23.339729Z  WARN garbage collection: Error in gc: GC Error: block on canonical chain shouldn't have refcount 0
Oct 01 00:35:24 validator-a neard[626278]: 2024-09-30T22:35:24.342663Z  WARN garbage collection: Error in gc: GC Error: block on canonical chain shouldn't have refcount 0
Oct 01 00:35:25 validator-a neard[626278]: 2024-09-30T22:35:25.348669Z  WARN garbage collection: Error in gc: GC Error: block on canonical chain shouldn't have refcount 0
Oct 01 00:35:26 validator-a neard[626278]: 2024-09-30T22:35:26.355773Z  WARN garbage collection: Error in gc: GC Error: block on canonical chain shouldn't have refcount 0
Oct 01 00:35:27 validator-a neard[626278]: 2024-09-30T22:35:27.458409Z  WARN garbage collection: Error in gc: GC Error: block on canonical chain shouldn't have refcount 0
Oct 01 00:35:28 validator-a neard[626278]: 2024-09-30T22:35:28.969291Z  WARN garbage collection: Error in gc: GC Error: block on canonical chain shouldn't have refcount 0
Oct 01 00:35:29 validator-a systemd[1]: neard.service: A process of this unit has been killed by the OOM killer.
Oct 01 00:35:31 validator-a systemd[1]: neard.service: Main process exited, code=killed, status=9/KILL
Oct 01 00:35:31 validator-a systemd[1]: neard.service: Failed with result 'oom-kill'.
Oct 01 00:35:31 validator-a systemd[1]: neard.service: Consumed 4h 19min 15.235s CPU time.
Oct 01 00:36:01 validator-a systemd[1]: neard.service: Scheduled restart job, restart counter is at 1.
Oct 01 00:36:01 validator-a systemd[1]: Stopped Run a NEAR protocol node.
Oct 01 00:36:01 validator-a systemd[1]: neard.service: Consumed 4h 19min 15.235s CPU time.

Setup was 32GB of RAM and 128GB of Swap

@wackazong
Copy link
Author

I had multiple backup servers with the same node key running. This might have affected the issue presented here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
community Issues created by community investigation required Node Node team
Projects
None yet
Development

No branches or pull requests

3 participants