Skip to content

Commit

Permalink
fix link to cli cheatsheet, update trouble shooting for onboarding (#971
Browse files Browse the repository at this point in the history
)
  • Loading branch information
ben-zeng authored Feb 5, 2022
1 parent e142a7f commit 2c115ae
Show file tree
Hide file tree
Showing 5 changed files with 102 additions and 10 deletions.
2 changes: 1 addition & 1 deletion ol/documentation/README.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# 0L Documentation
## CLI
[CLI Cheatsheet](cli_cheatsheet.md)
[CLI Cheatsheet](../../ol/cli/README.md)

## Running Tower, i.e Mining.

Expand Down
86 changes: 82 additions & 4 deletions ol/documentation/node-ops/validators/troubleshoting_onboarding.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,13 +5,18 @@
2. [Tower start: Epochs not consecutive](#issue-2)
3. [DB should open](#issue-3)
4. [Tower start: EOF error](#issue-4)
5. [When trying to start the tower app, a response was received from an upstream_node but not a remote tower state](#issue-5)
6. [When a fullnode (diem-node) has started, in the logs it states that a response was received from an upstream_node but not a remote tower state.](#issue-6)
7. [When trying to start the web monitor, Connection Failed: Connection refused (os error 111)](#issue-7)

## <a id="issue-1"></a> Issue "Too many open files"
### Validator logs show "Too many open files" or "File descriptor limit exceeded" error and tower app is stopped

The validator is not voting or syncing and the file descriptor limit exceeded error is shown in the node logs. Tower app is stuck at a proof unable to submit the transaction with ".....".

**Temporary Solution:** This is an issue with the file descriptor limit being exceeded.
**Temporary Solution:**

This is an issue with the file descriptor limit being exceeded.

1. Go to the **diem-node.service** template you are using for starting the node.
2. Update **LimitNOFILE=200000**
Expand All @@ -35,7 +40,9 @@ Run with RUST_BACKTRACE=full to include source snippets.
```

**Solution:** It is possible that waypoint that tower app is using might be wrong. You can check the waypoint by looking at the logs on tower start.
**Solution:**

It is possible that waypoint that tower app is using might be wrong. You can check the waypoint by looking at the logs on tower start.
```
Waypoint: No waypoint parsed from command line args. Searching for waypoint in key_store.json
[tower/src/config.rs:...] &value = "xxxxx"
Expand All @@ -62,12 +69,83 @@ You may be running `diem-node` in a separate process. In that case there may be

## <a id="issue-4"></a> Issue: Tower App start: EOF error

**Problem:** Validator node ran out of space and tower app created empty block_.json file. When starting tower app below error is observed:
**Problem:**

Validator node ran out of space and tower app created empty block_.json file. When starting tower app below error is observed:

```
Message: Error EOF while parsing a value
Location: tower/src/block.rs:....
```
**Solution:** Check if the last block proof created is empty. If so, remove the file and start the tower app again. This is after the node is caught up on the network.
**Solution:**

Check if the last block proof created is empty. If so, remove the file and start the tower app again. This is after the node is caught up on the network.

## <a id="issue-5"></a> Issue: Received response but no remote state found

**Problem:**

When trying to start the tower app, a response was received from an upstream_node but not a remote tower state.

```
Message: called `Result::unwrap()` on an `Err` value: Info: Received response but no remote state found. Exiting.
Location: ol/tower/src/backlog.rs:27
```
**Solution:**

There could be an issue with the address used for the validator, or an issue with mining the genesis block.
If you have another address that has already been onboarded and being used for mining normally, you can try the following:

- change your `acount` and `auth_key` set in `~/.0L/0L.toml` under `[profile]` with details of your existing address.
- start tower again with `tower start`, `tower -u <upstream-ip-address> start` or with whatever flags you were trying to start tower with.
- you should now see `Mining VDF Proof # 1` or similar in the logs.
- If this works, you should use this address as the validator. To do this, you will need to follow step 2.2 onwards again from [the hard onboarding guide](../../node-ops/validators/validator_onboarding_hard_mode.md#2-generate-account-keys), making sure to use the mnemonic of the working address when using `onboard val`

Alternatively, you can try a brand new address:
- make sure it is onboarded to the network already (not validator yet). If you are able to mine using Carpe on another device, it's highly likely this address will work.
- change your `acount` and `auth_key` set in `~/.0L/0L.toml` under `[profile]` with details of your new address.
- start tower again with `tower start`, `tower -u <upstream-ip-address> start` or with whatever flags you were trying to start tower with.
- you should now see `Mining VDF Proof # 1` or similar in the logs.
- If this works, you should use this address as the validator. To do this, you will need to follow step 2.2 onwards again from [the hard onboarding guide](../../node-ops/validators/validator_onboarding_hard_mode.md#2-generate-account-keys), making sure to use the mnemonic of the working address when using `onboard val`

If this doesn't work, it can also be caused by [When trying to start the web monitor, Connection Failed: Connection refused (os error 111)](#issue-7)

## <a id="issue-6"></a> Issue: NoAvailablePeers after the fullnode (diem-node) has been started.

**Problem:**

When a fullnode (diem-node) has started, in the logs it states that a response was received from an upstream_node but not a remote tower state.
```
2022-02-05T11:55:46.052036Z [state-sync] ERROR state-sync/src/coordinator.rs:219 {"error":"NoAvailablePeers(\"No peers to send chunk request to!\")","event":"fail","name":"progress_check"}
2022-02-05T11:55:46.551657Z [state-sync] WARN state-sync/src/coordinator.rs:1514 {"name":"timeout","version":23877219}
2022-02-05T11:55:46.551690Z [state-sync] WARN state-sync/src/coordinator.rs:1567 {"event":"missing_peers","name":"send_chunk_request"}
2022-02-05T11:55:46.551818Z [state-sync] ERROR state-sync/src/coordinator.rs:1547 {"error":"NoAvailablePeers(\"No peers to send chunk request to!\")","event":"send_chunk_request_fail","local_epoch":102,"name":"timeout","version":23877219}
2022-02-05T11:55:46.551895Z [state-sync] ERROR state-sync/src/coordinator.rs:219 {"error":"NoAvailablePeers(\"No peers to send chunk request to!\")","event":"fail","name":"progress_check"}
```
**Solution:**

All you need to do is: `ol restore`

The issue occured perhaps due to configs not fully updated after changing the address or mnemonic associated with the fullnode. `ol restore` will set the waypoint and update the `key_Store.json`


## <a id="issue-7"></a> Issue: When trying to start the web monitor, Connection Failed: Connection refused (os error 111)

**Problem:**

When starting the web_monitor, you may get `Connection Failed: Connection refused (os error 111)`.

You may also either have issue 5. [When trying to start the tower app, a response was received from an upstream_node but not a remote tower state](#issue-5), the solution here could fix that too.


```
can make client but could not get metadata Error { inner: Inner { kind: Request, source: Some(ConnectionFailed("Connection refused (os error 111)")), json_rpc_error: None } }
Caused by:
Connection Failed: Connection refused (os error 111)
ERROR: could not create client connection, message: Cannot connect to any JSON RPC peers in the list of upstream_nodes in 0L.toml
```
**Solution:**

There may be a bad IP address, `upstream_nodes`, being used in `~/.0L/0L.toml`. Try changing the ip address there to one from [here](https://github.com/OLSF/carpe/blob/main/seed_peers/fullnode_seed_playlist.json).
::
Original file line number Diff line number Diff line change
Expand Up @@ -129,10 +129,14 @@ to place in your first proof.
2.2. Run the validator onboarding wizard inside a `tmux` session, and answer questions:

```
# start wizard
# start wizard with template
onboard val -u http://<ip-address-of-the-one-who-onboards-you>:3030
# without template, note: assumes an autopay_batch.json is in the project root.
# note, this person needs to be already running a validator, ask in the discord for their ip address. If you navigate to <ip-address>:3030, you should be able to see their validator node's health.
OR
# start wizard without template, note: assumes an autopay_batch.json is in the project root.
onboard val
```

Expand Down Expand Up @@ -270,6 +274,12 @@ ol serve -c
```
---

## Onboarder Troubleshooting
If you are having troubles onboarding, please see whether they match any of the issues here:
[troubleshooting onboarding](../../node-ops/validators/troubleshoting_onboarding.md)



## Onboarder instructions
If you are onboarding someone and have received their `account.json` file
1. Copy the `account.json` to your local node.
Expand All @@ -291,4 +301,4 @@ Follow those instructions or reset your terminal.
To configure your current shell, run:
```
source $HOME/.cargo/env
```
```
4 changes: 4 additions & 0 deletions ol/documentation/node-ops/validators/web_monitor.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,10 @@ Start the web monitor in this tmux session by the following command:
ol serve -c
```

## Troubleshooting

You may get an issue with `Connection Failed: Connection refused (os error 111)`, when trying to start the monitor. If you do, check out troubleshooting steps here

## How to access it

Once the web monitor is running, you can access it in your browser at the following URL:
Expand Down
4 changes: 2 additions & 2 deletions ol/tower/README.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Miner
# Tower

Miner is an application.
Tower is an application.

## Getting Started

Expand Down

0 comments on commit 2c115ae

Please sign in to comment.