Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Node's ready status is incorrect while the node is starting #5923

Open
stmax82 opened this issue Jan 27, 2024 · 2 comments
Open

Node's ready status is incorrect while the node is starting #5923

stmax82 opened this issue Jan 27, 2024 · 2 comments
Assignees
Labels
bug Something isn't working

Comments

@stmax82
Copy link

stmax82 commented Jan 27, 2024

To determine if a node is ready / fully synched I'm calling the python client's ready() function. Internally this is probably making a HTTP GET /ready request.

I noticed that a few seconds after starting the node, ready returns OK instead of status 503. Only after a while it switches to status 503 and then back to OK once it is fully synched.

Here's a python script to reproduce the problem:

import algosdk
import time

ALGOD_TOKEN="XXX"
ALGOD_SERVER="http://localhost:8080"

def main():
    client = algosdk.v2client.algod.AlgodClient(ALGOD_TOKEN, ALGOD_SERVER)
    while True:
        try:
            client.ready()
            print(f"node is ready")
        except algosdk.error.AlgodHTTPError as e:
            print(f"node is not ready, status {e.code}")
        except Exception as e:
            print(f"connection failed: {e}")
        time.sleep(1)

if __name__ == "__main__":
    main()

To reproduce the problem:

  1. Stop the node
  2. Start the script
  3. Start the node

You'll see the following output:

connection failed: <urlopen error [WinError 10061] No connection could be made because the target machine actively refused it>
connection failed: <urlopen error [WinError 10061] No connection could be made because the target machine actively refused it>
connection failed: Remote end closed connection without response
node is ready <-- this is wrong
node is ready <-- this is wrong
node is ready <-- this is wrong
node is ready <-- this is wrong
node is ready <-- this is wrong
node is not ready, status 503 <-- now it figured out that it's actually not ready yet...
node is not ready, status 503
node is not ready, status 503
node is not ready, status 503
node is not ready, status 503
node is ready <-- now it's really ready
node is ready
node is ready
node is ready
node is ready

As you can see it incorrectly starts with "node is ready", then switches to "node is not ready, status 503" and finally back to "node is ready".

I also repeatedly ran goal node status while the node was starting and it displayed the following (note entry number 4):

root@3ab791cddba2:/algod# goal node status
Last committed block: 35599045
Time since last block: 0.0s
Sync Time: 0.0s
Last consensus protocol: https://github.com/algorandfoundation/specs/tree/925a46433742afb0b51bb939354bd907fa88bf95
Next consensus protocol: https://github.com/algorandfoundation/specs/tree/925a46433742afb0b51bb939354bd907fa88bf95
Round for next consensus protocol: 35599046
Next consensus protocol supported: true
Last Catchpoint:
Genesis ID: mainnet-v1.0
Genesis hash: wGHE2Pwdvd7S12BL5FaOP20EGYesN73ktiC1qzkkit8=

root@3ab791cddba2:/algod# goal node status
Last committed block: 35599045
Time since last block: 0.0s
Sync Time: 0.0s
Last consensus protocol: https://github.com/algorandfoundation/specs/tree/925a46433742afb0b51bb939354bd907fa88bf95
Next consensus protocol: https://github.com/algorandfoundation/specs/tree/925a46433742afb0b51bb939354bd907fa88bf95
Round for next consensus protocol: 35599046
Next consensus protocol supported: true
Last Catchpoint:
Genesis ID: mainnet-v1.0
Genesis hash: wGHE2Pwdvd7S12BL5FaOP20EGYesN73ktiC1qzkkit8=

root@3ab791cddba2:/algod# goal node status
Last committed block: 35599045
Time since last block: 0.0s
Sync Time: 0.0s
Last consensus protocol: https://github.com/algorandfoundation/specs/tree/925a46433742afb0b51bb939354bd907fa88bf95
Next consensus protocol: https://github.com/algorandfoundation/specs/tree/925a46433742afb0b51bb939354bd907fa88bf95
Round for next consensus protocol: 35599046
Next consensus protocol supported: true
Last Catchpoint:
Genesis ID: mainnet-v1.0
Genesis hash: wGHE2Pwdvd7S12BL5FaOP20EGYesN73ktiC1qzkkit8=

At this point, client.ready() started returning status 503 (not ready):

root@3ab791cddba2:/algod# goal node status
Last committed block: 35599045
Time since last block: 0.0s
Sync Time: 0.7s
Last consensus protocol: https://github.com/algorandfoundation/specs/tree/925a46433742afb0b51bb939354bd907fa88bf95
Next consensus protocol: https://github.com/algorandfoundation/specs/tree/925a46433742afb0b51bb939354bd907fa88bf95
Round for next consensus protocol: 35599046
Next consensus protocol supported: true
Last Catchpoint:
Genesis ID: mainnet-v1.0
Genesis hash: wGHE2Pwdvd7S12BL5FaOP20EGYesN73ktiC1qzkkit8=

root@3ab791cddba2:/algod# goal node status
Last committed block: 35599047
Time since last block: 0.3s
Sync Time: 24.9s
Last consensus protocol: https://github.com/algorandfoundation/specs/tree/925a46433742afb0b51bb939354bd907fa88bf95
Next consensus protocol: https://github.com/algorandfoundation/specs/tree/925a46433742afb0b51bb939354bd907fa88bf95
Round for next consensus protocol: 35599048
Next consensus protocol supported: true
Last Catchpoint:
Genesis ID: mainnet-v1.0
Genesis hash: wGHE2Pwdvd7S12BL5FaOP20EGYesN73ktiC1qzkkit8=
root@3ab791cddba2:/algod#

While the node was starting it shows "Sync Time: 0.0s". The client.ready() returns OK. Only once the sync time starts increasing, the client.ready() returns status 503 (not ready).

Your environment

  • Software version: 3.21.0-stable
  • Node status if applicable: weird
  • Operating System details: Node running on Docker on Windows, mainnet, FAST_CATCHUP=1, PROFILE=participation, image algorand/algod:stable
@stmax82 stmax82 added the bug Something isn't working label Jan 27, 2024
@algorandskiy
Copy link
Contributor

It appears to be SDK issue

@AustP
Copy link
Contributor

AustP commented Jul 23, 2024

Has the /ready endpoint gotten worse? I use the endpoint in A1CN to know if I need to show the syncing page or the dashboard page. I used to have a timeout on startup to skip the first section of incorrect reporting. It's not ideal, but it worked.

Lately, however, the /ready endpoint has reported that the node is ready after doing a fast catchup but before being fully caught up. There's usually a few thousand blocks that need to be downloaded after a catchup. It used to be that /ready would report that the node wasn't ready while those needed to be downloaded. Now (v3.25.0-stable) /ready reports ready during that final stretch.

Note that I'm not using the SDK. I'm making my own REST calls to the node.

@gmalouf gmalouf assigned gmalouf and unassigned jasonpaulos Sep 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

5 participants