Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SRE Testing: Investigate some aberrant behavior #4768

Open
cspramit opened this issue Jun 27, 2024 · 6 comments
Open

SRE Testing: Investigate some aberrant behavior #4768

cspramit opened this issue Jun 27, 2024 · 6 comments
Assignees
Labels
bug Something isn't working

Comments

@cspramit
Copy link
Contributor

Below is the test output ,
Engineering: please look under the section "some different behaviours observed" to investigate the aberrant behavior , if it has any substance.

SRE @AJ : Please look at the SSE event not reporting data further and update the comments on the ticket

Network: 100 + 20
Name: ajith-condor-big (8s)
Branch: feat-2.0
Commit: ee9c6de
Config:
minimum_block_time = '8192 ms'

block_gas_limit = 3_300_000_000_000

native_mint_lane = [0, 1024, 1024, 65_000_000_000, 325]
native_auction_lane = [1, 2048, 2048, 362_500_000_000, 75]
wasm_lanes = [[2, 1_048_576, 2048, 1_000_000_000_000, 1], [3, 344_064, 1024, 100_000_000_000, 10], [4, 172_032, 1024, 50_000_000_000, 15], [5, 12_288, 512, 1_500_000_000, 25]]
Tests: Mixed Load of V1 and V2 Transfers, WASM Transfers, Auction
Observations:
time spent at switch block now under 5seconds
allocated memory increases as new deploys in buffer
allocated memory spikes by +1G at switch block
at load when average memory was 1.1G, switchblock spiked to 2+ Gig
block timestamps were stable and under 8s through the loading and execution period
was able to hit 325 native transfers per block consistently
also observed Auction Lane saturated at 75 periodically (as per load and config)
also observed V1 wasm deploys allocated to Lane 3 at 10 (as per load and config)

some different behaviours observed
at switch block - the SSE events seem to be not reporting data as per grafana
seeing 2mins of missing data - (trying to investigate the SSE event outputs)
the casper-node log messages show blocks and deploys executions at this period
after all the pending deploys in the network were executed the block timestamps were periodically spiking to 10 to 20s
1 validate node stalled and later synced to tip in idle network
at later switch block eras the finalisation times starting spiking to 18+ seconds
some nodes were running in 'debug' mode so it could have an IO impact on the node itself
Grafana: link
Dumps: http://genesis.casperlabs.io/ajith-condor-big/casper-node-dumps/ajith-condor-big/27062024_0401/dump_download_list.html
DB Snapshot:
genesis.casperlabs.io/release-test-db-snapshots/ajith-condor-big/db.json
genesis.casperlabs.io/release-test-db-snapshots/ajith-condor-big/db.tar.zst
Nodes (debug) - 172.142.59.225, 172.142.71.238, 172.142.94.215

@cspramit cspramit added the bug Something isn't working label Jun 27, 2024
@cspramit
Copy link
Contributor Author

cspramit commented Jun 27, 2024

Slack Discussions : https://casper-labs-team.slack.com/archives/C01ULKL8G8J/p1719463830035189
@thamps-casp cc Pleae investigate the sse events

@cspramit cspramit assigned thamps-casp and unassigned rafal-ch Jul 1, 2024
@thamps-casp
Copy link

thamps-casp commented Jul 2, 2024

Did further investigations into the SSE and custom SSE exporter.

  1. It can be confirmed that SSE events are streaming correctly. No events were lost and it was confirmed by inspecting the events through custom scripts.
  2. There is a delay of SSE events streaming when the network reaches a switch block. The switch block event is streamed after delay of ~180 seconds and then the remainder of events starts flowing after. This delay causes the prometheus exporter export time and actual 'BlockAdded' event timestamp to be way out of sync.

ex.
BlockAdded Event Message Received: 2024-07-02 05:38:55.754281
BlockAdded Event Timestamp in Message: 2024-07-02T05:38:55.645Z

2024-07-02 05:38:55,754 - __main__ - DEBUG - 2024-07-02 05:38:55.754281 UTC: [BlockAdded]
2024-07-02 05:38:55,754 - __main__ - DEBUG - {'block_hash': '38d7480abdb70448d97659404d7743626eabd42ef4a512af086c505d68ab91a0', 'block': {'Version2': {'hash': '38d7480abdb70448d97659404d7743626eabd42ef4a512af086c505d68ab91a0', 'header': {'parent_hash': '1956145b563b52426883268c99522ea18f661c79b3fb0cd12703a252e15a8e65', 'state_root_hash': 'e23b3eafdd8b8c66a3f00cc0208dd47810e4eab937da141db375dd09634afe59', 'body_hash': '15209a8832459e5db7bce8a55abe387d00b79282004a0580e1d1f0606e4b3d5a', 'random_bit': True, 'accumulated_seed': 'd526837f7cb26a87d7acd79122306b4c9db0bed6b451d2f89da5ff6c071e6a2e', 'era_end': None, 'timestamp': '2024-07-02T05:38:55.645Z', 'era_id': 10, 'height': 2195, 'protocol_version': '2.0.0', 'proposer': '0163e2867616bc81762e416d7ed9c1bc78a52ee7e7452cf2098a5332b6f476aea2', 'current_gas_price': 1, 'last_switch_block_hash': 'b5fae62e01f5894999b564d287d06d710c9065700c5f2581b4a334b608cfe8a4'}, 'body': {'transactions': {}, 'rewarded_signatures': [[255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 240], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]}}}}

Switch Block
BlockAdded Event Message Received: 2024-07-02 05:41:55.038500
BlockAdded Event Timestamp in Message: 2024-07-02T05:39:03.838Z (event is from past now as per prometheus)

2024-07-02 05:41:55,038 - __main__ - DEBUG - 2024-07-02 05:41:55.038500 UTC: [BlockAdded]
2024-07-02 05:41:55,039 - __main__ - DEBUG - {'block_hash': 'fa1eff8511efb891bdb9c93d6fc9316c3b250f7b658fb60da52fea00678d93dc', 'block': {'Version2': {'hash': 'fa1eff8511efb891bdb9c93d6fc9316c3b250f7b658fb60da52fea00678d93dc', 'header': {'parent_hash': '38d7480abdb70448d97659404d7743626eabd42ef4a512af086c505d68ab91a0', 'state_root_hash': '2db60b43f6998a28b5fb94a803c1e164905b0de9be2be360e2477c233d01f135', 'body_hash': '15209a8832459e5db7bce8a55abe387d00b79282004a0580e1d1f0606e4b3d5a', 'random_bit': True, 'accumulated_seed': 'ec2af6fe928c71c4f152dc4ed392d9220a37adaae700d561c20dddd32bb288dd', 'era_end': {'equivocators': [], 'inactive_validators': [], 'next_era_validator_weights': [...]},'rewards': {...},'next_era_gas_price': 1}, 'timestamp': '2024-07-02T05:39:03.838Z', 'era_id': 10, 'height': 2196, 'protocol_version': '2.0.0', 'proposer': '01a57f1bd15ef3370e2863591f19a55973a2665a90aa181bc14871878a235d4fae', 'current_gas_price': 1, 'last_switch_block_hash': 'b5fae62e01f5894999b564d287d06d710c9065700c5f2581b4a334b608cfe8a4'}, 'body': {'transactions': {}, 'rewarded_signatures': [[255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 240], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]}}}}

and thereafter till the time get into sync

2024-07-02 05:41:55,044 - __main__ - DEBUG - 2024-07-02 05:41:55.044877 UTC: [BlockAdded]
2024-07-02 05:41:55,045 - __main__ - DEBUG - {'block_hash': 'a445181fd3b86846f1da834aea40eb9ac4b268c0b70eed3ae11c4c32e5ba1cd4', 'block': {'Version2': {'hash': 'a445181fd3b86846f1da834aea40eb9ac4b268c0b70eed3ae11c4c32e5ba1cd4', 'header': {'parent_hash': 'fa1eff8511efb891bdb9c93d6fc9316c3b250f7b658fb60da52fea00678d93dc', 'state_root_hash': '5dfb7e45790e15ec7d18809f8f4751151a29fa85a83e725f149e60b035ad5693', 'body_hash': '91d47c3500b87606b40fbc75eeb4840c214065a926af1d8a4e269384beed7c57', 'random_bit': True, 'accumulated_seed': '5c22c6234d801eef56b87c49b29ce929cec7e67b74712377417dfb9c77c98955', 'era_end': None, 'timestamp': '2024-07-02T05:39:16.299Z', 'era_id': 11, 'height': 2197, 'protocol_version': '2.0.0', 'proposer': '01d9fb3ae0f4e2798dbf82f14dee390adce67c003e470531cf17b7ca5fc31b9edd', 'current_gas_price': 1, 'last_switch_block_hash': 'fa1eff8511efb891bdb9c93d6fc9316c3b250f7b658fb60da52fea00678d93dc'}, 'body': {'transactions': {}, 'rewarded_signatures': [[247, 251, 183, 250, 253, 255, 223, 255, 255, 239, 255, 223, 240], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]}}}}
2024-07-02 05:41:55,052 - __main__ - DEBUG - 2024-07-02 05:41:55.052568 UTC: [BlockAdded]
2024-07-02 05:41:55,052 - __main__ - DEBUG - {'block_hash': '5b9fd8a2c98a9746a791cfd4e75ffafd01da4ae846ec312402611476a9fd6b8a', 'block': {'Version2': {'hash': '5b9fd8a2c98a9746a791cfd4e75ffafd01da4ae846ec312402611476a9fd6b8a', 'header': {'parent_hash': 'a445181fd3b86846f1da834aea40eb9ac4b268c0b70eed3ae11c4c32e5ba1cd4', 'state_root_hash': 'de861dd8c46fa0c867c1d073ee268a46f65cb7f7722d552337b4c5e62c46cf15', 'body_hash': '96d220874e8a5bd72c0597a140e6331d8188c1053140e9f1cb84d46e772420c9', 'random_bit': False, 'accumulated_seed': '710556ad739f0e27c19d55a9797d7a8619eed167522647e8f4e71487ebf1bc8c', 'era_end': None, 'timestamp': '2024-07-02T05:39:24.492Z', 'era_id': 11, 'height': 2198, 'protocol_version': '2.0.0', 'proposer': '018f0160018dc92571933192a52aff4989a9253c2b531d12ef45f8b1b2e7c1b528', 'current_gas_price': 1, 'last_switch_block_hash': 'fa1eff8511efb891bdb9c93d6fc9316c3b250f7b658fb60da52fea00678d93dc'}, 'body': {'transactions': {}, 'rewarded_signatures': [[255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 240], [8, 4, 72, 5, 2, 0, 32, 0, 0, 16, 0, 32, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]}}}}
...
2024-07-02 05:42:00,313 - __main__ - DEBUG - 2024-07-02 05:42:00.313370 UTC: [BlockAdded]
2024-07-02 05:42:00,313 - __main__ - DEBUG - {'block_hash': '0bcb2b0c01406044cb0ed1ab096ccd8d49c8c547e46b578e49b68cca5f711503', 'block': {'Version2': {'hash': '0bcb2b0c01406044cb0ed1ab096ccd8d49c8c547e46b578e49b68cca5f711503', 'header': {'parent_hash': '752d3d1a70931dafac09edb9f0ea71e5d89baf5a12c0141cab85e9e211483ead', 'state_root_hash': '0d4f938fdd541317db32e618381b728712afd6c9fb2eaeb972538f6a1252e1d6', 'body_hash': '15209a8832459e5db7bce8a55abe387d00b79282004a0580e1d1f0606e4b3d5a', 'random_bit': True, 'accumulated_seed': '3970633503208969d9a49ad2f4d910dffa9aa85f5e9a7b18296ff94dd01290c3', 'era_end': None, 'timestamp': '2024-07-02T05:42:00.162Z', 'era_id': 11, 'height': 2217, 'protocol_version': '2.0.0', 'proposer': '01a75c228a5bbf7fbfd396a58900aa56c9ab7b65055a4ee6d7fe52ad0531440127', 'current_gas_price': 1, 'last_switch_block_hash': 'fa1eff8511efb891bdb9c93d6fc9316c3b250f7b658fb60da52fea00678d93dc'}, 'body': {'transactions': {}, 'rewarded_signatures': [[255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 240], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]}}}}

@cspramit
Copy link
Contributor Author

cspramit commented Jul 18, 2024

@alsrdn Alex S will look into

@thamps-casp
Copy link

Re-running the same investigation with latest 2.0 commit 9109c7b3910de54673c642c964a3a8036933332c

Network: 5 +2
Issue: Not observed. The SSE event monitoring does not show any delay of the switch block BlockAdded event. This maybe because of the small network. Next to test with a large 100 + 20 network.

@thamps-casp
Copy link

Ran test with a 60 + 10 network.
Note: 100 + 20 cannot be provisioned due to aws limits enforcement.

With this medium sized network - this issue is not seen.

Switch Block
BlockAdded Event Message Received: 2024-10-16T03:28:33.157Z
BlockAdded Event Timestamp in Message: 2024-10-16T03:28:15.272Z

@thamps-casp
Copy link

100 Validators + 1 - large network
Block Time: 8secs

  • added continous load to the network with wasm and transfer deploys
  • issue still not observed in the scale of issue raised. 30-40s delay only
  • SSE switch block event lag is under 30s with load.

Switch Block
BlockAdded Event Message Received: 2024-10-16T07:46:43.679Z
BlockAdded Event Timestamp in Message: 2024-10-16T07:46:14.898Z

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants