Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Telemetry issue #3393

Open
botnumberseven opened this issue Nov 4, 2024 · 2 comments
Open

Telemetry issue #3393

botnumberseven opened this issue Nov 4, 2024 · 2 comments

Comments

@botnumberseven
Copy link

botnumberseven commented Nov 4, 2024

I think current implementation of telemetry is not adequate to v8test node needs and at least partially can explain the misalignment between scoreboard values and node runners expectation based on node performance.

This if how I understand telemetry process, the data submission part of it (assuming default values in constats.js). Please correct me if I'm wrong.

Every 15mins node runs a query to fetch - the earliest 4000 completed/failed (or 5+ minutes old) operation_ids, then pulls all records from event table for these operation_ids, something like this:
WITH
ids as (
SELECT DISTINCT operation_id, timestamp
FROM event
WHERE name IN ('COMPLETED', 'FAILED') OR timestamp < (UNIX_TIMESTAMP() * 1000 - 1000 * 60 * 5)
ORDER BY timestamp ASC
LIMIT 4000)

select * from event WHERE operation_id IN (select operation_id from ids)

Btw, according to comments value 5 here 1000 - 1000 * 60 * 5 should be 30. Not sure if it's typo in the code or comments are outdated.

Then node sends the output of this query to telemetry server and deletes these records from event table.

So healthy telemetry process submission implies the event table on the node is kept very small. In general it should contain only non-completed events, as all completed/failed (or old enough) are submitted thru telemetry and deleted every 15mins.

Checking our nodes I can clearly see this is not the case. All our nodes have records in event table dated 2024-10-31.
And these nodes were "refreshed" (MySQL and blazegraph were cleared) on 2024-10-30.
That means nodes have been running for 6 days, but they have not submitted any operational data thru telemetry beyond their 1st day of their running.

I'm not entirely sure what are the reasons here. I know the contributors potentially could be:

  1. unstable telemetry server, we see HTTP 500 responses from it quite often
  2. MySQL timeouts on telemetry related queries, our event tables accumulated ~10M rows in 6 days. And event table has no indexes to accommodate telemetry related queries.
  3. Maybe node does not actually executes telemetry submissions every 15mins for whatever reason. (I've enabled sql logger and I do see DELETE FROM event WHERE id IN... executed around every 15mins in the last hour).

My recommendation is to do a review of telemetry submission process, test it and verify it works according to the design. Also it needs to be tested under heavy loads, as they can be a contributor here as well.
Telemetry process has very limited, which makes it harder to debug it. Would be nice to include it, so it clearly says when it starts, how many operations it tries to submit, how many it successfully submitted, how many deleted, etc...

In addition I'd recommend to use one of running nodes (with tons of events already) and manually validate the process. E.g. in case of our nodes they have 5 days old event data, which should not be the case. What goes wrong with them?

@botnumberseven
Copy link
Author

One temporary workaround here could be prioritizing paranet related events in telemetry submission.

@botnumberseven
Copy link
Author

botnumberseven commented Nov 5, 2024

Oh boy, how could i miss it - the main root cause here is likely to be the original query:
SELECT DISTINCT operation_id, timestamp
FROM event
WHERE name IN ('COMPLETED', 'FAILED') OR timestamp < (UNIX_TIMESTAMP() * 1000 - 1000 * 60 * 5)
ORDER BY timestamp ASC
LIMIT 4000

This thing does not return the first 4000 unique operation_ids (as i believe was originally planned), it returns the first 4000 unique pairs of operation_id + timestamp. Number of unique operation_ids could be (and is) an order of magnitude lower. That would explain why number of unique operation_ids added by the node every hour is below 16k, but telemetry can't keep up. Because it does not submit 4k events every 15mins, it submits around 400.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant