You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I think current implementation of telemetry is not adequate to v8test node needs and at least partially can explain the misalignment between scoreboard values and node runners expectation based on node performance.
This if how I understand telemetry process, the data submission part of it (assuming default values in constats.js). Please correct me if I'm wrong.
Every 15mins node runs a query to fetch - the earliest 4000 completed/failed (or 5+ minutes old) operation_ids, then pulls all records from event table for these operation_ids, something like this:
WITH
ids as (
SELECT DISTINCT operation_id, timestamp
FROM event
WHERE name IN ('COMPLETED', 'FAILED') OR timestamp < (UNIX_TIMESTAMP() * 1000 - 1000 * 60 * 5)
ORDER BY timestamp ASC
LIMIT 4000)
select * from event WHERE operation_id IN (select operation_id from ids)
Btw, according to comments value 5 here 1000 - 1000 * 60 * 5 should be 30. Not sure if it's typo in the code or comments are outdated.
Then node sends the output of this query to telemetry server and deletes these records from event table.
So healthy telemetry process submission implies the event table on the node is kept very small. In general it should contain only non-completed events, as all completed/failed (or old enough) are submitted thru telemetry and deleted every 15mins.
Checking our nodes I can clearly see this is not the case. All our nodes have records in event table dated 2024-10-31.
And these nodes were "refreshed" (MySQL and blazegraph were cleared) on 2024-10-30.
That means nodes have been running for 6 days, but they have not submitted any operational data thru telemetry beyond their 1st day of their running.
I'm not entirely sure what are the reasons here. I know the contributors potentially could be:
unstable telemetry server, we see HTTP 500 responses from it quite often
MySQL timeouts on telemetry related queries, our event tables accumulated ~10M rows in 6 days. And event table has no indexes to accommodate telemetry related queries.
Maybe node does not actually executes telemetry submissions every 15mins for whatever reason. (I've enabled sql logger and I do see DELETE FROM event WHERE id IN... executed around every 15mins in the last hour).
My recommendation is to do a review of telemetry submission process, test it and verify it works according to the design. Also it needs to be tested under heavy loads, as they can be a contributor here as well.
Telemetry process has very limited, which makes it harder to debug it. Would be nice to include it, so it clearly says when it starts, how many operations it tries to submit, how many it successfully submitted, how many deleted, etc...
In addition I'd recommend to use one of running nodes (with tons of events already) and manually validate the process. E.g. in case of our nodes they have 5 days old event data, which should not be the case. What goes wrong with them?
The text was updated successfully, but these errors were encountered:
Oh boy, how could i miss it - the main root cause here is likely to be the original query:
SELECT DISTINCT operation_id, timestamp
FROM event
WHERE name IN ('COMPLETED', 'FAILED') OR timestamp < (UNIX_TIMESTAMP() * 1000 - 1000 * 60 * 5)
ORDER BY timestamp ASC
LIMIT 4000
This thing does not return the first 4000 unique operation_ids (as i believe was originally planned), it returns the first 4000 unique pairs of operation_id + timestamp. Number of unique operation_ids could be (and is) an order of magnitude lower. That would explain why number of unique operation_ids added by the node every hour is below 16k, but telemetry can't keep up. Because it does not submit 4k events every 15mins, it submits around 400.
I think current implementation of telemetry is not adequate to v8test node needs and at least partially can explain the misalignment between scoreboard values and node runners expectation based on node performance.
This if how I understand telemetry process, the data submission part of it (assuming default values in constats.js). Please correct me if I'm wrong.
Every 15mins node runs a query to fetch - the earliest 4000 completed/failed (or 5+ minutes old) operation_ids, then pulls all records from event table for these operation_ids, something like this:
WITH
ids as (
SELECT DISTINCT
operation_id
,timestamp
FROM
event
WHERE
name
IN ('COMPLETED', 'FAILED') ORtimestamp
< (UNIX_TIMESTAMP() * 1000 - 1000 * 60 * 5)ORDER BY
timestamp
ASCLIMIT 4000)
select * from
event
WHEREoperation_id
IN (selectoperation_id
from ids)Btw, according to comments value 5 here
1000 - 1000 * 60 * 5
should be 30. Not sure if it's typo in the code or comments are outdated.Then node sends the output of this query to telemetry server and deletes these records from event table.
So healthy telemetry process submission implies the event table on the node is kept very small. In general it should contain only non-completed events, as all completed/failed (or old enough) are submitted thru telemetry and deleted every 15mins.
Checking our nodes I can clearly see this is not the case. All our nodes have records in event table dated 2024-10-31.
And these nodes were "refreshed" (MySQL and blazegraph were cleared) on 2024-10-30.
That means nodes have been running for 6 days, but they have not submitted any operational data thru telemetry beyond their 1st day of their running.
I'm not entirely sure what are the reasons here. I know the contributors potentially could be:
event
WHEREid
IN... executed around every 15mins in the last hour).My recommendation is to do a review of telemetry submission process, test it and verify it works according to the design. Also it needs to be tested under heavy loads, as they can be a contributor here as well.
Telemetry process has very limited, which makes it harder to debug it. Would be nice to include it, so it clearly says when it starts, how many operations it tries to submit, how many it successfully submitted, how many deleted, etc...
In addition I'd recommend to use one of running nodes (with tons of events already) and manually validate the process. E.g. in case of our nodes they have 5 days old event data, which should not be the case. What goes wrong with them?
The text was updated successfully, but these errors were encountered: