-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
refactor: Speed up Historical Backfill process #379
refactor: Speed up Historical Backfill process #379
Conversation
e5d7fce
to
60cd1dd
Compare
60cd1dd
to
53649ff
Compare
pub const INDEXED_ACTIONS_FILES_FOLDER: &str = "silver/accounts/action_receipt_actions/metadata"; | ||
pub const MAX_UNINDEXED_BLOCKS_TO_PROCESS: u64 = 7200; // two hours of blocks takes ~14 minutes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This kind of served as an alarm for if something went wrong with the index file creation process. Since we are removing this, we should have some metric or error that checks if the latest index file failed to create. It doesn't need to be here or in this PR though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was intentional, forgot to comment on it sorry.
Now that we're using near-lake-framework
the BPS should be significantly faster. I'm not sure this should be a concern anymore. I'll add a warning log for now and we can add more later if this becomes a problem :)
); | ||
|
||
storage::del( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there might be a bug in the runner-side prefetch code related to this. If a historical process is kicked off and it fails, then the runner-side prefetch continuously pulls from its buffer array, not the stream (Although the stream message is still there). So, even if you delete the stream and fill it with a new historical stream, the earlier failed message will continue to block execution. I'll test this and ship a PR to fix that if it's in fact true.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch
Maybe we need to think of alternate ways of handling pre-fetch, perhaps we could:
- Update/overwrite the existing stream messages rather than maintaining an in-memory queue
set
an expire-able key in redis, similar to to the real-time caching
indexer/queryapi_coordinator/src/historical_block_processing.rs
Outdated
Show resolved
Hide resolved
} | ||
.start_block_height(last_indexed_block + 1) | ||
.build() | ||
.context("Failed to build lake config")?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we fail to build the lake config, then should we purge the historical feed again? I think it could be confusing to the customer if it looks like their historical process ran only to then have a gap between the block they specified and what blocks actually ran. On that front, I feel it might be better to construct this earlier. Maybe also have a retry on it depending on the error type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There any many things that can fail and cause the historical process to exit - it's not just limited to the construction of near-lake-framework
.
At this point, I think it's worth pushing this out as is so we can test. We then can have a deeper conversation about error handling later.
indexer/queryapi_coordinator/src/historical_block_processing.rs
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Commented on one thing to look into (unindexed start_block_height), otherwise looks great.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome work! Excited to see these changes in action.
Problem
The Historical backfill is composed of two main steps:
near-delta-lake
/Databricks index fileslast_indexed_block
fromnear-delta-lake
and theblock_height
in which the historical process was triggeredOnly after each of these steps are completed are the blocks flushed to Redis. This leads to a large delay from when the process was triggered, to when the blocks are actually executed by Runner, creating the appearance of things 'not working'.
Changes
This PR makes the following changes to reduce the time between trigger, and execution:
Flush blocks indexed blocks immediately
Fetching blocks from the index file (Step 1.), is relatively quick process. Rather than wait for (Step 2.), we can flush the blocks immediately, and then continue on to the following step.
Prefetch blocks in manual filtering
In manual filtering, blocks are processed sequentially. To speed this process up, we can fetch ahead so that we minimise the time spent waiting for S3 requests. Luckily,
near-lake-framework
does exactly this, therefore manual filtering has been refactored to use this.Results
The following is from a backfill of 19,959,790 blocks run locally. I'd expect the Before to be much lower given the geographical distance compared to actual our infrastructure, but the results are still positive :).
Time till first block appears on Redis:
Time to completion: