-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Prod Release 20/11 #414
Prod Release 20/11 #414
Conversation
## Problem The Historical backfill is composed of two main steps: 1. Fetching matching blocks from `near-delta-lake`/Databricks index files 2. Manually filtering all blocks between the `last_indexed_block` from `near-delta-lake` and the `block_height` in which the historical process was triggered Only after each of these steps are completed are the blocks flushed to Redis. This leads to a large delay from when the process was triggered, to when the blocks are actually executed by Runner, creating the appearance of things 'not working'. ## Changes This PR makes the following changes to reduce the time between trigger, and execution: ### Flush blocks indexed blocks immediately Fetching blocks from the index file (Step 1.), is relatively quick process. Rather than wait for (Step 2.), we can flush the blocks immediately, and then continue on to the following step. ### Prefetch blocks in manual filtering In manual filtering, blocks are processed sequentially. To speed this process up, we can fetch ahead so that we minimise the time spent waiting for S3 requests. Luckily, `near-lake-framework` does exactly this, therefore manual filtering has been refactored to use this. ## Results The following is from a backfill of 19,959,790 blocks run locally. I'd expect the Before to be much lower given the geographical distance compared to actual our infrastructure, but the results are still positive :). Time till first block appears on Redis: - Before: 33 minutes - After: 2 seconds Time to completion: - Before: 33 minutes - After: 45 seconds
We have two versions of `indexer_rules_engine`; sync and async. This logic does not need to be `async` so I've removed this version and refactored so only the sync version is used. Additionally, I've migrated the `async` tests across to use the `sync` version, and have added `cargo test` to Github Actions so that they are always run. This required fixing the `set_provisioning_flag` test, as well as ignoring the integration tests which rely on AWS access keys.
Runner sets Indexer Config when the thread is started. As a result, it does not react to updates to that config, such as updates to code. This is a problem as that means unless runner is restarted, published code changes won't be used. I've changed it so that the config is read each iteration of the loop. That way, config updates will be consumed. In the short term, this can be tuned such as reading every X loops and on every failure, if need be. In the long term, improved communication between coordinator and runner can facilitate coordinator communicating to runner to read the config as opposed to doing so all the time. In addition, the metrics for block wait duration and overall execution duration were wrong. I've moved the start time to the correct spot.
Coordinator pulls indexer config from registry upon initialization. While it runs, it continues to check each block for registry update calls. When the registry is updated, it kicks off a historical process if necessary, and updates its local indexer configurations to reflect the new config for that particular indexer. It turns out that the local indexer config was in fact not correctly updating. It was setting the wrong key for storing the config. As a result, existing indexer updates were not being captured while new indexers would have the wrong config. Coordinator would attempt to set various status such as provisioning but the incorrect keys would lead to errors around that. And, failing to capture indexer config updates leads to runner never seeing those updates and continuing to run stale code. Restarting coordinator resolved these issues since the config is read freshly from registry, hiding these issues for some time, before they would once again reappear. In addition to the above problem, the configuration is only updated in redis when the next matching block appears. So, if a new indexer is published without a start block, its redis config and stream will not be set until a new matching block appears. This can be problematic for developers who are the producers of blocks matching their contract filter. In that case, matching blocks may be rare, leading to a long delay before updated code is read and executed. Setting the config immediately after an update is seen eliminates such a problem. I've fixed the particular line which was in correctly setting the key wrong, as well as set the redis config immediately after an update, to resolve the above issues.
This PR addresses a few painpoints of logging and QueryAPI Indexer development. 1. Logs can now be seen through the QueryAPI editor window via a `Show Logs` button. - No need to switch between `Indexer Status` & `Editor` window anymore saving time and frustration with the editor taking a few seconds to load. - Enables more powerful integrations with the editor for future PRs (e.g: adding blocks to debug directly from the indexer Logs table) 2. Logs are easier to digest now and search now. - Logs are grouped by `block_height` so now the user sees all the messages belonging to a block's execution together. - Additionally the user can use the `search` functionality to search for any block_height where the function was executed to view its log in entirety. - Logs have relative dates now e.g (4 mins ago...) - BlockHeights can be clicked on to take you to the explorer link 3. Simplifying QueryAPI links. - There is no need for `&view=status/editor-window` to be passed in anymore. - As long as `selectedIndexerPath` is defined, the user is sent to the editor window. If `&view=status` is passed the user is redirected to the indexer logs page within the editor window. - Note: I have a PR on `queryapi/docs` to reflect these changes: near/docs#1584 Notes: Instead of using GraphQL to fetch logs, we are using a feature of Hasura which `restifies` your graphql queries. You can find details here: https://near-queryapi.api.pagoda.co/console/api/rest/list - The Old Indexer Status Widget should still be avalaible if needed via the following link: https://near.org/dataplatform.near/widget/QueryApi.IndexerStatus?accountId=dataplatform.near&indexer_name=social_feed Other small updates - If a invalid indexer path is chosen, the editor will show an error - QueryAPI is avalaible for users in read only mode even if they are not logged in on near.org ### This PR does not add WS support. This will be done in a future PR. ![image](https://github.com/near/queryapi/assets/25015977/0e94ede2-6cca-41ae-9dd4-827ff388ba48) ![image](https://github.com/near/queryapi/assets/25015977/01e655cb-786d-4913-9b09-e00014037863) ![queryapi logs testing](https://github.com/near/queryapi/assets/25015977/c66d980c-0988-492e-b225-46371ff7572e)
I've updated a frontend environment variable in terraform which needs to be applied in our Prod enviornment here: https://github.com/near/near-ops/pull/1440. The change is already applied for our dev env. |
@roshaans could you please apply it? |
So after a lot of testing, I found the problem. It turns out that the lazy loading implementation I used for context.db was set up so that any calls to a db function would first check if the dmlHandler exists (Load on access). If not, then instantiate it. This is helpful for situations where the functions aren't used. However, this had a side effect: If a bunch of db function calls come in simultaneously, they will all poll dmlHandler and see it is uninitialized and all try and create a new one. Each initialization will try to create a pool with 10 available clients. And then all of them will create at least one client each. This will exhaust the clients very quick if there's enough of them. Changing it so that the dmlHandler is lazy loaded immediately instead. And on access, it will await its completion before calling any functions. This ensures there is no more than 1 client active for that invocation. I wrote a test as well which tries 100 simultaneous calls and verified 1 pool is created. Transaction Support can come next, now that the pooling works correctly now.
Current block wait metric uses a gauge, which is causing spikes in block wait to be lost. So, I have switched it to a histogram instead, which does a much better job at capturing peaks and having a variety of more useful expressions, some of which I list below. These expressions are also customizable. A more accurate average: `rate(queryapi_runner_block_wait_duration_milliseconds_sum[$__rate_interval]) / rate(queryapi_runner_block_wait_duration_milliseconds_count[$__rate_interval])` The duration under which 95% of block requests were fulfilled `histogram_quantile(0.95, sum(rate(queryapi_runner_block_wait_duration_milliseconds_bucket[$__rate_interval])) by (le))` The % of requests for each indexer where the block wait was under 100ms ``` sum(rate(queryapi_runner_block_wait_duration_milliseconds_bucket{le="100"}[$__rate_interval])) by (indexer) / sum(rate(queryapi_runner_block_wait_duration_milliseconds_count[$__rate_interval])) by (indexer) ```
indexer_rules_engine
(refactor: Remove asyncindexer_rules_engine
#387)