Prod Release 23/04/24 #686

morgsmccauley · 2024-04-22T21:23:46Z

Introducing Logging Class (Disabled Usage of Logger) (feat: Introducing Logging Class (Disabled Usage of Logger) #608)
feat: Avoid unnecessary status updates (feat: Avoid unnecessary status updates #637)
feat: Code for Set Status and Blockheight through Postgres (feat: Code for Set Status and Blockheight through Postgres #634)
fix: Unresolved comments in feat: Introducing Logging Class (Disabled Usage of Logger) #608 (fix: Unresolved comments in #608 #640)
feat: Provision logs for existing users (feat: Provision logs for existing users #636)
feat: Add GCP compatible logging format to Block Streamer (feat: Add GCP compatible logging format to Block Streamer #655)
Introduce provisioning of Logs Table for new and existing users (feat: Introduce provisioning of Logs Table for new and existing users #643)
refactor: Convert IndexerConfig to Class (refactor: Convert IndexerConfig to Class #646)
feat: Conditionally provision metadata table (feat: Conditionally provision metadata table #658)
fix: type generation on load (fix: type generation on load #648)
Enable Logging functionality to both new and old Log Tables (feat: Enable Logging functionality to both new and old Log Tables #657)
feat: Count S3 get requests made by near-lake-framework (feat: Count S3 get requests made by near-lake-framework #662)
fix: fix cron in provisioning (fix: fix cron in provisioning #669)
feat: Enable Metadata Table Writes (feat: Enable Metadata Table Writes #659)
fix: Use compatible versions across inter-dependant crates (fix: Use compatible versions across inter-dependant crates #671)
fix: Reduce requests made to Near Lake S3 (fix: Reduce requests made to Near Lake S3 #665)
feat: Add more metrics for Lake Cache (feat: Add more metrics for Lake Cache #672)
feat: Retry Tracking and Permissions for Tables in Hasura (feat: Retry Tracking and Permissions for Tables in Hasura #663)
feat: Rename logs and metadata tables (feat: Rename logs and metadata tables #677)
fix: Continue Replacement of Logs/Metadata tables after untracking failures (fix: Continue Replacement of Logs/Metadata tables after untracking failures #675)
fix: Remove deletion of old logs and metadata tables (fix: Remove deletion of old logs and metadata tables #679)
feat: Write GCP compatible logs from Runner (feat: Write GCP compatible logs from Runner #680)
fix: Write to winston instead of console (fix: Write to winston instead of console #681)
fix: Specify missing log levels (fix: Specify missing log levels #682)
fix: Correct context.set graphql query (fix: Correct context.set graphql query #683)
Use new Metadata Table for status and block height (Use new Metadata Table for status and block height #676)
fix: Add back yarn frontend (fix: Add back yarn frontend #685)
feat: Expose log count metric from Runner (feat: Expose log count metric from Runner #684)

Prototype Draft: Integrated new logs schema for. We are writing to both default/public/indexer_logs_entries and provisioning a new table on new indexers under the user's provisioned database and respective schema. https://www.loom.com/share/ff21d7099cac403d9152c905f7e4ddcc?sid=5828ae99-377b-4510-ac8c-76c02fd232f2

Quick & dirty PR which short-circuits updating the status via GraphQL when it is unchanged. We don't need to check whether block height has changed, as that is updated in a separate call later on. I've also updated the integration tests to assert the output of status/logs.

I introduce the code necessary to perform status and last processed block height writes through Postgres. I also refactored DmlHandler and its usage in Indexer as caching of the database credentials allows for a simplification of its constructor.

Feat: created logEntry class and test cases Chore: relocated createLogs to abstracted func Chore: renamed schema idx to prefix with '__'

The provisioning flow will not be run for existing Indexers, this PR adds a separate provisioning check/step which sets up the partitioned logs table for existing users. I've opted for a in-code approach as a "manual" migration script requires specific timing, i.e. we'd need to deploy the logs change, ensuring all new Indexers are provisioned correct, and then migrate all existing users to ensure that no Indexers are missed. But since the logs provisioning change is coupled with the logging itself, existing Indexers would fail to log until the migration is complete. My only concern for this approach is a "thundering herd". After this is deployed, all Indexers will attempt to provision there logs table at the same time - I'll monitor this in Dev. As this code is temporary, I didn't bother adding instrumentation/unit-tests, nor worry about the performance impact. It will be removed promptly. This is dependant on #608 and should be merged after.

This PR adds [tracing-stackdriver](https://github.com/NAlexPear/tracing-stackdriver), which outputs logs in a GCP compatible JSON format. To enable this, the `GCP_LOGGING_ENABLED` environment variable must be set. Further, I've added additional context to errors to aid debugging. near/near-ops#1695

First few commits is from this branch #640. Created this branch based off the initial branch. This PR intends to introduce the creation of the logs table by provisioning the logsSchema and the follow CRON jobs for new Users but does not use or writeLogs to the new logsTable itself. If this is merged by itself new users will have unused log table but the provisioning will occur. To provision existing users - #636

Migrating any data related to the Indexer into a common class to simplify data interactions with things like AccountId, which are common. I've also added an integ test for context DB.

Enable conditional provisioning of metadata table.

The object post astify() we receive from node-sql-parser changed. We can think about Version Control in the future. Quick fix here: returning the object to how it was and adding an additional layer on Editor to ensure mounting of types

Uncommented functionality so we actually start writing logs to new the Tables that have been provisioned in #643. Old logging implementation remains untouched as still functions (although it has been renamed from writeLog -> writeLogOld). We are writing to both log tables. ### 1. Provisioning and Logging (to both tables) for a new Indexer https://www.loom.com/share/3ad6d6ea3368412e8896340a74759ffb?sid=4d5379e8-5401-41bf-9e38-d0f8e8c4eca5 ### 2. Logging (to both tables) for a existing Indexer https://www.loom.com/share/4ba411f2bcb740e1842650f695ffb347?sid=253ced68-9d4c-459f-871b-b0a3ee00cd91 ### Provisioning and Logging new logs table for a existing Indexer (that does not have logs table) https://www.loom.com/share/2aa7c0cc882f4dbdb9e51fc2a9e9b7b9?sid=1aa511fe-3054-4d27-9996-2b9fddc44ed8

Depends on near/near-lake-framework-rs#102 This PR exposes a new metrics which counts the number of Get requests made to S3 by `near-lake-framework`. I wanted to start tracking this metric _before_ I merge the change which reduces them, so I can measure the impact of that change. The easiest way to track these requests was to pass a custom `S3Client` to `near-lake-framework`, so we can hook in to the actual requests made. The custom `S3Client` (`LakeS3Client`) is exactly the same as the default implementation in `near-lake-framework` itself, but with the added metric. This is essentially part 1 for #419, as the "reduction" in requests will build on this custom client, adding caching/de-duplication.

CRON statement functions was attempting to access a non-existent scoped property. Added syntax for dynamic sql generation to properly traverse. Tested by setting cron fn_create_partition to trigger every 30 seconds. Previously we would not see the Non Trackable functions and we would get the original error message below. Now we are able to view the 2 Non Trackable functions and the row succeeds. <img width="686" alt="Screenshot 2024-04-16 at 7 39 10 PM" src="https://github.com/near/queryapi/assets/42101107/b67e6e49-2f66-46d8-a41e-e1b51a6a2f06"> <img width="1378" alt="Screenshot 2024-04-16 at 8 02 24 PM" src="https://github.com/near/queryapi/assets/42101107/fbb4946e-8f23-4fc3-8765-0fad676897d1"> Original error `ERROR: function fn_delete_partition(unknown, date, unknown, unknown) does not exist LINE 1: SELECT fn_delete_partition('kevin33_near_component_01.__logs... ^ HINT: No function matches the given name and argument types. You might need to add explicit type casts.`

Enable writes of Status and Last Processed Block Height to Metadata table. Reorganizes provisioning to ensure writing of PROVISIONING status. Ensures IndexerMeta is available for writing error logs.

- fix: Use compatible types across inter-dependant crates - fix: Clippy

Each `BlockStream` uses its own dedicated `near-lake-framework` instance, and hence manages its own connection with S3. This leads to many duplicate S3 requests, particularly across the large majority of Indexers which follow the network tip, which request the same block data at the same time. This PR introduces a shared S3 client to be used across all `near-lake-framework` instances. `SharedLakeS3Client` ensures that duplicate requests made within a short time-frame, including those made in parallel, result in only a single request to S3. ## Cache Strategy This implementation will mostly impact `BlockStream`s following the network tip, i.e. `From Latest`. These streams will wait for new data in Near Lake S3, and request it as soon as it is available, at the same time. Therefore, it would be enough to cache the result alone, by the time we actually prime the cache, all other requests would have missed it and fired a request of their own. Locking while the request is in-flight also is not feasible, as this would force _every_ request to execute in sequence. Instead of caching the result of the request, we cache its computation. The first request initiates the request and stores its `Future`, then all subsequent requests retrieve that `Future` from cache and `await` its result, ensuring only one underlying request at most. ## Performance Impact My main concern with this implementation is the impact it will have on performance. Each request made must block to check the cache, introducing contention/delays. The lock is only held while checking the cache, and not while the request is being made, so my hope is that it does not impact too much. This may be something that needs to be iterated over time. From local testing the impact seemed to be negligible, but that was with 5 Indexers, it may be worse with many. I've added a metric to measure lock wait time, to determine whether this contention is becoming a problem.

Logs Table and Metadata Table are necessary for important functions of QueryApi. Hasura often invisibly fails to track tables and add permissions. These operations need to be successful, so I added a check which verifies tracking and permissions are correct, and reattempts them if not. When successful, the result is cached. In a successful case, the expensive hasura calls (getTableNames and getTrackedTablesWithPermissions) are done at most twice. I also combined the conditional provisioning functions since they are verified to work already.

The logs and metadata tables were created with a `__` prefix. Unfortunately, it turns out that the prefix is a reserved prefix used by Hasura. So, we are renaming the tables to be prefixed with `sys_`, which is not reserved, to the best of my knowledge. The specific process for the migration is: 1. Delete and recreate the cron DB in dev. This deletes the BD and any scheduled jobs. 2. Delete logs/metadata tables and any created partitions. 3. Create new tables. 4. Use new tables successfully.

…ilures (#675) Errors in provisioning should be logged to the machine as they can potentially be overwritten by errors in the finally block of the parent try catch. We ideally want to move the provisioning out to its own try catch but this is a simple fix for the time being. In addition the PRs to replace the logs/metadata tables failed due to untracking being partially successful. This PR allows untracking errors.

All tables have been migrated to using the new tables. This code hook for deleting old tables is no longer necessary.

This PR adds `winston` to introduce structured logging, and also write GCP compatible logs when `GCP_LOGGING_ENABLED` is set.

`context.set` was constructing an incorrect query #646 - this corrects that query.

Frontend now queries the new metadata table instead of the old one. <img width="1013" alt="image" src="https://github.com/near/queryapi/assets/22734869/4eae0f19-eda6-4e28-b244-cb78453aeeea"> Indexers which have not successfully run since the introduction of the new logs table will not have a last processed block height since they never processed a block successfully. So, I opted to set it as N/A and leave a tooltip on hover which says why its N/A. <img width="1015" alt="image" src="https://github.com/near/queryapi/assets/22734869/38bc3565-41e1-4562-9966-371337e673fb">

My due diligence was insufficient. As it turns out, yarn.lock IS used by the frontend during development. Adding it back.

Creates a new `winston` transport method to count logs by level, and exposes a new prometheus metric to record this value. Additionally, metrics recorded on the main thread were not captured. This was not an issue as the majority of metrics were done within the worker. But since we also log on main thread, this PR updates metrics aggregation to expose main thread metrics.

Temporarily disable usage of new logs/metadata tables so that prod release is focused on provisioning new tables only.

Adding a rollout for provisioning of logs and metadata table for existing users to soften prod release impact. Provisioning has been observed to use many DB connections on startup.

Kevin101Zhang and others added 28 commits April 4, 2024 15:22

fix: Unresolved comments in #608 (#640)

89dadf5

Feat: created logEntry class and test cases Chore: relocated createLogs to abstracted func Chore: renamed schema idx to prefix with '__'

refactor: Convert IndexerConfig to Class (#646)

fe81959

Migrating any data related to the Indexer into a common class to simplify data interactions with things like AccountId, which are common. I've also added an integ test for context DB.

feat: Conditionally provision metadata table (#658)

e64bbfd

Enable conditional provisioning of metadata table.

fix: type generation on load (#648)

1db0cdd

The object post astify() we receive from node-sql-parser changed. We can think about Version Control in the future. Quick fix here: returning the object to how it was and adding an additional layer on Editor to ensure mounting of types

feat: Enable Metadata Table Writes (#659)

492d95c

Enable writes of Status and Last Processed Block Height to Metadata table. Reorganizes provisioning to ensure writing of PROVISIONING status. Ensures IndexerMeta is available for writing error logs.

fix: Use compatible versions across inter-dependant crates (#671)

bf0c121

- fix: Use compatible types across inter-dependant crates - fix: Clippy

feat: Add more metrics for Lake Cache (#672)

a9bd527

fix: Remove deletion of old logs and metadata tables (#679)

1a877eb

All tables have been migrated to using the new tables. This code hook for deleting old tables is no longer necessary.

feat: Write GCP compatible logs from Runner (#680)

6f46574

This PR adds `winston` to introduce structured logging, and also write GCP compatible logs when `GCP_LOGGING_ENABLED` is set.

fix: Write to winston instead of console (#681)

b168b2b

fix: Specify missing log levels (#682)

d89e00c

fix: Correct context.set graphql query (#683)

77f0ebd

`context.set` was constructing an incorrect query #646 - this corrects that query.

fix: Add back yarn frontend (#685)

c262f9f

My due diligence was insufficient. As it turns out, yarn.lock IS used by the frontend during development. Adding it back.

morgsmccauley requested a review from a team as a code owner April 22, 2024 21:23

morgsmccauley force-pushed the stable branch from 3a8eed4 to 211383c Compare April 22, 2024 21:49

darunrs added 2 commits April 22, 2024 15:06

fix: Revert usage of new logs/metadata table temporarily (#689)

37f4e31

Temporarily disable usage of new logs/metadata tables so that prod release is focused on provisioning new tables only.

fix: Add rollout provisioning through redis whitelist (#690)

cfc21fd

Adding a rollout for provisioning of logs and metadata table for existing users to soften prod release impact. Provisioning has been observed to use many DB connections on startup.

darunrs approved these changes Apr 22, 2024

View reviewed changes

morgsmccauley merged commit 7555eff into stable Apr 22, 2024
17 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prod Release 23/04/24 #686

Prod Release 23/04/24 #686

morgsmccauley commented Apr 22, 2024 •

edited

Loading

Prod Release 23/04/24 #686

Prod Release 23/04/24 #686

Conversation

morgsmccauley commented Apr 22, 2024 • edited Loading

morgsmccauley commented Apr 22, 2024 •

edited

Loading