Prod Release 06/02/24 #546

darunrs · 2024-02-01T02:05:34Z

fix: upgrade @next/font from 13.1.6 to 13.5.3
fix: upgrade @types/node from 18.13.0 to 18.18.1
fix: upgrade @types/react from 18.0.28 to 18.2.23
fix: upgrade eslint from 8.34.0 to 8.50.0
fix: frontend/package.json to reduce vulnerabilities
feat: Capture errors thrown within Coordinator (feat: Capture errors thrown within Coordinator #515)
feat: Only start indexers set within the allowlist (feat: Only start indexers set within the allowlist #518)
feat: Add Dockerfile for Coordinator V2 (feat: Add Dockerfile for Coordinator V2 #519)
feat: Support Deployment of Block Streamer (feat: Support Deployment of Block Streamer #516)
feat: Enable Block Streams Start from V1 Interruption (feat: Enable Block Streams Start from V1 Interruption #517)
Merge branch 'main' into snyk-upgrade-4ba709d36fa2e38a5ac1105f6ae38549
Merge pull request [Snyk] Upgrade @types/node from 18.13.0 to 18.18.1 #314 from near/snyk-upgrade-4ba709d36fa2e38a5ac110…
Merge branch 'main' into snyk-upgrade-c7dc22e6846ada68552465a4cbc363b0
Merge pull request [Snyk] Upgrade @next/font from 13.1.6 to 13.5.3 #313 from near/snyk-upgrade-c7dc22e6846ada68552465…
feat: Logging & Error updates (feat: Logging & Error updates #526)
Merge branch 'main' into snyk-upgrade-a7fd032540f59a0e0521b8386f1e3465
Merge pull request [Snyk] Upgrade @types/react from 18.0.28 to 18.2.23 #315 from near/snyk-upgrade-a7fd032540f59a0e0521b8…
Merge pull request [Snyk] Security upgrade next from 13.1.6 to 13.5.0 #334 from near/snyk-fix-415cfdc1a838c227b64a7b254d…
Merge branch 'main' into snyk-upgrade-fdf2121163d14312e34f796d4e4ec0e4
Merge pull request [Snyk] Upgrade eslint from 8.34.0 to 8.50.0 #316 from near/snyk-upgrade-fdf2121163d14312e34f79…
fix: Prevent Coordinator from stopping V1 executors (fix: Prevent Coordinator from stopping V1 executors #544)
fix: Resolve duplicate processing of messages in Runner (fix: Resolve duplicate processing of messages in Runner #545)

Snyk has created this PR to upgrade @next/font from 13.1.6 to 13.5.3. See this package in npm: See this project in Snyk: https://app.snyk.io/org/gabehamilton/project/f1490843-1830-4eb0-a957-99816aa5edcc?utm_source=github&utm_medium=referral&page=upgrade-pr

Snyk has created this PR to upgrade @types/node from 18.13.0 to 18.18.1. See this package in npm: See this project in Snyk: https://app.snyk.io/org/gabehamilton/project/f1490843-1830-4eb0-a957-99816aa5edcc?utm_source=github&utm_medium=referral&page=upgrade-pr

Snyk has created this PR to upgrade @types/react from 18.0.28 to 18.2.23. See this package in npm: See this project in Snyk: https://app.snyk.io/org/gabehamilton/project/f1490843-1830-4eb0-a957-99816aa5edcc?utm_source=github&utm_medium=referral&page=upgrade-pr

Snyk has created this PR to upgrade eslint from 8.34.0 to 8.50.0. See this package in npm: See this project in Snyk: https://app.snyk.io/org/gabehamilton/project/f1490843-1830-4eb0-a957-99816aa5edcc?utm_source=github&utm_medium=referral&page=upgrade-pr

The following vulnerabilities are fixed with an upgrade: - https://snyk.io/vuln/SNYK-JS-NEXT-6032387

Currently, errors thrown within Coordinator V2 will bubble up to `main()` and cause the entire application to exit. This PR captures those errors, handling them accordingly. Errors are handled in either of the following ways: 1. Exponentially retry - data which is 'critical' to the control, i.e. the indexer registry, executor/stream list, will be continuously retried, blocking the control loop from further progress as it would not make sense to continue without this information. 2. Swallowed - actions such as starting/stopping executors/streams will be logged and swallowed. This is preferable over exponential retries as individual failures will not block the progress of the control loop, therefore allowing other indexers to be acted on. Skipping should be fine in this case as it will be retried in the next loop. I expect this behaviour to evolve over time as we learn more about the system, the important thing here is that Coordinator will not crash on errors.

This PR updates Coordinator V2 to read an `allowlist` from Redis, and only start accounts which have been added to that list. The `allowlist` is a single key in Redis stored as stringified JSON, e.g.: ```json [{ "account_id": "morgs.near" }] ``` I've used JSON as each entry will eventually contain more fields, such as an acknowledgement from Coordinator V1, and a flag to set when completing the migration. Additionally, I've added various logs across Coordinator V2.

Block Streamer needs a Dockerfile in order to build the image and be deployable either locally or in GCP. I've created one for the service and updated the compose file to set the correct env variables needed. This also sserves as a record for what ENV variables need to be set during Terraform deployments to GCP. In addition, there's a small issue with running QueryApi locally where if the user runs Runner locally through yarn start, provisioning of resources through hasura fails because it populates the postgres host as 'localhost' whereas it should be 'postgres', as the calls are processed inside the hasura docker. This is due to PGHOST being both used and also passed to Hasura inside Runner. To fix this, I created a separate env variable called PGHOST_HASURA which can be set to 'postgres'. This way, Runner can use 'localhost' while it passes 'postgres' to hasura's docker.

Coordinator V2 currently defaults to starting block streams with the start_block_height value populated in the registry. In order for a smooth migration from V1 to V2, we instead want Coordinator V2 to create block streams starting where the current indexers left off. Coordinator V1 now writes a last_published_block value for all indexers it manages on each matching block. Coordinator V2's start_block_height setting workflow needs refactoring. This is due to the fact that the exisitng flow defauls to using start_block_height as its present in the registry even after we begin processing a stream from that block height. We need to refactor the flow to ensure we use the start block height once, and otherwise use redis last published block height, or fallback if not present. New flow to handle edge cases will be tracked by #521.

…5f6ae38549 [Snyk] Upgrade @types/node from 18.13.0 to 18.18.1

…a4cbc363b0 [Snyk] Upgrade @next/font from 13.1.6 to 13.5.3

- feat: Cap exponential retry to 30 secs - feat: Debug log stream/executor responses - feat: Add more logging to coordinator

…386f1e3465 [Snyk] Upgrade @types/react from 18.0.28 to 18.2.23

…253c56 [Snyk] Security upgrade next from 13.1.6 to 13.5.0

…6d4e4ec0e4 [Snyk] Upgrade eslint from 8.34.0 to 8.50.0

As V1 executors are returned via Runner gRPC, Coordinator V2 will attempt to synchronise them. These executors should be completely ignored, within the synchronisation logic.

@Kevin101Zhang

It was observed that Runner would execute on the same block multiple times. This was verified again when @Kevin101Zhang saw that the components indexer, which he modified, incorrectly incremented the star counter. I noticed some indexers such as sweat_blockheight very rarely had duplicate runs. To validate if Runner was the problem, I shut down Runner in dev and triggered some social feed interactions. I verified that the block only appeared in Redis once, indicating Coordinator was not the problem. When I started Runner again, the problem appeared again. After that, I worked on replicating the issue locally. I built up 3 messages on a social feed indexer and modified Runner to write its executions to a file. Afterward, I ran Runner. I searched the file and found that the 3 blocks actually appeared in sequence like so: block 1, block 2, block 3, block 1, block 2, block 3... and so on. This seems to indicate duplicate workers was the problem but instead that the same message was being read into the block array after they were all read. Finally, with that I found the problem. The way Runner fills its array of S3 promises, it reads a stream message and increments a stored message ID. Subsequent stream fetches specify to fetch messages after that stream ID. This is needed as deletion of stream messages can only take place after the message is SUCCESSFULLY processed, to avoid problems if Runner was reset. However, the code which handles the case where no messages are available in the stream reset that counting stream ID to '0', as a mechanism to ensure messages that were somehow skipped are definitely read. This resetting of the ID ended up being the cause. To illustrate, here's a scenario: 1. The stream has 3 messages in it. 2. Runner starts with 0 and then reads all 3 messages. The counting stream ID is the ID of the last read message + 1. 3. The messages are added to the array as promises. 4. The first message begins to process. 5. The producer loop sees that no messages are found anymore. 6. **The producer waits 100ms and then resets the ID.** 7. The producer, while the first message was processing, fetches the same 3 messages again. 8. This will repeat as long as no new messages appear, while messages are in the promise array but not processed. The fix is simple: Remove the resetting of the stream ID. It is no longer necessary as the problem of remaining messages in the stream was fixed back in December.

morgsmccauley

Make sure to "Create Merge Commit" rather than "Squash and Merge" - this keeps the histories aligned

In each invocation of runFunction, on a block, DML Handler is created. It's responsible for making calls to the database. Part of setting it up involves fetching the credentials for the user's DB from Hasura, and creating a PG Client. This takes time so the process was run through an unawaited async request. While other setup and some of user code is ran, the setting up of DML Handler would be completed. The first context.db call would await its completion and subsequent calls would have it ready. However, it was observed that when the call to Hasura for the DB credentials times out, the error, instead of propagating into a try catch, would instead be considered by the Javascript runtime as an unhandled Promise Exception, and would terminate the worker thread, stopping the indexer. In order to fix this problem, we need to transition away from keeping DmlHandler.create as an unresolved Promise across multiple contexts. The approach I've decided to take is to defer the creation of the Hasura call promise until the first call of context.db. This adds more latency to the first context.db call as it now must wait for the entire process to complete. However, this also does not penalize Indexers that don't use context.db as their code does not need to connect to Hasura unless needed. Very soon, we will in fact overhaul this existing logic by migrating the Hasura credentials call away from runFunctions. This eliminates the underlying problem of unresolved promises as none remain afterward. So, the focus here is to address the bug, which is a critical problem, without too much change, as the workflow will be refactored again soon anyway. I also fix a small bug where context.db calls were getting logged under the wrong indexer logs table function name.

This PR builds on the Redis `allowlist` to auto migrate indexers from the existing infrastructure to the Control Plane. An account migration requires coordination between both the V1 and V2 architecture - the indexer must be removed/ignored from V1, and then correctly configured within V2. ## Allowlist shape Each `allowlist` entry now contain the following properties: ```rs pub struct AllowlistEntry { account_id: AccountId, // The account which should be migrated v1_ack: bool, // True if Coordinator V1 has acknowledged the entry migrated: bool, // True if the migration was successful failed: bool, // True if the migration failed } ``` ## Coordinator V1 For Coordinator V1, the `allowlist` is really a Denylist, the code/types have therefore been named as such. Accounts within the "denylist" should be ignored completely by V1. Because we cannot guarantee the timing of when this "ignore" actually happens, a flag (`v1_ack`) will be set from V1. V2 will only take over once this flag has been set. Accounts within the "denylist" will be filtered out of the in-memory registry. Any new indexer "registrations" will also be ignored. In-progress historical backfills haven't been considered as we'll disable this functionality anyway. ## Coordinator V2 Once acknowledged by V1, Coordinator V2 will attempt to migrate all functions under the relevant account. The steps for migration are: 1. Remove the streams from the Redis `streams` set - preventing Runner from starting these indexers implicitly 2. Stop the existing V1 executors which have already been started via the `streams` set 3. Merge the existing historical (if it exists) and real-time streams Once migrated, accounts which have `v1_ack && migrated && !failed` will be exposed to the control loop, prompting V2 to act on these indexers. ### `migrated` flag For now, the `migrated` flag will not be set on success, preventing V2 from running the indexer on the new architecture. There are some issues around V2 continuing from the right block correctly, so it's best to not run them for now. This allows us to test the migration in isolation, not worrying about what V2 does after that. I'll add this logic back in once #536 is complete. ### `failed` flag If any part of the migration fails, the `failed` flag will be set for that account. It would take a significant amount of time to cover all the edge cases in code so it would be faster to just set this flag to ignore the account, fix the migration manually and then reset the `failed` flag.

Runner sets the status of an Indexer after a successful run or a failed run whose error was caught. However, if the executor itself crashed, the error would not be caught as the worker is terminated. As a result, the status of the indexer would continue to display RUNNING, which was incorrect and misleading. I updated the status by ensuring crashed workers have the STOPPED status set. In addition, I added a new FAILING status which will now be set when an indexer is still running, but failing on the same block.

snyk-bot and others added 22 commits October 21, 2023 00:49

fix: frontend/package.json to reduce vulnerabilities

34cb33d

The following vulnerabilities are fixed with an upgrade: - https://snyk.io/vuln/SNYK-JS-NEXT-6032387

feat: Add Dockerfile for Coordinator V2 (#519)

32d6fa9

Merge branch 'main' into snyk-upgrade-4ba709d36fa2e38a5ac1105f6ae38549

0b6c08a

Merge pull request #314 from near/snyk-upgrade-4ba709d36fa2e38a5ac110…

c5b1e41

…5f6ae38549 [Snyk] Upgrade @types/node from 18.13.0 to 18.18.1

Merge branch 'main' into snyk-upgrade-c7dc22e6846ada68552465a4cbc363b0

f568bd0

Merge pull request #313 from near/snyk-upgrade-c7dc22e6846ada68552465…

a23c51e

…a4cbc363b0 [Snyk] Upgrade @next/font from 13.1.6 to 13.5.3

feat: Logging & Error updates (#526)

46e363c

- feat: Cap exponential retry to 30 secs - feat: Debug log stream/executor responses - feat: Add more logging to coordinator

Merge branch 'main' into snyk-upgrade-a7fd032540f59a0e0521b8386f1e3465

b6fc249

Merge pull request #315 from near/snyk-upgrade-a7fd032540f59a0e0521b8…

6116c53

…386f1e3465 [Snyk] Upgrade @types/react from 18.0.28 to 18.2.23

Merge pull request #334 from near/snyk-fix-415cfdc1a838c227b64a7b254d…

2317f36

…253c56 [Snyk] Security upgrade next from 13.1.6 to 13.5.0

Merge branch 'main' into snyk-upgrade-fdf2121163d14312e34f796d4e4ec0e4

102409a

Merge pull request #316 from near/snyk-upgrade-fdf2121163d14312e34f79…

70ea36d

…6d4e4ec0e4 [Snyk] Upgrade eslint from 8.34.0 to 8.50.0

fix: Prevent Coordinator from stopping V1 executors (#544)

593f305

As V1 executors are returned via Runner gRPC, Coordinator V2 will attempt to synchronise them. These executors should be completely ignored, within the synchronisation logic.

darunrs marked this pull request as ready for review February 1, 2024 02:11

darunrs requested a review from a team as a code owner February 1, 2024 02:11

morgsmccauley approved these changes Feb 1, 2024

View reviewed changes

darunrs changed the title ~~Prod Release 31/01/24~~ Prod Release 01/02/24 Feb 1, 2024

darunrs and others added 3 commits February 5, 2024 11:35

darunrs changed the title ~~Prod Release 01/02/24~~ Prod Release 01/06/24 Feb 6, 2024

darunrs changed the title ~~Prod Release 01/06/24~~ Prod Release 06/02/24 Feb 6, 2024

darunrs merged commit 34017d3 into stable Feb 6, 2024
22 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prod Release 06/02/24 #546

Prod Release 06/02/24 #546

darunrs commented Feb 1, 2024 •

edited

Loading

morgsmccauley left a comment

Prod Release 06/02/24 #546

Prod Release 06/02/24 #546

Conversation

darunrs commented Feb 1, 2024 • edited Loading

morgsmccauley left a comment

Choose a reason for hiding this comment

darunrs commented Feb 1, 2024 •

edited

Loading