Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prod Release 06/02/24 #546

Merged
merged 25 commits into from
Feb 6, 2024
Merged

Prod Release 06/02/24 #546

merged 25 commits into from
Feb 6, 2024

Conversation

darunrs
Copy link
Collaborator

@darunrs darunrs commented Feb 1, 2024

snyk-bot and others added 22 commits October 21, 2023 00:49
Snyk has created this PR to upgrade @next/font from 13.1.6 to 13.5.3.

See this package in npm:


See this project in Snyk:
https://app.snyk.io/org/gabehamilton/project/f1490843-1830-4eb0-a957-99816aa5edcc?utm_source=github&utm_medium=referral&page=upgrade-pr
Snyk has created this PR to upgrade @types/node from 18.13.0 to 18.18.1.

See this package in npm:


See this project in Snyk:
https://app.snyk.io/org/gabehamilton/project/f1490843-1830-4eb0-a957-99816aa5edcc?utm_source=github&utm_medium=referral&page=upgrade-pr
Snyk has created this PR to upgrade @types/react from 18.0.28 to 18.2.23.

See this package in npm:


See this project in Snyk:
https://app.snyk.io/org/gabehamilton/project/f1490843-1830-4eb0-a957-99816aa5edcc?utm_source=github&utm_medium=referral&page=upgrade-pr
Snyk has created this PR to upgrade eslint from 8.34.0 to 8.50.0.

See this package in npm:


See this project in Snyk:
https://app.snyk.io/org/gabehamilton/project/f1490843-1830-4eb0-a957-99816aa5edcc?utm_source=github&utm_medium=referral&page=upgrade-pr
Currently, errors thrown within Coordinator V2 will bubble up to
`main()` and cause the entire application to exit. This PR captures
those errors, handling them accordingly.

Errors are handled in either of the following ways:
1. Exponentially retry - data which is 'critical' to the control, i.e.
the indexer registry, executor/stream list, will be continuously
retried, blocking the control loop from further progress as it would not
make sense to continue without this information.
2. Swallowed - actions such as starting/stopping executors/streams will
be logged and swallowed. This is preferable over exponential retries as
individual failures will not block the progress of the control loop,
therefore allowing other indexers to be acted on. Skipping should be
fine in this case as it will be retried in the next loop.

I expect this behaviour to evolve over time as we learn more about the
system, the important thing here is that Coordinator will not crash on
errors.
This PR updates Coordinator V2 to read an `allowlist` from Redis, and
only start accounts which have been added to that list. The `allowlist`
is a single key in Redis stored as stringified JSON, e.g.:
```json
[{ "account_id": "morgs.near" }]
```

I've used JSON as each entry will eventually contain more fields, such
as an acknowledgement from Coordinator V1, and a flag to set when
completing the migration.

Additionally, I've added various logs across Coordinator V2.
Block Streamer needs a Dockerfile in order to build the image and be
deployable either locally or in GCP. I've created one for the service
and updated the compose file to set the correct env variables needed.
This also sserves as a record for what ENV variables need to be set
during Terraform deployments to GCP.

In addition, there's a small issue with running QueryApi locally where
if the user runs Runner locally through yarn start, provisioning of
resources through hasura fails because it populates the postgres host as
'localhost' whereas it should be 'postgres', as the calls are processed
inside the hasura docker. This is due to PGHOST being both used and also
passed to Hasura inside Runner. To fix this, I created a separate env
variable called PGHOST_HASURA which can be set to 'postgres'. This way,
Runner can use 'localhost' while it passes 'postgres' to hasura's
docker.
Coordinator V2 currently defaults to starting block streams with the
start_block_height value populated in the registry. In order for a
smooth migration from V1 to V2, we instead want Coordinator V2 to create
block streams starting where the current indexers left off.

Coordinator V1 now writes a last_published_block value for all indexers
it manages on each matching block.

Coordinator V2's start_block_height setting workflow needs refactoring.
This is due to the fact that the exisitng flow defauls to using
start_block_height as its present in the registry even after we begin
processing a stream from that block height. We need to refactor the flow
to ensure we use the start block height once, and otherwise use redis
last published block height, or fallback if not present.

New flow to handle edge cases will be tracked by
#521.
…5f6ae38549

[Snyk] Upgrade @types/node from 18.13.0 to 18.18.1
…a4cbc363b0

[Snyk] Upgrade @next/font from 13.1.6 to 13.5.3
- feat: Cap exponential retry to 30 secs
- feat: Debug log stream/executor responses
- feat: Add more logging to coordinator
…386f1e3465

[Snyk] Upgrade @types/react from 18.0.28 to 18.2.23
…253c56

[Snyk] Security upgrade next from 13.1.6 to 13.5.0
…6d4e4ec0e4

[Snyk] Upgrade eslint from 8.34.0 to 8.50.0
As V1 executors are returned via Runner gRPC, Coordinator V2 will
attempt to synchronise them. These executors should be completely
ignored, within the synchronisation logic.
It was observed that Runner would execute on the same block multiple
times. This was verified again when @Kevin101Zhang saw that the
components indexer, which he modified, incorrectly incremented the star
counter.

I noticed some indexers such as sweat_blockheight very rarely had
duplicate runs. To validate if Runner was the problem, I shut down
Runner in dev and triggered some social feed interactions. I verified
that the block only appeared in Redis once, indicating Coordinator was
not the problem. When I started Runner again, the problem appeared
again.

After that, I worked on replicating the issue locally. I built up 3
messages on a social feed indexer and modified Runner to write its
executions to a file. Afterward, I ran Runner. I searched the file and
found that the 3 blocks actually appeared in sequence like so: block 1,
block 2, block 3, block 1, block 2, block 3... and so on. This seems to
indicate duplicate workers was the problem but instead that the same
message was being read into the block array after they were all read.

Finally, with that I found the problem. The way Runner fills its array
of S3 promises, it reads a stream message and increments a stored
message ID. Subsequent stream fetches specify to fetch messages after
that stream ID. This is needed as deletion of stream messages can only
take place after the message is SUCCESSFULLY processed, to avoid
problems if Runner was reset.

However, the code which handles the case where no messages are available
in the stream reset that counting stream ID to '0', as a mechanism to
ensure messages that were somehow skipped are definitely read. This
resetting of the ID ended up being the cause. To illustrate, here's a
scenario:

1. The stream has 3 messages in it. 
2. Runner starts with 0 and then reads all 3 messages. The counting
stream ID is the ID of the last read message + 1.
3. The messages are added to the array as promises.
4. The first message begins to process. 
5. The producer loop sees that no messages are found anymore. 
6. **The producer waits 100ms and then resets the ID.**
7. The producer, while the first message was processing, fetches the
same 3 messages again.
8. This will repeat as long as no new messages appear, while messages
are in the promise array but not processed.

The fix is simple: Remove the resetting of the stream ID. It is no
longer necessary as the problem of remaining messages in the stream was
fixed back in December.
@darunrs darunrs marked this pull request as ready for review February 1, 2024 02:11
@darunrs darunrs requested a review from a team as a code owner February 1, 2024 02:11
Copy link
Collaborator

@morgsmccauley morgsmccauley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sure to "Create Merge Commit" rather than "Squash and Merge" - this keeps the histories aligned

@darunrs darunrs changed the title Prod Release 31/01/24 Prod Release 01/02/24 Feb 1, 2024
darunrs and others added 3 commits February 5, 2024 11:35
In each invocation of runFunction, on a block, DML Handler is created.
It's responsible for making calls to the database. Part of setting it up
involves fetching the credentials for the user's DB from Hasura, and
creating a PG Client. This takes time so the process was run through an
unawaited async request. While other setup and some of user code is ran,
the setting up of DML Handler would be completed. The first context.db
call would await its completion and subsequent calls would have it
ready.

However, it was observed that when the call to Hasura for the DB
credentials times out, the error, instead of propagating into a try
catch, would instead be considered by the Javascript runtime as an
unhandled Promise Exception, and would terminate the worker thread,
stopping the indexer.

In order to fix this problem, we need to transition away from keeping
DmlHandler.create as an unresolved Promise across multiple contexts. The
approach I've decided to take is to defer the creation of the Hasura
call promise until the first call of context.db. This adds more latency
to the first context.db call as it now must wait for the entire process
to complete. However, this also does not penalize Indexers that don't
use context.db as their code does not need to connect to Hasura unless
needed.

Very soon, we will in fact overhaul this existing logic by migrating the
Hasura credentials call away from runFunctions. This eliminates the
underlying problem of unresolved promises as none remain afterward. So,
the focus here is to address the bug, which is a critical problem,
without too much change, as the workflow will be refactored again soon
anyway.


I also fix a small bug where context.db calls were getting logged under
the wrong indexer logs table function name.
This PR builds on the Redis `allowlist` to auto migrate indexers from
the existing infrastructure to the Control Plane. An account migration
requires coordination between both the V1 and V2 architecture - the
indexer must be removed/ignored from V1, and then correctly configured
within V2.

## Allowlist shape
Each `allowlist` entry now contain the following properties:
```rs
pub struct AllowlistEntry {
    account_id: AccountId, // The account which should be migrated
    v1_ack: bool,          // True if Coordinator V1 has acknowledged the entry
    migrated: bool,        // True if the migration was successful
    failed: bool,          // True if the migration failed
}
```

## Coordinator V1
For Coordinator V1, the `allowlist` is really a Denylist, the code/types
have therefore been named as such. Accounts within the "denylist" should
be ignored completely by V1. Because we cannot guarantee the timing of
when this "ignore" actually happens, a flag (`v1_ack`) will be set from
V1. V2 will only take over once this flag has been set.

Accounts within the "denylist" will be filtered out of the in-memory
registry. Any new indexer "registrations" will also be ignored.
In-progress historical backfills haven't been considered as we'll
disable this functionality anyway.

## Coordinator V2
Once acknowledged by V1, Coordinator V2 will attempt to migrate all
functions under the relevant account. The steps for migration are:
1. Remove the streams from the Redis `streams` set - preventing Runner
from starting these indexers implicitly
2. Stop the existing V1 executors which have already been started via
the `streams` set
3. Merge the existing historical (if it exists) and real-time streams

Once migrated, accounts which have `v1_ack && migrated && !failed` will
be exposed to the control loop, prompting V2 to act on these indexers.

### `migrated` flag
For now, the `migrated` flag will not be set on success, preventing V2
from running the indexer on the new architecture. There are some issues
around V2 continuing from the right block correctly, so it's best to not
run them for now. This allows us to test the migration in isolation, not
worrying about what V2 does after that. I'll add this logic back in once
#536 is complete.

### `failed` flag
If any part of the migration fails, the `failed` flag will be set for
that account. It would take a significant amount of time to cover all
the edge cases in code so it would be faster to just set this flag to
ignore the account, fix the migration manually and then reset the
`failed` flag.
Runner sets the status of an Indexer after a successful run or a failed
run whose error was caught. However, if the executor itself crashed, the
error would not be caught as the worker is terminated. As a result, the
status of the indexer would continue to display RUNNING, which was
incorrect and misleading.

I updated the status by ensuring crashed workers have the STOPPED status
set. In addition, I added a new FAILING status which will now be set
when an indexer is still running, but failing on the same block.
@darunrs darunrs changed the title Prod Release 01/02/24 Prod Release 01/06/24 Feb 6, 2024
@darunrs darunrs changed the title Prod Release 01/06/24 Prod Release 06/02/24 Feb 6, 2024
@darunrs darunrs merged commit 34017d3 into stable Feb 6, 2024
22 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants