Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Count S3 get requests made by near-lake-framework #662

Merged
merged 6 commits into from
Apr 17, 2024

Conversation

morgsmccauley
Copy link
Collaborator

@morgsmccauley morgsmccauley commented Apr 15, 2024

Depends on near/near-lake-framework-rs#102

This PR exposes a new metrics which counts the number of Get requests made to S3 by near-lake-framework. I wanted to start tracking this metric before I merge the change which reduces them, so I can measure the impact of that change. The easiest way to track these requests was to pass a custom S3Client to near-lake-framework, so we can hook in to the actual requests made.

The custom S3Client (LakeS3Client) is exactly the same as the default implementation in near-lake-framework itself, but with the added metric. This is essentially part 1 for #419, as the "reduction" in requests will build on this custom client, adding caching/de-duplication.

@morgsmccauley morgsmccauley requested a review from a team as a code owner April 15, 2024 03:53

// FIX: near lake framework now infinitely retires - we need a way to stop it to allow the test
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will try to address this in a future PR, will probably need to add a config option to Near Lake which prevents infinite retries.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What change to Lake caused infinite retries?

}

#[cfg(test)]
mockall::mock! {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unable to use usual #[automock] due to the multiple impl blocks and trait implementation, so have to use this custom mock! block. The end result is essentially the same.

|| wildmatch::WildMatch::new(account_id)
.matches(&outcome_with_receipt.receipt.predecessor_id)
.matches(outcome_with_receipt.receipt.predecessor_id.as_str())
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Latest near-lake-framework changed these to AccountId from String

@morgsmccauley morgsmccauley linked an issue Apr 15, 2024 that may be closed by this pull request
Copy link
Collaborator

@darunrs darunrs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sweet work! Excited to see the next PR with the caching!

@@ -116,6 +116,7 @@ impl BlockStream {
}
}

#[allow(clippy::too_many_arguments)]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this call out an indication that we need to rethink the function declaration? Either encapsulate some or all of these into some BlockStreamContext, or roll some of these into each other (redis stream does into IndexerConfig, Lake Framework stream created before being passed into start_block_stream, etc)?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah - this suppresses the error which I was getting sick of. BlockStreamContext is a good idea.


// FIX: near lake framework now infinitely retires - we need a way to stop it to allow the test
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What change to Lake caused infinite retries?

let response = self
.s3_client
.list_objects_v2()
.max_keys(1000)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a value we set somewhere previously or was 1000 the default by the API?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the default from the API, essentially just a copy paste from near_lake_framework code.

Ok(bytes)
}

async fn list_common_prefixes(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this function used for? Just testing or something else?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is part of the S3Client trait exposed by near_lake_framework. Under the hood it's used to list block heights, and was originally added to mock those heights returned. We don't do anything with it now, but still need to supply it.

We may want to add caching here too, but it's a bit more complicated because the arguments will likely vary widely across all near_lake_framework instances.

@morgsmccauley morgsmccauley merged commit 733c3c6 into main Apr 17, 2024
4 checks passed
@morgsmccauley morgsmccauley deleted the feat/count-lake-requests branch April 17, 2024 01:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Reduce duplicate Lake requests across dedicated streams
2 participants