Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature: Repository Search by Substring #8417

Merged
merged 20 commits into from
Dec 18, 2024

Conversation

nadavsteindler
Copy link
Contributor

@nadavsteindler nadavsteindler commented Dec 12, 2024

Closes #7596

Change Description

Background

Community feature request

New Feature

Repo search will seek substrings rather than prefixes, making it more likely to turn up what the user wants

Testing Details

Manual tests
Unit test Added

@nadavsteindler nadavsteindler marked this pull request as draft December 12, 2024 16:43
@nadavsteindler nadavsteindler self-assigned this Dec 12, 2024
@nadavsteindler nadavsteindler added the new-feature Issues that introduce new feature or capability label Dec 12, 2024
Copy link

github-actions bot commented Dec 12, 2024

♻️ PR Preview 4087e10 has been successfully destroyed since this PR has been closed.

🤖 By surge-preview

Copy link

E2E Test Results - DynamoDB Local - Local Block Adapter

13 passed

Copy link

github-actions bot commented Dec 12, 2024

E2E Test Results - Quickstart

11 passed

@nadavsteindler nadavsteindler added the include-changelog PR description should be included in next release changelog label Dec 12, 2024
@nadavsteindler nadavsteindler force-pushed the feature/fuzzy-repo-search branch from 1e982e4 to 1178906 Compare December 15, 2024 10:31
@nadavsteindler nadavsteindler marked this pull request as ready for review December 15, 2024 15:47
Copy link
Contributor

@arielshaqed arielshaqed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feature worries me a lot.

## Cost & performance

Update: @guy-har points out that this entire calculation is wrong and that I was confugsed, please ignore and/or see below.

Suppose a user has 100 repositories (any less and they probably don't need it). They start typing a repository name on the webUI. Every character they type on the GUI narrows the search, at the cost of a KV listing. A reasonable typist can hit 10 characters per second; didn't we just hit 1000 RCUs/s?

Ideally we would open an immediate tech debt item to improve KV to allow faster scanning: we could add conditions on all supported KVs that would greatly reduce the required traffic.

Also, existing programs that perform list-repositories may now cost much more: every repository listing scans all repositories.

API

AFAICT the code overrides "prefix" to mean "substring". This is both confusing and a breaking change. At the very least, please keep "prefix", and add a new "substring" field.

Also, I believe that this change breaks pagination; please test and fix if needed.

startPos = afterRepositoryID
}
if startPos != "" {
it.SeekGE(startPos)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes scanning very inefficient.

Copy link
Contributor Author

@nadavsteindler nadavsteindler Dec 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I fixed it- seek to start position is back in. This was a logical error before we even talk about performance

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But now we are missing repositories that include the string and are before the prefix (e.g for prefix ple we will never get apple)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should take this one step at a time:

  1. Separate "prefix" (which needs to be kept unchanged) from a new "substring" (actual name TBD) parameter.
  2. Implement "substring".
  3. Now support "after" and "prefix" - their implementations are probably unchanged.

We need to do all 3 in this PR, of course, but I think that I and @guy-har need to lay off supporting "after" until, well, after we agree about "substring".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK I implemented the substring part. I thought to just call it search string, since that is what the client is trying to do and maybe we will improve the algorithm at some point- let me know what you think

// collect limit +1 to return limit and has more
if len(repos) >= limit+1 {
break
if strings.Contains(string(record.RepositoryID), prefix) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a breaking change: it changes the meaning of the "prefix" parameter! (It is also very confusing to users, who will probably not expect "prefix" to mean "substring").

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess I could change the name of the parameter- I don't see any other code that uses it

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Prefix is defined in our API and should still be supported, even if we wanted, we can't just change it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's lots of code that can use prefix, we will just never know about it:

  • Any third-party code might use it. It could even use our REST API directly, we would have a hard time detecting it.
  • The high-level Python SDK method repositories supports it, so any code that uses that could rely on prefix.

Also,

Not changing the behaviour of prefix may complicate this PR, but unfortunately we need to do it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we have to support prefix and also a search string?

prefixRepositoryID := graveler.RepositoryID(prefix)
startPos := prefixRepositoryID
if afterRepositoryID > startPos {
startPos = afterRepositoryID
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't this break pagination?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes you are correct thanks. Added some failing tests and fixed it

@nadavsteindler
Copy link
Contributor Author

nadavsteindler commented Dec 15, 2024

This feature worries me a lot.

Cost & performance

Suppose a user has 100 repositories (any less and they probably don't need it). They start typing a repository name on the webUI. Every character they type on the GUI narrows the search, at the cost of a KV listing. A reasonable typist can hit 10 characters per second; didn't we just hit 1000 RCUs/s?

Ideally we would open an immediate tech debt item to improve KV to allow faster scanning: we could add conditions on all supported KVs that would greatly reduce the required traffic.

Also, existing programs that perform list-repositories may now cost much more: every repository listing scans all repositories.

API

AFAICT the code overrides "prefix" to mean "substring". This is both confusing and a breaking change. At the very least, please keep "prefix", and add a new "substring" field.

Also, I believe that this change breaks pagination; please test and fix if needed.

OK OK, let's not get over-excited. I deleted a few lines too much, but now I fixed them, so I think it should be fine. I can performance test it to make sure

Oh as far as the API, I guess i could just rename it to substring right? No one else uses prefix, so no need for backwards compatiblity

startPos = afterRepositoryID
}
if startPos != "" {
it.SeekGE(startPos)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But now we are missing repositories that include the string and are before the prefix (e.g for prefix ple we will never get apple)

// collect limit +1 to return limit and has more
if len(repos) >= limit+1 {
break
if strings.Contains(string(record.RepositoryID), prefix) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Prefix is defined in our API and should still be supported, even if we wanted, we can't just change it.

@guy-har
Copy link
Contributor

guy-har commented Dec 16, 2024

@arielshaqed, can you please explain the calculation that brings us to 1000 RCUs/s?
AFAIU a scan costs 1RCU for 4KB of data, so roughly speaking scanning 100 records would be around 4 RCUs

@arielshaqed
Copy link
Contributor

@arielshaqed, can you please explain the calculation that brings us to 1000 RCUs/s? AFAIU a scan costs 1RCU for 4KB of data, so roughly speaking scanning 100 records would be around 4 RCUs

Sorry, you're right. I always make the same mistake: "kv".Store.Scan is actually DynamoDB Query, which indeed costs 1 RCU per 4 KiB of data. How large is a repository record?

message RepositoryData {
  string id = 1;
  string storage_namespace = 2;
  string default_branch_id = 3;
  google.protobuf.Timestamp creation_date = 4;
  RepositoryState state = 5;
  string instance_uid = 6;
  bool read_only = 7;
}

Recalculating: I can certainly see >160 bytes per repository, which is indeed just under 25 repositories per RCU, so 4-5 RCUs per keystroke. Obviously not ideal but please strike my original numbers (and I'm editing that comment to clarify the error).

@nadavsteindler
Copy link
Contributor Author

nadavsteindler commented Dec 16, 2024

probably don't need it). They start typing a repository name on the webUI. Every character they type on the GUI narrows the search, at the cost of a KV listing. A reasonable typist can hit 10 characters per second; didn't we just hit 1000 RCUs/s?

@itaigilo pointed out to me that our frontend uses useDebouncedState with a wait time of 300ms, so even if the user types faster, it will generate at most one request per 300ms

I could add some sort of substrings cache or index so that we aren't constantly iterating through the KV entries. Do you think it's neccessary?

Copy link
Contributor

@itaigilo itaigilo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the effort, this is a tricky full-stack feature.

Blocking for both the param naming,
And for at least considering a different design of the API.

api/swagger.yml Show resolved Hide resolved
api/swagger.yml Outdated
@@ -70,6 +70,13 @@ components:
description: delimiter used to group common prefixes by
schema:
type: string

PaginationSearchString:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why in some places it's SearchString in in others it's search?
Please be consistent, since it's making it much easier to track when dealing with params that are being passed from the FE to the BE services.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, well a "search string" is a term: https://www.techtarget.com/whatis/definition/search-string
The typically query strings are shortened to one word. See for example:
image

api/swagger.yml Show resolved Hide resolved
pkg/catalog/catalog.go Outdated Show resolved Hide resolved
@nadavsteindler nadavsteindler force-pushed the feature/fuzzy-repo-search branch 2 times, most recently from 927f8aa to 8fbb77f Compare December 17, 2024 11:27
@nadavsteindler
Copy link
Contributor Author

Thanks for the effort, this is a tricky full-stack feature.

Blocking for both the param naming, And for at least considering a different design of the API.

OK I addressed these issues (see the comments)

Copy link
Contributor

@arielshaqed arielshaqed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Neat, thanks! Please see comments about it being a search string, not a pagination search string. Now we'll see if anyone was depending on the old behaviour.1

Footnotes

  1. See XKCD #1172.

api/swagger.yml Show resolved Hide resolved
pkg/api/controller.go Outdated Show resolved Hide resolved
webui/src/lib/api/index.js Outdated Show resolved Hide resolved
Copy link
Contributor

@itaigilo itaigilo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, nice one.

Approving to unblock,
But please fix Ariel's and my style comments before merging (paginationSearch, setsearch, and console.log).

api/swagger.yml Show resolved Hide resolved
webui/src/pages/repositories/index.jsx Outdated Show resolved Hide resolved
Comment on lines 587 to 589
// In this case, pass the last repository name as 'after' on the next call to ListRepositories
func (c *Catalog) ListRepositories(ctx context.Context, limit int, prefix, after string) ([]*Repository, bool, error) {
func (c *Catalog) ListRepositories(ctx context.Context, limit int, prefix, searchString, after string) ([]*Repository, bool, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Document searchString

@nadavsteindler nadavsteindler force-pushed the feature/fuzzy-repo-search branch from 128270c to 4087e10 Compare December 18, 2024 09:15
@nadavsteindler nadavsteindler changed the title Feature substring repo search Feature: Repository Search by Substring Dec 18, 2024
@nadavsteindler nadavsteindler merged commit b60c6c7 into master Dec 18, 2024
43 checks passed
@nadavsteindler nadavsteindler deleted the feature/fuzzy-repo-search branch December 18, 2024 09:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
include-changelog PR description should be included in next release changelog new-feature Issues that introduce new feature or capability
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Substring search across repositories
4 participants