Feature: Repository Search by Substring #8417

nadavsteindler · 2024-12-12T16:43:20Z

Closes #7596

Change Description

Background

Community feature request

New Feature

Repo search will seek substrings rather than prefixes, making it more likely to turn up what the user wants

Testing Details

Manual tests
Unit test Added

github-actions · 2024-12-12T16:44:45Z

♻️ PR Preview 4087e10 has been successfully destroyed since this PR has been closed.

_{🤖 By surge-preview}

github-actions · 2024-12-12T16:51:24Z

E2E Test Results - DynamoDB Local - Local Block Adapter

github-actions · 2024-12-12T16:52:14Z

E2E Test Results - Quickstart

arielshaqed

This feature worries me ~~a lot~~.

~~## Cost & performance~~

Update: @guy-har points out that this entire calculation is wrong and that I was confugsed, please ignore and/or see below.

Suppose a user has 100 repositories (any less and they probably don't need it). They start typing a repository name on the webUI. Every character they type on the GUI narrows the search, at the cost of a KV listing. A reasonable typist can hit 10 characters per second; didn't we just hit 1000 RCUs/s?

Ideally we would open an immediate tech debt item to improve KV to allow faster scanning: we could add conditions on all supported KVs that would greatly reduce the required traffic.

Also, existing programs that perform list-repositories may now cost much more: every repository listing scans all repositories.

API

AFAICT the code overrides "prefix" to mean "substring". This is both confusing and a breaking change. At the very least, please keep "prefix", and add a new "substring" field.

Also, I believe that this change breaks pagination; please test and fix if needed.

arielshaqed · 2024-12-15T16:16:27Z

pkg/catalog/catalog.go

-		startPos = afterRepositoryID
-	}
-	if startPos != "" {
-		it.SeekGE(startPos)


This makes scanning very inefficient.

Sorry I fixed it- seek to start position is back in. This was a logical error before we even talk about performance

But now we are missing repositories that include the string and are before the prefix (e.g for prefix ple we will never get apple)

I think we should take this one step at a time:

Separate "prefix" (which needs to be kept unchanged) from a new "substring" (actual name TBD) parameter.

Implement "substring".

Now support "after" and "prefix" - their implementations are probably unchanged.

We need to do all 3 in this PR, of course, but I think that I and @guy-har need to lay off supporting "after" until, well, after we agree about "substring".

OK I implemented the substring part. I thought to just call it search string, since that is what the client is trying to do and maybe we will improve the algorithm at some point- let me know what you think

arielshaqed · 2024-12-15T16:16:32Z

pkg/catalog/catalog.go

-		// collect limit +1 to return limit and has more
-		if len(repos) >= limit+1 {
-			break
+		if strings.Contains(string(record.RepositoryID), prefix) {


This is a breaking change: it changes the meaning of the "prefix" parameter! (It is also very confusing to users, who will probably not expect "prefix" to mean "substring").

I guess I could change the name of the parameter- I don't see any other code that uses it

Prefix is defined in our API and should still be supported, even if we wanted, we can't just change it.

There's lots of code that can use prefix, we will just never know about it:

Any third-party code might use it. It could even use our REST API directly, we would have a hard time detecting it.

The high-level Python SDK method repositories supports it, so any code that uses that could rely on prefix.

Also,

The S3 gateway method ListBuckets should support prefix; it seems that it doesn't, and that's now bug S3 gateway ListBuckets does not support "prefix" argument #8425 :-/

Not changing the behaviour of prefix may complicate this PR, but unfortunately we need to do it.

So we have to support prefix and also a search string?

arielshaqed · 2024-12-15T16:25:56Z

pkg/catalog/catalog.go

-	prefixRepositoryID := graveler.RepositoryID(prefix)
-	startPos := prefixRepositoryID
-	if afterRepositoryID > startPos {
-		startPos = afterRepositoryID


Doesn't this break pagination?

Yes you are correct thanks. Added some failing tests and fixed it

nadavsteindler · 2024-12-15T17:10:22Z

This feature worries me a lot.

Cost & performance

Suppose a user has 100 repositories (any less and they probably don't need it). They start typing a repository name on the webUI. Every character they type on the GUI narrows the search, at the cost of a KV listing. A reasonable typist can hit 10 characters per second; didn't we just hit 1000 RCUs/s?

Ideally we would open an immediate tech debt item to improve KV to allow faster scanning: we could add conditions on all supported KVs that would greatly reduce the required traffic.

Also, existing programs that perform list-repositories may now cost much more: every repository listing scans all repositories.

API

AFAICT the code overrides "prefix" to mean "substring". This is both confusing and a breaking change. At the very least, please keep "prefix", and add a new "substring" field.

Also, I believe that this change breaks pagination; please test and fix if needed.

OK OK, let's not get over-excited. I deleted a few lines too much, but now I fixed them, so I think it should be fine. I can performance test it to make sure

Oh as far as the API, I guess i could just rename it to substring right? No one else uses prefix, so no need for backwards compatiblity

guy-har · 2024-12-16T06:55:04Z

pkg/catalog/catalog.go

-		startPos = afterRepositoryID
-	}
-	if startPos != "" {
-		it.SeekGE(startPos)


But now we are missing repositories that include the string and are before the prefix (e.g for prefix ple we will never get apple)

guy-har · 2024-12-16T06:57:08Z

pkg/catalog/catalog.go

-		// collect limit +1 to return limit and has more
-		if len(repos) >= limit+1 {
-			break
+		if strings.Contains(string(record.RepositoryID), prefix) {


Prefix is defined in our API and should still be supported, even if we wanted, we can't just change it.

guy-har · 2024-12-16T07:08:46Z

@arielshaqed, can you please explain the calculation that brings us to 1000 RCUs/s?
AFAIU a scan costs 1RCU for 4KB of data, so roughly speaking scanning 100 records would be around 4 RCUs

arielshaqed · 2024-12-16T07:47:25Z

@arielshaqed, can you please explain the calculation that brings us to 1000 RCUs/s? AFAIU a scan costs 1RCU for 4KB of data, so roughly speaking scanning 100 records would be around 4 RCUs

Sorry, you're right. I always make the same mistake: "kv".Store.Scan is actually DynamoDB Query, which indeed costs 1 RCU per 4 KiB of data. How large is a repository record?

message RepositoryData {
  string id = 1;
  string storage_namespace = 2;
  string default_branch_id = 3;
  google.protobuf.Timestamp creation_date = 4;
  RepositoryState state = 5;
  string instance_uid = 6;
  bool read_only = 7;
}

Recalculating: I can certainly see >160 bytes per repository, which is indeed just under 25 repositories per RCU, so 4-5 RCUs per keystroke. Obviously not ideal but please strike my original numbers (and I'm editing that comment to clarify the error).

pkg/catalog/catalog_test.go

nadavsteindler · 2024-12-16T13:36:01Z

probably don't need it). They start typing a repository name on the webUI. Every character they type on the GUI narrows the search, at the cost of a KV listing. A reasonable typist can hit 10 characters per second; didn't we just hit 1000 RCUs/s?

@itaigilo pointed out to me that our frontend uses useDebouncedState with a wait time of 300ms, so even if the user types faster, it will generate at most one request per 300ms

I could add some sort of substrings cache or index so that we aren't constantly iterating through the KV entries. Do you think it's neccessary?

itaigilo

Thanks for the effort, this is a tricky full-stack feature.

Blocking for both the param naming,
And for at least considering a different design of the API.

api/swagger.yml

itaigilo · 2024-12-16T18:47:46Z

api/swagger.yml

@@ -70,6 +70,13 @@ components:
      description: delimiter used to group common prefixes by
      schema:
        type: string
+
+    PaginationSearchString:


Why in some places it's SearchString in in others it's search?
Please be consistent, since it's making it much easier to track when dealing with params that are being passed from the FE to the BE services.

Hmm, well a "search string" is a term: https://www.techtarget.com/whatis/definition/search-string
The typically query strings are shortened to one word. See for example:

api/swagger.yml

pkg/catalog/catalog.go

nadavsteindler · 2024-12-17T15:37:52Z

Thanks for the effort, this is a tricky full-stack feature.

Blocking for both the param naming, And for at least considering a different design of the API.

OK I addressed these issues (see the comments)

arielshaqed

Neat, thanks! Please see comments about it being a search string, not a pagination search string. Now we'll see if anyone was depending on the old behaviour.¹

See XKCD #1172. ↩

api/swagger.yml

pkg/api/controller.go

webui/src/lib/api/index.js

itaigilo

LGTM, nice one.

Approving to unblock,
But please fix Ariel's and my style comments before merging (paginationSearch, setsearch, and console.log).

api/swagger.yml

webui/src/pages/repositories/index.jsx

guy-har · 2024-12-18T08:22:40Z

pkg/catalog/catalog.go

 // In this case, pass the last repository name as 'after' on the next call to ListRepositories
-func (c *Catalog) ListRepositories(ctx context.Context, limit int, prefix, after string) ([]*Repository, bool, error) {
+func (c *Catalog) ListRepositories(ctx context.Context, limit int, prefix, searchString, after string) ([]*Repository, bool, error) {


Document searchString

Co-authored-by: Ariel Shaqed (Scolnicov) <[email protected]>

nadavsteindler marked this pull request as draft December 12, 2024 16:43

nadavsteindler self-assigned this Dec 12, 2024

nadavsteindler added the new-feature Issues that introduce new feature or capability label Dec 12, 2024

nadavsteindler added the include-changelog PR description should be included in next release changelog label Dec 12, 2024

nadavsteindler force-pushed the feature/fuzzy-repo-search branch from 1e982e4 to 1178906 Compare December 15, 2024 10:31

nadavsteindler requested review from N-o-Z, itaigilo and guy-har December 15, 2024 15:46

nadavsteindler marked this pull request as ready for review December 15, 2024 15:47

arielshaqed requested changes Dec 15, 2024

View reviewed changes

guy-har requested changes Dec 16, 2024

View reviewed changes

nadavsteindler commented Dec 16, 2024

View reviewed changes

pkg/catalog/catalog_test.go Show resolved Hide resolved

nadavsteindler requested review from arielshaqed and guy-har December 16, 2024 14:30

itaigilo requested changes Dec 16, 2024

View reviewed changes

nadavsteindler force-pushed the feature/fuzzy-repo-search branch 2 times, most recently from 927f8aa to 8fbb77f Compare December 17, 2024 11:27

nadavsteindler requested a review from itaigilo December 17, 2024 11:50

arielshaqed approved these changes Dec 17, 2024

View reviewed changes

api/swagger.yml Show resolved Hide resolved

pkg/api/controller.go Outdated Show resolved Hide resolved

webui/src/lib/api/index.js Outdated Show resolved Hide resolved

itaigilo approved these changes Dec 17, 2024

View reviewed changes

api/swagger.yml Show resolved Hide resolved

webui/src/pages/repositories/index.jsx Outdated Show resolved Hide resolved

guy-har approved these changes Dec 18, 2024

View reviewed changes

fuzzy repo search

acc3c06

nadavsteindler and others added 19 commits December 18, 2024 11:15

fuzzy repo search

70204a5

fuzzy repo search

6dedb03

fuzzy repo search

f0267db

fuzzy repo search

18315a7

fuzzy repo search

1fbc35a

moar tests

d012362

moar tests

0cf8b74

api

18e8d00

use new param

324050e

page query param name

85821a5

change name

333a973

Update swagger.yml

f6c7d61

clean code

ce66e8d

moar tests

ce3a825

Apply suggestions from code review

10adbf4

Co-authored-by: Ariel Shaqed (Scolnicov) <[email protected]>

review fixes

5c2a637

yaml fmt

758d170

yaml fmt

9b8be00

code review

4087e10

nadavsteindler force-pushed the feature/fuzzy-repo-search branch from 128270c to 4087e10 Compare December 18, 2024 09:15

nadavsteindler changed the title ~~Feature substring repo search~~ Feature: Repository Search by Substring Dec 18, 2024

nadavsteindler merged commit b60c6c7 into master Dec 18, 2024
43 checks passed

nadavsteindler deleted the feature/fuzzy-repo-search branch December 18, 2024 09:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature: Repository Search by Substring #8417

Feature: Repository Search by Substring #8417

nadavsteindler commented Dec 12, 2024 •

edited

Loading

github-actions bot commented Dec 12, 2024 •

edited

Loading

github-actions bot commented Dec 12, 2024

github-actions bot commented Dec 12, 2024 •

edited

Loading

arielshaqed left a comment •

edited

Loading

arielshaqed Dec 15, 2024

nadavsteindler Dec 15, 2024 •

edited

Loading

guy-har Dec 16, 2024

arielshaqed Dec 16, 2024

nadavsteindler Dec 16, 2024

arielshaqed Dec 15, 2024

nadavsteindler Dec 15, 2024

guy-har Dec 16, 2024

arielshaqed Dec 16, 2024

nadavsteindler Dec 16, 2024

arielshaqed Dec 15, 2024

nadavsteindler Dec 15, 2024

nadavsteindler commented Dec 15, 2024 •

edited

Loading

Cost & performance

API

guy-har Dec 16, 2024

guy-har Dec 16, 2024

guy-har commented Dec 16, 2024

arielshaqed commented Dec 16, 2024

nadavsteindler commented Dec 16, 2024 •

edited

Loading

itaigilo left a comment

itaigilo Dec 16, 2024

nadavsteindler Dec 17, 2024

nadavsteindler commented Dec 17, 2024

arielshaqed left a comment

itaigilo left a comment

guy-har Dec 18, 2024

Feature: Repository Search by Substring #8417

Feature: Repository Search by Substring #8417

Conversation

nadavsteindler commented Dec 12, 2024 • edited Loading

Change Description

Background

New Feature

Testing Details

github-actions bot commented Dec 12, 2024 • edited Loading

github-actions bot commented Dec 12, 2024

E2E Test Results - DynamoDB Local - Local Block Adapter

github-actions bot commented Dec 12, 2024 • edited Loading

E2E Test Results - Quickstart

arielshaqed left a comment • edited Loading

Choose a reason for hiding this comment

API

Choose a reason for hiding this comment

nadavsteindler Dec 15, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nadavsteindler commented Dec 15, 2024 • edited Loading

Cost & performance

API

Choose a reason for hiding this comment

Choose a reason for hiding this comment

guy-har commented Dec 16, 2024

arielshaqed commented Dec 16, 2024

nadavsteindler commented Dec 16, 2024 • edited Loading

itaigilo left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nadavsteindler commented Dec 17, 2024

arielshaqed left a comment

Choose a reason for hiding this comment

Footnotes

itaigilo left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nadavsteindler commented Dec 12, 2024 •

edited

Loading

github-actions bot commented Dec 12, 2024 •

edited

Loading

github-actions bot commented Dec 12, 2024 •

edited

Loading

arielshaqed left a comment •

edited

Loading

nadavsteindler Dec 15, 2024 •

edited

Loading

nadavsteindler commented Dec 15, 2024 •

edited

Loading

nadavsteindler commented Dec 16, 2024 •

edited

Loading