Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Client Side Embeddings Search #119

Open
0x4007 opened this issue Oct 10, 2024 · 19 comments · May be fixed by #149
Open

Client Side Embeddings Search #119

0x4007 opened this issue Oct 10, 2024 · 19 comments · May be fixed by #149

Comments

@0x4007
Copy link
Member

0x4007 commented Oct 10, 2024

image

Perhaps we can improve our search experience by:

  1. Loading in all the vector embeddings of every issue and associated comments from our database
  2. Run the similarity search (i.e. cosine) to rank sort the most relevant issue

If performance is bad running all of these calculations, we can compile to wasm potentially.

@0x4007
Copy link
Member Author

0x4007 commented Oct 10, 2024

@sshivaditya2019 rfc on time estimate and spec

@sshivaditya2019
Copy link

sshivaditya2019 commented Oct 23, 2024

This can be accomplished using natural library1, which is highly optimized. The main challenge would be generating the 1024-size embeddings on the client side.

Rather than retrieving embeddings from the database, we could use the wink-js embedding model2 to generate embeddings for both the query and the entries. These embeddings could be computed at load time, potentially increasing page load time by 15 to 35 seconds or more in some cases, and then used in the search process.

Vector-based search may not be particularly beneficial here; instead, heuristic-based retrieval methods, such as NDCG, along with a more effective search algorithm, would likely yield better results.

Footnotes

  1. https://github.com/NaturalNode/natural

  2. https://winkjs.org/

@0x4007
Copy link
Member Author

0x4007 commented Oct 24, 2024

Lets do your recommendation

@sshivaditya2019
Copy link

I think it will take around a day to set up the heuristic-based search functionality.

I'm not sure if there's a gating mechanism for tasks or something similar, but I can incorporate that into this task to create an integrated task recommender, if that's an requirement. @0x4007 rfc

@0x4007
Copy link
Member Author

0x4007 commented Oct 25, 2024

I think it will take around a day to set up the heuristic-based search functionality.

I'm not sure if there's a gating mechanism for tasks or something similar,

This is not implemented anywhere now. However it will soon be implemented based on contributor/collaborator status and priority level (or time level)

But that will only be on GitHub and not our UI I think. We still need to figure that out.

but I can incorporate that into this task to create an integrated task recommender, if that's an requirement. @0x4007 rfc

Integrated task recommender sounds very cool on the UI level. I'm onboard with exploring this although as of right now implementation details are not clear to me.

@sshivaditya2019
Copy link

/start

@sshivaditya2019
Copy link

@0x4007 could you assign this issue to me ?

Copy link
Contributor

ubiquity-os bot commented Oct 26, 2024

@sshivaditya2019 the deadline is at Sun, Oct 27, 5:30 PM UTC

@0x4007
Copy link
Member Author

0x4007 commented Oct 26, 2024

/start

@gentlementlegen Not working again

Start officially is our most unreliable plugin

@gentlementlegen
Copy link
Member

Error was

{ "message": "Validation Failed", "errors": [ { "message": "The listed users cannot be searched either because the users do not exist or you do not have permission to view the users.", "resource": "Search", "field": "q", "code": "invalid" } ], "documentation_url": "https://docs.github.com/v3/search/", "status": "422" }

with the search arguments like

{ "q": "org:ubiquity author:sshivaditya2019 state:open", "per_page": 100, "order": "desc", "sort": "created" }

URL for reference
https://api.github.com/search/issues?q=org%3Aubiquity%20author%3Asshivaditya2019%20state%3Aopen&per_page=100&order=desc&sort=created"

@0x4007
Copy link
Member Author

0x4007 commented Oct 27, 2024

Okay you should figure the root problem and fix

@Keyrxng
Copy link
Member

Keyrxng commented Oct 27, 2024

Okay you should figure the root problem and fix

I've mentioned this before re: user privacy settings affecting our attempts via GQL and rest but the root problem is shivs account' privacy settings being restricted which we don't control unfortunately.

So perhaps we should just assume defaults in this situation and apply the lowest contributor limits and then use an alt search query for PRs/Issue in the network and then filter using their username as they would be public as that's our org settings then. I assume it's the assigned issues query that's caused it here.

@0x4007
Copy link
Member Author

0x4007 commented Oct 27, 2024

If it's something the contributor can fix then the solution is to write a detailed error explaining that they can't self assign until they fix their settings, explain exactly what to fix, and then provide a link to where they can fix.

@gentlementlegen
Copy link
Member

It is still weird to me that the user privacy affects a search because the profile is public. Can we consider using GQL with issues search instead of the search API? Something like

query($organization: String!, $author: String!) {
  organization(login: $organization) {
    repositories(first: 100) {
      nodes {
        issues(first: 100, states: OPEN, filterBy: {createdBy: $author}) {
          nodes {
            title
            url
            createdAt
          }
        }
      }
    }
  }
}

with

{
  "organization": "ubiquity",
  "author": "sshivaditya2019"
}

would achieve the same result. I don't know if that would resolve the issue but it's worth a try.

@0x4007
Copy link
Member Author

0x4007 commented Oct 28, 2024

You can test and verify pretty quickly. I suggest you do that and let us know.

@Keyrxng
Copy link
Member

Keyrxng commented Oct 28, 2024

using the explorer and my login for access to the explorer

{
  "data": {
    "organization": null
  },
  "errors": [
    {
      "type": "FORBIDDEN",
      "path": [
        "organization",
        "repositories"
      ],
      "extensions": {
        "saml_failure": false
      },
      "locations": [
        {
          "line": 3,
          "column": 5
        }
      ],
      "message": "Although you appear to have the correct authorization credentials, the `ubiquity` organization has enabled OAuth App access restrictions, meaning that data access to third-parties is limited. For more information on these restrictions, including how to enable this app, visit https://docs.github.com/articles/restricting-access-to-your-organization-s-data/"

@0x4007
Copy link
Member Author

0x4007 commented Oct 28, 2024

If OAuth app access is required to read user data, let's use the app for logging in on devpool.directory.

The error can explain that the user needs to sign in on devpool.directory if there is a problem reading their data.

@gentlementlegen
Copy link
Member

It'd be better to just test locally, sorry didn't have time to do so today.

@Keyrxng
Copy link
Member

Keyrxng commented Oct 28, 2024

It'd be better to just test locally, sorry didn't have time to do so today.

Aye it likely would be sorrry bud

@sshivaditya2019 sshivaditya2019 linked a pull request Oct 31, 2024 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants