-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use OpenAI embeddings for similarity #646
Comments
See also https://github.com/dglazkov/wanderer |
Create a native polymath endpoint instead of needing to do an export of a json file and resvae |
The content that is sent to the embedder should also include the canonical forms of any concept links, to help create semantic connection across multiple synonyms |
A few challenges: the set of embeddings in production will likely be extremely large, and push the renderer closer to an OOM. There will also be many cases where you don't have the embeddings but still want to do something meaningful. Have a new set of query filters, One design: have the embeddings in a cloud function. Use hnsw doesn't allow saving metadata so we'll have to do it some other way, including maintaining a mapping from cardID -> hnsw index. That will be stored in a new There is also an endpoint, which anyone can hit (because it never sends back card content) that you can pass a key card ID and a k, and it will pass back an array of The local filter for The content to be embedded is a function that takes a card and a collection of concept cards and produces a canonical text, which includes the canonical form of every concept card's title appended at the end. |
Just use Qdrant? (open source pinecone alternative) Yeah, just use Qdrant, there are a ton of database administration tasks that will be too annoying to do by hand. Also, running a DB in a cloud function and not duplicating the service a million times (with more resource use and also possible for collisions) is likely. Add a Add a client tool that checks if the qdrant_api_key is set, and if so, during the deploy checks to see if the DB is configured (via collection_info), and if not, configures it. Configuration makes the named DB collection ( The IDs for each point is a
There are three endpoints:
There is also a check during deploy to ask the user if they want to trigger content reindexing (default to false) (or maybe just run it every time, as long as the qdrant_key is configured?) |
|
A cheapo version of src/util:cardPlainContent. Part of #646.
They'll only be actually read on the server. Part of #646.
It mainly checks to see if we already have a card/embedding info stored and if so, stores that. Not actually used yet. Part of #646.
This is the first time exercising the current pipeline, and revels a few problems, like "invalid-card-id" is somehow in the document? Part of #646.
.update will fail for a doc that doesn't exist. Part of #646.
Part of #646. This makes them show up in VSCode while editing.
If provided, then error of code `stale-embedding` will be returned if the embedding is older than the card.updated. This will allow the client to determine that the embedding isn't yet updated, and try again. This will happen for example right after a card is saved, before it is re-embeddeded and stored. Part of #670. Part of #646.
If provided, then for collections that are a preview, it will render an alert icon with that text. Part of #646.
…ards based on local similarity. This makes it possible to detect if you're seeing the good or bad simlarity in a collection. Part of #646.
…on without an updated timestamp. Part of #646.
There's a current bug where you create a card, and it gets a '' embedding (which returns very generic results). While editing, you see those generic results, even as you update. In the future we'll be live fetching the actively edited card, but for now just fall back on tfidf pipeline. Part of #646.
Before, it was 10 seconds, allowing a lot of slop between servers. But that broke a common case: create-a-card, and then within 10 seconds paste and save. The client would then ask for the new similarCards, and it would still be based on the old empty content embedding, and the server would happily report it as current. That would lead to bizarre similarity for that card (until another card was created/updated, when that cache would be blown away in the client and refetched and would actually be accurate). Part of #646.
This signals "the value isn't current", like a missing enetry in CardSimilarityMap would. Part of #646.
…Which can totally happen, right? Part of #646.
If you pass a card object in via REST to similarCards, it is {seconds, nanoseconds} not a literal timestamp. Part of #646.
Before, the entire payload.content was in memory, but unnecessarily because in the common read path it's not fetched. Run `gulp configure-qdrant` to update configuration. Part of #646.
Use https://beta.openai.com/docs/guides/embeddings/use-cases to calculate similarity.
If an
openai_secret_key
is provided inconfig.SECRET.json
then it activates embedding-based similarity.A new cloud function is set up that when a card's title or body is modified it's flagged to have its embedding fetched. (The fetching has to happen serverside to protect the secret key). By having it be driven off of cards being edited, we could dratt off the firestore permissions to make it hard to abuse our embedding secret key. Getting an embedding for a live-editing card is harder then.
Fetchign embeddings could take awhile and could fail, and sometimes we'd need to do many in bulk, so we'll need some kind of implciit queing system, and flagging cards that have an embedding fetch happening, or need a new one
The functions to compute two similarities between cards instead uses the embeddings. (Maybe have a different type of fingerprint, an EmbeddingFingerprint?)
The text was updated successfully, but these errors were encountered: