Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use OpenAI embeddings for similarity #646

Open
jkomoros opened this issue Dec 25, 2022 · 8 comments
Open

Use OpenAI embeddings for similarity #646

jkomoros opened this issue Dec 25, 2022 · 8 comments

Comments

@jkomoros
Copy link
Owner

Use https://beta.openai.com/docs/guides/embeddings/use-cases to calculate similarity.

If an openai_secret_key is provided in config.SECRET.json then it activates embedding-based similarity.

A new cloud function is set up that when a card's title or body is modified it's flagged to have its embedding fetched. (The fetching has to happen serverside to protect the secret key). By having it be driven off of cards being edited, we could dratt off the firestore permissions to make it hard to abuse our embedding secret key. Getting an embedding for a live-editing card is harder then.

Fetchign embeddings could take awhile and could fail, and sometimes we'd need to do many in bulk, so we'll need some kind of implciit queing system, and flagging cards that have an embedding fetch happening, or need a new one

The functions to compute two similarities between cards instead uses the embeddings. (Maybe have a different type of fingerprint, an EmbeddingFingerprint?)

@jkomoros
Copy link
Owner Author

See also https://github.com/dglazkov/wanderer

@jkomoros
Copy link
Owner Author

Create a native polymath endpoint instead of needing to do an export of a json file and resvae

@jkomoros
Copy link
Owner Author

jkomoros commented Jun 4, 2023

@jkomoros
Copy link
Owner Author

jkomoros commented Jun 4, 2023

The content that is sent to the embedder should also include the canonical forms of any concept links, to help create semantic connection across multiple synonyms

@jkomoros
Copy link
Owner Author

jkomoros commented Oct 18, 2023

A few challenges: the set of embeddings in production will likely be extremely large, and push the renderer closer to an OOM. There will also be many cases where you don't have the embeddings but still want to do something meaningful.

Have a new set of query filters, meaning, which if embeddings are available, uses a sorting based on cosine similiarty, and if they aren't, falls back to just an alias for similarity.

One design: have the embeddings in a cloud function. Use hnsw to do the index. Store the index in Cloud Storage, and every time you save a new snapshot, remove old copies that are above some count (so keep a few just in case). Every time the cloud function loads (will cloud functions v2 help the instance be reused more often?) it loads the snapshot of the most recent version. We can use Object Versioniong in Cloud Storage, and ifGenerationMatch to check before writing that no edits have been made. If they have, reload the most recent snapshot and try again (try this up to, say, 3 times). Once the write succeeds, also write the information to the embeddings firestore collection (see below). There should be a clean operation that looks for ids in the hsnw index that don't have a corresponding firestore entry and deletes them (otherwise there will be items that might continually show up in queries that have to be filtered out)

hnsw doesn't allow saving metadata so we'll have to do it some other way, including maintaining a mapping from cardID -> hnsw index. That will be stored in a new embeddings firestore collection, which will be keyed off of cardID + embedding_space (allowing adding new ones in the future), like c-123-4567+embedding-ada-002. Each record will have the embedding_index, the last_updated date, a version number for the card extraction version, and a snapshot of the embedded text. Every time a card is saved and its content changes, we check if there is an embedding record, and if there is one if the text is equivalent. If any of these aren't true, then we kick off an embedding request and then store the result in the hsnw index, saving a snapshot, and updating the embedding record. The card extraction version allows us to experiment with new formats for the text to embed, including just cardPlainText (note for content cards, will need to include the title) but also things like including the canonical form of the concept links, as well as a date (which will inherently get a bit of nearby date similarity overlap?). There also needs to be an operation to kick off creating new embedding entries when a new version (e.g. extraction version) is pushed, in addition to the incremental onCardUpdated hook to compute incremental embeddings. Make sure the embeddings collection is not allowed to be downloaded to the client (especially if it contains the full embedded text)

There is also an endpoint, which anyone can hit (because it never sends back card content) that you can pass a key card ID and a k, and it will pass back an array of [CARD_ID, similarity] records of the most similar items. (You can pass -1 to mean 'literally every card'). When you hit the endpoint, the endpoint loads up the hnsw index, looks up the embedding_index of the given card ID, fetches the embedding of that item, fetches the k most similar, and then reverses out the card_id of each one before passing back. You can also pass to the endpoint not a cardID but a card content--useful for doing the similarity of a card as it is being edited. In the future we can filter out any cards the given user doesn't have access to (to not leak the existence of other cards, and to ensure the entire list of records that are passed back aren't, for example, entirely unpublished cards they can't see).

The local filter for meaning will keep track of cached similarity lists of key careds (invalidating the list each time a card is edited). The filter will have a bit of a delay to actually give a result, as it fetches to the endpoint.

The content to be embedded is a function that takes a card and a collection of concept cards and produces a canonical text, which includes the canonical form of every concept card's title appended at the end.

@jkomoros
Copy link
Owner Author

jkomoros commented Oct 26, 2023

Just use Qdrant? (open source pinecone alternative)

Yeah, just use Qdrant, there are a ton of database administration tasks that will be too annoying to do by hand. Also, running a DB in a cloud function and not duplicating the service a million times (with more resource use and also possible for collisions) is likely.

Add a qdrant_api_key and qdrant_url to config.SECRET.json. Document how to set it and what it does. (Warn at gulp file generation if the qdrant key is set and openai key is not). Also have .GENERATED. have a VECTOR_STORE_ENABLED.

Add a client tool that checks if the qdrant_api_key is set, and if so, during the deploy checks to see if the DB is configured (via collection_info), and if not, configures it. Configuration makes the named DB collection (openai.com:text-embedding-ada-002, with dev- prepended for dev_mode) and then adds two indexes, on card_id and version.

The IDs for each point is a card_id+version (verify qdrant doesn't literally require a UUID). This allows us to not have to keep track of a integer index and which one to use for next insert). The payload in the qdrant store is structured like this:

{
  //Indexed
  "card_id": CardID,
  //The version of the content extraction, allowing adding a new one later
  //Indexed
  "version": 0,
  "content" : "<Extracted content>",
  "last_updated": timestamp
}

functions/src/embeddings.ts creares a qdrant client, if the api_key is configured.

There are three endpoints:

  1. Re-index any missing cards. Fetches all cards, and then goes through one by one to call updateCard. A HTTPS trigger.
  2. updateCardEmedding, just calls updateCard. A firestore trigger. Extracts the text content (bailing early if no content). Then does a getPoint with the computed ID (or scroll with filter card_id, version if the id is UUID) to fetch the payload, and compare the text content. If the text is the same, quit. If not the same, compute the embedding and upsert.
  3. The query endpoint. It either takes a card_Id or a card to extract from, computes the embedding (or if it's a card_id tries to fetch it via getPoint(with_vector)) and then does the search, passing a filter of version=${currentVersion}. Then it just extracts the card_id and score and passes those as a tuple. Note that if the IDs are not UUIDs, then it doesn't need to fetch payload or vector, which allows it to request, say, the top 200 similar items reasonably quickly (how high can it go without getting too slow?)

There is also a check during deploy to ask the user if they want to trigger content reindexing (default to false) (or maybe just run it every time, as long as the qdrant_key is configured?)

@jkomoros
Copy link
Owner Author

jkomoros commented Oct 28, 2023

  • Fix the error where the card ID is wrong on the card object when the embedding calculation runs
  • Include a short date of the card in the embedding (experiment)
  • Consider renaming payload.version to extraction_version (requires updating indexes)
  • Switch to v2 functions for higher timeout
  • On deploy, if qdrant_enabled, hit the reindexCardEmbeddings endpoint
  • Include card_type and card_created timestamp (for visualization)
  • title should come before body
  • date / card type shoudl come at end
  • set timeout of 540 seconds
  • Include date / card type in the body
  • fetchCardSimilarity should be kicked off by a similarCard with a new keycard
  • Make reindex-card-embeddings not print errors to the CLI (which can interfere with for example text editors live in that CLI), but to a log file
  • When a card is first saved, it has a no-such-embedding error, instead of a 'stale-embedding'. That then leads to cards having the same wrong similarity (why is it always the same wrong one) until the next card is saved.
    • No, that's the content for an empty card. There's a race, when you save the card, it doesn't clear out the cardSimilarity when the card is updated. Or it does, and the problem is if you save new content under 10 seconds (LAST_UPDATED_EPSILON on the server) then it happily reports the old value as not stale.
    • What happens is you create an (empty) card. It gets a generic embedding and a last_updated timestamp. If you then save it within 10 seconds (LAST_UPDATED_EPSILON) and then reach out for similarCards then it says "well the embedding I have is new enough" and returns it, even though it will (soon) be updated. The next time the cards are cleared (when the next card is created/updated) the new embedding is fetched and it works great.
  • While editing new cards, for now just always fall back to tfidf similarity (until we update)
  • Figure out how to protect against ddos kind of attacks on reindexCardEmbeddings endpoint
  • Bail earlier for when cards are changed where the change doesn't include any properties that might affect the embedding content (e.g. concept cards where all that is changed is references_inbound)
  • the card_created field in qdrant.Point is suspiciously low precision
  • Figure out how to get the endpoint for reindexCardEmbeddings given v2 is at a different URL (document, and allow gulp task to get it)
  • Add a lastUpdated to PointPayload
  • When a card is edited, if similarity is fetched just then, it will almost certainly miss the new index value. Maybe when the card is edited add a fetchCardSimilarity for, say, 10 seconds later? Ideally we'd run it immediately after it's available in qdrant, but we don't really have a callback for that.
  • While editing the card, live update the similarity every few seconds (just like the normal pipeline)
  • For cards thare are below the ~500 or so similarity cutoff, take the tfidf similarity and smear it (below the lowest similarity from the qdrant pipeline)
  • Experiment with sticking the canonical text of the concepts at the end of the embedding text
  • Have some kind of loading indicator when a similarity request is ongoing so it's not as weird when it pops in and out?
  • If there are TONS of failures in reindexCardEmbedding break. It should only survive through say 5 in a row. (it should be rare, like a card content being too long, and not persistent, like openai being down)
  • Consider having a clean_old_versions endpoint, or do it automatically on reindexCardEmbeddings?
  • Set the qdrant config functions values (eg cluster_url, api_key) automatically via gulp task, not manually
  • Enable Object Versioning on new buckets when being created in gulp gsutil task
  • When saving hnsw index, fail if the generation doesn't match (ideally we'd then reapply the stuff that has changed since then on the newly fetched index)
  • Use gpt-tok to clip overly long embeddings before embedding
  • make textContentForEmbeddingForCard function be smarter and more like cardPlainContent
  • Make everything work for non-content and working-notes cards
  • Actually run processCard when a card changes
  • Create the endpoint to get the most similar cards to a given card
  • Create the endpoint to get the most similar cards to a given extracted text
  • Wire in the meaning filter type locally and see how it feels
  • Figure out if everything that touches hnsw should be one cloud function that hte other things just proxy to (because otherwise every instance of a cloud function that does something with the embedding store might load up a different version and stomp on each other's edits)
  • There's a bug where '?DEFAULT-INVALID-ID?' has been stored on a lot of cards for a long time, likely from refactoring card updates and diffs etc (Many cards in production have id field of ?DEFAULT-INVALID-ID? #672 )
  • Card processing function silently exits if openai key is not set (or is set to change-me sentinel)
  • Have a 'embedCards' endpoint that triggers making embeddings for every card that can be run to calculate all embeddings for all cards and store them.
  • If cardSimiliarity is called on a card that exists but does not have an embedding, calculate it on the spot. (This can happen if for example a updateCardEmbedding had an error).
  • Actually store hsnw index toox

jkomoros added a commit that referenced this issue Oct 29, 2023
jkomoros added a commit that referenced this issue Oct 29, 2023
jkomoros added a commit that referenced this issue Oct 29, 2023
A cheapo version of src/util:cardPlainContent.

Part of #646.
jkomoros added a commit that referenced this issue Oct 29, 2023
They'll only be actually read on the server.

Part of #646.
jkomoros added a commit that referenced this issue Oct 29, 2023
jkomoros added a commit that referenced this issue Oct 29, 2023
jkomoros added a commit that referenced this issue Oct 29, 2023
It mainly checks to see if we already have a card/embedding info stored and if so, stores that.

Not actually used yet.

Part of #646.
jkomoros added a commit that referenced this issue Oct 29, 2023
This is the first time exercising the current pipeline, and revels a few problems, like "invalid-card-id" is somehow in the document?

Part of #646.
jkomoros added a commit that referenced this issue Oct 29, 2023
jkomoros added a commit that referenced this issue Oct 29, 2023
jkomoros added a commit that referenced this issue Oct 29, 2023
jkomoros added a commit that referenced this issue Oct 29, 2023
.update will fail for a doc that doesn't exist.

Part of #646.
jkomoros added a commit that referenced this issue Oct 29, 2023
Part of #646.
jkomoros added a commit that referenced this issue Oct 29, 2023
Part of #646.

This makes them show up in VSCode while editing.
jkomoros added a commit that referenced this issue Oct 29, 2023
jkomoros added a commit that referenced this issue Nov 12, 2023
jkomoros added a commit that referenced this issue Nov 12, 2023
If provided, then error of code `stale-embedding` will be returned if the embedding is older than the
card.updated.

This will allow the client to determine that the embedding isn't  yet updated, and try again. This will
happen for example right after a card is saved, before it is re-embeddeded and stored.

Part of #670. Part of #646.
jkomoros added a commit that referenced this issue Nov 12, 2023
The similarCards.last_updated flow requires cards to have their point updated every time the card.updated
timestamp updates. That updates more often than the embedding changes.

This makes it so the last_updated filter should work.

Part of #646. Part of #670.
jkomoros added a commit that referenced this issue Nov 12, 2023
jkomoros added a commit that referenced this issue Nov 12, 2023
…oint's last_updated.

This meant we erroneously thought every embedding was stale.

Part of #646. Part of #670.
jkomoros added a commit that referenced this issue Nov 12, 2023
…s if the embedding is stale.

This means that right after a card is updated, before the embedding has been updated, it will get the new
similarity soon after it's available, no more than DELAY_FOR_STALE milliseconds after it's available.

Part of #646. Part of #670.
jkomoros added a commit that referenced this issue Nov 12, 2023
If provided, then for collections that are a preview, it will render an alert icon with that text.

Part of #646.
jkomoros added a commit that referenced this issue Nov 12, 2023
…ards based on local similarity.

This makes it possible to detect if you're seeing the good or bad simlarity in a collection.

Part of #646.
jkomoros added a commit that referenced this issue Nov 12, 2023
Before, there could be a case where a card was updated but no embedding woudl be updated, and you could
wait for a stale-embedding that woudl never come.

This is an edge case anyway.

Part of #646. Part of #670.
jkomoros added a commit that referenced this issue Nov 12, 2023
jkomoros added a commit that referenced this issue Nov 12, 2023
Technically there coudl be multiple cards to be similar of.

Part of #646. Part of #670.
jkomoros added a commit that referenced this issue Nov 13, 2023
jkomoros added a commit that referenced this issue Nov 14, 2023
jkomoros added a commit that referenced this issue Nov 14, 2023
There's a current bug where you create a card, and it gets a '' embedding (which returns very generic results). While editing, you see those generic results, even as you update.

In the future we'll be live fetching the actively edited card, but for now just fall back on tfidf pipeline.

Part of #646.
jkomoros added a commit that referenced this issue Nov 14, 2023
Before, it was 10 seconds, allowing a lot of slop between servers. But that broke a common case: create-a-card, and then within 10 seconds paste and save.

The client would then ask for the new similarCards, and it would still be based on the old empty content embedding, and the server would happily report it as current.

That would lead to bizarre similarity for that card (until another card was created/updated, when that cache would be blown away in the client and refetched and would actually be accurate).

Part of #646.
jkomoros added a commit that referenced this issue Nov 16, 2023
jkomoros added a commit that referenced this issue Nov 16, 2023
This signals "the value isn't current", like a missing enetry in CardSimilarityMap would.

Part of #646.
jkomoros added a commit that referenced this issue Nov 16, 2023
jkomoros added a commit that referenced this issue Nov 16, 2023
jkomoros added a commit that referenced this issue Nov 16, 2023
jkomoros added a commit that referenced this issue Nov 17, 2023
If you pass a card object in via REST to similarCards, it is {seconds, nanoseconds} not a literal
timestamp.

Part of #646.
jkomoros added a commit that referenced this issue Nov 18, 2023
Before, the entire payload.content was in memory, but unnecessarily because in the common read path it's
not fetched.

Run `gulp configure-qdrant` to update configuration.

Part of #646.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant