Use OpenAI embeddings for similarity #646

jkomoros · 2022-12-25T20:26:37Z

Use https://beta.openai.com/docs/guides/embeddings/use-cases to calculate similarity.

If an openai_secret_key is provided in config.SECRET.json then it activates embedding-based similarity.

A new cloud function is set up that when a card's title or body is modified it's flagged to have its embedding fetched. (The fetching has to happen serverside to protect the secret key). By having it be driven off of cards being edited, we could dratt off the firestore permissions to make it hard to abuse our embedding secret key. Getting an embedding for a live-editing card is harder then.

Fetchign embeddings could take awhile and could fail, and sometimes we'd need to do many in bulk, so we'll need some kind of implciit queing system, and flagging cards that have an embedding fetch happening, or need a new one

The functions to compute two similarities between cards instead uses the embeddings. (Maybe have a different type of fingerprint, an EmbeddingFingerprint?)

The text was updated successfully, but these errors were encountered:

jkomoros · 2022-12-25T20:27:48Z

See also https://github.com/dglazkov/wanderer

jkomoros · 2022-12-25T20:35:08Z

And https://github.com/openai/openai-cookbook/blob/838f000935d9df03e75e181cbcea2e306850794b/examples/Question_answering_using_embeddings.ipynb

jkomoros · 2022-12-28T21:00:04Z

Create a native polymath endpoint instead of needing to do an export of a json file and resvae

jkomoros · 2023-06-04T10:25:47Z

https://extensions.dev/extensions/googlecloud/firestore-semantic-search

jkomoros · 2023-06-04T11:43:43Z

The content that is sent to the embedder should also include the canonical forms of any concept links, to help create semantic connection across multiple synonyms

jkomoros · 2023-10-18T13:56:03Z

A few challenges: the set of embeddings in production will likely be extremely large, and push the renderer closer to an OOM. There will also be many cases where you don't have the embeddings but still want to do something meaningful.

Have a new set of query filters, meaning, which if embeddings are available, uses a sorting based on cosine similiarty, and if they aren't, falls back to just an alias for similarity.

One design: have the embeddings in a cloud function. Use hnsw to do the index. Store the index in Cloud Storage, and every time you save a new snapshot, remove old copies that are above some count (so keep a few just in case). Every time the cloud function loads (will cloud functions v2 help the instance be reused more often?) it loads the snapshot of the most recent version. We can use Object Versioniong in Cloud Storage, and ifGenerationMatch to check before writing that no edits have been made. If they have, reload the most recent snapshot and try again (try this up to, say, 3 times). Once the write succeeds, also write the information to the embeddings firestore collection (see below). There should be a clean operation that looks for ids in the hsnw index that don't have a corresponding firestore entry and deletes them (otherwise there will be items that might continually show up in queries that have to be filtered out)

hnsw doesn't allow saving metadata so we'll have to do it some other way, including maintaining a mapping from cardID -> hnsw index. That will be stored in a new embeddings firestore collection, which will be keyed off of cardID + embedding_space (allowing adding new ones in the future), like c-123-4567+embedding-ada-002. Each record will have the embedding_index, the last_updated date, a version number for the card extraction version, and a snapshot of the embedded text. Every time a card is saved and its content changes, we check if there is an embedding record, and if there is one if the text is equivalent. If any of these aren't true, then we kick off an embedding request and then store the result in the hsnw index, saving a snapshot, and updating the embedding record. The card extraction version allows us to experiment with new formats for the text to embed, including just cardPlainText (note for content cards, will need to include the title) but also things like including the canonical form of the concept links, as well as a date (which will inherently get a bit of nearby date similarity overlap?). There also needs to be an operation to kick off creating new embedding entries when a new version (e.g. extraction version) is pushed, in addition to the incremental onCardUpdated hook to compute incremental embeddings. Make sure the embeddings collection is not allowed to be downloaded to the client (especially if it contains the full embedded text)

There is also an endpoint, which anyone can hit (because it never sends back card content) that you can pass a key card ID and a k, and it will pass back an array of [CARD_ID, similarity] records of the most similar items. (You can pass -1 to mean 'literally every card'). When you hit the endpoint, the endpoint loads up the hnsw index, looks up the embedding_index of the given card ID, fetches the embedding of that item, fetches the k most similar, and then reverses out the card_id of each one before passing back. You can also pass to the endpoint not a cardID but a card content--useful for doing the similarity of a card as it is being edited. In the future we can filter out any cards the given user doesn't have access to (to not leak the existence of other cards, and to ensure the entire list of records that are passed back aren't, for example, entirely unpublished cards they can't see).

The local filter for meaning will keep track of cached similarity lists of key careds (invalidating the list each time a card is edited). The filter will have a bit of a delay to actually give a result, as it fetches to the endpoint.

The content to be embedded is a function that takes a card and a collection of concept cards and produces a canonical text, which includes the canonical form of every concept card's title appended at the end.

jkomoros · 2023-10-26T17:58:30Z

Just use Qdrant? (open source pinecone alternative)

Yeah, just use Qdrant, there are a ton of database administration tasks that will be too annoying to do by hand. Also, running a DB in a cloud function and not duplicating the service a million times (with more resource use and also possible for collisions) is likely.

Add a qdrant_api_key and qdrant_url to config.SECRET.json. Document how to set it and what it does. (Warn at gulp file generation if the qdrant key is set and openai key is not). Also have .GENERATED. have a VECTOR_STORE_ENABLED.

Add a client tool that checks if the qdrant_api_key is set, and if so, during the deploy checks to see if the DB is configured (via collection_info), and if not, configures it. Configuration makes the named DB collection (openai.com:text-embedding-ada-002, with dev- prepended for dev_mode) and then adds two indexes, on card_id and version.

The IDs for each point is a card_id+version (verify qdrant doesn't literally require a UUID). This allows us to not have to keep track of a integer index and which one to use for next insert). The payload in the qdrant store is structured like this:

{
  //Indexed
  "card_id": CardID,
  //The version of the content extraction, allowing adding a new one later
  //Indexed
  "version": 0,
  "content" : "<Extracted content>",
  "last_updated": timestamp
}

functions/src/embeddings.ts creares a qdrant client, if the api_key is configured.

There are three endpoints:

Re-index any missing cards. Fetches all cards, and then goes through one by one to call updateCard. A HTTPS trigger.
updateCardEmedding, just calls updateCard. A firestore trigger. Extracts the text content (bailing early if no content). Then does a getPoint with the computed ID (or scroll with filter card_id, version if the id is UUID) to fetch the payload, and compare the text content. If the text is the same, quit. If not the same, compute the embedding and upsert.
The query endpoint. It either takes a card_Id or a card to extract from, computes the embedding (or if it's a card_id tries to fetch it via getPoint(with_vector)) and then does the search, passing a filter of version=${currentVersion}. Then it just extracts the card_id and score and passes those as a tuple. Note that if the IDs are not UUIDs, then it doesn't need to fetch payload or vector, which allows it to request, say, the top 200 similar items reasonably quickly (how high can it go without getting too slow?)

There is also a check during deploy to ask the user if they want to trigger content reindexing (default to false) (or maybe just run it every time, as long as the qdrant_key is configured?)

jkomoros · 2023-10-28T21:23:33Z

Fix the error where the card ID is wrong on the card object when the embedding calculation runs
Include a short date of the card in the embedding (experiment)
Consider renaming payload.version to extraction_version (requires updating indexes)
Switch to v2 functions for higher timeout
On deploy, if qdrant_enabled, hit the reindexCardEmbeddings endpoint
Include card_type and card_created timestamp (for visualization)
title should come before body
date / card type shoudl come at end
set timeout of 540 seconds
Include date / card type in the body
fetchCardSimilarity should be kicked off by a similarCard with a new keycard
Make reindex-card-embeddings not print errors to the CLI (which can interfere with for example text editors live in that CLI), but to a log file
When a card is first saved, it has a no-such-embedding error, instead of a 'stale-embedding'. That then leads to cards having the same wrong similarity (why is it always the same wrong one) until the next card is saved.
- No, that's the content for an empty card. There's a race, when you save the card, it doesn't clear out the cardSimilarity when the card is updated. Or it does, and the problem is if you save new content under 10 seconds (LAST_UPDATED_EPSILON on the server) then it happily reports the old value as not stale.
- What happens is you create an (empty) card. It gets a generic embedding and a last_updated timestamp. If you then save it within 10 seconds (LAST_UPDATED_EPSILON) and then reach out for similarCards then it says "well the embedding I have is new enough" and returns it, even though it will (soon) be updated. The next time the cards are cleared (when the next card is created/updated) the new embedding is fetched and it works great.
While editing new cards, for now just always fall back to tfidf similarity (until we update)
Figure out how to protect against ddos kind of attacks on reindexCardEmbeddings endpoint
Bail earlier for when cards are changed where the change doesn't include any properties that might affect the embedding content (e.g. concept cards where all that is changed is references_inbound)
the card_created field in qdrant.Point is suspiciously low precision
Figure out how to get the endpoint for reindexCardEmbeddings given v2 is at a different URL (document, and allow gulp task to get it)
Add a lastUpdated to PointPayload
When a card is edited, if similarity is fetched just then, it will almost certainly miss the new index value. Maybe when the card is edited add a fetchCardSimilarity for, say, 10 seconds later? Ideally we'd run it immediately after it's available in qdrant, but we don't really have a callback for that.
While editing the card, live update the similarity every few seconds (just like the normal pipeline)
For cards thare are below the ~500 or so similarity cutoff, take the tfidf similarity and smear it (below the lowest similarity from the qdrant pipeline)
Experiment with sticking the canonical text of the concepts at the end of the embedding text
Have some kind of loading indicator when a similarity request is ongoing so it's not as weird when it pops in and out?
If there are TONS of failures in reindexCardEmbedding break. It should only survive through say 5 in a row. (it should be rare, like a card content being too long, and not persistent, like openai being down)
Consider having a clean_old_versions endpoint, or do it automatically on reindexCardEmbeddings?
Set the qdrant config functions values (eg cluster_url, api_key) automatically via gulp task, not manually
Enable Object Versioning on new buckets when being created in gulp gsutil task
When saving hnsw index, fail if the generation doesn't match (ideally we'd then reapply the stuff that has changed since then on the newly fetched index)
Use gpt-tok to clip overly long embeddings before embedding
make textContentForEmbeddingForCard function be smarter and more like cardPlainContent
Make everything work for non-content and working-notes cards
Actually run processCard when a card changes
Create the endpoint to get the most similar cards to a given card
Create the endpoint to get the most similar cards to a given extracted text
Wire in the meaning filter type locally and see how it feels
Figure out if everything that touches hnsw should be one cloud function that hte other things just proxy to (because otherwise every instance of a cloud function that does something with the embedding store might load up a different version and stomp on each other's edits)
There's a bug where '?DEFAULT-INVALID-ID?' has been stored on a lot of cards for a long time, likely from refactoring card updates and diffs etc (Many cards in production have id field of ?DEFAULT-INVALID-ID? #672 )
Card processing function silently exits if openai key is not set (or is set to change-me sentinel)
Have a 'embedCards' endpoint that triggers making embeddings for every card that can be run to calculate all embeddings for all cards and store them.
If cardSimiliarity is called on a card that exists but does not have an embedding, calculate it on the spot. (This can happen if for example a updateCardEmbedding had an error).
Actually store hsnw index toox

Part of #646.

A cheapo version of src/util:cardPlainContent. Part of #646.

They'll only be actually read on the server. Part of #646.

Part of #646.

It mainly checks to see if we already have a card/embedding info stored and if so, stores that. Not actually used yet. Part of #646.

… deploy. Part of #646.

This is the first time exercising the current pipeline, and revels a few problems, like "invalid-card-id" is somehow in the document? Part of #646.

Part of #646.

See #672 for why. Part of #646.

.update will fail for a doc that doesn't exist. Part of #646.

Part of #646.

…d quits. Part of #646.

Part of #646.

Part of #646. This makes them show up in VSCode while editing.

Part of #646.

Part of #646. Part of #670.

If provided, then error of code `stale-embedding` will be returned if the embedding is older than the card.updated. This will allow the client to determine that the embedding isn't yet updated, and try again. This will happen for example right after a card is saved, before it is re-embeddeded and stored. Part of #670. Part of #646.

Part of #670. Part of #646.

The similarCards.last_updated flow requires cards to have their point updated every time the card.updated timestamp updates. That updates more often than the embedding changes. This makes it so the last_updated filter should work. Part of #646. Part of #670.

Part of #646. Part of #670.

…oint's last_updated. This meant we erroneously thought every embedding was stale. Part of #646. Part of #670.

…s if the embedding is stale. This means that right after a card is updated, before the embedding has been updated, it will get the new similarity soon after it's available, no more than DELAY_FOR_STALE milliseconds after it's available. Part of #646. Part of #670.

If provided, then for collections that are a preview, it will render an alert icon with that text. Part of #646.

…ards based on local similarity. This makes it possible to detect if you're seeing the good or bad simlarity in a collection. Part of #646.

Before, there could be a case where a card was updated but no embedding woudl be updated, and you could wait for a stale-embedding that woudl never come. This is an edge case anyway. Part of #646. Part of #670.

…ven if none would ever come. Part of #646. Part of #670.

Technically there coudl be multiple cards to be similar of. Part of #646. Part of #670.

Part of #646.

…on without an updated timestamp. Part of #646.

Part of #646.

…e, code: uknown} Part of #646.

There's a current bug where you create a card, and it gets a '' embedding (which returns very generic results). While editing, you see those generic results, even as you update. In the future we'll be live fetching the actively edited card, but for now just fall back on tfidf pipeline. Part of #646.

Before, it was 10 seconds, allowing a lot of slop between servers. But that broke a common case: create-a-card, and then within 10 seconds paste and save. The client would then ask for the new similarCards, and it would still be based on the old empty content embedding, and the server would happily report it as current. That would lead to bizarre similarity for that card (until another card was created/updated, when that cache would be blown away in the client and refetched and would actually be accurate). Part of #646.

Part of #646.

This signals "the value isn't current", like a missing enetry in CardSimilarityMap would. Part of #646.

Part of #646.

…Which can totally happen, right? Part of #646.

…structors. Part of #646.

If you pass a card object in via REST to similarCards, it is {seconds, nanoseconds} not a literal timestamp. Part of #646.

Part of #646.

Before, the entire payload.content was in memory, but unnecessarily because in the common read path it's not fetched. Run `gulp configure-qdrant` to update configuration. Part of #646.

jkomoros mentioned this issue Dec 28, 2022

Create AI copilot tools #647

Open

jkomoros mentioned this issue Oct 24, 2023

Use Typescript in Cloud Functions #671

Open

jkomoros added a commit that referenced this issue Oct 29, 2023

Export openai_endpoint from openai.ts

3f38a04

Part of #646.

jkomoros added a commit that referenced this issue Oct 29, 2023

Create a embeddings stub.

4c62530

Part of #646.

jkomoros added a commit that referenced this issue Oct 29, 2023

textContentForEmbeedingForCard does a reasonable job.

dea59c7

A cheapo version of src/util:cardPlainContent. Part of #646.

jkomoros added a commit that referenced this issue Oct 29, 2023

Add a firestore rule to disallow embeddings reading.

95593a4

They'll only be actually read on the server. Part of #646.

jkomoros added a commit that referenced this issue Oct 29, 2023

embeddingForCard returns null if no text in card.

b1ad89e

Part of #646.

jkomoros added a commit that referenced this issue Oct 29, 2023

Factor out embeddingForContent and embeddingForCard.

c7d1653

Part of #646.

jkomoros added a commit that referenced this issue Oct 29, 2023

Create a stub of embeddings.ts:processCard

11a0f69

It mainly checks to see if we already have a card/embedding info stored and if so, stores that. Not actually used yet. Part of #646.

jkomoros added a commit that referenced this issue Oct 29, 2023

Fix a bug where in gulpfile deploy there was a missing ',' for openai…

1ec5189

… deploy. Part of #646.

jkomoros added a commit that referenced this issue Oct 29, 2023

Hook up a document handler for cards when they change.

695779b

This is the first time exercising the current pipeline, and revels a few problems, like "invalid-card-id" is somehow in the document? Part of #646.

jkomoros added a commit that referenced this issue Oct 29, 2023

Add a maxInstances for th processCardEmbedding.

6dee860

Part of #646.

jkomoros mentioned this issue Oct 29, 2023

Many cards in production have id field of ?DEFAULT-INVALID-ID? #672

Open

jkomoros added a commit that referenced this issue Oct 29, 2023

Update console.log of embedding.

e8a52f2

Part of #646.

jkomoros added a commit that referenced this issue Oct 29, 2023

Switch to override the firestore id in the card.

12db67e

See #672 for why. Part of #646.

jkomoros added a commit that referenced this issue Oct 29, 2023

Update processCard to use .set(,{merge:true})

06b1624

.update will fail for a doc that doesn't exist. Part of #646.

jkomoros added a commit that referenced this issue Oct 29, 2023

Add a TODO.

7820887

Part of #646.

jkomoros added a commit that referenced this issue Oct 29, 2023

openai_endpoint is null if key not set (empty or CHANGE_ME)

25df01e

Part of #646.

jkomoros added a commit that referenced this issue Oct 29, 2023

If openai_endpoint is empty, processCardEmbedding prints a warning an…

a0ca0a6

…d quits. Part of #646.

jkomoros added a commit that referenced this issue Oct 29, 2023

Fix a lot of eslint warnings that will show up in next commit.

b200849

Part of #646.

jkomoros added a commit that referenced this issue Oct 29, 2023

Switch to using the same eslintrc ro the whole project.

6159bfa

Part of #646. This makes them show up in VSCode while editing.

jkomoros added a commit that referenced this issue Oct 29, 2023

Fix a few eslint warnings.

d66f312

Part of #646.

jkomoros added a commit that referenced this issue Nov 12, 2023

Add a error code in similar cards response.

9ec0542

Part of #646. Part of #670.

jkomoros added a commit that referenced this issue Nov 12, 2023

Wire through last-updated to the similarCards endpoint from client.

847eea8

Part of #670. Part of #646.

jkomoros added a commit that referenced this issue Nov 12, 2023

Make log statements include the qdrant ID as well as card_id.

0e53403

Part of #646. Part of #670.

jkomoros added a commit that referenced this issue Nov 12, 2023

Correct a bug in stale-embedding where we didn't fetch the existing p…

deda69d

…oint's last_updated. This meant we erroneously thought every embedding was stale. Part of #646. Part of #670.

jkomoros added a commit that referenced this issue Nov 12, 2023

Add a showPreview option to reference blocks.

707795d

If provided, then for collections that are a preview, it will render an alert icon with that text. Part of #646.

jkomoros added a commit that referenced this issue Nov 12, 2023

Use the new referenceBlock.showPreview to show an alert for similar c…

8f76abe

…ards based on local similarity. This makes it possible to detect if you're seeing the good or bad simlarity in a collection. Part of #646.

jkomoros added a commit that referenced this issue Nov 12, 2023

Fix a small bug in similarityFilter where it would expect a preview e…

a4b16c5

…ven if none would ever come. Part of #646. Part of #670.

jkomoros added a commit that referenced this issue Nov 12, 2023

Fix a small edge case logic in similarConfigurableFilter.

58e0f14

Technically there coudl be multiple cards to be similar of. Part of #646. Part of #670.

jkomoros added a commit that referenced this issue Nov 12, 2023

Optimize it so similarCards only fetches 'last_updated' payload field.

a5797e4

Part of #646.

jkomoros added a commit that referenced this issue Nov 13, 2023

Fix a bug where in production there are some cards in the about secti…

1fcaf64

…on without an updated timestamp. Part of #646.

jkomoros added a commit that referenced this issue Nov 14, 2023

fetchSimilarCards handles a missing lastUpdated (which is rare) better.

7e51902

Part of #646.

jkomoros added a commit that referenced this issue Nov 14, 2023

Add an insufficient-permissions error code for similarCards.

79fa356

Part of #646.

jkomoros added a commit that referenced this issue Nov 14, 2023

Replace a number of throws in similarCards with return {success: fals…

d1e4ab4

…e, code: uknown} Part of #646.

jkomoros added a commit that referenced this issue Nov 16, 2023

Create pickEmbeddableCard to extract only embeddableCard properties.

d785bff

Part of #646.

jkomoros added a commit that referenced this issue Nov 16, 2023

Add state.editor.editingCardSimilarity state.

7617d66

Part of #646.

jkomoros added a commit that referenced this issue Nov 16, 2023

Add an action type for updating editing card similarity.

342ce39

Part of #646.

jkomoros added a commit that referenced this issue Nov 16, 2023

Make it so editingCardSimilarity is able to be undefined.

d176b65

This signals "the value isn't current", like a missing enetry in CardSimilarityMap would. Part of #646.

jkomoros added a commit that referenced this issue Nov 16, 2023

Rough in local similar card fetching machinery.

3b1357b

Part of #646.

jkomoros added a commit that referenced this issue Nov 16, 2023

Make it so FilterExtras can signal that there isn't an editing card. …

e51240d

…Which can totally happen, right? Part of #646.

jkomoros added a commit that referenced this issue Nov 16, 2023

Make it so editingCardSimilarity is wired through into collection con…

1ab1b15

…structors. Part of #646.

jkomoros added a commit that referenced this issue Nov 17, 2023

Fix an error where card.created might not have a function.

2bcc58e

If you pass a card object in via REST to similarCards, it is {seconds, nanoseconds} not a literal timestamp. Part of #646.

jkomoros added a commit that referenced this issue Nov 17, 2023

Actually hook in live embedding-based similarity updating while editing.

117af3e

Part of #646.

jkomoros mentioned this issue Dec 15, 2023

Allow separate Qdrant instances for dev and prod #683

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use OpenAI embeddings for similarity #646

Use OpenAI embeddings for similarity #646

jkomoros commented Dec 25, 2022

jkomoros commented Dec 25, 2022

jkomoros commented Dec 25, 2022

jkomoros commented Dec 28, 2022

jkomoros commented Jun 4, 2023

jkomoros commented Jun 4, 2023

jkomoros commented Oct 18, 2023 •

edited

Loading

jkomoros commented Oct 26, 2023 •

edited

Loading

jkomoros commented Oct 28, 2023 •

edited

Loading

Use OpenAI embeddings for similarity #646

Use OpenAI embeddings for similarity #646

Comments

jkomoros commented Dec 25, 2022

jkomoros commented Dec 25, 2022

jkomoros commented Dec 25, 2022

jkomoros commented Dec 28, 2022

jkomoros commented Jun 4, 2023

jkomoros commented Jun 4, 2023

jkomoros commented Oct 18, 2023 • edited Loading

jkomoros commented Oct 26, 2023 • edited Loading

jkomoros commented Oct 28, 2023 • edited Loading

jkomoros commented Oct 18, 2023 •

edited

Loading

jkomoros commented Oct 26, 2023 •

edited

Loading

jkomoros commented Oct 28, 2023 •

edited

Loading