Creating notebook to ingest CloudSQL database using kubernetes docs #751

german-grandas · 2024-07-26T14:42:37Z

The idea with this PR is create a jupyter notebook, as happens on rag/example_notebooks/rag-kaggle-ray-sql-latest.ipynb, showing the required steps to answer questions using the kubernetes docs instead of use the netflix reviews database.

imreddy13 · 2024-08-06T22:10:24Z

/gcbrun

yiyinglovecoding · 2024-08-06T22:10:34Z

/gcbrun

gongmax

Can you update the e2e test, i.e. test rag to use this notebook. After verifying this works, we also need to update the tutorial https://github.com/GoogleCloudPlatform/ai-on-gke/blame/main/applications/rag/README.md to use it.

- Updating test_rag.py so the test can validate answers from the kubernetes documentation. - Updating cloudbuild.yaml to ingest the database with the kubernetes documentation.

german-grandas · 2024-09-18T14:01:39Z

/gcbrun

german-grandas · 2024-09-18T14:46:56Z

/gcbrun

german-grandas · 2024-09-18T15:43:55Z

/gcbrun

german-grandas · 2024-09-18T16:29:57Z

/gcbrun

…b.com/GoogleCloudPlatform/ai-on-gke into add/example_notebooks/kubernetes_docs

german-grandas · 2024-09-18T18:53:37Z

/gcbrun

german-grandas · 2024-09-19T11:11:01Z

/gcbrun

german-grandas · 2024-09-19T15:32:36Z

/gcbrun

…tflix database

german-grandas · 2024-09-23T15:18:08Z

/gcbrun

german-grandas · 2024-09-23T15:29:28Z

/gcbrun

german-grandas · 2024-10-08T16:42:06Z

/gcbrun

german-grandas · 2024-10-08T19:37:18Z

/gcbrun

german-grandas · 2024-10-08T22:24:57Z

/gcbrun

german-grandas · 2024-10-08T23:58:24Z

/gcbrun

… into add/example_notebooks/kubernetes_docs

german-grandas · 2024-10-15T22:48:05Z

/gcbrun

gongmax · 2024-10-16T17:39:19Z

/gcbrun

german-grandas · 2024-10-17T20:07:45Z

/gcbrun

german-grandas · 2024-10-18T17:47:27Z

/gcbrun

german-grandas · 2024-10-18T21:44:11Z

/gcbrun

applications/rag/example_notebooks/rag-ray-ingest-with-kubernetes-docs.ipynb

applications/rag/README.md

applications/rag/example_notebooks/rag-ray-ingest-with-kubernetes-docs.ipynb

… into add/example_notebooks/kubernetes_docs

german-grandas · 2024-10-22T14:21:04Z

/gcbrun

german-grandas · 2024-10-22T15:51:39Z

/gcbrun

gongmax

Accessing through frontend UI, I got:
1, Error pg8000.exceptions.DatabaseError: {'S': 'ERROR', 'V': 'ERROR', 'C': '42703', 'M': 'column "id" does not exist', 'P': '8', 'F': 'parse_relation.c', 'L': '3676', 'R': 'errorMissingColumn'}. This related to the column name consistency issue I commented below.
2. Warning Warning: Error: DBAPIError.__init__() missing 2 required positional arguments: 'params' and 'orig'.
Can you fix them and verify?

Besides, the whole ray job took ~40mins to finish. Do you think it makes sense to follow the previous way to utilize the Ray Dataset to processing the data, which handles the batch and distribution by itself. And utilize GPU.

applications/rag/example_notebooks/rag-ray-ingest-with-kubernetes-docs.ipynb

gongmax · 2024-10-24T21:00:02Z

applications/rag/example_notebooks/rag-ray-ingest-with-kubernetes-docs.ipynb

+    "# the dataset has been pre-dowloaded to the GCS bucket as part of the notebook in the cell above. Ray workers will find the dataset readily mounted.\n",
+    "SHARED_DATASET_BASE_PATH = \"/data/kubernetes-docs/\"\n",
+    "\n",
+    "BATCH_SIZE = 100\n",


Where is the BATCH_SIZE being used?

Is not required in this scenario, I removed it.

applications/rag/example_notebooks/rag-ray-ingest-with-kubernetes-docs.ipynb

gongmax · 2024-10-24T21:26:20Z

applications/rag/example_notebooks/rag-ray-ingest-with-kubernetes-docs.ipynb

+    "    entrypoint=\"python job.py\",\n",
+    "    # Path to the local directory that contains the entrypoint file.\n",
+    "    runtime_env={\n",
+    "        \"working_dir\": \"/home/jovyan/rag-app\", # upload the local working directory to ray workers\n",


This seems work but can you help me to understand why the working_dir looks like this, specifically the jovyan part?

I wasn't sure so I took the same approach from

ai-on-gke/applications/rag/example_notebooks/rag-kaggle-ray-sql-latest.ipynb

Line 292 in 87d11d5

" \"working_dir\": \"/home/jovyan/rag-app\", # upload the local working directory to ray workers\n",

, there the working dir is setup as /home/jovyan/rag-app\ is updated so just can be /home/rag-app\

Regarding to this, I think we prefer the interactive pattern which rag-kaggle-ray-sql-interactive.ipynb followes. It would be great if we can follow the pattern in this notebook.

applications/rag/README.md

gongmax · 2024-10-24T21:53:25Z

applications/rag/example_notebooks/rag-ray-ingest-with-kubernetes-docs.ipynb

+    "ray.shutdown()"
+   ]
+  }
+ ],


Similar as what we did before, can we add a cell to verify the embeddings got created and stored in the database correctly?

Added on line 194.

… into add/example_notebooks/kubernetes_docs

german-grandas · 2024-10-25T15:59:02Z

/gcbrun

… into add/example_notebooks/kubernetes_docs

german-grandas · 2024-10-28T12:22:10Z

/gcbrun

gongmax · 2024-10-29T16:08:19Z

applications/rag/example_notebooks/rag-ray-ingest-with-kubernetes-docs.ipynb

+    "    \n",
+    "    splits = splitter.split_documents(pages)\n",
+    "\n",
+    "    chunks = []\n",


The variable name chunks is confusing here, it sounds more like the raw data chunks before embedding

gongmax · 2024-10-29T16:12:16Z

applications/rag/example_notebooks/rag-ray-ingest-with-kubernetes-docs.ipynb

+    "            \"langchain_id\" : id,\n",
+    "            \"content\" : page_content,\n",
+    "            \"embedding\" : embedded_document,\n",
+    "            \"langchain_metadata\" : file_metadata\n",


Can this even work? The keys of the split_data do not match the schema of TextEmbedding.

gongmax · 2024-10-29T16:17:16Z

See error sqlalchemy.exc.DatabaseError: (pg8000.exceptions.DatabaseError) {'S': 'ERROR', 'V': 'ERROR', 'C': '23502', 'M': 'null value in column "id" of relation "rag_embeddings_db" violates not-null constraint', 'D': 'Failing row contains (null, null, null).', 's': 'public', 't': 'rag_embeddings_db', 'c': 'id', 'F': 'execMain.c', 'L': '1974', 'R': 'ExecConstraints'} when running the notebook. May related to my above comment about the DB schema mis-match. The same error also happened in the CI, which means the e2e test here is not sufficient enough, i.e, even without populate the vector DB successfully, the e2e test still passes uncorrectly.

Creating notebook to ingest CloudSQL database using kubernetes

9b05372

german-grandas changed the title ~~Creating notebook to ingest CloudSQL database using kubernetes~~ Creating notebook to ingest CloudSQL database using kubernetes docs Jul 26, 2024

german-grandas requested a review from imreddy13 July 26, 2024 23:04

german-grandas enabled auto-merge (squash) July 31, 2024 15:27

imreddy13 requested review from roberthbailey and gongmax August 6, 2024 22:10

gongmax reviewed Sep 17, 2024

View reviewed changes

german-grandas added 2 commits September 18, 2024 13:57

Running rag e2e test with kubernetes docs.

015d3ff

- Updating test_rag.py so the test can validate answers from the kubernetes documentation. - Updating cloudbuild.yaml to ingest the database with the kubernetes documentation.

Updating branch with main

8f48578

Reverting change

d05e198

Fixing issue converting notebook to script.

67fa74b

Update cloudbuild.yaml to fix generation of script.

8197739

german-grandas added 2 commits September 18, 2024 18:52

updating notebook variables

f54e7ed

Merge branch 'add/example_notebooks/kubernetes_docs' of https://githu…

34e1d5d

…b.com/GoogleCloudPlatform/ai-on-gke into add/example_notebooks/kubernetes_docs

Adding iptype configuration to db engine

37b5024

Updating Rag application README

b046e93

german-grandas added 2 commits September 23, 2024 15:11

Updating branch with main

9852c43

updating cloudbuild to run tests on kubernetes docs instead of the ne…

6575552

…tflix database

Fixing issue with cloudbuild file

f0482f6

updating notebook formatting

41f4a75

removing conflicting package, updating embeddings importing

d7f9165

fixing import on notebook

41b7bea

Reverting change on notebook

edd5f8f

german-grandas added 2 commits October 15, 2024 22:46

Refactoring notebook so ray can be used, updating cloudbuild.yml file.

df11fa3

Merge branch 'main' of https://github.com/GoogleCloudPlatform/ai-on-gke…

d92a9d9

… into add/example_notebooks/kubernetes_docs

gongmax reviewed Oct 21, 2024

View reviewed changes

applications/rag/example_notebooks/rag-ray-ingest-with-kubernetes-docs.ipynb Outdated Show resolved Hide resolved

applications/rag/README.md Show resolved Hide resolved

applications/rag/example_notebooks/rag-ray-ingest-with-kubernetes-docs.ipynb Outdated Show resolved Hide resolved

german-grandas added 2 commits October 22, 2024 14:18

Updating based on comments

c7bfb03

Merge branch 'main' of https://github.com/GoogleCloudPlatform/ai-on-gke…

47d9c34

… into add/example_notebooks/kubernetes_docs

gongmax reviewed Oct 24, 2024

View reviewed changes

german-grandas added 3 commits October 25, 2024 15:35

Working on comments

bba5207

Merge branch 'main' of https://github.com/GoogleCloudPlatform/ai-on-gke…

aea0f7d

… into add/example_notebooks/kubernetes_docs

Reverting change adding new variable to kuberay

610658e

german-grandas added 2 commits October 28, 2024 12:21

Fixing issue with ray working_dir

a91fe1d

Merge branch 'main' of https://github.com/GoogleCloudPlatform/ai-on-gke…

57db05c

… into add/example_notebooks/kubernetes_docs

gongmax reviewed Oct 29, 2024

View reviewed changes

Creating notebook to ingest CloudSQL database using kubernetes docs #751

Are you sure you want to change the base?

Creating notebook to ingest CloudSQL database using kubernetes docs #751

Conversation

german-grandas commented Jul 26, 2024 • edited Loading

imreddy13 commented Aug 6, 2024

yiyinglovecoding commented Aug 6, 2024

gongmax left a comment

Choose a reason for hiding this comment

german-grandas commented Sep 18, 2024

german-grandas commented Sep 18, 2024

german-grandas commented Sep 18, 2024

german-grandas commented Sep 18, 2024

german-grandas commented Sep 18, 2024

german-grandas commented Sep 19, 2024

german-grandas commented Sep 19, 2024

german-grandas commented Sep 23, 2024

german-grandas commented Sep 23, 2024

german-grandas commented Oct 8, 2024

german-grandas commented Oct 8, 2024

german-grandas commented Oct 8, 2024

german-grandas commented Oct 8, 2024

german-grandas commented Oct 15, 2024

gongmax commented Oct 16, 2024

german-grandas commented Oct 17, 2024

german-grandas commented Oct 18, 2024

german-grandas commented Oct 18, 2024

german-grandas commented Oct 22, 2024

german-grandas commented Oct 22, 2024

gongmax left a comment

Choose a reason for hiding this comment

gongmax Oct 24, 2024 • edited Loading

Choose a reason for hiding this comment

german-grandas Oct 25, 2024

Choose a reason for hiding this comment

gongmax Oct 24, 2024 • edited Loading

Choose a reason for hiding this comment

german-grandas Oct 25, 2024

Choose a reason for hiding this comment

gongmax Oct 25, 2024

Choose a reason for hiding this comment

gongmax Oct 24, 2024

Choose a reason for hiding this comment

german-grandas Oct 25, 2024

Choose a reason for hiding this comment

german-grandas commented Oct 25, 2024

german-grandas commented Oct 28, 2024

gongmax Oct 29, 2024

Choose a reason for hiding this comment

gongmax Oct 29, 2024

Choose a reason for hiding this comment

gongmax commented Oct 29, 2024 • edited Loading

german-grandas commented Jul 26, 2024 •

edited

Loading

gongmax Oct 24, 2024 •

edited

Loading

gongmax Oct 24, 2024 •

edited

Loading

gongmax commented Oct 29, 2024 •

edited

Loading