-
Notifications
You must be signed in to change notification settings - Fork 178
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Creating notebook to ingest CloudSQL database using kubernetes docs #751
base: main
Are you sure you want to change the base?
Conversation
/gcbrun |
1 similar comment
/gcbrun |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you update the e2e test, i.e. test rag
to use this notebook. After verifying this works, we also need to update the tutorial https://github.com/GoogleCloudPlatform/ai-on-gke/blame/main/applications/rag/README.md to use it.
- Updating test_rag.py so the test can validate answers from the kubernetes documentation. - Updating cloudbuild.yaml to ingest the database with the kubernetes documentation.
/gcbrun |
/gcbrun |
/gcbrun |
/gcbrun |
…b.com/GoogleCloudPlatform/ai-on-gke into add/example_notebooks/kubernetes_docs
/gcbrun |
/gcbrun |
/gcbrun |
/gcbrun |
/gcbrun |
/gcbrun |
/gcbrun |
/gcbrun |
/gcbrun |
… into add/example_notebooks/kubernetes_docs
/gcbrun |
4 similar comments
/gcbrun |
/gcbrun |
/gcbrun |
/gcbrun |
applications/rag/example_notebooks/rag-ray-ingest-with-kubernetes-docs.ipynb
Outdated
Show resolved
Hide resolved
applications/rag/example_notebooks/rag-ray-ingest-with-kubernetes-docs.ipynb
Outdated
Show resolved
Hide resolved
… into add/example_notebooks/kubernetes_docs
/gcbrun |
1 similar comment
/gcbrun |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Accessing through frontend UI, I got:
1, Error pg8000.exceptions.DatabaseError: {'S': 'ERROR', 'V': 'ERROR', 'C': '42703', 'M': 'column "id" does not exist', 'P': '8', 'F': 'parse_relation.c', 'L': '3676', 'R': 'errorMissingColumn'}
. This related to the column name consistency issue I commented below.
2. Warning Warning: Error: DBAPIError.__init__() missing 2 required positional arguments: 'params' and 'orig'
.
Can you fix them and verify?
Besides, the whole ray job took ~40mins to finish. Do you think it makes sense to follow the previous way to utilize the Ray Dataset to processing the data, which handles the batch and distribution by itself. And utilize GPU.
applications/rag/example_notebooks/rag-ray-ingest-with-kubernetes-docs.ipynb
Outdated
Show resolved
Hide resolved
applications/rag/example_notebooks/rag-ray-ingest-with-kubernetes-docs.ipynb
Outdated
Show resolved
Hide resolved
"# the dataset has been pre-dowloaded to the GCS bucket as part of the notebook in the cell above. Ray workers will find the dataset readily mounted.\n", | ||
"SHARED_DATASET_BASE_PATH = \"/data/kubernetes-docs/\"\n", | ||
"\n", | ||
"BATCH_SIZE = 100\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where is the BATCH_SIZE
being used?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is not required in this scenario, I removed it.
applications/rag/example_notebooks/rag-ray-ingest-with-kubernetes-docs.ipynb
Outdated
Show resolved
Hide resolved
" entrypoint=\"python job.py\",\n", | ||
" # Path to the local directory that contains the entrypoint file.\n", | ||
" runtime_env={\n", | ||
" \"working_dir\": \"/home/jovyan/rag-app\", # upload the local working directory to ray workers\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems work but can you help me to understand why the working_dir
looks like this, specifically the jovyan
part?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wasn't sure so I took the same approach from
" \"working_dir\": \"/home/jovyan/rag-app\", # upload the local working directory to ray workers\n", |
/home/jovyan/rag-app\
is updated so just can be /home/rag-app\
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regarding to this, I think we prefer the interactive pattern which rag-kaggle-ray-sql-interactive.ipynb
followes. It would be great if we can follow the pattern in this notebook.
"ray.shutdown()" | ||
] | ||
} | ||
], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar as what we did before, can we add a cell to verify the embeddings got created and stored in the database correctly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added on line 194.
… into add/example_notebooks/kubernetes_docs
/gcbrun |
… into add/example_notebooks/kubernetes_docs
/gcbrun |
" \n", | ||
" splits = splitter.split_documents(pages)\n", | ||
"\n", | ||
" chunks = []\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The variable name chunks
is confusing here, it sounds more like the raw data chunks before embedding
" \"langchain_id\" : id,\n", | ||
" \"content\" : page_content,\n", | ||
" \"embedding\" : embedded_document,\n", | ||
" \"langchain_metadata\" : file_metadata\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this even work? The keys of the split_data do not match the schema of TextEmbedding.
See error |
The idea with this PR is create a jupyter notebook, as happens on rag/example_notebooks/rag-kaggle-ray-sql-latest.ipynb, showing the required steps to answer questions using the kubernetes docs instead of use the netflix reviews database.