Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Creating notebook to ingest CloudSQL database using kubernetes docs #751

Open
wants to merge 36 commits into
base: main
Choose a base branch
from

Conversation

german-grandas
Copy link
Collaborator

@german-grandas german-grandas commented Jul 26, 2024

The idea with this PR is create a jupyter notebook, as happens on rag/example_notebooks/rag-kaggle-ray-sql-latest.ipynb, showing the required steps to answer questions using the kubernetes docs instead of use the netflix reviews database.

@german-grandas german-grandas changed the title Creating notebook to ingest CloudSQL database using kubernetes Creating notebook to ingest CloudSQL database using kubernetes docs Jul 26, 2024
@german-grandas german-grandas enabled auto-merge (squash) July 31, 2024 15:27
@imreddy13
Copy link
Collaborator

/gcbrun

1 similar comment
@yiyinglovecoding
Copy link
Collaborator

/gcbrun

Copy link
Collaborator

@gongmax gongmax left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you update the e2e test, i.e. test rag to use this notebook. After verifying this works, we also need to update the tutorial https://github.com/GoogleCloudPlatform/ai-on-gke/blame/main/applications/rag/README.md to use it.

    - Updating test_rag.py so the test can validate answers from the kubernetes documentation.
    - Updating cloudbuild.yaml to ingest the database with the kubernetes documentation.
@german-grandas
Copy link
Collaborator Author

/gcbrun

@german-grandas
Copy link
Collaborator Author

/gcbrun

@german-grandas
Copy link
Collaborator Author

/gcbrun

@german-grandas
Copy link
Collaborator Author

/gcbrun

@german-grandas
Copy link
Collaborator Author

/gcbrun

@german-grandas
Copy link
Collaborator Author

/gcbrun

@german-grandas
Copy link
Collaborator Author

/gcbrun

@german-grandas
Copy link
Collaborator Author

/gcbrun

@german-grandas
Copy link
Collaborator Author

/gcbrun

@german-grandas
Copy link
Collaborator Author

/gcbrun

@german-grandas
Copy link
Collaborator Author

/gcbrun

@german-grandas
Copy link
Collaborator Author

/gcbrun

@german-grandas
Copy link
Collaborator Author

/gcbrun

@german-grandas
Copy link
Collaborator Author

/gcbrun

4 similar comments
@gongmax
Copy link
Collaborator

gongmax commented Oct 16, 2024

/gcbrun

@german-grandas
Copy link
Collaborator Author

/gcbrun

@german-grandas
Copy link
Collaborator Author

/gcbrun

@german-grandas
Copy link
Collaborator Author

/gcbrun

@german-grandas
Copy link
Collaborator Author

/gcbrun

1 similar comment
@german-grandas
Copy link
Collaborator Author

/gcbrun

Copy link
Collaborator

@gongmax gongmax left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Accessing through frontend UI, I got:
1, Error pg8000.exceptions.DatabaseError: {'S': 'ERROR', 'V': 'ERROR', 'C': '42703', 'M': 'column "id" does not exist', 'P': '8', 'F': 'parse_relation.c', 'L': '3676', 'R': 'errorMissingColumn'}. This related to the column name consistency issue I commented below.
2. Warning Warning: Error: DBAPIError.__init__() missing 2 required positional arguments: 'params' and 'orig'.
Can you fix them and verify?

Besides, the whole ray job took ~40mins to finish. Do you think it makes sense to follow the previous way to utilize the Ray Dataset to processing the data, which handles the batch and distribution by itself. And utilize GPU.

"# the dataset has been pre-dowloaded to the GCS bucket as part of the notebook in the cell above. Ray workers will find the dataset readily mounted.\n",
"SHARED_DATASET_BASE_PATH = \"/data/kubernetes-docs/\"\n",
"\n",
"BATCH_SIZE = 100\n",
Copy link
Collaborator

@gongmax gongmax Oct 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where is the BATCH_SIZE being used?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is not required in this scenario, I removed it.

" entrypoint=\"python job.py\",\n",
" # Path to the local directory that contains the entrypoint file.\n",
" runtime_env={\n",
" \"working_dir\": \"/home/jovyan/rag-app\", # upload the local working directory to ray workers\n",
Copy link
Collaborator

@gongmax gongmax Oct 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems work but can you help me to understand why the working_dir looks like this, specifically the jovyan part?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wasn't sure so I took the same approach from

" \"working_dir\": \"/home/jovyan/rag-app\", # upload the local working directory to ray workers\n",
, there the working dir is setup as /home/jovyan/rag-app\ is updated so just can be /home/rag-app\

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding to this, I think we prefer the interactive pattern which rag-kaggle-ray-sql-interactive.ipynb followes. It would be great if we can follow the pattern in this notebook.

applications/rag/README.md Show resolved Hide resolved
"ray.shutdown()"
]
}
],
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar as what we did before, can we add a cell to verify the embeddings got created and stored in the database correctly?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added on line 194.

@german-grandas
Copy link
Collaborator Author

/gcbrun

@german-grandas
Copy link
Collaborator Author

/gcbrun

" \n",
" splits = splitter.split_documents(pages)\n",
"\n",
" chunks = []\n",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The variable name chunks is confusing here, it sounds more like the raw data chunks before embedding

Comment on lines +177 to +180
" \"langchain_id\" : id,\n",
" \"content\" : page_content,\n",
" \"embedding\" : embedded_document,\n",
" \"langchain_metadata\" : file_metadata\n",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this even work? The keys of the split_data do not match the schema of TextEmbedding.

@gongmax
Copy link
Collaborator

gongmax commented Oct 29, 2024

See error sqlalchemy.exc.DatabaseError: (pg8000.exceptions.DatabaseError) {'S': 'ERROR', 'V': 'ERROR', 'C': '23502', 'M': 'null value in column "id" of relation "rag_embeddings_db" violates not-null constraint', 'D': 'Failing row contains (null, null, null).', 's': 'public', 't': 'rag_embeddings_db', 'c': 'id', 'F': 'execMain.c', 'L': '1974', 'R': 'ExecConstraints'} when running the notebook. May related to my above comment about the DB schema mis-match. The same error also happened in the CI, which means the e2e test here is not sufficient enough, i.e, even without populate the vector DB successfully, the e2e test still passes uncorrectly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants