Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Your Scanned Documents Made Conversational Notebook #112

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 7 additions & 0 deletions Your Scanned Documents, Made Conversational/environment.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
name: app_environment
channels:
- snowflake
dependencies:
- pillow=*
- snowflake=*
- tesserocr=*
Original file line number Diff line number Diff line change
@@ -0,0 +1,314 @@
{
"metadata": {
"kernelspec": {
"display_name": "Streamlit Notebook",
"name": "streamlit"
}
},
"nbformat_minor": 5,
"nbformat": 4,
"cells": [
{
"cell_type": "markdown",
"id": "a4e8daf0-929f-4249-b1ed-f8d53c07d062",
"metadata": {
"collapsed": false,
"name": "intro",
"resultHeight": 298
},
"source": "# Welcome to :snowflake: OCR and RAG with Snowflake Notebooks :notebook:\n\nTransform your images into searchable text and build a question-answering system using OCR and RAG capabilities in [Snowflake Notebooks](https://docs.snowflake.com/LIMITEDACCESS/snowsight-notebooks/ui-snowsight-notebooks-about)! ⚡️\n\nThis notebook demonstrates how to build an end-to-end application that:\n1. Performs OCR on images using Tesseract\n2. Stores and indexes the extracted text\n3. Creates a question-answering interface using Snowflake's Cortex capabilities and Streamlit\n\nTODO: Add quickstart link, youtube link\n"
},
{
"cell_type": "markdown",
"id": "26d4386b-6ae4-45e8-8cc7-cc9b3cd4b024",
"metadata": {
"name": "cell1",
"collapsed": false,
"resultHeight": 102
},
"source": "## Before you start 🛑\n\nUse the [quickstart](URL HERE) guide to prepare and load images into staging. "
},
{
"cell_type": "markdown",
"id": "2a3db564-ad2d-4c7b-8ffa-d50b14aad5fb",
"metadata": {
"collapsed": false,
"name": "setup_intro",
"resultHeight": 259
},
"source": [
"## Setting Up Your Environment 🎒\n",
"\n",
"First, we'll import the required packages and set up our Snowflake session. The notebook uses several key packages:\n",
"- `streamlit`: For creating the interactive web interface\n",
"- `tesserocr`: For optical character recognition\n",
"- `pandas`: For data manipulation\n",
"- `PIL`: For image processing\n",
"- Snowpark packages for interacting with Snowflake\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "3775908f-ca36-4846-8f38-5adca39217f2",
"metadata": {
"collapsed": false,
"language": "python",
"name": "setup",
"resultHeight": 0
},
"outputs": [],
"source": "# Import python packages\nimport streamlit as st\nimport tesserocr\nimport io\nimport pandas as pd\nfrom PIL import Image\n\n# We can also use Snowpark for our analyses!\nfrom snowflake.snowpark.context import get_active_session\nfrom snowflake.snowpark.types import StringType, StructField, StructType, IntegerType\nfrom snowflake.snowpark.files import SnowflakeFile\nfrom snowflake.core import CreateMode\nfrom snowflake.core.table import Table, TableColumn\nfrom snowflake.core.schema import Schema\nfrom snowflake.core import Root\n\nsession = get_active_session()\nsession.use_schema(\"ocr_rag\")\nroot = Root(session)\ndatabase = root.databases[session.get_current_database()]"
},
{
"cell_type": "markdown",
"id": "5f5e73a2-035e-4357-a6cf-e31f004dfb43",
"metadata": {
"collapsed": false,
"name": "schema_intro",
"resultHeight": 201
},
"source": [
"## Creating the Schema and Table Structure 🏗️\n",
"\n",
"Next, we'll set up our database schema and table to store the processed documents. The `docs_chunks_table` will store:\n",
"- File paths and URLs for our images\n",
"- Extracted text chunks from OCR\n",
"- Vector embeddings for semantic search\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "d68a37a7-4bcf-46fb-a060-8b9913550389",
"metadata": {
"codeCollapsed": false,
"collapsed": false,
"language": "python",
"name": "create_schema_and_table",
"resultHeight": 757
},
"outputs": [],
"source": "docs_chunks_table = Table(\n name=\"docs_chunks_table\",\n columns=[TableColumn(name=\"relative_path\", datatype=\"string\"),\n TableColumn(name=\"file_url\", datatype=\"string\"),\n TableColumn(name=\"scoped_file_url\", datatype=\"string\"),\n TableColumn(name=\"chunk\", datatype=\"string\"),\n TableColumn(name=\"chunk_vec\", datatype=\"vector(float,768)\")]\n)\ndatabase.schemas[\"ocr_rag\"].tables.create(docs_chunks_table, mode=CreateMode.or_replace)"
},
{
"cell_type": "markdown",
"id": "ddeb07bd-2294-43b8-88ab-f2f563af2d26",
"metadata": {
"collapsed": false,
"name": "ocr_intro",
"resultHeight": 243
},
"source": [
"## OCR Processing with Tesseract 📸\n",
"\n",
"Now we'll create a User-Defined Table Function (UDTF) that:\n",
"1. Reads images from Snowflake storage\n",
"2. Processes them with Tesseract OCR\n",
"3. Returns the extracted text\n",
"\n",
"This function handles binary image data and can process multiple image formats.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "2e4a1347-836a-4530-8295-00a1250a2e88",
"metadata": {
"collapsed": false,
"language": "python",
"name": "create_udtf_for_ocr_on_image",
"resultHeight": 452
},
"outputs": [],
"source": "session.sql(\"DROP FUNCTION IF EXISTS IMAGE_TEXT(VARCHAR)\").collect()\n\nclass ImageText:\n def process(self, file_url: str):\n with SnowflakeFile.open(file_url, 'rb') as f:\n buffer = io.BytesIO(f.readall())\n image = Image.open(buffer)\n text = tesserocr.image_to_text(image)\n yield (text,) # Return the full OCR text\n\noutput_schema = StructType([StructField(\"full_text\", StringType())])\n\nsession.udtf.register(\n ImageText,\n name=\"IMAGE_TEXT\",\n is_permanent=True,\n stage_location=\"@ocr_rag.images_to_ocr\",\n schema=\"ocr_rag\",\n output_schema=output_schema,\n packages=[\"tesserocr\", \"pillow\",\"snowflake-snowpark-python\"],\n replace=True\n)"
},
{
"cell_type": "markdown",
"id": "3825709e-58dd-443a-be6a-ea7fc5f83bd4",
"metadata": {
"collapsed": false,
"name": "processing_intro",
"resultHeight": 201
},
"source": [
"## Processing Images and Extracting Text 🔄\n",
"\n",
"Let's process our staged images through the OCR function. This query will:\n",
"1. Read all images from our stage\n",
"2. Run OCR on each image\n",
"3. Return the extracted text along with file information\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "533bc9e0-3ec1-4780-9165-539b788ff97e",
"metadata": {
"collapsed": false,
"language": "sql",
"name": "run_through_files_to_ocr",
"resultHeight": 439
},
"outputs": [],
"source": [
"SELECT \n",
" relative_path, \n",
" file_url, \n",
" build_scoped_file_url(@ocr_rag.images_to_ocr, relative_path) AS scoped_file_url,\n",
" ocr_result.full_text\n",
"FROM \n",
" directory(@ocr_rag.images_to_ocr),\n",
" TABLE(IMAGE_TEXT(build_scoped_file_url(@ocr_rag.images_to_ocr, relative_path))) AS ocr_result;"
]
},
{
"cell_type": "markdown",
"id": "1ff1eaae-ac6b-4637-95ab-2c22ff4f6d50",
"metadata": {
"collapsed": false,
"name": "vectorization_intro",
"resultHeight": 201
},
"source": [
"## Text Processing and Vectorization 🔤\n",
"\n",
"Now we'll process the extracted text by:\n",
"1. Splitting it into manageable chunks\n",
"2. Creating vector embeddings using Snowflake Cortex\n",
"3. Storing the results for efficient retrieval\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "4a711ab9-e45f-413c-8e14-5ab239c21bf0",
"metadata": {
"collapsed": false,
"language": "sql",
"name": "split_text_and_create_vector",
"resultHeight": 112
},
"outputs": [],
"source": "INSERT INTO docs_chunks_table (relative_path, file_url, scoped_file_url, chunk, chunk_vec)\nSELECT \n relative_path, \n file_url,\n scoped_file_url,\n chunk.value,\n SNOWFLAKE.CORTEX.EMBED_TEXT_768('e5-base-v2', chunk.value) AS chunk_vec\nFrom\n {{run_through_files_to_ocr}},\n LATERAL FLATTEN(SNOWFLAKE.CORTEX.SPLIT_TEXT_RECURSIVE_CHARACTER(full_text,'none', 4000, 400)) chunk;"
},
{
"cell_type": "markdown",
"id": "4ecf3aad-cbe2-46d4-a8c8-eb778b61b2cf",
"metadata": {
"collapsed": false,
"name": "qa_intro",
"resultHeight": 313
},
"source": [
"## Building the Question-Answering System 🤖\n",
"\n",
"Finally, we'll create our QA system that uses:\n",
"- Vector similarity search to find relevant context\n",
"- Mistral-7b model for generating answers\n",
"- Streamlit for the user interface\n",
"\n",
"Key parameters:\n",
"- `num_chunks`: Number of context chunks provided (default: 3)\n",
"- `model`: Language model used (default: \"mistral-7b\")\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "529d4b15-6235-4afe-9099-b2654495e3e9",
"metadata": {
"collapsed": false,
"language": "python",
"name": "chat_app",
"resultHeight": 130
},
"outputs": [],
"source": "num_chunks = 3 # Num-chunks provided as context. Play with this to check how it affects your accuracy\nmodel = \"mistral-7b\" # The model we decided to use\ndef create_prompt (myquestion):\n cmd = \"\"\"\n with results as\n (SELECT RELATIVE_PATH,\n VECTOR_COSINE_SIMILARITY(docs_chunks_schema.docs_chunks_table.chunk_vec,\n SNOWFLAKE.CORTEX.EMBED_TEXT_768('e5-base-v2', ?)) as similarity,\n chunk\n from docs_chunks_schema.docs_chunks_table\n order by similarity desc\n limit ?)\n select chunk, relative_path from results \n \"\"\"\n df_context = session.sql(cmd, params=[myquestion, num_chunks]).to_pandas() \n\n context_lenght = len(df_context) -1\n prompt_context = \"\"\n for i in range (0, context_lenght):\n prompt_context += df_context._get_value(i, 'CHUNK')\n prompt_context = prompt_context.replace(\"'\", \"\")\n relative_path = df_context._get_value(0,'RELATIVE_PATH')\n prompt = f\"\"\"\n 'You are an expert assistance extracting information from context provided. \n Answer the question based on the context. Be concise and do not hallucinate. \n If you don´t have the information just say so.\n Context: {prompt_context}\n Question: \n {myquestion} \n Answer: '\n \"\"\"\n cmd2 = f\"select GET_PRESIGNED_URL(@ocr_rag.images_to_ocr, '{relative_path}', 360) as URL_LINK from directory(@ocr_rag.images_to_ocr)\"\n df_url_link = session.sql(cmd2).to_pandas()\n url_link = df_url_link._get_value(0,'URL_LINK')\n\n return prompt, url_link, relative_path\ndef complete(myquestion, model_name):\n prompt, url_link, relative_path =create_prompt (myquestion)\n cmd = f\"\"\"\n select SNOWFLAKE.CORTEX.COMPLETE(?,?) as response\n \"\"\"\n\n df_response = session.sql(cmd, params=[model_name, prompt]).collect()\n return df_response, url_link, relative_path\ndef display_response (question, model):\n response, url_link, relative_path = complete(question, model)\n res_text = response[0].RESPONSE\n st.markdown(res_text)\n\n display_url = f\"Link to [{relative_path}]({url_link}) that may be useful\"\n st.markdown(display_url)\n#Main code\nst.title(\"Asking Questions to Your Scanned Documents with Snowflake Cortex:\")\ndocs_available = session.sql(\"ls @ocr_rag.images_to_ocr\").collect()\nquestion = st.text_input(\"Enter question\", placeholder=\"What are my documents about?\", label_visibility=\"collapsed\")\nif question:\n display_response (question, model)"
},
{
"cell_type": "markdown",
"id": "9a5a97ba-b9fa-4f0c-91f5-60bcb2d9b527",
"metadata": {
"collapsed": false,
"name": "tips_intro",
"resultHeight": 499
},
"source": [
"## Performance Tips 🚀\n",
"\n",
"To get the best results from your OCR and RAG system:\n",
"\n",
"1. Fine-tune your parameters:\n",
" - Adjust `num_chunks` based on your document length and complexity\n",
" - Experiment with different chunk sizes for optimal context\n",
" - Monitor response quality and adjust as needed\n",
"\n",
"2. Optimize image processing:\n",
" - Ensure good image quality for better OCR results\n",
" - Consider batch processing for large image sets\n",
" - Pre-process images if needed (contrast, resolution)\n",
"\n",
"3. Manage resources efficiently:\n",
" - Monitor memory usage with large document sets\n",
" - Use appropriate vector similarity thresholds\n",
" - Consider indexing strategies for large collections\n"
]
},
{
"cell_type": "markdown",
"id": "7339a252-5664-46c0-b329-a64d74fd104f",
"metadata": {
"collapsed": false,
"name": "security_intro",
"resultHeight": 370
},
"source": [
"## Security Considerations 🔒\n",
"\n",
"Important security aspects to keep in mind:\n",
"\n",
"1. Access Control:\n",
" - Ensure proper permissions on image storage\n",
" - Manage API access appropriately\n",
" - Monitor usage patterns and access logs\n",
"\n",
"2. Data Privacy:\n",
" - Consider sensitivity of extracted text\n",
" - Implement appropriate data retention policies\n",
" - Handle user queries securely\n"
]
},
{
"cell_type": "markdown",
"id": "ab4a99bb-b2d1-4600-9125-037e5bb1c558",
"metadata": {
"collapsed": false,
"name": "next_steps_intro",
"resultHeight": 540
},
"source": [
"## Going Further 🌟\n",
"\n",
"Consider extending the application with:\n",
"\n",
"1. Enhanced Features:\n",
" - Multiple language support\n",
" - Custom OCR pre-processing\n",
" - Additional document formats\n",
"\n",
"2. UI Improvements:\n",
" - Enhanced visualization of results\n",
" - Better context highlighting\n",
" - User feedback mechanisms\n",
"\n",
"3. Model Enhancements:\n",
" - Custom embedding models\n",
" - Fine-tuned language models\n",
" - Improved prompt engineering\n",
"\n",
"For more information on Snowflake's AI capabilities, visit the [Snowflake documentation](https://docs.snowflake.com/).\n"
]
}
]
}