Snowflake-Labs · jamescha-earley · Nov 19, 2024
diff --git a/Your Scanned Documents, Made Conversational/environment.yml b/Your Scanned Documents, Made Conversational/environment.yml
@@ -0,0 +1,7 @@
+name: app_environment
+channels:
+  - snowflake
+dependencies:
+  - pillow=*
+  - snowflake=*
+  - tesserocr=*
diff --git a/Your Scanned Documents, Made Conversational/your_scanned_documents_made_conversational.ipynb b/Your Scanned Documents, Made Conversational/your_scanned_documents_made_conversational.ipynb
@@ -0,0 +1,314 @@
+{
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Streamlit Notebook",
+   "name": "streamlit"
+  }
+ },
+ "nbformat_minor": 5,
+ "nbformat": 4,
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "a4e8daf0-929f-4249-b1ed-f8d53c07d062",
+   "metadata": {
+    "collapsed": false,
+    "name": "intro",
+    "resultHeight": 298
+   },
+   "source": "# Welcome to :snowflake: OCR and RAG with Snowflake Notebooks :notebook:\n\nTransform your images into searchable text and build a question-answering system using OCR and RAG capabilities in [Snowflake Notebooks](https://docs.snowflake.com/LIMITEDACCESS/snowsight-notebooks/ui-snowsight-notebooks-about)! ⚡️\n\nThis notebook demonstrates how to build an end-to-end application that:\n1. Performs OCR on images using Tesseract\n2. Stores and indexes the extracted text\n3. Creates a question-answering interface using Snowflake's Cortex capabilities and Streamlit\n\nTODO: Add quickstart link, youtube link\n"
+  },
+  {
+   "cell_type": "markdown",
+   "id": "26d4386b-6ae4-45e8-8cc7-cc9b3cd4b024",
+   "metadata": {
+    "name": "cell1",
+    "collapsed": false,
+    "resultHeight": 102
+   },
+   "source": "## Before you start 🛑\n\nUse the [quickstart](URL HERE) guide to prepare and load images into staging. "
+  },
+  {
+   "cell_type": "markdown",
+   "id": "2a3db564-ad2d-4c7b-8ffa-d50b14aad5fb",
+   "metadata": {
+    "collapsed": false,
+    "name": "setup_intro",
+    "resultHeight": 259
+   },
+   "source": [
+    "## Setting Up Your Environment 🎒\n",
+    "\n",
+    "First, we'll import the required packages and set up our Snowflake session. The notebook uses several key packages:\n",
+    "- `streamlit`: For creating the interactive web interface\n",
+    "- `tesserocr`: For optical character recognition\n",
+    "- `pandas`: For data manipulation\n",
+    "- `PIL`: For image processing\n",
+    "- Snowpark packages for interacting with Snowflake\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "3775908f-ca36-4846-8f38-5adca39217f2",
+   "metadata": {
+    "collapsed": false,
+    "language": "python",
+    "name": "setup",
+    "resultHeight": 0
+   },
+   "outputs": [],
+   "source": "# Import python packages\nimport streamlit as st\nimport tesserocr\nimport io\nimport pandas as pd\nfrom PIL import Image\n\n# We can also use Snowpark for our analyses!\nfrom snowflake.snowpark.context import get_active_session\nfrom snowflake.snowpark.types import StringType, StructField, StructType, IntegerType\nfrom snowflake.snowpark.files import SnowflakeFile\nfrom snowflake.core import CreateMode\nfrom snowflake.core.table import Table, TableColumn\nfrom snowflake.core.schema import Schema\nfrom snowflake.core import Root\n\nsession = get_active_session()\nsession.use_schema(\"ocr_rag\")\nroot = Root(session)\ndatabase = root.databases[session.get_current_database()]"
+  },
+  {
+   "cell_type": "markdown",
+   "id": "5f5e73a2-035e-4357-a6cf-e31f004dfb43",
+   "metadata": {
+    "collapsed": false,
+    "name": "schema_intro",
+    "resultHeight": 201
+   },
+   "source": [
+    "## Creating the Schema and Table Structure 🏗️\n",
+    "\n",
+    "Next, we'll set up our database schema and table to store the processed documents. The `docs_chunks_table` will store:\n",
+    "- File paths and URLs for our images\n",
+    "- Extracted text chunks from OCR\n",
+    "- Vector embeddings for semantic search\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "d68a37a7-4bcf-46fb-a060-8b9913550389",
+   "metadata": {
+    "codeCollapsed": false,
+    "collapsed": false,
+    "language": "python",
+    "name": "create_schema_and_table",
+    "resultHeight": 757
+   },
+   "outputs": [],
+   "source": "docs_chunks_table = Table(\n    name=\"docs_chunks_table\",\n    columns=[TableColumn(name=\"relative_path\", datatype=\"string\"),\n            TableColumn(name=\"file_url\", datatype=\"string\"),\n            TableColumn(name=\"scoped_file_url\", datatype=\"string\"),\n            TableColumn(name=\"chunk\", datatype=\"string\"),\n            TableColumn(name=\"chunk_vec\", datatype=\"vector(float,768)\")]\n)\ndatabase.schemas[\"ocr_rag\"].tables.create(docs_chunks_table, mode=CreateMode.or_replace)"
+  },
+  {
+   "cell_type": "markdown",
+   "id": "ddeb07bd-2294-43b8-88ab-f2f563af2d26",
+   "metadata": {
+    "collapsed": false,
+    "name": "ocr_intro",
+    "resultHeight": 243
+   },
+   "source": [
+    "## OCR Processing with Tesseract 📸\n",
+    "\n",
+    "Now we'll create a User-Defined Table Function (UDTF) that:\n",
+    "1. Reads images from Snowflake storage\n",
+    "2. Processes them with Tesseract OCR\n",
+    "3. Returns the extracted text\n",
+    "\n",
+    "This function handles binary image data and can process multiple image formats.\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "2e4a1347-836a-4530-8295-00a1250a2e88",
+   "metadata": {
+    "collapsed": false,
+    "language": "python",
+    "name": "create_udtf_for_ocr_on_image",
+    "resultHeight": 452
+   },
+   "outputs": [],
+   "source": "session.sql(\"DROP FUNCTION IF EXISTS IMAGE_TEXT(VARCHAR)\").collect()\n\nclass ImageText:\n    def process(self, file_url: str):\n        with SnowflakeFile.open(file_url, 'rb') as f:\n            buffer = io.BytesIO(f.readall())\n        image = Image.open(buffer)\n        text = tesserocr.image_to_text(image)\n        yield (text,)  # Return the full OCR text\n\noutput_schema = StructType([StructField(\"full_text\", StringType())])\n\nsession.udtf.register(\n    ImageText,\n    name=\"IMAGE_TEXT\",\n    is_permanent=True,\n    stage_location=\"@ocr_rag.images_to_ocr\",\n    schema=\"ocr_rag\",\n    output_schema=output_schema,\n    packages=[\"tesserocr\", \"pillow\",\"snowflake-snowpark-python\"],\n    replace=True\n)"
+  },
+  {
+   "cell_type": "markdown",
+   "id": "3825709e-58dd-443a-be6a-ea7fc5f83bd4",
+   "metadata": {
+    "collapsed": false,
+    "name": "processing_intro",
+    "resultHeight": 201
+   },
+   "source": [
+    "## Processing Images and Extracting Text 🔄\n",
+    "\n",
+    "Let's process our staged images through the OCR function. This query will:\n",
+    "1. Read all images from our stage\n",
+    "2. Run OCR on each image\n",
+    "3. Return the extracted text along with file information\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "533bc9e0-3ec1-4780-9165-539b788ff97e",
+   "metadata": {
+    "collapsed": false,
+    "language": "sql",
+    "name": "run_through_files_to_ocr",
+    "resultHeight": 439
+   },
+   "outputs": [],
+   "source": [
+    "SELECT \n",
+    "    relative_path, \n",
+    "    file_url, \n",
+    "    build_scoped_file_url(@ocr_rag.images_to_ocr, relative_path) AS scoped_file_url,\n",
+    "    ocr_result.full_text\n",
+    "FROM \n",
+    "    directory(@ocr_rag.images_to_ocr),\n",
+    "    TABLE(IMAGE_TEXT(build_scoped_file_url(@ocr_rag.images_to_ocr, relative_path))) AS ocr_result;"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "1ff1eaae-ac6b-4637-95ab-2c22ff4f6d50",
+   "metadata": {
+    "collapsed": false,
+    "name": "vectorization_intro",
+    "resultHeight": 201
+   },
+   "source": [
+    "## Text Processing and Vectorization 🔤\n",
+    "\n",
+    "Now we'll process the extracted text by:\n",
+    "1. Splitting it into manageable chunks\n",
+    "2. Creating vector embeddings using Snowflake Cortex\n",
+    "3. Storing the results for efficient retrieval\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "4a711ab9-e45f-413c-8e14-5ab239c21bf0",
+   "metadata": {
+    "collapsed": false,
+    "language": "sql",
+    "name": "split_text_and_create_vector",
+    "resultHeight": 112
+   },
+   "outputs": [],
+   "source": "INSERT INTO docs_chunks_table (relative_path, file_url, scoped_file_url, chunk, chunk_vec)\nSELECT \n    relative_path, \n    file_url,\n    scoped_file_url,\n    chunk.value,\n    SNOWFLAKE.CORTEX.EMBED_TEXT_768('e5-base-v2', chunk.value) AS chunk_vec\nFrom\n    {{run_through_files_to_ocr}},\n    LATERAL FLATTEN(SNOWFLAKE.CORTEX.SPLIT_TEXT_RECURSIVE_CHARACTER(full_text,'none', 4000, 400)) chunk;"
+  },
+  {
+   "cell_type": "markdown",
+   "id": "4ecf3aad-cbe2-46d4-a8c8-eb778b61b2cf",
+   "metadata": {
+    "collapsed": false,
+    "name": "qa_intro",
+    "resultHeight": 313
+   },
+   "source": [
+    "## Building the Question-Answering System 🤖\n",
+    "\n",
+    "Finally, we'll create our QA system that uses:\n",
+    "- Vector similarity search to find relevant context\n",
+    "- Mistral-7b model for generating answers\n",
+    "- Streamlit for the user interface\n",
+    "\n",
+    "Key parameters:\n",
+    "- `num_chunks`: Number of context chunks provided (default: 3)\n",
+    "- `model`: Language model used (default: \"mistral-7b\")\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "529d4b15-6235-4afe-9099-b2654495e3e9",
+   "metadata": {
+    "collapsed": false,
+    "language": "python",
+    "name": "chat_app",
+    "resultHeight": 130
+   },
+   "outputs": [],
+   "source": "num_chunks = 3 # Num-chunks provided as context. Play with this to check how it affects your accuracy\nmodel = \"mistral-7b\" # The model we decided to use\ndef create_prompt (myquestion):\n    cmd = \"\"\"\n     with results as\n     (SELECT RELATIVE_PATH,\n       VECTOR_COSINE_SIMILARITY(docs_chunks_schema.docs_chunks_table.chunk_vec,\n                SNOWFLAKE.CORTEX.EMBED_TEXT_768('e5-base-v2', ?)) as similarity,\n       chunk\n     from docs_chunks_schema.docs_chunks_table\n     order by similarity desc\n     limit ?)\n     select chunk, relative_path from results \n     \"\"\"\n    df_context = session.sql(cmd, params=[myquestion, num_chunks]).to_pandas()      \n\n    context_lenght = len(df_context) -1\n    prompt_context = \"\"\n    for i in range (0, context_lenght):\n        prompt_context += df_context._get_value(i, 'CHUNK')\n    prompt_context = prompt_context.replace(\"'\", \"\")\n    relative_path =  df_context._get_value(0,'RELATIVE_PATH')\n    prompt = f\"\"\"\n      'You are an expert assistance extracting information from context provided. \n       Answer the question based on the context. Be concise and do not hallucinate. \n       If you don´t have the information just say so.\n      Context: {prompt_context}\n      Question:  \n       {myquestion} \n       Answer: '\n       \"\"\"\n    cmd2 = f\"select GET_PRESIGNED_URL(@ocr_rag.images_to_ocr, '{relative_path}', 360) as URL_LINK from directory(@ocr_rag.images_to_ocr)\"\n    df_url_link = session.sql(cmd2).to_pandas()\n    url_link = df_url_link._get_value(0,'URL_LINK')\n\n    return prompt, url_link, relative_path\ndef complete(myquestion, model_name):\n    prompt, url_link, relative_path =create_prompt (myquestion)\n    cmd = f\"\"\"\n             select SNOWFLAKE.CORTEX.COMPLETE(?,?) as response\n           \"\"\"\n\n    df_response = session.sql(cmd, params=[model_name, prompt]).collect()\n    return df_response, url_link, relative_path\ndef display_response (question, model):\n    response, url_link, relative_path = complete(question, model)\n    res_text = response[0].RESPONSE\n    st.markdown(res_text)\n\n    display_url = f\"Link to [{relative_path}]({url_link}) that may be useful\"\n    st.markdown(display_url)\n#Main code\nst.title(\"Asking Questions to Your Scanned Documents with Snowflake Cortex:\")\ndocs_available = session.sql(\"ls @ocr_rag.images_to_ocr\").collect()\nquestion = st.text_input(\"Enter question\", placeholder=\"What are my documents about?\", label_visibility=\"collapsed\")\nif question:\n    display_response (question, model)"
+  },
+  {
+   "cell_type": "markdown",
+   "id": "9a5a97ba-b9fa-4f0c-91f5-60bcb2d9b527",
+   "metadata": {
+    "collapsed": false,
+    "name": "tips_intro",
+    "resultHeight": 499
+   },
+   "source": [
+    "## Performance Tips 🚀\n",
+    "\n",
+    "To get the best results from your OCR and RAG system:\n",
+    "\n",
+    "1. Fine-tune your parameters:\n",
+    "   - Adjust `num_chunks` based on your document length and complexity\n",
+    "   - Experiment with different chunk sizes for optimal context\n",
+    "   - Monitor response quality and adjust as needed\n",
+    "\n",
+    "2. Optimize image processing:\n",
+    "   - Ensure good image quality for better OCR results\n",
+    "   - Consider batch processing for large image sets\n",
+    "   - Pre-process images if needed (contrast, resolution)\n",
+    "\n",
+    "3. Manage resources efficiently:\n",
+    "   - Monitor memory usage with large document sets\n",
+    "   - Use appropriate vector similarity thresholds\n",
+    "   - Consider indexing strategies for large collections\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "7339a252-5664-46c0-b329-a64d74fd104f",
+   "metadata": {
+    "collapsed": false,
+    "name": "security_intro",
+    "resultHeight": 370
+   },
+   "source": [
+    "## Security Considerations 🔒\n",
+    "\n",
+    "Important security aspects to keep in mind:\n",
+    "\n",
+    "1. Access Control:\n",
+    "   - Ensure proper permissions on image storage\n",
+    "   - Manage API access appropriately\n",
+    "   - Monitor usage patterns and access logs\n",
+    "\n",
+    "2. Data Privacy:\n",
+    "   - Consider sensitivity of extracted text\n",
+    "   - Implement appropriate data retention policies\n",
+    "   - Handle user queries securely\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "ab4a99bb-b2d1-4600-9125-037e5bb1c558",
+   "metadata": {
+    "collapsed": false,
+    "name": "next_steps_intro",
+    "resultHeight": 540
+   },
+   "source": [
+    "## Going Further 🌟\n",
+    "\n",
+    "Consider extending the application with:\n",
+    "\n",
+    "1. Enhanced Features:\n",
+    "   - Multiple language support\n",
+    "   - Custom OCR pre-processing\n",
+    "   - Additional document formats\n",
+    "\n",
+    "2. UI Improvements:\n",
+    "   - Enhanced visualization of results\n",
+    "   - Better context highlighting\n",
+    "   - User feedback mechanisms\n",
+    "\n",
+    "3. Model Enhancements:\n",
+    "   - Custom embedding models\n",
+    "   - Fine-tuned language models\n",
+    "   - Improved prompt engineering\n",
+    "\n",
+    "For more information on Snowflake's AI capabilities, visit the [Snowflake documentation](https://docs.snowflake.com/).\n"
+   ]
+  }
+ ]
+}