griptape-ai · collindutter · Oct 4, 2024 · Aug 27, 2024 · Sep 5, 2024 · Aug 21, 2024
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -12,6 +12,13 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 - `TavilyWebSearchDriver` to integrate Tavily's web search capabilities.
 - `ExaWebSearchDriver` to integrate Exa's web search capabilities.
 - `Workflow.outputs` to access the outputs of a Workflow.
+- `BaseFileLoader` for Loaders that load from a path.
+- `BaseLoader.fetch()` method for fetching data from a source.
+- `BaseLoader.parse()` method for parsing fetched data.
+- `BaseFileManager.encoding` to specify the encoding when loading and saving files.
+- `BaseWebScraperDriver.extract_page()` method for extracting data from an already scraped web page.
+- `TextLoaderRetrievalRagModule.chunker` for specifying the chunking strategy.
+- `file_utils.get_mime_type` utility for getting the MIME type of a file.
 
 ### Changed
 - **BREAKING**: Renamed parameters on several classes to `client`:
@@ -33,7 +40,20 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
   - `model_client` on `GooglePromptDriver`.
   - `model_client` on `GoogleTokenizer`.
 - **BREAKING**: Renamed parameter `pipe` on `HuggingFacePipelinePromptDriver` to `pipeline`.
+- **BREAKING**: Update `pypdf` dependency to `^5.0.1`.
+- **BREAKING**: Update `redis` dependency to `^5.1.0`.
+- **BREAKING**: Removed `BaseFileManager.default_loader` and `BaseFileManager.loaders`.
+- **BREAKING**: Loaders no longer chunk data, use a Chunker to chunk the data.
+- **BREAKING**: Removed `fileutils.load_file` and `fileutils.load_files`.
+- **BREAKING**: Removed `loaders-dataframe` and `loaders-audio` extras as they are no longer needed.
+- **BREKING**: `TextLoader`, `PdfLoader`, `ImageLoader`, and `AudioLoader` now take a `str | PathLike` instead of `bytes`. Passing `bytes` is still supported but deprecated.
+- **BREAKING**: Removed `DataframeLoader`.
 - Several places where API clients are initialized are now lazy loaded.
+- `BaseVectorStoreDriver.upsert_text_artifacts` now returns a list or dictionary of upserted vector ids.
+- `LocalFileManagerDriver.workdir` is now optional.
+- `filetype` is now a core dependency.
+- `FileManagerTool` now uses `filetype` for more accurate file type detection.
+- `BaseFileLoader.load_file()` will now either return a `TextArtifact` or a `BlobArtifact` depending on whether `BaseFileManager.encoding` is set.
 - `Structure.output`'s type is now `BaseArtifact` and raises an exception if the output is `None`.
 - **BREAKING**: Update `pypdf` dependency to `^5.0.1`.
 - **BREAKING**: Update `redis` dependency to `^5.1.0`.
@@ -59,8 +79,9 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 
 ### Changed
 - **BREAKING**: Removed `CsvRowArtifact`. Use `TextArtifact` instead.
+- **BREAKING**: Removed `DataframeLoader`.
 - **BREAKING**: Removed `MediaArtifact`, use `ImageArtifact` or `AudioArtifact` instead.
-- **BREAKING**: `CsvLoader`, `DataframeLoader`, and `SqlLoader` now return `list[TextArtifact]`.
+- **BREAKING**: `CsvLoader` and `SqlLoader` now return `ListArtifact[TextArtifact]`.
 - **BREAKING**: Removed `ImageArtifact.media_type`.
 - **BREAKING**: Removed `AudioArtifact.media_type`.
 - **BREAKING**: Removed `BlobArtifact.dir_name`.

diff --git a/MIGRATION.md b/MIGRATION.md
@@ -153,6 +153,219 @@ print(image_artifact.meta["prompt"], image_artifact.meta["model"]) # Generate an
 ```
 
 
+## 0.31.X to 0.32.X
+
+### Removed `DataframeLoader`
+
+`DataframeLoader` has been removed. Use `CsvLoader.parse` or build `TextArtifact`s from the dataframe instead.
+
+#### Before
+
+```python
+DataframeLoader().load(df)
+```
+
+#### After
+```python
+# Convert the dataframe to csv bytes and parse it
+CsvLoader().parse(bytes(df.to_csv(line_terminator='\r\n', index=False), encoding='utf-8'))
+# Or build TextArtifacts from the dataframe
+[TextArtifact(row) for row in source.to_dict(orient="records")]
+```
+
+### `TextLoader`, `PdfLoader`, `ImageLoader`, and `AudioLoader` now take a `str | PathLike` instead of `bytes`.
+
+#### Before
+```python
+PdfLoader().load(Path("attention.pdf").read_bytes())
+PdfLoader().load_collection([Path("attention.pdf").read_bytes(), Path("CoT.pdf").read_bytes()])
+```
+
+#### After
+```python
+PdfLoader().load("attention.pdf")
+PdfLoader().load_collection([Path("attention.pdf"), "CoT.pdf"])
+```
+
+### Removed `fileutils.load_file` and `fileutils.load_files`
+
+`griptape.utils.file_utils.load_file` and `griptape.utils.file_utils.load_files` have been removed.
+You can now pass the file path directly to the Loader.
+
+#### Before
+
+```python
+PdfLoader().load(load_file("attention.pdf").read_bytes())
+PdfLoader().load_collection(list(load_files(["attention.pdf", "CoT.pdf"]).values()))
+```
+
+```python
+PdfLoader().load("attention.pdf")
+PdfLoader().load_collection(["attention.pdf", "CoT.pdf"])
+```
+
+### Loaders no longer chunk data
+
+Loaders no longer chunk the data after loading it. If you need to chunk the data, use a [Chunker](https://docs.griptape.ai/stable/griptape-framework/data/chunkers/) after loading the data.
+
+#### Before
+
+```python
+chunks = PdfLoader().load("attention.pdf")
+vector_store.upsert_text_artifacts(
+    {
+        "griptape": chunks,
+    }
+)
+```
+
+#### After
+```python
+artifact = PdfLoader().load("attention.pdf")
+chunks = Chunker().chunk(artifact)
+vector_store.upsert_text_artifacts(
+    {
+        "griptape": chunks,
+    }
+)
+```
+
+### Removed `MediaArtifact`
+
+`MediaArtifact` has been removed. Use `ImageArtifact` or `AudioArtifact` instead.
+
+#### Before
+
+```python
+image_media = MediaArtifact(
+    b"image_data",
+    media_type="image",
+    format="jpeg"
+)
+
+audio_media = MediaArtifact(
+    b"audio_data",
+    media_type="audio",
+    format="wav"
+)
+``` 
+
+#### After
+```python
+image_artifact = ImageArtifact(
+    b"image_data",
+    format="jpeg"
+)
+
+audio_artifact = AudioArtifact(
+    b"audio_data",
+    format="wav"
+)
+```
+
+### `ImageArtifact.format` is now required
+
+`ImageArtifact.format` is now a required parameter. Update any code that does not provide a `format` parameter.
+
+#### Before
+
+```python
+image_artifact = ImageArtifact(
+    b"image_data"
+)
+```
+
+#### After
+```python
+image_artifact = ImageArtifact(
+    b"image_data",
+    format="jpeg"
+)
+```
+
+### Removed `CsvRowArtifact`
+
+`CsvRowArtifact` has been removed. Use `TextArtifact` instead.
+
+#### Before
+
+```python
+artifact = CsvRowArtifact({"name": "John", "age": 30})
+print(artifact.value) # {"name": "John", "age": 30}
+print(type(artifact.value)) # <class 'dict'>
+```
+
+#### After
+```python
+artifact = TextArtifact("name: John\nage: 30")
+print(artifact.value) # name: John\nage: 30
+print(type(artifact.value)) # <class 'str'>
+```
+
+If you require storing a dictionary as an Artifact, you can use `GenericArtifact` instead.
+
+### `CsvLoader`, `DataframeLoader`, and `SqlLoader` return types 
+
+`CsvLoader`, `DataframeLoader`, and `SqlLoader` now return a `list[TextArtifact]` instead of `list[CsvRowArtifact]`.
+
+If you require a dictionary, set a custom `formatter_fn` and then parse the text to a dictionary. 
+
+#### Before
+
+```python
+results = CsvLoader().load(Path("people.csv").read_text())
+
+print(results[0].value) # {"name": "John", "age": 30}
+print(type(results[0].value)) # <class 'dict'>
+```
+
+#### After
+```python
+results = CsvLoader().load(Path("people.csv").read_text())
+
+print(type(results)) # <class 'griptape.artifacts.ListArtifact'>
+print(results[0].value) # name: John\nAge: 30
+print(type(results[0].value)) # <class 'str'>
+
+# Customize formatter_fn
+results = CsvLoader(formatter_fn=lambda x: json.dumps(x)).load(Path("people.csv").read_text())
+print(results[0].value) # {"name": "John", "age": 30}
+print(type(results[0].value)) # <class 'str'>
+
+dict_results = [json.loads(result.value) for result in results]
+print(dict_results[0]) # {"name": "John", "age": 30}
+print(type(dict_results[0])) # <class 'dict'>
+```
+
+### Moved `ImageArtifact.prompt` and `ImageArtifact.model` to `ImageArtifact.meta`
+
+`ImageArtifact.prompt` and `ImageArtifact.model` have been moved to `ImageArtifact.meta`.
+
+#### Before
+
+```python
+image_artifact = ImageArtifact(
+    b"image_data",
+    format="jpeg",
+    prompt="Generate an image of a cat",
+    model="DALL-E"
+)
+
+print(image_artifact.prompt, image_artifact.model) # Generate an image of a cat, DALL-E
+```
+
+#### After
+```python
+image_artifact = ImageArtifact(
+    b"image_data",
+    format="jpeg",
+    meta={"prompt": "Generate an image of a cat", "model": "DALL-E"}
+)
+
+print(image_artifact.meta["prompt"], image_artifact.meta["model"]) # Generate an image of a cat, DALL-E
+```
+
+
 ## 0.30.X to 0.31.X
 
 ### Exceptions Over `ErrorArtifact`s

diff --git a/docs/examples/src/load_query_and_chat_marqo_1.py b/docs/examples/src/load_query_and_chat_marqo_1.py
@@ -1,6 +1,7 @@
 import os
 
 from griptape import utils
+from griptape.chunkers import TextChunker
 from griptape.drivers import MarqoVectorStoreDriver, OpenAiEmbeddingDriver
 from griptape.loaders import WebLoader
 from griptape.structures import Agent
@@ -25,11 +26,12 @@
 
 # Load artifacts from the web
 artifacts = WebLoader().load("https://www.griptape.ai")
+chunks = TextChunker().chunk(artifacts)
 
 # Upsert the artifacts into the vector store
 vector_store.upsert_text_artifacts(
     {
-        namespace: artifacts,
+        namespace: chunks,
     }
 )
 

diff --git a/docs/examples/src/query_webpage_1.py b/docs/examples/src/query_webpage_1.py
@@ -1,14 +1,15 @@
 import os
 
+from griptape.chunkers import TextChunker
 from griptape.drivers import LocalVectorStoreDriver, OpenAiEmbeddingDriver
 from griptape.loaders import WebLoader
 
 vector_store = LocalVectorStoreDriver(embedding_driver=OpenAiEmbeddingDriver(api_key=os.environ["OPENAI_API_KEY"]))
 
-artifacts = WebLoader(max_tokens=100).load("https://www.griptape.ai")
+artifacts = WebLoader().load("https://www.griptape.ai")
+chunks = TextChunker().chunk(artifacts)
 
-for a in artifacts:
-    vector_store.upsert_text_artifact(a, namespace="griptape")
+vector_store.upsert_text_artifacts({"griptape": chunks})
 
 results = vector_store.query("creativity", count=3, namespace="griptape")
 

diff --git a/docs/examples/src/query_webpage_astra_db_1.py b/docs/examples/src/query_webpage_astra_db_1.py
@@ -1,5 +1,6 @@
 import os
 
+from griptape.chunkers import TextChunker
 from griptape.drivers import (
     AstraDbVectorStoreDriver,
     OpenAiChatPromptDriver,
@@ -43,9 +44,9 @@
     ),
 )
 
-artifacts = WebLoader(max_tokens=256).load(input_blogpost)
-
-vector_store_driver.upsert_text_artifacts({namespace: artifacts})
+artifacts = WebLoader().load(input_blogpost)
+chunks = TextChunker().chunk(artifacts)
+vector_store_driver.upsert_text_artifacts({namespace: chunks})
 
 rag_tool = RagTool(
     description="A DataStax blog post",

diff --git a/docs/examples/src/talk_to_a_pdf_1.py b/docs/examples/src/talk_to_a_pdf_1.py
@@ -1,5 +1,6 @@
 import requests
 
+from griptape.chunkers import TextChunker
 from griptape.drivers import LocalVectorStoreDriver, OpenAiChatPromptDriver, OpenAiEmbeddingDriver
 from griptape.engines.rag import RagEngine
 from griptape.engines.rag.modules import PromptResponseRagModule, VectorStoreRetrievalRagModule
@@ -30,9 +31,10 @@
     rag_engine=engine,
 )
 
-artifacts = PdfLoader().load(response.content)
+artifacts = PdfLoader().parse(response.content)
+chunks = TextChunker().chunk(artifacts)
 
-vector_store.upsert_text_artifacts({namespace: artifacts})
+vector_store.upsert_text_artifacts({namespace: chunks})
 
 agent = Agent(tools=[rag_tool])
 

diff --git a/docs/examples/src/talk_to_a_webpage_1.py b/docs/examples/src/talk_to_a_webpage_1.py
@@ -1,3 +1,4 @@
+from griptape.chunkers import TextChunker
 from griptape.drivers import LocalVectorStoreDriver, OpenAiChatPromptDriver, OpenAiEmbeddingDriver
 from griptape.engines.rag import RagEngine
 from griptape.engines.rag.modules import PromptResponseRagModule, VectorStoreRetrievalRagModule
@@ -26,8 +27,9 @@
 )
 
 artifacts = WebLoader().load("https://en.wikipedia.org/wiki/Physics")
+chunks = TextChunker().chunk(artifacts)
 
-vector_store_driver.upsert_text_artifacts({namespace: artifacts})
+vector_store_driver.upsert_text_artifacts({namespace: chunks})
 
 rag_tool = RagTool(
     description="Contains information about physics. " "Use it to answer any physics-related questions.",

diff --git a/docs/griptape-framework/data/chunkers.md b/docs/griptape-framework/data/chunkers.md
@@ -18,3 +18,7 @@ Here is how to use a chunker:
 ```python
 --8<-- "docs/griptape-framework/data/src/chunkers_1.py"
 ```
+
+The most common use of a Chunker is to split up a long text into smaller chunks for inserting into a Vector Database when doing Retrieval Augmented Generation (RAG).
+
+See [RagEngine](../../griptape-framework/engines/rag-engines.md) for more information on how to use Chunkers in RAG pipelines.