Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor/loaders #1116

Merged
merged 55 commits into from
Oct 4, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
55 commits
Select commit Hold shift + click to select a range
661bb40
Refactor Artifacts
collindutter Aug 27, 2024
8ad5b36
Artifacts are just data
collindutter Sep 5, 2024
58a8233
Improve ListArtifact
collindutter Aug 21, 2024
ccb37f2
Bring back CsvRowArtifact
collindutter Sep 6, 2024
1da1428
Fix changelog
collindutter Sep 6, 2024
a5bc8f8
Move converters into class
collindutter Sep 9, 2024
98e3aa7
Make image format required
collindutter Sep 9, 2024
0eb93c3
Raise error instead
collindutter Sep 9, 2024
69fac0b
Generate embeddings before clearing
collindutter Sep 9, 2024
dca6c78
Refactor converters
collindutter Sep 9, 2024
50dfb46
Move converter to lambda
collindutter Sep 10, 2024
48dc409
Use artifact encoding for images
collindutter Sep 10, 2024
ad80101
Appease code cov
collindutter Sep 10, 2024
beff800
Fix bad merge
collindutter Sep 11, 2024
9bb0930
Address not
collindutter Sep 11, 2024
c76e5c8
Revert image to_text as base64, move base64 to parent
collindutter Sep 11, 2024
27609b3
Fix changelog
collindutter Sep 11, 2024
f00a35d
Remove CsvRowArtifact for final time
collindutter Sep 11, 2024
83744cf
Fix feedback
collindutter Sep 13, 2024
de6e1f3
Refactor Artifacts
collindutter Aug 27, 2024
b562da0
Artifacts are just data
collindutter Sep 5, 2024
2cba4ed
Improve ListArtifact
collindutter Aug 21, 2024
72b6133
Bring back CsvRowArtifact
collindutter Sep 6, 2024
8a415a2
Artifacts are just data
collindutter Sep 5, 2024
ad19c5a
Refactor Artifacts
collindutter Aug 27, 2024
06d40a7
Standardize loaders interface
collindutter Aug 27, 2024
e5aa986
Use generics for loaders, remove instances of path
collindutter Sep 5, 2024
2931915
Start changelog
collindutter Sep 5, 2024
ebcfff4
Fix bad merge
collindutter Sep 5, 2024
8947425
Remove table artifact
collindutter Sep 5, 2024
05e4231
More changelog
collindutter Sep 5, 2024
c93f512
Type signature cleanup
collindutter Sep 5, 2024
b6a1055
Use new list artifact
collindutter Sep 5, 2024
c67ee56
More changelog
collindutter Sep 5, 2024
cd6e9af
Add migration
collindutter Sep 5, 2024
9a97695
Fix bad merge
collindutter Sep 6, 2024
3284e53
Fix bad merge
collindutter Sep 6, 2024
55ccab6
Fixed bad changelog
collindutter Sep 6, 2024
55a021a
Update docs
collindutter Sep 6, 2024
f78d29d
Fix doc links
collindutter Sep 10, 2024
eda407b
Fix bad merge
collindutter Sep 11, 2024
f2d7bd7
Fix bad merge
collindutter Sep 12, 2024
c14508b
Merge branch 'dev' into refactor/loaders
collindutter Sep 13, 2024
d940a98
Fix PR comments
collindutter Sep 16, 2024
e4af29f
Merge branch 'dev' into refactor/loaders
collindutter Sep 23, 2024
81983c2
Merge branch 'dev' into refactor/loaders
collindutter Sep 23, 2024
64760f2
Fix changelog
collindutter Sep 25, 2024
97c497b
Support deprecated loading
collindutter Sep 25, 2024
2763a38
Clean up Vector Store Driver examples
collindutter Sep 27, 2024
0bcf576
Regenerate lock file
collindutter Oct 1, 2024
27e4262
Fix changelog
collindutter Oct 3, 2024
f99d5cd
Update chunkers docs
collindutter Oct 3, 2024
0418e7c
Update doc link order
collindutter Oct 3, 2024
23090b6
Fix changelog
collindutter Oct 4, 2024
4674099
Regenerate lock filee
collindutter Oct 4, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
23 changes: 22 additions & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,13 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
- `TavilyWebSearchDriver` to integrate Tavily's web search capabilities.
- `ExaWebSearchDriver` to integrate Exa's web search capabilities.
- `Workflow.outputs` to access the outputs of a Workflow.
- `BaseFileLoader` for Loaders that load from a path.
- `BaseLoader.fetch()` method for fetching data from a source.
- `BaseLoader.parse()` method for parsing fetched data.
- `BaseFileManager.encoding` to specify the encoding when loading and saving files.
- `BaseWebScraperDriver.extract_page()` method for extracting data from an already scraped web page.
- `TextLoaderRetrievalRagModule.chunker` for specifying the chunking strategy.
- `file_utils.get_mime_type` utility for getting the MIME type of a file.

### Changed
- **BREAKING**: Renamed parameters on several classes to `client`:
Expand All @@ -33,7 +40,20 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
- `model_client` on `GooglePromptDriver`.
- `model_client` on `GoogleTokenizer`.
- **BREAKING**: Renamed parameter `pipe` on `HuggingFacePipelinePromptDriver` to `pipeline`.
- **BREAKING**: Update `pypdf` dependency to `^5.0.1`.
- **BREAKING**: Update `redis` dependency to `^5.1.0`.
- **BREAKING**: Removed `BaseFileManager.default_loader` and `BaseFileManager.loaders`.
- **BREAKING**: Loaders no longer chunk data, use a Chunker to chunk the data.
- **BREAKING**: Removed `fileutils.load_file` and `fileutils.load_files`.
- **BREAKING**: Removed `loaders-dataframe` and `loaders-audio` extras as they are no longer needed.
- **BREKING**: `TextLoader`, `PdfLoader`, `ImageLoader`, and `AudioLoader` now take a `str | PathLike` instead of `bytes`. Passing `bytes` is still supported but deprecated.
- **BREAKING**: Removed `DataframeLoader`.
- Several places where API clients are initialized are now lazy loaded.
- `BaseVectorStoreDriver.upsert_text_artifacts` now returns a list or dictionary of upserted vector ids.
- `LocalFileManagerDriver.workdir` is now optional.
- `filetype` is now a core dependency.
- `FileManagerTool` now uses `filetype` for more accurate file type detection.
- `BaseFileLoader.load_file()` will now either return a `TextArtifact` or a `BlobArtifact` depending on whether `BaseFileManager.encoding` is set.
- `Structure.output`'s type is now `BaseArtifact` and raises an exception if the output is `None`.
- **BREAKING**: Update `pypdf` dependency to `^5.0.1`.
- **BREAKING**: Update `redis` dependency to `^5.1.0`.
Expand All @@ -59,8 +79,9 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0

### Changed
- **BREAKING**: Removed `CsvRowArtifact`. Use `TextArtifact` instead.
- **BREAKING**: Removed `DataframeLoader`.
- **BREAKING**: Removed `MediaArtifact`, use `ImageArtifact` or `AudioArtifact` instead.
- **BREAKING**: `CsvLoader`, `DataframeLoader`, and `SqlLoader` now return `list[TextArtifact]`.
- **BREAKING**: `CsvLoader` and `SqlLoader` now return `ListArtifact[TextArtifact]`.
- **BREAKING**: Removed `ImageArtifact.media_type`.
- **BREAKING**: Removed `AudioArtifact.media_type`.
- **BREAKING**: Removed `BlobArtifact.dir_name`.
Expand Down
213 changes: 213 additions & 0 deletions MIGRATION.md
Original file line number Diff line number Diff line change
Expand Up @@ -153,6 +153,219 @@ print(image_artifact.meta["prompt"], image_artifact.meta["model"]) # Generate an
```


## 0.31.X to 0.32.X

### Removed `DataframeLoader`

`DataframeLoader` has been removed. Use `CsvLoader.parse` or build `TextArtifact`s from the dataframe instead.

#### Before

```python
DataframeLoader().load(df)
```

#### After
```python
# Convert the dataframe to csv bytes and parse it
CsvLoader().parse(bytes(df.to_csv(line_terminator='\r\n', index=False), encoding='utf-8'))
# Or build TextArtifacts from the dataframe
[TextArtifact(row) for row in source.to_dict(orient="records")]
```

### `TextLoader`, `PdfLoader`, `ImageLoader`, and `AudioLoader` now take a `str | PathLike` instead of `bytes`.

#### Before
```python
PdfLoader().load(Path("attention.pdf").read_bytes())
PdfLoader().load_collection([Path("attention.pdf").read_bytes(), Path("CoT.pdf").read_bytes()])
```

#### After
```python
PdfLoader().load("attention.pdf")
PdfLoader().load_collection([Path("attention.pdf"), "CoT.pdf"])
```

### Removed `fileutils.load_file` and `fileutils.load_files`

`griptape.utils.file_utils.load_file` and `griptape.utils.file_utils.load_files` have been removed.
You can now pass the file path directly to the Loader.

#### Before

```python
PdfLoader().load(load_file("attention.pdf").read_bytes())
PdfLoader().load_collection(list(load_files(["attention.pdf", "CoT.pdf"]).values()))
```

```python
PdfLoader().load("attention.pdf")
PdfLoader().load_collection(["attention.pdf", "CoT.pdf"])
```

### Loaders no longer chunk data

Loaders no longer chunk the data after loading it. If you need to chunk the data, use a [Chunker](https://docs.griptape.ai/stable/griptape-framework/data/chunkers/) after loading the data.

#### Before

```python
chunks = PdfLoader().load("attention.pdf")
vector_store.upsert_text_artifacts(
{
"griptape": chunks,
}
)
```

#### After
```python
artifact = PdfLoader().load("attention.pdf")
chunks = Chunker().chunk(artifact)
vector_store.upsert_text_artifacts(
{
"griptape": chunks,
}
)
```

### Removed `MediaArtifact`

`MediaArtifact` has been removed. Use `ImageArtifact` or `AudioArtifact` instead.

#### Before

```python
image_media = MediaArtifact(
b"image_data",
media_type="image",
format="jpeg"
)

audio_media = MediaArtifact(
b"audio_data",
media_type="audio",
format="wav"
)
```

#### After
```python
image_artifact = ImageArtifact(
b"image_data",
format="jpeg"
)

audio_artifact = AudioArtifact(
b"audio_data",
format="wav"
)
```

### `ImageArtifact.format` is now required

`ImageArtifact.format` is now a required parameter. Update any code that does not provide a `format` parameter.

#### Before

```python
image_artifact = ImageArtifact(
b"image_data"
)
```

#### After
```python
image_artifact = ImageArtifact(
b"image_data",
format="jpeg"
)
```

### Removed `CsvRowArtifact`

`CsvRowArtifact` has been removed. Use `TextArtifact` instead.

#### Before

```python
artifact = CsvRowArtifact({"name": "John", "age": 30})
print(artifact.value) # {"name": "John", "age": 30}
print(type(artifact.value)) # <class 'dict'>
```

#### After
```python
artifact = TextArtifact("name: John\nage: 30")
print(artifact.value) # name: John\nage: 30
print(type(artifact.value)) # <class 'str'>
```

If you require storing a dictionary as an Artifact, you can use `GenericArtifact` instead.

### `CsvLoader`, `DataframeLoader`, and `SqlLoader` return types

`CsvLoader`, `DataframeLoader`, and `SqlLoader` now return a `list[TextArtifact]` instead of `list[CsvRowArtifact]`.

If you require a dictionary, set a custom `formatter_fn` and then parse the text to a dictionary.

#### Before

```python
results = CsvLoader().load(Path("people.csv").read_text())

print(results[0].value) # {"name": "John", "age": 30}
print(type(results[0].value)) # <class 'dict'>
```

#### After
```python
results = CsvLoader().load(Path("people.csv").read_text())

print(type(results)) # <class 'griptape.artifacts.ListArtifact'>
print(results[0].value) # name: John\nAge: 30
print(type(results[0].value)) # <class 'str'>

# Customize formatter_fn
results = CsvLoader(formatter_fn=lambda x: json.dumps(x)).load(Path("people.csv").read_text())
print(results[0].value) # {"name": "John", "age": 30}
print(type(results[0].value)) # <class 'str'>

dict_results = [json.loads(result.value) for result in results]
print(dict_results[0]) # {"name": "John", "age": 30}
print(type(dict_results[0])) # <class 'dict'>
```

### Moved `ImageArtifact.prompt` and `ImageArtifact.model` to `ImageArtifact.meta`

`ImageArtifact.prompt` and `ImageArtifact.model` have been moved to `ImageArtifact.meta`.

#### Before

```python
image_artifact = ImageArtifact(
b"image_data",
format="jpeg",
prompt="Generate an image of a cat",
model="DALL-E"
)

print(image_artifact.prompt, image_artifact.model) # Generate an image of a cat, DALL-E
```

#### After
```python
image_artifact = ImageArtifact(
b"image_data",
format="jpeg",
meta={"prompt": "Generate an image of a cat", "model": "DALL-E"}
)

print(image_artifact.meta["prompt"], image_artifact.meta["model"]) # Generate an image of a cat, DALL-E
```


## 0.30.X to 0.31.X

### Exceptions Over `ErrorArtifact`s
Expand Down
4 changes: 3 additions & 1 deletion docs/examples/src/load_query_and_chat_marqo_1.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
import os

from griptape import utils
from griptape.chunkers import TextChunker
from griptape.drivers import MarqoVectorStoreDriver, OpenAiEmbeddingDriver
from griptape.loaders import WebLoader
from griptape.structures import Agent
Expand All @@ -25,11 +26,12 @@

# Load artifacts from the web
artifacts = WebLoader().load("https://www.griptape.ai")
chunks = TextChunker().chunk(artifacts)

# Upsert the artifacts into the vector store
vector_store.upsert_text_artifacts(
{
namespace: artifacts,
namespace: chunks,
}
)

Expand Down
7 changes: 4 additions & 3 deletions docs/examples/src/query_webpage_1.py
Original file line number Diff line number Diff line change
@@ -1,14 +1,15 @@
import os

from griptape.chunkers import TextChunker
from griptape.drivers import LocalVectorStoreDriver, OpenAiEmbeddingDriver
from griptape.loaders import WebLoader

vector_store = LocalVectorStoreDriver(embedding_driver=OpenAiEmbeddingDriver(api_key=os.environ["OPENAI_API_KEY"]))

artifacts = WebLoader(max_tokens=100).load("https://www.griptape.ai")
artifacts = WebLoader().load("https://www.griptape.ai")
chunks = TextChunker().chunk(artifacts)

for a in artifacts:
vector_store.upsert_text_artifact(a, namespace="griptape")
vector_store.upsert_text_artifacts({"griptape": chunks})

results = vector_store.query("creativity", count=3, namespace="griptape")

Expand Down
7 changes: 4 additions & 3 deletions docs/examples/src/query_webpage_astra_db_1.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
import os

from griptape.chunkers import TextChunker
from griptape.drivers import (
AstraDbVectorStoreDriver,
OpenAiChatPromptDriver,
Expand Down Expand Up @@ -43,9 +44,9 @@
),
)

artifacts = WebLoader(max_tokens=256).load(input_blogpost)

vector_store_driver.upsert_text_artifacts({namespace: artifacts})
artifacts = WebLoader().load(input_blogpost)
chunks = TextChunker().chunk(artifacts)
vector_store_driver.upsert_text_artifacts({namespace: chunks})

rag_tool = RagTool(
description="A DataStax blog post",
Expand Down
6 changes: 4 additions & 2 deletions docs/examples/src/talk_to_a_pdf_1.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
import requests

from griptape.chunkers import TextChunker
from griptape.drivers import LocalVectorStoreDriver, OpenAiChatPromptDriver, OpenAiEmbeddingDriver
from griptape.engines.rag import RagEngine
from griptape.engines.rag.modules import PromptResponseRagModule, VectorStoreRetrievalRagModule
Expand Down Expand Up @@ -30,9 +31,10 @@
rag_engine=engine,
)

artifacts = PdfLoader().load(response.content)
artifacts = PdfLoader().parse(response.content)
chunks = TextChunker().chunk(artifacts)

vector_store.upsert_text_artifacts({namespace: artifacts})
vector_store.upsert_text_artifacts({namespace: chunks})
dylanholmes marked this conversation as resolved.
Show resolved Hide resolved

agent = Agent(tools=[rag_tool])

Expand Down
4 changes: 3 additions & 1 deletion docs/examples/src/talk_to_a_webpage_1.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
from griptape.chunkers import TextChunker
from griptape.drivers import LocalVectorStoreDriver, OpenAiChatPromptDriver, OpenAiEmbeddingDriver
from griptape.engines.rag import RagEngine
from griptape.engines.rag.modules import PromptResponseRagModule, VectorStoreRetrievalRagModule
Expand Down Expand Up @@ -26,8 +27,9 @@
)

artifacts = WebLoader().load("https://en.wikipedia.org/wiki/Physics")
chunks = TextChunker().chunk(artifacts)

vector_store_driver.upsert_text_artifacts({namespace: artifacts})
vector_store_driver.upsert_text_artifacts({namespace: chunks})

rag_tool = RagTool(
description="Contains information about physics. " "Use it to answer any physics-related questions.",
Expand Down
4 changes: 4 additions & 0 deletions docs/griptape-framework/data/chunkers.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,3 +18,7 @@ Here is how to use a chunker:
```python
--8<-- "docs/griptape-framework/data/src/chunkers_1.py"
```

The most common use of a Chunker is to split up a long text into smaller chunks for inserting into a Vector Database when doing Retrieval Augmented Generation (RAG).

See [RagEngine](../../griptape-framework/engines/rag-engines.md) for more information on how to use Chunkers in RAG pipelines.
Loading
Loading