Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor/loaders #1116

Merged
merged 55 commits into from
Oct 4, 2024
Merged

Refactor/loaders #1116

merged 55 commits into from
Oct 4, 2024

Conversation

collindutter
Copy link
Member

@collindutter collindutter commented Aug 28, 2024

Describe your changes

Added

  • BaseFileLoader for Loaders that load from a path.
  • BaseLoader.fetch() method for fetching data from a source.
  • BaseLoader.parse() method for parsing fetched data.
  • BaseFileManager.encoding to specify the encoding when loading and saving files.
  • BaseWebScraperDriver.extract_page() method for extracting data from an already scraped web page.
  • TextLoaderRetrievalRagModule.chunker for specifying the chunking strategy.
  • file_utils.get_mime_type utility for getting the MIME type of a file.

Changed

  • BREAKING: Removed BaseFileManager.default_loader and BaseFileManager.loaders.
  • BREAKING: Loaders no longer chunk data, use a Chunker to chunk the data.
  • BREAKING: Removed fileutils.load_file and fileutils.load_files.
  • BREAKING: Removed loaders-dataframe and loaders-audio extras as they are no longer needed.
  • BREKING: TextLoader, PdfLoader, ImageLoader, and AudioLoader now take a str | PathLike instead of bytes.
  • BREAKING: Removed DataframeLoader.
  • LocalFileManagerDriver.workdir is now optional.
  • filetype is now a core dependency.
  • FileManagerTool now uses filetype for more accurate file type detection.
  • BaseFileLoader.load_file() will now either return a TextArtifact or a BlobArtifact depending on whether BaseFileManager.encoding is set.

The purpose of this PR was to clean up the Loader interface, and their define purpose.

Loaders fetch data from a source, and parse it into Artifacts. Loaders do not chunk data, that is the role of Chunkers.

We provide 4 top level Loaders that provide from a variety of sources:

  • BaseFileLoader: Loads data from a file.
  • WebLoader: Loads data from a web page.
  • SqlLoader: Loads data from a SQL database.
  • EmailLoader: Loads data from an email inbox.

BaseFileLoader then has subclasses that provide file-type specific parsing logic:

  • TextLoader
  • ImageLoader
  • AudioLoader
  • CsvLoader
  • BlobLoader
    In a future PR, I'd like this parsing logic to live in a new class of Driver, File Parser Driver(?), instead of Loaders.

Issue ticket number and link

Closes #1102


📚 Documentation preview 📚: https://griptape--1116.org.readthedocs.build//1116/

@collindutter collindutter force-pushed the refactor/loaders branch 2 times, most recently from a568189 to ded5cd9 Compare September 4, 2024 17:05
@collindutter collindutter force-pushed the refactor/artifacts branch 11 times, most recently from 96b9752 to 9669ba8 Compare September 4, 2024 23:01
@collindutter collindutter force-pushed the refactor/loaders branch 3 times, most recently from 019ebbd to 62b6e17 Compare September 5, 2024 15:45
@collindutter collindutter force-pushed the refactor/artifacts branch 2 times, most recently from 57fa0f7 to 31ba036 Compare September 5, 2024 20:55
@collindutter collindutter force-pushed the refactor/loaders branch 4 times, most recently from 13da4e8 to 506fe6a Compare October 1, 2024 17:05
@collindutter collindutter force-pushed the refactor/loaders branch 6 times, most recently from d9c8c86 to 3a0e653 Compare October 3, 2024 17:55
dylanholmes
dylanholmes previously approved these changes Oct 3, 2024
@collindutter collindutter merged commit 9cd4199 into dev Oct 4, 2024
15 checks passed
@collindutter collindutter deleted the refactor/loaders branch October 4, 2024 21:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BaseTextLoader.tokenizer is using OpenAiTokenizer.DEFAULT_OPENAI_GPT_3_CHAT_MODEL as a default
2 participants