Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

#91 document > item renaming #103

Merged
merged 4 commits into from
Oct 14, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion docs/concepts/operators.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ Operators in DocETL are designed for semantically processing unstructured data.

## Overview

- Datasets contain documents, where a document is an object in the JSON list, with fields and values.
- Datasets contain items, where a item is an object in the JSON list, with fields and values. An item here could be simple text chunk or a document reference.
- DocETL provides several operators, each tailored for specific unstructured data processing tasks.
- By default, operations are parallelized on your data using multithreading for improved performance.

Expand Down
4 changes: 2 additions & 2 deletions docs/concepts/pipelines.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Pipelines

Pipelines in DocETL are the core structures that define the flow of data processing. They orchestrate the application of operators to datasets, creating a seamless workflow for complex document processing tasks.
Pipelines in DocETL are the core structures that define the flow of data processing. They orchestrate the application of operators to datasets, creating a seamless workflow for complex chunk processing tasks.

## Components of a Pipeline

Expand All @@ -21,7 +21,7 @@ default_model: gpt-4o-mini

### Datasets

Datasets define the input data for your pipeline. They are collections of documents, where each document is an object in a JSON list (or row in a CSV file). Datasets are typically specified in the YAML configuration file, indicating the type and path of the data source. For example:
Datasets define the input data for your pipeline. They are collections of items/chunks, where each item/chunk is an object in a JSON list (or row in a CSV file). Datasets are typically specified in the YAML configuration file, indicating the type and path of the data source. For example:

```yaml
datasets:
Expand Down
4 changes: 2 additions & 2 deletions docs/examples/custom-parsing.md
Original file line number Diff line number Diff line change
Expand Up @@ -283,7 +283,7 @@ While DocETL provides several built-in parsing tools, the community can always b
If the built-in tools don't meet your needs, you can create your own custom parsing tools. Here's how:

1. Define your parsing function in the `parsing_tools` section of your configuration.
2. Ensure your function takes a document (dict) as input and returns a list of documents (dicts).
2. Ensure your function takes a item (dict) as input and returns a list of items (dicts).
3. Use your custom parser in the `parsing` section of your dataset configuration.

For example:
Expand All @@ -292,7 +292,7 @@ For example:
parsing_tools:
- name: my_custom_parser
function_code: |
def my_custom_parser(document: Dict) -> List[Dict]:
def my_custom_parser(item: Dict) -> List[Dict]:
# Your custom parsing logic here
return [processed_data]

Expand Down
4 changes: 2 additions & 2 deletions docs/examples/mining-product-reviews.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ Our goal is to create a pipeline that will:
2. Resolve similar themes across different games
3. Generate reports of polarizing themes common across games, supported by quotes from different game reviews

We'll be using a subset of the [STEAM review dataset](https://www.kaggle.com/datasets/andrewmvd/steam-reviews). We've created a subset that contains reviews for 500 of the most popular games, with approximately 400 reviews per game, balanced between positive and negative ratings. For each game, we concatenate all reviews into a single text for analysis---so we'll have 500 input documents, each representing a game. You can get the dataset sample [here](https://drive.google.com/file/d/1hroljsvn8m23iVsNpET8Ma7sfb1OUu_u/view?usp=drive_link).
We'll be using a subset of the [STEAM review dataset](https://www.kaggle.com/datasets/andrewmvd/steam-reviews). We've created a subset that contains reviews for 500 of the most popular games, with approximately 400 reviews per game, balanced between positive and negative ratings. For each game, we concatenate all reviews into a single text for analysis---so we'll have 500 input items/reviews, each representing a game. You can get the dataset sample [here](https://drive.google.com/file/d/1hroljsvn8m23iVsNpET8Ma7sfb1OUu_u/view?usp=drive_link).

## Pipeline Structure

Expand Down Expand Up @@ -284,7 +284,7 @@ This command, with `optimize: true` set for the map and resolve operations, prov

2. Blocking statements and thresholds for the resolve operation: This optimizes the theme resolution process, making it more efficient when dealing with a large number of themes across multiple games. The optimizer provided us with blocking keys of `summary` and `theme`, and a threshold of 0.596 for similarity (to get 95% recall of duplicates).

These optimizations are crucial for handling the scale of our dataset, which includes 500 games with an _average_ of 66,000 tokens per game, and 12% of the documents exceeding the context length limits of the OpenAI LLMs (128k tokens).
These optimizations are crucial for handling the scale of our dataset, which includes 500 games with an _average_ of 66,000 tokens per game, and 12% of the items/reviews exceeding the context length limits of the OpenAI LLMs (128k tokens).

??? info "Optimized Pipeline"

Expand Down
Loading