diff --git a/docs/concepts/operators.md b/docs/concepts/operators.md index fc9070ef..617e4499 100644 --- a/docs/concepts/operators.md +++ b/docs/concepts/operators.md @@ -4,7 +4,7 @@ Operators in DocETL are designed for semantically processing unstructured data. ## Overview -- Datasets contain documents, where a document is an object in the JSON list, with fields and values. +- Datasets contain items, where a item is an object in the JSON list, with fields and values. An item here could be simple text chunk or a document reference. - DocETL provides several operators, each tailored for specific unstructured data processing tasks. - By default, operations are parallelized on your data using multithreading for improved performance. diff --git a/docs/concepts/pipelines.md b/docs/concepts/pipelines.md index ed0e4cae..bc1d1b3b 100644 --- a/docs/concepts/pipelines.md +++ b/docs/concepts/pipelines.md @@ -1,6 +1,6 @@ # Pipelines -Pipelines in DocETL are the core structures that define the flow of data processing. They orchestrate the application of operators to datasets, creating a seamless workflow for complex document processing tasks. +Pipelines in DocETL are the core structures that define the flow of data processing. They orchestrate the application of operators to datasets, creating a seamless workflow for complex chunk processing tasks. ## Components of a Pipeline @@ -21,7 +21,7 @@ default_model: gpt-4o-mini ### Datasets -Datasets define the input data for your pipeline. They are collections of documents, where each document is an object in a JSON list (or row in a CSV file). Datasets are typically specified in the YAML configuration file, indicating the type and path of the data source. For example: +Datasets define the input data for your pipeline. They are collections of items/chunks, where each item/chunk is an object in a JSON list (or row in a CSV file). Datasets are typically specified in the YAML configuration file, indicating the type and path of the data source. For example: ```yaml datasets: diff --git a/docs/examples/custom-parsing.md b/docs/examples/custom-parsing.md index a8959f97..9b4d1b3a 100644 --- a/docs/examples/custom-parsing.md +++ b/docs/examples/custom-parsing.md @@ -283,7 +283,7 @@ While DocETL provides several built-in parsing tools, the community can always b If the built-in tools don't meet your needs, you can create your own custom parsing tools. Here's how: 1. Define your parsing function in the `parsing_tools` section of your configuration. -2. Ensure your function takes a document (dict) as input and returns a list of documents (dicts). +2. Ensure your function takes a item (dict) as input and returns a list of items (dicts). 3. Use your custom parser in the `parsing` section of your dataset configuration. For example: @@ -292,7 +292,7 @@ For example: parsing_tools: - name: my_custom_parser function_code: | - def my_custom_parser(document: Dict) -> List[Dict]: + def my_custom_parser(item: Dict) -> List[Dict]: # Your custom parsing logic here return [processed_data] diff --git a/docs/examples/mining-product-reviews.md b/docs/examples/mining-product-reviews.md index 9e120d85..c6edf3ec 100644 --- a/docs/examples/mining-product-reviews.md +++ b/docs/examples/mining-product-reviews.md @@ -16,7 +16,7 @@ Our goal is to create a pipeline that will: 2. Resolve similar themes across different games 3. Generate reports of polarizing themes common across games, supported by quotes from different game reviews -We'll be using a subset of the [STEAM review dataset](https://www.kaggle.com/datasets/andrewmvd/steam-reviews). We've created a subset that contains reviews for 500 of the most popular games, with approximately 400 reviews per game, balanced between positive and negative ratings. For each game, we concatenate all reviews into a single text for analysis---so we'll have 500 input documents, each representing a game. You can get the dataset sample [here](https://drive.google.com/file/d/1hroljsvn8m23iVsNpET8Ma7sfb1OUu_u/view?usp=drive_link). +We'll be using a subset of the [STEAM review dataset](https://www.kaggle.com/datasets/andrewmvd/steam-reviews). We've created a subset that contains reviews for 500 of the most popular games, with approximately 400 reviews per game, balanced between positive and negative ratings. For each game, we concatenate all reviews into a single text for analysis---so we'll have 500 input items/reviews, each representing a game. You can get the dataset sample [here](https://drive.google.com/file/d/1hroljsvn8m23iVsNpET8Ma7sfb1OUu_u/view?usp=drive_link). ## Pipeline Structure @@ -284,7 +284,7 @@ This command, with `optimize: true` set for the map and resolve operations, prov 2. Blocking statements and thresholds for the resolve operation: This optimizes the theme resolution process, making it more efficient when dealing with a large number of themes across multiple games. The optimizer provided us with blocking keys of `summary` and `theme`, and a threshold of 0.596 for similarity (to get 95% recall of duplicates). -These optimizations are crucial for handling the scale of our dataset, which includes 500 games with an _average_ of 66,000 tokens per game, and 12% of the documents exceeding the context length limits of the OpenAI LLMs (128k tokens). +These optimizations are crucial for handling the scale of our dataset, which includes 500 games with an _average_ of 66,000 tokens per game, and 12% of the items/reviews exceeding the context length limits of the OpenAI LLMs (128k tokens). ??? info "Optimized Pipeline"