Skip to content

Commit

Permalink
Apply suggestions from code review
Browse files Browse the repository at this point in the history
Co-authored-by: Nathan Bower <[email protected]>
Signed-off-by: kolchfa-aws <[email protected]>
  • Loading branch information
kolchfa-aws and natebower authored Sep 16, 2024
1 parent 0c9cd3c commit 235f134
Show file tree
Hide file tree
Showing 2 changed files with 14 additions and 14 deletions.
4 changes: 2 additions & 2 deletions _ml-commons-plugin/api/async-batch-ingest.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ nav_order: 35
**Introduced 2.17**
{: .label .label-purple }

Use the Asynchronous Batch Ingestion API to ingest data into your OpenSearch cluster from your files in remote file servers, such as S3 or OpenAI. For detailed configuration steps, see [Asynchronous batch ingestion]({{site.url}}{{site.baseurl}}/ml-commons-plugin/remote-models/async-batch-ingestion/).
Use the Asynchronous Batch Ingestion API to ingest data into your OpenSearch cluster from your files on remote file servers, such as Amazon Simple Storage Service (Amazon S3) or OpenAI. For detailed configuration steps, see [Asynchronous batch ingestion]({{site.url}}{{site.baseurl}}/ml-commons-plugin/remote-models/async-batch-ingestion/).

## Path and HTTP methods

Expand All @@ -30,7 +30,7 @@ Field | Data type | Required/Optional | Description
`ingest_fields` | Array | Optional | Lists fields from the source file that should be ingested directly into the OpenSearch index without any additional mapping.
`credential` | Object | Required | Contains the authentication information for accessing external data sources, such as Amazon S3 or OpenAI.
`data_source` | Object | Required | Specifies the type and location of the external file(s) from which the data is ingested.
`data_source.type` | String | Required | Specifies the type of the external data source. Valid values are `s3`, `openAI`.
`data_source.type` | String | Required | Specifies the type of the external data source. Valid values are `s3` and `openAI`.
`data_source.source` | Array | Required | Specifies one or more file locations from which the data is ingested. For `s3`, specify the file path to the Amazon S3 bucket (for example, `["s3://offlinebatch/output/sagemaker_batch.json.out"]`). For `openAI`, specify the file IDs for input or output files (for example, `["file-<your output file id>", "file-<your input file id>", "file-<your other file>"]`).

## Example request: Ingesting a single file
Expand Down
24 changes: 12 additions & 12 deletions _ml-commons-plugin/remote-models/async-batch-ingestion.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,9 +13,9 @@ grand_parent: Integrating ML models

[Batch ingestion]({{site.url}}{{site.baseurl}}/ml-commons-plugin/remote-models/batch-ingestion/) configures an ingest pipeline, which processes documents one by one. For each document, batch ingestion calls an externally hosted model to generate text embeddings from the document text and then ingests the document, including text and embeddings, into an OpenSearch index.

An alternative to this real-time process, _asynchronous_ batch ingestion, ingests both documents and their embeddings generated outside of OpenSearch and stored in a remote file server, such as Amazon S3 or OpenAI. Asynchronous ingestion returns a task ID and runs asynchronously to ingest data into your k-NN cluster for neural search offline. You can use asynchronous batch ingestion together with [Batch Predict API]({{site.url}}{{site.baseurl}}/ml-commons-plugin/api/model-apis/batch-predict/) to perform inference asynchronously. Batch predict operation takes an input file, which contains documents, and calls an externally hosted model to generate embeddings for those documents into an output file. You can then use asynchronous batch ingestion to ingest both the input file containing documents and the output file containing their embeddings into an OpenSearch index.
An alternative to this real-time process, _asynchronous_ batch ingestion, ingests both documents and their embeddings generated outside of OpenSearch and stored on a remote file server, such as Amazon Simple Storage Service (Amazon S3) or OpenAI. Asynchronous ingestion returns a task ID and runs asynchronously to ingest data offline into your k-NN cluster for neural search. You can use asynchronous batch ingestion together with the [Batch Predict API]({{site.url}}{{site.baseurl}}/ml-commons-plugin/api/model-apis/batch-predict/) to perform inference asynchronously. The batch predict operation takes an input file containing documents and calls an externally hosted model to generate embeddings for those documents in an output file. You can then use asynchronous batch ingestion to ingest both the input file containing documents and the output file containing their embeddings into an OpenSearch index.

As of OpenSearch 2.17, Batch Ingestion API is verified to work with Amazon SageMaker, Amazon Bedrock, and OpenAI.
As of OpenSearch 2.17, the Asynchronous Batch Ingestion API is supported by Amazon SageMaker, Amazon Bedrock, and OpenAI.
{: .note}

## Prerequisites
Expand All @@ -30,9 +30,9 @@ Before using asynchronous batch ingestion, you must generate text embeddings usi

## Ingesting data from a single file

First, create a k-NN index where you'll ingest the data. The fields in the k-NN index represent the structure of the data in the source file.
First, create a k-NN index into which you'll ingest the data. The fields in the k-NN index represent the structure of the data in the source file.

In this example, the source file contains documents containing titles and chapters, along with their corresponding embeddings. Thus, you'll create a k-NN index with fields `id`, `chapter_embedding`, `chapter`, `title_embedding`, and `title`:
In this example, the source file holds documents containing titles and chapters, along with their corresponding embeddings. Thus, you'll create a k-NN index with the fields `id`, `chapter_embedding`, `chapter`, `title_embedding`, and `title`:

```json
PUT /my-nlp-index
Expand Down Expand Up @@ -83,9 +83,9 @@ PUT /my-nlp-index
```
{% include copy-curl.html %}

When using an S3 file as the source for asynchronous batch ingestion, you must map the fields in the source file to fields in the index in order to indicate where each piece of data is ingested. If no JSON path is provided for a field, that field in the k-NN index will be set to `null`.
When using an S3 file as the source for asynchronous batch ingestion, you must map the fields in the source file to fields in the index in order to indicate into which index each piece of data is ingested. If no JSON path is provided for a field, that field will be set to `null` in the k-NN index.

In the `field_map`, provide the location where the data for each field can be found in the source file. You can also specify fields to be ingested directly into your index without making any changes to the source file by adding their JSON path to the `ingest_fields` array. For example, in the following asynchronous batch ingestion request, the element with the JSON path `$.id` from the source file is ingested directly into the `id` field of your index. To ingest this data from the Amazon S3 file, send the following request to your OpenSearch endpoint:
In the `field_map`, indicate the location of the data for each field in the source file. You can also specify fields to be ingested directly into your index without making any changes to the source file by adding their JSON paths to the `ingest_fields` array. For example, in the following asynchronous batch ingestion request, the element with the JSON path `$.id` from the source file is ingested directly into the `id` field of your index. To ingest this data from the Amazon S3 file, send the following request to your OpenSearch endpoint:

```json
POST /_plugins/_ml/_batch_ingestion
Expand Down Expand Up @@ -123,14 +123,14 @@ The response contains a task ID for the ingestion task:
}
```

To check the status of the operation, provide the task ID to the [Tasks API]({{site.url}}{{site.baseurl}}/ml-commons-plugin/api/tasks-apis/get-task/). Once the ingestion is complete, the task `state` changes to `COMPLETED`.
To check the status of the operation, provide the task ID to the [Tasks API]({{site.url}}{{site.baseurl}}/ml-commons-plugin/api/tasks-apis/get-task/). Once ingestion is complete, the task `state` changes to `COMPLETED`.


## Ingesting data from multiple files

You can also ingest data from multiple files by specifying the file locations in the `source`. The following example ingests data from three OpenAI files.

The OpenAI Batch API input file is in the following format:
The OpenAI Batch API input file is formatted as follows:

```
{"custom_id": "request-1", "method": "POST", "url": "/v1/embeddings", "body": {"model": "text-embedding-ada-002", "input": [ "What is the meaning of life?", "The food was delicious and the waiter..."]}}
Expand All @@ -139,7 +139,7 @@ The OpenAI Batch API input file is in the following format:
...
```

The OpenAI Batch API output file is in the following format:
The OpenAI Batch API output file is formatted as follows:

```
{"id": "batch_req_ITKQn29igorXCAGp6wzYs5IS", "custom_id": "request-1", "response": {"status_code": 200, "request_id": "10845755592510080d13054c3776aef4", "body": {"object": "list", "data": [{"object": "embedding", "index": 0, "embedding": [0.0044326545, ... ...]}, {"object": "embedding", "index": 1, "embedding": [0.002297497, ... ... ]}], "model": "text-embedding-ada-002", "usage": {"prompt_tokens": 15, "total_tokens": 15}}}, "error": null}
Expand All @@ -148,7 +148,7 @@ The OpenAI Batch API output file is in the following format:

If you have run the Batch API in OpenAI for text embedding and want to ingest the model input and output files along with some metadata into your index, send the following asynchronous ingestion request. Make sure to use `source[file-index]` to identify the file's location in the source array in the request body. For example, `source[0]` refers to the first file in the `data_source.source` array.

The following request ingests seven fields into your index: five are specified in the `field_map` section, and two in `ingest_fields`. The format follows the pattern `sourcefile.jsonPath`, indicating the JSON path for each file. In the field_map, `$.body.input[0]` is used as the JSON path to ingest data into the `question` field from the second file in the `source` array. The `ingest_fields` array lists all elements from the `source` files that will be ingested directly into your index:
The following request ingests seven fields into your index: Five are specified in the `field_map` section and two are specified in `ingest_fields`. The format follows the pattern `sourcefile.jsonPath`, indicating the JSON path for each file. In the field_map, `$.body.input[0]` is used as the JSON path to ingest data into the `question` field from the second file in the `source` array. The `ingest_fields` array lists all elements from the `source` files that will be ingested directly into your index:

```json
POST /_plugins/_ml/_batch_ingestion
Expand All @@ -173,7 +173,7 @@ POST /_plugins/_ml/_batch_ingestion
```
{% include copy-curl.html %}

In the request, make sure to define the `_id` field in the `field_map`, because it is necessary in order to map each data entry from the three different files.
In the request, make sure to define the `_id` field in the `field_map`. This is necessary in order to map each data entry from the three separate files.

The response contains a task ID for the ingestion task:

Expand All @@ -185,6 +185,6 @@ The response contains a task ID for the ingestion task:
}
```

To check the status of the operation, provide the task ID to the [Tasks API]({{site.url}}{{site.baseurl}}/ml-commons-plugin/api/tasks-apis/get-task/). Once the ingestion is complete, the task `state` changes to `COMPLETED`.
To check the status of the operation, provide the task ID to the [Tasks API]({{site.url}}{{site.baseurl}}/ml-commons-plugin/api/tasks-apis/get-task/). Once ingestion is complete, the task `state` changes to `COMPLETED`.

For request field descriptions, see [Asynchronous Batch Ingestion API]({{site.url}}{{site.baseurl}}/ml-commons-plugin/api/async-batch-ingest/).

0 comments on commit 235f134

Please sign in to comment.