Skip to content

Commit

Permalink
Add Sycamore page (#8234)
Browse files Browse the repository at this point in the history
* Create sycamore.md

Create Sycamore page

Signed-off-by: jonfritz <[email protected]>

* Update sycamore.md

Add to docs

Signed-off-by: jonfritz <[email protected]>

* Update sycamore.md

Add info to docs page

Signed-off-by: jonfritz <[email protected]>

* Update _tools/sycamore.md

Co-authored-by: kolchfa-aws <[email protected]>
Signed-off-by: jonfritz <[email protected]>

* Update _tools/sycamore.md

Co-authored-by: kolchfa-aws <[email protected]>
Signed-off-by: jonfritz <[email protected]>

* Update _tools/sycamore.md

Co-authored-by: kolchfa-aws <[email protected]>
Signed-off-by: jonfritz <[email protected]>

* Update _tools/sycamore.md

Co-authored-by: kolchfa-aws <[email protected]>
Signed-off-by: jonfritz <[email protected]>

* Update _tools/sycamore.md

Co-authored-by: kolchfa-aws <[email protected]>
Signed-off-by: jonfritz <[email protected]>

* Update _tools/sycamore.md

Co-authored-by: kolchfa-aws <[email protected]>
Signed-off-by: jonfritz <[email protected]>

* Update _tools/sycamore.md

Co-authored-by: kolchfa-aws <[email protected]>
Signed-off-by: jonfritz <[email protected]>

* Update _tools/sycamore.md

Co-authored-by: kolchfa-aws <[email protected]>
Signed-off-by: jonfritz <[email protected]>

* Update _tools/sycamore.md

Co-authored-by: kolchfa-aws <[email protected]>
Signed-off-by: jonfritz <[email protected]>

* Update _tools/sycamore.md

Co-authored-by: kolchfa-aws <[email protected]>
Signed-off-by: jonfritz <[email protected]>

* Update sycamore.md

Correct typo

Signed-off-by: jonfritz <[email protected]>

* Update _tools/sycamore.md

Co-authored-by: kolchfa-aws <[email protected]>
Signed-off-by: jonfritz <[email protected]>

* Update _tools/sycamore.md

Co-authored-by: kolchfa-aws <[email protected]>
Signed-off-by: jonfritz <[email protected]>

* Update _tools/sycamore.md

Co-authored-by: kolchfa-aws <[email protected]>
Signed-off-by: jonfritz <[email protected]>

* Update _tools/sycamore.md

Co-authored-by: kolchfa-aws <[email protected]>
Signed-off-by: jonfritz <[email protected]>

* Update sycamore.md

Updates from suggestions

Signed-off-by: jonfritz <[email protected]>

* Update index.md

Updating index with Sycamore

Signed-off-by: jonfritz <[email protected]>

* Update _tools/index.md

Signed-off-by: kolchfa-aws <[email protected]>

* Add front matter

Signed-off-by: Fanit Kolchina <[email protected]>

* Apply suggestions from code review

Co-authored-by: Nathan Bower <[email protected]>
Signed-off-by: kolchfa-aws <[email protected]>

---------

Signed-off-by: jonfritz <[email protected]>
Signed-off-by: kolchfa-aws <[email protected]>
Signed-off-by: Fanit Kolchina <[email protected]>
Co-authored-by: kolchfa-aws <[email protected]>
Co-authored-by: Fanit Kolchina <[email protected]>
Co-authored-by: Nathan Bower <[email protected]>
(cherry picked from commit e0045f9)
Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
  • Loading branch information
4 people committed Sep 16, 2024
1 parent 9736d30 commit e332838
Show file tree
Hide file tree
Showing 2 changed files with 55 additions and 0 deletions.
7 changes: 7 additions & 0 deletions _tools/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,7 @@ This section provides documentation for OpenSearch-supported tools, including:
- [OpenSearch CLI](#opensearch-cli)
- [OpenSearch Kubernetes operator](#opensearch-kubernetes-operator)
- [OpenSearch upgrade, migration, and comparison tools](#opensearch-upgrade-migration-and-comparison-tools)
- [Sycamore](#sycamore) for AI-powered extract, transform, load (ETL) on complex documents for vector and hybrid search

For information about Data Prepper, the server-side data collector for filtering, enriching, transforming, normalizing, and aggregating data for downstream analytics and visualization, see [Data Prepper]({{site.url}}{{site.baseurl}}/data-prepper/index/).

Expand Down Expand Up @@ -122,3 +123,9 @@ The OpenSearch Kubernetes Operator is an open-source Kubernetes operator that he
OpenSearch migration tools facilitate migrations to OpenSearch and upgrades to newer versions of OpenSearch. These can help you can set up a proof-of-concept environment locally using Docker containers or deploy to AWS using a one-click deployment script. This empowers you to fine-tune cluster configurations and manage workloads more effectively before migration.

For more information about OpenSearch migration tools, see the documentation in the [OpenSearch Migration GitHub repository](https://github.com/opensearch-project/opensearch-migrations/tree/capture-and-replay-v0.1.0).

## Sycamore

[Sycamore](https://github.com/aryn-ai/sycamore) is an open-source, AI-powered document processing engine designed to prepare unstructured data for retrieval-augmented generation (RAG) and semantic search using Python. Sycamore supports chunking and enriching a wide range of complex document types, including reports, presentations, transcripts, and manuals. Additionally, Sycamore can extract and process embedded elements, such as tables, figures, graphs, and other infographics. It can then load the data into target indexes, including vector and keyword indexes, using an [OpenSearch connector](https://sycamore.readthedocs.io/en/stable/sycamore/connectors/opensearch.html).

Check failure on line 129 in _tools/index.md

View workflow job for this annotation

GitHub Actions / vale

[vale] _tools/index.md#L129

[OpenSearch.Spelling] Error: infographics. If you are referencing a setting, variable, format, function, or repository, surround it with tic marks.
Raw output
{"message": "[OpenSearch.Spelling] Error: infographics. If you are referencing a setting, variable, format, function, or repository, surround it with tic marks.", "location": {"path": "_tools/index.md", "range": {"start": {"line": 129, "column": 469}}}, "severity": "ERROR"}

For more information, see [Sycamore]({{site.url}}{{site.baseurl}}/tools/sycamore/).
48 changes: 48 additions & 0 deletions _tools/sycamore.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,48 @@
---
layout: default
title: Sycamore
nav_order: 210
has_children: false
---

# Sycamore

[Sycamore](https://github.com/aryn-ai/sycamore) is an open-source, AI-powered document processing engine designed to prepare unstructured data for retrieval-augmented generation (RAG) and semantic search using Python. Sycamore supports chunking and enriching a wide range of complex document types, including reports, presentations, transcripts, and manuals. Additionally, Sycamore can extract and process embedded elements, such as tables, figures, graphs, and other infographics. It can then load the data into target indexes, including vector and keyword indexes, using a connector like the [OpenSearch connector](https://sycamore.readthedocs.io/en/stable/sycamore/connectors/opensearch.html).

Check failure on line 10 in _tools/sycamore.md

View workflow job for this annotation

GitHub Actions / vale

[vale] _tools/sycamore.md#L10

[OpenSearch.Spelling] Error: infographics. If you are referencing a setting, variable, format, function, or repository, surround it with tic marks.
Raw output
{"message": "[OpenSearch.Spelling] Error: infographics. If you are referencing a setting, variable, format, function, or repository, surround it with tic marks.", "location": {"path": "_tools/sycamore.md", "range": {"start": {"line": 10, "column": 469}}}, "severity": "ERROR"}

To get started, visit the [Sycamore documentation](https://sycamore.readthedocs.io/en/stable/sycamore/get_started.html).

# Sycamore ETL pipeline structure

A Sycamore extract, transform, load (ETL) pipeline applies a series of transformations to a [DocSet](https://sycamore.readthedocs.io/en/stable/sycamore/get_started/concepts.html#docsets), which is a collection of documents and their constituent elements (for example, tables, blocks of text, or headers). At the end of the pipeline, the DocSet is loaded into OpenSearch vector and keyword indexes.

A typical pipeline for preparing unstructured data for vector or hybrid search in OpenSearch consists of the following steps:

* Read documents into a [DocSet](https://sycamore.readthedocs.io/en/stable/sycamore/get_started/concepts.html#docsets).
* [Partition documents](https://sycamore.readthedocs.io/en/stable/sycamore/transforms/partition.html) into structured JSON elements.
* Extract metadata, filter, and clean data using [transforms](https://sycamore.readthedocs.io/en/stable/sycamore/APIs/docset.html).
* Create [chunks](https://sycamore.readthedocs.io/en/stable/sycamore/transforms/merge.html) from groups of elements.
* Embed the chunks using the model of your choice.
* [Load](https://sycamore.readthedocs.io/en/stable/sycamore/connectors/opensearch.html) the embeddings, metadata, and text into OpenSearch vector and keyword indexes.

For an example pipeline that uses this workflow, see [this notebook](https://github.com/aryn-ai/sycamore/blob/main/notebooks/opensearch_docs_etl.ipynb).


# Install Sycamore

Check failure on line 30 in _tools/sycamore.md

View workflow job for this annotation

GitHub Actions / vale

[vale] _tools/sycamore.md#L30

[OpenSearch.HeadingCapitalization] 'Install Sycamore' is a heading and should be in sentence case.
Raw output
{"message": "[OpenSearch.HeadingCapitalization] 'Install Sycamore' is a heading and should be in sentence case.", "location": {"path": "_tools/sycamore.md", "range": {"start": {"line": 30, "column": 3}}}, "severity": "ERROR"}

We recommend installing the Sycamore library using `pip`. The connector for OpenSearch can be specified and installed using extras. For example:

```bash
pip install sycamore-ai[opensearch]
```
{% include copy.html %}

By default, Sycamore works with the Aryn Partitioning Service to process PDFs. To run inference locally for partitioning or embedding, install Sycamore with the `local-inference` extra as follows:

Check failure on line 39 in _tools/sycamore.md

View workflow job for this annotation

GitHub Actions / vale

[vale] _tools/sycamore.md#L39

[OpenSearch.Spelling] Error: Aryn. If you are referencing a setting, variable, format, function, or repository, surround it with tic marks.
Raw output
{"message": "[OpenSearch.Spelling] Error: Aryn. If you are referencing a setting, variable, format, function, or repository, surround it with tic marks.", "location": {"path": "_tools/sycamore.md", "range": {"start": {"line": 39, "column": 37}}}, "severity": "ERROR"}

```bash
pip install sycamore-ai[opensearch,local-inference]
```
{% include copy.html %}

## Next steps

For more information, visit the [Sycamore documentation](https://sycamore.readthedocs.io/en/stable/sycamore/get_started.html).

0 comments on commit e332838

Please sign in to comment.