Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Sycamore page #8234

Merged
merged 24 commits into from
Sep 16, 2024
Merged
Changes from 14 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
37 changes: 37 additions & 0 deletions _tools/sycamore.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
# Sycamore

[Sycamore](https://github.com/aryn-ai/sycamore) is an open-source, AI-powered document processing engine designed to prepare unstructured data for retrieval-augmented generation (RAG) and semantic search using Python. Sycamore supports chunking and enriching a wide range of complex document types, including reports, presentations, transcripts, and manuals. Additionally, Sycamore can extract and process embedded elements, such as tables, figures, graphs, and other infographics. It can then load the data into target indexes, including vector and keyword indexes, using a connector like the [OpenSearch connector](https://sycamore.readthedocs.io/en/stable/sycamore/connectors/opensearch.html).

To get started, visit the [Sycamore documentation](https://sycamore.readthedocs.io/en/stable/sycamore/get_started.html).

# Sycamore ETL pipeline structure

A Sycamore Extract, Transform, Load (ETL) pipeline applies a series of transformations to a [DocSet](https://sycamore.readthedocs.io/en/stable/sycamore/get_started/concepts.html#docsets), which is a collection of documents and their constituent elements (for example, tables, blocks of text, or headers). At the end of the pipeline, the DocSet is loaded into OpenSearch vector and keyword indexes.
kolchfa-aws marked this conversation as resolved.
Show resolved Hide resolved

A pipeline to prepare unstructured data for vector or hybrid search in OpenSearch generally has these parts:
jonfritz marked this conversation as resolved.
Show resolved Hide resolved

* Read documents into a [DocSet](https://sycamore.readthedocs.io/en/stable/sycamore/get_started/concepts.html#docsets)
kolchfa-aws marked this conversation as resolved.
Show resolved Hide resolved
* [Partition documents](https://sycamore.readthedocs.io/en/stable/sycamore/transforms/partition.html) into structured JSON elements
kolchfa-aws marked this conversation as resolved.
Show resolved Hide resolved
* Extract metadata, filter, and clean data with [transforms](https://sycamore.readthedocs.io/en/stable/sycamore/APIs/docset.html)
jonfritz marked this conversation as resolved.
Show resolved Hide resolved
* Create [chunks](https://sycamore.readthedocs.io/en/stable/sycamore/transforms/merge.html) from groups of elements
kolchfa-aws marked this conversation as resolved.
Show resolved Hide resolved
* Embed with the model of your choice
jonfritz marked this conversation as resolved.
Show resolved Hide resolved
* [Load](https://sycamore.readthedocs.io/en/stable/sycamore/connectors/opensearch.html) OpenSearch
jonfritz marked this conversation as resolved.
Show resolved Hide resolved

You can see an example pipeline with this flow in [this notebook](https://github.com/aryn-ai/sycamore/blob/main/notebooks/opensearch_docs_etl.ipynb).
jonfritz marked this conversation as resolved.
Show resolved Hide resolved


# Install Sycamore

We recommend installing the Sycamore library using `pip`. The connector for OpenSearch can be installed via extras. For example:
jonfritz marked this conversation as resolved.
Show resolved Hide resolved

```bash
pip install sycamore-ai[opensearch]
```
jonfritz marked this conversation as resolved.
Show resolved Hide resolved
{% include copy.html %}

By default, Sycamore works with the Aryn Partitioning Service to process PDFs. To run inference locally for partitioning or embedding, install the `local-inference` extra as follows:

```bash
pip install sycamore-ai[opensearch,local-inference]
```
jonfritz marked this conversation as resolved.
Show resolved Hide resolved
{% include copy.html %}