From 58c15ba744ccb32edc18997a60ad0079422750e0 Mon Sep 17 00:00:00 2001 From: jonfritz <134336691+jonfritz@users.noreply.github.com> Date: Wed, 11 Sep 2024 15:58:24 -0700 Subject: [PATCH 01/23] Create sycamore.md Create Sycamore page Signed-off-by: jonfritz <134336691+jonfritz@users.noreply.github.com> --- _tools/sycamore.md | 3 +++ 1 file changed, 3 insertions(+) create mode 100644 _tools/sycamore.md diff --git a/_tools/sycamore.md b/_tools/sycamore.md new file mode 100644 index 0000000000..75dfaaab0b --- /dev/null +++ b/_tools/sycamore.md @@ -0,0 +1,3 @@ +# Sycamore + +Sycamore is an open source, AI-powered document processing engine for ETL, RAG, LLM-based applications, and analytics on unstructured data. Sycamore can partition and enrich a wide range of document types including reports, presentations, transcripts, manuals, and more. It can analyze and chunk complex documents such as PDFs and images with embedded tables, figures, graphs, and other infographics. From b8f4b04f6cb401a5ac68292fec6c4ce45e95b728 Mon Sep 17 00:00:00 2001 From: jonfritz <134336691+jonfritz@users.noreply.github.com> Date: Wed, 11 Sep 2024 17:27:31 -0700 Subject: [PATCH 02/23] Update sycamore.md Add to docs Signed-off-by: jonfritz <134336691+jonfritz@users.noreply.github.com> --- _tools/sycamore.md | 23 +++++++++++++++++++++-- 1 file changed, 21 insertions(+), 2 deletions(-) diff --git a/_tools/sycamore.md b/_tools/sycamore.md index 75dfaaab0b..f6f80f41f2 100644 --- a/_tools/sycamore.md +++ b/_tools/sycamore.md @@ -1,3 +1,22 @@ -# Sycamore +# Sycamore ETL -Sycamore is an open source, AI-powered document processing engine for ETL, RAG, LLM-based applications, and analytics on unstructured data. Sycamore can partition and enrich a wide range of document types including reports, presentations, transcripts, manuals, and more. It can analyze and chunk complex documents such as PDFs and images with embedded tables, figures, graphs, and other infographics. +[Sycamore](https://github.com/aryn-ai/sycamore) is an open source, AI-powered document processing engine to prepare unstructured data for RAG and semantic search. Sycamore can chunk and enrich a wide range of complex document types including reports, presentations, transcripts, manuals, and more, and it can extract and process embedded tables, figures, graphs, and other infographics. It can then load a target index, including vector and keyword indexes, using a connector (like the [OpenSearch connector](https://sycamore.readthedocs.io/en/stable/sycamore/connectors/opensearch.html)). + +[Visit the Sycamore documentation](https://sycamore.readthedocs.io/en/stable/sycamore/get_started.html) to get started. + +# Structure of an ETL Pipeline + +A Sycamore ETL pipeline to prepare unstructured data for vector or hybrid search in OpenSearch generally has these parts: + +* Read documents into a DocSet +* Partition documents into structured JSON +* Extract metadata and clean data +* Create chunks +* Embed +* Load OpenSearch + +You can see an example pipeline with this flow in this notebook. + + + +# Install Sycamore From 97cfaf686261359054672daa21cf712c286b81be Mon Sep 17 00:00:00 2001 From: jonfritz <134336691+jonfritz@users.noreply.github.com> Date: Wed, 11 Sep 2024 23:38:43 -0700 Subject: [PATCH 03/23] Update sycamore.md Add info to docs page Signed-off-by: jonfritz <134336691+jonfritz@users.noreply.github.com> --- _tools/sycamore.md | 31 ++++++++++++++++++++++--------- 1 file changed, 22 insertions(+), 9 deletions(-) diff --git a/_tools/sycamore.md b/_tools/sycamore.md index f6f80f41f2..dea0bc3e6a 100644 --- a/_tools/sycamore.md +++ b/_tools/sycamore.md @@ -1,22 +1,35 @@ # Sycamore ETL -[Sycamore](https://github.com/aryn-ai/sycamore) is an open source, AI-powered document processing engine to prepare unstructured data for RAG and semantic search. Sycamore can chunk and enrich a wide range of complex document types including reports, presentations, transcripts, manuals, and more, and it can extract and process embedded tables, figures, graphs, and other infographics. It can then load a target index, including vector and keyword indexes, using a connector (like the [OpenSearch connector](https://sycamore.readthedocs.io/en/stable/sycamore/connectors/opensearch.html)). +[Sycamore](https://github.com/aryn-ai/sycamore) is an open source, AI-powered document processing engine to prepare unstructured data for RAG and semantic search using Python. Sycamore can chunk and enrich a wide range of complex document types including reports, presentations, transcripts, manuals, and more, and it can extract and process embedded tables, figures, graphs, and other infographics. It can then load a target index, including vector and keyword indexes, using a connector (like the [OpenSearch connector](https://sycamore.readthedocs.io/en/stable/sycamore/connectors/opensearch.html)). [Visit the Sycamore documentation](https://sycamore.readthedocs.io/en/stable/sycamore/get_started.html) to get started. # Structure of an ETL Pipeline -A Sycamore ETL pipeline to prepare unstructured data for vector or hybrid search in OpenSearch generally has these parts: +A Sycamore ETL pipeline is a series of transformations on a [DocSet](https://sycamore.readthedocs.io/en/stable/sycamore/get_started/concepts.html#docsets), which is a collection of documents and their constintuent elements (e.g. a table, block of text, or header). At the end of the pipeline, the DocSet is loaded into OpenSearch vector and keyword indexes. -* Read documents into a DocSet -* Partition documents into structured JSON -* Extract metadata and clean data -* Create chunks -* Embed -* Load OpenSearch +A pipeline to prepare unstructured data for vector or hybrid search in OpenSearch generally has these parts: -You can see an example pipeline with this flow in this notebook. +* Read documents into a [DocSet](https://sycamore.readthedocs.io/en/stable/sycamore/get_started/concepts.html#docsets) +* [Partition documents](https://sycamore.readthedocs.io/en/stable/sycamore/transforms/partition.html) into structured JSON elements +* Extract metadata, filter, and clean data with [transforms](https://sycamore.readthedocs.io/en/stable/sycamore/APIs/docset.html) +* Create [chunks](https://sycamore.readthedocs.io/en/stable/sycamore/transforms/merge.html) from groups of elements +* Embed with the model of your choice +* [Load](https://sycamore.readthedocs.io/en/stable/sycamore/connectors/opensearch.html) OpenSearch +You can see an example pipeline with this flow in [this notebook](https://github.com/aryn-ai/sycamore/blob/main/notebooks/opensearch_docs_etl.ipynb). # Install Sycamore + +We recommend installing the Sycamore library using `pip`. The connector for OpenSearch can be installed via extras. For example: + +``` +pip install sycamore-ai[opensearch] +``` + +By default, Sycamore works with the Aryn Partitioning Service to process PDFs. To run inference locally for partitioning or embedding, install the local-inference extra as follows: + +``` +pip install sycamore-ai[opensearch,local-inference] +``` From 83f99a9b12c420eeb73ae1ae09fb6c7e9eeaef67 Mon Sep 17 00:00:00 2001 From: jonfritz <134336691+jonfritz@users.noreply.github.com> Date: Thu, 12 Sep 2024 13:00:39 -0700 Subject: [PATCH 04/23] Update _tools/sycamore.md Co-authored-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> Signed-off-by: jonfritz <134336691+jonfritz@users.noreply.github.com> --- _tools/sycamore.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/_tools/sycamore.md b/_tools/sycamore.md index dea0bc3e6a..00a2afcada 100644 --- a/_tools/sycamore.md +++ b/_tools/sycamore.md @@ -1,4 +1,4 @@ -# Sycamore ETL +# Sycamore [Sycamore](https://github.com/aryn-ai/sycamore) is an open source, AI-powered document processing engine to prepare unstructured data for RAG and semantic search using Python. Sycamore can chunk and enrich a wide range of complex document types including reports, presentations, transcripts, manuals, and more, and it can extract and process embedded tables, figures, graphs, and other infographics. It can then load a target index, including vector and keyword indexes, using a connector (like the [OpenSearch connector](https://sycamore.readthedocs.io/en/stable/sycamore/connectors/opensearch.html)). From 26556047fb913bae5e9c495fa0fe4d6db08f64ef Mon Sep 17 00:00:00 2001 From: jonfritz <134336691+jonfritz@users.noreply.github.com> Date: Thu, 12 Sep 2024 13:01:14 -0700 Subject: [PATCH 05/23] Update _tools/sycamore.md Co-authored-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> Signed-off-by: jonfritz <134336691+jonfritz@users.noreply.github.com> --- _tools/sycamore.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/_tools/sycamore.md b/_tools/sycamore.md index 00a2afcada..f27ccfaa4c 100644 --- a/_tools/sycamore.md +++ b/_tools/sycamore.md @@ -1,6 +1,6 @@ # Sycamore -[Sycamore](https://github.com/aryn-ai/sycamore) is an open source, AI-powered document processing engine to prepare unstructured data for RAG and semantic search using Python. Sycamore can chunk and enrich a wide range of complex document types including reports, presentations, transcripts, manuals, and more, and it can extract and process embedded tables, figures, graphs, and other infographics. It can then load a target index, including vector and keyword indexes, using a connector (like the [OpenSearch connector](https://sycamore.readthedocs.io/en/stable/sycamore/connectors/opensearch.html)). +[Sycamore](https://github.com/aryn-ai/sycamore) is an open-source, AI-powered document processing engine designed to prepare unstructured data for retrieval-augmented generation (RA)G and semantic search using Python. Sycamore supports chunking and enriching a wide range of complex document types, including reports, presentations, transcripts, and manuals. Additionally, Sycamore can extract and process embedded elements, such as tables, figures, graphs, and other infographics. It can then load the data into target indexes, including vector and keyword indexes, using a connector like the [OpenSearch connector](https://sycamore.readthedocs.io/en/stable/sycamore/connectors/opensearch.html). [Visit the Sycamore documentation](https://sycamore.readthedocs.io/en/stable/sycamore/get_started.html) to get started. From 99c47280a25de1af51bbdf10cfeac6967600a668 Mon Sep 17 00:00:00 2001 From: jonfritz <134336691+jonfritz@users.noreply.github.com> Date: Thu, 12 Sep 2024 13:01:34 -0700 Subject: [PATCH 06/23] Update _tools/sycamore.md Co-authored-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> Signed-off-by: jonfritz <134336691+jonfritz@users.noreply.github.com> --- _tools/sycamore.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/_tools/sycamore.md b/_tools/sycamore.md index f27ccfaa4c..6a3eca5c87 100644 --- a/_tools/sycamore.md +++ b/_tools/sycamore.md @@ -2,7 +2,7 @@ [Sycamore](https://github.com/aryn-ai/sycamore) is an open-source, AI-powered document processing engine designed to prepare unstructured data for retrieval-augmented generation (RA)G and semantic search using Python. Sycamore supports chunking and enriching a wide range of complex document types, including reports, presentations, transcripts, and manuals. Additionally, Sycamore can extract and process embedded elements, such as tables, figures, graphs, and other infographics. It can then load the data into target indexes, including vector and keyword indexes, using a connector like the [OpenSearch connector](https://sycamore.readthedocs.io/en/stable/sycamore/connectors/opensearch.html). -[Visit the Sycamore documentation](https://sycamore.readthedocs.io/en/stable/sycamore/get_started.html) to get started. +To get started, visit the [Sycamore documentation](https://sycamore.readthedocs.io/en/stable/sycamore/get_started.html). # Structure of an ETL Pipeline From 71cc3a83ffa412214397cc4383ed905e88da67b5 Mon Sep 17 00:00:00 2001 From: jonfritz <134336691+jonfritz@users.noreply.github.com> Date: Thu, 12 Sep 2024 13:01:46 -0700 Subject: [PATCH 07/23] Update _tools/sycamore.md Co-authored-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> Signed-off-by: jonfritz <134336691+jonfritz@users.noreply.github.com> --- _tools/sycamore.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/_tools/sycamore.md b/_tools/sycamore.md index 6a3eca5c87..2414dce392 100644 --- a/_tools/sycamore.md +++ b/_tools/sycamore.md @@ -4,7 +4,7 @@ To get started, visit the [Sycamore documentation](https://sycamore.readthedocs.io/en/stable/sycamore/get_started.html). -# Structure of an ETL Pipeline +# Sycamore ETL pipeline structure A Sycamore ETL pipeline is a series of transformations on a [DocSet](https://sycamore.readthedocs.io/en/stable/sycamore/get_started/concepts.html#docsets), which is a collection of documents and their constintuent elements (e.g. a table, block of text, or header). At the end of the pipeline, the DocSet is loaded into OpenSearch vector and keyword indexes. From 1d15c54142baf750f43d264eac1f1622e4ad8491 Mon Sep 17 00:00:00 2001 From: jonfritz <134336691+jonfritz@users.noreply.github.com> Date: Thu, 12 Sep 2024 13:03:30 -0700 Subject: [PATCH 08/23] Update _tools/sycamore.md Co-authored-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> Signed-off-by: jonfritz <134336691+jonfritz@users.noreply.github.com> --- _tools/sycamore.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/_tools/sycamore.md b/_tools/sycamore.md index 2414dce392..302c513b1b 100644 --- a/_tools/sycamore.md +++ b/_tools/sycamore.md @@ -6,7 +6,7 @@ To get started, visit the [Sycamore documentation](https://sycamore.readthedocs. # Sycamore ETL pipeline structure -A Sycamore ETL pipeline is a series of transformations on a [DocSet](https://sycamore.readthedocs.io/en/stable/sycamore/get_started/concepts.html#docsets), which is a collection of documents and their constintuent elements (e.g. a table, block of text, or header). At the end of the pipeline, the DocSet is loaded into OpenSearch vector and keyword indexes. +A Sycamore Extract, Transform, Load (ETL) pipeline applies a series of transformations to a [DocSet](https://sycamore.readthedocs.io/en/stable/sycamore/get_started/concepts.html#docsets), which is a collection of documents and their constituent elements (for example, tables, blocks of text, or headers). At the end of the pipeline, the DocSet is loaded into OpenSearch vector and keyword indexes. A pipeline to prepare unstructured data for vector or hybrid search in OpenSearch generally has these parts: From 7e1035ca2950dbf75a60c8790ed964c79c883bde Mon Sep 17 00:00:00 2001 From: jonfritz <134336691+jonfritz@users.noreply.github.com> Date: Thu, 12 Sep 2024 13:03:41 -0700 Subject: [PATCH 09/23] Update _tools/sycamore.md Co-authored-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> Signed-off-by: jonfritz <134336691+jonfritz@users.noreply.github.com> --- _tools/sycamore.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/_tools/sycamore.md b/_tools/sycamore.md index 302c513b1b..afb7cd9a4f 100644 --- a/_tools/sycamore.md +++ b/_tools/sycamore.md @@ -24,7 +24,7 @@ You can see an example pipeline with this flow in [this notebook](https://github We recommend installing the Sycamore library using `pip`. The connector for OpenSearch can be installed via extras. For example: -``` +```bash pip install sycamore-ai[opensearch] ``` From 984cc5b3458350bd00790abb3b0fa3f303eeaed2 Mon Sep 17 00:00:00 2001 From: jonfritz <134336691+jonfritz@users.noreply.github.com> Date: Thu, 12 Sep 2024 13:03:54 -0700 Subject: [PATCH 10/23] Update _tools/sycamore.md Co-authored-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> Signed-off-by: jonfritz <134336691+jonfritz@users.noreply.github.com> --- _tools/sycamore.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/_tools/sycamore.md b/_tools/sycamore.md index afb7cd9a4f..456f0084e0 100644 --- a/_tools/sycamore.md +++ b/_tools/sycamore.md @@ -28,7 +28,7 @@ We recommend installing the Sycamore library using `pip`. The connector for Open pip install sycamore-ai[opensearch] ``` -By default, Sycamore works with the Aryn Partitioning Service to process PDFs. To run inference locally for partitioning or embedding, install the local-inference extra as follows: +By default, Sycamore works with the Aryn Partitioning Service to process PDFs. To run inference locally for partitioning or embedding, install the `local-inference` extra as follows: ``` pip install sycamore-ai[opensearch,local-inference] From af2318790975fec80fdc67ee32ae440ac6be0fce Mon Sep 17 00:00:00 2001 From: jonfritz <134336691+jonfritz@users.noreply.github.com> Date: Thu, 12 Sep 2024 13:04:25 -0700 Subject: [PATCH 11/23] Update _tools/sycamore.md Co-authored-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> Signed-off-by: jonfritz <134336691+jonfritz@users.noreply.github.com> --- _tools/sycamore.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/_tools/sycamore.md b/_tools/sycamore.md index 456f0084e0..60abd996f7 100644 --- a/_tools/sycamore.md +++ b/_tools/sycamore.md @@ -30,6 +30,6 @@ pip install sycamore-ai[opensearch] By default, Sycamore works with the Aryn Partitioning Service to process PDFs. To run inference locally for partitioning or embedding, install the `local-inference` extra as follows: -``` +```bash pip install sycamore-ai[opensearch,local-inference] ``` From 731fa3cab371d06d923a866c998c97c88355cc07 Mon Sep 17 00:00:00 2001 From: jonfritz <134336691+jonfritz@users.noreply.github.com> Date: Thu, 12 Sep 2024 13:04:48 -0700 Subject: [PATCH 12/23] Update _tools/sycamore.md Co-authored-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> Signed-off-by: jonfritz <134336691+jonfritz@users.noreply.github.com> --- _tools/sycamore.md | 1 + 1 file changed, 1 insertion(+) diff --git a/_tools/sycamore.md b/_tools/sycamore.md index 60abd996f7..3b8e17f682 100644 --- a/_tools/sycamore.md +++ b/_tools/sycamore.md @@ -33,3 +33,4 @@ By default, Sycamore works with the Aryn Partitioning Service to process PDFs. T ```bash pip install sycamore-ai[opensearch,local-inference] ``` +{% include copy.html %} From 1088601398796dd16197c610f0de039dc51ac3d2 Mon Sep 17 00:00:00 2001 From: jonfritz <134336691+jonfritz@users.noreply.github.com> Date: Thu, 12 Sep 2024 13:04:59 -0700 Subject: [PATCH 13/23] Update _tools/sycamore.md Co-authored-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> Signed-off-by: jonfritz <134336691+jonfritz@users.noreply.github.com> --- _tools/sycamore.md | 1 + 1 file changed, 1 insertion(+) diff --git a/_tools/sycamore.md b/_tools/sycamore.md index 3b8e17f682..016705c5dc 100644 --- a/_tools/sycamore.md +++ b/_tools/sycamore.md @@ -27,6 +27,7 @@ We recommend installing the Sycamore library using `pip`. The connector for Open ```bash pip install sycamore-ai[opensearch] ``` +{% include copy.html %} By default, Sycamore works with the Aryn Partitioning Service to process PDFs. To run inference locally for partitioning or embedding, install the `local-inference` extra as follows: From 82426898bbed72c69f5714a70005a1db2fd9f0e2 Mon Sep 17 00:00:00 2001 From: jonfritz <134336691+jonfritz@users.noreply.github.com> Date: Thu, 12 Sep 2024 13:05:54 -0700 Subject: [PATCH 14/23] Update sycamore.md Correct typo Signed-off-by: jonfritz <134336691+jonfritz@users.noreply.github.com> --- _tools/sycamore.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/_tools/sycamore.md b/_tools/sycamore.md index 016705c5dc..ca22cec5b2 100644 --- a/_tools/sycamore.md +++ b/_tools/sycamore.md @@ -1,6 +1,6 @@ # Sycamore -[Sycamore](https://github.com/aryn-ai/sycamore) is an open-source, AI-powered document processing engine designed to prepare unstructured data for retrieval-augmented generation (RA)G and semantic search using Python. Sycamore supports chunking and enriching a wide range of complex document types, including reports, presentations, transcripts, and manuals. Additionally, Sycamore can extract and process embedded elements, such as tables, figures, graphs, and other infographics. It can then load the data into target indexes, including vector and keyword indexes, using a connector like the [OpenSearch connector](https://sycamore.readthedocs.io/en/stable/sycamore/connectors/opensearch.html). +[Sycamore](https://github.com/aryn-ai/sycamore) is an open-source, AI-powered document processing engine designed to prepare unstructured data for retrieval-augmented generation (RAG) and semantic search using Python. Sycamore supports chunking and enriching a wide range of complex document types, including reports, presentations, transcripts, and manuals. Additionally, Sycamore can extract and process embedded elements, such as tables, figures, graphs, and other infographics. It can then load the data into target indexes, including vector and keyword indexes, using a connector like the [OpenSearch connector](https://sycamore.readthedocs.io/en/stable/sycamore/connectors/opensearch.html). To get started, visit the [Sycamore documentation](https://sycamore.readthedocs.io/en/stable/sycamore/get_started.html). From 1cd871c67c58f7b089c507b0abfd296fcedd5a00 Mon Sep 17 00:00:00 2001 From: jonfritz <134336691+jonfritz@users.noreply.github.com> Date: Thu, 12 Sep 2024 13:31:37 -0700 Subject: [PATCH 15/23] Update _tools/sycamore.md Co-authored-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> Signed-off-by: jonfritz <134336691+jonfritz@users.noreply.github.com> --- _tools/sycamore.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/_tools/sycamore.md b/_tools/sycamore.md index ca22cec5b2..298b152027 100644 --- a/_tools/sycamore.md +++ b/_tools/sycamore.md @@ -8,7 +8,7 @@ To get started, visit the [Sycamore documentation](https://sycamore.readthedocs. A Sycamore Extract, Transform, Load (ETL) pipeline applies a series of transformations to a [DocSet](https://sycamore.readthedocs.io/en/stable/sycamore/get_started/concepts.html#docsets), which is a collection of documents and their constituent elements (for example, tables, blocks of text, or headers). At the end of the pipeline, the DocSet is loaded into OpenSearch vector and keyword indexes. -A pipeline to prepare unstructured data for vector or hybrid search in OpenSearch generally has these parts: +A typical pipeline for preparing unstructured data for vector or hybrid search in OpenSearch consists of the following steps: * Read documents into a [DocSet](https://sycamore.readthedocs.io/en/stable/sycamore/get_started/concepts.html#docsets) * [Partition documents](https://sycamore.readthedocs.io/en/stable/sycamore/transforms/partition.html) into structured JSON elements From b0c39fd76e078c6b2b4df5aed78b10e6ca441705 Mon Sep 17 00:00:00 2001 From: jonfritz <134336691+jonfritz@users.noreply.github.com> Date: Thu, 12 Sep 2024 13:31:54 -0700 Subject: [PATCH 16/23] Update _tools/sycamore.md Co-authored-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> Signed-off-by: jonfritz <134336691+jonfritz@users.noreply.github.com> --- _tools/sycamore.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/_tools/sycamore.md b/_tools/sycamore.md index 298b152027..3500e6a3cf 100644 --- a/_tools/sycamore.md +++ b/_tools/sycamore.md @@ -12,7 +12,7 @@ A typical pipeline for preparing unstructured data for vector or hybrid search i * Read documents into a [DocSet](https://sycamore.readthedocs.io/en/stable/sycamore/get_started/concepts.html#docsets) * [Partition documents](https://sycamore.readthedocs.io/en/stable/sycamore/transforms/partition.html) into structured JSON elements -* Extract metadata, filter, and clean data with [transforms](https://sycamore.readthedocs.io/en/stable/sycamore/APIs/docset.html) +* Extract metadata, filter and clean data using [transforms](https://sycamore.readthedocs.io/en/stable/sycamore/APIs/docset.html) * Create [chunks](https://sycamore.readthedocs.io/en/stable/sycamore/transforms/merge.html) from groups of elements * Embed with the model of your choice * [Load](https://sycamore.readthedocs.io/en/stable/sycamore/connectors/opensearch.html) OpenSearch From 9d710caf5d70f888eb9fe20aebd88f54cd908d01 Mon Sep 17 00:00:00 2001 From: jonfritz <134336691+jonfritz@users.noreply.github.com> Date: Thu, 12 Sep 2024 13:32:08 -0700 Subject: [PATCH 17/23] Update _tools/sycamore.md Co-authored-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> Signed-off-by: jonfritz <134336691+jonfritz@users.noreply.github.com> --- _tools/sycamore.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/_tools/sycamore.md b/_tools/sycamore.md index 3500e6a3cf..ba9ae09b21 100644 --- a/_tools/sycamore.md +++ b/_tools/sycamore.md @@ -14,7 +14,7 @@ A typical pipeline for preparing unstructured data for vector or hybrid search i * [Partition documents](https://sycamore.readthedocs.io/en/stable/sycamore/transforms/partition.html) into structured JSON elements * Extract metadata, filter and clean data using [transforms](https://sycamore.readthedocs.io/en/stable/sycamore/APIs/docset.html) * Create [chunks](https://sycamore.readthedocs.io/en/stable/sycamore/transforms/merge.html) from groups of elements -* Embed with the model of your choice +* Embed the chunks using the model of your choice * [Load](https://sycamore.readthedocs.io/en/stable/sycamore/connectors/opensearch.html) OpenSearch You can see an example pipeline with this flow in [this notebook](https://github.com/aryn-ai/sycamore/blob/main/notebooks/opensearch_docs_etl.ipynb). From ffa519b1ca74902397ed9f0525e7466d1a6ae9a1 Mon Sep 17 00:00:00 2001 From: jonfritz <134336691+jonfritz@users.noreply.github.com> Date: Thu, 12 Sep 2024 13:32:48 -0700 Subject: [PATCH 18/23] Update _tools/sycamore.md Co-authored-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> Signed-off-by: jonfritz <134336691+jonfritz@users.noreply.github.com> --- _tools/sycamore.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/_tools/sycamore.md b/_tools/sycamore.md index ba9ae09b21..f5864b5a8c 100644 --- a/_tools/sycamore.md +++ b/_tools/sycamore.md @@ -17,7 +17,7 @@ A typical pipeline for preparing unstructured data for vector or hybrid search i * Embed the chunks using the model of your choice * [Load](https://sycamore.readthedocs.io/en/stable/sycamore/connectors/opensearch.html) OpenSearch -You can see an example pipeline with this flow in [this notebook](https://github.com/aryn-ai/sycamore/blob/main/notebooks/opensearch_docs_etl.ipynb). +For an example pipeline that follows this flow, see [this notebook](https://github.com/aryn-ai/sycamore/blob/main/notebooks/opensearch_docs_etl.ipynb). # Install Sycamore From 6ec3caa2d26c643bbde79411de2070e38f455d24 Mon Sep 17 00:00:00 2001 From: jonfritz <134336691+jonfritz@users.noreply.github.com> Date: Thu, 12 Sep 2024 13:35:52 -0700 Subject: [PATCH 19/23] Update sycamore.md Updates from suggestions Signed-off-by: jonfritz <134336691+jonfritz@users.noreply.github.com> --- _tools/sycamore.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/_tools/sycamore.md b/_tools/sycamore.md index f5864b5a8c..2398c52d6a 100644 --- a/_tools/sycamore.md +++ b/_tools/sycamore.md @@ -15,21 +15,21 @@ A typical pipeline for preparing unstructured data for vector or hybrid search i * Extract metadata, filter and clean data using [transforms](https://sycamore.readthedocs.io/en/stable/sycamore/APIs/docset.html) * Create [chunks](https://sycamore.readthedocs.io/en/stable/sycamore/transforms/merge.html) from groups of elements * Embed the chunks using the model of your choice -* [Load](https://sycamore.readthedocs.io/en/stable/sycamore/connectors/opensearch.html) OpenSearch +* [Load](https://sycamore.readthedocs.io/en/stable/sycamore/connectors/opensearch.html) the embeddings, metadata, and text into OpenSearch vector and keyword indexes For an example pipeline that follows this flow, see [this notebook](https://github.com/aryn-ai/sycamore/blob/main/notebooks/opensearch_docs_etl.ipynb). # Install Sycamore -We recommend installing the Sycamore library using `pip`. The connector for OpenSearch can be installed via extras. For example: +We recommend installing the Sycamore library using `pip`. The connector for OpenSearch can be specified and installed via extras. For example: ```bash pip install sycamore-ai[opensearch] ``` {% include copy.html %} -By default, Sycamore works with the Aryn Partitioning Service to process PDFs. To run inference locally for partitioning or embedding, install the `local-inference` extra as follows: +By default, Sycamore works with the Aryn Partitioning Service to process PDFs. To run inference locally for partitioning or embedding, install Sycamore with the `local-inference` extra as follows: ```bash pip install sycamore-ai[opensearch,local-inference] From b77f2bfc4a1807d3b5ef04120123523c46b5d537 Mon Sep 17 00:00:00 2001 From: jonfritz <134336691+jonfritz@users.noreply.github.com> Date: Thu, 12 Sep 2024 14:01:48 -0700 Subject: [PATCH 20/23] Update index.md Updating index with Sycamore Signed-off-by: jonfritz <134336691+jonfritz@users.noreply.github.com> --- _tools/index.md | 7 +++++++ 1 file changed, 7 insertions(+) diff --git a/_tools/index.md b/_tools/index.md index 108f10da97..b166d6d3cf 100644 --- a/_tools/index.md +++ b/_tools/index.md @@ -18,6 +18,7 @@ This section provides documentation for OpenSearch-supported tools, including: - [OpenSearch CLI](#opensearch-cli) - [OpenSearch Kubernetes operator](#opensearch-kubernetes-operator) - [OpenSearch upgrade, migration, and comparison tools](#opensearch-upgrade-migration-and-comparison-tools) +- [Sycamore](#sycamore) for AI-powered ETL on complex documents for vector and hybrid search For information about Data Prepper, the server-side data collector for filtering, enriching, transforming, normalizing, and aggregating data for downstream analytics and visualization, see [Data Prepper]({{site.url}}{{site.baseurl}}/data-prepper/index/). @@ -122,3 +123,9 @@ The OpenSearch Kubernetes Operator is an open-source Kubernetes operator that he OpenSearch migration tools facilitate migrations to OpenSearch and upgrades to newer versions of OpenSearch. These can help you can set up a proof-of-concept environment locally using Docker containers or deploy to AWS using a one-click deployment script. This empowers you to fine-tune cluster configurations and manage workloads more effectively before migration. For more information about OpenSearch migration tools, see the documentation in the [OpenSearch Migration GitHub repository](https://github.com/opensearch-project/opensearch-migrations/tree/capture-and-replay-v0.1.0). + +## Sycamore + +[Sycamore](https://github.com/aryn-ai/sycamore) is an open-source, AI-powered document processing engine designed to prepare unstructured data for retrieval-augmented generation (RAG) and semantic search using Python. Sycamore supports chunking and enriching a wide range of complex document types, including reports, presentations, transcripts, and manuals. Additionally, Sycamore can extract and process embedded elements, such as tables, figures, graphs, and other infographics. It can then load the data into target indexes, including vector and keyword indexes, using an [OpenSearch connector](https://sycamore.readthedocs.io/en/stable/sycamore/connectors/opensearch.html). + +To get started, visit the [Sycamore documentation](https://sycamore.readthedocs.io/en/stable/sycamore/get_started.html). From a2b9c27417c41315c7827e507be23e9020545751 Mon Sep 17 00:00:00 2001 From: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> Date: Thu, 12 Sep 2024 19:25:39 -0400 Subject: [PATCH 21/23] Update _tools/index.md Signed-off-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> --- _tools/index.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/_tools/index.md b/_tools/index.md index b166d6d3cf..f22d65f327 100644 --- a/_tools/index.md +++ b/_tools/index.md @@ -128,4 +128,4 @@ For more information about OpenSearch migration tools, see the documentation in [Sycamore](https://github.com/aryn-ai/sycamore) is an open-source, AI-powered document processing engine designed to prepare unstructured data for retrieval-augmented generation (RAG) and semantic search using Python. Sycamore supports chunking and enriching a wide range of complex document types, including reports, presentations, transcripts, and manuals. Additionally, Sycamore can extract and process embedded elements, such as tables, figures, graphs, and other infographics. It can then load the data into target indexes, including vector and keyword indexes, using an [OpenSearch connector](https://sycamore.readthedocs.io/en/stable/sycamore/connectors/opensearch.html). -To get started, visit the [Sycamore documentation](https://sycamore.readthedocs.io/en/stable/sycamore/get_started.html). +For more information, see [Sycamore]({{site.url}}{{site.baseurl}}/tools/sycamore/). From 3ec3dbe97111030f8d7aa5b00ff713f370da3722 Mon Sep 17 00:00:00 2001 From: Fanit Kolchina Date: Mon, 16 Sep 2024 15:47:02 -0400 Subject: [PATCH 22/23] Add front matter Signed-off-by: Fanit Kolchina --- _tools/sycamore.md | 11 +++++++++++ 1 file changed, 11 insertions(+) diff --git a/_tools/sycamore.md b/_tools/sycamore.md index 2398c52d6a..fb681ec7d1 100644 --- a/_tools/sycamore.md +++ b/_tools/sycamore.md @@ -1,3 +1,10 @@ +--- +layout: default +title: Sycamore +nav_order: 210 +has_children: false +--- + # Sycamore [Sycamore](https://github.com/aryn-ai/sycamore) is an open-source, AI-powered document processing engine designed to prepare unstructured data for retrieval-augmented generation (RAG) and semantic search using Python. Sycamore supports chunking and enriching a wide range of complex document types, including reports, presentations, transcripts, and manuals. Additionally, Sycamore can extract and process embedded elements, such as tables, figures, graphs, and other infographics. It can then load the data into target indexes, including vector and keyword indexes, using a connector like the [OpenSearch connector](https://sycamore.readthedocs.io/en/stable/sycamore/connectors/opensearch.html). @@ -35,3 +42,7 @@ By default, Sycamore works with the Aryn Partitioning Service to process PDFs. T pip install sycamore-ai[opensearch,local-inference] ``` {% include copy.html %} + +## Next steps + +For more information, visit the [Sycamore documentation](https://sycamore.readthedocs.io/en/stable/sycamore/get_started.html). \ No newline at end of file From f3acccd7d75acef3a9dd123abc3cf1fd6d7bf35a Mon Sep 17 00:00:00 2001 From: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> Date: Mon, 16 Sep 2024 16:15:35 -0400 Subject: [PATCH 23/23] Apply suggestions from code review Co-authored-by: Nathan Bower Signed-off-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> --- _tools/index.md | 2 +- _tools/sycamore.md | 18 +++++++++--------- 2 files changed, 10 insertions(+), 10 deletions(-) diff --git a/_tools/index.md b/_tools/index.md index f22d65f327..c9d446a81a 100644 --- a/_tools/index.md +++ b/_tools/index.md @@ -18,7 +18,7 @@ This section provides documentation for OpenSearch-supported tools, including: - [OpenSearch CLI](#opensearch-cli) - [OpenSearch Kubernetes operator](#opensearch-kubernetes-operator) - [OpenSearch upgrade, migration, and comparison tools](#opensearch-upgrade-migration-and-comparison-tools) -- [Sycamore](#sycamore) for AI-powered ETL on complex documents for vector and hybrid search +- [Sycamore](#sycamore) for AI-powered extract, transform, load (ETL) on complex documents for vector and hybrid search For information about Data Prepper, the server-side data collector for filtering, enriching, transforming, normalizing, and aggregating data for downstream analytics and visualization, see [Data Prepper]({{site.url}}{{site.baseurl}}/data-prepper/index/). diff --git a/_tools/sycamore.md b/_tools/sycamore.md index fb681ec7d1..7ce55931ac 100644 --- a/_tools/sycamore.md +++ b/_tools/sycamore.md @@ -13,23 +13,23 @@ To get started, visit the [Sycamore documentation](https://sycamore.readthedocs. # Sycamore ETL pipeline structure -A Sycamore Extract, Transform, Load (ETL) pipeline applies a series of transformations to a [DocSet](https://sycamore.readthedocs.io/en/stable/sycamore/get_started/concepts.html#docsets), which is a collection of documents and their constituent elements (for example, tables, blocks of text, or headers). At the end of the pipeline, the DocSet is loaded into OpenSearch vector and keyword indexes. +A Sycamore extract, transform, load (ETL) pipeline applies a series of transformations to a [DocSet](https://sycamore.readthedocs.io/en/stable/sycamore/get_started/concepts.html#docsets), which is a collection of documents and their constituent elements (for example, tables, blocks of text, or headers). At the end of the pipeline, the DocSet is loaded into OpenSearch vector and keyword indexes. A typical pipeline for preparing unstructured data for vector or hybrid search in OpenSearch consists of the following steps: -* Read documents into a [DocSet](https://sycamore.readthedocs.io/en/stable/sycamore/get_started/concepts.html#docsets) -* [Partition documents](https://sycamore.readthedocs.io/en/stable/sycamore/transforms/partition.html) into structured JSON elements -* Extract metadata, filter and clean data using [transforms](https://sycamore.readthedocs.io/en/stable/sycamore/APIs/docset.html) -* Create [chunks](https://sycamore.readthedocs.io/en/stable/sycamore/transforms/merge.html) from groups of elements -* Embed the chunks using the model of your choice -* [Load](https://sycamore.readthedocs.io/en/stable/sycamore/connectors/opensearch.html) the embeddings, metadata, and text into OpenSearch vector and keyword indexes +* Read documents into a [DocSet](https://sycamore.readthedocs.io/en/stable/sycamore/get_started/concepts.html#docsets). +* [Partition documents](https://sycamore.readthedocs.io/en/stable/sycamore/transforms/partition.html) into structured JSON elements. +* Extract metadata, filter, and clean data using [transforms](https://sycamore.readthedocs.io/en/stable/sycamore/APIs/docset.html). +* Create [chunks](https://sycamore.readthedocs.io/en/stable/sycamore/transforms/merge.html) from groups of elements. +* Embed the chunks using the model of your choice. +* [Load](https://sycamore.readthedocs.io/en/stable/sycamore/connectors/opensearch.html) the embeddings, metadata, and text into OpenSearch vector and keyword indexes. -For an example pipeline that follows this flow, see [this notebook](https://github.com/aryn-ai/sycamore/blob/main/notebooks/opensearch_docs_etl.ipynb). +For an example pipeline that uses this workflow, see [this notebook](https://github.com/aryn-ai/sycamore/blob/main/notebooks/opensearch_docs_etl.ipynb). # Install Sycamore -We recommend installing the Sycamore library using `pip`. The connector for OpenSearch can be specified and installed via extras. For example: +We recommend installing the Sycamore library using `pip`. The connector for OpenSearch can be specified and installed using extras. For example: ```bash pip install sycamore-ai[opensearch]