Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add new markdown text splitter #7

Open
wants to merge 1 commit into
base: frances/test_promptless_3
Choose a base branch
from

Conversation

frances720
Copy link

Description
This MR defines a ExperimentalMarkdownSyntaxTextSplitter class. The
main goal is to replicate the functionality of the original
MarkdownHeaderTextSplitter which extracts the header stack as metadata
but with one critical difference: it keeps the whitespace of the
original text intact.

This draft reimplements the MarkdownHeaderTextSplitter with a very
different algorithmic approach. Instead of marking up each line of the
text individually and aggregating them back together into chunks, this
method builds each chunk sequentially and applies the metadata to each
chunk. This makes the implementation simpler. However, since it's
designed to keep white space intact its not a full drop in replacement
for the original. Since it is a radical implementation change to the
original code and I would like to get feedback to see if this is a
worthwhile replacement, should be it's own class, or is not a good idea
at all.

Note: I implemented the return_each_line parameter but I don't think
it's a necessary feature. I'd prefer to remove it.

This implementation also adds the following additional features:

Splits out code blocks and includes the language in the "Code"
metadata key
Splits text on the horizontal rule --- as well
The headers_to_split_on parameter is now optional - with sensible
defaults that can be overridden.
Issue
Keeping the whitespace keeps the paragraphs structure and the formatting
of the code blocks intact which allows the caller much more flexibility
in how they want to further split the individuals sections of the
resulting documents. This addresses the issues brought up by the
community in the following issues:

langchain-ai#20823
langchain-ai#19436
langchain-ai#22256
Dependencies
N/A

Twitter handle
@RyanElston

Co-authored-by: isaac hershenson ihershenson@hmc.edu

Copy link

promptless bot commented Nov 15, 2024

📝 Documentation updates detected!

Promptless has analyzed your changes and created a documentation update PR. You can review the proposed documentation changes here:
#8

Please review the suggested updates to ensure they accurately reflect your changes.

Copy link

promptless bot commented Nov 15, 2024

📝 Documentation updates detected!

Promptless has analyzed your changes and created a documentation update PR. You can review the proposed documentation changes here:
#9

Please review the suggested updates to ensure they accurately reflect your changes.

Copy link

promptless bot commented Nov 15, 2024

📝 Documentation updates detected!

Promptless has analyzed your changes and created a documentation update PR. You can review the proposed documentation changes here:
#10

Please review the suggested updates to ensure they accurately reflect your changes.

Copy link

promptless bot commented Nov 15, 2024

✅ No documentation updates required

Promptless has analyzed your changes against existing documentation and determined that no updates are needed at this time.

If you believe documentation updates are needed, please update the relevant files manually.

2 similar comments
Copy link

promptless bot commented Nov 15, 2024

✅ No documentation updates required

Promptless has analyzed your changes against existing documentation and determined that no updates are needed at this time.

If you believe documentation updates are needed, please update the relevant files manually.

Copy link

promptless bot commented Nov 15, 2024

✅ No documentation updates required

Promptless has analyzed your changes against existing documentation and determined that no updates are needed at this time.

If you believe documentation updates are needed, please update the relevant files manually.

Copy link

promptless bot commented Nov 18, 2024

📝 Documentation updates detected!

Promptless has analyzed your changes and created a documentation update PR. You can review the proposed documentation changes here:
#12

Please review the suggested updates to ensure they accurately reflect your changes.

1 similar comment
Copy link

promptless bot commented Nov 19, 2024

📝 Documentation updates detected!

Promptless has analyzed your changes and created a documentation update PR. You can review the proposed documentation changes here:
#12

Please review the suggested updates to ensure they accurately reflect your changes.

Copy link

promptless bot commented Nov 19, 2024

✅ No documentation updates required

Promptless has analyzed your changes against existing documentation and determined that no updates are needed at this time.

If you believe documentation updates are needed, please update the relevant files manually.

Copy link

promptless bot commented Nov 19, 2024

📝 Documentation updates detected!

Promptless has analyzed your changes and created a documentation update PR. You can review the proposed documentation changes here:
#12

Please review the suggested updates to ensure they accurately reflect your changes.

2 similar comments
Copy link

promptless bot commented Nov 19, 2024

📝 Documentation updates detected!

Promptless has analyzed your changes and created a documentation update PR. You can review the proposed documentation changes here:
#12

Please review the suggested updates to ensure they accurately reflect your changes.

Copy link

promptless bot commented Nov 19, 2024

📝 Documentation updates detected!

Promptless has analyzed your changes and created a documentation update PR. You can review the proposed documentation changes here:
#12

Please review the suggested updates to ensure they accurately reflect your changes.

Copy link

promptless bot commented Nov 19, 2024

✅ No documentation updates required

Promptless has analyzed your changes against existing documentation and determined that no updates are needed at this time.

If you believe documentation updates are needed, please update the relevant files manually.

2 similar comments
Copy link

promptless bot commented Nov 20, 2024

✅ No documentation updates required

Promptless has analyzed your changes against existing documentation and determined that no updates are needed at this time.

If you believe documentation updates are needed, please update the relevant files manually.

Copy link

promptless bot commented Dec 17, 2024

✅ No documentation updates required

Promptless has analyzed your changes against existing documentation and determined that no updates are needed at this time.

If you believe documentation updates are needed, please update the relevant files manually.

Copy link

promptless bot commented Dec 17, 2024

📝 Documentation updates detected!

Promptless has analyzed your changes and created a documentation update PR. You can review the proposed documentation changes here:
#19

Please review the suggested updates to ensure they accurately reflect your changes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant