Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to split document in a smarter way #6

Open
RobinHerzog opened this issue Feb 16, 2023 · 6 comments
Open

How to split document in a smarter way #6

RobinHerzog opened this issue Feb 16, 2023 · 6 comments

Comments

@RobinHerzog
Copy link

Hello,

I understand that we need to split document in smaller piece because OpenAI can not get the whole texts as input.

However, my challenge is to cut texts in a smart way so it does not cut the text in the middle of a sentence.

Any luck about that?

@boxabirds
Copy link

boxabirds commented Feb 20, 2023

Simplest looks to use the built-in SpacyTextSplitter that uses Spacy or maybe the NLTKTextSplitter. The former I know for sure has a huge footprint -- 400MB or so for the Spacy model. A nice patch to langchain would be to use sentence_splitter which is super fast and efficient (as it doesn't require a machine learning model). A little further investigation for this demo suggests using the MarkdownTextSplitter might be the best bet.

@xingfanxia
Copy link

second this

@alfasin
Copy link

alfasin commented Jun 13, 2023

I'm using: RecursiveCharacterTextSplitter for generic text splitting tasks:
https://python.langchain.com/en/latest/modules/indexes/text_splitters/examples/recursive_text_splitter.html

@Anurag-38
Copy link

Anurag-38 commented Nov 20, 2023

Simplest looks to use the built-in SpacyTextSplitter that uses Spacy or maybe the NLTKTextSplitter. The former I know for sure has a huge footprint -- 400MB or so for the Spacy model. A nice patch to langchain would be to use sentence_splitter which is super fast and efficient (as it doesn't require a machine learning model). A little further investigation for this demo suggests using the MarkdownTextSplitter might be the best bet.

Hi,
I wanted to ask about 'SpacyTextSplitter' can I use multiple separators like a list of punctuation's I am giving an example below. If what I ahve shown below is possible but I am not doing it right do write the correct syntax.

Original Working Code
text_splitter = SpacyTextSplitter(chunk_size=900, chunk_overlap=50, separator = '\n', pipeline = 'sentencizer')
My Expectation
text_splitter = SpacyTextSplitter(chunk_size=900, chunk_overlap=50, separator = ['\n', '=', ',', '&'], pipeline = 'sentencizer ')

@arrowing
Copy link

arrowing commented Nov 28, 2023

Simplest looks to use the built-in SpacyTextSplitter that uses Spacy or maybe the NLTKTextSplitter. The former I know for sure has a huge footprint -- 400MB or so for the Spacy model. A nice patch to langchain would be to use sentence_splitter which is super fast and efficient (as it doesn't require a machine learning model). A little further investigation for this demo suggests using the MarkdownTextSplitter might be the best bet.

Hi, I wanted to ask about 'SpacyTextSplitter' can I use multiple separators like a list of punctuation's I am giving an example below. If what I ahve shown below is possible but I am not doing it right do write the correct syntax.

Original Working Code ' text_splitter = SpacyTextSplitter(chunk_size=900, chunk_overlap=50, separator = '\n', pipeline = 'sentencizer') '

My Expectation ' text_splitter = SpacyTextSplitter(chunk_size=900, chunk_overlap=50, separator = ['\n', '=', ',', '&'], pipeline = 'sentencizer ') '

You can find "SpacyTextSplitter" class code in the Langchain that implement on "TextSplitter" class.
If you want to use separators, maybe you should rewrite the class like "PythonCodeTextSplitter" class which implemented on "RecursiveCharacterTextSplitter".

Some code like the following:

def _make_spacy_pipeline_for_splitting(pipeline: str) -> Any:  # avoid importing spacy
    try:
        import spacy
    except ImportError:
        raise ImportError(
            "Spacy is not installed, please install it with `pip install spacy`."
        )
    if pipeline == "sentencizer":
        from spacy.lang.en import English

        sentencizer = English()
        sentencizer.add_pipe("sentencizer")
    else:
        sentencizer = spacy.load(pipeline, exclude=["ner", "tagger"])
    return sentencizer

class SpacyTextSplitter(RecursiveCharacterTextSplitter):
    """Splitting text using Spacy package.

    Per default, Spacy's `en_core_web_sm` model is used. For a faster, but
    potentially less accurate splitting, you can use `pipeline='sentencizer'`.
    """

    def __init__(
        self, separators: str = ["\n\n"], pipeline: str = "en_core_web_sm", **kwargs: Any
    ) -> None:
        """Initialize the spacy text splitter."""
        super().__init__(**kwargs)
        self._tokenizer = _make_spacy_pipeline_for_splitting(pipeline)
        self._separators = separators

    def split_text(self, text: str) -> List[str]:
        """Split incoming text and return chunks."""
        # !!! your code !!!

@Anurag-38
Copy link

Simplest looks to use the built-in SpacyTextSplitter that uses Spacy or maybe the NLTKTextSplitter. The former I know for sure has a huge footprint -- 400MB or so for the Spacy model. A nice patch to langchain would be to use sentence_splitter which is super fast and efficient (as it doesn't require a machine learning model). A little further investigation for this demo suggests using the MarkdownTextSplitter might be the best bet.

Hi, I wanted to ask about 'SpacyTextSplitter' can I use multiple separators like a list of punctuation's I am giving an example below. If what I ahve shown below is possible but I am not doing it right do write the correct syntax.
Original Working Code ' text_splitter = SpacyTextSplitter(chunk_size=900, chunk_overlap=50, separator = '\n', pipeline = 'sentencizer') '
My Expectation ' text_splitter = SpacyTextSplitter(chunk_size=900, chunk_overlap=50, separator = ['\n', '=', ',', '&'], pipeline = 'sentencizer ') '

You can find "SpacyTextSplitter" class code in the Langchain that implement on "TextSplitter" class. If you want to use separators, maybe you should rewrite the class like "PythonCodeTextSplitter" class which implemented on "RecursiveCharacterTextSplitter".

Some code like the following:

def _make_spacy_pipeline_for_splitting(pipeline: str) -> Any:  # avoid importing spacy
    try:
        import spacy
    except ImportError:
        raise ImportError(
            "Spacy is not installed, please install it with `pip install spacy`."
        )
    if pipeline == "sentencizer":
        from spacy.lang.en import English

        sentencizer = English()
        sentencizer.add_pipe("sentencizer")
    else:
        sentencizer = spacy.load(pipeline, exclude=["ner", "tagger"])
    return sentencizer

class SpacyTextSplitter(RecursiveCharacterTextSplitter):
    """Splitting text using Spacy package.

    Per default, Spacy's `en_core_web_sm` model is used. For a faster, but
    potentially less accurate splitting, you can use `pipeline='sentencizer'`.
    """

    def __init__(
        self, separators: str = ["\n\n"], pipeline: str = "en_core_web_sm", **kwargs: Any
    ) -> None:
        """Initialize the spacy text splitter."""
        super().__init__(**kwargs)
        self._tokenizer = _make_spacy_pipeline_for_splitting(pipeline)
        self._separators = separators

    def split_text(self, text: str) -> List[str]:
        """Split incoming text and return chunks."""
        # !!! your code !!!

Thanks a lot!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants