How to split document in a smarter way #6

RobinHerzog · 2023-02-16T13:02:27Z

Hello,

I understand that we need to split document in smaller piece because OpenAI can not get the whole texts as input.

However, my challenge is to cut texts in a smart way so it does not cut the text in the middle of a sentence.

Any luck about that?

boxabirds · 2023-02-20T19:33:46Z

Simplest looks to use the built-in SpacyTextSplitter that uses Spacy or maybe the NLTKTextSplitter. The former I know for sure has a huge footprint -- 400MB or so for the Spacy model. A nice patch to langchain would be to use sentence_splitter which is super fast and efficient (as it doesn't require a machine learning model). A little further investigation for this demo suggests using the MarkdownTextSplitter might be the best bet.

xingfanxia · 2023-03-22T07:17:38Z

second this

alfasin · 2023-06-13T13:23:22Z

I'm using: RecursiveCharacterTextSplitter for generic text splitting tasks:
https://python.langchain.com/en/latest/modules/indexes/text_splitters/examples/recursive_text_splitter.html

Anurag-38 · 2023-11-20T11:04:04Z

Simplest looks to use the built-in SpacyTextSplitter that uses Spacy or maybe the NLTKTextSplitter. The former I know for sure has a huge footprint -- 400MB or so for the Spacy model. A nice patch to langchain would be to use sentence_splitter which is super fast and efficient (as it doesn't require a machine learning model). A little further investigation for this demo suggests using the MarkdownTextSplitter might be the best bet.

Hi,
I wanted to ask about 'SpacyTextSplitter' can I use multiple separators like a list of punctuation's I am giving an example below. If what I ahve shown below is possible but I am not doing it right do write the correct syntax.

Original Working Code
text_splitter = SpacyTextSplitter(chunk_size=900, chunk_overlap=50, separator = '\n', pipeline = 'sentencizer')
My Expectation
text_splitter = SpacyTextSplitter(chunk_size=900, chunk_overlap=50, separator = ['\n', '=', ',', '&'], pipeline = 'sentencizer ')

arrowing · 2023-11-28T10:09:26Z

Simplest looks to use the built-in SpacyTextSplitter that uses Spacy or maybe the NLTKTextSplitter. The former I know for sure has a huge footprint -- 400MB or so for the Spacy model. A nice patch to langchain would be to use sentence_splitter which is super fast and efficient (as it doesn't require a machine learning model). A little further investigation for this demo suggests using the MarkdownTextSplitter might be the best bet.

Hi, I wanted to ask about 'SpacyTextSplitter' can I use multiple separators like a list of punctuation's I am giving an example below. If what I ahve shown below is possible but I am not doing it right do write the correct syntax.

Original Working Code ' text_splitter = SpacyTextSplitter(chunk_size=900, chunk_overlap=50, separator = '\n', pipeline = 'sentencizer') '

My Expectation ' text_splitter = SpacyTextSplitter(chunk_size=900, chunk_overlap=50, separator = ['\n', '=', ',', '&'], pipeline = 'sentencizer ') '

You can find "SpacyTextSplitter" class code in the Langchain that implement on "TextSplitter" class.
If you want to use separators, maybe you should rewrite the class like "PythonCodeTextSplitter" class which implemented on "RecursiveCharacterTextSplitter".

Some code like the following:

def _make_spacy_pipeline_for_splitting(pipeline: str) -> Any:  # avoid importing spacy
    try:
        import spacy
    except ImportError:
        raise ImportError(
            "Spacy is not installed, please install it with `pip install spacy`."
        )
    if pipeline == "sentencizer":
        from spacy.lang.en import English

        sentencizer = English()
        sentencizer.add_pipe("sentencizer")
    else:
        sentencizer = spacy.load(pipeline, exclude=["ner", "tagger"])
    return sentencizer

class SpacyTextSplitter(RecursiveCharacterTextSplitter):
    """Splitting text using Spacy package.

    Per default, Spacy's `en_core_web_sm` model is used. For a faster, but
    potentially less accurate splitting, you can use `pipeline='sentencizer'`.
    """

    def __init__(
        self, separators: str = ["\n\n"], pipeline: str = "en_core_web_sm", **kwargs: Any
    ) -> None:
        """Initialize the spacy text splitter."""
        super().__init__(**kwargs)
        self._tokenizer = _make_spacy_pipeline_for_splitting(pipeline)
        self._separators = separators

    def split_text(self, text: str) -> List[str]:
        """Split incoming text and return chunks."""
        # !!! your code !!!

Anurag-38 · 2023-11-28T10:17:17Z

Simplest looks to use the built-in SpacyTextSplitter that uses Spacy or maybe the NLTKTextSplitter. The former I know for sure has a huge footprint -- 400MB or so for the Spacy model. A nice patch to langchain would be to use sentence_splitter which is super fast and efficient (as it doesn't require a machine learning model). A little further investigation for this demo suggests using the MarkdownTextSplitter might be the best bet.

Hi, I wanted to ask about 'SpacyTextSplitter' can I use multiple separators like a list of punctuation's I am giving an example below. If what I ahve shown below is possible but I am not doing it right do write the correct syntax.
Original Working Code ' text_splitter = SpacyTextSplitter(chunk_size=900, chunk_overlap=50, separator = '\n', pipeline = 'sentencizer') '
My Expectation ' text_splitter = SpacyTextSplitter(chunk_size=900, chunk_overlap=50, separator = ['\n', '=', ',', '&'], pipeline = 'sentencizer ') '

You can find "SpacyTextSplitter" class code in the Langchain that implement on "TextSplitter" class. If you want to use separators, maybe you should rewrite the class like "PythonCodeTextSplitter" class which implemented on "RecursiveCharacterTextSplitter".

Some code like the following:
def _make_spacy_pipeline_for_splitting(pipeline: str) -> Any:  # avoid importing spacy
    try:
        import spacy
    except ImportError:
        raise ImportError(
            "Spacy is not installed, please install it with `pip install spacy`."
        )
    if pipeline == "sentencizer":
        from spacy.lang.en import English

        sentencizer = English()
        sentencizer.add_pipe("sentencizer")
    else:
        sentencizer = spacy.load(pipeline, exclude=["ner", "tagger"])
    return sentencizer

class SpacyTextSplitter(RecursiveCharacterTextSplitter):
    """Splitting text using Spacy package.

    Per default, Spacy's `en_core_web_sm` model is used. For a faster, but
    potentially less accurate splitting, you can use `pipeline='sentencizer'`.
    """

    def __init__(
        self, separators: str = ["\n\n"], pipeline: str = "en_core_web_sm", **kwargs: Any
    ) -> None:
        """Initialize the spacy text splitter."""
        super().__init__(**kwargs)
        self._tokenizer = _make_spacy_pipeline_for_splitting(pipeline)
        self._separators = separators

    def split_text(self, text: str) -> List[str]:
        """Split incoming text and return chunks."""
        # !!! your code !!!

Thanks a lot!!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to split document in a smarter way #6

How to split document in a smarter way #6

RobinHerzog commented Feb 16, 2023

boxabirds commented Feb 20, 2023 •

edited

Loading

xingfanxia commented Mar 22, 2023

alfasin commented Jun 13, 2023

Anurag-38 commented Nov 20, 2023 •

edited

Loading

arrowing commented Nov 28, 2023 •

edited

Loading

Anurag-38 commented Nov 28, 2023

How to split document in a smarter way #6

How to split document in a smarter way #6

Comments

RobinHerzog commented Feb 16, 2023

boxabirds commented Feb 20, 2023 • edited Loading

xingfanxia commented Mar 22, 2023

alfasin commented Jun 13, 2023

Anurag-38 commented Nov 20, 2023 • edited Loading

arrowing commented Nov 28, 2023 • edited Loading

Anurag-38 commented Nov 28, 2023

boxabirds commented Feb 20, 2023 •

edited

Loading

Anurag-38 commented Nov 20, 2023 •

edited

Loading

arrowing commented Nov 28, 2023 •

edited

Loading