semchunk
is a fast and lightweight pure Python library for splitting text into semantically meaningful chunks.
Owing to its complex yet highly efficient chunking algorithm, semchunk
is both more semantically accurate than langchain.text_splitter.RecursiveCharacterTextSplitter
(see How It Works 🔍) and is also over 70% faster than semantic-text-splitter
(see the Benchmarks 📊).
semchunk
may be installed with pip
:
pip install semchunk
The code snippet below demonstrates how text can be chunked with semchunk
:
>>> import semchunk
>>> import tiktoken # `tiktoken` is not required but is used here to quickly count tokens.
>>> text = 'The quick brown fox jumps over the lazy dog.'
>>> chunk_size = 2 # A low chunk size is used here for demo purposes.
>>> encoder = tiktoken.encoding_for_model('gpt-4')
>>> token_counter = lambda text: len(encoder.encode(text)) # `token_counter` may be swapped out for any function capable of counting tokens.
>>> semchunk.chunk(text, chunk_size=chunk_size, token_counter=token_counter)
['The quick', 'brown fox', 'jumps over', 'the lazy', 'dog.']
def chunk(
text: str,
chunk_size: int,
token_counter: callable,
memoize: bool=True
) -> list[str]
chunk()
splits text into semantically meaningful chunks of a specified size as determined by the provided token counter.
text
is the text to be chunked.
chunk_size
is the maximum number of tokens a chunk may contain.
token_counter
is a callable that takes a string and returns the number of tokens in it.
memoize
flags whether to memoise the token counter. It defaults to True
.
This function returns a list of chunks up to chunk_size
-tokens-long, with any whitespace used to split the text removed.
semchunk
works by recursively splitting texts until all resulting chunks are equal to or less than a specified chunk size. In particular, it:
- Splits text using the most semantically meaningful splitter possible;
- Recursively splits the resulting chunks until a set of chunks equal to or less than the specified chunk size is produced;
- Merges any chunks that are under the chunk size back together until the chunk size is reached; and
- Reattaches any non-whitespace splitters back to the ends of chunks barring the final chunk if doing so does not bring chunks over the chunk size, otherwise adds non-whitespace splitters as their own chunks.
To ensure that chunks are as semantically meaningful as possible, semchunk
uses the following splitters, in order of precedence:
- The largest sequence of newlines (
\n
) and/or carriage returns (\r
); - The largest sequence of tabs;
- The largest sequence of whitespace characters (as defined by regex's
\s
character class); - Sentence terminators (
.
,?
,!
and*
); - Clause separators (
;
,,
,(
,)
,[
,]
,“
,”
,‘
,’
,'
,"
and`
); - Sentence interrupters (
:
,—
and…
); - Word joiners (
/
,\
,–
,&
and-
); and - All other characters.
semchunk
also relies on memoization to cache the results of token counters and the chunk()
function, thereby improving performance.
On a desktop with a Ryzen 3600, 64 GB of RAM, Windows 11 and Python 3.12.0, it takes semchunk
24.41s seconds to split every sample in NLTK's Gutenberg Corpus into 512-token-long chunks (for context, the Corpus contains 18 texts and 3,001,260 tokens). By comparison, it takes semantic-text-splitter
1 minute and 48.01 seconds to chunk the same texts into 512-token-long chunks — a difference of 77.35%.
The code used to benchmark semchunk
and semantic-text-splitter
is available here.
This library is licensed under the MIT License.