-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
📈 performance optimisation #3
Conversation
Hey @R0bk, |
Hey @R0bk, For those users who do want an extra speed bump, it could be worth simply using a token counter that implements the heuristics you have developed. You could even use a wrapper function for arbitrary token counters. I will leave that decision to users as I want to ensure One speed enhancement I would be able to merge is replacing recursive calls to Two other suggestions I can give for speeding up chunking is to utilise multiprocessing if you aren't already (I've successfully used it to chunk corpora that would otherwise take hours to chunk in a matter of minutes) and to use my other Python library |
Hey @umarbutler Those are some good points, the first thing I actually went for was an adjustment to be stack based. But unfortunately once I finished and verified that the results were equal, I found it had the same performance as the recursive version if not slightly slower. The code was neater, however the current code gains an advantage as it always splits first and then runs the tokeniser. Changing to be stack based (and with my impression of what neat code is) I had to run the tokeniser first, which added an extra run across the entire input string and hence slowed it. After profiling I saw that all the time is spent either in the regex ~20% or in the actual library call for the tokeniser ~75%. So the benefit of going stack based seemed pretty limited. Also, on the heuristic I added, I must say it was a little bit hacky, even if performant - and you've got a good idea to keep this project independent of any one tokeniser/ vocab. Your post did inspire me to take another look though, and see what's possible when ensuring tokeniser/ vocab independence. I've been able to find some reasonably big gains just from adjusting the binary search to do a simple guess for where to look next. The difference is larger at bigger batch sizes but there is a real double digit impact even at 512. I've put some benchmark times below and adjusted the code to also check for correctness between this build and the original to ensure there are no gaps.
You can checkout the new commit here. I've been running it through a pretty large corpuses at 8192 and it's been performing well so far! (I've also been taking advantage of the disk-cache library but yours looks interesting, it certainly looks easier to manage, do you know if it is faster too?) |
… search algorithm.
Hey Rob, RE diskcache, I'm not too sure, I haven't benchmarked it. I might when I have some spare time. persist-cache is unique in that it essentially uses the file system as a database which means the biggest limitation in finding and saving keys is the OS and disk. Serialisation and hashing are super fast thanks to a custom fusion of msgpack and pickle (with msgspec being used for speedy msgpack serialisation) and xxhash. The only downside is that if you are storing millions of keys, it can a really really long time to delete the cache (since each key gets its own file on the disk). |
The changes have just been merged into the latest release of I ran my profiler and it looks like most of the time is being spend in the token counter and tokeniser, followed by a bit of time spent in On further thought, I think your heuristics could be worked into That will be my next task for I use |
Hey @umarbutler,
You've made a really neat package here - performant and to the point.
Recently I've been running through some of the Aus reg landscape and working on-top of your codebase has been a big aid. The chunking was still one of the bottlenecks on my machine so I took a pass at some further performance work.
I was able to find some ~25% performance gains (larger chunks held higher gains). Running a sweep across chunk_sizes for:
we get below results (running on a M1 Max chip).
This also stretches out the lead over semantic_text_splitter by another 10% or so. It's worth noting that semantic_text_splitter seems to perform even worse for myself than yourself though, not sure if due to recent changes in their lib, x86 v arm, or noise.
I believe there's still further room for non-trivial gains, it seemed that the token_counter (using tiktoken as the example) took up about 80% of cycles. If we can reduce the number of calls or the length of an average input by say:
then there's probably double digit gains possible.
Please feel free for any comments or changes - thanks for making the package