-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Provide arrow_counts method on volume to bypass pandas #39
Comments
This could be a function added to the |
I think it makes more sense as a |
In your example, you're deliberately not instantiating a volume, right? It could work in Volume. I'd love if the documentation was clear that it's for advanced users that only want that count out of the files. The reason: if they run Maybe call it |
Yeah, deliberately avoiding instantiating a volume there just b/c that's the prototype for the method. Happy with fast_count; another naming option might be To be clear it wouldn't be just for counts, though; it can return arrow tables with any desired columns. At a parquet backend, it would perform: read_parquet_or_feather({self.path}, columns = ['page', 'section', 'token', 'pos', 'count']) and at a json parser, it would do self._make_tokencounts_df(arrow = True)['page', 'section', 'token', 'pos', 'count'] Either of those would be faster than |
Loading from parquet or feather into pandas to create indices is time consuming for tasks where you don't actually want the data in pandas. (E.g., passing counts straight into tensorflow or numpy).
Using a basic benchmark of reading loading 52 random volumes and summing the wordcounts column, here's a comparison of two methods by count in seconds. using arrow.parquet.read_parquet straight into arrow format and summing word counts is almost 10x faster. Feather based approaches can another order of magnitude faster,
but that's probably because they avoid unicode tasks altogether while parquet has to unpack. In real life, you have to do the unicode.
I'd propose that this method be a bit less user-oriented than the pandas ones--not support lowercasing, etc. Just a basic wrapper to pull out some columns and then do computation elsewhere.
METHOD A
METHOD B
That is
The text was updated successfully, but these errors were encountered: