Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provide arrow_counts method on volume to bypass pandas #39

Open
bmschmidt opened this issue Apr 15, 2021 · 4 comments
Open

Provide arrow_counts method on volume to bypass pandas #39

bmschmidt opened this issue Apr 15, 2021 · 4 comments

Comments

@bmschmidt
Copy link
Contributor

Loading from parquet or feather into pandas to create indices is time consuming for tasks where you don't actually want the data in pandas. (E.g., passing counts straight into tensorflow or numpy).

Using a basic benchmark of reading loading 52 random volumes and summing the wordcounts column, here's a comparison of two methods by count in seconds. using arrow.parquet.read_parquet straight into arrow format and summing word counts is almost 10x faster. Feather based approaches can another order of magnitude faster,
but that's probably because they avoid unicode tasks altogether while parquet has to unpack. In real life, you have to do the unicode.

I'd propose that this method be a bit less user-oriented than the pandas ones--not support lowercasing, etc. Just a basic wrapper to pull out some columns and then do computation elsewhere.

METHOD A

      v = Volume(id, id_resolver = resolver)
      _ = v.tokenlist()['count'].sum()
feather with zstd takes 4.282841682434082
feather with lz4 takes 4.423848867416382
feather with None takes 3.9383790493011475
parquet with snappy takes 4.217001914978027
parquet with gzip takes 4.160176992416382
parquet with brotli takes 4.148128986358643
parquet with lz4 takes 4.411034345626831
parquet with zstd takes 4.238173246383667

METHOD B

        z = read_parquet_or_feather(path, columns = ['token', 'count'])
        sum = pc.sum(z['count']).as_py()
        if sum:
            total += sum
feather with zstd takes 0.10278081893920898 counting 9222235 in 52 vols
feather with lz4 takes 0.06820297241210938 counting 9222235 in 52 vols
feather with gz takes 0.0074291229248046875 counting 0 in 0 vols
feather with None takes 0.034819841384887695 counting 9222235 in 52 vols
parquet with snappy takes 0.4386928081512451 counting 9222235 in 52 vols
parquet with gzip takes 0.5488269329071045 counting 9222235 in 52 vols
parquet with brotli takes 0.5444929599761963 counting 9222235 in 52 vols
parquet with lz4 takes 0.4098381996154785 counting 9222235 in 52 vols
parquet with zstd takes 0.4021151065826416 counting 9222235 in 52 vols

That is

@organisciak
Copy link
Collaborator

This could be a function added to the utils module, perhaps?

@bmschmidt
Copy link
Contributor Author

I think it makes more sense as a volume method? It's just a different kind of tabular format function, and as valid for json-backed files as anything else.

@organisciak
Copy link
Collaborator

In your example, you're deliberately not instantiating a volume, right?

It could work in Volume. I'd love if the documentation was clear that it's for advanced users that only want that count out of the files. The reason: if they run Volume.arrow_counts(), then ask for any more advanced token info, they'll still end up instantiating and caching the tokenlist. Since the primary purpose of this library is scaffolding between EF and Pandas, I expect that wouldn't be too uncommon of a situation.

Maybe call it fast_count()? If I saw that method, I'd understand what it does, then read the docs to figure out what the catch is 😄

@bmschmidt
Copy link
Contributor Author

bmschmidt commented Apr 19, 2021

Yeah, deliberately avoiding instantiating a volume there just b/c that's the prototype for the method.

Happy with fast_count; another naming option might be raw_counts?

To be clear it wouldn't be just for counts, though; it can return arrow tables with any desired columns. At a parquet backend, it would perform:

 read_parquet_or_feather({self.path}, columns = ['page', 'section', 'token', 'pos', 'count'])

and at a json parser, it would do

self._make_tokencounts_df(arrow = True)['page', 'section', 'token', 'pos', 'count']

Either of those would be faster than self.tokencounts_df().

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants