Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using memoryviews for decompression #96

Closed
jakirkham opened this issue May 1, 2020 · 25 comments
Closed

Using memoryviews for decompression #96

jakirkham opened this issue May 1, 2020 · 25 comments

Comments

@jakirkham
Copy link
Contributor

Currently decompression requires bytes objects here and here. This means if users have a mmap or other Python object that otherwise acts like an array, they must coerce it to a bytes object, which requires a copy. To avoid this it would be better if these functions took a memoryview (in particular uint8_t[::1]). This would still support a bytes object when passed and would still behaving like an array in the code. Most importantly it would allow users to pass these other array-like objects without having to copy to a bytes object first.

cc @halehawk @rabernat

@rabernat
Copy link

rabernat commented May 1, 2020

Just wanted to leave my 👍 -- this would be a simple change with important performance benefits.

@jakirkham
Copy link
Contributor Author

Also apologies if this is already known to the authors here, but this doc provides some background on using memoryviews in Cython. As Ryan notes this is a simple change (guessing ~4 lines after peaking at the code).

@lindstro
Copy link
Member

lindstro commented May 1, 2020

Thanks for the suggestion. We will look into it and see what can be done.

@halehawk
Copy link
Contributor

halehawk commented May 1, 2020 via email

@jakirkham
Copy link
Contributor Author

jakirkham commented May 1, 2020

To be clear we are discussing decompression, so its not a bytes object being converted to an array-like. It's an array-like being converted to bytes that's the issue. Second mmap is merely one array-like. There are others.

It depends on what Store users select in Zarr. If they use LMDB, this would come up for example. Potentially other backends use this. Additionally there has been interest in using memory mapping with directory storage.

Also it depends on what a user's compression/filter pipeline looks like for their data. If there are other steps that come before zfp that produce something else like a NumPy array, this may come up.

The main point here is that users turn to compression (at least in the Zarr case) because memory usage is a concern. So avoiding copies when they are not needed is important to keep memory usage to a minimum.

@halehawk
Copy link
Contributor

@lindstro Do you have any update on this issue? "bytes" is not good as a type for passing a buffer, since "In the case that the API only deals with byte strings, i.e. binary data or encoded text, it is best not to type the input argument as something like bytes, because that would restrict the allowed input to exactly that type and exclude both subtypes and other kinds of byte containers, e.g. bytearray objects or memory views." So if we want to assign a numpy array to decompress API, it always has to convert to bytes object first which is not performance efficient with large buffer.
Could you please replace "bytes" at both decompress API with "uint8_t[::1]"?
I am trying to experiment that in my fork repo, but I don't know how to test with the modification. Could you please point out?
@jakirkham @rabernat

@jakirkham
Copy link
Contributor Author

My guess is @lindstro, et al. would accept a PR @halehawk (if you want to give it a try 😉). My guess is this is a 3 line change.

In case it helps, uint8_t is defined here. So would require a cimport from libc.stdint. Though one could also just use unsigned char if that's easier.

@jakirkham
Copy link
Contributor Author

Would add it looks like this script contains their CI build process. Maybe that provides a good starting place for building things locally?

@lindstro
Copy link
Member

@halehawk Would be great if you could experiment with this and submit a PR. I assume your question about testing is directed at the numcodecs folks. If not, the zfpy tests are in zfp/tests/python on the develop branch.

@halehawk
Copy link
Contributor

halehawk commented Sep 17, 2020

My guess is @lindstro, et al. would accept a PR @halehawk (if you want to give it a try 😉). My guess is this is a 3 line change.

In case it helps, uint8_t is defined here. So would require a cimport from libc.stdint. Though one could also just use unsigned char if that's easier.

@jakirkham Can I use "char *"? Can the return value by ensure_ndarray be passing as a char* to _decompress? Or do I need to get the pointer of the return value?

@jakirkham
Copy link
Contributor Author

jakirkham commented Sep 17, 2020

Would use uint8_t[::1]. That will still accept bytes objects, but will also accepts NumPy arrays that are 1-D uint8 contiguous arrays.

Edit: It's also possible to cast raw pointers into uint8_t[::1]. This section of the Cython docs may help.

@halehawk
Copy link
Contributor

halehawk commented Sep 17, 2020 via email

@jakirkham
Copy link
Contributor Author

Guessing that's PR ( #106 )? Thanks @halehawk! 😄 Made a couple minor suggestions.

@halehawk
Copy link
Contributor

halehawk commented Sep 17, 2020 via email

@lindstro
Copy link
Member

Run ctest -V to see the full output. If the tests pass on your machine but not on Travis, then we might need to run an interactive Travis session to figure out which tests fail.

@halehawk
Copy link
Contributor

halehawk commented Sep 18, 2020 via email

@lindstro
Copy link
Member

I've got the error log on CDash:

test_advanced_decompression_checksum (__main__.TestNumpy) ... ERROR
test_advanced_decompression_nonsquare (__main__.TestNumpy) ... ERROR
test_different_dimensions (__main__.TestNumpy) ... ERROR
test_different_dtypes (__main__.TestNumpy) ... ERROR
test_utils (__main__.TestNumpy) ... ERROR

======================================================================
ERROR: test_advanced_decompression_checksum (__main__.TestNumpy)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test_numpy.py", line 76, in test_advanced_decompression_checksum
    **compression_kwargs
  File "zfpy.pyx", line 249, in zfpy._decompress (/home/travis/build/LLNL/zfp/build/python/zfpy.c:252)
  File "stringsource", line 616, in View.MemoryView.memoryview_cwrapper (/home/travis/build/LLNL/zfp/build/python/zfpy.c:616)
  File "stringsource", line 323, in View.MemoryView.memoryview.__cinit__ (/home/travis/build/LLNL/zfp/build/python/zfpy.c:323)
BufferError: Object is not writable.

======================================================================
ERROR: test_advanced_decompression_nonsquare (__main__.TestNumpy)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test_numpy.py", line 104, in test_advanced_decompression_nonsquare
    out= decompressed_array,
  File "zfpy.pyx", line 249, in zfpy._decompress (/home/travis/build/LLNL/zfp/build/python/zfpy.c:252)
  File "stringsource", line 616, in View.MemoryView.memoryview_cwrapper (/home/travis/build/LLNL/zfp/build/python/zfpy.c:616)
  File "stringsource", line 323, in View.MemoryView.memoryview.__cinit__ (/home/travis/build/LLNL/zfp/build/python/zfpy.c:323)
BufferError: Object is not writable.

======================================================================
ERROR: test_different_dimensions (__main__.TestNumpy)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test_numpy.py", line 24, in test_different_dimensions
    self.lossless_round_trip(c_array)
  File "test_numpy.py", line 17, in lossless_round_trip
    decompressed_array = zfpy.decompress_numpy(compressed_array)
  File "zfpy.pyx", line 333, in zfpy.decompress_numpy (/home/travis/build/LLNL/zfp/build/python/zfpy.c:332)
  File "stringsource", line 616, in View.MemoryView.memoryview_cwrapper (/home/travis/build/LLNL/zfp/build/python/zfpy.c:616)
  File "stringsource", line 323, in View.MemoryView.memoryview.__cinit__ (/home/travis/build/LLNL/zfp/build/python/zfpy.c:323)
BufferError: Object is not writable.

======================================================================
ERROR: test_different_dtypes (__main__.TestNumpy)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test_numpy.py", line 38, in test_different_dtypes
    self.lossless_round_trip(array)
  File "test_numpy.py", line 17, in lossless_round_trip
    decompressed_array = zfpy.decompress_numpy(compressed_array)
  File "zfpy.pyx", line 333, in zfpy.decompress_numpy (/home/travis/build/LLNL/zfp/build/python/zfpy.c:332)
  File "stringsource", line 616, in View.MemoryView.memoryview_cwrapper (/home/travis/build/LLNL/zfp/build/python/zfpy.c:616)
  File "stringsource", line 323, in View.MemoryView.memoryview.__cinit__ (/home/travis/build/LLNL/zfp/build/python/zfpy.c:323)
BufferError: Object is not writable.

======================================================================
ERROR: test_utils (__main__.TestNumpy)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test_numpy.py", line 186, in test_utils
    compressed_array,
  File "zfpy.pyx", line 333, in zfpy.decompress_numpy (/home/travis/build/LLNL/zfp/build/python/zfpy.c:332)
  File "stringsource", line 616, in View.MemoryView.memoryview_cwrapper (/home/travis/build/LLNL/zfp/build/python/zfpy.c:616)
  File "stringsource", line 323, in View.MemoryView.memoryview.__cinit__ (/home/travis/build/LLNL/zfp/build/python/zfpy.c:323)
BufferError: Object is not writable.

----------------------------------------------------------------------
Ran 5 tests in 17.626s

FAILED (errors=5)

@jakirkham
Copy link
Contributor Author

Right, hence my question here ( #106 (comment) ).

@halehawk
Copy link
Contributor

halehawk commented Sep 18, 2020 via email

@halehawk
Copy link
Contributor

halehawk commented Sep 18, 2020 via email

@halehawk
Copy link
Contributor

halehawk commented Sep 18, 2020 via email

@lindstro
Copy link
Member

I can't tell from the logs. You could add a line to travis.sh to find out.

@halehawk
Copy link
Contributor

halehawk commented Sep 18, 2020 via email

@jakirkham
Copy link
Contributor Author

Would take a look at using fused types to allow dispatching between const and non-const variants. Here's an example in Cython's tests.

@jakirkham
Copy link
Contributor Author

Closing now that PR ( #106 ) is in.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants