[Python] Arrow IPC 30x slower than Numpy's, and MemoryMappedFile is slower than OSFile #44121

u3Izx9ql7vW4 · 2024-09-15T00:11:53Z

Describe the bug, including details regarding any error messages, version, and platform.

I was looking around the internet for why Arrow's IPC is essentially the speed of going to disk, and I came upon a post on stackoverflow from 2022, and ran it on my machine. I was a bit amazed at the disparity -- below are the scripts extracted from the post.

The following script using Numpy below took ~0.02s on my machine.

import numpy as np
import time
import ctypes

from multiprocessing import sharedctypes

data = np.ones([1, 1, 544, 192], dtype=np.float32)

capacity = 1000 * 1 * 544 * 192 * 10

buffer = sharedctypes.RawArray(ctypes.c_uint8, capacity + 1)
ndarray = np.ndarray((capacity,), dtype=np.uint8, buffer=buffer)

cur_offset = 0

t = time.time()
for i in range(1000):
    data = np.frombuffer(data, dtype=np.uint8)
    data_size = data.shape[0]
    ndarray[cur_offset:data_size + cur_offset] = data
    cur_offset += data_size
e = time.time()

print(e - t)

The script below using PyArrow ran for 0.8s on my machine.

import numpy as np
import pyarrow as pa
import time
import os

data = np.ones((1, 1, 544, 992), dtype=np.float32)

tensor = pa.Tensor.from_numpy(data)

path = os.path.join(str("./"), 'pyarrow-tensor-ipc-roundtrip')
mmap = pa.create_memory_map(path, 5000000 * 1000)

s = time.time()
for i in range(1000):
    result = pa.ipc.write_tensor(tensor, mmap)
e = time.time()

print(e - s)

output_stream = pa.BufferOutputStream()

s = time.time()
for i in range(1000):
    result = pa.ipc.write_tensor(tensor, output_stream)
e = time.time()

print(e - s)

Surprisingly second one using BufferOutputStream is 2x slower than the first using mmap. I also tried replacing the path with /dev/shm/, which is actual memory map directory, which sped things up to 0.6s. It's as if create_memory_map isn't using memory mapping at all. In fact if you swap out

mmap = pa.create_memory_map(path, 5000000 * 1000)

with

mmap = pa.OSFile(path, 'wb')

you'll decrease the write time by half! What's causing this?

os                     Ubuntu 24
arrow                     1.3.0
pyarrow                   16.1.0

Component(s)

Python

The text was updated successfully, but these errors were encountered:

u3Izx9ql7vW4 added the Type: bug label Sep 15, 2024

github-actions bot added the Component: Python label Sep 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Python] Arrow IPC 30x slower than Numpy's, and MemoryMappedFile is slower than OSFile #44121

[Python] Arrow IPC 30x slower than Numpy's, and MemoryMappedFile is slower than OSFile #44121

u3Izx9ql7vW4 commented Sep 15, 2024 •

edited

Loading

[Python] Arrow IPC 30x slower than Numpy's, and MemoryMappedFile is slower than OSFile #44121

[Python] Arrow IPC 30x slower than Numpy's, and MemoryMappedFile is slower than OSFile #44121

Comments

u3Izx9ql7vW4 commented Sep 15, 2024 • edited Loading

Describe the bug, including details regarding any error messages, version, and platform.

Component(s)

u3Izx9ql7vW4 commented Sep 15, 2024 •

edited

Loading