-
Notifications
You must be signed in to change notification settings - Fork 653
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: passing in large data object to DataFrame apply() function resulting in SegFault #6641
Comments
Thanks for reporting this @SiRumCz. Do you have a more detailed stack trace? I'm unable to reproduce the error locally, and I don't have access to the server hardware that you do. |
@noloerino Hi, as I mentioned in the description, there's not much useful information on the terminal because the program did not exit with error code, rather it was killed unexpectedly. But if you think logs from Ray could help, please let me know where and how I can pull for you, I am not very familiar with Ray clusters. tested code and result: >>> n_features = 20000
>>> n_samples = 20000
>>> features = [f'f{i+1}' for i in range(n_features)]
>>> samples = [f's{i+1}' for i in range(n_samples)]
>>> df = pd.DataFrame(np.random.rand(n_samples, n_features).round(2), columns=features, index=samples)
UserWarning: Distributing <class 'numpy.ndarray'> object. This may take some time.
>>> mat_arr = np.random.rand(n_features, n_features).round(2)
>>> df.dot(mat_arr)
Segmentation fault (core dumped) I would suggest to try with ray clusters, even on a smaller scale, it seems like the problem happens when result is being transferred back to Ray client after being computed on remote node, because when I test the same code on 5000 x 5000, 10K x 10K, and 15K x 15K data, they all finish without any error. |
@SiRumCz could you run the following code and show output here? import modin.pandas as pd
import numpy as np
n_features = 20000
n_samples = 20000
features = [f'f{i+1}' for i in range(n_features)]
samples = [f's{i+1}' for i in range(n_samples)]
df = pd.DataFrame(np.random.rand(n_samples, n_features), columns=features, index=samples).round(2)
print(f"{df._query_compiler._modin_frame._partitions.shape=}")
from modin.config import CpuCount, NPartitions
print(f"{CpuCount.get()=}, {NPartitions.get()=}") UPD: could you also show |
@anmyachev Thanks for the update, here's the output: >>> print(f"{df._query_compiler._modin_frame._partitions.shape=}")
df._query_compiler._modin_frame._partitions.shape=(48, 48)
>>> from modin.config import CpuCount, NPartitions
>>> print(f"{CpuCount.get()=}, {NPartitions.get()=}")
CpuCount.get()=16, NPartitions.get()=48 On both of my py3.8 and py3.10 venv, the version is |
@SiRumCz it would be great if you could also provide the following information:
|
|
@anmyachev I have an update, it seems like the problem is not from import modin.pandas as pd
import numpy as np
n_features = 20000
n_samples = 20000
features = [f'f{i+1}' for i in range(n_features)]
samples = [f's{i+1}' for i in range(n_samples)]
df = pd.DataFrame(np.random.rand(n_samples, n_features), columns=features, index=samples).round(2)
mat_df = pd.DataFrame(np.random.rand(n_features, n_features), columns=features, index=features).round(2)
res_df = df.apply(lambda x,m: m.iloc[0], axis=1, m=mat_df) # this will still get seg fault Error message: >>> df.apply(lambda x,m: m.head(1), axis=1, m=mat_df)
UserWarning: Ray Client is attempting to retrieve a 2.99 GiB object over the network, which may be slow. Consider serializing the object to a file and using S3 or rsync instead.
Segmentation fault (core dumped) The problem seems to be during or after the Client requested for the object over the network. Update: >>> df.apply(lambda x: x+1)
f1 f2 f3 f4 f5 f6 f7 f8 f9 f10 f11 f12 f13 f14 ... f19987 f19988 f19989 f19990 f19991 f19992 f19993 f19994 f19995 f19996 f19997 f19998 f19999 f20000
s1 1.87 1.05 1.88 1.67 1.84 1.12 1.58 1.68 1.33 1.81 1.21 1.33 1.14 1.93 ... 1.89 1.96 1.40 1.08 1.27 1.13 1.81 1.77 1.78 1.56 1.39 1.90 1.15 1.59
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
s20000 1.28 1.24 1.66 1.27 1.47 1.07 1.06 1.24 1.84 1.57 1.90 1.36 1.01 1.08 ... 1.25 1.44 1.34 1.78 1.94 1.73 1.52 1.60 1.87 1.47 1.07 1.43 1.69 1.47
[20000 rows x 20000 columns]
>>> df.apply(lambda x, m: x+1, m=mat_df)
UserWarning: Ray Client is attempting to retrieve a 2.99 GiB object over the network, which may be slow. Consider serializing the object to a file and using S3 or rsync instead.
Segmentation fault (core dumped) The matrix parameter in the apply func doesn't need to be a dataframe, I tried with numpy array and it has the same problem, my guess is that the param data cannot be too large to trigger some mechanism, otherwise I will get seg fault. |
@modin-project/modin-ray maybe you can tell us which direction to look in order to understand the reason for this segfault? |
@anmyachev I have enabled
The problem seems to be from a package |
@anmyachev I can make it to work by not setting an RAY_ADDRESS, Modin/Ray seems to be able to pick up the address when I am running the client code on one of the Ray nodes. However, it is worth noting that I am seeing different ports after this change, before it was on port 10001, and now 6379 is being picked up.
After:
|
@SiRumCz good catch! Let's ensure (if needed) you are connecting to an existing cluster (instead of creating a new one), use Thanks for all the information! I'm working to figure out what's wrong. |
@anmyachev thanks for the advice. Just to add to the solutions, while it works, I am see some activities that's kind of confusing:
Although I am seeing these messages, my code would still be able to give me the result. |
Well |
@SiRumCz It turns out there are two modes of operation. Connecting to a cluster through a Ray client and without a client. Perhaps connecting through the client does not work for Modin due to some architectural restrictions (which are mentioned in the note), but which exactly I don’t know yet. |
@anmyachev thanks for the explanation, |
@anmyachev I mean this general note: Ray Client has architectural limitations and may not work as expected when using Ray for ML workloads (like Ray Tune or Ray Train). Use [Ray Jobs API](https://docs.ray.io/en/latest/cluster/running-applications/job-submission/index.html#jobs-overview) for interactive development on ML projects. I don’t know the details yet, but maybe @modin-project/modin-ray know. |
@SiRumCz can you install the same |
@anmyachev by |
@SiRumCz not |
@anmyachev That was my thoughts but upon checking grpc installation page, it seems to be language specific, and for python, it is If I am to install grpc package, do you have any idea which one to install?
|
@SiRumCz looks like I found pretty close issue in ray itself ray-project/ray#38713. There's also a Segmentation Fault when one try to put a large object through
So let's assume that the environment is correct and just try to avoid connecting through the Ray Client for now. |
@anmyachev good find! |
Blocked by ray-project/ray#38713 |
Modin version checks
I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest released version of Modin.
I have confirmed this bug exists on the main branch of Modin. (In order to do this you can follow this guide.)
Reproducible Example
Issue Description
Getting
Segmentation fault (core dumped)
when doing matrix multiplication on 20K x 20K matrices.I have two servers, each with 16 threads, >50GB available RAM, >300GB available storage. Code tested on both [email protected][email protected] and [email protected][email protected], both showed same result.
The code is trying to do a dot() on two 20K x 20K matrices, each matrix is roughly 3GB (showing on object store memory in one of the nodes), the seg fault happens when calling
df.dot(mat_df)
. However, the code will work on small matrices like 500 x 500.There's not much to show on the terminal, basically it just says
UserWarning: Ray Client is attempting to retrieve a 3.00 GiB object over the network, which may be slow. Consider serializing the object to a file and using S3 or rsync instead.
and the program gets killed.I have tried to increase my object store memory and also specify a plasma directory, and still getting seg fault, the problem seems to happen after the computation has finished on remote node and being transferred by to client, I saw an increase to 12GB on the object store memory , and around 3GB increase on the client's main memory (when the warning message shows up), right when the main memory increased to 3GB, the program gets killed.
Expected Behavior
should return the transformed df.
Error Logs
Installed Versions
UserWarning: Setuptools is replacing distutils.
INSTALLED VERSIONS
commit : 4c01f64
python : 3.10.13.final.0
python-bits : 64
OS : Linux
OS-release : 5.15.0-84-generic
Version : #93~20.04.1-Ubuntu SMP Wed Sep 6 16:15:40 UTC 2023
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8
Modin dependencies
modin : 0.24.1
ray : 2.7.1
dask : 2023.9.3
distributed : 2023.9.3
hdk : None
pandas dependencies
pandas : 2.1.1
numpy : 1.26.0
pytz : 2023.3.post1
dateutil : 2.8.2
setuptools : 65.5.0
pip : 23.0.1
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 3.1.2
IPython : 8.16.1
pandas_datareader : None
bs4 : 4.12.2
bottleneck : None
dataframe-api-compat: None
fastparquet : None
fsspec : 2023.9.2
gcsfs : None
matplotlib : None
numba : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : 13.0.0
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
zstandard : None
tzdata : 2023.3
qtpy : None
pyqt5 : None
The text was updated successfully, but these errors were encountered: