Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SNOW-1561999: Inconsistent Results with random_split on Large DataFrames with Fixed Seed #1991

Open
robertlessmore opened this issue Jul 29, 2024 · 2 comments
Assignees
Labels
bug Something isn't working status-triage_done Initial triage done, will be further handled by the driver team

Comments

@robertlessmore
Copy link

  1. What version of Python are you using?

Python 3.11.9 | packaged by conda-forge | (main, Apr 19 2024, 18:27:10) [MSC v.1938 64 bit (AMD64)]

  1. What operating system and processor architecture are you using?

Windows-10-10.0.22631-SP0
3. What are the component versions in the environment?

Name Version Build Channel

asn1crypto 1.5.1 pyhd8ed1ab_0 conda-forge
brotli-python 1.1.0 py311h12c1d0e_1 conda-forge
bzip2 1.0.8 h2466b09_7 conda-forge
ca-certificates 2024.7.4 h56e8100_0 conda-forge
certifi 2024.7.4 pyhd8ed1ab_0 conda-forge
cffi 1.16.0 py311ha68e1ae_0 conda-forge
charset-normalizer 3.3.2 pyhd8ed1ab_0 conda-forge
cloudpickle 2.2.1 pyhd8ed1ab_0 conda-forge
cryptography 42.0.8 py311hfd75b31_0 conda-forge
filelock 3.15.4 pyhd8ed1ab_0 conda-forge
h2 4.1.0 pyhd8ed1ab_0 conda-forge
hpack 4.0.0 pyh9f0ad1d_0 conda-forge
hyperframe 6.0.1 pyhd8ed1ab_0 conda-forge
idna 3.7 pyhd8ed1ab_0 conda-forge
intel-openmp 2024.2.0 h57928b3_980 conda-forge
libblas 3.9.0 23_win64_mkl conda-forge
libcblas 3.9.0 23_win64_mkl conda-forge
libexpat 2.6.2 h63175ca_0 conda-forge
libffi 3.4.2 h8ffe710_5 conda-forge
libhwloc 2.11.1 default_h8125262_1000 conda-forge
libiconv 1.17 hcfcfb64_2 conda-forge
liblapack 3.9.0 23_win64_mkl conda-forge
libsqlite 3.46.0 h2466b09_0 conda-forge
libxml2 2.12.7 h0f24e4e_4 conda-forge
libzlib 1.3.1 h2466b09_1 conda-forge
mkl 2024.1.0 h66d3029_694 conda-forge
numpy 2.0.1 py311h35ffc71_0 conda-forge
openssl 3.3.1 h2466b09_2 conda-forge
packaging 24.1 pyhd8ed1ab_0 conda-forge
pip 24.0 pyhd8ed1ab_0 conda-forge
platformdirs 4.2.2 pyhd8ed1ab_0 conda-forge
pthreads-win32 2.9.1 hfa6e2cd_3 conda-forge
pycparser 2.22 pyhd8ed1ab_0 conda-forge
pyjwt 2.8.0 pyhd8ed1ab_1 conda-forge
pyopenssl 24.2.1 pyhd8ed1ab_0 conda-forge
pysocks 1.7.1 pyh0701188_6 conda-forge
python 3.11.9 h631f459_0_cpython conda-forge
python_abi 3.11 4_cp311 conda-forge
pytz 2024.1 pyhd8ed1ab_0 conda-forge
pyyaml 6.0.1 py311ha68e1ae_1 conda-forge
requests 2.32.3 pyhd8ed1ab_0 conda-forge
setuptools 71.0.4 pyhd8ed1ab_0 conda-forge
snowflake-connector-python 3.11.0 py311hcf9f919_0 conda-forge
snowflake-snowpark-python 1.20.0 py311h1ea47a8_0 conda-forge
sortedcontainers 2.4.0 pyhd8ed1ab_0 conda-forge
tbb 2021.12.0 hc790b64_3 conda-forge
tk 8.6.13 h5226925_1 conda-forge
tomlkit 0.13.0 pyha770c72_0 conda-forge
typing-extensions 4.12.2 hd8ed1ab_0 conda-forge
typing_extensions 4.12.2 pyha770c72_0 conda-forge
tzdata 2024a h0c530f3_0 conda-forge
ucrt 10.0.22621.0 h57928b3_0 conda-forge
urllib3 2.2.2 pyhd8ed1ab_1 conda-forge
vc 14.3 h8a93ad2_20 conda-forge
vc14_runtime 14.40.33810 ha82c5b3_20 conda-forge
vs2015_runtime 14.40.33810 h3bf8584_20 conda-forge
wheel 0.43.0 pyhd8ed1ab_1 conda-forge
win_inet_pton 1.1.0 pyhd8ed1ab_6 conda-forge
xz 5.2.6 h8d14728_0 conda-forge
yaml 0.2.5 h8ffe710_2 conda-forge
zstandard 0.23.0 py311h53056dc_0 conda-forge
zstd 1.5.6 h0ea2cb4_0 conda-forge

  1. What did you do?

from utils import get_session
session = get_session.session()

df_range = session.range(1, 10**7).to_df("a")
train_1, test_1 = df_range.random_split([0.5, 0.5], seed=42)
train_2, test_2 = df_range.random_split([0.5, 0.5], seed=42)

train_train = train_1.join(train_2, on="a", how="inner")
train_test = train_1.join(test_2, on="a", how="inner")
test_train = test_1.join(train_2, on="a", how="inner")
test_test = test_1.join(test_2, on="a", how="inner")

print(train_train.count(), train_test.count(), test_train.count(), test_test.count())


output:
2819263 2181740 2181520 2817476

  1. What did you expect to see?
    I expected train_1 and train_2 to be identical because they were both created by splitting the same DataFrame with a fixed seed. Consequently, I expected the inner joins train_test and test_train to be empty since test_1 and test_2 should contain different rows from train_1 and train_2 respectively. However, the resulting counts show that random_split does not produce consistent splits for a fixed seed when the DataFrame contains more than 10**7 samples.
@robertlessmore robertlessmore added bug Something isn't working needs triage Initial RCA is required labels Jul 29, 2024
@github-actions github-actions bot changed the title Inconsistent Results with random_split on Large DataFrames with Fixed Seed SNOW-1561999: Inconsistent Results with random_split on Large DataFrames with Fixed Seed Jul 29, 2024
@sfc-gh-sghosh sfc-gh-sghosh self-assigned this Aug 13, 2024
@sfc-gh-sghosh
Copy link

Hello @robertlessmore ,

Thanks for raising the issue, we are checking, will update.

Regards,
Sujan

@sfc-gh-sghosh sfc-gh-sghosh added status-triage_done Initial triage done, will be further handled by the driver team status-triage Issue is under initial triage and removed status-triage_done Initial triage done, will be further handled by the driver team needs triage Initial RCA is required labels Aug 13, 2024
@sfc-gh-sghosh
Copy link

Hello @robertlessmore ,

Yes, we are able to reproduce the issue with large datasets; the issue is not there with small datasets and working as expected, so we will work on removing it.

Regards,
Sujan

@sfc-gh-sghosh sfc-gh-sghosh added status-triage_done Initial triage done, will be further handled by the driver team and removed status-triage Issue is under initial triage labels Aug 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working status-triage_done Initial triage done, will be further handled by the driver team
Projects
None yet
Development

No branches or pull requests

3 participants