Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AMBER - in "simplesolvated" directory all files are not compressed #68

Closed
DrDomenicoMarson opened this issue Sep 24, 2022 · 7 comments
Closed
Labels

Comments

@DrDomenicoMarson
Copy link
Contributor

All the files neded for testing are compressed as .bz2 files, except the ones inside the simplesolvated directory. Is there a reason behind this? We could spare some space compressing also these files.

@xiki-tempula
Copy link
Collaborator

xiki-tempula commented Sep 25, 2022

@orbeckst Maybe you could comment on this as well.

I'm leaning towards having all of them as uncompressed files. I mean yes bz2 will make the file smaller so save some time in downloading the test files but these time is then spent on unzipping the files and writing it to a local file before reading it. At the end of the day, they are not going to drastically decrease the CI time.

Considering that we are pulling the files from alchemtest/archive/master.zip, file > bz2 > zip might not really shrink much space compared with file > zip.

On the other hand, saving the files as bz2 would mean that the user has to do the unzipping, which seems like extra effort to me to be honest.

@DrDomenicoMarson
Copy link
Contributor Author

I LOVE this suggestion :) I was afraid you preferred to have all compressed files, hence the issue, but I completely agree that's much easier to create and check tests with uncompressed files!

@orbeckst
Copy link
Member

I'd like to keep them compressed (or compresse them if not done so already) so that the footprint of the test package is small.

We never uncompress files to disk. We decompress on the fly with anyopen() https://github.com/alchemistry/alchemlyb/blob/cb965ad049556dc97445acc16cbb21a2bf3f7f38/src/alchemlyb/parsing/util.py#L20

Do you have timing information that shows a marked difference between compressed and uncompressed test files?

@DrDomenicoMarson
Copy link
Contributor Author

Ok, I'll compress the files in current and future PRs then!

@xiki-tempula
Copy link
Collaborator

xiki-tempula commented Oct 2, 2022

@orbeckst So I did a test with bz2

import time
from alchemlyb.parsing.amber import extract_u_nk
from alchemtest.amber import load_bace_example
file = load_bace_example().data['solvated']['decharge'][0]
a = time.time()
for i in range(1000):
    extract_u_nk(file, 298)
b = time.time()
print(b-a)

gives 43.86949706077576

with bz2.open(file, "rt") as bz_file:
    text = bz_file.read()
    with open('test.out', 'w') as f:
        f.write(text)
a = time.time()
for i in range(1000):
    extract_u_nk('test.out', 298)
b = time.time()
print(b-a)

gives 24.963331937789917
So running bz2 will make it 75% slower.

With regard to the zip file size, I have created a branch where all bz2 files have been unzipped (https://github.com/xiki-tempula/alchemtest/tree/feat_unzip), the zip file has a size of 82.9 MB which is slightly larger than the original zip file of 73.9 MB but I don't think it is too bad?

Plus, it will avoid #71 (comment)

@orbeckst
Copy link
Member

I agree that the difference in download size is not an issue for alchemlyb where we take the zip. Maybe that's even true of whl etc (not sure if they are compressed). However, when installed, the difference is sizable

 81M	0.6.0.tar.gz
 94M	alchemtest-0.6.0             # as dowloaded
304M	alchemtest-0.6.0-unpacked    # manually unpacked all gz and bz2 files

so in order to keep the installed footprint down, I'd really like to keep datafiles compressed. (This is not important for CI but for users/developers.)

The data that you showed #68 (comment) indicate overhead and that's interesting. I don't know if loading the data is the bottleneck for most tests — it might well be — and if this is the case then unpacking the data files might cut CI time by 25% - 50%. However, I think that the GROMACS reader will not be affected as much as it uses pandas.read_csv(), which is already much faster than our manually written ASCII readers (but I don't know how much it is affected by compression).

I ran pytest -v --durations=20 (on my laptop with SSD) to see where we spend time (we can add --durations to the alchemlyb CI for more information): here are the 20 slowest tests:

============================================================ slowest 20 durations =============================================================
34.28s setup    src/alchemlyb/tests/test_workflow_ABFE.py::Test_automatic_ABFE::test_read
32.75s setup    src/alchemlyb/tests/test_workflow_ABFE.py::Test_manual_ABFE::test_read
9.33s call     src/alchemlyb/tests/test_workflow_ABFE.py::Test_manual_ABFE::test_convergence_method
9.30s call     src/alchemlyb/tests/test_workflow_ABFE.py::Test_automatic_ABFE::test_convergence_method
7.18s call     src/alchemlyb/tests/test_workflow_ABFE.py::Test_manual_ABFE::test_convergence_nosample_u_nk
5.16s setup    src/alchemlyb/tests/test_workflow_ABFE.py::Test_automatic_benzene::test_read
4.61s call     src/alchemlyb/tests/test_visualisation.py::test_plot_dF_state
2.83s setup    src/alchemlyb/tests/test_workflow_ABFE.py::Test_automatic_amber::test_summary
2.74s call     src/alchemlyb/tests/test_workflow_ABFE.py::Test_manual_ABFE::test_estimator_method
2.72s call     src/alchemlyb/tests/test_workflow_ABFE.py::Test_automatic_ABFE::test_estimator_method
1.99s setup    src/alchemlyb/tests/test_fep_estimators.py::TestMBAR::test_mbar[X_delta_f5]
1.96s call     src/alchemlyb/tests/parsing/test_gmx.py::test_u_nk_with_potential_energy
1.88s call     src/alchemlyb/tests/parsing/test_gmx.py::test_u_nk_with_total_energy
1.85s call     src/alchemlyb/tests/test_visualisation.py::test_plot_ti_dhdl
1.81s setup    src/alchemlyb/tests/test_fep_estimators.py::TestBAR::test_bar[X_delta_f1]
1.79s call     src/alchemlyb/tests/parsing/test_gmx.py::test_u_nk_without_energy
1.78s setup    src/alchemlyb/tests/test_fep_estimators.py::TestAutoMBAR::test_mbar[X_delta_f5]
1.73s setup    src/alchemlyb/tests/test_fep_estimators.py::TestAutoMBAR::test_mbar[X_delta_f8]
1.72s setup    src/alchemlyb/tests/test_fep_estimators.py::TestMBAR::test_mbar[X_delta_f8]
1.72s call     src/alchemlyb/tests/test_fep_estimators.py::TestMBAR::test_mbar[X_delta_f4]
=============================================== 383 passed, 3141 warnings in 238.44s (0:03:58) ================================================

Probably anything labelled setup is related to reading files (possibly also related to alchemistry/alchemlyb#206 ) but I cannot discern from the output if we're reading GMX or AMBER in these tests.

@xiki-tempula is your main argument for uncompressing that we want to make CI run faster?

If so, then the question is by how much it would really speed up things. And then one has to decide what tradeoffs one wants to make.

@orbeckst
Copy link
Member

I'll close this issue as "won't fix" for now. Please chime in if you want to re-open the discussion.

@orbeckst orbeckst closed this as not planned Won't fix, can't repro, duplicate, stale Oct 21, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants