-
Notifications
You must be signed in to change notification settings - Fork 51
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Possibility of including a feature of dealing with dhdl files from extended simulations #101
Comments
Dealing with corrupted files is a thorny issue because you suddenly have to assume a lot of things. As much as “doing things correctly automatically” is nice when it works, the danger is that things can go wrong in so many different ways and if you then “correct” something incorrectly, your users might never know. Therefore, for alchemlyb I at least favor failing early and then letting the user clean up first, using their specific knowledge. Thus for the incomplete dhdl files I’d rather just have a check that fails cleanly with a sensible error message. That would be a good PR. For the dealing with overlapping times part: Do I understand you correctly in that duplicated data are used? The first step would be to detect the situation and then raise an error. That would be a goo other separate PR. Once the detection works, we can discuss ways to deal with it. But I’d start with failing first. What’s your opinion @dotsdl @davidlmobley @mrshirts ? |
Hi @orbeckst,
Yes, I believe so. Say that the first dhdl file ends at 1584 ps and the second (extended) dhdl file starts from 1562 ps, then the situation in the following figure would happen if we use |
I think @dotsdl does some de-deduplication somewhere already. In any case, that would be an easy thing to add to the preprocessors and then you can just include it into your pipeline. |
Hi @wehs7661! Re: deduplication. This is something I believe is best handled at the user-script/application level. The forthcoming refactor of the subsampling module (#98) will make subsamplers explicitly raise an exception if they encounter duplicate time values in a set of samples corresponding to the same I would recommend that after you concatenate your samples, you deduplicate on the index with something like this answer: u_nk = pd.concat([extract_u_nk(xvg, T=300) for xvg in xvg_files])
u_nk = u_nk.loc[~u_nk.index.duplicated(keep='last')] This will keep the last occurrence of rows with exactly the same index values, but drop all other ones. It will also keep all rows that have a unique index value. As for parsing errors due to file corruption, I think the best we can do is add more sensible error messages for failed parsing (again echoing @orbeckst). There are so many ways for errors like this to happen in the wild, so it is hard for |
Thank you @dotsdl ! Dealing with the issue externally by the user now makes sense to me. |
@dotsdl would it be worthwhile to add a preprocessor that essentially just does u_nk.loc[~u_nk.index.duplicated(keep='last')] so that users don't have to touch our data structures too much in an uncontrolled manner? It could also serve as a reference/best-practice implementation and curious users can look at the source to learn how to do it. – If that's a sensible idea we can open an issue. Otherwise I am closing this one. |
(I tagged it "invalid" because we don't have "wont fix" as a tag – it does not mean that it wasn't a valid question.) |
We can add this as a preprocessor, yes. I quite like this philosophy of making these things easy for our data structures, which double as reference implementations for some |
Dear
alchemlyb
developers,First I want to thank you all for your hard work in developing this pretty user-friendly package. Today I was using
alchemlyb
to analyze the dhdl files of a replica-exchange simulation. Since I was running long simulations, I extended the simulation of each replica for several times. However, I found that this might cause two problems when parsing the GROMACS dhdl files.Specifically, when parsing one of the files, I got the error shown as below. This error happened because the last line of the file to be parsed was incomplete as the simulation was ended by timeout. As a result, the end of the last line was
-1.5258789e-
instead of-1.5258789e-5
, leading to ValueError when converting the last string of the line into a float whendtype
was specified asnp.float64
. (See Line 265 in_extract_dataframe
.)In addition, it seems that currently, the GROMACS parser is not able to deal with the overlapped time frames when the simulation is extended. Specifically, say that the simulation of the first replica was ended by timeout and the last time frame in
system_dhdl.xvg
was 1592 ps, but the last time frame of the corresponding.cpt
file was only updated to 1562 ps since the.cpt
file updates only every 15 minutes. As a result, if we use rungmx mdrun
with the-cpi
option to extend the simulation, the dhdl file of the extended simulation,system_dhdl.part0002.xvg
will start from 1562 ns rather than 1592 ns. In this situation, when we usedHdl_coul = pd.concat([extract_dHdl(xvg, T=300) for xvg in files['Coulomb']])
oru_nk_coul = pd.concat([extract_u_nk(xvg, T=300) for xvg in files['Coulomb']])
, it seems thatextract_dHdl
orextract_u_nk
are not able to discard the part of data corresponding to the overlapped time frames (from 1562 ps to 1592 ps) insystem_dhdl.xvg
and adopt the data of these time frames insystem_dhdl.part0002.xvg
.While apparently, with another Python script, both problems above can be externally solved by modifying the dhdl files such that the incomplete lines and the duplicated time frames are discarded, I'm wondering if it is worthy to address these issues internally in
alchemlyb
instead. After all, this situation happens a lot when users extend their simulations.Thanks a lot in advance!
The text was updated successfully, but these errors were encountered: