Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possibility of including a feature of dealing with dhdl files from extended simulations #101

Closed
wehs7661 opened this issue May 14, 2020 · 8 comments

Comments

@wehs7661
Copy link
Contributor

Dear alchemlyb developers,
First I want to thank you all for your hard work in developing this pretty user-friendly package. Today I was using alchemlyb to analyze the dhdl files of a replica-exchange simulation. Since I was running long simulations, I extended the simulation of each replica for several times. However, I found that this might cause two problems when parsing the GROMACS dhdl files.

Specifically, when parsing one of the files, I got the error shown as below. This error happened because the last line of the file to be parsed was incomplete as the simulation was ended by timeout. As a result, the end of the last line was -1.5258789e- instead of -1.5258789e-5, leading to ValueError when converting the last string of the line into a float when dtype was specified as np.float64. (See Line 265 in _extract_dataframe.)

TypeError: Cannot cast array from dtype('O') to dtype('float64') according to the rule 'safe'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/wehs7661/anaconda3/lib/python3.7/site-packages/alchemlyb/parsing/gmx.py", line 133, in extract_dHdl
    df = _extract_dataframe(xvg, headers)
  File "/home/wehs7661/anaconda3/lib/python3.7/site-packages/alchemlyb/parsing/gmx.py", line 267, in _extract_dataframe
    float_precision='high')
  File "/home/wehs7661/anaconda3/lib/python3.7/site-packages/pandas/io/parsers.py", line 702, in parser_f     
    return _read(filepath_or_buffer, kwds)
  File "/home/wehs7661/anaconda3/lib/python3.7/site-packages/pandas/io/parsers.py", line 435, in _read        
    data = parser.read(nrows)
  File "/home/wehs7661/anaconda3/lib/python3.7/site-packages/pandas/io/parsers.py", line 1139, in read
    ret = self._engine.read(nrows)
  File "/home/wehs7661/anaconda3/lib/python3.7/site-packages/pandas/io/parsers.py", line 1995, in read        
    data = self._reader.read(nrows)
  File "pandas/_libs/parsers.pyx", line 899, in pandas._libs.parsers.TextReader.read
  File "pandas/_libs/parsers.pyx", line 914, in pandas._libs.parsers.TextReader._read_low_memory
  File "pandas/_libs/parsers.pyx", line 991, in pandas._libs.parsers.TextReader._read_rows
  File "pandas/_libs/parsers.pyx", line 1123, in pandas._libs.parsers.TextReader._convert_column_data
  File "pandas/_libs/parsers.pyx", line 1197, in pandas._libs.parsers.TextReader._convert_tokens
ValueError: could not convert string to float: '-1.5258789e-'

In addition, it seems that currently, the GROMACS parser is not able to deal with the overlapped time frames when the simulation is extended. Specifically, say that the simulation of the first replica was ended by timeout and the last time frame in system_dhdl.xvg was 1592 ps, but the last time frame of the corresponding .cpt file was only updated to 1562 ps since the .cpt file updates only every 15 minutes. As a result, if we use run gmx mdrun with the -cpi option to extend the simulation, the dhdl file of the extended simulation, system_dhdl.part0002.xvg will start from 1562 ns rather than 1592 ns. In this situation, when we use dHdl_coul = pd.concat([extract_dHdl(xvg, T=300) for xvg in files['Coulomb']]) or u_nk_coul = pd.concat([extract_u_nk(xvg, T=300) for xvg in files['Coulomb']]), it seems that extract_dHdl or extract_u_nk are not able to discard the part of data corresponding to the overlapped time frames (from 1562 ps to 1592 ps) in system_dhdl.xvg and adopt the data of these time frames in system_dhdl.part0002.xvg.

While apparently, with another Python script, both problems above can be externally solved by modifying the dhdl files such that the incomplete lines and the duplicated time frames are discarded, I'm wondering if it is worthy to address these issues internally in alchemlyb instead. After all, this situation happens a lot when users extend their simulations.

Thanks a lot in advance!

@orbeckst
Copy link
Member

Dealing with corrupted files is a thorny issue because you suddenly have to assume a lot of things. As much as “doing things correctly automatically” is nice when it works, the danger is that things can go wrong in so many different ways and if you then “correct” something incorrectly, your users might never know. Therefore, for alchemlyb I at least favor failing early and then letting the user clean up first, using their specific knowledge.

Thus for the incomplete dhdl files I’d rather just have a check that fails cleanly with a sensible error message. That would be a good PR.

For the dealing with overlapping times part: Do I understand you correctly in that duplicated data are used? The first step would be to detect the situation and then raise an error. That would be a goo other separate PR. Once the detection works, we can discuss ways to deal with it.

But I’d start with failing first.

What’s your opinion @dotsdl @davidlmobley @mrshirts ?

@wehs7661
Copy link
Contributor Author

Hi @orbeckst,
Thank you for your reply!

For the dealing with overlapping times part: Do I understand you correctly in that duplicated data are used?

Yes, I believe so. Say that the first dhdl file ends at 1584 ps and the second (extended) dhdl file starts from 1562 ps, then the situation in the following figure would happen if we use u_nk = pd.concat([extract_u_nk(xvg, T=300) for xvg in xvg_files]):
image
In the example above, time frames from 1562 ps to 1584 ps are duplicated since both files have these time frames. Similar to the first problem, this can be easily fixed externally by the user with additional Python script that discards the time frames from 1562 to 1584 ps in the first file and then appends the data of the second file, so I was just wondering if it is worthy to include a method to detect duplicated time frames and deal with this issue. Solving the second problem in the way above will automatically solve the first problem since the incomplete last line (if any) will be discarded as well.

@orbeckst
Copy link
Member

I think @dotsdl does some de-deduplication somewhere already. In any case, that would be an easy thing to add to the preprocessors and then you can just include it into your pipeline.

@dotsdl
Copy link
Member

dotsdl commented May 18, 2020

Hi @wehs7661!


Re: deduplication. This is something I believe is best handled at the user-script/application level. The forthcoming refactor of the subsampling module (#98) will make subsamplers explicitly raise an exception if they encounter duplicate time values in a set of samples corresponding to the same lambda value(s). De-duplication is hard for us to do well because depending on the user's data, the right thing to do can vary substantially. As @orbeckst said, this can easily result in the library giving results that are not actually the intention of the user.

I would recommend that after you concatenate your samples, you deduplicate on the index with something like this answer:

u_nk = pd.concat([extract_u_nk(xvg, T=300) for xvg in xvg_files])
u_nk = u_nk.loc[~u_nk.index.duplicated(keep='last')]

This will keep the last occurrence of rows with exactly the same index values, but drop all other ones. It will also keep all rows that have a unique index value.


As for parsing errors due to file corruption, I think the best we can do is add more sensible error messages for failed parsing (again echoing @orbeckst). There are so many ways for errors like this to happen in the wild, so it is hard for alchemlyb's parsers to be built to do anything but throw an error that there was an issue with the file it didn't know what to do with. This is precisely the same reason the underlying pandas.read_csv parser fails loudly, too, instead of proceeding with usable results.

@wehs7661
Copy link
Contributor Author

Thank you @dotsdl ! Dealing with the issue externally by the user now makes sense to me. u_nk = u_nk.loc[~u_nk.index.duplicated(keep='last')] is also a short and clean way to dedpulicate the time frames.

@orbeckst
Copy link
Member

@dotsdl would it be worthwhile to add a preprocessor that essentially just does

u_nk.loc[~u_nk.index.duplicated(keep='last')]

so that users don't have to touch our data structures too much in an uncontrolled manner? It could also serve as a reference/best-practice implementation and curious users can look at the source to learn how to do it. – If that's a sensible idea we can open an issue.

Otherwise I am closing this one.

@orbeckst
Copy link
Member

(I tagged it "invalid" because we don't have "wont fix" as a tag – it does not mean that it wasn't a valid question.)

@dotsdl
Copy link
Member

dotsdl commented May 19, 2020

We can add this as a preprocessor, yes. I quite like this philosophy of making these things easy for our data structures, which double as reference implementations for some pandas-fu.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants