Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduced the "extract" function to all the parsers, with relative tests #240

Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
23 commits
Select commit Hold shift + click to select a range
1c18bbb
merged the extract functions
DrDomenicoMarson Sep 23, 2022
4e62919
amber merged parsers pass all old tests
DrDomenicoMarson Sep 23, 2022
bc77e98
added extract to gmx parser, tests passed
DrDomenicoMarson Sep 23, 2022
745690e
added extract to gomc
DrDomenicoMarson Sep 23, 2022
fe34171
added extract to namd parser
DrDomenicoMarson Sep 23, 2022
d897c69
added test for the new extract function
DrDomenicoMarson Sep 23, 2022
bd81563
removed reduntant test for T
DrDomenicoMarson Sep 23, 2022
3d6a827
added test for extract in for gmx
DrDomenicoMarson Sep 23, 2022
c7d58c6
cleaned amber tests
DrDomenicoMarson Sep 23, 2022
57dafa5
added extract test for gomc and namd
DrDomenicoMarson Sep 23, 2022
e592104
added to CHANGES
DrDomenicoMarson Sep 23, 2022
954effe
removed support for python<3.6
DrDomenicoMarson Sep 25, 2022
a102a69
emoved type hint to be consistent with the rest
DrDomenicoMarson Sep 25, 2022
d4b2355
fixed the doscring of extract for sphinx
DrDomenicoMarson Sep 25, 2022
b879407
fix testing extract of gomc parser
DrDomenicoMarson Sep 25, 2022
b7d18d5
added # pragma: no cover
DrDomenicoMarson Sep 25, 2022
107b282
commented two lines in units.py that ruined sphinx
DrDomenicoMarson Sep 25, 2022
df95656
added description for extract in docs
DrDomenicoMarson Sep 25, 2022
99f3f69
removed a redundant try/except in the test
DrDomenicoMarson Sep 25, 2022
448c648
uncommented the docstring of units.py
DrDomenicoMarson Sep 25, 2022
fcbe559
just to force docs rebuilding
DrDomenicoMarson Sep 26, 2022
eaf8f3d
Added versionadded:: 1.0.0
DrDomenicoMarson Sep 27, 2022
71020ee
fixed versionadded
DrDomenicoMarson Sep 27, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions CHANGES
Original file line number Diff line number Diff line change
Expand Up @@ -45,6 +45,9 @@ Fixes
- substitute the any_none() function with a check "if None in" in the AMBER parser (issue #236, PR #237)
- For ABFE workflow, dHdl will only be read when a TI estimator is chosen.
Similarly, u_nk will only be read when FEP estimators are chosen. (PR #231)
- All parsers now have a 'extract(file, T)' method that returns a dict with both
"dHdl" and "u_nk" data (or None). THe AMBER parser when using this function will read the
file just once, extracting all data at once. (issue #222, PR #240)

07/22/2022 xiki-tempula, IAlibay, dotsdl, orbeckst, ptmerz

Expand Down
2 changes: 1 addition & 1 deletion docs/api_principles.rst
Original file line number Diff line number Diff line change
Expand Up @@ -69,7 +69,7 @@ The library is structured as follows, following a similar style to
│   ├── abfe.py


* The :mod:`~alchemlyb.parsing` submodule contains parsers for individual MD engines, since the output files needed to perform alchemical free energy calculations vary widely and are not standardized. Each module at the very least provides an `extract_u_nk` function for extracting reduced potentials (needed for MBAR), as well as an `extract_dHdl` function for extracting derivatives required for thermodynamic integration. Other helper functions may be exposed for additional processing, such as generating an XVG file from an EDR file in the case of GROMACS. All `extract\_*` functions take similar arguments (a file path, parameters such as temperature), and produce standard outputs (:class:`pandas.DataFrame` for reduced potentials, :class:`pandas.Series` for derivatives).
* The :mod:`~alchemlyb.parsing` submodule contains parsers for individual MD engines, since the output files needed to perform alchemical free energy calculations vary widely and are not standardized. Each module at the very least provides an `extract_u_nk` function for extracting reduced potentials (needed for MBAR), as well as an `extract_dHdl` function for extracting derivatives required for thermodynamic integration. Moreover an `extract`` functon is provided, which returns a dict containing both derivatives and reduced potentials. Other helper functions may be exposed for additional processing, such as generating an XVG file from an EDR file in the case of GROMACS. All `extract\_*` functions take similar arguments (a file path, parameters such as temperature), and produce standard outputs (:class:`pandas.DataFrame` for reduced potentials, :class:`pandas.Series` for derivatives).
* The :mod:`~alchemlyb.preprocessing` submodule features functions for subsampling timeseries, as may be desired before feeding them to an estimator. So far, these are limited to `slicing`, `statistical_inefficiency`, and `equilibrium_detection` functions, many of which make use of subsampling schemes available from :mod:`pymbar`. These functions are written in such a way that they can be easily composed as parts of complex processing pipelines.
* The :mod:`~alchemlyb.estimators` module features classes *a la* **scikit-learn** that can be initialized with parameters that determine their behavior and then "trained" on a `fit` method. MBAR, BAR, and thermodynamic integration (TI) as the major methods are all implemented. Correct error estimates require the use of time series with independent samples.
* The :mod:`~alchemlyb.convergence` submodule features convenience functions/classes for doing convergence analysis using a given dataset and a chosen estimator.
Expand Down
1 change: 1 addition & 0 deletions docs/parsing/alchemlyb.parsing.amber.rst
Original file line number Diff line number Diff line change
Expand Up @@ -18,3 +18,4 @@ This submodule includes these parsing functions:

.. autofunction:: alchemlyb.parsing.amber.extract_dHdl
.. autofunction:: alchemlyb.parsing.amber.extract_u_nk
.. autofunction:: alchemlyb.parsing.amber.extract
1 change: 1 addition & 0 deletions docs/parsing/alchemlyb.parsing.gmx.rst
Original file line number Diff line number Diff line change
Expand Up @@ -37,3 +37,4 @@ This submodule includes these parsing functions:

.. autofunction:: alchemlyb.parsing.gmx.extract_dHdl
.. autofunction:: alchemlyb.parsing.gmx.extract_u_nk
.. autofunction:: alchemlyb.parsing.gmx.extract
1 change: 1 addition & 0 deletions docs/parsing/alchemlyb.parsing.gomc.rst
Original file line number Diff line number Diff line change
Expand Up @@ -13,3 +13,4 @@ This submodule includes these parsing functions:

.. autofunction:: alchemlyb.parsing.gomc.extract_dHdl
.. autofunction:: alchemlyb.parsing.gomc.extract_u_nk
.. autofunction:: alchemlyb.parsing.gomc.extract
1 change: 1 addition & 0 deletions docs/parsing/alchemlyb.parsing.namd.rst
Original file line number Diff line number Diff line change
Expand Up @@ -31,3 +31,4 @@ API Reference
This submodule includes these parsing functions:

.. autofunction:: alchemlyb.parsing.namd.extract_u_nk
.. autofunction:: alchemlyb.parsing.namd.extract
17 changes: 17 additions & 0 deletions src/alchemlyb/parsing/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,3 +14,20 @@ def wrapper(outfile, T, *args, **kwargs):
dataframe.attrs['energy_unit'] = 'kT'
return dataframe
return wrapper


def _init_attrs_dict(func):
'''Add temperature and energy units to the parsed dataframes.

The temperature is added to the dataframe as dataframe.attrs['temperature']
and the energy unit is initiated as dataframe.attrs['energy_unit'] = 'kT'.
'''
@wraps(func)
def wrapper(outfile, T, *args, **kwargs):
dict_with_df = func(outfile, T, *args, **kwargs)
for k in dict_with_df.keys():
if dict_with_df[k] is not None:
dict_with_df[k].attrs['temperature'] = T
dict_with_df[k].attrs['energy_unit'] = 'kT'
return dict_with_df
return wrapper
191 changes: 97 additions & 94 deletions src/alchemlyb/parsing/amber.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
"""Parsers for extracting alchemical data from `Amber <http://ambermd.org>`_ output files.

Most of the file parsing parts are inherited from
Some of the file parsing parts are adapted from
`alchemical-analysis`_.

.. _alchemical-analysis: https://github.com/MobleyLab/alchemical-analysis
Expand All @@ -14,13 +14,15 @@
import numpy as np

from .util import anyopen
from . import _init_attrs
from . import _init_attrs_dict
from ..postprocessors.units import R_kJmol, kJ2kcal

logger = logging.getLogger("alchemlyb.parsers.Amber")

k_b = R_kJmol * kJ2kcal

_FP_RE = r'[+-]?(\d+(\.\d*)?|\.\d+)([eE][+-]?\d+)?'


def convert_to_pandas(file_datum):
"""Convert the data structure from numpy to pandas format"""
Expand All @@ -40,11 +42,6 @@ def convert_to_pandas(file_datum):
return df


DVDL_COMPS = ['BOND', 'ANGLE', 'DIHED', '1-4 NB', '1-4 EEL', 'VDWAALS',
'EELEC', 'RESTRAINT']
_FP_RE = r'[+-]?(\d+(\.\d*)?|\.\d+)([eE][+-]?\d+)?'


def _pre_gen(it, first):
"""A generator that returns first first if it exists."""

Expand All @@ -58,7 +55,7 @@ def _pre_gen(it, first):
return


class SectionParser(object):
class SectionParser():
"""
A simple parser to extract data values from sections.
"""
Expand Down Expand Up @@ -129,14 +126,11 @@ def extract_section(self, start, end, fields, limit=None, extra='',
def __iter__(self):
return self

def next(self):
def __next__(self):
"""Read next line of the filehandle and check for EOF."""
self.lineno += 1
return next(self.fileh)

# make compatible with python 3.6
__next__ = next

def close(self):
"""Close the filehandle."""
self.fileh.close()
Expand All @@ -148,11 +142,11 @@ def __exit__(self, typ, value, traceback):
self.close()


class FEData(object):
class FEData():
"""A simple struct container to collect data from individual files."""

__slots__ = ['clambda', 't0', 'dt', 'T', 'ntpr', 'gradients',
'component_gradients', 'mbar_energies',
'mbar_energies',
xiki-tempula marked this conversation as resolved.
Show resolved Hide resolved
'have_mbar', 'mbar_lambdas', 'mbar_lambda_idx']

def __init__(self):
Expand All @@ -162,7 +156,6 @@ def __init__(self):
self.T = -1.0
self.ntpr = -1
self.gradients = []
self.component_gradients = []
self.mbar_energies = []
self.have_mbar = False
self.mbar_lambdas = []
Expand Down Expand Up @@ -206,13 +199,12 @@ def file_validation(outfile):
mbar_ndata = int(nstlim / mbar_ndata)
mbar_lambdas = _process_mbar_lambdas(secp)
file_datum.mbar_lambdas = mbar_lambdas
clambda_str = '%6.4f' % clambda
clambda_str = f'{clambda:6.4f}'

if clambda_str not in mbar_lambdas:
logger.warning('WARNING: lamba %s not contained in set of '
'MBAR lambas: %s\nNot using MBAR.',
clambda_str, ', '.join(mbar_lambdas))

have_mbar = False
else:
mbar_nlambda = len(mbar_lambdas)
Expand Down Expand Up @@ -240,9 +232,9 @@ def file_validation(outfile):
file_datum.have_mbar = have_mbar
return file_datum

@_init_attrs
def extract_u_nk(outfile, T):
"""Return reduced potentials `u_nk` from Amber outputfile.
@_init_attrs_dict
def extract(outfile, T):
"""Return reduced potentials `u_nk` and gradients `dH/dl` from Amber outputfile.

Parameters
----------
Expand All @@ -254,61 +246,94 @@ def extract_u_nk(outfile, T):

Returns
-------
u_nk : DataFrame
Reduced potential for each alchemical state (k) for each frame (n).
Dict
A dictionary with keys of 'u_nk', which is a pandas DataFrame of reduced potentials for each
alchemical state (k) for each frame (n), and 'dHdl', which is a Series of dH/dl
as a function of time for this lambda window.
xiki-tempula marked this conversation as resolved.
Show resolved Hide resolved


.. versionchanged:: 0.5.0
The :mod:`scipy.constants` is used for parsers instead of
the constants used by the corresponding MD engine.

.. versionadded:: 1.0.0
"""

beta = 1/(k_b * T)

file_datum = file_validation(outfile)
if not file_validation(outfile): # pragma: no cover
return None
if not file_validation(outfile):
return {"u_nk": None, "dHdl": None}

if not np.isclose(T, file_datum.T, atol=0.01):
msg = f'The temperature read from the input file ({file_datum.T:.2f} K)'
msg += f' is different from the temperature passed as parameter ({T:.2f} K)'
logger.error(msg)
raise ValueError(msg)

if not file_datum.have_mbar:
raise Exception('ERROR: No MBAR energies found! Cannot parse file.')
finished = False
with SectionParser(outfile) as secp:
line = secp.skip_lines(5)
high_E_cnt = 0
nensec = 0
old_nstep = -1
for line in secp:
if line.startswith('MBAR Energy analysis'):
if " A V E R A G E S O V E R" in line:
_ = secp.skip_after('^|=========================================')
elif line.startswith(' NSTEP'):
nstep, dvdl = secp.extract_section('^ NSTEP', '^ ---',
['NSTEP', 'DV/DL'],
extra=line)
if nstep != old_nstep and dvdl is not None and nstep is not None:
file_datum.gradients.append(dvdl)
nensec += 1
old_nstep = nstep
elif line.startswith('MBAR Energy analysis') and file_datum.have_mbar:
mbar = secp.extract_section('^MBAR', '^ ---', file_datum.mbar_lambdas,
extra=line)

if None in mbar:
if None in mbar: # pragma: no cover
continue

E_ref = mbar[file_datum.mbar_lambda_idx]

for lmbda, E in enumerate(mbar):
if E > 0.0:

reference_energy = mbar[file_datum.mbar_lambda_idx]
for lmbda, energy in enumerate(mbar):
if energy > 0.0:
high_E_cnt += 1

file_datum.mbar_energies[lmbda].append(beta * (E - E_ref))
file_datum.mbar_energies[lmbda].append(beta * (energy - reference_energy))
elif line == ' 5. TIMINGS\n':
finished = True
break

if high_E_cnt:
logger.warning('%i MBAR energ%s > 0.0 kcal/mol',
high_E_cnt, 'ies are' if high_E_cnt > 1 else 'y is')

time = [file_datum.t0 + (frame_index + 1) * file_datum.dt * file_datum.ntpr
for frame_index in range(len(file_datum.mbar_energies[0]))]
if not finished: # pragma: no cover
logger.warning('WARNING: file %s is a prematurely terminated run' % outfile)

if file_datum.have_mbar:
mbar_time = [
file_datum.t0 + (frame_index + 1) * file_datum.dt * file_datum.ntpr
for frame_index in range(len(file_datum.mbar_energies[0]))]

mbar_df = pd.DataFrame(
file_datum.mbar_energies,
index=np.array(file_datum.mbar_lambdas, dtype=np.float64),
columns=pd.MultiIndex.from_arrays(
[mbar_time, np.repeat(file_datum.clambda, len(mbar_time))], names=['time', 'lambdas'])
).T
else:
logger.info('WARNING: No MBAR energies found! "u_nk" entry will be None')
mbar_df = None

if not nensec: # pragma: no cover
logger.warning('WARNING: File %s does not contain any dV/dl data' % outfile)
dHdl_df = None
else:
logger.info('Read %s DV/DL data points in file %s' % (nensec, outfile))
dHdl_df = convert_to_pandas(file_datum)
dHdl_df['dHdl'] *= beta

return {"u_nk": mbar_df, "dHdl": dHdl_df}

return pd.DataFrame(file_datum.mbar_energies,
columns=pd.MultiIndex.from_arrays([time, np.repeat(file_datum.clambda, len(time))],
names=['time', 'lambdas']),
index=np.array(file_datum.mbar_lambdas, dtype=np.float64)).T

@_init_attrs
def extract_dHdl(outfile, T):
"""Return gradients ``dH/dl`` from Amber TI outputfile.

Expand All @@ -330,56 +355,34 @@ def extract_dHdl(outfile, T):
the constants used by the corresponding MD engine.

"""
beta = 1/(k_b * T)
extracted = extract(outfile, T)
return extracted['dHdl']

file_datum = file_validation(outfile)
if not file_validation(outfile):
return None

if not np.isclose(T, file_datum.T, atol=0.01):
msg = f'The temperature read from the input file ({file_datum.T:.2f} K)'
msg += f' is different from the temperature passed as parameter ({T:.2f} K)'
logger.error(msg)
raise ValueError(msg)
def extract_u_nk(outfile, T):
"""Return reduced potentials `u_nk` from Amber outputfile.

finished = False
comps = []
with SectionParser(outfile) as secp:
line = secp.skip_lines(5)
nensec = 0
old_nstep = -1
into_average_section = False
for line in secp:
if 'DV/DL, AVERAGES OVER' in line \
or " A V E R A G E S O V E R" in line:
into_average_section = True
if line.startswith(' NSTEP') and into_average_section:
_ = secp.skip_lines(1)
into_average_section = False
elif line.startswith(' NSTEP'):
nstep, dvdl = secp.extract_section('^ NSTEP', '^ ---',
['NSTEP', 'DV/DL'],
extra=line)
if nstep != old_nstep and dvdl is not None \
and nstep is not None:
file_datum.gradients.append(dvdl)
nensec += 1
old_nstep = nstep
if line == ' 5. TIMINGS\n':
finished = True
break
if not finished: # pragma: no cover
logger.warning(' WARNING: prematurely terminated run')
if not nensec: # pragma: no cover
logger.warning('WARNING: File %s does not contain any DV/DL data',
outfile)
logger.info(f'Read {nensec} DV/DL data points')
# at this step we get info stored in the FEData object for a given amber out file
file_datum.component_gradients.extend(comps)
# convert file_datum to the pandas format to make it identical to alchemlyb output format
df = convert_to_pandas(file_datum)
df['dHdl'] *= beta
return df
Parameters
----------
outfile : str
Path to Amber .out file to extract data from.
T : float
Temperature in Kelvin at which the simulations were performed;
needed to generated the reduced potential (in units of kT)

Returns
-------
u_nk : DataFrame
Reduced potential for each alchemical state (k) for each frame (n).


.. versionchanged:: 0.5.0
The :mod:`scipy.constants` is used for parsers instead of
the constants used by the corresponding MD engine.

"""
extracted = extract(outfile, T)
return extracted['u_nk']


def _process_mbar_lambdas(secp):
Expand Down Expand Up @@ -410,7 +413,7 @@ def _process_mbar_lambdas(secp):
if 'total' in line:
data = line.split()
mbar_lambdas.extend(data[2:])
else:
else: # pragma: no cover
mbar_lambdas.extend(line.split())

return mbar_lambdas
Loading