Skip to content

Commit

Permalink
Introduced the "extract" function to all the parsers, with relative t…
Browse files Browse the repository at this point in the history
…ests (#240)

Ok, I hope not to have broken anything with this PR :-)
I added the extract(file, T) function to all the existing parsers, with a relative test for each parser.

The only parser that was heavily hit by this PR is the AMBER parser, as was mentioned in issue #222 .
Here I had to make different adjustments, on my machine all tests are passing through.
I took the opportunity to make also some small, cosmetic, changes to the code in the amber parser (just a couple of them, like removing the unnecessary 'class SomeThing(Object').
  • Loading branch information
DrDomenicoMarson authored Sep 27, 2022
1 parent cb965ad commit 48a719b
Show file tree
Hide file tree
Showing 17 changed files with 283 additions and 124 deletions.
3 changes: 3 additions & 0 deletions CHANGES
Original file line number Diff line number Diff line change
Expand Up @@ -45,6 +45,9 @@ Fixes
- substitute the any_none() function with a check "if None in" in the AMBER parser (issue #236, PR #237)
- For ABFE workflow, dHdl will only be read when a TI estimator is chosen.
Similarly, u_nk will only be read when FEP estimators are chosen. (PR #231)
- All parsers now have a 'extract(file, T)' method that returns a dict with both
"dHdl" and "u_nk" data (or None). THe AMBER parser when using this function will read the
file just once, extracting all data at once. (issue #222, PR #240)

07/22/2022 xiki-tempula, IAlibay, dotsdl, orbeckst, ptmerz

Expand Down
2 changes: 1 addition & 1 deletion docs/api_principles.rst
Original file line number Diff line number Diff line change
Expand Up @@ -69,7 +69,7 @@ The library is structured as follows, following a similar style to
│   ├── abfe.py

* The :mod:`~alchemlyb.parsing` submodule contains parsers for individual MD engines, since the output files needed to perform alchemical free energy calculations vary widely and are not standardized. Each module at the very least provides an `extract_u_nk` function for extracting reduced potentials (needed for MBAR), as well as an `extract_dHdl` function for extracting derivatives required for thermodynamic integration. Other helper functions may be exposed for additional processing, such as generating an XVG file from an EDR file in the case of GROMACS. All `extract\_*` functions take similar arguments (a file path, parameters such as temperature), and produce standard outputs (:class:`pandas.DataFrame` for reduced potentials, :class:`pandas.Series` for derivatives).
* The :mod:`~alchemlyb.parsing` submodule contains parsers for individual MD engines, since the output files needed to perform alchemical free energy calculations vary widely and are not standardized. Each module at the very least provides an `extract_u_nk` function for extracting reduced potentials (needed for MBAR), as well as an `extract_dHdl` function for extracting derivatives required for thermodynamic integration. Moreover an `extract`` functon is provided, which returns a dict containing both derivatives and reduced potentials. Other helper functions may be exposed for additional processing, such as generating an XVG file from an EDR file in the case of GROMACS. All `extract\_*` functions take similar arguments (a file path, parameters such as temperature), and produce standard outputs (:class:`pandas.DataFrame` for reduced potentials, :class:`pandas.Series` for derivatives).
* The :mod:`~alchemlyb.preprocessing` submodule features functions for subsampling timeseries, as may be desired before feeding them to an estimator. So far, these are limited to `slicing`, `statistical_inefficiency`, and `equilibrium_detection` functions, many of which make use of subsampling schemes available from :mod:`pymbar`. These functions are written in such a way that they can be easily composed as parts of complex processing pipelines.
* The :mod:`~alchemlyb.estimators` module features classes *a la* **scikit-learn** that can be initialized with parameters that determine their behavior and then "trained" on a `fit` method. MBAR, BAR, and thermodynamic integration (TI) as the major methods are all implemented. Correct error estimates require the use of time series with independent samples.
* The :mod:`~alchemlyb.convergence` submodule features convenience functions/classes for doing convergence analysis using a given dataset and a chosen estimator.
Expand Down
1 change: 1 addition & 0 deletions docs/parsing/alchemlyb.parsing.amber.rst
Original file line number Diff line number Diff line change
Expand Up @@ -18,3 +18,4 @@ This submodule includes these parsing functions:

.. autofunction:: alchemlyb.parsing.amber.extract_dHdl
.. autofunction:: alchemlyb.parsing.amber.extract_u_nk
.. autofunction:: alchemlyb.parsing.amber.extract
1 change: 1 addition & 0 deletions docs/parsing/alchemlyb.parsing.gmx.rst
Original file line number Diff line number Diff line change
Expand Up @@ -37,3 +37,4 @@ This submodule includes these parsing functions:

.. autofunction:: alchemlyb.parsing.gmx.extract_dHdl
.. autofunction:: alchemlyb.parsing.gmx.extract_u_nk
.. autofunction:: alchemlyb.parsing.gmx.extract
1 change: 1 addition & 0 deletions docs/parsing/alchemlyb.parsing.gomc.rst
Original file line number Diff line number Diff line change
Expand Up @@ -13,3 +13,4 @@ This submodule includes these parsing functions:

.. autofunction:: alchemlyb.parsing.gomc.extract_dHdl
.. autofunction:: alchemlyb.parsing.gomc.extract_u_nk
.. autofunction:: alchemlyb.parsing.gomc.extract
1 change: 1 addition & 0 deletions docs/parsing/alchemlyb.parsing.namd.rst
Original file line number Diff line number Diff line change
Expand Up @@ -31,3 +31,4 @@ API Reference
This submodule includes these parsing functions:

.. autofunction:: alchemlyb.parsing.namd.extract_u_nk
.. autofunction:: alchemlyb.parsing.namd.extract
17 changes: 17 additions & 0 deletions src/alchemlyb/parsing/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,3 +14,20 @@ def wrapper(outfile, T, *args, **kwargs):
dataframe.attrs['energy_unit'] = 'kT'
return dataframe
return wrapper


def _init_attrs_dict(func):
'''Add temperature and energy units to the parsed dataframes.
The temperature is added to the dataframe as dataframe.attrs['temperature']
and the energy unit is initiated as dataframe.attrs['energy_unit'] = 'kT'.
'''
@wraps(func)
def wrapper(outfile, T, *args, **kwargs):
dict_with_df = func(outfile, T, *args, **kwargs)
for k in dict_with_df.keys():
if dict_with_df[k] is not None:
dict_with_df[k].attrs['temperature'] = T
dict_with_df[k].attrs['energy_unit'] = 'kT'
return dict_with_df
return wrapper
191 changes: 97 additions & 94 deletions src/alchemlyb/parsing/amber.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
"""Parsers for extracting alchemical data from `Amber <http://ambermd.org>`_ output files.
Most of the file parsing parts are inherited from
Some of the file parsing parts are adapted from
`alchemical-analysis`_.
.. _alchemical-analysis: https://github.com/MobleyLab/alchemical-analysis
Expand All @@ -14,13 +14,15 @@
import numpy as np

from .util import anyopen
from . import _init_attrs
from . import _init_attrs_dict
from ..postprocessors.units import R_kJmol, kJ2kcal

logger = logging.getLogger("alchemlyb.parsers.Amber")

k_b = R_kJmol * kJ2kcal

_FP_RE = r'[+-]?(\d+(\.\d*)?|\.\d+)([eE][+-]?\d+)?'


def convert_to_pandas(file_datum):
"""Convert the data structure from numpy to pandas format"""
Expand All @@ -40,11 +42,6 @@ def convert_to_pandas(file_datum):
return df


DVDL_COMPS = ['BOND', 'ANGLE', 'DIHED', '1-4 NB', '1-4 EEL', 'VDWAALS',
'EELEC', 'RESTRAINT']
_FP_RE = r'[+-]?(\d+(\.\d*)?|\.\d+)([eE][+-]?\d+)?'


def _pre_gen(it, first):
"""A generator that returns first first if it exists."""

Expand All @@ -58,7 +55,7 @@ def _pre_gen(it, first):
return


class SectionParser(object):
class SectionParser():
"""
A simple parser to extract data values from sections.
"""
Expand Down Expand Up @@ -129,14 +126,11 @@ def extract_section(self, start, end, fields, limit=None, extra='',
def __iter__(self):
return self

def next(self):
def __next__(self):
"""Read next line of the filehandle and check for EOF."""
self.lineno += 1
return next(self.fileh)

# make compatible with python 3.6
__next__ = next

def close(self):
"""Close the filehandle."""
self.fileh.close()
Expand All @@ -148,11 +142,11 @@ def __exit__(self, typ, value, traceback):
self.close()


class FEData(object):
class FEData():
"""A simple struct container to collect data from individual files."""

__slots__ = ['clambda', 't0', 'dt', 'T', 'ntpr', 'gradients',
'component_gradients', 'mbar_energies',
'mbar_energies',
'have_mbar', 'mbar_lambdas', 'mbar_lambda_idx']

def __init__(self):
Expand All @@ -162,7 +156,6 @@ def __init__(self):
self.T = -1.0
self.ntpr = -1
self.gradients = []
self.component_gradients = []
self.mbar_energies = []
self.have_mbar = False
self.mbar_lambdas = []
Expand Down Expand Up @@ -206,13 +199,12 @@ def file_validation(outfile):
mbar_ndata = int(nstlim / mbar_ndata)
mbar_lambdas = _process_mbar_lambdas(secp)
file_datum.mbar_lambdas = mbar_lambdas
clambda_str = '%6.4f' % clambda
clambda_str = f'{clambda:6.4f}'

if clambda_str not in mbar_lambdas:
logger.warning('WARNING: lamba %s not contained in set of '
'MBAR lambas: %s\nNot using MBAR.',
clambda_str, ', '.join(mbar_lambdas))

have_mbar = False
else:
mbar_nlambda = len(mbar_lambdas)
Expand Down Expand Up @@ -240,9 +232,9 @@ def file_validation(outfile):
file_datum.have_mbar = have_mbar
return file_datum

@_init_attrs
def extract_u_nk(outfile, T):
"""Return reduced potentials `u_nk` from Amber outputfile.
@_init_attrs_dict
def extract(outfile, T):
"""Return reduced potentials `u_nk` and gradients `dH/dl` from Amber outputfile.
Parameters
----------
Expand All @@ -254,61 +246,94 @@ def extract_u_nk(outfile, T):
Returns
-------
u_nk : DataFrame
Reduced potential for each alchemical state (k) for each frame (n).
Dict
A dictionary with keys of 'u_nk', which is a pandas DataFrame of reduced potentials for each
alchemical state (k) for each frame (n), and 'dHdl', which is a Series of dH/dl
as a function of time for this lambda window.
.. versionchanged:: 0.5.0
The :mod:`scipy.constants` is used for parsers instead of
the constants used by the corresponding MD engine.
.. versionadded:: 1.0.0
"""

beta = 1/(k_b * T)

file_datum = file_validation(outfile)
if not file_validation(outfile): # pragma: no cover
return None
if not file_validation(outfile):
return {"u_nk": None, "dHdl": None}

if not np.isclose(T, file_datum.T, atol=0.01):
msg = f'The temperature read from the input file ({file_datum.T:.2f} K)'
msg += f' is different from the temperature passed as parameter ({T:.2f} K)'
logger.error(msg)
raise ValueError(msg)

if not file_datum.have_mbar:
raise Exception('ERROR: No MBAR energies found! Cannot parse file.')
finished = False
with SectionParser(outfile) as secp:
line = secp.skip_lines(5)
high_E_cnt = 0
nensec = 0
old_nstep = -1
for line in secp:
if line.startswith('MBAR Energy analysis'):
if " A V E R A G E S O V E R" in line:
_ = secp.skip_after('^|=========================================')
elif line.startswith(' NSTEP'):
nstep, dvdl = secp.extract_section('^ NSTEP', '^ ---',
['NSTEP', 'DV/DL'],
extra=line)
if nstep != old_nstep and dvdl is not None and nstep is not None:
file_datum.gradients.append(dvdl)
nensec += 1
old_nstep = nstep
elif line.startswith('MBAR Energy analysis') and file_datum.have_mbar:
mbar = secp.extract_section('^MBAR', '^ ---', file_datum.mbar_lambdas,
extra=line)

if None in mbar:
if None in mbar: # pragma: no cover
continue

E_ref = mbar[file_datum.mbar_lambda_idx]

for lmbda, E in enumerate(mbar):
if E > 0.0:

reference_energy = mbar[file_datum.mbar_lambda_idx]
for lmbda, energy in enumerate(mbar):
if energy > 0.0:
high_E_cnt += 1

file_datum.mbar_energies[lmbda].append(beta * (E - E_ref))
file_datum.mbar_energies[lmbda].append(beta * (energy - reference_energy))
elif line == ' 5. TIMINGS\n':
finished = True
break

if high_E_cnt:
logger.warning('%i MBAR energ%s > 0.0 kcal/mol',
high_E_cnt, 'ies are' if high_E_cnt > 1 else 'y is')

time = [file_datum.t0 + (frame_index + 1) * file_datum.dt * file_datum.ntpr
for frame_index in range(len(file_datum.mbar_energies[0]))]
if not finished: # pragma: no cover
logger.warning('WARNING: file %s is a prematurely terminated run' % outfile)

if file_datum.have_mbar:
mbar_time = [
file_datum.t0 + (frame_index + 1) * file_datum.dt * file_datum.ntpr
for frame_index in range(len(file_datum.mbar_energies[0]))]

mbar_df = pd.DataFrame(
file_datum.mbar_energies,
index=np.array(file_datum.mbar_lambdas, dtype=np.float64),
columns=pd.MultiIndex.from_arrays(
[mbar_time, np.repeat(file_datum.clambda, len(mbar_time))], names=['time', 'lambdas'])
).T
else:
logger.info('WARNING: No MBAR energies found! "u_nk" entry will be None')
mbar_df = None

if not nensec: # pragma: no cover
logger.warning('WARNING: File %s does not contain any dV/dl data' % outfile)
dHdl_df = None
else:
logger.info('Read %s DV/DL data points in file %s' % (nensec, outfile))
dHdl_df = convert_to_pandas(file_datum)
dHdl_df['dHdl'] *= beta

return {"u_nk": mbar_df, "dHdl": dHdl_df}

return pd.DataFrame(file_datum.mbar_energies,
columns=pd.MultiIndex.from_arrays([time, np.repeat(file_datum.clambda, len(time))],
names=['time', 'lambdas']),
index=np.array(file_datum.mbar_lambdas, dtype=np.float64)).T

@_init_attrs
def extract_dHdl(outfile, T):
"""Return gradients ``dH/dl`` from Amber TI outputfile.
Expand All @@ -330,56 +355,34 @@ def extract_dHdl(outfile, T):
the constants used by the corresponding MD engine.
"""
beta = 1/(k_b * T)
extracted = extract(outfile, T)
return extracted['dHdl']

file_datum = file_validation(outfile)
if not file_validation(outfile):
return None

if not np.isclose(T, file_datum.T, atol=0.01):
msg = f'The temperature read from the input file ({file_datum.T:.2f} K)'
msg += f' is different from the temperature passed as parameter ({T:.2f} K)'
logger.error(msg)
raise ValueError(msg)
def extract_u_nk(outfile, T):
"""Return reduced potentials `u_nk` from Amber outputfile.
finished = False
comps = []
with SectionParser(outfile) as secp:
line = secp.skip_lines(5)
nensec = 0
old_nstep = -1
into_average_section = False
for line in secp:
if 'DV/DL, AVERAGES OVER' in line \
or " A V E R A G E S O V E R" in line:
into_average_section = True
if line.startswith(' NSTEP') and into_average_section:
_ = secp.skip_lines(1)
into_average_section = False
elif line.startswith(' NSTEP'):
nstep, dvdl = secp.extract_section('^ NSTEP', '^ ---',
['NSTEP', 'DV/DL'],
extra=line)
if nstep != old_nstep and dvdl is not None \
and nstep is not None:
file_datum.gradients.append(dvdl)
nensec += 1
old_nstep = nstep
if line == ' 5. TIMINGS\n':
finished = True
break
if not finished: # pragma: no cover
logger.warning(' WARNING: prematurely terminated run')
if not nensec: # pragma: no cover
logger.warning('WARNING: File %s does not contain any DV/DL data',
outfile)
logger.info(f'Read {nensec} DV/DL data points')
# at this step we get info stored in the FEData object for a given amber out file
file_datum.component_gradients.extend(comps)
# convert file_datum to the pandas format to make it identical to alchemlyb output format
df = convert_to_pandas(file_datum)
df['dHdl'] *= beta
return df
Parameters
----------
outfile : str
Path to Amber .out file to extract data from.
T : float
Temperature in Kelvin at which the simulations were performed;
needed to generated the reduced potential (in units of kT)
Returns
-------
u_nk : DataFrame
Reduced potential for each alchemical state (k) for each frame (n).
.. versionchanged:: 0.5.0
The :mod:`scipy.constants` is used for parsers instead of
the constants used by the corresponding MD engine.
"""
extracted = extract(outfile, T)
return extracted['u_nk']


def _process_mbar_lambdas(secp):
Expand Down Expand Up @@ -410,7 +413,7 @@ def _process_mbar_lambdas(secp):
if 'total' in line:
data = line.split()
mbar_lambdas.extend(data[2:])
else:
else: # pragma: no cover
mbar_lambdas.extend(line.split())

return mbar_lambdas
Loading

0 comments on commit 48a719b

Please sign in to comment.