Introduced the "extract" function to all the parsers, with relative t…

…ests (#240) Ok, I hope not to have broken anything with this PR :-) I added the extract(file, T) function to all the existing parsers, with a relative test for each parser. The only parser that was heavily hit by this PR is the AMBER parser, as was mentioned in issue #222 . Here I had to make different adjustments, on my machine all tests are passing through. I took the opportunity to make also some small, cosmetic, changes to the code in the amber parser (just a couple of them, like removing the unnecessary 'class SomeThing(Object').
alchemistry · Sep 27, 2022 · 48a719b · 48a719b
1 parent cb965ad
commit 48a719b
Show file tree

Hide file tree

Showing 17 changed files with 283 additions and 124 deletions.
diff --git a/CHANGES b/CHANGES
@@ -45,6 +45,9 @@ Fixes
   - substitute the any_none() function with a check "if None in" in the AMBER parser (issue #236, PR #237)
   - For ABFE workflow, dHdl will only be read when a TI estimator is chosen.
     Similarly, u_nk will only be read when FEP estimators are chosen. (PR #231)
+  - All parsers now have a 'extract(file, T)' method that returns a dict with both
+    "dHdl" and "u_nk" data (or None). THe AMBER parser when using this function will read the 
+    file just once, extracting all data at once. (issue #222, PR #240)
 
 07/22/2022 xiki-tempula, IAlibay, dotsdl, orbeckst, ptmerz
 

diff --git a/docs/api_principles.rst b/docs/api_principles.rst
@@ -69,7 +69,7 @@ The library is structured as follows, following a similar style to
     │   ├── abfe.py
          
 
-* The :mod:`~alchemlyb.parsing` submodule contains parsers for individual MD engines, since the output files needed to perform alchemical free energy calculations vary widely and are not standardized.  Each module at the very least provides an `extract_u_nk` function for extracting reduced potentials (needed for MBAR), as well as an `extract_dHdl` function for extracting derivatives required for thermodynamic integration.  Other helper functions may be exposed for additional processing, such as generating an XVG file from an EDR file in the case of GROMACS.  All `extract\_*` functions take similar arguments (a file path, parameters such as temperature), and produce standard outputs (:class:`pandas.DataFrame` for reduced potentials, :class:`pandas.Series` for derivatives).
+* The :mod:`~alchemlyb.parsing` submodule contains parsers for individual MD engines, since the output files needed to perform alchemical free energy calculations vary widely and are not standardized.  Each module at the very least provides an `extract_u_nk` function for extracting reduced potentials (needed for MBAR), as well as an `extract_dHdl` function for extracting derivatives required for thermodynamic integration. Moreover an `extract`` functon is provided, which returns a dict containing both derivatives and reduced potentials. Other helper functions may be exposed for additional processing, such as generating an XVG file from an EDR file in the case of GROMACS.  All `extract\_*` functions take similar arguments (a file path, parameters such as temperature), and produce standard outputs (:class:`pandas.DataFrame` for reduced potentials, :class:`pandas.Series` for derivatives).
 * The :mod:`~alchemlyb.preprocessing` submodule features functions for subsampling timeseries, as may be desired before feeding them to an estimator.  So far, these are limited to `slicing`, `statistical_inefficiency`, and `equilibrium_detection` functions, many of which make use of subsampling schemes available from :mod:`pymbar`.  These functions are written in such a way that they can be easily composed as parts of complex processing pipelines.
 * The :mod:`~alchemlyb.estimators` module features classes *a la* **scikit-learn** that can be initialized with parameters that determine their behavior and then "trained" on a `fit` method.  MBAR, BAR, and thermodynamic integration (TI) as the major methods are all implemented.  Correct error estimates require the use of time series with independent samples.
 * The :mod:`~alchemlyb.convergence` submodule features convenience functions/classes for doing convergence analysis using a given dataset and a chosen estimator.

diff --git a/docs/parsing/alchemlyb.parsing.amber.rst b/docs/parsing/alchemlyb.parsing.amber.rst
@@ -18,3 +18,4 @@ This submodule includes these parsing functions:
 
 .. autofunction:: alchemlyb.parsing.amber.extract_dHdl
 .. autofunction:: alchemlyb.parsing.amber.extract_u_nk
+.. autofunction:: alchemlyb.parsing.amber.extract
diff --git a/docs/parsing/alchemlyb.parsing.gmx.rst b/docs/parsing/alchemlyb.parsing.gmx.rst
@@ -37,3 +37,4 @@ This submodule includes these parsing functions:
 
 .. autofunction:: alchemlyb.parsing.gmx.extract_dHdl
 .. autofunction:: alchemlyb.parsing.gmx.extract_u_nk
+.. autofunction:: alchemlyb.parsing.gmx.extract
diff --git a/docs/parsing/alchemlyb.parsing.gomc.rst b/docs/parsing/alchemlyb.parsing.gomc.rst
@@ -13,3 +13,4 @@ This submodule includes these parsing functions:
 
 .. autofunction:: alchemlyb.parsing.gomc.extract_dHdl
 .. autofunction:: alchemlyb.parsing.gomc.extract_u_nk
+.. autofunction:: alchemlyb.parsing.gomc.extract
diff --git a/docs/parsing/alchemlyb.parsing.namd.rst b/docs/parsing/alchemlyb.parsing.namd.rst
@@ -31,3 +31,4 @@ API Reference
 This submodule includes these parsing functions:
 
 .. autofunction:: alchemlyb.parsing.namd.extract_u_nk
+.. autofunction:: alchemlyb.parsing.namd.extract
diff --git a/src/alchemlyb/parsing/__init__.py b/src/alchemlyb/parsing/__init__.py
@@ -14,3 +14,20 @@ def wrapper(outfile, T, *args, **kwargs):
             dataframe.attrs['energy_unit'] = 'kT'
         return dataframe
     return wrapper
+
+
+def _init_attrs_dict(func):
+    '''Add temperature and energy units to the parsed dataframes.
+
+    The temperature is added to the dataframe as dataframe.attrs['temperature']
+    and the energy unit is initiated as dataframe.attrs['energy_unit'] = 'kT'.
+    '''
+    @wraps(func)
+    def wrapper(outfile, T, *args, **kwargs):
+        dict_with_df = func(outfile, T, *args, **kwargs)
+        for k in dict_with_df.keys():
+            if dict_with_df[k] is not None:
+                dict_with_df[k].attrs['temperature'] = T
+                dict_with_df[k].attrs['energy_unit'] = 'kT'
+        return dict_with_df
+    return wrapper
diff --git a/src/alchemlyb/parsing/amber.py b/src/alchemlyb/parsing/amber.py
@@ -1,6 +1,6 @@
 """Parsers for extracting alchemical data from `Amber <http://ambermd.org>`_ output files.
 
-Most of the file parsing parts are inherited from
+Some of the file parsing parts are adapted from
 `alchemical-analysis`_.
 
 .. _alchemical-analysis: https://github.com/MobleyLab/alchemical-analysis
@@ -14,13 +14,15 @@
 import numpy as np
 
 from .util import anyopen
-from . import _init_attrs
+from . import _init_attrs_dict
 from ..postprocessors.units import R_kJmol, kJ2kcal
 
 logger = logging.getLogger("alchemlyb.parsers.Amber")
 
 k_b = R_kJmol * kJ2kcal
 
+_FP_RE = r'[+-]?(\d+(\.\d*)?|\.\d+)([eE][+-]?\d+)?'
+
 
 def convert_to_pandas(file_datum):
     """Convert the data structure from numpy to pandas format"""
@@ -40,11 +42,6 @@ def convert_to_pandas(file_datum):
     return df
 
 
-DVDL_COMPS = ['BOND', 'ANGLE', 'DIHED', '1-4 NB', '1-4 EEL', 'VDWAALS',
-              'EELEC', 'RESTRAINT']
-_FP_RE = r'[+-]?(\d+(\.\d*)?|\.\d+)([eE][+-]?\d+)?'
-
-
 def _pre_gen(it, first):
     """A generator that returns first first if it exists."""
 
@@ -58,7 +55,7 @@ def _pre_gen(it, first):
             return
 
 
-class SectionParser(object):
+class SectionParser():
     """
     A simple parser to extract data values from sections.
     """
@@ -129,14 +126,11 @@ def extract_section(self, start, end, fields, limit=None, extra='',
     def __iter__(self):
         return self
 
-    def next(self):
+    def __next__(self):
         """Read next line of the filehandle and check for EOF."""
         self.lineno += 1
         return next(self.fileh)
 
-    # make compatible with python 3.6
-    __next__ = next
-
     def close(self):
         """Close the filehandle."""
         self.fileh.close()
@@ -148,11 +142,11 @@ def __exit__(self, typ, value, traceback):
         self.close()
 
 
-class FEData(object):
+class FEData():
     """A simple struct container to collect data from individual files."""
 
     __slots__ = ['clambda', 't0', 'dt', 'T', 'ntpr', 'gradients',
-                 'component_gradients', 'mbar_energies',
+                 'mbar_energies',
                  'have_mbar', 'mbar_lambdas', 'mbar_lambda_idx']
 
     def __init__(self):
@@ -162,7 +156,6 @@ def __init__(self):
         self.T = -1.0
         self.ntpr = -1
         self.gradients = []
-        self.component_gradients = []
         self.mbar_energies = []
         self.have_mbar = False
         self.mbar_lambdas = []
@@ -206,13 +199,12 @@ def file_validation(outfile):
             mbar_ndata = int(nstlim / mbar_ndata)
             mbar_lambdas = _process_mbar_lambdas(secp)
             file_datum.mbar_lambdas = mbar_lambdas
-            clambda_str = '%6.4f' % clambda
+            clambda_str = f'{clambda:6.4f}'
 
             if clambda_str not in mbar_lambdas:
                 logger.warning('WARNING: lamba %s not contained in set of '
                                'MBAR lambas: %s\nNot using MBAR.',
                                clambda_str, ', '.join(mbar_lambdas))
-
                 have_mbar = False
             else:
                 mbar_nlambda = len(mbar_lambdas)
@@ -240,9 +232,9 @@ def file_validation(outfile):
     file_datum.have_mbar = have_mbar
     return file_datum
 
-@_init_attrs
-def extract_u_nk(outfile, T):
-    """Return reduced potentials `u_nk` from Amber outputfile.
+@_init_attrs_dict
+def extract(outfile, T):
+    """Return reduced potentials `u_nk` and gradients `dH/dl` from Amber outputfile.
 
     Parameters
     ----------
@@ -254,61 +246,94 @@ def extract_u_nk(outfile, T):
 
     Returns
     -------
-    u_nk : DataFrame
-        Reduced potential for each alchemical state (k) for each frame (n).
+    Dict
+        A dictionary with keys of 'u_nk', which is a pandas DataFrame of reduced potentials for each
+        alchemical state (k) for each frame (n), and 'dHdl', which is a Series of dH/dl
+        as a function of time for this lambda window.
 
 
-    .. versionchanged:: 0.5.0
-        The :mod:`scipy.constants` is used for parsers instead of
-        the constants used by the corresponding MD engine.
-
+    .. versionadded:: 1.0.0
     """
+
     beta = 1/(k_b * T)
 
     file_datum = file_validation(outfile)
-    if not file_validation(outfile):   # pragma: no cover
-        return None
+    if not file_validation(outfile):
+        return {"u_nk": None, "dHdl": None}
 
     if not np.isclose(T, file_datum.T, atol=0.01):
         msg = f'The temperature read from the input file ({file_datum.T:.2f} K)'
         msg += f' is different from the temperature passed as parameter ({T:.2f} K)'
         logger.error(msg)
         raise ValueError(msg)
 
-    if not file_datum.have_mbar:
-        raise Exception('ERROR: No MBAR energies found! Cannot parse file.')
+    finished = False
     with SectionParser(outfile) as secp:
         line = secp.skip_lines(5)
         high_E_cnt = 0
+        nensec = 0
+        old_nstep = -1
         for line in secp:
-            if line.startswith('MBAR Energy analysis'):
+            if "      A V E R A G E S   O V E R" in line:
+                _ = secp.skip_after('^|=========================================')
+            elif line.startswith(' NSTEP'):
+                nstep, dvdl = secp.extract_section('^ NSTEP', '^ ---',
+                                                   ['NSTEP', 'DV/DL'],
+                                                   extra=line)
+                if nstep != old_nstep and dvdl is not None and nstep is not None:
+                    file_datum.gradients.append(dvdl)
+                    nensec += 1
+                    old_nstep = nstep
+            elif line.startswith('MBAR Energy analysis') and file_datum.have_mbar:
                 mbar = secp.extract_section('^MBAR', '^ ---', file_datum.mbar_lambdas,
                                             extra=line)
-
-                if None in mbar:
+                
+                if None in mbar: # pragma: no cover
                     continue
-
-                E_ref = mbar[file_datum.mbar_lambda_idx]
-
-                for lmbda, E in enumerate(mbar):
-                    if E > 0.0:
+
+                reference_energy = mbar[file_datum.mbar_lambda_idx]
+                for lmbda, energy in enumerate(mbar):
+                    if energy > 0.0:
                         high_E_cnt += 1
 
-                    file_datum.mbar_energies[lmbda].append(beta * (E - E_ref))
+                    file_datum.mbar_energies[lmbda].append(beta * (energy - reference_energy))
+            elif line == '   5.  TIMINGS\n':
+                finished = True
+                break
 
         if high_E_cnt:
             logger.warning('%i MBAR energ%s > 0.0 kcal/mol',
                            high_E_cnt, 'ies are' if high_E_cnt > 1 else 'y is')
 
-        time = [file_datum.t0 + (frame_index + 1) * file_datum.dt * file_datum.ntpr 
-                for frame_index in range(len(file_datum.mbar_energies[0]))]
+    if not finished: # pragma: no cover
+        logger.warning('WARNING: file %s is a prematurely terminated run' % outfile)
+
+    if file_datum.have_mbar:
+        mbar_time = [
+            file_datum.t0 + (frame_index + 1) * file_datum.dt * file_datum.ntpr
+            for frame_index in range(len(file_datum.mbar_energies[0]))]
+
+        mbar_df = pd.DataFrame(
+            file_datum.mbar_energies,
+            index=np.array(file_datum.mbar_lambdas, dtype=np.float64),
+            columns=pd.MultiIndex.from_arrays(
+                [mbar_time, np.repeat(file_datum.clambda, len(mbar_time))], names=['time', 'lambdas'])
+                ).T
+    else:
+        logger.info('WARNING: No MBAR energies found! "u_nk" entry will be None')
+        mbar_df = None
+
+    if not nensec: # pragma: no cover
+        logger.warning('WARNING: File %s does not contain any dV/dl data' % outfile)
+        dHdl_df = None
+    else:
+        logger.info('Read %s DV/DL data points in file %s' % (nensec, outfile))
+        dHdl_df = convert_to_pandas(file_datum)
+        dHdl_df['dHdl'] *= beta
+
+    return {"u_nk": mbar_df, "dHdl": dHdl_df}
 
-    return pd.DataFrame(file_datum.mbar_energies,
-                        columns=pd.MultiIndex.from_arrays([time, np.repeat(file_datum.clambda, len(time))],
-                                                          names=['time', 'lambdas']),
-                        index=np.array(file_datum.mbar_lambdas, dtype=np.float64)).T
 
-@_init_attrs
 def extract_dHdl(outfile, T):
     """Return gradients ``dH/dl`` from Amber TI outputfile.
 
@@ -330,56 +355,34 @@ def extract_dHdl(outfile, T):
         the constants used by the corresponding MD engine.
 
     """
-    beta = 1/(k_b * T)
+    extracted = extract(outfile, T)
+    return extracted['dHdl']
 
-    file_datum = file_validation(outfile)
-    if not file_validation(outfile):
-        return None
 
-    if not np.isclose(T, file_datum.T, atol=0.01):
-        msg = f'The temperature read from the input file ({file_datum.T:.2f} K)'
-        msg += f' is different from the temperature passed as parameter ({T:.2f} K)'
-        logger.error(msg)
-        raise ValueError(msg)
+def extract_u_nk(outfile, T):
+    """Return reduced potentials `u_nk` from Amber outputfile.
 
-    finished = False
-    comps = []
-    with SectionParser(outfile) as secp:
-        line = secp.skip_lines(5)
-        nensec = 0
-        old_nstep = -1
-        into_average_section = False
-        for line in secp:
-            if 'DV/DL, AVERAGES OVER' in line \
-                or "      A V E R A G E S   O V E R" in line:
-                into_average_section = True
-            if line.startswith(' NSTEP') and into_average_section:
-                _ = secp.skip_lines(1)
-                into_average_section = False
-            elif line.startswith(' NSTEP'):
-                nstep, dvdl = secp.extract_section('^ NSTEP', '^ ---',
-                                                   ['NSTEP', 'DV/DL'],
-                                                   extra=line)
-                if nstep != old_nstep and dvdl is not None \
-                        and nstep is not None:
-                    file_datum.gradients.append(dvdl)
-                    nensec += 1
-                    old_nstep = nstep
-            if line == '   5.  TIMINGS\n':
-                finished = True
-                break
-    if not finished:  # pragma: no cover
-        logger.warning('  WARNING: prematurely terminated run')
-    if not nensec:  # pragma: no cover
-        logger.warning('WARNING: File %s does not contain any DV/DL data',
-                        outfile)
-    logger.info(f'Read {nensec} DV/DL data points')
-    # at this step we get info stored in the FEData object for a given amber out file
-    file_datum.component_gradients.extend(comps)
-    # convert file_datum to the pandas format to make it identical to alchemlyb output format
-    df = convert_to_pandas(file_datum)
-    df['dHdl'] *= beta
-    return df
+    Parameters
+    ----------
+    outfile : str
+        Path to Amber .out file to extract data from.
+    T : float
+        Temperature in Kelvin at which the simulations were performed;
+        needed to generated the reduced potential (in units of kT)
+
+    Returns
+    -------
+    u_nk : DataFrame
+        Reduced potential for each alchemical state (k) for each frame (n).
+
+
+    .. versionchanged:: 0.5.0
+        The :mod:`scipy.constants` is used for parsers instead of
+        the constants used by the corresponding MD engine.
+
+    """
+    extracted = extract(outfile, T)
+    return extracted['u_nk']
 
 
 def _process_mbar_lambdas(secp):
@@ -410,7 +413,7 @@ def _process_mbar_lambdas(secp):
             if 'total' in line:
                 data = line.split()
                 mbar_lambdas.extend(data[2:])
-            else:
+            else: # pragma: no cover
                 mbar_lambdas.extend(line.split())
 
     return mbar_lambdas