Merge pull request #12 from EPCCed/training-implementation

Training implementation
EPCCed · May 1, 2024 · 2ddb538 · 2ddb538
2 parents 71b346e + 406f63c
commit 2ddb538
Show file tree

Hide file tree

Showing 29 changed files with 590 additions and 515 deletions.
diff --git a/docs/data-generation.md b/docs/data-generation.md
@@ -1,6 +1,6 @@
 # Data Generation
 
-Following the structure given in the [general data generation](ML_training.md) case, this page describes our implementation for the Hasegawa-Wakatani example.
+Following the structure given in the [general data generation](ML_training.md) case, this page describes the data generation phase of our implementation for the Hasegawa-Wakatani example.
 
 1. Fine- and coarse-grained resolutions. We chose:
  - 1024x1024 for fine-grained simulations; and
@@ -30,24 +30,30 @@ Following the structure given in the [general data generation](ML_training.md) c
     cmake --build build --target hasegawa-wakatani
     ```
 
-    Before simulating the training data, a burn-in run must be conducted at the desired resolution. For an example of this, see [fine_init.sh](../files/data-generation/fine_init.sh). Edit `<account>` on line 9 and `x01` in lines containing paths to match your `$WORK` and desired `/scratch` locations and submit via `sbatch fine_init.sh`.
+    Before simulating the training data, a burn-in run must be conducted at the desired resolution. For an example of this, see [fine_init.sh](../files/1-data-generation/fine_init.sh). Edit `<account>` on line 9 and `x01` in lines containing paths to match your `$WORK` and desired `/scratch` locations and submit via `sbatch fine_init.sh`.
 
-    Following that, we run a number of sequentially trajectories to generate fine-grained ground-truth data. See [fine_trajectories.sh](../files/data-generation/fine_trajectories.sh)
+    Following that, we run a number of sequentially trajectories to generate fine-grained ground-truth data. See [fine_trajectories.sh](../files/1-data-generation/fine_trajectories.sh)
 
     The initial simulation produces "restart files", `/scratch/space1/x01/data/my-scratch-data/initial/data/BOUT.restart.*.nc` from which a simulation can be continued. Those, as well as the input file (`/scratch/space1/x01/data/my-scratch-data/initial/data/BOUT.inp` should be placed in `/scratch/space1/x01/data/my-scratch-data/0`.
 
     Edit line 17 (`for TRAJ_INDEX in {1..10}`) to give the desired number of trajectories. Edit `<account>` on line 9 and `x01` in lines containing paths for your project and submit via `sbatch fine_trajectories.sh`.
 
 3. Coarsen selected simulation snapshots.
 
-    Fine-grained data must be coarsened to match the desired coarse-grained resolution. This can be done via interpolation for a general solution. Files in [files/coarsening](../files/coarsening) perform this task. Submit `submit-resize.sh` via `sbatch submit-resize.sh`.
+    Fine-grained data must be coarsened to match the desired coarse-grained resolution. This can be done via interpolation for a general solution. Files in [files/2-coarsening](../files/2-coarsening) perform this task. Submit `submit-resize.sh` via `sbatch submit-resize.sh`.
 
     _Note: this operates on one trajectory at a time and will therefore need to be repeated for each trajectory run in step 2.
 
 4. Single-timestep coarse simulations.
 
-    With the previous step having extracted fine-grained data for each time step (and each trajectory for which it was repeated), we now need to run a single-timestep coarse-grained simulation. To do this, see [files/coarse_simulations](../files/coarse_simulations/). Submitting [run_coarse_sims.sh](../files/coarse_simulations/run_coarse_sims.sh) will run a single step simulation for each coarsened timestep created in the previous step.
+    With the previous step having extracted fine-grained data for each time step (and each trajectory for which it was repeated), we now need to run a single-timestep coarse-grained simulation. To do this, see [files/3-coarse_simulations](../files/3-coarse_simulations/). Submitting [run_coarse_sims.sh](../files/3-coarse_simulations/run_coarse_sims.sh) will run a single step simulation for each coarsened timestep created in the previous step.
 
-Subsequent steps: calculating the error; reformatting data for ingestion into TensorFlow; and model training are covered in [ML model training implementation](training_implementation.md).
+5. Generating training data.
+
+    We now have all of the data required to train the ML models, but not in the format we require. Files in [files/4-training-data](../files/4-training-data) perform this task. Edit [files/4-training-data/](../files/4-training-data/sub_gen_training_nc.sh) and [files/4-training-data/gen_training_nc.py](../files/4-training-data/gen_training_nc.py) so that the paths work with your setup.
+
+    _Note: paths are hardcoded in [files/4-training-data/gen_training_nc.py](../files/4-training-data/gen_training_nc.py), not read in from the command line.
+
+Subsequent steps: calculating the error and model training are covered in [ML model training implementation](training_implementation.md).
 
 
diff --git a/docs/training_implementation.md b/docs/training_implementation.md
@@ -0,0 +1,11 @@
+# ML Model Training
+
+Following on from the [data generation phase](data-generation.md) of our implementation for the Hasegawa-Wakatani example, this page describes how we train our ML models.
+
+1. Error calculation.
+
+    We are at the stage of having fine-grained simulation trajectories, and from those, extracted data for each timestep, coarsened that data, and run single-timestep coarse-grained simulations.
+
+    Now, the task is to take the difference between timestep 1 and timestep 0 of those coarse-grained simulations.
+
+    UNFINISHED - email [email protected] if you get this far!!
diff --git a/files/data-generation/fine_init.sh → files/1-data-generation/fine_init.sh b/files/data-generation/fine_init.sh → files/1-data-generation/fine_init.sh
diff --git a/files/data-generation/fine_trajectories.sh → files/1-data-generation/fine_trajectories.sh b/files/data-generation/fine_trajectories.sh → files/1-data-generation/fine_trajectories.sh
diff --git a/files/coarsening/resize_trajectory.py → files/2-coarsening/resize_trajectory.py b/files/coarsening/resize_trajectory.py → files/2-coarsening/resize_trajectory.py
diff --git a/files/coarsening/restart.py → files/2-coarsening/restart.py b/files/coarsening/restart.py → files/2-coarsening/restart.py
diff --git a/files/coarsening/submit-resize.sh → files/2-coarsening/submit-resize.sh b/files/coarsening/submit-resize.sh → files/2-coarsening/submit-resize.sh
diff --git a/files/coarse_simulations/coarse_BOUT.inp → files/3-coarse_simulations/coarse_BOUT.inp b/files/coarse_simulations/coarse_BOUT.inp → files/3-coarse_simulations/coarse_BOUT.inp
diff --git a/files/coarse_simulations/run_coarse_sims.sh → ...s/3-coarse_simulations/run_coarse_sims.sh b/files/coarse_simulations/run_coarse_sims.sh → ...s/3-coarse_simulations/run_coarse_sims.sh
diff --git a/files/4-training-data/gen_training_nc.py b/files/4-training-data/gen_training_nc.py
@@ -0,0 +1,52 @@
+#!/usr/env/python
+# Script to generate training .nc files from BOUT coarse sim files
+# Based on traj_netcdf.ipynb
+import sys
+import numpy as np
+import xarray as xr
+from xbout import open_boutdataset
+from tqdm import tqdm, trange
+
+basedir = '//scratch/space1/x01/data/my-scratch-data'
+outdir = '/scratch/space1/x01/data/my-scratch-data/training/training_nc'
+
+def read_traj(traj):
+    dvort0 = []
+    dvort1 = []
+    dn0 = []
+    dn1 = []
+    for i in trange(0, 1001):
+        ds = open_boutdataset(
+                f'{basedir}/trajectory_{traj}/{i}/coarse_sim/BOUT.dmp.*.nc', 
+                info=False)
+        dvort0.append(ds['vort'][0,:,:,:])
+        dvort1.append(ds['vort'][1,:,:,:])
+        dn0.append(ds['n'][0,:,:,:])
+        dn1.append(ds['n'][1,:,:,:])
+    tvort0 = xr.concat(dvort0[1:], 't')
+    tn0 = xr.concat(dn0[1:], 't')
+    tvort1 = xr.concat(dvort1[:1001], 't')
+    tn1 = xr.concat(dn1[:1001], 't')
+    d0 = xr.merge([tvort0,tn0])
+    d1 = xr.merge([tvort1,tn1])
+    return d0, d1
+
+def clean(ds):
+    if 'metadata' in ds.attrs:
+        del ds.attrs['metadata']
+    if 'options' in ds.attrs:
+        del ds.attrs['options']
+    for variable in ds.variables.values():
+        if 'metadata' in variable.attrs:
+            del variable.attrs['metadata']
+        if 'options' in variable.attrs:
+            del variable.attrs['options']
+
+traj = sys.argv[1]
+d0, d1 = read_traj(traj)
+clean(d0)
+clean(d1)
+d0.to_netcdf(f'{outdir}/gt_traj_{traj}.nc')
+d1.to_netcdf(f'{outdir}/sim_traj_{traj}.nc')
+#err=d0-d1
+#err.to_netcdf(f'{outdir}/err_traj_{traj}.nc')
diff --git a/files/4-training-data/sub_gen_training_nc.sh b/files/4-training-data/sub_gen_training_nc.sh
@@ -0,0 +1,16 @@
+#!/bin/bash
+
+#SBATCH --nodes=1
+#SBATCH --ntasks=1
+# #SBATCH --exclusive
+#SBATCH --time=01:00:00
+#SBATCH --partition=standard
+#SBATCH --qos=standard
+#SBATCH --account=<account>
+
+eval "$(/work/x01/x01/$USER/miniconda3/bin/conda shell.bash hook)"
+conda activate boutsmartsim
+
+TRAJECTORY=1
+
+python gen_training_nc.py $TRAJECTORY
diff --git a/files/5-training/README.md b/files/5-training/README.md
@@ -0,0 +1 @@
+Training pipeline for training the error correction ML model.
diff --git a/files/5-training/data_read.py b/files/5-training/data_read.py
@@ -0,0 +1,69 @@
+""" Functions to load/augment training dataset """
+import tensorflow as tf
+import numpy as np
+import netCDF4 as nc
+
+from typing import List, Tuple, Dict
+
+def extract_array_data(file_path: str, args) -> np.ndarray:
+  dataset = nc.Dataset(file_path, 'r')
+
+  # number of ghost cells in x dimension
+  gx = 2
+  # extract vorticity and density without ghost cells and remove unit y direction
+  vort_array = np.squeeze(dataset.variables['vort'][:,gx:-gx,:,:])
+  dens_array = np.squeeze(dataset.variables['n'][:,gx:-gx,:,:])
+  dataset.close()
+
+  if args.vort_only:
+    flow_image = np.stack([vort_array], axis=-1)
+  elif args.dens_only:
+    flow_image = np.stack([dens_array], axis=-1)
+  else:
+    flow_image = np.stack([vort_array, dens_array], axis=-1)
+  return flow_image
+
+def translate_augmentation(fields: Dict[str, tf.Tensor]) -> Dict[str, tf.Tensor]:
+  coarse_image, error_image = fields['coarse'], fields['error']
+  #commented out for testing
+  #if coarse_image.shape != error_image.shape:
+   # raise ValueError(f"Coarse grained data and error should be same shape (got {coarse_image.shape} and {error_image.shape} respectively).")
+  shape = tf.shape(coarse_image)
+  nx, nz = shape[0], shape[1]
+  shift_x = tf.random.uniform(shape=[], minval=0, maxval=nx-1, dtype=tf.int32)
+  shift_z = tf.random.uniform(shape=[], minval=0, maxval=nz-1, dtype=tf.int32)
+
+  # apply same shift to coarse snapshot and error
+  coarse_shifted = tf.roll(coarse_image, shift_x, 0)
+  coarse_shifted = tf.roll(coarse_shifted, shift_z, 1)
+  error_shifted = tf.roll(error_image, shift_x, 0)
+  error_shifted = tf.roll(error_shifted, shift_z, 1)
+  return {'coarse': coarse_shifted, 'error': error_shifted}
+
+def data_generator(ground_truth_file_names: List[str], coarse_grained_file_names: List[str], args):
+  for gt_file, cg_file in zip(ground_truth_file_names, coarse_grained_file_names):
+    raw_data_gt = extract_array_data(gt_file, args)
+    raw_data_cg = extract_array_data(cg_file, args)[1:]
+    error = raw_data_gt - raw_data_cg
+
+    # Reshape tensors to have dynamic dimensions
+    raw_data_cg = tf.convert_to_tensor(raw_data_cg, dtype=tf.float64)
+    error = tf.convert_to_tensor(error, dtype=tf.float64)
+    for i in range(raw_data_cg.shape[0]):
+      yield {'coarse': raw_data_cg[i], 'error': error[i]}
+
+def generate_augmented_dataset(
+  ground_truth_file_names: List[str],
+  coarse_grained_file_names: List[str],
+  args,
+) -> tf.data.Dataset:
+  if args.vort_only or args.dens_only:
+    channels = 1
+  else:
+    channels = 2
+  dataset = tf.data.Dataset.from_generator(
+    lambda: data_generator(ground_truth_file_names, coarse_grained_file_names, args),
+    output_signature={'coarse': tf.TensorSpec(shape=(None, None, channels), dtype=tf.float64),
+    'error': tf.TensorSpec(shape=(None, None, channels), dtype=tf.float64)}
+    )
+  return dataset.map(translate_augmentation)
diff --git a/files/5-training/model.py b/files/5-training/model.py
@@ -0,0 +1,82 @@
+from tensorflow.keras.layers import Input, Normalization, Conv2D
+from tensorflow.keras import Model
+from tensorflow import keras
+from typing import Tuple
+
+import padding
+
+'''
+preprocessing:
+
+- "cyclic" padding along the flow direction (wall dimension is 0-padded):
+    the input padded with   2 columns on both sides that 'wrap' around (see np.pad('wrap'))
+                            two rows of 0 on top and bottom
+    ! needs to be reapplied after every convolution layer
+'''
+
+'''
+TODO/preprocessing:
+
+- rescale input to [-1, 1]:
+    To rescale an input in the [0, 255] range to be in the [-1, 1] range, 
+        you would pass scale=1./127.5, offset=-1.
+        in general: scale = scaled_max/(max * .5) 
+                    offset = min + scaled_min
+    keras.layers.Rescaling(scale, offset=0.0, **kwargs)
+
+    keras seems not to support min-max scaling with variable min/max, so will normalise instead
+'''
+
+def kochkov_cnn(image_shape: Tuple[int]) -> keras.Model:
+  """ Todo: need a more automated way of padding e.g. if we adjust filter size. """
+  model = keras.Sequential()
+
+  model.add(Input(shape=image_shape))
+
+  #1
+  # pad
+  model.add(padding.CyclicPadding2D(padding=(1,1)))
+#  model.add(keras.layers.ZeroPadding2D(padding=(1,0)))
+
+  model.add(Conv2D (filters=64, kernel_size=3, padding ='valid', activation='relu'))
+
+  #2
+  # pad
+  model.add(padding.CyclicPadding2D(padding=(1,1)))
+#  model.add(keras.layers.ZeroPadding2D(padding=(1,0)))
+
+  model.add(Conv2D (filters =64, kernel_size =3, padding ='valid', activation='relu'))
+
+  #3
+  # pad
+  model.add(padding.CyclicPadding2D(padding=(1,1)))
+#  model.add(keras.layers.ZeroPadding2D(padding=(1,0)))
+
+  model.add(Conv2D (filters =64, kernel_size =3, padding ='valid', activation='relu'))
+
+  #4
+  # pad
+  model.add(padding.CyclicPadding2D(padding=(1,1)))
+#  model.add(keras.layers.ZeroPadding2D(padding=(1,0)))
+
+  model.add(Conv2D (filters =64, kernel_size =3, padding ='valid', activation='relu'))
+
+  #5
+  # pad
+  model.add(padding.CyclicPadding2D(padding=(1,1)))
+#  model.add(keras.layers.ZeroPadding2D(padding=(1,0)))
+
+  model.add(Conv2D (filters =64, kernel_size =3, padding ='valid', activation='relu'))
+
+  #6
+  # pad
+  model.add(padding.CyclicPadding2D(padding=(1,1)))
+ # model.add(keras.layers.ZeroPadding2D(padding=(1,0)))
+
+  model.add(Conv2D (filters =64, kernel_size =3, padding ='valid', activation='relu'))
+
+  # output
+  model.add(padding.CyclicPadding2D(padding=(1,1)))
+
+  model.add(Conv2D(filters=image_shape[2], kernel_size =3, padding ='valid', activation='linear'))
+  return model
diff --git a/files/5-training/padding.py b/files/5-training/padding.py
@@ -0,0 +1,75 @@
+import tensorflow as tf
+
+# from keras.engine.base_layer import Layer
+# from keras.engine.input_spec import InputSpec
+# from keras.utils import conv_utils
+from tensorflow.keras.layers import Layer
+from tensorflow.keras.layers import InputSpec
+from tensorflow.python.keras.utils import conv_utils
+
+# some ideas here:
+#  https://stackoverflow.com/questions/54911015/keras-convolution-layer-on-images-coming-from-circular-cyclic-domain
+
+class CyclicPadding2D(Layer):
+    def __init__(self, padding=(1, 1), data_format=None, **kwargs):
+      super().__init__(**kwargs)
+      self.data_format = conv_utils.normalize_data_format(data_format)
+      if len(padding) != 2:
+        raise ValueError('`padding` should have two elements. '
+                         f'Received: {padding}.')
+      self.padding = padding
+      self.input_spec = InputSpec(ndim=4)
+
+    def get_config(self):
+        config = super().get_config()
+        config.update({
+            "padding": self.padding,
+            "data_format": self.data_format,
+        })
+        return config
+
+    def compute_output_shape(self, input_shape):
+        input_shape = tf.TensorShape(input_shape).as_list()
+        if self.data_format == 'channels_first':
+          if input_shape[2] is not None:
+            rows = input_shape[2] + 2 * self.padding[0]
+          else:
+            rows = None
+          if input_shape[3] is not None:
+            cols = input_shape[3] + 2 * self.padding[1]
+          else:
+            cols = None
+          return tf.TensorShape(
+            [input_shape[0], input_shape[1], rows, cols])
+        elif self.data_format == 'channels_last':
+          if input_shape[1] is not None:
+              rows = input_shape[1] + 2 * self.padding[0]
+          else:
+              rows = None
+          if input_shape[2] is not None:
+              cols = input_shape[2] + 2 * self.padding[1]
+          else:
+              cols = None
+          return tf.TensorShape([input_shape[0], rows, cols, input_shape[3]])
+
+    def call(self, inputs):
+        tensor = inputs
+        ndim = len(inputs.shape)
+        for ax, pd in enumerate(self.padding):
+            if self.data_format == "channels_last":
+                #(batch, rows, cols, channels)
+                axis = 1 + ax
+            elif self.data_format == "channels_first":
+                #(batch, channels, rows, cols)
+                axis = 2 + ax
+            else:
+                return
+            sl_start = [slice(None, pd) if i == axis else slice(None) for i in range(ndim)]
+            sl_end = [slice(-pd, None) if i == axis else slice(None) for i in range(ndim)]
+            tensor = tf.concat([
+                tensor[sl_end],
+                tensor,
+                tensor[sl_start],
+            ], axis)
+
+        return tensor
diff --git a/files/5-training/submit-training.sh b/files/5-training/submit-training.sh
@@ -0,0 +1,23 @@
+#!/bin/bash
+#
+#SBATCH --partition=gpu
+#SBATCH --qos=gpu
+#SBATCH --gres=gpu:1
+#SBATCH --time=48:00:00
+#SBATCH --account=x01
+
+CUDA_VERSION=11.6
+CUDNN_VERSION=8.6.0-cuda-${CUDA_VERSION}
+TENSORRT_VERSION=8.4.3.1-u2
+
+module load intel-20.4/compilers
+module load nvidia/cudnn/${CUDNN_VERSION}
+module load nvidia/tensorrt/${TENSORRT_VERSION}
+module load nvidia/nvhpc
+
+conda activate boutsmartsim
+
+cd /path/to/SiMLInt/files/training
+
+# choose appropriate parameters here
+python training.py --epochs 100 --batch-size 32 --learning-rate 0.0001