esciencecenter-digital-skills · wmotion · Jun 30, 2023 · Jun 30, 2023 · Jun 30, 2023 · Jun 30, 2023
diff --git a/episodes/benchmarking.md b/episodes/benchmarking.md
@@ -5,23 +5,23 @@ exercises: 20
 ---
 
 :::questions
-- How do we know our program ran faster?
-- How do we learn about efficiency?
+- How do we know whether our program ran faster in parallel?
+- How do we appraise efficiency?
 :::
 
 :::objectives
-- View performance on system monitor
-- Find out how many cores your machine has
-- Use `%time` and `%timeit` line-magic
-- Use a memory profiler
-- Plot performance against number of work units
-- Understand the influence of hyper-threading on timings
+- View performance on system monitor.
+- Find out how many cores your machine has.
+- Use `%time` and `%timeit` line-magic.
+- Use a memory profiler.
+- Plot performance against number of work units.
+- Understand the influence of hyper-threading on timings.
 :::
 
 
 # A first example with Dask
-We will get into creating parallel programs in Python later. First let's see a small example. Open
-your system monitor (this will differ among specific operating systems), and run the following code examples.
+We will create parallel programs in Python later. First let's see a small example. Open
+your System Monitor (the application will vary between specific operating systems), and run the following code examples:
 
 ```python
 # Summation making use of numpy:
@@ -39,57 +39,55 @@ result = work.compute()
 
 :::callout
 ## Try a heavy enough task
-It could be that a task this small does not register on your radar. Depending on your computer you will have to raise the power to ``10**8`` or ``10**9`` to make sure that it runs long enough to observe the effect. But be careful and increase slowly. Asking for too much memory can make your computer slow to a crawl.
+Your radar may not detect so small a task. In your computer you may have to gradually raise the problem size to ``10**8`` or ``10**9`` to observe the effect in long enough a run. But be careful and increase slowly! Asking for too much memory can make your computer slow to a crawl.
 :::
 
 ![System monitor](fig/system-monitor.jpg){alt="screenshot of system monitor"}
 
-How can we test this in a more practical way? In Jupyter we can use some line magics, small "magic words" preceded
+How can we monitor this more conveniently? In Jupyter we can use some line magics, small "magic words" preceded
 by the symbol `%%` that modify the behaviour of the cell.
 
 ```python
 %%time
 np.arange(10**7).sum()
 ```
 
-The `%%time` line magic checks how long it took for a computation to finish. It does nothing to
-change the computation itself. In this it is very similar to the `time` shell command.
+The `%%time` line magic checks how long it took for a computation to finish. It does not affect how the computation is performed. In this regard it is very similar to the `time` shell command.
 
-If run the chunk several times, we will notice a difference in the times.
+If we run the chunk several times, we will notice variability in the reported times.
 How can we trust this timer, then?
 A possible solution will be to time the chunk several times, and take the average time as our valid measure.
-The `%%timeit` line magic does exactly this in a concise an comfortable manner!
-`%%timeit` first measures how long it takes to run a command one time, then
-repeats it enough times to get an average run-time. Also, `%%timeit` can measure run times without
-the time it takes to setup a problem, measuring only the performance of the code in the cell.
-This way we can trust the outcome better.
+The `%%timeit` line magic does exactly this in a concise and convenient manner!
+`%%timeit` first measures how long it takes to run a command once, then
+repeats it enough times to get an average run-time. Also, `%%timeit` can measure run times discounting the overhead of setting up a problem and measuring only the performance of the code in the cell.
+So this outcome is more trustworthy.
 
 ```python
 %%timeit
 np.arange(10**7).sum()
 ```
 
-If you want to store the output of `%%timeit` in a Python variable, you can do so with the `-o` flag.
+You can store the output of `%%timeit` in a Python variable using the `-o` flag:
 
 ```python
 time = %timeit -o np.arange(10**7).sum()
-print(f"Time taken: {time.average:.4f}s")
+print(f"Time taken: {time.average:.4f} s")
 ```
 
-Note that this does not tell you anything about memory consumption or efficiency.
+Note that this metric does not tell you anything about memory consumption or efficiency.
 
 # Memory profiling
-- The act of systematically testing performance under different conditions is called **benchmarking**.
-- Analysing what parts of a program contribute to the total performance, and identifying possible bottlenecks is **profiling**.
+- **Benchmarking** is the action of systematically testing performance under different conditions.
+- **Profiling** is the analysis of which parts of a program contribute to the total performance, and the identification of possible bottlenecks.
 
-We will use the [`memory_profiler` package](https://github.com/pythonprofilers/memory_profiler) to track memory usage.
+We will use the package [`memory_profiler`](https://github.com/pythonprofilers/memory_profiler) to track memory usage.
 It can be installed executing the code below in the console:
 
 ~~~sh
 pip install memory_profiler
 ~~~
 
-In Jupyter, type the following lines to compare the memory usage of the serial and parallel versions of the code presented above (again, change the value of `10**7` to something higher if needed):
+The memory usage of the serial and parallel versions of a code will vary. In Jupyter, type the following lines to see the effect in the code presented above (again, increase the baseline value `10**7` if needed):
 
 ```python
 import numpy as np
@@ -112,37 +110,37 @@ memory_dask = memory_usage(sum_with_dask, interval=0.01)
 # Plot results
 plt.plot(memory_numpy, label='numpy')
 plt.plot(memory_dask, label='dask')
-plt.xlabel('Time step')
-plt.ylabel('Memory / MB')
+plt.xlabel('Interval counter [-]')
+plt.ylabel('Memory usage [MiB]')
 plt.legend()
 plt.show()
 ```
 
-The figure should be similar to the one below:
+The plot should be similar to the one below:
 
 ![Memory performance](fig/memory.png){alt="showing very high peak for numpy, and constant low line for dask"}
 
 :::challenge
 ## Exercise (plenary)
-Why is the Dask solution more memory efficient?
+Why is the Dask solution more memory-efficient?
 
 ::::solution
 ## Solution
-Chunking! Dask chunks the large array, such that the data is never entirely in memory.
+Chunking! Dask chunks the large array so that the data is never entirely in memory.
 ::::
 :::
 
 :::callout
 ## Profiling from Dask
-Dask has several option to do profiling from Dask itself. See the [dask documentation](https://docs.dask.org/en/latest/diagnostics-local.html) for more information.
+Dask has several built-in option for profiling. See the [dask documentation](https://docs.dask.org/en/latest/diagnostics-local.html) for more information.
 :::
 
 # Using many cores
-Using more cores for a computation can decrease the run time. The first question is of course: how many cores do I have? See the snippets below to find out:
+Using more cores for a computation can decrease the run time. The first question is of course: how many cores do I have? See the snippet below to find this out:
 
 :::callout
-## Find out how many cores your machine has
-The number of cores can be found from Python by executing:
+## Find out the number of cores in your machine
+The number of cores can be found from Python upon executing:
 
 ```python
 import psutil
@@ -152,13 +150,12 @@ print(f"The number of physical/logical cores is {N_physical_cores}/{N_logical_co
 ```
 :::
 
-Usually the number of logical cores is higher than the number of physical course. This is due to *hyper-threading*,
+Usually the number of logical cores is higher than the number of physical cores. This is due to *hyper-threading*,
 which enables each physical CPU core to execute several threads at the same time. Even with simple examples,
 performance may scale unexpectedly. There are many reasons for this, hyper-threading being one of them.
+See the ensuing example.
 
-See for instance the example below:
-
-On a machine with 4 physical and 8 logical cores doing this (admittedly oversimplistic) benchmark:
+On a machine with 4 physical and 8 logical cores, this admittedly over-simplistic benchmark:
 
 ```python
 x = []
@@ -167,7 +164,7 @@ for n in range(1, 9):
     x.append(time_taken.average)
 ```
 
-Gives the following result:
+gives the result:
 
 ```python
 import pandas as pd
@@ -179,13 +176,13 @@ data.set_index("n").plot()
 
 :::discussion
 ## Discussion
-Why is the runtime increasing if we add more than 4 cores? This has to do with **hyper-threading**. On most architectures it does not make much sense to use more workers than the number of physical cores you have.
+Why does the runtime increase if we add more than 4 cores? This has to do with **hyper-threading**. On most architectures it does not make much sense to use more workers than the physical cores you have.
 :::
 
 :::keypoints
-- It is often non-trivial to understand performance
-- Memory is just as important as speed
-- Measuring is knowing
+- Understanding performance is often non-trivial.
+- Memory is just as important as speed.
+- To measure is to know.
 :::
 
 
diff --git a/episodes/computing-pi.md b/episodes/computing-pi.md
@@ -12,27 +12,26 @@ exercises: 30
 
 :::objectives
 - Rewrite a program in a vectorized form.
-- Understand the difference between data and task-based parallel programming.
+- Understand the difference between data-based and task-based parallel programming.
 - Apply `numba.jit` to accelerate Python.
 :::
 
 # Parallelizing a Python application
-In order to recognize the advantages of parallelization we need an algorithm that is easy to parallelize, but still complex enough to take a few seconds of CPU time.
-To not scare away the interested reader, we need this algorithm to be understandable and, if possible, also interesting.
-We chose a classical algorithm for demonstrating parallel programming: estimating the value of number π.
+In order to recognize the advantages of parallelism we need an algorithm that is easy to parallelize, complex enough to take a few seconds of CPU time, understandable, and also interesting not to scare away the interested learner.
+Estimating the value of number $\pi$ is a classical problem to demonstrate parallel programming.
 
-The algorithm we present is one of the classical examples of the power of Monte-Carlo methods.
-This is an umbrella term for several algorithms that use random numbers to approximate exact results.
-We chose this algorithm because of its simplicity and straightforward geometrical interpretation.
+The algorithm we present is a classical demonstration of the power of Monte Carlo methods.
+This is a category of algorithms using random numbers to approximate exact results.
+This approach is simple and has a straightforward geometrical interpretation.
 
-We can compute the value of π using a random number generator. We count the points falling inside the blue circle M compared to the green square N.
-Then π is approximated by the ratio 4M/N.
+We can compute the value of $\pi$ using a random number generator. We count the points falling inside the blue circle M compared to the green square N.
+The ratio 4M/N then approximates $\pi$.
 
 ![Computing Pi](fig/calc_pi_3_wide.svg){alt="the area of a unit sphere contains a multiple of pi"}
 
 :::challenge
 ## Challenge: Implement the algorithm
-Use only standard Python and the function `random.uniform`. The function should have the following
+Use only standard Python and the method `random.uniform`. The function should have the following
 interface:
 
 ```python
@@ -46,7 +45,7 @@ def calc_pi(N):
     return ...
 ```
 
-Also make sure to time your function!
+Also, make sure to time your function!
 
 ::::solution
 ## Solution
@@ -75,11 +74,11 @@ def calc_pi(N):
 ::::
 :::
 
-Before we start to parallelize this program, we need to do our best to make the inner function as
-efficient as we can. We show two techniques for doing this: *vectorization* using `numpy` and
+Before we parallelize this program, the inner function must be as
+efficient as we can make it. We show two techniques for doing this: *vectorization* using `numpy`, and
 *native code generation* using `numba`.
 
-We first demonstrate a Numpy version of this algorithm.
+We first demonstrate a Numpy version of this algorithm:
 
 ```python
 import numpy as np
@@ -94,10 +93,10 @@ def calc_pi_numpy(N):
 
 This is a **vectorized** version of the original algorithm. It nicely demonstrates **data parallelization**,
 where a **single operation** is replicated over collections of data.
-It contrasts to **task parallelization**, where **different independent** procedures are performed in
-parallel (think for example about cutting the vegetables while simmering the split peas).
+It contrasts with **task parallelization**, where **different independent** procedures are performed in
+parallel (think, for example, about cutting the vegetables while simmering the split peas).
 
-If we compare with the 'naive' implementation above, we see that our new one is much faster:
+This implementation is much faster than the 'naive' implementation above: 
 
 ```python
 %timeit calc_pi_numpy(10**6)
@@ -110,15 +109,15 @@ If we compare with the 'naive' implementation above, we see that our new one is
 :::discussion
 ## Discussion: is this all better?
 What is the downside of the vectorized implementation?
-- It uses more memory
-- It is less intuitive
-- It is a more monolithic approach, i.e. you cannot break it up in several parts
+- It uses more memory.
+- It is less intuitive.
+- It is a more monolithic approach, i.e., you cannot break it up in several parts.
 :::
 
 :::challenge
 ## Challenge: Daskify
-Write `calc_pi_dask` to make the Numpy version parallel. Compare speed and memory performance with
-the Numpy version. NB: Remember that dask.array mimics the numpy API.
+Write `calc_pi_dask` to make the Numpy version parallel. Compare its speed and memory performance with
+the Numpy version. NB: Remember that the API of `dask.array` mimics that of the Numpy.
 
 ::::solution
 ## Solution
@@ -143,7 +142,7 @@ def calc_pi_dask(N):
 :::
 
 # Using Numba to accelerate Python code
-Numba makes it easier to create accelerated functions. You can use it with the decorator `numba.jit`.
+Numba makes it easier to create accelerated functions. You can activate it with the decorator `numba.jit`.
 
 ```python
 import numba
@@ -167,7 +166,7 @@ Let's time three versions of the same test. First, native Python iterators:
 190 ms ± 3.26 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
 ```
 
-Now with Numpy:
+Second, with Numpy:
 
 ```python
 %timeit np.arange(10**7).sum()
@@ -177,7 +176,7 @@ Now with Numpy:
 17.5 ms ± 138 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
 ```
 
-And with Numba:
+Third, with Numba:
 
 ```python
 %timeit sum_range_numba(10**7)
@@ -187,27 +186,24 @@ And with Numba:
 162 ns ± 0.885 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
 ```
 
-Numba is 100x faster in this case!  It gets this speedup with "just-in-time" compilation (JIT)—compiling the Python
-function into machine code just before it is called (that's what the `@numba.jit` decorator stands for).
-Not every Python and Numpy feature is supported, but a function may be a good candidate for Numba if it is written
-with a Python for-loop over a large range of values, as with `sum_range_numba()`.
+Numba is hundredfold faster in this case! It gets this speedup with "just-in-time" compilation (JIT) — that is, compiling the Python
+function into machine code just before it is called, as the `@numba.jit` decorator indicates.
+Numba does not support every Python and Numpy feature, but functions written with a for-loop with a large number of iterates, like in our `sum_range_numba()`, are good candidates.
 
 :::callout
 ## Just-in-time compilation speedup
 
-The first time you call a function decorated with `@numba.jit`, you may see little or no speedup. In
-subsequent calls, the function could be much faster. You may also see this warning when using `timeit`:
+The first time you call a function decorated with `@numba.jit`, you may see no or little speedup. The function can then be much faster in subsequent calls. Also, `timeit` may throw this warning:
 
 `The slowest run took 14.83 times longer than the fastest. This could mean that an intermediate result is being cached.`
 
 Why does this happen?
 On the first call, the JIT compiler needs to compile the function. On subsequent calls, it reuses the
-already-compiled function. The compiled function can *only* be reused if it is called with the same argument types
-(int, float, etc.).
+function previously compiled. The compiled function can *only* be reused if the types of its arguments (int, float, and the like) are the same as at the point of compilation.
 
-See this example where `sum_range_numba` is timed again, but now with a float argument instead of int:
+See this example, where `sum_range_numba` is timed once again with a float argument instead of an int:
 ```python
-%time sum_range_numba(10.**7)
+%time sum_range_numba(10**7)
 %time sum_range_numba(10.**7)
 ```
 ```output
@@ -248,17 +244,17 @@ def calc_pi_numba(N):
 :::
 
 :::callout
-## Measuring == knowing
+## Measuring = knowing
 Always profile your code to see which parallelization method works best.
 :::
 
 :::callout
-## `numba.jit` is not a magical command to solve are your problems
-Using numba to accelerate your code often outperforms other methods, but it is not always trivial to rewrite your code so that you can use numba with it.
+## `numba.jit` is not a magical command to solve your problems
+Accelerating your code with Numba often outperforms other methods, but rewriting code to reap the benefits of Numba is not always trivial.
 :::
 
 :::keypoints
-- Always profile your code to see which parallelization method works best
+- Always profile your code to see which parallelization method works best.
 - Vectorized algorithms are both a blessing and a curse.
-- Numba can help you speed up code
+- Numba can help you speed up code.
 :::