Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Revise episodes 1 through 9 for readability #13

Open
wants to merge 44 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
44 commits
Select commit Hold shift + click to select a range
4dd0470
introduction.md: readability pass 1
wmotion Jun 30, 2023
8b66c41
Merge pull request #1 from wmotion/readability
wmotion Jun 30, 2023
c82bf08
introduction.md: breaks code lines at end of sentences
wmotion Jun 30, 2023
c4b991f
introduction.md: readability pass 2
wmotion Jun 30, 2023
e30edc7
Merge branch 'linebreaks' into readability
wmotion Jun 30, 2023
18ba47d
introduction.md: merge recent edits for readability
wmotion Jun 30, 2023
1a8c9db
benchmarking.md: readability pass 1
wmotion Jun 30, 2023
956b8c2
Merge benchmarking.md readability pass 1
wmotion Jun 30, 2023
1024ef6
computing-pi.md: readability pass 1
wmotion Jun 30, 2023
390aa98
Merge computing-pi from readability into main
wmotion Jun 30, 2023
b0e9ca6
threads-and-processes.md: readability pass 1
wmotion Jun 30, 2023
e1113f2
Merge threads-and-processes from readability into main
wmotion Jun 30, 2023
a6d8652
delayed-evaluation: readability pass 1
wmotion Jun 30, 2023
e144c94
delayed-evaluation.md: from readability to main
wmotion Jun 30, 2023
a1c3497
map-and-reduce.md: readability pass 1
wmotion Jul 1, 2023
3edd68c
map-and-reduce.md: merge readability into main
wmotion Jul 1, 2023
d1286d0
exercise-with-fractals.md: readability pass 1
wmotion Jul 1, 2023
4535ca4
exercise-with-fractals.md: merge readability into main
wmotion Jul 1, 2023
920252f
extra-asyncio.md: readability pass 1
wmotion Jul 1, 2023
4908d78
extra-asyncio.md: merge branch 'readability' into main
wmotion Jul 1, 2023
e431cee
extra-external-c.md: readability pass 1
wmotion Jul 1, 2023
6f57683
extra-external-c.md: merge branch 'readability' into main
wmotion Jul 1, 2023
c8b1b7a
introduction.md: readability pass 2
wmotion Jul 1, 2023
d6c585d
benchmarking.md: readability pass 2
wmotion Jul 1, 2023
ca9bce9
computing.md: readability pass 2
wmotion Jul 1, 2023
597ae7f
thread-and-processes.md: readability pass 2
wmotion Jul 1, 2023
f66ca8f
delayed-evaluation.md: readability pass 2
wmotion Jul 1, 2023
05b6fb0
thread-and-processes.md: readability pass 2
wmotion Jul 1, 2023
f88c0d0
map-and-reduce.md: readability pass 2
wmotion Jul 1, 2023
f068d69
exercise-with-fractals.md: readability pass 2
wmotion Jul 1, 2023
1699299
extra-asyncio.md: readability pass 2
wmotion Jul 1, 2023
dc83178
extra-external-c.md: readability pass 2
wmotion Jul 1, 2023
7816c23
introduction.md: review code snippets
wmotion Jul 2, 2023
dd10bac
benchmarking.md: revise code snippets
wmotion Jul 2, 2023
0b65df2
threads-and-processes.md: revise snippet code
wmotion Jul 3, 2023
1cf958c
Merge pull request #10 from esciencecenter-digital-skills/main
wmotion Jul 3, 2023
4ebee2b
delayed-evaluation.md: revise code snippets
wmotion Jul 3, 2023
37607a2
map-and-reduce.md: revise snipped code
wmotion Jul 3, 2023
92619e0
benchmarking.md: remark implemented
wmotion Jul 3, 2023
cda9ca2
computing-pi.md: implement remarks, with rephrasing
wmotion Jul 3, 2023
6fb3c66
delayed-evaluation.md: implement remarks
wmotion Jul 3, 2023
ecd90ce
map-and-reduce.md: implement remarks
wmotion Jul 3, 2023
fa7eeac
threads-and-processes.md: implement remaks
wmotion Jul 3, 2023
bd779a5
exercise-with-fractals.md: revise snippet code (part I)
wmotion Jul 3, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
87 changes: 42 additions & 45 deletions episodes/benchmarking.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,23 +5,23 @@ exercises: 20
---

:::questions
- How do we know our program ran faster?
- How do we learn about efficiency?
- How do we know whether our program ran faster in parallel?
- How do we appraise efficiency?
:::

:::objectives
- View performance on system monitor
- Find out how many cores your machine has
- Use `%time` and `%timeit` line-magic
- Use a memory profiler
- Plot performance against number of work units
- Understand the influence of hyper-threading on timings
- View performance on system monitor.
- Find out how many cores your machine has.
- Use `%time` and `%timeit` line-magic.
- Use a memory profiler.
- Plot performance against number of work units.
- Understand the influence of hyper-threading on timings.
:::


# A first example with Dask
We will get into creating parallel programs in Python later. First let's see a small example. Open
your system monitor (this will differ among specific operating systems), and run the following code examples.
We will create parallel programs in Python later. First let's see a small example. Open
your System Monitor (the application will vary between specific operating systems), and run the following code examples:

```python
# Summation making use of numpy:
Expand All @@ -39,57 +39,55 @@ result = work.compute()

:::callout
## Try a heavy enough task
It could be that a task this small does not register on your radar. Depending on your computer you will have to raise the power to ``10**8`` or ``10**9`` to make sure that it runs long enough to observe the effect. But be careful and increase slowly. Asking for too much memory can make your computer slow to a crawl.
Your system monitor may not detect so small a task. In your computer you may have to gradually raise the problem size to ``10**8`` or ``10**9`` to observe the effect in long enough a run. But be careful and increase slowly! Asking for too much memory can make your computer slow to a crawl.
:::

![System monitor](fig/system-monitor.jpg){alt="screenshot of system monitor"}

How can we test this in a more practical way? In Jupyter we can use some line magics, small "magic words" preceded
How can we monitor this more conveniently? In Jupyter we can use some line magics, small "magic words" preceded
by the symbol `%%` that modify the behaviour of the cell.

```python
%%time
np.arange(10**7).sum()
```

The `%%time` line magic checks how long it took for a computation to finish. It does nothing to
change the computation itself. In this it is very similar to the `time` shell command.
The `%%time` line magic checks how long it took for a computation to finish. It does not affect how the computation is performed. In this regard it is very similar to the `time` shell command.

If run the chunk several times, we will notice a difference in the times.
If we run the chunk several times, we will notice variability in the reported times.
How can we trust this timer, then?
A possible solution will be to time the chunk several times, and take the average time as our valid measure.
The `%%timeit` line magic does exactly this in a concise an comfortable manner!
`%%timeit` first measures how long it takes to run a command one time, then
repeats it enough times to get an average run-time. Also, `%%timeit` can measure run times without
the time it takes to setup a problem, measuring only the performance of the code in the cell.
This way we can trust the outcome better.
The `%%timeit` line magic does exactly this in a concise and convenient manner!
`%%timeit` first measures how long it takes to run a command once, then
repeats it enough times to get an average run-time. Also, `%%timeit` can measure run times discounting the overhead of setting up a problem and measuring only the performance of the code in the cell.
So this outcome is more trustworthy.

```python
%%timeit
np.arange(10**7).sum()
```

If you want to store the output of `%%timeit` in a Python variable, you can do so with the `-o` flag.
You can store the output of `%%timeit` in a Python variable using the `-o` flag:

```python
time = %timeit -o np.arange(10**7).sum()
print(f"Time taken: {time.average:.4f}s")
print(f"Time taken: {time.average:.4f} s")
```

Note that this does not tell you anything about memory consumption or efficiency.
Note that this metric does not tell you anything about memory consumption or efficiency.

# Memory profiling
- The act of systematically testing performance under different conditions is called **benchmarking**.
- Analysing what parts of a program contribute to the total performance, and identifying possible bottlenecks is **profiling**.
- **Benchmarking** is the action of systematically testing performance under different conditions.
- **Profiling** is the analysis of which parts of a program contribute to the total performance, and the identification of possible bottlenecks.

We will use the [`memory_profiler` package](https://github.com/pythonprofilers/memory_profiler) to track memory usage.
We will use the package [`memory_profiler`](https://github.com/pythonprofilers/memory_profiler) to track memory usage.
It can be installed executing the code below in the console:

~~~sh
pip install memory_profiler
~~~

In Jupyter, type the following lines to compare the memory usage of the serial and parallel versions of the code presented above (again, change the value of `10**7` to something higher if needed):
The memory usage of the serial and parallel versions of a code will vary. In Jupyter, type the following lines to see the effect in the code presented above (again, increase the baseline value `10**7` if needed):

```python
import numpy as np
Expand All @@ -112,37 +110,37 @@ memory_dask = memory_usage(sum_with_dask, interval=0.01)
# Plot results
plt.plot(memory_numpy, label='numpy')
plt.plot(memory_dask, label='dask')
plt.xlabel('Time step')
plt.ylabel('Memory / MB')
plt.xlabel('Interval counter [-]')
plt.ylabel('Memory usage [MiB]')
plt.legend()
plt.show()
```

The figure should be similar to the one below:
The plot should be similar to the one below:

![Memory performance](fig/memory.png){alt="showing very high peak for numpy, and constant low line for dask"}

:::challenge
## Exercise (plenary)
Why is the Dask solution more memory efficient?
Why is the Dask solution more memory-efficient?
wmotion marked this conversation as resolved.
Show resolved Hide resolved

::::solution
## Solution
Chunking! Dask chunks the large array, such that the data is never entirely in memory.
Chunking! Dask chunks the large array so that the data is never entirely in memory.
::::
:::

:::callout
## Profiling from Dask
Dask has several option to do profiling from Dask itself. See the [dask documentation](https://docs.dask.org/en/latest/diagnostics-local.html) for more information.
Dask has several built-in option for profiling. See the [dask documentation](https://docs.dask.org/en/latest/diagnostics-local.html) for more information.
:::

# Using many cores
Using more cores for a computation can decrease the run time. The first question is of course: how many cores do I have? See the snippets below to find out:
Using more cores for a computation can decrease the run time. The first question is of course: how many cores do I have? See the snippet below to find this out:

:::callout
## Find out how many cores your machine has
The number of cores can be found from Python by executing:
## Find out the number of cores in your machine
The number of cores can be found from Python upon executing:

```python
import psutil
Expand All @@ -152,13 +150,12 @@ print(f"The number of physical/logical cores is {N_physical_cores}/{N_logical_co
```
:::

Usually the number of logical cores is higher than the number of physical course. This is due to *hyper-threading*,
Usually the number of logical cores is higher than the number of physical cores. This is due to *hyper-threading*,
which enables each physical CPU core to execute several threads at the same time. Even with simple examples,
performance may scale unexpectedly. There are many reasons for this, hyper-threading being one of them.
See the ensuing example.

See for instance the example below:

On a machine with 4 physical and 8 logical cores doing this (admittedly oversimplistic) benchmark:
On a machine with 4 physical and 8 logical cores, this admittedly over-simplistic benchmark:

```python
x = []
Expand All @@ -167,7 +164,7 @@ for n in range(1, 9):
x.append(time_taken.average)
```

Gives the following result:
gives the result:

```python
import pandas as pd
Expand All @@ -179,13 +176,13 @@ data.set_index("n").plot()

:::discussion
## Discussion
Why is the runtime increasing if we add more than 4 cores? This has to do with **hyper-threading**. On most architectures it does not make much sense to use more workers than the number of physical cores you have.
Why does the runtime increase if we add more than 4 cores? This has to do with **hyper-threading**. On most architectures it does not make much sense to use more workers than the physical cores you have.
:::

:::keypoints
- It is often non-trivial to understand performance
- Memory is just as important as speed
- Measuring is knowing
- Understanding performance is often non-trivial.
- Memory is just as important as speed.
- To measure is to know.
:::


77 changes: 35 additions & 42 deletions episodes/computing-pi.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,27 +12,26 @@ exercises: 30

:::objectives
- Rewrite a program in a vectorized form.
- Understand the difference between data and task-based parallel programming.
- Understand the difference between data-based and task-based parallel programming.
- Apply `numba.jit` to accelerate Python.
:::

# Parallelizing a Python application
In order to recognize the advantages of parallelization we need an algorithm that is easy to parallelize, but still complex enough to take a few seconds of CPU time.
To not scare away the interested reader, we need this algorithm to be understandable and, if possible, also interesting.
We chose a classical algorithm for demonstrating parallel programming: estimating the value of number π.
In order to recognize the advantages of parallelism we need an algorithm that is easy to parallelize, complex enough to take a few seconds of CPU time, understandable, and also interesting not to scare away the interested learner.
Estimating the value of number $\pi$ is a classical problem to demonstrate parallel programming.

The algorithm we present is one of the classical examples of the power of Monte-Carlo methods.
This is an umbrella term for several algorithms that use random numbers to approximate exact results.
We chose this algorithm because of its simplicity and straightforward geometrical interpretation.
The algorithm we present is a classical demonstration of the power of Monte Carlo methods.
This is a category of algorithms using random numbers to approximate exact results.
This approach is simple and has a straightforward geometrical interpretation.

We can compute the value of π using a random number generator. We count the points falling inside the blue circle M compared to the green square N.
Then π is approximated by the ratio 4M/N.
We can compute the value of $\pi$ using a random number generator. We count the points falling inside the blue circle M compared to the green square N.
The ratio 4M/N then approximates $\pi$.

![Computing Pi](fig/calc_pi_3_wide.svg){alt="the area of a unit sphere contains a multiple of pi"}

:::challenge
## Challenge: Implement the algorithm
Use only standard Python and the function `random.uniform`. The function should have the following
Use only standard Python and the method `random.uniform`. The function should have the following
interface:

```python
Expand All @@ -46,7 +45,7 @@ def calc_pi(N):
return ...
```

Also make sure to time your function!
Also, make sure to time your function!

::::solution
## Solution
Expand Down Expand Up @@ -75,11 +74,11 @@ def calc_pi(N):
::::
:::

Before we start to parallelize this program, we need to do our best to make the inner function as
efficient as we can. We show two techniques for doing this: *vectorization* using `numpy` and
Before we parallelize this program, the inner function must be as
efficient as we can make it. We show two techniques for doing this: *vectorization* using `numpy`, and
*native code generation* using `numba`.

We first demonstrate a Numpy version of this algorithm.
We first demonstrate a Numpy version of this algorithm:

```python
import numpy as np
Expand All @@ -92,12 +91,9 @@ def calc_pi_numpy(N):
return 4 * M / N
```

This is a **vectorized** version of the original algorithm. It nicely demonstrates **data parallelization**,
where a **single operation** is replicated over collections of data.
It contrasts to **task parallelization**, where **different independent** procedures are performed in
parallel (think for example about cutting the vegetables while simmering the split peas).
This is a **vectorized** version of the original algorithm. A problem written in a vectorized form becomes amenable to **data parallelization**, where each single operation is replicated over a large collection of data. Data parallelism contrasts with **task parallelism**, where different independent procedures are performed in parallel. An example of task parallelism is the pea-soup recipe in the introduction.

If we compare with the 'naive' implementation above, we see that our new one is much faster:
This implementation is much faster than the 'naive' implementation above:

```python
%timeit calc_pi_numpy(10**6)
Expand All @@ -110,15 +106,15 @@ If we compare with the 'naive' implementation above, we see that our new one is
:::discussion
## Discussion: is this all better?
What is the downside of the vectorized implementation?
- It uses more memory
- It is less intuitive
- It is a more monolithic approach, i.e. you cannot break it up in several parts
- It uses more memory.
- It is less intuitive.
- It is a more monolithic approach, i.e., you cannot break it up in several parts.
:::

:::challenge
## Challenge: Daskify
Write `calc_pi_dask` to make the Numpy version parallel. Compare speed and memory performance with
the Numpy version. NB: Remember that dask.array mimics the numpy API.
Write `calc_pi_dask` to make the Numpy version parallel. Compare its speed and memory performance with
the Numpy version. NB: Remember that the API of `dask.array` mimics that of the Numpy.

::::solution
## Solution
Expand All @@ -143,7 +139,7 @@ def calc_pi_dask(N):
:::

# Using Numba to accelerate Python code
Numba makes it easier to create accelerated functions. You can use it with the decorator `numba.jit`.
Numba makes it easier to create accelerated functions. You can activate it with the decorator `numba.jit`.

```python
import numba
Expand All @@ -167,7 +163,7 @@ Let's time three versions of the same test. First, native Python iterators:
190 ms ± 3.26 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
```

Now with Numpy:
Second, with Numpy:

```python
%timeit np.arange(10**7).sum()
Expand All @@ -177,7 +173,7 @@ Now with Numpy:
17.5 ms ± 138 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
```

And with Numba:
Third, with Numba:

```python
%timeit sum_range_numba(10**7)
Expand All @@ -187,27 +183,24 @@ And with Numba:
162 ns ± 0.885 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
```

Numba is 100x faster in this case! It gets this speedup with "just-in-time" compilation (JIT)—compiling the Python
function into machine code just before it is called (that's what the `@numba.jit` decorator stands for).
Not every Python and Numpy feature is supported, but a function may be a good candidate for Numba if it is written
with a Python for-loop over a large range of values, as with `sum_range_numba()`.
Numba is hundredfold faster in this case! It gets this speedup with "just-in-time" compilation (JIT) — that is, compiling the Python
function into machine code just before it is called, as the `@numba.jit` decorator indicates.
Numba does not support every Python and Numpy feature, but functions written with a for-loop with a large number of iterates, like in our `sum_range_numba()`, are good candidates.

:::callout
## Just-in-time compilation speedup

The first time you call a function decorated with `@numba.jit`, you may see little or no speedup. In
subsequent calls, the function could be much faster. You may also see this warning when using `timeit`:
The first time you call a function decorated with `@numba.jit`, you may see no or little speedup. The function can then be much faster in subsequent calls. Also, `timeit` may throw this warning:

`The slowest run took 14.83 times longer than the fastest. This could mean that an intermediate result is being cached.`

Why does this happen?
On the first call, the JIT compiler needs to compile the function. On subsequent calls, it reuses the
already-compiled function. The compiled function can *only* be reused if it is called with the same argument types
(int, float, etc.).
function previously compiled. The compiled function can *only* be reused if the types of its arguments (int, float, and the like) are the same as at the point of compilation.

See this example where `sum_range_numba` is timed again, but now with a float argument instead of int:
See this example, where `sum_range_numba` is timed once again with a float argument instead of an int:
```python
%time sum_range_numba(10.**7)
%time sum_range_numba(10**7)
%time sum_range_numba(10.**7)
```
```output
Expand Down Expand Up @@ -248,17 +241,17 @@ def calc_pi_numba(N):
:::

:::callout
## Measuring == knowing
## Measuring = knowing
Always profile your code to see which parallelization method works best.
:::

:::callout
## `numba.jit` is not a magical command to solve are your problems
Using numba to accelerate your code often outperforms other methods, but it is not always trivial to rewrite your code so that you can use numba with it.
## `numba.jit` is not a magical command to solve your problems
Accelerating your code with Numba often outperforms other methods, but rewriting code to reap the benefits of Numba is not always trivial.
:::

:::keypoints
- Always profile your code to see which parallelization method works best
- Always profile your code to see which parallelization method works best.
- Vectorized algorithms are both a blessing and a curse.
- Numba can help you speed up code
- Numba can help you speed up code.
:::
Loading