From 4dd0470b470444512293a7651c0bd2ac7849334e Mon Sep 17 00:00:00 2001 From: Giordano Lipari Date: Fri, 30 Jun 2023 16:25:11 +0200 Subject: [PATCH 01/32] introduction.md: readability pass 1 Linguistic interventions to improve readability, while affecting the current meanings as little as possible. --- episodes/introduction.md | 115 +++++++++++++++++++-------------------- 1 file changed, 57 insertions(+), 58 deletions(-) diff --git a/episodes/introduction.md b/episodes/introduction.md index 61d7223..a2dd5ad 100644 --- a/episodes/introduction.md +++ b/episodes/introduction.md @@ -1,5 +1,5 @@ --- -title: 'introduction' +title: 'Introduction' teaching: 20 exercises: 5 --- @@ -9,7 +9,7 @@ exercises: 5 - What problems are we solving, and what are we **not** discussing? - Why do we use Python? - What is parallel programming? -- Why can it be hard to write a parallel program? +- Why can writing a parallel program be hard? :::::::::::::::::::::::::::::::::::::::::::::::: @@ -25,15 +25,14 @@ exercises: 5 :::callout ## What problems are we solving? -Ask around what problems participants encountered: "Why did you sign up?". Specifically: what is the domain science related task that you want to parallelize? +Ask around what problems participants encountered: "Why did you sign up?". Specifically: which task in your field of expertise you want to parallelize? ::: Most problems will fit in one of two categories: - I wrote this code in Python and it is not fast enough. -- I run this code on my laptop, but the target problem size is bigger than the RAM. +- I run this code on my laptop, but the target size of the problem is bigger than its RAM. -In this course we will show several possible ways of speeding up your program and making it ready -to function in parallel. We will be introducing the following modules: +In this course we show several ways of speeding up your program and making it run in parallel. We introduce the following modules: 1. `threading` allows different parts of your program to run concurrently on a single computer (with shared memory) 3. `dask` makes scalable parallel computing easy @@ -41,79 +40,79 @@ to function in parallel. We will be introducing the following modules: 5. `memory_profile` monitors memory performance 6. `asyncio` Python's native asynchronous programming -FIXME: Actually explain functional programming & distributed programming -More importantly, we will show how to change the design of a program to fit parallel paradigms. This +FIXME: Actually explain functional programming & distributed programming. +More importantly, we show how to change the design of a program to fit parallel paradigms. This often involves techniques from **functional programming**. :::callout ## What we won't talk about -In this course we will not talk about **distributed programming**. This is a huge can of worms. It is easy to -show simple examples, but depending on the problem, solutions will be wildly different. Dask has a -lot of functionality to help you in setting up for running on a network. The important bit is that, -once you have made your code suitable for parallel computing, you'll have the right mind-set to get -it to work in a distributed environment. +In this course we will not talk about **distributed programming**. +This is a huge can of worms. +It is easy to show simple examples, but solutions for particular problems will be wildly different. +Dask has a lot of functionalities to help you set up runs on a network. +The important bit is that, once you have made your code suitable for parallel computing, you will have the right mind-set to get it to work in a distributed environment. ::: # Overview and rationale FIXME: update this to newer lesson content organisation. -This is an advanced course. Why is it advanced? We (hopefully) saw in the discussion that although -many of your problems share similar characteristics, it is the detail that will determine the -solution. We all need our algorithms, models, analysis to run in a way that many hands make light -work. When such a situation arises with a group of people, we start with a meeting discussing who -does what, when do we meet again to sync up, etc. After a while you can get the feeling that all you -do is be in meetings. We will see that there are several abstractions that can make our life easier. -In large parts this course will use Dask to illustrate these abstractions. - -- Vectorized instructions: tell many workers to do the same work on a different piece of data. This - is where `dask.array` and `dask.dataframe` come in. We will illustrate this model of working by -computing the number Pi later on. -- Map/filter/reduce: this is a method where we combine different functionals to create a larger - program. We will use `dask.bag` to count the number of unique words in a novel using this -formalism. -- Task-based parallelization: this may be the most generic abstraction as all the others can be expressed +This is an advanced course. +Why is it advanced? +We (hopefully) saw in the discussion that, although many of your problems share similar characteristics, the details will determine the solution. +We all need our algorithms, models, analysis to run so that many hands make light work. +When such a situation arises with a group of people, we start with a meeting discussing who does what, when do we meet again to sync up, and so on. +After a while you can get the feeling that all you do is to be in meetings. +We will see that several abstractions can make our life easier. +This course illustrates these abstractions making ample use of Dask. + +- Vectorized instructions: tell many workers to do the same work on a different piece of data. + This is where `dask.array` and `dask.dataframe` come in. + We will illustrate this model of working by computing the number $\pi$ later on. +- Map/filter/reduce: this methodology combines different functionals to create a larger program. + We implement this formalism when using `dask.bag` to count the number of unique words in a novel. +- Task-based parallelization: this may be the most generic abstraction, as all the others can be expressed in terms of tasks or workflows. This is `dask.delayed`. # Why Python? -Python is one of most widely used languages to do scientific data analysis, visualization, and even modelling and simulation. -The popularity of Python is mainly due to the two pillars of a friendly syntax together with the availability of many high-quality libraries. +Python is one of most widely used languages for scientific data analysis, visualization, and even modelling and simulation. +The popularity of Python is mainly due to the two pillars of a friendly syntax and the availability of many high-quality libraries. :::callout ## It's not all good news -The flexibility that Python offers comes with a few downsides though: -- Python code typically doesn’t perform as fast as lower-level implementations in C/C++ or Fortran. -- It is not trivial to parallelize Python code to work efficiently on many-core architectures. +The flexibility of Python comes with a few downsides though: +- Python code typically does not perform as fast as lower-level implementations in C/C++ or Fortran. +- Parallelizing Python code to work efficiently on many-core architectures is not trivial. -This workshop addresses both these issues, with an emphasis on being able to run Python code efficiently (in parallel) on multiple cores. +This workshop addresses both issues, with an emphasis on efficiently running parallel Python code on multiple cores. ::: # What is parallel computing? ## Dependency diagrams -Suppose we have a computation where each step **depends** on a previous one. We can represent this situation like in the figure below, known as a dependency diagram: +Suppose we have a computation where each step **depends** on a previous one. We can represent this situation in the schematic below, known as a dependency diagram: ![Serial computation](fig/serial.png){alt="boxes and arrows in sequential configuration"} -In these diagrams the inputs and outputs of each function are represented as rectangles. The inward and outward arrows indicate their flow. Note that the output of one function can become the input of another one. The diagram above is the typical diagram of a **serial computation**. If you ever used a loop to update a value, you used serial computation. +In these diagrams rectangles represent the inputs and outputs of each function. The inward and outward arrows indicate their flow. Note that the output of one function can become the input of another one. The diagram above is the typical diagram of a **serial computation**. If you ever used a loop to update a value, you used serial computation. -If our computation involves **independent work** (that is, the results of the application of each function are independent of the results of the application of the rest), we can structure our computation like this: +If our computation involves **independent work** (that is, the results of each function are independent of the results of the application of the rest), we can structure our computation as follows: ![Parallel computation](fig/parallel.png){alt="boxes and arrows with two parallel pipe lines"} -This scheme corresponds to a **parallel computation**. +This scheme represents a **parallel computation**. ### How can parallel computing improve our code execution speed? Nowadays, most personal computers have 4 or 8 processors (also known as cores). In the diagram above, we can assign each of the three functions to one core, so they can be performed simultaneously. :::callout -## Do 8 processors work 8 times faster than one? -It may be tempting to think that using eight cores instead of one would multiply the execution speed by eight. For now, it's ok to use this a first approximation to reality. Later in the course we'll see that things are actually more complicated than that. +## Do eight processors work eight as fast as one? +It may be tempting to think that using eight cores instead of one would increase the execution speed eigthfold. For now, it is ok to use this as a first approximation to reality. Later in the course we see that things are actually more complicated. ::: ## Parallelizable and non-parallelizable tasks -Some tasks are easily parallelizable while others inherently aren't. However, it might not always be immediately apparent that a task is parallelizable. +Some tasks are easily parallelizable while others inherently are not. However, it might not always be immediately apparent that a task is parallelizable. Let us consider the following piece of code. @@ -132,13 +131,13 @@ print(y) # Print output 10 ``` -Note that each successive loop uses the result of the previous loop. In that way, it is dependent on the previous +Note that each successive loop uses the result of the previous loop. In that way, it depends on the previous loop. The following dependency diagram makes that clear: ![serial execution](fig/serial.svg){alt="boxes and arrows"} Although we are performing the loops in a serial way in the snippet above, -nothing avoids us from performing this calculation in parallel. +nothing prevents us from performing this calculation in parallel. The following example shows that parts of the computations can be done independently: ```python @@ -161,19 +160,19 @@ print(result) ![parallel execution](fig/parallel.svg){alt="boxes and arrows"} -The technique for parallelising sums like this is called **chunking**. +**Chunking** is the technique for parallelizing operations like these sums. -There is a subclass of algorithms where the subtasks are completely independent. These kinds of algorithms are known as [embarrassingly parallel](https://en.wikipedia.org/wiki/Embarrassingly_parallel), or more friendly: naturally or delightfully parallel. +There is a subclass of algorithms where the subtasks are completely independent. These kinds of algorithms are known as [embarrassingly parallel](https://en.wikipedia.org/wiki/Embarrassingly_parallel) or, more friendly, naturally or delightfully parallel. -An example of this kind of problem is squaring each element in a list, which can be done like so: +An example of this kind of problem is squaring each element in a list, which can be done as follows: ```python y = [n**2 for n in x] ``` -Each task of squaring a number is independent of all the other elements in the list. +Each task of squaring a number is independent of all other elements in the list. -It is important to know that some tasks are fundamentally non-parallelizable. An example of such an **inherently serial** algorithm could be the computation of the fibonacci sequence using the formula `Fn=Fn-1 + Fn-2`. Each output here depends on the outputs of the two previous loops. +It is important to know that some tasks are fundamentally non-parallelizable. An example of such an **inherently serial** algorithm is the computation of the Fibonacci sequence using the formula `Fn=Fn-1 + Fn-2`. Each output depends on the outputs of the two previous loops. :::challenge ## Challenge: Parallellizable and non-parallellizable tasks @@ -184,7 +183,7 @@ Please write your answers in the collaborative document. ::::solution Answers may differ. An ubiquitous example of a naturally parallel problem is a parameter scan, where you need to evaluate some model for N different configurations of input parameters. -One set of problems that are very hard to parallelize are solving time-dependent models where every state depends on the previous one. Even in those cases there are attempts to do parallel computation, but those require fundamentally different algorithms to solve. +Time-dependent models are a category of problems very hard to parallelize, since every state depends on the previous one(s). The attempts to parallelize those cases require fundamentally different algorithms. In many cases fully paralellizable algorithms may be a bit less efficient per CPU cycle than their single threaded brethren. :::: @@ -192,25 +191,25 @@ In many cases fully paralellizable algorithms may be a bit less efficient per CP :::callout ## Problems versus Algorithms -Often, the parallelizability of a problem depends on its specific implementation. For instance, in our first example of a non-parallelizable task, we mentioned the calculation of the Fibonacci sequence. However, there exists a [closed form expression to compute the n-th Fibonacci number](https://en.wikipedia.org/wiki/Fibonacci_number#Closed-form_expression). +Often, the parallelizability of a problem depends on its specific implementation. For instance, in our first example of a non-parallelizable task, we mentioned the calculation of the Fibonacci sequence. Conveniently, a [closed form expression to compute the n-th Fibonacci number](https://en.wikipedia.org/wiki/Fibonacci_number#Closed-form_expression) exists. -Last but not least, don't let the name demotivate you: if your algorithm happens to be embarrassingly parallel, that's good news! The "embarrassingly" refers to the feeling of "this is great!, how did I not notice before?!" +Last but not least, do not let the name discourage you: if your algorithm happens to be embarrassingly parallel, that's good news! The "embarrassingly" evokes the feeling of "this is great!, how did I not notice before?!" ::: :::challenge -## Challenge: Parallelised Pea Soup +## Challenge: Parallelized Pea Soup We have the following recipe: -1. (1 min) Pour water into a soup pan, add the split peas and bay leaf and bring it to boil. -2. (60 min) Remove any foam using a skimmer and let it simmer under a lid for about 60 minutes. +1. (1 min) Pour water into a soup pan, add the split peas and bay leaf, and bring it to boil. +2. (60 min) Remove any foam using a skimmer, and let it simmer under a lid for about 60 minutes. 3. (15 min) Clean and chop the leek, celeriac, onion, carrot and potato. -4. (20 min) Remove the bay leaf, add the vegetables and simmer for 20 more minutes. Stir the soup occasionally. +4. (20 min) Remove the bay leaf, add the vegetables, and simmer for 20 more minutes. Stir the soup occasionally. 5. (1 day) Leave the soup for one day. Reheat before serving and add a sliced smoked sausage (vegetarian options are also welcome). Season with pepper and salt. -Imagine you're cooking alone. +Imagine you are cooking alone. - Can you identify potential for parallelisation in this recipe? -- And what if you are cooking with the help of a friend help? Is the soup done any faster? +- And what if you are cooking with a friend's help? Is the soup done any faster? - Draw a dependency diagram. ::::solution @@ -222,7 +221,7 @@ Imagine you're cooking alone. :::: ::: -## Shared vs. Distributed memory +## Shared vs. distributed memory FIXME: add text ![Shared vs. Distributed memory architecture: the crucial difference is the bandwidth to shared memory](fig/memory-architecture.svg){alt="diagram"} From c82bf08b1d225022de64712c4c0c9820cdc3c74f Mon Sep 17 00:00:00 2001 From: Giordano Lipari Date: Fri, 30 Jun 2023 16:33:56 +0200 Subject: [PATCH 02/32] introduction.md: breaks code lines at end of sentences It makes it so much more readable and improvable --- episodes/introduction.md | 68 +++++++++++++++++++++++++++------------- 1 file changed, 46 insertions(+), 22 deletions(-) diff --git a/episodes/introduction.md b/episodes/introduction.md index a2dd5ad..724b443 100644 --- a/episodes/introduction.md +++ b/episodes/introduction.md @@ -25,14 +25,16 @@ exercises: 5 :::callout ## What problems are we solving? -Ask around what problems participants encountered: "Why did you sign up?". Specifically: which task in your field of expertise you want to parallelize? +Ask around what problems participants encountered: "Why did you sign up?". +Specifically: which task in your field of expertise you want to parallelize? ::: Most problems will fit in one of two categories: - I wrote this code in Python and it is not fast enough. - I run this code on my laptop, but the target size of the problem is bigger than its RAM. -In this course we show several ways of speeding up your program and making it run in parallel. We introduce the following modules: +In this course we show several ways of speeding up your program and making it run in parallel. +We introduce the following modules: 1. `threading` allows different parts of your program to run concurrently on a single computer (with shared memory) 3. `dask` makes scalable parallel computing easy @@ -41,8 +43,8 @@ In this course we show several ways of speeding up your program and making it ru 6. `asyncio` Python's native asynchronous programming FIXME: Actually explain functional programming & distributed programming. -More importantly, we show how to change the design of a program to fit parallel paradigms. This -often involves techniques from **functional programming**. +More importantly, we show how to change the design of a program to fit parallel paradigms. +This often involves techniques from **functional programming**. :::callout ## What we won't talk about @@ -90,11 +92,16 @@ This workshop addresses both issues, with an emphasis on efficiently running par ## Dependency diagrams -Suppose we have a computation where each step **depends** on a previous one. We can represent this situation in the schematic below, known as a dependency diagram: +Suppose we have a computation where each step **depends** on a previous one. +We can represent this situation in the schematic below, known as a dependency diagram: ![Serial computation](fig/serial.png){alt="boxes and arrows in sequential configuration"} -In these diagrams rectangles represent the inputs and outputs of each function. The inward and outward arrows indicate their flow. Note that the output of one function can become the input of another one. The diagram above is the typical diagram of a **serial computation**. If you ever used a loop to update a value, you used serial computation. +In these diagrams rectangles represent the inputs and outputs of each function. +The inward and outward arrows indicate their flow. +Note that the output of one function can become the input of another one. +The diagram above is the typical diagram of a **serial computation**. +If you ever used a loop to update a value, you used serial computation. If our computation involves **independent work** (that is, the results of each function are independent of the results of the application of the rest), we can structure our computation as follows: @@ -104,15 +111,19 @@ This scheme represents a **parallel computation**. ### How can parallel computing improve our code execution speed? -Nowadays, most personal computers have 4 or 8 processors (also known as cores). In the diagram above, we can assign each of the three functions to one core, so they can be performed simultaneously. +Nowadays, most personal computers have 4 or 8 processors (also known as cores). +In the diagram above, we can assign each of the three functions to one core, so they can be performed simultaneously. :::callout ## Do eight processors work eight as fast as one? -It may be tempting to think that using eight cores instead of one would increase the execution speed eigthfold. For now, it is ok to use this as a first approximation to reality. Later in the course we see that things are actually more complicated. +It may be tempting to think that using eight cores instead of one would increase the execution speed eigthfold. +For now, it is ok to use this as a first approximation to reality. +Later in the course we see that things are actually more complicated. ::: ## Parallelizable and non-parallelizable tasks -Some tasks are easily parallelizable while others inherently are not. However, it might not always be immediately apparent that a task is parallelizable. +Some tasks are easily parallelizable while others inherently are not. +However, it might not always be immediately apparent that a task is parallelizable. Let us consider the following piece of code. @@ -131,13 +142,13 @@ print(y) # Print output 10 ``` -Note that each successive loop uses the result of the previous loop. In that way, it depends on the previous -loop. The following dependency diagram makes that clear: +Note that each successive loop uses the result of the previous loop. +In that way, it depends on the previous loop. +The following dependency diagram makes that clear: ![serial execution](fig/serial.svg){alt="boxes and arrows"} -Although we are performing the loops in a serial way in the snippet above, -nothing prevents us from performing this calculation in parallel. +Although we are performing the loops in a serial way in the snippet above, nothing prevents us from performing this calculation in parallel. The following example shows that parts of the computations can be done independently: ```python @@ -162,7 +173,8 @@ print(result) **Chunking** is the technique for parallelizing operations like these sums. -There is a subclass of algorithms where the subtasks are completely independent. These kinds of algorithms are known as [embarrassingly parallel](https://en.wikipedia.org/wiki/Embarrassingly_parallel) or, more friendly, naturally or delightfully parallel. +There is a subclass of algorithms where the subtasks are completely independent. +These kinds of algorithms are known as [embarrassingly parallel](https://en.wikipedia.org/wiki/Embarrassingly_parallel) or, more friendly, naturally or delightfully parallel. An example of this kind of problem is squaring each element in a list, which can be done as follows: @@ -172,18 +184,23 @@ y = [n**2 for n in x] Each task of squaring a number is independent of all other elements in the list. -It is important to know that some tasks are fundamentally non-parallelizable. An example of such an **inherently serial** algorithm is the computation of the Fibonacci sequence using the formula `Fn=Fn-1 + Fn-2`. Each output depends on the outputs of the two previous loops. +It is important to know that some tasks are fundamentally non-parallelizable. +An example of such an **inherently serial** algorithm is the computation of the Fibonacci sequence using the formula `Fn=Fn-1 + Fn-2`. +Each output depends on the outputs of the two previous loops. :::challenge ## Challenge: Parallellizable and non-parallellizable tasks -Can you think of a task in your domain that is parallelizable? Can you also think of one that is fundamentally non-parallelizable? +Can you think of a task in your domain that is parallelizable? +Can you also think of one that is fundamentally non-parallelizable? Please write your answers in the collaborative document. ::::solution -Answers may differ. An ubiquitous example of a naturally parallel problem is a parameter scan, where you need to evaluate some model for N different configurations of input parameters. +Answers may differ. +An ubiquitous example of a naturally parallel problem is a parameter scan, where you need to evaluate some model for N different configurations of input parameters. -Time-dependent models are a category of problems very hard to parallelize, since every state depends on the previous one(s). The attempts to parallelize those cases require fundamentally different algorithms. +Time-dependent models are a category of problems very hard to parallelize, since every state depends on the previous one(s). +The attempts to parallelize those cases require fundamentally different algorithms. In many cases fully paralellizable algorithms may be a bit less efficient per CPU cycle than their single threaded brethren. :::: @@ -191,9 +208,12 @@ In many cases fully paralellizable algorithms may be a bit less efficient per CP :::callout ## Problems versus Algorithms -Often, the parallelizability of a problem depends on its specific implementation. For instance, in our first example of a non-parallelizable task, we mentioned the calculation of the Fibonacci sequence. Conveniently, a [closed form expression to compute the n-th Fibonacci number](https://en.wikipedia.org/wiki/Fibonacci_number#Closed-form_expression) exists. +Often, the parallelizability of a problem depends on its specific implementation. +For instance, in our first example of a non-parallelizable task, we mentioned the calculation of the Fibonacci sequence. +Conveniently, a [closed form expression to compute the n-th Fibonacci number](https://en.wikipedia.org/wiki/Fibonacci_number#Closed-form_expression) exists. -Last but not least, do not let the name discourage you: if your algorithm happens to be embarrassingly parallel, that's good news! The "embarrassingly" evokes the feeling of "this is great!, how did I not notice before?!" +Last but not least, do not let the name discourage you: if your algorithm happens to be embarrassingly parallel, that's good news! +The "embarrassingly" evokes the feeling of "this is great!, how did I not notice before?!" ::: :::challenge @@ -203,8 +223,11 @@ We have the following recipe: 1. (1 min) Pour water into a soup pan, add the split peas and bay leaf, and bring it to boil. 2. (60 min) Remove any foam using a skimmer, and let it simmer under a lid for about 60 minutes. 3. (15 min) Clean and chop the leek, celeriac, onion, carrot and potato. -4. (20 min) Remove the bay leaf, add the vegetables, and simmer for 20 more minutes. Stir the soup occasionally. -5. (1 day) Leave the soup for one day. Reheat before serving and add a sliced smoked sausage (vegetarian options are also welcome). Season with pepper and salt. +4. (20 min) Remove the bay leaf, add the vegetables, and simmer for 20 more minutes. + Stir the soup occasionally. +6. (1 day) Leave the soup for one day. + Reheat before serving and add a sliced smoked sausage (vegetarian options are also welcome). + Season with pepper and salt. Imagine you are cooking alone. @@ -226,6 +249,7 @@ FIXME: add text ![Shared vs. Distributed memory architecture: the crucial difference is the bandwidth to shared memory](fig/memory-architecture.svg){alt="diagram"} + ::::::::::::::::::::::::::::::::::::: keypoints - Programs are parallelizable if you can identify independent tasks. From c4b991fa9e26de522c0da896d56d334cd31dee4b Mon Sep 17 00:00:00 2001 From: Giordano Lipari Date: Fri, 30 Jun 2023 16:37:58 +0200 Subject: [PATCH 03/32] introduction.md: readability pass 2 Enforce consistent (not necessarily correct) punctuation. Apply occasional edits. --- episodes/introduction.md | 14 +++++++------- 1 file changed, 7 insertions(+), 7 deletions(-) diff --git a/episodes/introduction.md b/episodes/introduction.md index a2dd5ad..ff43595 100644 --- a/episodes/introduction.md +++ b/episodes/introduction.md @@ -34,11 +34,11 @@ Most problems will fit in one of two categories: In this course we show several ways of speeding up your program and making it run in parallel. We introduce the following modules: -1. `threading` allows different parts of your program to run concurrently on a single computer (with shared memory) -3. `dask` makes scalable parallel computing easy -4. `numba` speeds up your Python functions by translating them to optimized machine code -5. `memory_profile` monitors memory performance -6. `asyncio` Python's native asynchronous programming +1. `threading` allows different parts of your program to run concurrently on a single computer (with shared memory). +3. `dask` makes scalable parallel computing easy. +4. `numba` speeds up your Python functions by translating them to optimized machine code. +5. `memory_profile` monitors memory performance. +6. `asyncio` Python's native asynchronous programming. FIXME: Actually explain functional programming & distributed programming. More importantly, we show how to change the design of a program to fit parallel paradigms. This @@ -114,7 +114,7 @@ It may be tempting to think that using eight cores instead of one would increase ## Parallelizable and non-parallelizable tasks Some tasks are easily parallelizable while others inherently are not. However, it might not always be immediately apparent that a task is parallelizable. -Let us consider the following piece of code. +Let us consider the following piece of code: ```python x = [1, 2, 3, 4] # Write input @@ -181,7 +181,7 @@ Can you think of a task in your domain that is parallelizable? Can you also thin Please write your answers in the collaborative document. ::::solution -Answers may differ. An ubiquitous example of a naturally parallel problem is a parameter scan, where you need to evaluate some model for N different configurations of input parameters. +Answers may vary. An ubiquitous example of a naturally parallel problem is a parameter scan, where you need to evaluate some model for N different configurations of input parameters. Time-dependent models are a category of problems very hard to parallelize, since every state depends on the previous one(s). The attempts to parallelize those cases require fundamentally different algorithms. From 1a8c9db40bbd734ad7502c50c3a3885709ee2b05 Mon Sep 17 00:00:00 2001 From: Giordano Lipari Date: Fri, 30 Jun 2023 17:24:08 +0200 Subject: [PATCH 04/32] benchmarking.md: readability pass 1 Linguistic interventions to improve readability, affecting the current meanings as little as possible. --- episodes/benchmarking.md | 69 +++++++++++++++++++--------------------- 1 file changed, 33 insertions(+), 36 deletions(-) diff --git a/episodes/benchmarking.md b/episodes/benchmarking.md index c6f9cf8..50f6dd6 100644 --- a/episodes/benchmarking.md +++ b/episodes/benchmarking.md @@ -5,8 +5,8 @@ exercises: 20 --- :::questions -- How do we know our program ran faster? -- How do we learn about efficiency? +- How do we know whether our program ran faster in parallel? +- How do we appraise efficiency? ::: :::objectives @@ -20,8 +20,8 @@ exercises: 20 # A first example with Dask -We will get into creating parallel programs in Python later. First let's see a small example. Open -your system monitor (this will differ among specific operating systems), and run the following code examples. +We will create parallel programs in Python later. First let's see a small example. Open +your system monitor (this will vary between specific operating systems), and run the following code examples. ```python # Summation making use of numpy: @@ -39,12 +39,12 @@ result = work.compute() :::callout ## Try a heavy enough task -It could be that a task this small does not register on your radar. Depending on your computer you will have to raise the power to ``10**8`` or ``10**9`` to make sure that it runs long enough to observe the effect. But be careful and increase slowly. Asking for too much memory can make your computer slow to a crawl. +Your radar may not detect so small a task. In your computer, you may have to gradually raise the problem size to ``10**8`` or ``10**9`` to observe the effect in a long enough run. But be careful and increase slowly! Asking for too much memory can make your computer slow to a crawl. ::: ![System monitor](fig/system-monitor.jpg){alt="screenshot of system monitor"} -How can we test this in a more practical way? In Jupyter we can use some line magics, small "magic words" preceded +How can we monitor this more conveniently? In Jupyter we can use some line magics, small "magic words" preceded by the symbol `%%` that modify the behaviour of the cell. ```python @@ -52,44 +52,42 @@ by the symbol `%%` that modify the behaviour of the cell. np.arange(10**7).sum() ``` -The `%%time` line magic checks how long it took for a computation to finish. It does nothing to -change the computation itself. In this it is very similar to the `time` shell command. +The `%%time` line magic checks how long it took for a computation to finish. It does not affect how the computation is performed. In this regard it is very similar to the `time` shell command. -If run the chunk several times, we will notice a difference in the times. +If run the chunk several times, we will notice variability in the reported times. How can we trust this timer, then? A possible solution will be to time the chunk several times, and take the average time as our valid measure. -The `%%timeit` line magic does exactly this in a concise an comfortable manner! -`%%timeit` first measures how long it takes to run a command one time, then -repeats it enough times to get an average run-time. Also, `%%timeit` can measure run times without -the time it takes to setup a problem, measuring only the performance of the code in the cell. -This way we can trust the outcome better. +The `%%timeit` line magic does exactly this in a concise and conveninet manner! +`%%timeit` first measures how long it takes to run a command once, then +repeats it enough times to get an average run-time. Also, `%%timeit` can measure run times discountinh the overhead of setting up a problem and measuring only the performance of the code in the cell. +So this outcome is more trustworthy. ```python %%timeit np.arange(10**7).sum() ``` -If you want to store the output of `%%timeit` in a Python variable, you can do so with the `-o` flag. +You can store the output of `%%timeit` in a Python variable using the `-o` flag: ```python time = %timeit -o np.arange(10**7).sum() print(f"Time taken: {time.average:.4f}s") ``` -Note that this does not tell you anything about memory consumption or efficiency. +Note that this metric does not tell you anything about memory consumption or efficiency. # Memory profiling -- The act of systematically testing performance under different conditions is called **benchmarking**. -- Analysing what parts of a program contribute to the total performance, and identifying possible bottlenecks is **profiling**. +- **Benchmarking** is the action of systematically testing performance under different conditions. +- **Profiling** is the analysis of which parts of a program contribute to the total performance, and the identification of possible bottlenecks. -We will use the [`memory_profiler` package](https://github.com/pythonprofilers/memory_profiler) to track memory usage. +We will use the package [`memory_profiler`](https://github.com/pythonprofilers/memory_profiler) to track memory usage. It can be installed executing the code below in the console: ~~~sh pip install memory_profiler ~~~ -In Jupyter, type the following lines to compare the memory usage of the serial and parallel versions of the code presented above (again, change the value of `10**7` to something higher if needed): +The memory usage of the serial and parallel versions of a code will vary. In Jupyter, type the following lines to see the effect in the code presented above (again, increase the baseline value `10**7` if needed): ```python import numpy as np @@ -118,31 +116,31 @@ plt.legend() plt.show() ``` -The figure should be similar to the one below: +The plot should be similar to the one below: ![Memory performance](fig/memory.png){alt="showing very high peak for numpy, and constant low line for dask"} :::challenge ## Exercise (plenary) -Why is the Dask solution more memory efficient? +Why is the Dask solution more memory-efficient? ::::solution ## Solution -Chunking! Dask chunks the large array, such that the data is never entirely in memory. +Chunking! Dask chunks the large array so that the data is never entirely in memory. :::: ::: :::callout ## Profiling from Dask -Dask has several option to do profiling from Dask itself. See the [dask documentation](https://docs.dask.org/en/latest/diagnostics-local.html) for more information. +Dask has several built-in option for profiling. See the [dask documentation](https://docs.dask.org/en/latest/diagnostics-local.html) for more information. ::: # Using many cores -Using more cores for a computation can decrease the run time. The first question is of course: how many cores do I have? See the snippets below to find out: +Using more cores for a computation can decrease the run time. The first question is of course: how many cores do I have? See the snippet below to find this out: :::callout -## Find out how many cores your machine has -The number of cores can be found from Python by executing: +## Find out the number of cores in your machine +The number of cores can be found from Python executing: ```python import psutil @@ -152,13 +150,12 @@ print(f"The number of physical/logical cores is {N_physical_cores}/{N_logical_co ``` ::: -Usually the number of logical cores is higher than the number of physical course. This is due to *hyper-threading*, +Usually the number of logical cores is higher than the number of physical cores. This is due to *hyper-threading*, which enables each physical CPU core to execute several threads at the same time. Even with simple examples, performance may scale unexpectedly. There are many reasons for this, hyper-threading being one of them. +See the ensuing example. -See for instance the example below: - -On a machine with 4 physical and 8 logical cores doing this (admittedly oversimplistic) benchmark: +On a machine with 4 physical and 8 logical cores, this admittedly oversimplistic benchmark: ```python x = [] @@ -167,7 +164,7 @@ for n in range(1, 9): x.append(time_taken.average) ``` -Gives the following result: +gives the result: ```python import pandas as pd @@ -179,13 +176,13 @@ data.set_index("n").plot() :::discussion ## Discussion -Why is the runtime increasing if we add more than 4 cores? This has to do with **hyper-threading**. On most architectures it does not make much sense to use more workers than the number of physical cores you have. +Why is the runtime increasing if we add more than 4 cores? This has to do with **hyper-threading**. On most architectures it does not make much sense to use more workers than physical cores you have. ::: :::keypoints -- It is often non-trivial to understand performance -- Memory is just as important as speed -- Measuring is knowing +- Understanding performance is often non-trivial. +- Memory is just as important as speed. +- To measure is to know. ::: From 1024ef6ba3bbbab3e5dba6844ee586b478243ba6 Mon Sep 17 00:00:00 2001 From: Giordano Lipari Date: Fri, 30 Jun 2023 18:37:28 +0200 Subject: [PATCH 05/32] computing-pi.md: readability pass 1 --- episodes/computing-pi.md | 68 +++++++++++++++++++--------------------- 1 file changed, 32 insertions(+), 36 deletions(-) diff --git a/episodes/computing-pi.md b/episodes/computing-pi.md index b0f8908..b30fa4c 100644 --- a/episodes/computing-pi.md +++ b/episodes/computing-pi.md @@ -12,27 +12,26 @@ exercises: 30 :::objectives - Rewrite a program in a vectorized form. -- Understand the difference between data and task-based parallel programming. +- Understand the difference between data-based and task-based parallel programming. - Apply `numba.jit` to accelerate Python. ::: # Parallelizing a Python application -In order to recognize the advantages of parallelization we need an algorithm that is easy to parallelize, but still complex enough to take a few seconds of CPU time. -To not scare away the interested reader, we need this algorithm to be understandable and, if possible, also interesting. -We chose a classical algorithm for demonstrating parallel programming: estimating the value of number π. +In order to recognize the advantages of parallelism we need an algorithm that is easy to parallelize, complex enough to take a few seconds of CPU time, understandable, and also interesting not to scare away the interested learner. +Estimating the value of number π is a classical problem to demonstrate parallel programming. -The algorithm we present is one of the classical examples of the power of Monte-Carlo methods. -This is an umbrella term for several algorithms that use random numbers to approximate exact results. -We chose this algorithm because of its simplicity and straightforward geometrical interpretation. +The algorithm we present is a classical demonstration of the power of Monte Carlo methods. +This is a category of algorithms using random numbers to approximate exact results. +This approach is simple and has a straightforward geometrical interpretation. We can compute the value of π using a random number generator. We count the points falling inside the blue circle M compared to the green square N. -Then π is approximated by the ratio 4M/N. +The ratio 4M/N then approximates π. ![Computing Pi](fig/calc_pi_3_wide.svg){alt="the area of a unit sphere contains a multiple of pi"} :::challenge ## Challenge: Implement the algorithm -Use only standard Python and the function `random.uniform`. The function should have the following +Use only standard Python and the method `random.uniform`. The function should have the following interface: ```python @@ -46,7 +45,7 @@ def calc_pi(N): return ... ``` -Also make sure to time your function! +Also, make sure to time your function! ::::solution ## Solution @@ -75,11 +74,11 @@ def calc_pi(N): :::: ::: -Before we start to parallelize this program, we need to do our best to make the inner function as -efficient as we can. We show two techniques for doing this: *vectorization* using `numpy` and +Before we parallelize this program, the inner function must be as +efficient as we can make it. We show two techniques for doing this: *vectorization* using `numpy`, and *native code generation* using `numba`. -We first demonstrate a Numpy version of this algorithm. +We first demonstrate a Numpy version of this algorithm: ```python import numpy as np @@ -94,10 +93,10 @@ def calc_pi_numpy(N): This is a **vectorized** version of the original algorithm. It nicely demonstrates **data parallelization**, where a **single operation** is replicated over collections of data. -It contrasts to **task parallelization**, where **different independent** procedures are performed in -parallel (think for example about cutting the vegetables while simmering the split peas). +It contrasts with **task parallelization**, where **different independent** procedures are performed in +parallel (think, for example, about cutting the vegetables while simmering the split peas). -If we compare with the 'naive' implementation above, we see that our new one is much faster: +This implementation is much faster than the 'naive' implementation above: ```python %timeit calc_pi_numpy(10**6) @@ -117,8 +116,8 @@ What is the downside of the vectorized implementation? :::challenge ## Challenge: Daskify -Write `calc_pi_dask` to make the Numpy version parallel. Compare speed and memory performance with -the Numpy version. NB: Remember that dask.array mimics the numpy API. +Write `calc_pi_dask` to make the Numpy version parallel. Compare its speed and memory performance with +the Numpy version. NB: Remember that API of `dask.array` mimics that of the Numpy. ::::solution ## Solution @@ -143,7 +142,7 @@ def calc_pi_dask(N): ::: # Using Numba to accelerate Python code -Numba makes it easier to create accelerated functions. You can use it with the decorator `numba.jit`. +Numba makes it easier to create accelerated functions. You can activate it with the decorator `numba.jit`. ```python import numba @@ -167,7 +166,7 @@ Let's time three versions of the same test. First, native Python iterators: 190 ms ± 3.26 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) ``` -Now with Numpy: +Then, with Numpy: ```python %timeit np.arange(10**7).sum() @@ -177,7 +176,7 @@ Now with Numpy: 17.5 ms ± 138 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) ``` -And with Numba: +Finally, with Numba: ```python %timeit sum_range_numba(10**7) @@ -187,27 +186,24 @@ And with Numba: 162 ns ± 0.885 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each) ``` -Numba is 100x faster in this case! It gets this speedup with "just-in-time" compilation (JIT)—compiling the Python -function into machine code just before it is called (that's what the `@numba.jit` decorator stands for). -Not every Python and Numpy feature is supported, but a function may be a good candidate for Numba if it is written -with a Python for-loop over a large range of values, as with `sum_range_numba()`. +Numba is hundredfold faster in this case! It gets this speedup with "just-in-time" compilation (JIT) —, that is, compiling the Python +function into machine code just before it is called, as the `@numba.jit` decorator indicates. +Numba does not support every Python and Numpy feature, but functions written with a for-loop with a large number of iterates, like in our `sum_range_numba()`, are good candidates. :::callout ## Just-in-time compilation speedup -The first time you call a function decorated with `@numba.jit`, you may see little or no speedup. In -subsequent calls, the function could be much faster. You may also see this warning when using `timeit`: +The first time you call a function decorated with `@numba.jit`, you may see no or little speedup. The function can then be much faster in subsequent calls. Also, `timeit` may throw this warning: `The slowest run took 14.83 times longer than the fastest. This could mean that an intermediate result is being cached.` Why does this happen? On the first call, the JIT compiler needs to compile the function. On subsequent calls, it reuses the -already-compiled function. The compiled function can *only* be reused if it is called with the same argument types -(int, float, etc.). +function previously compiled. The compiled function can *only* be reused if the types of its arguments (int, float, and the like) are the same as the point of compilation. -See this example where `sum_range_numba` is timed again, but now with a float argument instead of int: +See this example, where `sum_range_numba` is timed once again with a float argument instead of an int: ```python -%time sum_range_numba(10.**7) +%time sum_range_numba(10**7) %time sum_range_numba(10.**7) ``` ```output @@ -248,17 +244,17 @@ def calc_pi_numba(N): ::: :::callout -## Measuring == knowing +## Measuring = knowing Always profile your code to see which parallelization method works best. ::: :::callout -## `numba.jit` is not a magical command to solve are your problems -Using numba to accelerate your code often outperforms other methods, but it is not always trivial to rewrite your code so that you can use numba with it. +## `numba.jit` is not a magical command to solve your problems +Accelerating your code with Numba often outperforms other methods, but rewriting code to reap the benefits of Numba is not always trivial. ::: :::keypoints -- Always profile your code to see which parallelization method works best +- Always profile your code to see which parallelization method works best. - Vectorized algorithms are both a blessing and a curse. -- Numba can help you speed up code +- Numba can help you speed up code. ::: From b0e9ca6ef655d220d62049acd72f5b65013df83d Mon Sep 17 00:00:00 2001 From: Giordano Lipari Date: Fri, 30 Jun 2023 20:07:29 +0200 Subject: [PATCH 06/32] threads-and-processes.md: readability pass 1 --- episodes/threads-and-processes.md | 94 +++++++++++++++---------------- 1 file changed, 46 insertions(+), 48 deletions(-) diff --git a/episodes/threads-and-processes.md b/episodes/threads-and-processes.md index 57cd521..be1465c 100644 --- a/episodes/threads-and-processes.md +++ b/episodes/threads-and-processes.md @@ -1,5 +1,5 @@ --- -title: 'Threads and processes' +title: 'Threads And Processes' teaching: 60 exercises: 30 --- @@ -11,15 +11,15 @@ exercises: 30 :::objectives - Understand the GIL. -- Understand the difference between the python `threading` and `multiprocessing` library +- Understand the difference between the `threading` and `multiprocessing` libraries in Python. ::: # Threading -Another possibility for parallelization is to use the `threading` module. -This module is built into Python. In this section, we'll use it to estimate pi -once again. +Another possibility of parallelizing code is to use the `threading` module. +This module is built into Python. We will use it to estimate $\pi$ +once again in this section. -Using threading to speed up your code: +An example of using threading to speed up your code is: ```python from threading import (Thread) @@ -41,32 +41,32 @@ t2.join() :::discussion ## Discussion: where's the speed-up? While mileage may vary, parallelizing `calc_pi`, `calc_pi_numpy` and `calc_pi_numba` this way will -not give the expected speed-up. `calc_pi_numba` should give *some* speed-up, but nowhere near the -ideal scaling over the number of cores. This is because Python only allows one thread to access the -interperter at any given time, a feature also known as the Global Interpreter Lock. +not give the theoretical speed-up. `calc_pi_numba` should give *some* speed-up, but nowhere near the +ideal scaling for the number of cores. This is because, at any given time, Python only allows one thread to access the +interperter, a feature also known as the Global Interpreter Lock. ::: ## A few words about the Global Interpreter Lock The Global Interpreter Lock (GIL) is an infamous feature of the Python interpreter. It both guarantees inner thread sanity, making programming in Python safer, and prevents us from using multiple cores from a single Python instance. -When we want to perform parallel computations, this becomes an obvious problem. -There are roughly two classes of solutions to circumvent/lift the GIL: +This becomes an obvious problem when we want to perform parallel computations. +Roughly speaking, there are two classes of solutions to circumvent/lift the GIL: -- Run multiple Python instances: `multiprocessing` -- Have important code outside Python: OS operations, C++ extensions, cython, numba +- Run multiple Python instances using `multiprocessing`. +- Keep important code outside Python using OS operations, C++ extensions, Cython, Numba. The downside of running multiple Python instances is that we need to share program state between different processes. -To this end, you need to serialize objects. Serialization entails converting a Python object into a stream of bytes, -that can then be sent to the other process, or e.g. stored to disk. This is typically done using `pickle`, `json`, or +To this end, you need to serialize objects. Serialization entails converting a Python object into a stream of bytes +that can then be sent to the other process or, for example, stored to disk. This is typically done using `pickle`, `json`, or similar, and creates a large overhead. The alternative is to bring parts of our code outside Python. Numpy has many routines that are largely situated outside of the GIL. -The only way to know for sure is trying out and profiling your application. +Trying out and profiling your application is the only way to know for sure. -To write your own routines that do not live under the GIL there are several options: fortunately `numba` makes this very easy. +To write your own routines not subjected to the GIL there are several options: fortunately, `numba` makes this very easy. -We can force the GIL off in Numba code by setting `nogil=True` in the `numba.jit` decorator. +We can force off the GIL in Numba code by setting `nogil=True` inside the `numba.jit` decorator. ```python @numba.jit(nopython=True, nogil=True) @@ -81,20 +81,20 @@ def calc_pi_nogil(N): ``` The `nopython` argument forces Numba to compile the code without referencing any Python objects, -while the `nogil` argument enables lifting the GIL during the execution of the function. +while the `nogil` argument disables the GIL during the execution of the function. :::callout ## Use `nopython=True` or `@numba.njit` -It's generally a good idea to use `nopython=True` with `@numba.jit` to make sure the entire +It is generally a good idea to use `nopython=True` with `@numba.jit` to make sure the entire function is running without referencing Python objects, because that will dramatically slow -down most Numba code. There's even a decorator that has `nopython=True` by default: `@numba.njit` +down most Numba code. The decorator `@numba.njit` even has `nopython=True` by default. ::: Now we can run the benchmark again, using `calc_pi_nogil` instead of `calc_pi`. :::challenge ## Exercise: try threading on a Numpy function -Many Numpy functions unlock the GIL. Try to sort two randomly generated arrays using `numpy.sort` in parallel. +Many Numpy functions unlock the GIL. Try and sort two randomly generated arrays using `numpy.sort` in parallel. ::::solution ## Solution @@ -119,9 +119,9 @@ t2.join() ::: # Multiprocessing -Python also allows for using multiple processes for parallelisation +Python also enable parallelisation with multiple processes via the `multiprocessing` module. It implements an API that is -superficially similar to threading: +seemingly similar to threading: ```python from multiprocessing import Process @@ -141,16 +141,16 @@ if __name__ == '__main__': p2.join() ``` -However under the hood processes are very different from threads. A -new process is created by creating a fresh "copy" of the python -interpreter, that includes all the resources associated to the parent. +However, under the hood, processes are very different from threads. A +new process is created by creating a fresh "copy" of the Python +interpreter that includes all the resources associated to the parent. There are three different ways of doing this (*spawn*, *fork*, and -*forkserver*), which depends on the platform. We will use *spawn* as -it is available on all platforms, you can read more about the others +*forkserver*), whose availability depends on the platform. We will use *spawn* as +it is available on all platforms. You can read more about the others in the [Python documentation](https://docs.python.org/3/library/multiprocessing.html#contexts-and-start-methods). -As creating a process is resource intensive, multiprocessing is -beneficial under limited circumstances - namely, when the resource +Since creating a process is resource-intensive, multiprocessing is +beneficial under limited circumstances --- namely, when the resource utilisation (or runtime) of a function is *measureably* larger than the overhead of creating a new process. @@ -164,8 +164,8 @@ if __name__ == "__main__": ``` ::: -The non-intrusive and safe way of starting a new process is acquire a -`context`, and working within the context. This ensures your +The non-intrusive and safe way of starting a new process is to acquire a +`context` and work within that context. This ensures that your application does not interfere with any other processes that might be in use. @@ -230,32 +230,31 @@ if __name__ == "__main__": :::callout ## Sharing state -It is also possible to share state between processes. The simpler -of the several ways is to use shared memory via `Value` or `Array`. +It is also possible to share state between processes. The simplest way is to use shared memory via `Value` or `Array`. You can access the underlying value using the `.value` property. -Note, in case you want to do an operation that is not atomic (cannot -be done in one step, e.g. using the `+=` operator), you should -explicitly acquire a lock before performing the operation: +Note, you should +explicitly acquire a lock before performing an operation that is not atomic (which cannot +be done in one step, e.g., using the `+=` operator): ```python with var.get_lock(): var.value += 1 ``` -Since Python 3.8, you can also create a numpy array backed by a +Since Python 3.8, you can also create a Numpy array backed by a shared memory buffer ([`multiprocessing.shared_memory.SharedMemory`](https://docs.python.org/3/library/multiprocessing.shared_memory.html)), which can then be accessed from separate processes *by name* -(including separate interactive shells!). +(including from separate interactive shells!). ::: ## Process pool The `Pool` API provides a pool of worker processes that can execute -tasks. Methods of the pool object offer various convenient ways to +tasks. Methods of the `Pool` object offer various convenient ways to implement data parallelism in your program. The most convenient way to create a pool object is with a context manager, either using the toplevel function `multiprocessing.Pool`, or by calling the `.Pool()` -method on the context. With the pool object, tasks can be submitted +method on the context. With the `Pool` object, tasks can be submitted by calling methods like `.apply()`, `.map()`, `.starmap()`, or their `.*_async()` versions. @@ -263,9 +262,8 @@ by calling methods like `.apply()`, `.map()`, `.starmap()`, or their ## Exercise: adapt the original exercise to submit tasks to a pool - Use the original `calc_pi` function (without the queue) - Submit batches of different sample size (different values of `N`). -- As mentioned earlier, creating a new process has overhead. Try a -wide range of sample sizes and check if runtime scaling supports -that claim. +- As mentioned earlier, creating a new process entails overheads. Try a +wide range of sample sizes and check if the runtime scales in keeping with that claim. ::::solution ## Solution @@ -304,8 +302,8 @@ if __name__ == "__main__": ::: :::keypoints -- If we want the most efficient parallelism on a single machine, we need to circumvent the GIL. -- If your code releases the GIL, threading will be more efficient than multiprocessing. -- If your code does not release the GIL, some of your code is still in Python, and you're wasting precious compute time! +- If we want the most efficient parallelism on a single machine, we need to work around the GIL. +- If your code disables the GIL, threading will be more efficient than multiprocessing. +- If your code keeps the GIL, some of your code is still in Python and you are wasting precious compute time! ::: From a6d865270634cdadc8bae17080b98bf452436bf4 Mon Sep 17 00:00:00 2001 From: "Giordano Lipari @c2" Date: Fri, 30 Jun 2023 22:51:09 +0200 Subject: [PATCH 07/32] delayed-evaluation: readability pass 1 --- episodes/delayed-evaluation.md | 60 +++++++++++++++++----------------- 1 file changed, 30 insertions(+), 30 deletions(-) diff --git a/episodes/delayed-evaluation.md b/episodes/delayed-evaluation.md index 22ed9fe..789b671 100644 --- a/episodes/delayed-evaluation.md +++ b/episodes/delayed-evaluation.md @@ -1,5 +1,5 @@ --- -title: 'Delayed evaluation' +title: 'Delayed Evaluation' teaching: 10 exercises: 2 --- @@ -10,12 +10,12 @@ exercises: 2 ::: :::objectives -- Understand the abstraction of delayed evaluation -- Use the `visualize` method to create dependency graphs +- Understand the abstraction of delayed evaluation. +- Use the `visualize` method to create dependency graphs. ::: -[Dask](https://dask.org/) is one of the many tools available for parallelizing Python code in a comfortable way. We've seen a basic example of `dask.array` in a previous episode. Now, we will focus on the `delayed` and `bag` sub-modules. Dask has a lot of other useful components, such as `dataframe` and `futures`, but we are not going to cover them in this lesson. +[Dask](https://dask.org/) is one of many convenient tools available for parallelizing Python code. We have seen a basic example of `dask.array` in a previous episode. Now, we will focus on the `delayed` and `bag` sub-modules. Dask has other useful components that we do not cover in this lesson, such as `dataframe` and `futures`. See an overview below: @@ -28,14 +28,14 @@ See an overview below: | `dask.futures` | `concurrent.futures` | Control execution, low-level | ❌ | # Dask Delayed -A lot of the functionality in Dask is based on top of a concept known as *delayed evaluation*. Because this concept is so very important in understanding how Dask functions, we will go a bit deeper into `dask.delayed`. +A lot of the functionalities in Dask is based on an important concept known as *delayed evaluation*. We will then go a bit deeper into `dask.delayed`. -By using `dask.delayed` we change the strategy by which our computation is evaluated. Normally in a computer, you expect commands to be run when you ask for them, and then when the job is complete, you can give the next command. When we use delayed evaluation, we don't wait around to formulate the next command. Instead we create the dependency graph of our complete computation without actually doing any work. When we know the full dependency graph, we can see which jobs can be done in parallel and give those to different workers. +`dask.delayed` changes the strategy by which our computation is evaluated. Normally, you expect that a computer runs commands when you ask for them, and that you can give the next command when the current job is complete. With delayed evaluation we do not wait before formulating the next command. Instead, we create the dependency graph of our complete computation without actually doing any work. Once we build the full dependency graph, we can see which jobs can be done in parallel and attribute those to different workers. -To express a computation in this world, we need to handle future objects *as if they're already there*. These objects may be refered to as *futures* or *promises*. +To express a computation in this world, we need to handle future objects *as if they're already there*. These objects may be referred to as either *futures* or *promises*. :::callout -Python has support for working with futures in several libraries, each time slightly different. The main difference between Python futures and Dask delayed objects is that futures are added to a queue from the first moment you define them, while delayed objects are silent until you ask to compute. We will refer to these 'live' futures as futures, and 'dead' futures (like delayed) as **promises**. +Several Python libraries provide slightly different support for working with futures. The main difference between Python futures and Dask delayed objects is that futures are added to a queue at the point of definition, while delayed objects are silent until you ask to compute. We will refer to such 'live' futures as futures and to 'dead' futures (including the delayed) as **promises**. ::: ~~~python @@ -52,8 +52,7 @@ def add(a, b): return a + b ~~~ -A `delayed` function stores the requested function call inside a **promise**. The function is not actually executed yet, instead we -are *promised* a value that can be computed later. +A `delayed` function stores the requested function call inside a **promise**. The function is not actually executed yet, and we are *promised* a value that can be computed later. ~~~python x_p = add(1, 2) @@ -68,12 +67,12 @@ type(x_p) [out]: dask.delayed.Delayed ~~~ -> ## Note -> It is often a good idea to suffix variables that you know are promises with `_p`. That way you +> ## Note on notation +> It is a good idea to suffix with `_p` variables that are promises. That way you > keep track of promises versus immediate values. {: .callout} -Only when we evaluate the computation, do we get an output. +Only when we ask to evaluate the computation do we get an output: ~~~python x_p.compute() @@ -83,7 +82,7 @@ x_p.compute() [out]: 3 ~~~ -From `Delayed` values we can create larger workflows and visualize them. +From `Delayed` values we can create larger workflows and visualize them: ~~~python x_p = add(1, 2) @@ -104,7 +103,7 @@ y_p = add(x_p, 3) z_p = add(x_p, -3) ``` -Visualize and compute `y_p` and `z_p` separately, how often is `x_p` evaluated? +Visualize and compute `y_p` and `z_p` separately. How often is `x_p` evaluated? Now change the workflow: @@ -115,7 +114,7 @@ z_p = add(x_p, y_p) z_p.visualize(rankdir="LR") ``` -We pass the yet uncomputed promise `x_p` to both `y_p` and `z_p`. Now, only compute `z_p`, how often do you expect `x_p` to be evaluated? Run the workflow to check your answer. +We pass the not-yet-computed promise `x_p` to both `y_p` and `z_p`. If you only compute `z_p`, how often do you expect `x_p` to be evaluated? Run the workflow to check your answer. ::::solution ## Solution @@ -128,11 +127,11 @@ z_p.compute() 3 + 6 = 9 [out]: 9 ``` -The computation of `x_p` (1 + 2) appears only once. This should teach you to procrastinate calling `compute` as long as you can. +The computation of `x_p` (1 + 2) appears only once. This should convince you to procrastinate the call `compute` as long as you can. :::: ::: -We can also make a promise by directly calling `delayed` +We can also make a promise by directly calling `delayed`: ~~~python N = 10**7 @@ -143,7 +142,7 @@ It is now possible to call `visualize` or `compute` methods on `x_p`. :::callout ## Decorators -In Python the decorator syntax is equivalent to passing a function through a function adapter (a.k.a. a higher order function or a functional). This adapter can change the behaviour of the function in many ways. The statement, +In Python the decorator syntax is equivalent to passing a function through a function adapter, also known as a higher order function or a functional. This adapter can change the behaviour of the function in many ways. The statement ```python @delayed @@ -163,7 +162,7 @@ sqr = delayed(sqr) :::callout ## Variadic arguments -In Python you can define functions that take arbitrary number of arguments: +In Python you can define functions taking arbitrary number of arguments: ```python def add(*args): @@ -172,7 +171,7 @@ def add(*args): add(1, 2, 3, 4) # => 10 ``` -You can use tuple-unpacking to pass a sequence of arguments: +You can then use tuple-unpacking to pass a sequence of arguments: ```python numbers = [1, 2, 3, 4] @@ -180,7 +179,7 @@ add(*numbers) # => 10 ``` ::: -We can build new primitives from the ground up. An important function that you will find in many different places where non-standard evaluation strategies are involved is `gather`. We can implement `gather` as follows: +We can build new primitives from the ground up. An important function frequently found where non-standard evaluation strategies are involved is `gather`. We can implement `gather` as follows: ~~~python @delayed @@ -191,7 +190,7 @@ def gather(*args): :::challenge ## Challenge: understand `gather` Can you describe what the `gather` function does in terms of lists and promises? -hint: Suppose I have a list of promises, what does `gather` allow me to do? +Hint: Suppose I have a list of promises, what does `gather` enable me to do? ::::solution ## Solution @@ -199,7 +198,7 @@ It turns a list of promises into a promise of a list. ::: :::: -We can visualize what `gather` does by this small example. +This small example shows what `gather` does: ~~~python x_p = gather(*(add(n, n) for n in range(10))) # Shorthand for gather(add(1, 1), add(2, 2), ...) @@ -209,25 +208,26 @@ x_p.visualize() ![a gather pattern](fig/dask-gather-example.svg) {.output alt="boxes and arrows"} -Computing the result, +Computing the result ~~~python x_p.compute() ~~~ +gives ~~~output [out]: [0, 1, 4, 9, 16, 25, 36, 49, 64, 81] ~~~ :::challenge -## Challenge: design a `mean` function and calculate pi -Write a `delayed` function that computes the mean of its arguments. Use it to esimates pi several times and returns the mean of the results. +## Challenge: design a `mean` function and calculate $\pi$ +Write a `delayed` function that computes the mean of its arguments. Use it to estimate $\pi$ several times and have it return the mean of the intermediate results. ```python >>> mean(1, 2, 3, 4).compute() 2.5 ``` -Make sure that the entire computation is contained in a single promise. +Ensure that the entire computation is contained in a single promise. ::::solution ## Solution @@ -257,9 +257,9 @@ pi_p.compute() :::: ::: -You may not seed a significant speedup. This is because `dask delayed` uses threads by default and our native Python implementation of `calc_pi` does not circumvent the GIL. With for example the numba version of `calc_pi` you should see a more significant speedup. +You may not see a significant speed-up. This is because `dask delayed` uses threads by default, and our native Python implementation of `calc_pi` does not circumvent the GIL. You should see a more significant speed-up with the Numba version of `calc_pi`, for example. -In practice you may not need to use `@delayed` functions too often, but it does offer ultimate flexibility. You can build complex computational workflows in this manner, sometimes replacing shell scripting, make files and the likes. +In practice, you may not need to use `@delayed` functions frequently, but they do offer ultimate flexibility. You can build complex computational workflows in this manner, sometimes replacing shell scripting, make files, and suchlike. :::keypoints - We can change the strategy by which a computation is evaluated. From a1c34977597233747d3756b4f7a785cb339701bc Mon Sep 17 00:00:00 2001 From: "Giordano Lipari @c2" Date: Sat, 1 Jul 2023 08:00:11 +0200 Subject: [PATCH 08/32] map-and-reduce.md: readability pass 1 --- episodes/map-and-reduce.md | 43 +++++++++++++++++++------------------- 1 file changed, 21 insertions(+), 22 deletions(-) diff --git a/episodes/map-and-reduce.md b/episodes/map-and-reduce.md index 6ee2d08..291d7cb 100644 --- a/episodes/map-and-reduce.md +++ b/episodes/map-and-reduce.md @@ -1,30 +1,30 @@ --- -title: 'Map and reduce' +title: 'Map and Reduce' teaching: 60 exercises: 30 --- :::questions -- What abstractions does Dask offer? -- What programming patterns exist in the parallel universe? +- Which abstractions does Dask offer? +- Which programming patterns exist in the parallel universe? ::: :::objectives -- Recognize `map`, `filter` and `reduce` patterns +- Recognize `map`, `filter` and `reduction` patterns - Create programs using these building blocks - Use the `visualize` method to create dependency graphs ::: -In computer science *bags* refer to unordered collections of data. In Dask, a `bag` is a collection that is chunked internally. When you perform operations on a bag, these operations are automatically parallelized over the chunks inside the bag. +In computer science *bags* are unordered collections of data. In Dask, a `bag` is a collection that get chunked internally. Operations on a bag are automatically parallelized over the chunks inside the bag. Dask bags let you compose functionality using several primitive patterns: the most important of these are `map`, `filter`, `groupby`, `flatten`, and `reduction`. :::discussion ## Discussion Open the [Dask documentation on bags](https://docs.dask.org/en/latest/bag-api.html). -Discuss the `map`, `filter`, `flatten` and `reduction` methods +Discuss the `map`, `filter`, `flatten` and `reduction` methods. -In this set of operations `reduction` is rather special. All other operations on bags could be written in terms of a reduction. +In this set of operations `reduction` is rather special, because all operations on bags could be written in terms of a reduction. ::: Operations on this level can be distinguished in several categories: @@ -32,15 +32,15 @@ Operations on this level can be distinguished in several categories: - **map** (N to N) applies a function *one-to-one* on a list of arguments. This operation is **embarrassingly parallel**. - **filter** (N to <N) selects a subset from the data. -- **reduce** (N to 1) computes an aggregate from a sequence of data; if the operation permits it - (summing, maximizing, etc) this can be done in parallel by reducing chunks of data and then +- **reduction** (N to 1) computes an aggregate from a sequence of data; if the operation permits it + (summing, maximizing, etc), this can be done in parallel by reducing chunks of data and then further processing the results of those chunks. - **groupby** (1 bag to N bags) groups data in subcategories. - **flatten** (N bags to 1 bag) combine many bags into one. -Let's see an example of it in action: +Let's see examples of them in action. -First, let's create the `bag` containing the elements we want to work with (in this case, the numbers from 0 to 5). +First, let's create the `bag` containing the elements we want to work with. In this case, the numbers from 0 to 5. ~~~python import dask.bag as db @@ -51,8 +51,7 @@ bag = db.from_sequence(['mary', 'had', 'a', 'little', 'lamb']) ### Map -To illustrate the concept of `map`, we'll need a mapping function. -In the example below we'll just use a function that squares its argument: +A function that squares its argument is a mapping function that illustrates the concept of `map`: ~~~python # Create a function for mapping @@ -78,9 +77,9 @@ bag.map(f).visualize() ### Filter -To illustrate the concept of `filter`, it is useful to have a function that returns a boolean. -In this case, we'll use a function that returns `True` if the argument contains the letter 'a', -and `False` if it doesn't. +A function returning a boolean is a useful illustration of the concept of `filter`. +In this case, we use a function returning `True` if the argument contains the letter 'a', +and `False` if it does not. ~~~python # Return True if x is even, False if not @@ -96,7 +95,7 @@ bag.filter(pred).compute() :::challenge ## Difference between `filter` and `map` -Without executing it, try to forecast what would be the output of `bag.map(pred).compute()`. +Forecast the output of `bag.map(pred).compute()` without executing it. ::::solution ## Solution @@ -119,9 +118,9 @@ bag.reduction(count_chars, sum).visualize() :::challenge ## Challenge: consider `pluck` -We previously discussed some generic operations on bags. In the documentation, lookup the `pluck` method. How would you implement this if `pluck` wasn't there? +We previously discussed some generic operations on bags. In the documentation, lookup the `pluck` method. How would you implement `pluck` if it was not there? -hint: Try `pluck` on some example data. +Hint: Try `pluck` on some example data. ```python from dask import bags as db @@ -148,7 +147,7 @@ bag.map(partial(getitem, "name")).compute() :::: ::: -FIXME: find replacement for word counting example +FIXME: find replacement for word counting example. :::challenge ## Challenge: Dask version of Pi estimation @@ -182,10 +181,10 @@ estimate.compute() :::callout ## Note -By default Dask runs a bag using multi-processing. This alleviates problems with the GIL, but also means a larger overhead. +By default Dask runs a bag using multiprocessing. This alleviates problems with the GIL, but also entails a larger overhead. ::: :::keypoints -- Use abstractions to keep programs manageable +- Use abstractions to keep programs manageable. ::: From d1286d05a65dabf82997810b7e1ea4908d1dc667 Mon Sep 17 00:00:00 2001 From: "Giordano Lipari @c2" Date: Sat, 1 Jul 2023 08:20:22 +0200 Subject: [PATCH 09/32] exercise-with-fractals.md: readability pass 1 --- episodes/exercise-with-fractals.md | 33 +++++++++++++++--------------- 1 file changed, 16 insertions(+), 17 deletions(-) diff --git a/episodes/exercise-with-fractals.md b/episodes/exercise-with-fractals.md index 833dc07..a00eb54 100644 --- a/episodes/exercise-with-fractals.md +++ b/episodes/exercise-with-fractals.md @@ -1,16 +1,15 @@ ---- title: 'Exercise with Fractals' teaching: 10 exercises: 50 --- :::questions -- Can we try a real problem now? +- Can we tackle a real problem now? ::: :::objectives -- Create a strategy to parallelise existing code -- Apply previous lessons +- Create a strategy to parallelize existing code. +- Apply previous lessons. ::: # The Mandelbrot and Julia fractals @@ -27,7 +26,7 @@ fractal](https://en.wikipedia.org/wiki/Mandelbrot_fractal). :::callout ## Complex numbers -Complex numbers are a special representation of rotations and scalings in the two-dimensional plane. Multiplying two complex numbers is the same as taking a point, rotate it by an angle $\phi$ and scale it by the absolute value. Multiplying with a number $z \in \mathbb{C}$ by 1 preserves $z$. Multiplying a point at $i = (0, 1)$ (having a positive angle of 90 degrees and absolute value 1), rotates it anti-clockwise by 90 degrees. Then you might see that $i^2 = (-1, 0)$. The funny thing is, that we can treat $i$ as any ordinary number, and all our algebra still works out. This is actually nothing short of a miracle! We can write a complex number +Complex numbers are a special representation of rotations and scalings in the two-dimensional plane. Multiplying two complex numbers is the same as taking a point, rotate it by an angle $\phi$ and scale it by the absolute value. Multiplying with a number $z \in \mathbb{C}$ by 1 preserves $z$. Multiplying a point at $i = (0, 1)$ (having a positive angle of 90 degrees and absolute value 1), rotates it anti-clockwise by 90 degrees. Then you might see that $i^2 = (-1, 0)$. The funny thing is that we can treat $i$ as any ordinary number, and all our algebra still works out. This is actually nothing short of a miracle! We can write a complex number $$z = x + iy,$$ @@ -78,8 +77,8 @@ ax.set_xlabel("$\Re(c)$") ax.set_ylabel("$\Im(c)$") ``` -Things become really loads of fun when we start to zoom in. We can play around with the `center` and -`extent` values (and necessarily `max_iter`) to control our window. +Things become really loads of fun when we zoom in. We can play around with the `center` and +`extent` values, and necessarily `max_iter`, to control our window. ```python max_iter = 1024 @@ -93,11 +92,11 @@ When we zoom in on the Mandelbrot fractal, we get smaller copies of the larger s :::challenge ## Exercise -Make this into an efficient parallel program. What kind of speed-ups do you get? +Turn this into an efficient parallel program. What kind of speed-ups do you get? ::::solution ### Create a `BoundingBox` class -We start with a naive implementation. It may be convenient to define a `BoundingBox` class in a separate module `bounding_box.py`. We'll add methods to this class later on. +We start with a naive implementation. It may be convenient to define a `BoundingBox` class in a separate module `bounding_box.py`. We add methods to this class later on. ``` {.python file="src/mandelbrot/bounding_box.py"} from dataclasses import dataclass @@ -156,7 +155,7 @@ def plot_fractal(box: BoundingBox, values: np.ndarray, ax=None): ::::solution ## Some solutions -The main approach with Python will be: use Numba to make this fast. Then there are two ways to parallelize: let Numba parallelize the function, or do a manual domain decomposition and use one of many ways in Python to run things multi-threaded. There is a third way: create a vectorized function and parallelize using `dask.array`. This last option is almost always slower than `@njit(parallel=True)` or domain decomposition. +The natural approach with Python is to speed this up with Numba. Then, there are three ways to parallelize: first, let Numba parallelize the function; second, do a manual domain decomposition and use one of the many Python ways to run things multi-threaded; third, create a vectorized function and parallelize using `dask.array`. This last option is almost always slower than `@njit(parallel=True)` and domain decomposition. ``` {.python file="src/mandelbrot/__init__.py"} @@ -207,7 +206,7 @@ def compute_mandelbrot( ``` ### Numba `parallel=True` -We can parallelize loops directly with Numba. Pass the flag `parallel=True` and use `prange` to create the loop. Here it is even more important to obtain the result array outside the context of Numba, or the result will be slower than the serial version. +We can parallelize loops directly with Numba. Pass the flag `parallel=True` and use `prange` to create the loop. Here, it is even more important to obtain the result array outside the context of Numba, otherwise the result will be slower than the serial version. ``` {.python file="src/mandelbrot/numba_parallel.py"} from typing import Optional @@ -247,7 +246,7 @@ def compute_mandelbrot(box: BoundingBox, max_iter: int, ::::solution ## Domain splitting -We split the computation into a set of sub-domains. The `BoundingBox.split()` method is designed such that if we deep-map the resulting list-of-lists, we can recombine the results using `numpy.block()`. +We split the computation into a set of sub-domains. The `BoundingBox.split()` method is designed so that, if we deep-map the resulting list-of-lists, we can recombine the results using `numpy.block()`. ``` {.python #bounding-box-methods} def split(self, n): @@ -308,7 +307,7 @@ def compute_mandelbrot(box: BoundingBox, max_iter: int, ::::solution ## Numba vectorize -Another solution is to use Numba's `@guvectorize` decorator. The speed-up (on my machine) is not as dramatic as with the domain-decomposition though. +Another solution is to use Numba's `@guvectorize` decorator. The speed-up (on my machine) is not as dramatic as with the domain decomposition though. ``` {.python #bounding-box-methods} def grid(self): @@ -458,12 +457,12 @@ If we take the center of the last image, we get the following rendering of the J :::challenge ## Generalize -Can you generalize your Mandelbrot code, such that you can compute both the Mandelbrot and the Julia sets in an efficient manner, while reusing as much of the code? +Can you generalize your Mandelbrot code to compute both the Mandelbrot and the Julia sets efficiently, while reusing as much code as possible? ::: :::keypoints -- Actually making code faster is not always straight forward -- Easy one-liners *can* get you 80% of the way -- Writing clean, modular code often makes it easier to parallelise later on +- Actually making code faster is not always straightforward. +- Easy one-liners *can* get you 80% of the way. +- Writing clean and modular code often makes parallelization easier later on. ::: From 920252ff81566ac02a75db23df415ad5d3bd30c3 Mon Sep 17 00:00:00 2001 From: "Giordano Lipari @c2" Date: Sat, 1 Jul 2023 12:58:46 +0200 Subject: [PATCH 10/32] extra-asyncio.md: readability pass 1 --- episodes/extra-asyncio.md | 82 +++++++++++++++++++-------------------- 1 file changed, 41 insertions(+), 41 deletions(-) diff --git a/episodes/extra-asyncio.md b/episodes/extra-asyncio.md index 5fa3f81..6abd3c1 100644 --- a/episodes/extra-asyncio.md +++ b/episodes/extra-asyncio.md @@ -1,4 +1,4 @@ ---- +do--- title: 'Asyncio' teaching: 30 exercises: 10 @@ -6,7 +6,7 @@ exercises: 10 :::questions - What is Asyncio? -- When is asyncio usefull? +- When is Asyncio useful? ::: :::objectives @@ -16,18 +16,18 @@ exercises: 10 ::: # Introduction to Asyncio -Asyncio stands for "asynchronous IO", and as you might have guessed it has little to do with either asynchronous work or doing IO. In general, asynchronous is an adjective describing objects or events that are not coordinated in time. In fact, the `asyncio` system is more like a carefully tuned set of gears running a multitude of tasks *as if* you have a lot of OS threads running. In the end they are all powered by the same crank. The gears in `asyncio` are called **coroutines**, its teeth moving other coroutines wherever you find the `await` keyword. +Asyncio stands for "asynchronous IO" and, as you might have guessed, has little to do with either asynchronous work or doing IO. In general, the adjective asynchronous describes objects or events not coordinated in time. In fact, the `asyncio` system is more like a set of gears carefully tuned to run a multitude of tasks *as if* a lot of OS threads were running. In the end, they are all powered by the same crank. The gears in `asyncio` are called **coroutines** and its teeth move other coroutines wherever you find the `await` keyword. -The main application for `asyncio` is hosting back-ends for web services, where a lot of tasks may be waiting on each other, while the server still needs to be responsive to new events. In that respect, `asyncio` is a little bit outside the domain of computational science. Nevertheless, you may encounter async code in the wild, and you *can* do parallelism with `asyncio` if you want a higher level abstraction but don't want to depend on `dask` or a similar alternative. +The main application for `asyncio` is hosting back-ends for web services, where a lot of tasks may be waiting for each other while the server remains responsive to new events. In that regard, `asyncio` is a little bit outside the domain of computational science. Nevertheless, you may encounter Asyncio code in the wild, and you *can* do parallelism with Asyncio if you want higher-level abstraction without `dask` or a similar alternative. -Many modern programming languages have features that are very similar to `asyncio`. +Many modern programming languages do have features very similar to `asyncio`. ## Run-time -The main point of `asyncio` is that it offers a different formalism for doing work than what you're used to from functions. To see what that means, we need to understand functions a bit better. +The distinctive point of `asyncio` is a formalism for carrying out work that is different from usual function. We need to look deeper into functions to appreciate the distinction. ### Call stacks -A function call is best understood in terms of a stack based system. When you call a function, you give it its arguments and forget for the moment what you were doing. Or rather, whatever you were doing, push it on a stack and forget about it. Then, with a clean sheet (called a stack frame), you start working on the given arguments until you arrive at a result. This result is what you remember, when you go back to the stack to see what you needed it for in the first place. -In this manner, every function call pushes a frame to the stack, and every return statement, we pop back to the previous. +A function call is best understood in terms of a stack-based system. When calling a function, you give it its arguments and temporarily forget what you were doing. Or, rather, you push on a stack and forget whatever you were doing. Then, you start working with the given arguments on a clean sheet (called a stack frame) until you obtain the function result. You remember only this result when you return to the stack and recall what you originally needed it for. +In this manner, every function call pushes a frame onto the stack and every return statement has us popping back to the previous frame. [![](https://mermaid.ink/img/pako:eNp1kL1Ow0AQhF9luSYgHApEdUUogpCoKRCSm8U3JpbtXXM_NlGUd-dMYgqkdKvb-Wb25mAqdTDWBHwlSIWnhj8996UQcRWbkSNoy10HPz-dpvVmc_ucJK9VLG01dY72mmjowAHEEiZ46mFp2nGkJlB9_X3zOBssWLZYn8wsvSMUFHd_YNY_3N9dinubLScOv8TgMTaawhm9GPFCTmUVqRWdCpqwGkGCMYeFQVsIfaBWj6uZF81f1nm30BCXk3TpxeFfM6YwPXzPjctFHmZJafJ1PUpj8-jYt6Up5Zh1nKK-7qUyNvqEwqTBZZ9z6cbW3AUcfwB5sYta?type=png)](https://mermaid.live/edit#pako:eNp1kL1Ow0AQhF9luSYgHApEdUUogpCoKRCSm8U3JpbtXXM_NlGUd-dMYgqkdKvb-Wb25mAqdTDWBHwlSIWnhj8996UQcRWbkSNoy10HPz-dpvVmc_ucJK9VLG01dY72mmjowAHEEiZ46mFp2nGkJlB9_X3zOBssWLZYn8wsvSMUFHd_YNY_3N9dinubLScOv8TgMTaawhm9GPFCTmUVqRWdCpqwGkGCMYeFQVsIfaBWj6uZF81f1nm30BCXk3TpxeFfM6YwPXzPjctFHmZJafJ1PUpj8-jYt6Up5Zh1nKK-7qUyNvqEwqTBZZ9z6cbW3AUcfwB5sYta) @@ -43,14 +43,14 @@ sequenceDiagram ``` -Crucially, when we pop back, we forget about the stack frame inside the function. This way, there is always a single concious stream of thought. Function calls can be evaluated by a single active agent. +Crucially, when we pop back, we also forget the stack frame inside the function. This way of doing always keeps a single conscious stream of thought. Function calls can be evaluated by a single active agent. ### Coroutines :::instructor -This section goes rather in depth on coroutines. This is meant to grow the correct mental model about what's going on with `asyncio`. +This section goes rather in depth on coroutines. This is meant to nurture the correct mental model about what's going on with `asyncio`. ::: -When working with coroutines, things are a bit different. When a result is returned from a coroutine, the coroutine keeps existing, its context is not forgotten. Coroutines exist in Python in several forms, the simplest being a **generator**. The following generator produces all integers (if you wait long enough): +Working with coroutines changes things a bit. The coroutine keeps on existing and its context is not forgotten when a coroutine returns a result. Python has several forms of coroutines and the simplest is a **generator**. For example, the following generator produces all integers (if you wait long enough): ```python def integers(): @@ -60,7 +60,7 @@ def integers(): a += 1 ``` -Then +Then: ```python for i in integers(): @@ -69,7 +69,7 @@ for i in integers(): break ``` -or +or: ```python from itertools import islice @@ -99,7 +99,7 @@ sequenceDiagram :::challenge ## Challenge: generate all even numbers -Can you write a generator that generates all even numbers? Try to reuse `integers()`. Extra: Can you generate the Fibonacci numbers? +Can you write a generator for all even numbers? Reuse `integers()`. Extra: Can you generate the Fibonacci series? ::::solution ```python @@ -116,7 +116,7 @@ def even_integers(): return (i for i in integers() if i % 2 == 0) ``` -For the Fibonacci numbers: +For the Fibonacci series: ```python def fib(): @@ -128,9 +128,9 @@ def fib(): :::: ::: -The generator gives away control, passing a value back, expecting, maybe, if faith has it, that control will be passed back to it in the future. The keyword `yield` applies in all its meanings: control is yielded, and we have a yield in terms of harvesting a crop. +The generator gives away control, passing a value back and expecting to receive control one more time, if faith has it. All meanings of the keyword `yield` apply here: the coroutine yields and produces a yield, as if we were harvesting a crop. -A generator conceptually only has one-way traffic: we get output. We can also use `yield` the other way around: it can be used to send information to a coroutine. For instance: we can have a coroutine that prints whatever you send to it. +Conceptually, a generator entails one-way traffic only: we get output. However, we can use `yield` also to send information to a coroutine. For example, this coroutine prints whatever you send to it: ```python def printer(): @@ -162,19 +162,19 @@ def printer(): :::: ::: -In practice, the `send` form of coroutines is hardly ever used. Cases where you'd need it are rare, and chances are noone will understand your code. Where it was needed before, its use is now largely superceded by `asyncio`. +In practice, the send-form of coroutines is hardly ever used. Cases for needing it are infrequent, and chances are that nobody will understand your code. Asyncio has largely superseded this usage. -Now that you have seen coroutines, it is a small step towards `asyncio`. The idea is that you can use coroutines to build a collaborative multi-threading environment. In most modern operating systems, execution threads are given some time, and then when the OS needs to do something else, control is taken away pre-emptively. In **collaborative multi-tasking**, every worker knows it is part of a collaborative, and it voluntarily yields control to the scheduler. With coroutines and `yield` you should be able to see that it is possible to create such a system, but it is not so straight forward, especially when you start to consider the propagation of exceptions. +The working of `asyncio` is only a small step away from that of coroutines. The intuition is that you can use coroutines to build a collaborative multi-threading environment. Most modern operating systems assign some time to execution threads and take control back pre-emptively to do something else. In **collaborative multi-tasking**, every worker knows to be part of a collaborative environment and yields control to the scheduler voluntarily. Creating such a system with coroutines and `yield` is possible in principle, but is not straightforward especially owing to the propagation of exceptions. ## Syntax -While `asyncio` itself is a library in standard Python, this library is actually a core component for using the associated async syntax. There are two keywords here: `async` and `await`. +`asyncio` itself is a library in standard Python and is a core component for actually using the associated async syntax. Two keywords are especially relevant here: `async` and `await`. -`async` Is a modifier keyword that modifies the behaviour of any subsequent syntax to behave in a manner that is consistent with the asynchronous run-time. +`async` is a modifier keyword that makes any subsequent syntax behave consistently with the asynchronous run-time. -`await` Is used inside a coroutine to wait for another coroutine to yield a result. Effectively, control is passed back to the scheduler, which may decide to give back control when a result is present. +`await` is used inside a coroutine to wait until another coroutine yields a result. Effectively, the scheduler takes control again and may decide to return it when a result is present. # A first program -Jupyter understands asynchronous code, so you can `await` futures in any cell. +Jupyter understands asynchronous code, so you can `await` futures in any cell: ```python import asyncio @@ -195,7 +195,7 @@ Venus 003 Venus 004 ``` -We can have coroutines work concurrently when we `gather` two coroutines. +We can have coroutines work concurrently when we `gather` them: ```python await asyncio.gather(counter("Earth"), counter("Moon")) @@ -215,7 +215,7 @@ Moon 004 ```` -Note that, although the Earth counter and Moon counter seem to operate at the same time, in actuality they are alternated by the scheduler and still running in a single thread! If you work outside the confines of Jupyter, you need to make sure to create an asynchronous main function and run it using `asyncio.run`. A typical program will look like this: +Note that, although the Earth counter and Moon counter seem to operate at the same time, the scheduler is actually alternating them in a single thread! If you work outside of Jupyter, you need an asynchronous main function and must run it using `asyncio.run`. A typical program will look like this: ```python import asyncio @@ -229,13 +229,13 @@ if __name__ == "__main__": asyncio.run(main) ``` -Asyncio, just like we saw with Dask, is contagious. Once you have async code at some low level, higher level code also needs to be async: [it's turtles all the way down](https://en.wikipedia.org/wiki/Turtles_all_the_way_down)! You may be tempted to do `asyncio.run` somewhere from the middle of your normal code to interact with the asyncronous parts. This can get you into trouble though, when you get multiple active asyncio run-times. While it is in principle possible to mix asyncio and classic code, it is in general considered bad practice to do so. +Asyncio is as contagious as Dask. Any higher-level code must be async once you have some async code at low level: [it's turtles all the way down](https://en.wikipedia.org/wiki/Turtles_all_the_way_down)! You may be tempted to implement `asyncio.run` in the middle of your code and interact with the asynchronous parts. Multiple active Asyncio run-times will get you into troubles, though. Mixing Asyncio and classic code is possible in principle, but is considered bad practice. ## Timing asynchronous code -While Jupyter works very well with `asyncio`, one thing that doesn't work is line or cell-magic. We'll have to write our own timer. +Jupyter works very well with `asyncio` except for the line magics and cell magics. We must then write our own timer. :::instructor -It may be best to let participants copy paste this snippet from the collaborative document. You may want to explain what a context manager is, but don't overdo it. This is advanced code and may scare off novices. +It may be best to have participants copy and paste this snippet from the collaborative document. You may want to explain what a context manager is, but don't overdo it. This is advanced code and may scare off novices. ::: ``` {.python #async-timer} @@ -270,10 +270,10 @@ print(f"that took {t.time} seconds") that took 0.20058414503000677 seconds ``` -These few snippets of code require advanced Python knowledge to understand. Rest assured that both classic coroutines and `asyncio` are a large topic to cover, and we're not going to cover all of it. At least, we can now time the execution of our code! +Understanding these few snippets of code requires advanced knowledge of Python. Rest assured that both classic coroutines and `asyncio` are a large topic that we cannot cover completely. However, we can time the execution of our code now! ## Compute $\pi$ again -As a reminder, here is our Numba code to compute $\pi$. +As a reminder, here is our Numba code for computing $\pi$: ``` {.python #calc-pi-numba} import random @@ -294,7 +294,7 @@ def calc_pi(N): return 4 * M / N ``` -We can send work to another thread with `asyncio.to_thread`. +We can send this work to another thread with `asyncio.to_thread`: ```python async with timer() as t: @@ -303,7 +303,7 @@ async with timer() as t: :::challenge ## Gather multiple outcomes -We've seen that we can gather multiple coroutines using `asyncio.gather`. Now gather several `calc_pi` computations, and time them. +We have already seen that `asyncio.gather` gathers multiple coroutines. Here, gather several `calc_pi` computations and time them. ::::solution ```python @@ -323,7 +323,7 @@ async def calc_pi_split(N, M): return sum(lst) / M ``` -Now, see if we get a speed up. +and then verify the speed-up we get: ``` {.python #async-calc-pi-main} async with timer() as t: @@ -352,14 +352,14 @@ that took 0.5876454019453377 seconds ``` # Working with `asyncio` outside Jupyter -Jupyter already has an asyncronous loop running for us. If you want to run scripts outside Jupyter you should write an asynchronous main function and call it using `asyncio.run`. +Jupyter already has an asynchronous loop running for us. In order to run scripts outside Jupyter you should write an asynchronous main function and call it using `asyncio.run`. :::challenge ## Compute $\pi$ in a script -Collect what we have done so far to compute $\pi$ in parallel into a script and run it. +Collect in a script what we have done so far to compute $\pi$ in parallel, and run it. ::::solution -Make sure that you create an `async` main function, and run it using `asyncio.run`. Create a small module called `calc_pi`. +Ensure that you create an `async` main function, and run it using `asyncio.run`. Create a small module called `calc_pi`. ``` {.python file="src/calc_pi/__init__.py"} # file: calc_pi/__init__.py @@ -374,7 +374,7 @@ Put the Numba code in a separate file `calc_pi/numba.py`. <> ``` -Put the async timer function in a separate file `async_timer.py`. +Put the `async_timer` function in a separate file `async_timer.py`. ``` {.python file="src/async_timer.py"} # file: async_timer.py @@ -399,13 +399,13 @@ if __name__ == "__main__": asyncio.run(main()) ``` -You may run this using `python -m calc_pi.async_pi`. +You may run this script using `python -m calc_pi.async_pi`. :::: ::: :::challenge ## Efficiency -Play with different subdivisions for `calc_pi_split` such that `M*N` remains constant. How much overhead do you see? +Play with different subdivisions for `calc_pi_split` keeping `M*N` constant. How much overhead do you see? ::::solution ``` {.python file="src/calc_pi/granularity.py"} @@ -440,7 +440,7 @@ if __name__ == "__main__": ![timings](fig/asyncio-timings.svg){alt="a dip at njobs=10 and overhead ~0.1ms per task"} -The work takes about 0.1s more when using 1000 tasks, so assuming that overhead scales linearly with the amount of tasks, we can learn that the overhead is around 0.1ms per task. +The work takes about 0.1 s more when using 1000 tasks. So, assuming that overhead is distributed uniformly among the tasks, we observe that the overhead is around 0.1 ms per task. :::: ::: @@ -449,6 +449,6 @@ The work takes about 0.1s more when using 1000 tasks, so assuming that overhead - Use `await` to call coroutines. - Use `asyncio.gather` to collect work. - Use `asyncio.to_thread` to perform CPU intensive tasks. -- Inside a script: always make an asynchronous `main` function, and run it with `asyncio.run`. +- Inside a script: always create an asynchronous `main` function, and run it with `asyncio.run`. ::: From e431cee70ef68719f4347d3f4ac0bc459df72a37 Mon Sep 17 00:00:00 2001 From: "Giordano Lipari @c2" Date: Sat, 1 Jul 2023 14:50:37 +0200 Subject: [PATCH 11/32] extra-external-c.md: readability pass 1 --- episodes/extra-external-c.md | 87 ++++++++++++++++++------------------ 1 file changed, 44 insertions(+), 43 deletions(-) diff --git a/episodes/extra-external-c.md b/episodes/extra-external-c.md index 962172b..6f3bf5d 100644 --- a/episodes/extra-external-c.md +++ b/episodes/extra-external-c.md @@ -1,27 +1,26 @@ --- -title: "Calling external C and C++ libraries from Python" +title: "Calling External C and C++ Libraries from Python" teaching: 60 exercises: 30 --- :::questions -- What are some of my options in calling C and C++ libraries from Python code? +- Which options are available to call from Python code C and C++ libraries? - How does this work together with Numpy arrays? - How do I use this in multiple threads while lifting the GIL? ::: :::objectives - Compile and link simple C programs into shared libraries. -- Call these library from Python and time its executions. -- Compare the performance with Numba decorated Python code. +- Call these libraries from Python and time their executions. +- Compare the performance with Numba-decorated Python code. - Bypass the GIL when calling these libraries from multiple threads simultaneously. ::: # Calling C and C++ libraries ## Simple example using either pybind11 or ctypes -External C and C++ libraries can be called from Python code using a number of options, using e.g. Cython, CFFI, pybind11 and ctypes. -We will discuss the last two, because they require the least amount of boilerplate, for simple cases - -for more complex examples that may not be the case. Consider this simple C program, test.c, which adds up consecutive numbers: +External C and C++ libraries can be called from Python code using a number of options, e.g., Cython, CFFI, pybind11 and ctypes. +We will discuss the last two because simple cases require the least amount of boilerplate. This may not be the case with more complex examples. Consider this simple C program, `test.c`, which adds up consecutive numbers: ~~~c #include @@ -46,8 +45,7 @@ PYBIND11_MODULE(test_pybind, m) { ~~~ -You can easily compile and link it into a shared object (.so) file. First you need pybind11. You can install it in -a number of ways, like pip, but I prefer creating virtual environments using pipenv. +You can easily compile and link it into a shared object (`*.so`) file with `pybind11`. You can install that in several ways, like `pip`; I prefer creating virtual environments using `pipenv`: ~~~bash pip install pipenv @@ -57,7 +55,7 @@ pipenv shell c++ -O3 -Wall -shared -std=c++11 -fPIC `python3 -m pybind11 --includes` test.c -o test_pybind.so ~~~ -which generates a `test_pybind.so` shared object which you can call from a iPython shell, like this: +which generates a shared object `test_pybind.so`, which can be called from a iPython shell as follows: ~~~python %import test_pybind @@ -66,7 +64,7 @@ which generates a `test_pybind.so` shared object which you can call from a iPyth %brute_force_sum=sum_range(high) ~~~ -Now you might want to check the output, by comparing with the well-known formula for the sum of consecutive integers. +Now you might want to check and compare the output with the well-known formula for the sum of consecutive integers: ~~~python %sum_from_formula=high*(high-1)//2 %sum_from_formula @@ -74,9 +72,9 @@ Now you might want to check the output, by comparing with the well-known formula %difference ~~~ -Give this script a suitable name, like `call_C_libraries.py`. -The same thing can be done using ctypes instead of pybind11, but requires slightly more boilerplate -on the Python side of the code and slightly less on the C side. test.c will be just the algorithm: +Give this script a suitable name, such as `call_C_libraries.py`. +The same thing can be done using `ctypes` instead of `pybind11`, but the coding requires slightly more boilerplate +on the Python side and slightly less on the C side. The program `test.c` will just contain the algorithm: ~~~c long long sum_range(long long high) @@ -91,15 +89,15 @@ long long sum_range(long long high) } ~~~ -Compile and link using +Compile and link with: ~~~bash gcc -O3 -g -fPIC -c -o test.o test.c ld -shared -o libtest.so test.o ~~~ -which generates a libtest.so file. +which generates a `libtest.so` file. -You will need some extra boilerplate: +You then need some extra boilerplate: ~~~python %import ctypes @@ -111,7 +109,7 @@ You will need some extra boilerplate: %brute_force_sum=sum_range(high) ~~~ -Again, you can compare with the formula for the sum of consecutive integers. +Again, you can compare the result with the formula for the sum of consecutive integers: ~~~python %sum_from_formula=high*(high-1)/2 %sum_from_formula @@ -129,8 +127,8 @@ Now we can time our compiled `sum_range` C library, e.g. from the iPython interf 2.69 ms ± 6.01 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) ~~~ -If you compare with the Numba timing from [chapter 3](computing-pi.md), you will see that the C library for `sum_range` is faster than -the numpy computation but significantly slower than the `numba.jit` decorated function. +If you compare with the Numba timing from [Episode 3](computing-pi.md), you will see that the C library for `sum_range` is faster than +the numpy computation but significantly slower than the `numba.jit`-decorated function. :::challenge ## C versus Numba @@ -154,7 +152,7 @@ long long conditional_sum_range(long long to) ::::solution ## Solution -Insert a line `if i%3==0:` in the code for `sum_range_numba` and rename it to `conditional_sum_range_numba`. +Insert a line `if i%3==0:` in the code for `sum_range_numba` and rename it to `conditional_sum_range_numba`: ~~~python @numba.jit @@ -166,7 +164,7 @@ def conditional_sum_range_numba(a: int): return x ~~~ -Let's check how fast it runs. +Let's check how fast it runs: ~~~ %timeit conditional_sum_range_numba(10**7) @@ -176,7 +174,7 @@ Let's check how fast it runs. 8.11 ms ± 15.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) ~~~ -Compare this with the run time for the C code for conditional_sum_range. +Compare this with the run time of the C code for conditional_sum_range. Compile and link in the usual way, assuming the file name is still `test.c`: ~~~bash @@ -199,14 +197,14 @@ conditional_sum_range.restype = ctypes.c_longlong 7.62 ms ± 49.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) ~~~ -This shows that for this slightly more complicated example the C code is somewhat faster than the Numba decorated Python code. +The C code is somewhat faster than the Numba-decorated Python code for this slightly more complicated example. :::: ::: -## Passing Numpy arrays to C libraries. -Now let us consider a more complex example. Instead of computing the sum of numbers up to a certain upper limit, let us -compute that for an array of upper limits. This will return an array of sums. How difficult is it to modify our C and Python code -to get this done? Well, you just need to replace `&sum_range` by `py::vectorize(sum_range)`: +## Passing Numpy arrays to C libraries +Now let us consider a more complex example. Instead of computing the sum of numbers up to an upper limit, let us +compute that for an array of upper limits. This operation will return an array of sums. How difficult is it to modify our C and Python code +to get this done? You just need to replace `&sum_range` with `py::vectorize(sum_range)`: ~~~c PYBIND11_MODULE(test_pybind, m) { @@ -216,8 +214,9 @@ PYBIND11_MODULE(test_pybind, m) { } ~~~ -Now let's see what happens if we pass `test_pybind.so` an array instead of an integer. +Now let's see what happens if we pass to `test_pybind.so` an array instead of an integer. +The code: ~~~python %import test_pybind %sum_range=test_pybind.sum_range @@ -231,7 +230,7 @@ gives array([ 0, 0, 1, 3, 6, 10, 15, 21, 28, 36]) ~~~ -It does not crash! Instead, it returns an array which you can check to be correct by subtracting the previous sum from each sum (except the first): +It does not crash! You can check that the array is upon subtracting the previous sum from each sum (except the first): ~~~python %out=sum_range(ys) @@ -244,10 +243,10 @@ which gives array([0, 1, 2, 3, 4, 5, 6, 7, 8]) ~~~ -the elements of `ys` - except the last - as you would expect. +that is, the elements of `ys` except the last, as expected. # Call the C library from multiple threads simultaneously. -We can quickly show you how the C library compiled using pybind11 can be run multithreaded. try the following from an iPython shell: +We can show that a C library compiled using `pybind11` can be run as multithreaded. Try the following from an iPython shell: ~~~python %high=int(1e9) @@ -258,8 +257,8 @@ We can quickly show you how the C library compiled using pybind11 can be run mul 274 ms ± 1.03 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) ~~~ -Now try a straightforward parallellisation of 20 calls of `sum_range`, over two threads, so 10 calls per thread. -This should take about ```10 * 274ms = 2.74s``` if parallellisation were running without overhead. Let's try: +Now try a straightforward parallelization of 20 calls of `sum_range` over two threads, hence at 10 calls per thread. +This should take about ```10 * 274 ms = 2.74 s``` for a parallelization free of overheads. Running: ~~~python import threading as T @@ -278,14 +277,14 @@ def timer(): timer() ~~~ -This gives +gives ~~~ Time elapsed = 5.59s ~~~ -i.e. more than twice the time we would expect. What actually happened is that `sum_range` was run sequentially instead of parallelly. -We need to add a single declaration to test.c: `py::call_guard()`: +i.e., more than twice the time we expected. In fact, the `sum_range` was run sequentially instead of parallelly. +We then need to add a single declaration to `test.c`: `py::call_guard()`: ~~~c PYBIND11_MODULE(test_pybind, m) { @@ -295,7 +294,7 @@ PYBIND11_MODULE(test_pybind, m) { } ~~~ -like this: +as follows: ~~~c PYBIND11_MODULE(test_pybind, m) { @@ -311,7 +310,9 @@ Now compile again: c++ -O3 -Wall -shared -std=c++11 -fPIC `python3 -m pybind11 --includes` test.c -o test_pybind.so ~~~ -Reimport the rebuilt shared object - this can only be done by quitting and relaunching the iPython interpreter - and time again. +Reimport the rebuilt shared object (only possible after quitting and relaunching the iPython interpreter) and time again. + +This code: ~~~python import test_pybind @@ -335,7 +336,7 @@ def timer(): timer() ~~~ -This gives: +gives: ~~~output Time elapsed = 2.81s @@ -344,9 +345,9 @@ Time elapsed = 2.81s as you would expect for two `sum_range` modules running in parallel. :::keypoints -- Multiple options are available in calling external C and C++ libraries and that the best choice can depend on the complexity of your problem. +- Multiple options are available to call external C and C++ libraries, and the best choice depends on the complexity of your problem. - Obviously, there is an extra compile and link step, but you will get a much faster execution compared to pure Python. -- Also, the GIL will be circumvented in calling these libaries. -- Numba might also offer you the speedup you want with even less effort. +- Also, the GIL will be circumvented in calling these libraries. +- Numba might also offer you the speed-up you want with even less effort. ::: From c8b1b7a763219fdcf48cb447db967ac703e1d0d9 Mon Sep 17 00:00:00 2001 From: "Giordano Lipari @c2" Date: Sat, 1 Jul 2023 16:51:16 +0200 Subject: [PATCH 12/32] introduction.md: readability pass 2 --- episodes/introduction.md | 45 ++++++++++++++++++++-------------------- 1 file changed, 22 insertions(+), 23 deletions(-) diff --git a/episodes/introduction.md b/episodes/introduction.md index acfa34f..7131fbb 100644 --- a/episodes/introduction.md +++ b/episodes/introduction.md @@ -9,15 +9,15 @@ exercises: 5 - What problems are we solving, and what are we **not** discussing? - Why do we use Python? - What is parallel programming? -- Why can writing a parallel program be hard? +- Why can writing a parallel program be challenging? :::::::::::::::::::::::::::::::::::::::::::::::: ::::::::::::::::::::::::::::::::::::: objectives -- Recognize serial and parallel patterns -- Identify problems that can be parallelized -- Understand a dependency diagram +- Recognize serial and parallel patterns. +- Identify problems that can be parallelized. +- Understand a dependency diagram. :::::::::::::::::::::::::::::::::::::::::::::::: @@ -25,11 +25,11 @@ exercises: 5 :::callout ## What problems are we solving? -Ask around what problems participants encountered: "Why did you sign up?". -Specifically: which task in your field of expertise you want to parallelize? +Ask around what problems participants encountered: "Why did you sign up?" +Specifically: "Which task in your field of expertise do you want to parallelize?" ::: -Most problems will fit in one of two categories: +Most problems will fit in either category: - I wrote this code in Python and it is not fast enough. - I run this code on my laptop, but the target size of the problem is bigger than its RAM. @@ -40,9 +40,9 @@ We introduce the following modules: 3. `dask` makes scalable parallel computing easy. 4. `numba` speeds up your Python functions by translating them to optimized machine code. 5. `memory_profile` monitors memory performance. -6. `asyncio` Python's native asynchronous programming. +6. `asyncio` is Python's native asynchronous programming. -FIXME: Actually explain functional programming & distributed programming. +FIXME: Actually explain functional programming and distributed programming. More importantly, we show how to change the design of a program to fit parallel paradigms. This often involves techniques from **functional programming**. @@ -52,7 +52,7 @@ In this course we will not talk about **distributed programming**. This is a huge can of worms. It is easy to show simple examples, but solutions for particular problems will be wildly different. Dask has a lot of functionalities to help you set up runs on a network. -The important bit is that, once you have made your code suitable for parallel computing, you will have the right mind-set to get it to work in a distributed environment. +The important bit is that, once you have made your code suitable for parallel computing, you will have the right mindset to get it to work in a distributed environment. ::: # Overview and rationale @@ -62,30 +62,29 @@ This is an advanced course. Why is it advanced? We (hopefully) saw in the discussion that, although many of your problems share similar characteristics, the details will determine the solution. We all need our algorithms, models, analysis to run so that many hands make light work. -When such a situation arises with a group of people, we start with a meeting discussing who does what, when do we meet again to sync up, and so on. +When such a situation arises in a group of people, we start with a meeting discussing who does what, when do we meet again and sync up, and so on. After a while you can get the feeling that all you do is to be in meetings. We will see that several abstractions can make our life easier. This course illustrates these abstractions making ample use of Dask. - Vectorized instructions: tell many workers to do the same work on a different piece of data. This is where `dask.array` and `dask.dataframe` come in. - We will illustrate this model of working by computing the number $\pi$ later on. + We illustrate this model of working by computing the number $\pi$ later on. - Map/filter/reduce: this methodology combines different functionals to create a larger program. We implement this formalism when using `dask.bag` to count the number of unique words in a novel. -- Task-based parallelization: this may be the most generic abstraction, as all the others can be expressed - in terms of tasks or workflows. This is `dask.delayed`. +- Task-based parallelization: this may be the most generic abstraction, as all the others can be expressed in terms of tasks or workflows. This is `dask.delayed`. # Why Python? Python is one of most widely used languages for scientific data analysis, visualization, and even modelling and simulation. -The popularity of Python is mainly due to the two pillars of a friendly syntax and the availability of many high-quality libraries. +The popularity of Python is mainly due to the two pillars of the friendly syntax and the availability of many high-quality libraries. :::callout ## It's not all good news -The flexibility of Python comes with a few downsides though: +The flexibility of Python comes with a few downsides, though: - Python code typically does not perform as fast as lower-level implementations in C/C++ or Fortran. - Parallelizing Python code to work efficiently on many-core architectures is not trivial. -This workshop addresses both issues, with an emphasis on efficiently running parallel Python code on multiple cores. +This workshop addresses both issues, with an emphasis on running parallel Python code efficiently on multiple cores. ::: # What is parallel computing? @@ -103,7 +102,7 @@ Note that the output of one function can become the input of another one. The diagram above is the typical diagram of a **serial computation**. If you ever used a loop to update a value, you used serial computation. -If our computation involves **independent work** (that is, the results of each function are independent of the results of the application of the rest), we can structure our computation as follows: +If our computation involves **independent work** (that is, the results of each function are independent of the results of applying the rest), we can structure our computation as follows: ![Parallel computation](fig/parallel.png){alt="boxes and arrows with two parallel pipe lines"} @@ -116,13 +115,13 @@ In the diagram above, we can assign each of the three functions to one core, so :::callout ## Do eight processors work eight as fast as one? -It may be tempting to think that using eight cores instead of one would increase the execution speed eigthfold. -For now, it is ok to use this as a first approximation to reality. +It may be tempting to think that using eight cores instead of one would increase the execution speed eightfold. +For now, it is OK to use this as a first approximation to reality. Later in the course we see that things are actually more complicated. ::: ## Parallelizable and non-parallelizable tasks -Some tasks are easily parallelizable while others inherently are not. +Some tasks are easily parallelizable while others are not so inherently. However, it might not always be immediately apparent that a task is parallelizable. Let us consider the following piece of code: @@ -213,7 +212,7 @@ For instance, in our first example of a non-parallelizable task, we mentioned th Conveniently, a [closed form expression to compute the n-th Fibonacci number](https://en.wikipedia.org/wiki/Fibonacci_number#Closed-form_expression) exists. Last but not least, do not let the name discourage you: if your algorithm happens to be embarrassingly parallel, that's good news! -The "embarrassingly" evokes the feeling of "this is great!, how did I not notice before?!" +The adverb "embarrassingly" evokes the feeling of "this is great!, how did I not notice before?!" ::: :::challenge @@ -226,7 +225,7 @@ We have the following recipe: 4. (20 min) Remove the bay leaf, add the vegetables, and simmer for 20 more minutes. Stir the soup occasionally. 6. (1 day) Leave the soup for one day. - Reheat before serving and add a sliced smoked sausage (vegetarian options are also welcome). + Re-heat before serving and add a sliced smoked sausage (vegetarian options are also welcome). Season with pepper and salt. Imagine you are cooking alone. From d6c585da926c2eb66133cf0c953fdc7c471a6e2f Mon Sep 17 00:00:00 2001 From: "Giordano Lipari @c2" Date: Sat, 1 Jul 2023 16:57:13 +0200 Subject: [PATCH 13/32] benchmarking.md: readability pass 2 --- episodes/benchmarking.md | 30 +++++++++++++++--------------- 1 file changed, 15 insertions(+), 15 deletions(-) diff --git a/episodes/benchmarking.md b/episodes/benchmarking.md index 50f6dd6..f927ea6 100644 --- a/episodes/benchmarking.md +++ b/episodes/benchmarking.md @@ -10,18 +10,18 @@ exercises: 20 ::: :::objectives -- View performance on system monitor -- Find out how many cores your machine has -- Use `%time` and `%timeit` line-magic -- Use a memory profiler -- Plot performance against number of work units -- Understand the influence of hyper-threading on timings +- View performance on system monitor. +- Find out how many cores your machine has. +- Use `%time` and `%timeit` line-magic. +- Use a memory profiler. +- Plot performance against number of work units. +- Understand the influence of hyper-threading on timings. ::: # A first example with Dask We will create parallel programs in Python later. First let's see a small example. Open -your system monitor (this will vary between specific operating systems), and run the following code examples. +your System Monitor (the application will vary between specific operating systems), and run the following code examples: ```python # Summation making use of numpy: @@ -39,7 +39,7 @@ result = work.compute() :::callout ## Try a heavy enough task -Your radar may not detect so small a task. In your computer, you may have to gradually raise the problem size to ``10**8`` or ``10**9`` to observe the effect in a long enough run. But be careful and increase slowly! Asking for too much memory can make your computer slow to a crawl. +Your radar may not detect so small a task. In your computer you may have to gradually raise the problem size to ``10**8`` or ``10**9`` to observe the effect in long enough a run. But be careful and increase slowly! Asking for too much memory can make your computer slow to a crawl. ::: ![System monitor](fig/system-monitor.jpg){alt="screenshot of system monitor"} @@ -54,12 +54,12 @@ np.arange(10**7).sum() The `%%time` line magic checks how long it took for a computation to finish. It does not affect how the computation is performed. In this regard it is very similar to the `time` shell command. -If run the chunk several times, we will notice variability in the reported times. +If we run the chunk several times, we will notice variability in the reported times. How can we trust this timer, then? A possible solution will be to time the chunk several times, and take the average time as our valid measure. -The `%%timeit` line magic does exactly this in a concise and conveninet manner! +The `%%timeit` line magic does exactly this in a concise and convenient manner! `%%timeit` first measures how long it takes to run a command once, then -repeats it enough times to get an average run-time. Also, `%%timeit` can measure run times discountinh the overhead of setting up a problem and measuring only the performance of the code in the cell. +repeats it enough times to get an average run-time. Also, `%%timeit` can measure run times discounting the overhead of setting up a problem and measuring only the performance of the code in the cell. So this outcome is more trustworthy. ```python @@ -140,7 +140,7 @@ Using more cores for a computation can decrease the run time. The first question :::callout ## Find out the number of cores in your machine -The number of cores can be found from Python executing: +The number of cores can be found from Python upon executing: ```python import psutil @@ -155,7 +155,7 @@ which enables each physical CPU core to execute several threads at the same time performance may scale unexpectedly. There are many reasons for this, hyper-threading being one of them. See the ensuing example. -On a machine with 4 physical and 8 logical cores, this admittedly oversimplistic benchmark: +On a machine with 4 physical and 8 logical cores, this admittedly over-simplistic benchmark: ```python x = [] @@ -176,13 +176,13 @@ data.set_index("n").plot() :::discussion ## Discussion -Why is the runtime increasing if we add more than 4 cores? This has to do with **hyper-threading**. On most architectures it does not make much sense to use more workers than physical cores you have. +Why does the runtime increase if we add more than 4 cores? This has to do with **hyper-threading**. On most architectures it does not make much sense to use more workers than the physical cores you have. ::: :::keypoints - Understanding performance is often non-trivial. - Memory is just as important as speed. -- To measure is to know. +- To measure is to know. ::: From ca9bce9ffa30dcdc5c7c46178e34953dfc87725e Mon Sep 17 00:00:00 2001 From: "Giordano Lipari @c2" Date: Sat, 1 Jul 2023 17:02:25 +0200 Subject: [PATCH 14/32] computing.md: readability pass 2 --- episodes/computing-pi.md | 22 +++++++++++----------- 1 file changed, 11 insertions(+), 11 deletions(-) diff --git a/episodes/computing-pi.md b/episodes/computing-pi.md index b30fa4c..e6b2642 100644 --- a/episodes/computing-pi.md +++ b/episodes/computing-pi.md @@ -18,14 +18,14 @@ exercises: 30 # Parallelizing a Python application In order to recognize the advantages of parallelism we need an algorithm that is easy to parallelize, complex enough to take a few seconds of CPU time, understandable, and also interesting not to scare away the interested learner. -Estimating the value of number π is a classical problem to demonstrate parallel programming. +Estimating the value of number $\pi$ is a classical problem to demonstrate parallel programming. The algorithm we present is a classical demonstration of the power of Monte Carlo methods. This is a category of algorithms using random numbers to approximate exact results. This approach is simple and has a straightforward geometrical interpretation. -We can compute the value of π using a random number generator. We count the points falling inside the blue circle M compared to the green square N. -The ratio 4M/N then approximates π. +We can compute the value of $\pi$ using a random number generator. We count the points falling inside the blue circle M compared to the green square N. +The ratio 4M/N then approximates $\pi$. ![Computing Pi](fig/calc_pi_3_wide.svg){alt="the area of a unit sphere contains a multiple of pi"} @@ -109,15 +109,15 @@ This implementation is much faster than the 'naive' implementation above: :::discussion ## Discussion: is this all better? What is the downside of the vectorized implementation? -- It uses more memory -- It is less intuitive -- It is a more monolithic approach, i.e. you cannot break it up in several parts +- It uses more memory. +- It is less intuitive. +- It is a more monolithic approach, i.e., you cannot break it up in several parts. ::: :::challenge ## Challenge: Daskify Write `calc_pi_dask` to make the Numpy version parallel. Compare its speed and memory performance with -the Numpy version. NB: Remember that API of `dask.array` mimics that of the Numpy. +the Numpy version. NB: Remember that the API of `dask.array` mimics that of the Numpy. ::::solution ## Solution @@ -166,7 +166,7 @@ Let's time three versions of the same test. First, native Python iterators: 190 ms ± 3.26 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) ``` -Then, with Numpy: +Second, with Numpy: ```python %timeit np.arange(10**7).sum() @@ -176,7 +176,7 @@ Then, with Numpy: 17.5 ms ± 138 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) ``` -Finally, with Numba: +Third, with Numba: ```python %timeit sum_range_numba(10**7) @@ -186,7 +186,7 @@ Finally, with Numba: 162 ns ± 0.885 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each) ``` -Numba is hundredfold faster in this case! It gets this speedup with "just-in-time" compilation (JIT) —, that is, compiling the Python +Numba is hundredfold faster in this case! It gets this speedup with "just-in-time" compilation (JIT) — that is, compiling the Python function into machine code just before it is called, as the `@numba.jit` decorator indicates. Numba does not support every Python and Numpy feature, but functions written with a for-loop with a large number of iterates, like in our `sum_range_numba()`, are good candidates. @@ -199,7 +199,7 @@ The first time you call a function decorated with `@numba.jit`, you may see no o Why does this happen? On the first call, the JIT compiler needs to compile the function. On subsequent calls, it reuses the -function previously compiled. The compiled function can *only* be reused if the types of its arguments (int, float, and the like) are the same as the point of compilation. +function previously compiled. The compiled function can *only* be reused if the types of its arguments (int, float, and the like) are the same as at the point of compilation. See this example, where `sum_range_numba` is timed once again with a float argument instead of an int: ```python From 597ae7f430cb1bb0b978348814038dca440719d8 Mon Sep 17 00:00:00 2001 From: "Giordano Lipari @c2" Date: Sat, 1 Jul 2023 17:09:23 +0200 Subject: [PATCH 15/32] thread-and-processes.md: readability pass 2 --- episodes/threads-and-processes.md | 36 +++++++++++++++---------------- 1 file changed, 18 insertions(+), 18 deletions(-) diff --git a/episodes/threads-and-processes.md b/episodes/threads-and-processes.md index be1465c..ce3ae70 100644 --- a/episodes/threads-and-processes.md +++ b/episodes/threads-and-processes.md @@ -40,9 +40,9 @@ t2.join() :::discussion ## Discussion: where's the speed-up? -While mileage may vary, parallelizing `calc_pi`, `calc_pi_numpy` and `calc_pi_numba` this way will +While mileage may vary, parallelizing `calc_pi`, `calc_pi_numpy` and `calc_pi_numba` in this way will not give the theoretical speed-up. `calc_pi_numba` should give *some* speed-up, but nowhere near the -ideal scaling for the number of cores. This is because, at any given time, Python only allows one thread to access the +ideal scaling for the number of cores. This is because, at any given time, Python only allows a single thread to access the interperter, a feature also known as the Global Interpreter Lock. ::: @@ -64,7 +64,7 @@ The alternative is to bring parts of our code outside Python. Numpy has many routines that are largely situated outside of the GIL. Trying out and profiling your application is the only way to know for sure. -To write your own routines not subjected to the GIL there are several options: fortunately, `numba` makes this very easy. +There are several options to make your own routines not subjected to the GIL: fortunately, `numba` makes this very easy. We can force off the GIL in Numba code by setting `nogil=True` inside the `numba.jit` decorator. @@ -120,7 +120,7 @@ t2.join() # Multiprocessing Python also enable parallelisation with multiple processes -via the `multiprocessing` module. It implements an API that is +via the `multiprocessing` module. It implements an API that is seemingly similar to threading: ```python @@ -141,11 +141,11 @@ if __name__ == '__main__': p2.join() ``` -However, under the hood, processes are very different from threads. A -new process is created by creating a fresh "copy" of the Python +However, under the hood, processes are very different from threads. A +new process is created by generating a fresh "copy" of the Python interpreter that includes all the resources associated to the parent. There are three different ways of doing this (*spawn*, *fork*, and -*forkserver*), whose availability depends on the platform. We will use *spawn* as +*forkserver*), whose availability depends on the platform. We will use *spawn* as it is available on all platforms. You can read more about the others in the [Python documentation](https://docs.python.org/3/library/multiprocessing.html#contexts-and-start-methods). @@ -156,7 +156,7 @@ the overhead of creating a new process. :::callout ## Protect process creation with an `if`-block -A module should be safely importable. Any code that creates +A module should be safely importable. Any code that creates processes, pools, or managers should be protected with: ```python if __name__ == "__main__": @@ -165,9 +165,9 @@ if __name__ == "__main__": ::: The non-intrusive and safe way of starting a new process is to acquire a -`context` and work within that context. This ensures that your +`context` and work within that context. This ensures that your application does not interfere with any other processes that might be -in use. +in use: ```python import multiprocessing as mp @@ -183,9 +183,9 @@ if __name__ == '__main__': ## Passing objects and sharing state We can pass objects between processes by using `Queue`s and `Pipe`s. Multiprocessing queues behave similarly to regular queues: -- FIFO: first in, first out -- `queue_instance.put()` to add -- `queue_instance.get()` to retrieve +- FIFO: first in, first out. +- `queue_instance.put()` to add. +- `queue_instance.get()` to retrieve. :::challenge ## Exercise: reimplement `calc_pi` to use a queue to return the result @@ -250,19 +250,19 @@ which can then be accessed from separate processes *by name* ## Process pool The `Pool` API provides a pool of worker processes that can execute -tasks. Methods of the `Pool` object offer various convenient ways to -implement data parallelism in your program. The most convenient way +tasks. Methods of the `Pool` object offer various convenient ways to +implement data parallelism in your program. The most convenient way to create a pool object is with a context manager, either using the toplevel function `multiprocessing.Pool`, or by calling the `.Pool()` -method on the context. With the `Pool` object, tasks can be submitted +method on the context. With the `Pool` object, tasks can be submitted by calling methods like `.apply()`, `.map()`, `.starmap()`, or their `.*_async()` versions. :::challenge ## Exercise: adapt the original exercise to submit tasks to a pool -- Use the original `calc_pi` function (without the queue) +- Use the original `calc_pi` function (without the queue). - Submit batches of different sample size (different values of `N`). -- As mentioned earlier, creating a new process entails overheads. Try a +- As mentioned earlier, creating a new process entails overheads. Try a wide range of sample sizes and check if the runtime scales in keeping with that claim. ::::solution From f66ca8f2d69338d852d563d3a7618213bc54820f Mon Sep 17 00:00:00 2001 From: "Giordano Lipari @c2" Date: Sat, 1 Jul 2023 17:17:16 +0200 Subject: [PATCH 16/32] delayed-evaluation.md: readability pass 2 --- episodes/delayed-evaluation.md | 20 +++++++++++--------- 1 file changed, 11 insertions(+), 9 deletions(-) diff --git a/episodes/delayed-evaluation.md b/episodes/delayed-evaluation.md index 789b671..4ad6889 100644 --- a/episodes/delayed-evaluation.md +++ b/episodes/delayed-evaluation.md @@ -28,21 +28,21 @@ See an overview below: | `dask.futures` | `concurrent.futures` | Control execution, low-level | ❌ | # Dask Delayed -A lot of the functionalities in Dask is based on an important concept known as *delayed evaluation*. We will then go a bit deeper into `dask.delayed`. +A lot of the functionalities in Dask is based on an important concept known as *delayed evaluation*. Hence we go a bit deeper into `dask.delayed`. `dask.delayed` changes the strategy by which our computation is evaluated. Normally, you expect that a computer runs commands when you ask for them, and that you can give the next command when the current job is complete. With delayed evaluation we do not wait before formulating the next command. Instead, we create the dependency graph of our complete computation without actually doing any work. Once we build the full dependency graph, we can see which jobs can be done in parallel and attribute those to different workers. To express a computation in this world, we need to handle future objects *as if they're already there*. These objects may be referred to as either *futures* or *promises*. :::callout -Several Python libraries provide slightly different support for working with futures. The main difference between Python futures and Dask delayed objects is that futures are added to a queue at the point of definition, while delayed objects are silent until you ask to compute. We will refer to such 'live' futures as futures and to 'dead' futures (including the delayed) as **promises**. +Several Python libraries provide slightly different support for working with futures. The main difference between Python futures and Dask-delayed objects is that futures are added to a queue at the point of definition, while delayed objects are silent until you ask to compute. We will refer to such 'live' futures as futures proper, and to 'dead' futures (including the delayed) as **promises**. ::: ~~~python from dask import delayed ~~~ -The `delayed` decorator builds a dependency graph from function calls. +The `delayed` decorator builds a dependency graph from function calls: ~~~python @delayed @@ -52,13 +52,13 @@ def add(a, b): return a + b ~~~ -A `delayed` function stores the requested function call inside a **promise**. The function is not actually executed yet, and we are *promised* a value that can be computed later. +A `delayed` function stores the requested function call inside a **promise**. The function is not actually executed yet, and we get a value *promised*, which can be computed later. ~~~python x_p = add(1, 2) ~~~ -We can check that `x_p` is now a `Delayed` value. +We can check that `x_p` is now a `Delayed` value: ~~~python type(x_p) @@ -68,7 +68,7 @@ type(x_p) ~~~ > ## Note on notation -> It is a good idea to suffix with `_p` variables that are promises. That way you +> It is good practice to suffix with `_p` variables that are promises. That way you > keep track of promises versus immediate values. {: .callout} @@ -103,7 +103,7 @@ y_p = add(x_p, 3) z_p = add(x_p, -3) ``` -Visualize and compute `y_p` and `z_p` separately. How often is `x_p` evaluated? +Visualize and compute `y_p` and `z_p` separately. How many times is `x_p` evaluated? Now change the workflow: @@ -114,7 +114,7 @@ z_p = add(x_p, y_p) z_p.visualize(rankdir="LR") ``` -We pass the not-yet-computed promise `x_p` to both `y_p` and `z_p`. If you only compute `z_p`, how often do you expect `x_p` to be evaluated? Run the workflow to check your answer. +We pass the not-yet-computed promise `x_p` to both `y_p` and `z_p`. If you only compute `z_p`, how many times do you expect `x_p` to be evaluated? Run the workflow to check your answer. ::::solution ## Solution @@ -213,7 +213,9 @@ Computing the result ~~~python x_p.compute() ~~~ + gives + ~~~output [out]: [0, 1, 4, 9, 16, 25, 36, 49, 64, 81] ~~~ @@ -264,6 +266,6 @@ In practice, you may not need to use `@delayed` functions frequently, but they d :::keypoints - We can change the strategy by which a computation is evaluated. - Nothing is computed until we run `compute()`. -- By using delayed evaluation, Dask knows which jobs can be run in parallel. +- With delayed evaluation Dask knows which jobs can be run in parallel. - Call `compute` only once at the end of your program to get the best results. ::: From 05b6fb02f22c03109db07d0fe12e784d6e7e7ab9 Mon Sep 17 00:00:00 2001 From: "Giordano Lipari @c2" Date: Sat, 1 Jul 2023 17:22:49 +0200 Subject: [PATCH 17/32] thread-and-processes.md: readability pass 2 --- episodes/threads-and-processes.md | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/episodes/threads-and-processes.md b/episodes/threads-and-processes.md index ce3ae70..6aacb78 100644 --- a/episodes/threads-and-processes.md +++ b/episodes/threads-and-processes.md @@ -43,7 +43,7 @@ t2.join() While mileage may vary, parallelizing `calc_pi`, `calc_pi_numpy` and `calc_pi_numba` in this way will not give the theoretical speed-up. `calc_pi_numba` should give *some* speed-up, but nowhere near the ideal scaling for the number of cores. This is because, at any given time, Python only allows a single thread to access the -interperter, a feature also known as the Global Interpreter Lock. +interpreter, a feature also known as the Global Interpreter Lock. ::: ## A few words about the Global Interpreter Lock @@ -56,7 +56,7 @@ Roughly speaking, there are two classes of solutions to circumvent/lift the GIL: - Run multiple Python instances using `multiprocessing`. - Keep important code outside Python using OS operations, C++ extensions, Cython, Numba. -The downside of running multiple Python instances is that we need to share program state between different processes. +The downside of running multiple Python instances is that we need to share the program state between different processes. To this end, you need to serialize objects. Serialization entails converting a Python object into a stream of bytes that can then be sent to the other process or, for example, stored to disk. This is typically done using `pickle`, `json`, or similar, and creates a large overhead. @@ -66,7 +66,7 @@ Trying out and profiling your application is the only way to know for sure. There are several options to make your own routines not subjected to the GIL: fortunately, `numba` makes this very easy. -We can force off the GIL in Numba code by setting `nogil=True` inside the `numba.jit` decorator. +We can force off the GIL in Numba code by setting `nogil=True` inside the `numba.jit` decorator: ```python @numba.jit(nopython=True, nogil=True) @@ -85,7 +85,7 @@ while the `nogil` argument disables the GIL during the execution of the function :::callout ## Use `nopython=True` or `@numba.njit` -It is generally a good idea to use `nopython=True` with `@numba.jit` to make sure the entire +It is generally good practice to use `nopython=True` with `@numba.jit` to make sure the entire function is running without referencing Python objects, because that will dramatically slow down most Numba code. The decorator `@numba.njit` even has `nopython=True` by default. ::: From f88c0d06c586791849f9e959e15cba52c2cba5f2 Mon Sep 17 00:00:00 2001 From: "Giordano Lipari @c2" Date: Sat, 1 Jul 2023 17:26:01 +0200 Subject: [PATCH 18/32] map-and-reduce.md: readability pass 2 --- episodes/map-and-reduce.md | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/episodes/map-and-reduce.md b/episodes/map-and-reduce.md index 291d7cb..0347402 100644 --- a/episodes/map-and-reduce.md +++ b/episodes/map-and-reduce.md @@ -10,12 +10,12 @@ exercises: 30 ::: :::objectives -- Recognize `map`, `filter` and `reduction` patterns -- Create programs using these building blocks -- Use the `visualize` method to create dependency graphs +- Recognize `map`, `filter` and `reduction` patterns. +- Create programs using these building blocks. +- Use the `visualize` method to create dependency graphs. ::: -In computer science *bags* are unordered collections of data. In Dask, a `bag` is a collection that get chunked internally. Operations on a bag are automatically parallelized over the chunks inside the bag. +In computer science *bags* are unordered collections of data. In Dask, a `bag` is a collection that gets chunked internally. Operations on a bag are automatically parallelized over the chunks inside the bag. Dask bags let you compose functionality using several primitive patterns: the most important of these are `map`, `filter`, `groupby`, `flatten`, and `reduction`. @@ -51,7 +51,7 @@ bag = db.from_sequence(['mary', 'had', 'a', 'little', 'lamb']) ### Map -A function that squares its argument is a mapping function that illustrates the concept of `map`: +A function squaring its argument is a mapping function that illustrates the concept of `map`: ~~~python # Create a function for mapping From f068d695372022e8051d8e5650389720f2bc448e Mon Sep 17 00:00:00 2001 From: "Giordano Lipari @c2" Date: Sat, 1 Jul 2023 17:33:54 +0200 Subject: [PATCH 19/32] exercise-with-fractals.md: readability pass 2 --- episodes/exercise-with-fractals.md | 22 +++++++++++----------- 1 file changed, 11 insertions(+), 11 deletions(-) diff --git a/episodes/exercise-with-fractals.md b/episodes/exercise-with-fractals.md index a00eb54..247bda0 100644 --- a/episodes/exercise-with-fractals.md +++ b/episodes/exercise-with-fractals.md @@ -30,14 +30,14 @@ Complex numbers are a special representation of rotations and scalings in the tw $$z = x + iy,$$ -remember that $i^2 = -1$ and act as if everything is normal! +remember that $i^2 = -1$, and act as if everything is normal! ::: -The Mandelbrot set is the set of complex numbers $$c \in \mathbb{C}$$ for which the iteration, +The Mandelbrot set is the set of complex numbers $$c \in \mathbb{C}$$ for which the iteration $$z_{n+1} = z_n^2 + c,$$ -converges, starting iteration at $z_0 = 0$. We can visualize the Mandelbrot set by plotting the +converges, starting from iteration at $z_0 = 0$. We can visualize the Mandelbrot set by plotting the number of iterations needed for the absolute value $|z_n|$ to exceed 2 (for which it can be shown that the iteration always diverges). @@ -78,7 +78,7 @@ ax.set_ylabel("$\Im(c)$") ``` Things become really loads of fun when we zoom in. We can play around with the `center` and -`extent` values, and necessarily `max_iter`, to control our window. +`extent` values, and necessarily `max_iter`, to control our window: ```python max_iter = 1024 @@ -155,7 +155,7 @@ def plot_fractal(box: BoundingBox, values: np.ndarray, ax=None): ::::solution ## Some solutions -The natural approach with Python is to speed this up with Numba. Then, there are three ways to parallelize: first, let Numba parallelize the function; second, do a manual domain decomposition and use one of the many Python ways to run things multi-threaded; third, create a vectorized function and parallelize using `dask.array`. This last option is almost always slower than `@njit(parallel=True)` and domain decomposition. +The natural approach with Python is to speed this up with Numba. Then, there are three ways to parallelize: first, letting Numba parallelize the function; second, doing a manual domain decomposition and using one of the many Python ways to run multi-threaded things; third, creating a vectorized function and parallelizing it using `dask.array`. This last option is almost always slower than `@njit(parallel=True)` and domain decomposition. ``` {.python file="src/mandelbrot/__init__.py"} @@ -167,7 +167,7 @@ The natural approach with Python is to speed this up with Numba. Then, there are When we port the core Mandelbrot function to Numba, we need to keep some best practices in mind: - Don't pass composite objects other than Numpy arrays. -- Avoid acquiring memory inside a Numba function; create an array in Python, then pass it to the Numba function. +- Avoid acquiring memory inside a Numba function; rather, create an array in Python and then pass it to the Numba function. - Write a Pythonic wrapper around the Numba function for easy use. ``` {.python file="src/mandelbrot/numba_serial.py"} @@ -260,9 +260,9 @@ def split(self, n): for j in range(n)] ``` -To perform the computation in parallel, lets go ahead and chose the most difficult path: `asyncio`. There are other ways to do this, setting up a number of threads, or use Dask. However, `asyncio` is available to us in Python natively. In the end, the result is very similar to what we would get using `dask.delayed`. +To perform the computation in parallel, let's go ahead and choose the most difficult path: `asyncio`. There are other ways to do this, like setting up a number of threads or using Dask. However, `asyncio` is available in Python natively. In the end, the result is very similar to what we would get using `dask.delayed`. -This may seem as a lot of code, but remember: we only used Numba to compile the core part and then used Asyncio to parallelize. The progress bar is a bit of flutter and the semaphore is only there to throttle the computation to fewer cores. Even then, this solution is by far the most extensive, but also the fastest. +This may seem as a lot of code, but remember: we only use Numba to compile the core part and then Asyncio to parallelize. The progress bar is a bit of flutter and the semaphore is only there to throttle the computation to fewer cores. Even then, this solution is the most extensive by far but also the fastest. ``` {.python file="src/mandelbrot/domain_splitting.py"} from typing import Optional @@ -307,7 +307,7 @@ def compute_mandelbrot(box: BoundingBox, max_iter: int, ::::solution ## Numba vectorize -Another solution is to use Numba's `@guvectorize` decorator. The speed-up (on my machine) is not as dramatic as with the domain decomposition though. +Another solution is to use Numba's `@guvectorize` decorator. The speed-up (on my machine) is not as dramatic as with the domain decomposition, though. ``` {.python #bounding-box-methods} def grid(self): @@ -451,13 +451,13 @@ for j in range(height): result[j, i] = k ``` -If we take the center of the last image, we get the following rendering of the Julia set: +If we take the centre of the last image, we get the following rendering of the Julia set: ![Example of a Julia set](fig/julia-1.png){alt="colorful rendering of a Julia set"} :::challenge ## Generalize -Can you generalize your Mandelbrot code to compute both the Mandelbrot and the Julia sets efficiently, while reusing as much code as possible? +Can you generalize your Mandelbrot code to compute both the Mandelbrot and the Julia sets efficiently, while reusing as much code as possible? ::: :::keypoints From 16992990599db8c39e99c72172b449b7079fa081 Mon Sep 17 00:00:00 2001 From: "Giordano Lipari @c2" Date: Sat, 1 Jul 2023 17:45:40 +0200 Subject: [PATCH 20/32] extra-asyncio.md: readability pass 2 --- episodes/extra-asyncio.md | 28 ++++++++++++++-------------- 1 file changed, 14 insertions(+), 14 deletions(-) diff --git a/episodes/extra-asyncio.md b/episodes/extra-asyncio.md index 6abd3c1..be7ff52 100644 --- a/episodes/extra-asyncio.md +++ b/episodes/extra-asyncio.md @@ -10,7 +10,7 @@ exercises: 10 ::: :::objectives -- Understand the difference between a coroutine and a function. +- Understand the difference between a function and a coroutine. - Know the rudimentary basics of `asyncio`. - Perform parallel computations in `asyncio`. ::: @@ -18,15 +18,15 @@ exercises: 10 # Introduction to Asyncio Asyncio stands for "asynchronous IO" and, as you might have guessed, has little to do with either asynchronous work or doing IO. In general, the adjective asynchronous describes objects or events not coordinated in time. In fact, the `asyncio` system is more like a set of gears carefully tuned to run a multitude of tasks *as if* a lot of OS threads were running. In the end, they are all powered by the same crank. The gears in `asyncio` are called **coroutines** and its teeth move other coroutines wherever you find the `await` keyword. -The main application for `asyncio` is hosting back-ends for web services, where a lot of tasks may be waiting for each other while the server remains responsive to new events. In that regard, `asyncio` is a little bit outside the domain of computational science. Nevertheless, you may encounter Asyncio code in the wild, and you *can* do parallelism with Asyncio if you want higher-level abstraction without `dask` or a similar alternative. +The main application for `asyncio` is hosting back-ends for web services, where a lot of tasks may be waiting for each other while the server remains responsive to new events. In that regard, `asyncio` is a little bit outside the domain of computational science. Nevertheless, you may encounter Asyncio code in the wild, and you *can* do parallelism with Asyncio if you want higher-level abstraction without `dask` or similar alternatives. Many modern programming languages do have features very similar to `asyncio`. ## Run-time -The distinctive point of `asyncio` is a formalism for carrying out work that is different from usual function. We need to look deeper into functions to appreciate the distinction. +The distinctive point of `asyncio` is a formalism for carrying out work that is different from usual functions. We need to look deeper into functions to appreciate the distinction. ### Call stacks -A function call is best understood in terms of a stack-based system. When calling a function, you give it its arguments and temporarily forget what you were doing. Or, rather, you push on a stack and forget whatever you were doing. Then, you start working with the given arguments on a clean sheet (called a stack frame) until you obtain the function result. You remember only this result when you return to the stack and recall what you originally needed it for. +A function call is best understood in terms of a stack-based system. When calling a function, you give it its arguments and temporarily forget what you were doing. Or, rather, you push on a stack and forget whatever you were doing. Then, you start working with the given arguments on a clean sheet (called a stack frame) until you obtain the function result. When you return to the stack, you remember only this result and recall what you originally needed it for. In this manner, every function call pushes a frame onto the stack and every return statement has us popping back to the previous frame. [![](https://mermaid.ink/img/pako:eNp1kL1Ow0AQhF9luSYgHApEdUUogpCoKRCSm8U3JpbtXXM_NlGUd-dMYgqkdKvb-Wb25mAqdTDWBHwlSIWnhj8996UQcRWbkSNoy10HPz-dpvVmc_ucJK9VLG01dY72mmjowAHEEiZ46mFp2nGkJlB9_X3zOBssWLZYn8wsvSMUFHd_YNY_3N9dinubLScOv8TgMTaawhm9GPFCTmUVqRWdCpqwGkGCMYeFQVsIfaBWj6uZF81f1nm30BCXk3TpxeFfM6YwPXzPjctFHmZJafJ1PUpj8-jYt6Up5Zh1nKK-7qUyNvqEwqTBZZ9z6cbW3AUcfwB5sYta?type=png)](https://mermaid.live/edit#pako:eNp1kL1Ow0AQhF9luSYgHApEdUUogpCoKRCSm8U3JpbtXXM_NlGUd-dMYgqkdKvb-Wb25mAqdTDWBHwlSIWnhj8996UQcRWbkSNoy10HPz-dpvVmc_ucJK9VLG01dY72mmjowAHEEiZ46mFp2nGkJlB9_X3zOBssWLZYn8wsvSMUFHd_YNY_3N9dinubLScOv8TgMTaawhm9GPFCTmUVqRWdCpqwGkGCMYeFQVsIfaBWj6uZF81f1nm30BCXk3TpxeFfM6YwPXzPjctFHmZJafJ1PUpj8-jYt6Up5Zh1nKK-7qUyNvqEwqTBZZ9z6cbW3AUcfwB5sYta) @@ -47,7 +47,7 @@ Crucially, when we pop back, we also forget the stack frame inside the function. ### Coroutines :::instructor -This section goes rather in depth on coroutines. This is meant to nurture the correct mental model about what's going on with `asyncio`. +This section goes rather in depth into coroutines. This is meant to nurture the correct mental model about what goes on with `asyncio`. ::: Working with coroutines changes things a bit. The coroutine keeps on existing and its context is not forgotten when a coroutine returns a result. Python has several forms of coroutines and the simplest is a **generator**. For example, the following generator produces all integers (if you wait long enough): @@ -99,7 +99,7 @@ sequenceDiagram :::challenge ## Challenge: generate all even numbers -Can you write a generator for all even numbers? Reuse `integers()`. Extra: Can you generate the Fibonacci series? +Can you write a generator for all even numbers? Reuse `integers()`. Extra: Can you generate the Fibonacci sequence? ::::solution ```python @@ -128,7 +128,7 @@ def fib(): :::: ::: -The generator gives away control, passing a value back and expecting to receive control one more time, if faith has it. All meanings of the keyword `yield` apply here: the coroutine yields and produces a yield, as if we were harvesting a crop. +The generator gives away control, passing back a value and expecting to receive control one more time, if faith has it. All meanings of the keyword `yield` apply here: the coroutine yields control and produces a yield, as if we were harvesting a crop. Conceptually, a generator entails one-way traffic only: we get output. However, we can use `yield` also to send information to a coroutine. For example, this coroutine prints whatever you send to it: @@ -164,7 +164,7 @@ def printer(): In practice, the send-form of coroutines is hardly ever used. Cases for needing it are infrequent, and chances are that nobody will understand your code. Asyncio has largely superseded this usage. -The working of `asyncio` is only a small step away from that of coroutines. The intuition is that you can use coroutines to build a collaborative multi-threading environment. Most modern operating systems assign some time to execution threads and take control back pre-emptively to do something else. In **collaborative multi-tasking**, every worker knows to be part of a collaborative environment and yields control to the scheduler voluntarily. Creating such a system with coroutines and `yield` is possible in principle, but is not straightforward especially owing to the propagation of exceptions. +The working of `asyncio` is only a small step farther than that of coroutines. The intuition is to use coroutines to build a collaborative multi-threading environment. Most modern operating systems assign some time to execution threads and take back control pre-emptively to do something else. In **collaborative multi-tasking**, every worker knows to be part of a collaborative environment and yields control to the scheduler voluntarily. Creating such a system with coroutines and `yield` is possible in principle, but is not straightforward especially owing to the propagation of exceptions. ## Syntax `asyncio` itself is a library in standard Python and is a core component for actually using the associated async syntax. Two keywords are especially relevant here: `async` and `await`. @@ -229,10 +229,10 @@ if __name__ == "__main__": asyncio.run(main) ``` -Asyncio is as contagious as Dask. Any higher-level code must be async once you have some async code at low level: [it's turtles all the way down](https://en.wikipedia.org/wiki/Turtles_all_the_way_down)! You may be tempted to implement `asyncio.run` in the middle of your code and interact with the asynchronous parts. Multiple active Asyncio run-times will get you into troubles, though. Mixing Asyncio and classic code is possible in principle, but is considered bad practice. +Asyncio is as contagious as Dask. Any higher-level code must be async once you have some async low-level code: [it's turtles all the way down](https://en.wikipedia.org/wiki/Turtles_all_the_way_down)! You may be tempted to implement `asyncio.run` in the middle of your code and interact with the asynchronous parts. Multiple active Asyncio run-times will get you into troubles, though. Mixing Asyncio and classic code is possible in principle, but is considered bad practice. ## Timing asynchronous code -Jupyter works very well with `asyncio` except for the line magics and cell magics. We must then write our own timer. +Jupyter works very well with `asyncio` except for line magics and cell magics. We must then write our own timer. :::instructor It may be best to have participants copy and paste this snippet from the collaborative document. You may want to explain what a context manager is, but don't overdo it. This is advanced code and may scare off novices. @@ -270,7 +270,7 @@ print(f"that took {t.time} seconds") that took 0.20058414503000677 seconds ``` -Understanding these few snippets of code requires advanced knowledge of Python. Rest assured that both classic coroutines and `asyncio` are a large topic that we cannot cover completely. However, we can time the execution of our code now! +Understanding these few snippets of code requires advanced knowledge of Python. Rest assured that both classic coroutines and `asyncio` are complex topics that we cannot cover completely. However, we can time the execution of our code now! ## Compute $\pi$ again As a reminder, here is our Numba code for computing $\pi$: @@ -323,7 +323,7 @@ async def calc_pi_split(N, M): return sum(lst) / M ``` -and then verify the speed-up we get: +and then verify the speed-up that we get: ``` {.python #async-calc-pi-main} async with timer() as t: @@ -405,7 +405,7 @@ You may run this script using `python -m calc_pi.async_pi`. :::challenge ## Efficiency -Play with different subdivisions for `calc_pi_split` keeping `M*N` constant. How much overhead do you see? +Play with different subdivisions in `calc_pi_split` keeping `M*N` constant. How much overhead do you see? ::::solution ``` {.python file="src/calc_pi/granularity.py"} @@ -440,7 +440,7 @@ if __name__ == "__main__": ![timings](fig/asyncio-timings.svg){alt="a dip at njobs=10 and overhead ~0.1ms per task"} -The work takes about 0.1 s more when using 1000 tasks. So, assuming that overhead is distributed uniformly among the tasks, we observe that the overhead is around 0.1 ms per task. +The work takes about 0.1 s more when using 1000 tasks. So, assuming that the total overhead is distributed uniformly among the tasks, we observe that the overhead is around 0.1 ms per task. :::: ::: From dc83178e6a3cb7e304d5a098675cb980729c22d4 Mon Sep 17 00:00:00 2001 From: "Giordano Lipari @c2" Date: Sat, 1 Jul 2023 17:54:22 +0200 Subject: [PATCH 21/32] extra-external-c.md: readability pass 2 --- episodes/extra-external-c.md | 16 ++++++++-------- 1 file changed, 8 insertions(+), 8 deletions(-) diff --git a/episodes/extra-external-c.md b/episodes/extra-external-c.md index 6f3bf5d..810e9db 100644 --- a/episodes/extra-external-c.md +++ b/episodes/extra-external-c.md @@ -5,7 +5,7 @@ exercises: 30 --- :::questions -- Which options are available to call from Python code C and C++ libraries? +- Which options are available to call from Python C and C++ libraries? - How does this work together with Numpy arrays? - How do I use this in multiple threads while lifting the GIL? ::: @@ -127,14 +127,14 @@ Now we can time our compiled `sum_range` C library, e.g. from the iPython interf 2.69 ms ± 6.01 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) ~~~ -If you compare with the Numba timing from [Episode 3](computing-pi.md), you will see that the C library for `sum_range` is faster than -the numpy computation but significantly slower than the `numba.jit`-decorated function. +If you contrast with the Numba timing in [Episode 3](computing-pi.md), you will see that the C library for `sum_range` is faster than +the Numpy computation but significantly slower than the `numba.jit`-decorated function. :::challenge ## C versus Numba Check if the Numba version of this conditional `sum range` function outperforms its C counterpart. -Insprired by [a blog by Christopher Swenson](https://caswenson.com/2009_06_13_bypassing_the_python_gil_with_ctypes.html). +Inspired by [a blog by Christopher Swenson](https://caswenson.com/2009_06_13_bypassing_the_python_gil_with_ctypes.html). ~~~c long long conditional_sum_range(long long to) @@ -230,7 +230,7 @@ gives array([ 0, 0, 1, 3, 6, 10, 15, 21, 28, 36]) ~~~ -It does not crash! You can check that the array is upon subtracting the previous sum from each sum (except the first): +It does not crash! You can check that the array is correct upon subtracting the previous sum from each sum (except the first): ~~~python %out=sum_range(ys) @@ -243,7 +243,7 @@ which gives array([0, 1, 2, 3, 4, 5, 6, 7, 8]) ~~~ -that is, the elements of `ys` except the last, as expected. +which are the elements of `ys` except the last, as expected. # Call the C library from multiple threads simultaneously. We can show that a C library compiled using `pybind11` can be run as multithreaded. Try the following from an iPython shell: @@ -310,7 +310,7 @@ Now compile again: c++ -O3 -Wall -shared -std=c++11 -fPIC `python3 -m pybind11 --includes` test.c -o test_pybind.so ~~~ -Reimport the rebuilt shared object (only possible after quitting and relaunching the iPython interpreter) and time again. +Import again the rebuilt shared object (only possible after quitting and relaunching the iPython interpreter), and time again. This code: @@ -346,7 +346,7 @@ as you would expect for two `sum_range` modules running in parallel. :::keypoints - Multiple options are available to call external C and C++ libraries, and the best choice depends on the complexity of your problem. -- Obviously, there is an extra compile and link step, but you will get a much faster execution compared to pure Python. +- Obviously, there is an extra compile-and-link step, but the execution will be much faster than pure Python. - Also, the GIL will be circumvented in calling these libraries. - Numba might also offer you the speed-up you want with even less effort. ::: From 7816c235887b61e77e6b04d07e1387b63e3939ef Mon Sep 17 00:00:00 2001 From: Giordano Lipari Date: Sun, 2 Jul 2023 16:11:03 +0200 Subject: [PATCH 22/32] introduction.md: review code snippets --- episodes/introduction.md | 6 +++++- 1 file changed, 5 insertions(+), 1 deletion(-) diff --git a/episodes/introduction.md b/episodes/introduction.md index 7131fbb..1761b30 100644 --- a/episodes/introduction.md +++ b/episodes/introduction.md @@ -151,7 +151,7 @@ Although we are performing the loops in a serial way in the snippet above, nothi The following example shows that parts of the computations can be done independently: ```python -x = [1, 2, 4, 4] +x = [1, 2, 3, 4] chunk1 = x[:2] chunk2 = x[2:] @@ -178,7 +178,11 @@ These kinds of algorithms are known as [embarrassingly parallel](https://en.wiki An example of this kind of problem is squaring each element in a list, which can be done as follows: ```python +x = [1, 2, 3, 4] + y = [n**2 for n in x] + +print(y) ``` Each task of squaring a number is independent of all other elements in the list. From dd10bac5c71fb5d58cc936102082d8cb50be83ca Mon Sep 17 00:00:00 2001 From: Giordano Lipari Date: Sun, 2 Jul 2023 18:28:35 +0200 Subject: [PATCH 23/32] benchmarking.md: revise code snippets See help(memory_usage) for the memory being measured in MiB. --- episodes/benchmarking.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/episodes/benchmarking.md b/episodes/benchmarking.md index f927ea6..85cb16b 100644 --- a/episodes/benchmarking.md +++ b/episodes/benchmarking.md @@ -71,7 +71,7 @@ You can store the output of `%%timeit` in a Python variable using the `-o` flag: ```python time = %timeit -o np.arange(10**7).sum() -print(f"Time taken: {time.average:.4f}s") +print(f"Time taken: {time.average:.4f} s") ``` Note that this metric does not tell you anything about memory consumption or efficiency. @@ -110,8 +110,8 @@ memory_dask = memory_usage(sum_with_dask, interval=0.01) # Plot results plt.plot(memory_numpy, label='numpy') plt.plot(memory_dask, label='dask') -plt.xlabel('Time step') -plt.ylabel('Memory / MB') +plt.xlabel('Interval counter [-]') +plt.ylabel('Memory usage [MiB]') plt.legend() plt.show() ``` From 0b65df2d3687c37e708d8a865498562e754bd0e3 Mon Sep 17 00:00:00 2001 From: Giordano Lipari Date: Mon, 3 Jul 2023 08:53:03 +0200 Subject: [PATCH 24/32] threads-and-processes.md: revise snippet code Fill gaps. Fix NameError with variable `high`. --- episodes/threads-and-processes.md | 25 ++++++++++++++++++++++--- 1 file changed, 22 insertions(+), 3 deletions(-) diff --git a/episodes/threads-and-processes.md b/episodes/threads-and-processes.md index 6aacb78..5cbbd70 100644 --- a/episodes/threads-and-processes.md +++ b/episodes/threads-and-processes.md @@ -69,6 +69,8 @@ There are several options to make your own routines not subjected to the GIL: fo We can force off the GIL in Numba code by setting `nogil=True` inside the `numba.jit` decorator: ```python +import random + @numba.jit(nopython=True, nogil=True) def calc_pi_nogil(N): M = 0 @@ -99,8 +101,10 @@ Many Numpy functions unlock the GIL. Try and sort two randomly generated arrays ::::solution ## Solution ```python -rnd1 = np.random.random(high) -rnd2 = np.random.random(high) +n = 10**7 +rnd1 = np.random.random(n) +rnd2 = np.random.random(n) + %timeit -n 10 -r 10 np.sort(rnd1) ``` @@ -124,12 +128,24 @@ via the `multiprocessing` module. It implements an API that is seemingly similar to threading: ```python +import random from multiprocessing import Process +# function in plain Python def calc_pi(N): - ... + M = 0 + for i in range(N): + # Simulate impact coordinates + x = random.uniform(-1, 1) + y = random.uniform(-1, 1) + + # True if impact happens inside the circle + if x**2 + y**2 < 1.0: + M += 1 + return (4 * M / N, N) # result, iterations if __name__ == '__main__': + n = 10**7 p1 = Process(target=calc_pi, args=(n,)) p2 = Process(target=calc_pi, args=(n,)) @@ -211,11 +227,14 @@ def calc_pi(N, que): if __name__ == "__main__": + ctx = mp.get_context("spawn") que = ctx.Queue() + n = 10**7 p1 = ctx.Process(target=calc_pi, args=(n, que)) p2 = ctx.Process(target=calc_pi, args=(n, que)) + p1.start() p2.start() From 4ebee2b344ceef2c14510e0aa65511eb2442c889 Mon Sep 17 00:00:00 2001 From: Giordano Lipari Date: Mon, 3 Jul 2023 13:33:16 +0200 Subject: [PATCH 25/32] delayed-evaluation.md: revise code snippets --- episodes/delayed-evaluation.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/episodes/delayed-evaluation.md b/episodes/delayed-evaluation.md index 4ad6889..a47836b 100644 --- a/episodes/delayed-evaluation.md +++ b/episodes/delayed-evaluation.md @@ -49,7 +49,7 @@ The `delayed` decorator builds a dependency graph from function calls: def add(a, b): result = a + b print(f"{a} + {b} = {result}") - return a + b + return result ~~~ A `delayed` function stores the requested function call inside a **promise**. The function is not actually executed yet, and we get a value *promised*, which can be computed later. @@ -201,7 +201,7 @@ It turns a list of promises into a promise of a list. This small example shows what `gather` does: ~~~python -x_p = gather(*(add(n, n) for n in range(10))) # Shorthand for gather(add(1, 1), add(2, 2), ...) +x_p = gather(*(delayed(add)(n, n) for n in range(10))) # Shorthand for gather(add(1, 1), add(2, 2), ...) x_p.visualize() ~~~ From 37607a2a5fa86904b6790e692eae59fe73e2b30a Mon Sep 17 00:00:00 2001 From: Giordano Lipari Date: Mon, 3 Jul 2023 15:26:01 +0200 Subject: [PATCH 26/32] map-and-reduce.md: revise snipped code --- episodes/map-and-reduce.md | 7 ++++--- 1 file changed, 4 insertions(+), 3 deletions(-) diff --git a/episodes/map-and-reduce.md b/episodes/map-and-reduce.md index 0347402..38d2174 100644 --- a/episodes/map-and-reduce.md +++ b/episodes/map-and-reduce.md @@ -82,7 +82,7 @@ In this case, we use a function returning `True` if the argument contains the le and `False` if it does not. ~~~python -# Return True if x is even, False if not +# Return True if x contains the letter 'a', else False def pred(x): return 'a' in x @@ -123,7 +123,7 @@ We previously discussed some generic operations on bags. In the documentation, l Hint: Try `pluck` on some example data. ```python -from dask import bags as db +from dask import bag as db data = [ { "name": "John", "age": 42 }, @@ -168,7 +168,8 @@ def calc_pi(N): # take a sample x = random.uniform(-1, 1) y = random.uniform(-1, 1) - if x*x + y*y < 1.: M+=1 + if x*x + y*y < 1.: + M += 1 return 4 * M / N bag = dask.bag.from_sequence(repeat(10**7, 24)) From 92619e09d48778b3fb7e261dcdd2e314116fa407 Mon Sep 17 00:00:00 2001 From: Giordano Lipari Date: Mon, 3 Jul 2023 15:34:32 +0200 Subject: [PATCH 27/32] benchmarking.md: remark implemented --- episodes/benchmarking.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/episodes/benchmarking.md b/episodes/benchmarking.md index 85cb16b..0f3b21d 100644 --- a/episodes/benchmarking.md +++ b/episodes/benchmarking.md @@ -39,7 +39,7 @@ result = work.compute() :::callout ## Try a heavy enough task -Your radar may not detect so small a task. In your computer you may have to gradually raise the problem size to ``10**8`` or ``10**9`` to observe the effect in long enough a run. But be careful and increase slowly! Asking for too much memory can make your computer slow to a crawl. +Your system monitor may not detect so small a task. In your computer you may have to gradually raise the problem size to ``10**8`` or ``10**9`` to observe the effect in long enough a run. But be careful and increase slowly! Asking for too much memory can make your computer slow to a crawl. ::: ![System monitor](fig/system-monitor.jpg){alt="screenshot of system monitor"} From cda9ca24c26d220296ec3827cee12fe6b797dc41 Mon Sep 17 00:00:00 2001 From: Giordano Lipari Date: Mon, 3 Jul 2023 15:38:24 +0200 Subject: [PATCH 28/32] computing-pi.md: implement remarks, with rephrasing --- episodes/computing-pi.md | 5 +---- 1 file changed, 1 insertion(+), 4 deletions(-) diff --git a/episodes/computing-pi.md b/episodes/computing-pi.md index e6b2642..063c086 100644 --- a/episodes/computing-pi.md +++ b/episodes/computing-pi.md @@ -91,10 +91,7 @@ def calc_pi_numpy(N): return 4 * M / N ``` -This is a **vectorized** version of the original algorithm. It nicely demonstrates **data parallelization**, -where a **single operation** is replicated over collections of data. -It contrasts with **task parallelization**, where **different independent** procedures are performed in -parallel (think, for example, about cutting the vegetables while simmering the split peas). +This is a **vectorized** version of the original algorithm. A problem written in a vectorized form becomes amenable to **data parallelization**, where each single operation is replicated over a large collection of data. Data parallelism contrasts with **task parallelism**, where different independent procedures are performed in parallel. An example of task parallelism is the pea-soup recipe in the introduction. This implementation is much faster than the 'naive' implementation above: From 6fb3c660d692af9e871004712829467f3be3910d Mon Sep 17 00:00:00 2001 From: Giordano Lipari Date: Mon, 3 Jul 2023 15:45:43 +0200 Subject: [PATCH 29/32] delayed-evaluation.md: implement remarks --- episodes/delayed-evaluation.md | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/episodes/delayed-evaluation.md b/episodes/delayed-evaluation.md index a47836b..6fe80d4 100644 --- a/episodes/delayed-evaluation.md +++ b/episodes/delayed-evaluation.md @@ -28,9 +28,9 @@ See an overview below: | `dask.futures` | `concurrent.futures` | Control execution, low-level | ❌ | # Dask Delayed -A lot of the functionalities in Dask is based on an important concept known as *delayed evaluation*. Hence we go a bit deeper into `dask.delayed`. +A lot of the functionalities in Dask are based on an important concept known as *delayed evaluation*. Hence we go a bit deeper into `dask.delayed`. -`dask.delayed` changes the strategy by which our computation is evaluated. Normally, you expect that a computer runs commands when you ask for them, and that you can give the next command when the current job is complete. With delayed evaluation we do not wait before formulating the next command. Instead, we create the dependency graph of our complete computation without actually doing any work. Once we build the full dependency graph, we can see which jobs can be done in parallel and attribute those to different workers. +`dask.delayed` changes the strategy by which our computation is evaluated. Normally, you expect that a computer runs commands when you ask for them, and that you can give the next command when the current job is complete. With delayed evaluation we do not wait before formulating the next command. Instead, we create the dependency graph of our complete computation without actually doing any work. Once we build the full dependency graph, we can see which jobs can be done in parallel and have those scheduled to different workers. To express a computation in this world, we need to handle future objects *as if they're already there*. These objects may be referred to as either *futures* or *promises*. @@ -179,7 +179,7 @@ add(*numbers) # => 10 ``` ::: -We can build new primitives from the ground up. An important function frequently found where non-standard evaluation strategies are involved is `gather`. We can implement `gather` as follows: +We can build new primitives from the ground up. An important function that is found frequently where non-standard evaluation strategies are involved is `gather`. We can implement `gather` as follows: ~~~python @delayed @@ -229,7 +229,7 @@ Write a `delayed` function that computes the mean of its arguments. Use it to es 2.5 ``` -Ensure that the entire computation is contained in a single promise. +Make sure that the entire computation is contained in a single promise. ::::solution ## Solution From ecd90ce8e06f18de33131b7c23dac5ea2c53dc22 Mon Sep 17 00:00:00 2001 From: Giordano Lipari Date: Mon, 3 Jul 2023 15:48:22 +0200 Subject: [PATCH 30/32] map-and-reduce.md: implement remarks --- episodes/map-and-reduce.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/episodes/map-and-reduce.md b/episodes/map-and-reduce.md index 38d2174..b1f7654 100644 --- a/episodes/map-and-reduce.md +++ b/episodes/map-and-reduce.md @@ -77,7 +77,7 @@ bag.map(f).visualize() ### Filter -A function returning a boolean is a useful illustration of the concept of `filter`. +We need a predicate, that is a function returning either true or false, to illustrate the concept of `filter`. In this case, we use a function returning `True` if the argument contains the letter 'a', and `False` if it does not. From fa7eeacc6a72bf10535778a2407ea5a41cb0da8c Mon Sep 17 00:00:00 2001 From: Giordano Lipari Date: Mon, 3 Jul 2023 16:00:26 +0200 Subject: [PATCH 31/32] threads-and-processes.md: implement remaks Default behaviour of Numba left to further considerations. --- episodes/threads-and-processes.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/episodes/threads-and-processes.md b/episodes/threads-and-processes.md index 5cbbd70..fded45a 100644 --- a/episodes/threads-and-processes.md +++ b/episodes/threads-and-processes.md @@ -66,7 +66,7 @@ Trying out and profiling your application is the only way to know for sure. There are several options to make your own routines not subjected to the GIL: fortunately, `numba` makes this very easy. -We can force off the GIL in Numba code by setting `nogil=True` inside the `numba.jit` decorator: +We can unlock the GIL in Numba code by setting `nogil=True` inside the `numba.jit` decorator: ```python import random @@ -123,9 +123,9 @@ t2.join() ::: # Multiprocessing -Python also enable parallelisation with multiple processes +Python also enables parallelisation with multiple processes via the `multiprocessing` module. It implements an API that is -seemingly similar to threading: +superficially similar to threading: ```python import random From bd779a516f23cd39e0ec9298071eb92c51988091 Mon Sep 17 00:00:00 2001 From: Giordano Lipari Date: Mon, 3 Jul 2023 19:55:47 +0200 Subject: [PATCH 32/32] exercise-with-fractals.md: revise snippet code (part I) --- episodes/exercise-with-fractals.md | 24 ++++++++++++------------ 1 file changed, 12 insertions(+), 12 deletions(-) diff --git a/episodes/exercise-with-fractals.md b/episodes/exercise-with-fractals.md index 247bda0..4bc5114 100644 --- a/episodes/exercise-with-fractals.md +++ b/episodes/exercise-with-fractals.md @@ -49,14 +49,14 @@ We may compute the Mandelbrot as follows: max_iter = 256 width = 256 height = 256 -center = -0.8+0.0j -extent = 3.0+3.0j +center = -0.8 + 0.0j +extent = 3.0 + 3.0j scale = max((extent / width).real, (extent / height).imag) result = np.zeros((height, width), int) for j in range(height): for i in range(width): - c = center + (i - width // 2 + (j - height // 2)*1j) * scale + c = center + (i - width // 2 + 1j * (j - height // 2)) * scale z = 0 for k in range(max_iter): z = z**2 + c @@ -72,7 +72,7 @@ fig, ax = plt.subplots(1, 1, figsize=(10, 10)) plot_extent = (width + 1j * height) * scale z1 = center - plot_extent / 2 z2 = z1 + plot_extent -ax.imshow(result**(1/3), origin='lower', extent=(z1.real, z2.real, z1.imag, z2.imag)) +ax.imshow(result**(1 / 3), origin='lower', extent=(z1.real, z2.real, z1.imag, z2.imag)) ax.set_xlabel("$\Re(c)$") ax.set_ylabel("$\Im(c)$") ``` @@ -82,8 +82,8 @@ Things become really loads of fun when we zoom in. We can play around with the ` ```python max_iter = 1024 -center = -1.1195+0.2718j -extent = 0.005+0.005j +center = -1.1195 + 0.2718j +extent = 0.005 + 0.005j ``` When we zoom in on the Mandelbrot fractal, we get smaller copies of the larger set! @@ -135,7 +135,7 @@ matplotlib.use(backend="Agg") from matplotlib import pyplot as plt import numpy as np -from .bounding_box import BoundingBox +from bounding_box import BoundingBox def plot_fractal(box: BoundingBox, values: np.ndarray, ax=None): if ax is None: @@ -146,7 +146,7 @@ def plot_fractal(box: BoundingBox, values: np.ndarray, ax=None): z1 = box.center - plot_extent / 2 z2 = z1 + plot_extent ax.imshow(values, origin='lower', extent=(z1.real, z2.real, z1.imag, z2.imag), - cmap=matplotlib.colormaps["jet"]) + cmap=matplotlib.colormaps["viridis"]) ax.set_xlabel("$\Re(c)$") ax.set_ylabel("$\Im(c)$") return fig, ax @@ -175,7 +175,7 @@ from typing import Any, Optional import numba # type:ignore import numpy as np -from .bounding_box import BoundingBox +from bounding_box import BoundingBox @numba.njit(nogil=True) @@ -184,11 +184,11 @@ def compute_mandelbrot_numba( scale: complex, max_iter: int): for j in range(height): for i in range(width): - c = center + (i - width // 2 + (j - height // 2) * 1j) * scale - z = 0.0+0.0j + c = center + (i - width // 2 + 1j * (j - height // 2)) * scale + z = 0.0 + 0.0j for k in range(max_iter): z = z**2 + c - if (z*z.conjugate()).real >= 4.0: + if (z * z.conjugate()).real >= 4.0: break result[j, i] = k return result