diff --git a/07performance/index.md b/07performance/index.md index 2b1a815c4..ece2108a5 100644 --- a/07performance/index.md +++ b/07performance/index.md @@ -8,19 +8,19 @@ This week we'll be introducing some concepts for producing high performance prog Even though parallelism can help us improve our throughput, single core optimisation is still vital for producing good performance on ordinary machines or for maximising the work that each core can do in a parallel program. This week we'll talk about: -1. [_Why_ and _when_ we should optimise](sec00Motivation) -2. [Complexity and algorithm analysis](sec01Complexity) +1. [_Why_ and _when_ we should optimise](sec00Motivation.html) +2. [Complexity and algorithm analysis](sec01Complexity.html) - How the time and space usage of algorithms scales with the size of the input. - Big-O notation. - How to determine complexity and examples with some common algorithms. - How does complexity impact our choices as software designers? -3. [Memory management and caching](sec02Memory) +3. [Memory management and caching](sec02Memory.html) - Memory bound problems. - How memory is structured in a typical machine. - Speed of different kinds of memory access. - Cache structure and operation. - Writing algorithms to effectively exploit the cache. -4. [Compiler Optimisation](sec03Optimisation) +4. [Compiler Optimisation](sec03Optimisation.html) - Automated optimisation by the compiler. - Compiler flags for optimisation. - Examples of optimisations, pros and cons. diff --git a/07performance/sec01Complexity.md b/07performance/sec01Complexity.md index 7d4727821..200f8cfd1 100644 --- a/07performance/sec01Complexity.md +++ b/07performance/sec01Complexity.md @@ -6,131 +6,131 @@ Estimated Reading Time: 45 minutes # Computational Complexity -Computational complexity is a major field in computer science, and we will only very briefly touch on the subject here. It is foundational in understanding computation as an idea, and underpins many areas of modern computing such as the theory of cryptography. For our purposes it will be enough to get a flavour of how we can study algorithms for their complexity properties, but you may wish to consult the reference texts for this week (or many other standard computer science texts) if you are interested in understanding complexity in a deeper and more rigorous way. +Computational complexity is a major field in computer science, and we will only very briefly touch on the subject here. It is foundational in understanding computation as an idea, and underpins many areas of modern computing such as the theory of cryptography. For our purposes it will be enough to get a flavour of how we can study algorithms for their complexity properties, but you may wish to consult the reference texts for this week (or many other standard computer science texts) if you are interested in understanding complexity in a deeper and more rigorous way. -**Complexity tells us how the time and space usage of solutions to a problem changes with respect to the size of its input.** +**Complexity tells us how the time and space usage of solutions to a problem changes with respect to the size of its input.** -Most commonly in numerical computing scenarios we are concerned with the time complexity of an algorithm, although there are occasions when you may have to worry about space complexity as well. +Most commonly in numerical computing scenarios we are concerned with the time complexity of an algorithm, although there are occasions when you may have to worry about space complexity as well. ## Intuitive Algorithm Analysis -When analysing an algorithm for its complexity, we are interested in how the time (or space requirements) of an algorithm scale as the input gets larger. We can define this notion in a formal way (see below!) but you can understand the key aspects of algorithmic performance based on a few intuitive concepts. +When analysing an algorithm for its complexity, we are interested in how the time (or space requirements) of an algorithm scale as the input gets larger. We can define this notion in a formal way (see below!) but you can understand the key aspects of algorithmic performance based on a few intuitive concepts. In terms of the input getting larger, this could mean: -- The size of a number $n$ for a function $f(n)$. A good example of this would be the time taken to find the prime factors of a number as the number gets larger. -- The number of elements in a container. For example, the time to sort a list of $n$ elements, or the time taken to look up a key-value pair in a map / dictionary. -- The number of dimensions of an $n$-dimensional space. For example in statistical sampling methods where we sample over many variables, we will be interested in how the algorithm performs as the number of dimensions increases. +- The size of a number $n$ for a function $f(n)$. A good example of this would be the time taken to find the prime factors of a number as the number gets larger. +- The number of elements in a container. For example, the time to sort a list of $n$ elements, or the time taken to look up a key-value pair in a map / dictionary. +- The number of dimensions of an $n$-dimensional space. For example in statistical sampling methods where we sample over many variables, we will be interested in how the algorithm performs as the number of dimensions increases. -nThe size of a matrix: this example is a little unusual. Sometimes, particularly for a square ($n \times n$) matrix, this is expressed just by $n$, even though the number of elements in the matrix (and therefore the total size of the input) is actually $n^2$. On algorithms designed to work on non-square $n \times m$ matrices, you may have complexity in terms of _both_ $n$ and $n$. - - Adding two square matrices of the same size together is usually described as $O(n^2)$ with $n$ referring tp just one dimension of the matrix, whereas adding two lists of the same size is usually described as $O(n)$ with $n$ referring to the total number of elements, even though in both cases there is one operation per element of data. This difference is purely because of the way the input size is labelled in these two cases, so watch out for what people mean by $n$ when they tell you something is $O(g(n))$! - - The general matrix case for addition would usually be written $O(nm)$. + - Adding two square matrices of the same size together is usually described as $O(n^2)$ with $n$ referring to just one dimension of the matrix, whereas adding two lists of the same size is usually described as $O(n)$ with $n$ referring to the total number of elements, even though in both cases there is one operation per element of data. This difference is purely because of the way the input size is labelled in these two cases, so watch out for what people mean by $n$ when they tell you something is $O(g(n))$! + - The general matrix case for addition would usually be written $O(nm)$. -The "time" for an algorithm is based on the number of elementary steps that the algorithm has to undertake. +The "time" for an algorithm is based on the number of elementary steps that the algorithm has to undertake. -We generally write the complexity in terms of "Big-O" notation, e.g. $T(n)$ (the time as a function of $n$) is $O(f(n))$. For example if our function printed out the elements of a list, the number of steps we take is proportional to the number of elements in the list (linear scaling), so $T(n)$ is $O(n)$. +We generally write the complexity in terms of "Big-O" notation, e.g. $T(n)$ (the time as a function of $n$) is $O(f(n))$. For example if our function printed out the elements of a list, the number of steps we take is proportional to the number of elements in the list (linear scaling), so $T(n)$ is $O(n)$. When talking about complexity, we only want to capture information about how the time or space scales as $n$ becomes large i.e. asymptotically as $n \rightarrow \infty$. As a result, we only care about dominant terms. Furthermore, we are only interested in the functional form of the scaling, not the absolute amount of time, so any constant factors are ignored. -- Any cubic is $O(n^3)$, any quadratic is $O(n^2)$ etc. regardless of lower order polynomial terms. -- $\log(n)$ is subdominant to $n$. +- Any cubic is $O(n^3)$, any quadratic is $O(n^2)$ etc. regardless of lower order polynomial terms. +- $\log(n)$ is subdominant to $n$. - Constant overheads (additive constants to the time) are ignored as they don't scale with $n$. -- A function is $O(1)$ if it doesn't scale with $n$ at all. +- A function is $O(1)$ if it doesn't scale with $n$ at all. - $O(1) < O(log(n)) < O(n) < O(n log(n)) < O(n^2) < O(n^3) < O(2^n)$ -- Algorithms whose time complexity is $O(p(n))$, where $p(n)$ is a polynomial, or better are called _polynomial time algorithms_. +- Algorithms whose time complexity is $O(p(n))$, where $p(n)$ is a polynomial, or better are called _polynomial time algorithms_. We can also understand algorithms made of smaller parts, for example: -- If an algorithm calculates $f(n)$ which is $O(n^3)$ then $g(n)$ which is $O(n^2)$, then the complexity of the algorithm is $O(n^3)$ since calculating $g(n)$ will become subdominant. -- If we make $n$ calls to a function $f(n)$, and $f(n)$ is $O(g(n))$, then the complexity is $O(n g(n))$. For example, making $n$ calls to a quadric-scaling function would lead to a cubic, i.e. $O(n^3)$, algorithm. - - Nested loops and recursions are key areas of your program to look at to see if complexity is piling up! +- If an algorithm calculates $f(n)$ which is $O(n^3)$ then $g(n)$ which is $O(n^2)$, then the complexity of the algorithm is $O(n^3)$ since calculating $g(n)$ will become subdominant. +- If we make $n$ calls to a function $f(n)$, and $f(n)$ is $O(g(n))$, then the complexity is $O(n g(n))$. For example, making $n$ calls to a quadric-scaling function would lead to a cubic, i.e. $O(n^3)$, algorithm. + - Nested loops and recursions are key areas of your program to look at to see if complexity is piling up! - Recursions or other kinds of branching logic can lead to recurrence relations: the time to calculate a problem can be expressed in terms of the time to calculate a smaller problem. This recurrence relation is directly linked to the complexity: - - If $T(n) \sim 2 \times T(\frac{n}{2})$ then $T(n)$ is linear. - - If $T(n) \sim 4 \times T(\frac{n}{2})$ then $T(n)$ is quadric. - - If $T(n) \sim k + T(\frac{n}{2})$ for constant $k$ then $T(n)$ is logarithmic. - - See the [Master Theorem](https://en.wikipedia.org/wiki/Master_theorem_(analysis_of_algorithms)) for more on this if you're interested! - - This is related to our intuition about how the time scales with $n$: if we know that a algorithm is $O(n^2)$ for example, we then know that as the input gets sufficiently large, the time taken to calculate the output will quadruple as $n$ doubles. + - If $T(n) \sim 2 \times T(\frac{n}{2})$ then $T(n)$ is linear. + - If $T(n) \sim 4 \times T(\frac{n}{2})$ then $T(n)$ is quadric. + - If $T(n) \sim k + T(\frac{n}{2})$ for constant $k$ then $T(n)$ is logarithmic. + - See the [Master Theorem](https://en.wikipedia.org/wiki/Master_theorem_(analysis_of_algorithms)) for more on this if you're interested! + - This is related to our intuition about how the time scales with $n$: if we know that a algorithm is $O(n^2)$ for example, we then know that as the input gets sufficiently large, the time taken to calculate the output will quadruple as $n$ doubles. ## Algorithm Analysis: Sorting Algorithm Examples -Let's take a look at two sorting algorithms and try to understand something about their complexity. +Let's take a look at two sorting algorithms and try to understand something about their complexity. ### **Insertion Sort** -Insertion sort is one of the easiest sorting methods. We'll sort from smallest to largest. We'll show an _out of place_ implementation because it is easier to visualise, but this can be performed _in place_ (i.e. changing the elements of the list as we go instead of putting them in a new list). +Insertion sort is one of the easiest sorting methods. We'll sort from smallest to largest. We'll show an _out of place_ implementation because it is easier to visualise, but this can be performed _in place_ (i.e. changing the elements of the list as we go instead of putting them in a new list). -1. It starts with an unsorted list, and in our case an empty buffer to place the elements into. This buffer will store the sorted list at the end and has the property that *it is correctly sorted at every step*. Elements are placed into the output list one by one. Since it starts empty, the first element can go straight in. +1. It starts with an unsorted list, and in our case an empty buffer to place the elements into. This buffer will store the sorted list at the end and has the property that _it is correctly sorted at every step_. Elements are placed into the output list one by one. Since it starts empty, the first element can go straight in. -![image](img/insert_sort_1.jpg) + ![image](img/insert_sort_1.jpg) -2. When placing the next element we compare with each element in the list in turn (starting with the end of the list) to see if it is smaller or larger. If the element is smaller (like here) then we can insert our element. +2. When placing the next element we compare with each element in the list in turn (starting with the end of the list) to see if it is smaller or larger. If the element is smaller (like here) then we can insert our element. -![image](img/insert_sort_2.jpg) + ![image](img/insert_sort_2.jpg) -3. We then try to insert the next element, again comparing with the last element of the output list. If we find a value that is bigger than the one we want to insert then we need to move along to the next element. +3. We then try to insert the next element, again comparing with the last element of the output list. If we find a value that is bigger than the one we want to insert then we need to move along to the next element. -![image](img/insert_sort_4.jpg) + ![image](img/insert_sort_4.jpg) -4. Once we've found one smaller we know where to insert the new element. All the elements in the output list to the right will need to be shifted along to make room. +4. Once we've found one smaller we know where to insert the new element. All the elements in the output list to the right will need to be shifted along to make room. -![image](img/insert_sort_3.jpg) + ![image](img/insert_sort_3.jpg) -5. In order to insert this in the right place, we have to move the larger element along one. (This would also mean moving along any elements to the right of this one!) +5. In order to insert this in the right place, we have to move the larger element along one. (This would also mean moving along any elements to the right of this one!) -![image](img/insert_sort_5.jpg) + ![image](img/insert_sort_5.jpg) -6. We can then insert our input element in the space and move on to the next element in the unsorted list. Again we compare until we find a smaller value; in this case we get all the way to the start of the list. Now all of the other elements need to be shifted along in order to insert this element. +6. We can then insert our input element in the space and move on to the next element in the unsorted list. Again we compare until we find a smaller value; in this case we get all the way to the start of the list. Now all of the other elements need to be shifted along in order to insert this element. -![image](img/insert_sort_6.jpg) + ![image](img/insert_sort_6.jpg) -7. We continue until the all elements have been inserted into the output list. Since the output list is always sorted, as soon as we have inserted all the elements we are done. +7. We continue until the all elements have been inserted into the output list. Since the output list is always sorted, as soon as we have inserted all the elements we are done. What is the time complexity of this? Well, the performance of this algorithm depends on the pattern of the input! There are always $n$ insertions but: -- If the input is already sorted, then every step involves just one comparison. (In the case of an out of place algorithm, an insertion/copy; for an in place sort then this is skipped.) This is therefore linear in the size of the list i.e. $O(n)$. -- If the output is reverse sorted, then every step involves comparing with every element of the output list and shifting every element as well! The output list grows by one each time so this is proportional to $1 + 2 + 3 + ... = \frac{n^2 + n}{2}$ and therefore $O(n^2)$. +- If the input is already sorted, then every step involves just one comparison. (In the case of an out of place algorithm, an insertion/copy; for an in place sort then this is skipped.) This is therefore linear in the size of the list i.e. $O(n)$. +- If the output is reverse sorted, then every step involves comparing with every element of the output list and shifting every element as well! The output list grows by one each time so this is proportional to $1 + 2 + 3 + ... = \frac{n^2 + n}{2}$ and therefore $O(n^2)$. -These are the best case and worse case scenarios. In practice, the average case is still $O(n^2)$, and in general we are usually most concerned with our _worst case_ complexity. Nevertheless, if you have _nearly sorted_ data you can have good performance for an algorithm like this, and it is important to understand how your algorithms are impacted by patterns in the data that you are working on. +These are the best case and worse case scenarios. In practice, the average case is still $O(n^2)$, and in general we are usually most concerned with our _worst case_ complexity. Nevertheless, if you have _nearly sorted_ data you can have good performance for an algorithm like this, and it is important to understand how your algorithms are impacted by patterns in the data that you are working on. ### **Merge Sort** Merge sort is a "divide and conquer" algorithm: the idea is that we can easily build a sorted list by merging two shorter sorted lists each containing half of the elements. -1. Merging two sorted lists is linear in the size of the list. At each step we compare the head of the two lists. +1. Merging two sorted lists is linear in the size of the list. At each step we compare the head of the two lists. -![image](img/merge_sort_1.jpg) + ![image](img/merge_sort_1.jpg) -2. Whichever is smaller goes in the output list and then we make the next compare the next pair and insert the smaller and so on. +2. Whichever is smaller goes in the output list and then we make the next compare the next pair and insert the smaller and so on. -![image](img/merge_sort_2.jpg) + ![image](img/merge_sort_2.jpg) -3. This is obviously done after $n-1$ comparisons and $n$ insertions so this part is $O(n)$ in the combined size of the lists. +3. This is obviously done after $n-1$ comparisons and $n$ insertions so this part is $O(n)$ in the combined size of the lists. -Naturally a list with only one element is sorted, and so we can turn this into a recursive sorting algorithm using the list with one element as a base case. +Naturally a list with only one element is sorted, and so we can turn this into a recursive sorting algorithm using the list with one element as a base case. 1. The list is recursively divided until you have single elements. Then each of these is merged pair-wise to make sorted lists of size 2. (Merging all the lists involves $O(n)$ operations just as discussed above.) -![image](img/merge_sort_7.jpg) + ![image](img/merge_sort_7.jpg) -2. These lists are merged pairwise at each level of recursion until we have the fully sorted list. +2. These lists are merged pairwise at each level of recursion until we have the fully sorted list. -![!image](img/merge_sort_9.jpg) + ![!image](img/merge_sort_9.jpg) -The complexity of this algorithm does not depend on any patterns in the list: a sorted, reverse sorted, or unsorted list still involves the same number of comparisons and merges. +The complexity of this algorithm does not depend on any patterns in the list: a sorted, reverse sorted, or unsorted list still involves the same number of comparisons and merges. -Each round of merging takes $O(n)$ operations, so we need to know how many rounds there are. We split the list in half at each level of recursion, and we can only do this $\sim log_2(n)$ times before we have lists of size 1, therefore the total complexity is $O(n log(n))$. +Each round of merging takes $O(n)$ operations, so we need to know how many rounds there are. We split the list in half at each level of recursion, and we can only do this $\sim log_2(n)$ times before we have lists of size 1, therefore the total complexity is $O(n log(n))$. - $O(n log(n))$ is actually optimal asymptotic behaviour for comparison sorting a list! -- $O(n log(n))$ is sometimes called "quasi-linear". +- $O(n log(n))$ is sometimes called "quasi-linear". - We can also approach this problem using a recurrence relation: $T(n) \sim O(n) + 2T(\frac{n}{2})$. ### *Comparison* - Merge sort has better asymptotic behaviour than insertion sort in the average and worst case. -- Insertion sort performance can depend strongly on the data. -- Merge sort has the same best, average, and worst case complexity so is very predictable. -- Merge sort has higher overheads due to all those recursive function calls. -- As a result, we know that merge sort will eventually beat insertion sort as long as $n$ becomes large enough, but insertion sort may provide better performance for small lists. - - Another popular algorithm is _quicksort_, which has $O(n log(n))$ behaviour in the average case (and is usually faster than merge sort), but $O(n^2)$ behaviour in the worst case. Selecting the best algorithms is not always obvious! +- Insertion sort performance can depend strongly on the data. +- Merge sort has the same best, average, and worst case complexity so is very predictable. +- Merge sort has higher overheads due to all those recursive function calls. +- As a result, we know that merge sort will eventually beat insertion sort as long as $n$ becomes large enough, but insertion sort may provide better performance for small lists. + - Another popular algorithm is _quicksort_, which has $O(n log(n))$ behaviour in the average case (and is usually faster than merge sort), but $O(n^2)$ behaviour in the worst case. Selecting the best algorithms is not always obvious! ## The Complexity of a Problem: Matrix Multiplication @@ -138,7 +138,7 @@ Each round of merging takes $O(n)$ operations, so we need to know how many round Let's take as an example the problem of matrix multiplication, an extremely common operation in scientific computing. What is the complexity of matrix multiplication? What algorithms are available to us and how do they get used in practice? -### The Naïve Algorithm +### The Naïve Algorithm We can define the elements of a product of two matrices as @@ -149,35 +149,35 @@ $C_{ij} = A_{ik} B_{kj}$. The simplest way to implement this is to iterate over $i$ and $j$, and for each element in the product matrix you perform a dot product between the $i$ row of $A$ and $j$ column of $B$ (which iterates over $k$). - Assuming our matrices are $N \times N$: - - There are $N^2$ elements to do this for in the product matrix. - - Calculating each element requires $N$ multiplications. - - Total number of multiplications is therefore $N^3$. - - We can see this immediately because we have nested iterations over $i$, $j$, and $k$, each of which has $N$ values. - - Nothing else (e.g. memory read/writes) exceeds $N^3$ behaviour either. ($N^2$ writes, up to $N^3$ reads.) + - There are $N^2$ elements to do this for in the product matrix. + - Calculating each element requires $N$ multiplications. + - Total number of multiplications is therefore $N^3$. + - We can see this immediately because we have nested iterations over $i$, $j$, and $k$, each of which has $N$ values. + - Nothing else (e.g. memory read/writes) exceeds $N^3$ behaviour either. ($N^2$ writes, up to $N^3$ reads.) -So from a simple solution we can see that matrix multiplication is $O(n^3)$. It can't be _worse_ than asymptotically $n^3$, but it could be _better_, since we haven't shown that it is bounded from below by $n^3$. +So from a simple solution we can see that matrix multiplication is $O(n^3)$. It can't be _worse_ than asymptotically $n^3$, but it could be _better_, since we haven't shown that it is bounded from below by $n^3$. -We _can_ say that matrix multiplication must be bounded from below by $n^2$ (written as $\Omega(n^2)$ when it's a _lower_ bound; this is also defined formally below) since it needs to fill out $n^2$ values in the output matrix! +We _can_ say that matrix multiplication must be bounded from below by $n^2$ (written as $\Omega(n^2)$ when it's a _lower_ bound; this is also defined formally below) since it needs to fill out $n^2$ values in the output matrix! -### (Asymptotically) Better Matrix Multiplication +### (Asymptotically) Better Matrix Multiplication The most common improved matrix multiplication algorithm is the _Strassen algorithm_. We won't go into the details of how the linear algebra works here, but we simply note: - The Strassen algorithm divides each matrix into four (roughly equal) sub-matrices. -- Normally a matrix multiplication could be calculated using these sub-matrices by calculating 8 small matrix products. This wouldn't save time: the naïve method is $O(n^3)$ and each matrix has size $\sim \frac{n}{2}$, so each sub-matrix multiplication is 8 times as fast, but there are also 8 matrix multiplications to do! -- By combining some of these sub-matrices through additions and subtractions (which are $O(n^2)$) we can actually express the output matrix with just 7 small matrix products! (Again, some addition and subtraction required to build the output matrix after the products are calculated.) -- The additions and subtractions are negligible because there are a fixed number and they are $O(n^2)$ and therefore sub-dominant. +- Normally a matrix multiplication could be calculated using these sub-matrices by calculating 8 small matrix products. This wouldn't save time: the naïve method is $O(n^3)$ and each matrix has size $\sim \frac{n}{2}$, so each sub-matrix multiplication is 8 times as fast, but there are also 8 matrix multiplications to do! +- By combining some of these sub-matrices through additions and subtractions (which are $O(n^2)$) we can actually express the output matrix with just 7 small matrix products! (Again, some addition and subtraction required to build the output matrix after the products are calculated.) +- The additions and subtractions are negligible because there are a fixed number and they are $O(n^2)$ and therefore sub-dominant. - Doing this recursively for sub-matrices as well leads to an algorithm which is $\sim O(n^{2.8...})$. - - Extra: To prove this you can write the number of operations as a recurrence relation $T(n) = 7T(\frac{n}{2}) + O(n^2)$ and apply the [Master Theorem](https://en.wikipedia.org/wiki/Master_theorem_(analysis_of_algorithms)). Again the recommended texts for this week are a good place to start! + - Extra: To prove this you can write the number of operations as a recurrence relation $T(n) = 7T(\frac{n}{2}) + O(n^2)$ and apply the [Master Theorem](https://en.wikipedia.org/wiki/Master_theorem_(analysis_of_algorithms)). Again the recommended texts for this week are a good place to start! This may seem like a small gain, but it becomes increasingly important as matrices get large, and large matrix multiplication can be a bottleneck for many numerical codes! We should bear in mind: -- This algorithm has more overheads than normal matrix multiplication. Asymptotic improvement does not mean it's better for all sizes of problems! For small matrices, the additional overheads of Strassen multiplication make it slower than the straightforward method. As a result, most implementations actually transition to regular matrix multiplication when the sub-matrices get to be small enough in size that this becomes more efficient. This kind of behavioural change is very common when optimising for performance. -- The Strassen algorithm is less numerically stable than the simple method, although not so much so as to be an issue in typical applications. This is again a common trade off with code optimised for speed, and something to bear in mind if you know that you have to deal with unstable edge cases. +- This algorithm has more overheads than normal matrix multiplication. Asymptotic improvement does not mean it's better for all sizes of problems! For small matrices, the additional overheads of Strassen multiplication make it slower than the straightforward method. As a result, most implementations actually transition to regular matrix multiplication when the sub-matrices get to be small enough in size that this becomes more efficient. This kind of behavioural change is very common when optimising for performance. +- The Strassen algorithm is less numerically stable than the simple method, although not so much so as to be an issue in typical applications. This is again a common trade off with code optimised for speed, and something to bear in mind if you know that you have to deal with unstable edge cases. -So is this the best known performance for matrix multiplication? Not quite! The best known algorithm for matrix multiplication is $\sim O(n^{2.37...})$, but this is unusable in practice. It is an example of what is called a _galactic algorithm_; algorithms which have better asymptotic behaviour, but whose overheads are so large that they actually perform worse for any input that could physically fit on any plausible machine. It is a good reminder that asymptotic behaviour isn't everything! +So is this the best known performance for matrix multiplication? Not quite! The best known algorithm for matrix multiplication is $\sim O(n^{2.37...})$, but this is unusable in practice. It is an example of what is called a _galactic algorithm_; algorithms which have better asymptotic behaviour, but whose overheads are so large that they actually perform worse for any input that could physically fit on any plausible machine. It is a good reminder that asymptotic behaviour isn't everything! ## Formal Definition: "Big O" Notation @@ -188,47 +188,47 @@ $f(n)$ is $O(g(n))$ if and only if there exists some minimum value $n_0$ and som Let's break down what this means: - **A function $f(n)$ is $O(g(n))$ if, as $n$ tends to infinity, $f(n)$ is bounded from above by $g(n)$ multiplied by an arbitrary constant.** -- Because we only require that the comparison holds for all inputs larger than some arbitrarily large value $n_0$, we are looking at _asymptotic behaviour_. We say that $g(n)$ is an **asymptotic upper bound** on $f(n)$. - - Practically speaking this means we are generally only interested in _leading terms_. For a quadratic $an^2 + bn + c$, the linear and constant terms will always be overwhelmed by the quadratic term in the asymptotic case, and so are irrelevant. All quadratics are $O(n^2)$, regardless of the values of $a$, $b$, and $c$ (as long as $a \ne 0$). - - For example: $\mid an^2 + bn + c \mid \le (a+1)n^2$ will always hold for $n$ large enough so that $n^2 > \mid bn + c \mid$, which we know must exist for any coefficients $b$ and $c$. +- Because we only require that the comparison holds for all inputs larger than some arbitrarily large value $n_0$, we are looking at _asymptotic behaviour_. We say that $g(n)$ is an **asymptotic upper bound** on $f(n)$. + - Practically speaking this means we are generally only interested in _leading terms_. For a quadratic $an^2 + bn + c$, the linear and constant terms will always be overwhelmed by the quadratic term in the asymptotic case, and so are irrelevant. All quadratics are $O(n^2)$, regardless of the values of $a$, $b$, and $c$ (as long as $a \ne 0$). + - For example: $\mid an^2 + bn + c \mid \le (a+1)n^2$ will always hold for $n$ large enough so that $n^2 > \mid bn + c \mid$, which we know must exist for any coefficients $b$ and $c$. - We have an arbitrary constant factor $\alpha$, so constant factors in $f(n)$ are irrelevant. $n^2$ and $50n^2$ are both $O(n^2)$. We are only interested in the **way that the output scales**, not the actual size. This will be a very important point to remember when applying algorithms in practice! - For this definition to make sense and be useful, the function $g(n)$ must be asymptotically non-negative. -- This can also be written treating $O(g(n))$ as the set of functions that asymptotically bounded by $g(n)$. - - $f(n) \in O(g(n)) \iff \exists m, \alpha . \forall n > m . \mid f(n) \mid \le \alpha g(n)$. - - Texts can vary quite a bit in notation e.g. $f(n) \in g(n)$ or $f(n) = g(n)$ (a bit of an abuse of notation since it is not an equivalence relation, but fairly common in practice), and in mathematical formality. +- This can also be written treating $O(g(n))$ as the set of functions that asymptotically bounded by $g(n)$. + - $f(n) \in O(g(n)) \iff \exists m, \alpha . \forall n > m . \mid f(n) \mid \le \alpha g(n)$. + - Texts can vary quite a bit in notation e.g. $f(n) \in g(n)$ or $f(n) = g(n)$ (a bit of an abuse of notation since it is not an equivalence relation, but fairly common in practice), and in mathematical formality. -We can also write $f(n)$ is $\Omega(g(n))$ if $g(n)$ is an **asymptotic lower bound** of $f(n)$. This is a reciprocal relationship with $O(n)$ i.e. +We can also write $f(n)$ is $\Omega(g(n))$ if $g(n)$ is an **asymptotic lower bound** of $f(n)$. This is a reciprocal relationship with $O(n)$ i.e. -$f(n)$ is $O(g(n))$ if and only if $g(n)$ is $\Omega(f(n))$. +$f(n)$ is $O(g(n))$ if and only if $g(n)$ is $\Omega(f(n))$. There is also a stronger condition: $f(n)$ is $\Theta(g(n))$ if and only if $f(n)$ is $O(g(n))$ **and** $g(n)$ is $O(f(n))$. -- This means that $g(n)$ is an **asymptotically tight bound** on $f(n)$, because as $n$ tends to infinity $f(n)$ is bounded by a constant multiple of $g(n)$, and $g(n)$ is bounded by a constant multiple of $f(n)$. -- Put another way, there are two constants $\alpha$ and $\beta$ for which, as $n$ tends to infinity, $f(n) \ge \alpha g(n)$ and $f(n) \le \beta g(n)$. As such, $f(n)$ is bounded from above and below by multiples of $g(n)$. - - $f(n)$ is $\Theta(g(n))$ if and only if $f(n)$ is $O(g(n))$ and $f(n)$ is $\Omega(g(n))$. +- This means that $g(n)$ is an **asymptotically tight bound** on $f(n)$, because as $n$ tends to infinity $f(n)$ is bounded by a constant multiple of $g(n)$, and $g(n)$ is bounded by a constant multiple of $f(n)$. +- Put another way, there are two constants $\alpha$ and $\beta$ for which, as $n$ tends to infinity, $f(n) \ge \alpha g(n)$ and $f(n) \le \beta g(n)$. As such, $f(n)$ is bounded from above and below by multiples of $g(n)$. + - $f(n)$ is $\Theta(g(n))$ if and only if $f(n)$ is $O(g(n))$ and $f(n)$ is $\Omega(g(n))$. This relationship is symmetric: $f(n)$ is $\Theta(g(n))$ if and only if $g(n)$ is $\Theta(f(n))$. -Asymptotically tight bounds are particularly useful for understanding the behaviour as a function scales. -Take our quadratic example from before: $an^2 + bn + c \in O(n^2)$ is clearly true, but looking at the definition of $O(g(n))$ we can also say that $an^2 + bn + c \in O(n^3)$, since $n^3$ is asymptotically larger than any quadratic function and therefore acts as an upper bound. The same would be true of many functions which grow faster than quadratics! Likewise $O(n^2)$ includes anything that grows slower than a quadratic, such as a linear function. To say that our quadratic is $\Theta(n^2)$ is much stricter, as it says that our function grows as fast as $n^2$ and no faster than $n^2$ (up to multiplicative constants). +Asymptotically tight bounds are particularly useful for understanding the behaviour as a function scales. +Take our quadratic example from before: $an^2 + bn + c \in O(n^2)$ is clearly true, but looking at the definition of $O(g(n))$ we can also say that $an^2 + bn + c \in O(n^3)$, since $n^3$ is asymptotically larger than any quadratic function and therefore acts as an upper bound. The same would be true of many functions which grow faster than quadratics! Likewise $O(n^2)$ includes anything that grows slower than a quadratic, such as a linear function. To say that our quadratic is $\Theta(n^2)$ is much stricter, as it says that our function grows as fast as $n^2$ and no faster than $n^2$ (up to multiplicative constants). ## Why use Big-O Notation for Algorithms and Computational Problems? -For algorithmic analysis, the functions that we are interested in are the time and space usage of an algorithm as a function of its input size. +For algorithmic analysis, the functions that we are interested in are the time and space usage of an algorithm as a function of its input size. -- Time usage is usually understood in terms of the number of "steps" that an algorithm needs to reach a result. How exactly that translates into time in the real world depends on how long each kind of operation takes to do (e.g. memory read, comparison, additions etc.), but these are multiplicative factors. - - Note that sometimes things which might appear to be a simple step are more complex. For example, if performing additions with arbitrary precision integers then the time it takes to perform the addition will vary with the size of the number! If using fixed precision then this is not an issue because you know that e.g. a standard `int` is 4 bytes, and so even if the addition is optimised in some way to add smaller numbers quicker they are still bounded by the time it would take to operate on 4 bytes. -- Space is usually a bit easier to understand as we can reason more directly about the amount of information that we have to store. - - When analysing space complexity we do not include the input itself, just any memory that must be allocated for working on. +- Time usage is usually understood in terms of the number of "steps" that an algorithm needs to reach a result. How exactly that translates into time in the real world depends on how long each kind of operation takes to do (e.g. memory read, comparison, additions etc.), but these are multiplicative factors. + - Note that sometimes things which might appear to be a simple step are more complex. For example, if performing additions with arbitrary precision integers then the time it takes to perform the addition will vary with the size of the number! If using fixed precision then this is not an issue because you know that e.g. a standard `int` is 4 bytes, and so even if the addition is optimised in some way to add smaller numbers quicker they are still bounded by the time it would take to operate on 4 bytes. +- Space is usually a bit easier to understand as we can reason more directly about the amount of information that we have to store. + - When analysing space complexity we do not include the input itself, just any memory that must be allocated for working on. - What we mean by the "input size" can be ambiguous. Traditionally it can be more rigorously defined in terms of tape size on a Turing machine (which we we won't have time to cover!) or bits on a computer, but in practice people may typically reason with the kinds of intuitive values mentioned before which would correspond with the input size. -Big-O, $\Omega$, and $\Theta$ all capture information about algorithm performance without knowing too much detail about how it is physically performed on a computer: things like the exact amount of time for particular operations, the differences in how memory is divided up for reading and writing, and so on get absorbed into multiplicative factors or additive overheads. Big-O notation captures something more fundamental about the way that problems scale. Even things like modern CPUs doing multiple arithmetic operations in parallel don't affect the computational complexity of an algorithm, since there are still a fixed number of operations than can happen concurrently and therefore this can't contribute more than a constant factor. +Big-O, $\Omega$, and $\Theta$ all capture information about algorithm performance without knowing too much detail about how it is physically performed on a computer: things like the exact amount of time for particular operations, the differences in how memory is divided up for reading and writing, and so on get absorbed into multiplicative factors or additive overheads. Big-O notation captures something more fundamental about the way that problems scale. Even things like modern CPUs doing multiple arithmetic operations in parallel don't affect the computational complexity of an algorithm, since there are still a fixed number of operations than can happen concurrently and therefore this can't contribute more than a constant factor. -Take for example a trivial summation example: +Take for example a trivial summation example: ```cpp= double SumVector(const vector &v) @@ -242,20 +242,19 @@ double SumVector(const vector &v) } ``` -- We're interested here in how we scale with the number of elements in the list, so we'll call this $n$. -- There is one addition operation for each element in the list, so $n$ operations total. Time complexity is $\Theta(n)$ i.e. _linear_. -- Regardless of the size of the list, we only allocate one `double` (`sum`) for this function, so the space complexity is $\Theta(1)$ i.e. _constant_. +- We're interested here in how we scale with the number of elements in the list, so we'll call this $n$. +- There is one addition operation for each element in the list, so $n$ operations total. Time complexity is $\Theta(n)$ i.e. _linear_. +- Regardless of the size of the list, we only allocate one `double` (`sum`) for this function, so the space complexity is $\Theta(1)$ i.e. _constant_. For **algorithmic analysis** we can often determine asymptotically tight bounds because we know exactly how an algorithm will behave. ## Summary of Complexity in Practice -- It's good to be aware of the complexity bounds on problems and algorithms that you will be working with, like matrix multiplication, matrix inversion, data lookups in different kinds of structures etc. - - Functions in the standard C++ library will often have known complexity given in the documentation, for example accessing an [ordered map](https://cplusplus.com/reference/map/map/operator[]/) is $O(log(n))$ and accessing an [unordered map](https://cplusplus.com/reference/unordered_map/unordered_map/operator[]/) is $O(1)$ in the average case and $O(n)$ in the worst case. -- Be aware when you are writing code of how your program scales. Think about things like nested loops and recursions and how their contributions combine. -- Algorithms with high complexity can become bottlenecks for your programs if their inputs become large. -- More efficient algorithms can often be a trade-off between time and accuracy or other desirable properties. +- It's good to be aware of the complexity bounds on problems and algorithms that you will be working with, like matrix multiplication, matrix inversion, data lookups in different kinds of structures etc. + - Functions in the standard C++ library will often have known complexity given in the documentation, for example accessing an [ordered map](https://cplusplus.com/reference/map/map/operator[]/) is $O(log(n))$ and accessing an [unordered map](https://cplusplus.com/reference/unordered_map/unordered_map/operator[]/) is $O(1)$ in the average case and $O(n)$ in the worst case. +- Be aware when you are writing code of how your program scales. Think about things like nested loops and recursions and how their contributions combine. +- Algorithms with high complexity can become bottlenecks for your programs if their inputs become large. +- More efficient algorithms can often be a trade-off between time and accuracy or other desirable properties. - Algorithms with good asymptotic performance often perform less well on small problems; don't waste time or sacrifice accuracy to use these kinds of methods on small data where they won't benefit! - Some algorithms have good average case complexity but very bad performance in special cases. Make sure that you're aware of any patterns that you expect in your data and that your chosen algorithm is not impacted by them. - **Well implemented methods often switch between algorithms depending on the size and nature of the data input to get the best performance.** - \ No newline at end of file diff --git a/07performance/sec02Memory.md b/07performance/sec02Memory.md index 1a5b99c20..a717b9a24 100644 --- a/07performance/sec02Memory.md +++ b/07performance/sec02Memory.md @@ -6,16 +6,16 @@ Estimated Reading Time: 30 minutes # Memory -Managing memory efficiently can be an important part of achieving peak performance. In this section we'll talk a bit about the basic model of how data is stored and accessed, and what this means for software development. +Managing memory efficiently can be an important part of achieving peak performance. In this section we'll talk a bit about the basic model of how data is stored and accessed, and what this means for software development. ## Memory Bound Problems When considering the efficiency of solving a problem on a computer, two classifications can sometimes be useful: -- **Compute bound** problems are those for which the main work or primary bottleneck is the number of compute steps required to complete the algorithm. For example, performing lots of arithmetic operations on a piece of data. -- **Memory bound** problems are those for which our main concern is the time spent accessing (reading or writing) memory. For example, copying or overwriting a large piece of data. +- **Compute bound** problems are those for which the main work or primary bottleneck is the number of compute steps required to complete the algorithm. For example, performing lots of arithmetic operations on a piece of data. +- **Memory bound** problems are those for which our main concern is the time spent accessing (reading or writing) memory. For example, copying or overwriting a large piece of data. -A straight-forward example of a memory bound problem would be a matrix transposition, $M^T_{ij} = M_{ji}$. This problem doesn't require any direct calculations to be done on the elements themselves, just to read the elements from one location and place them in another. +A straight-forward example of a memory bound problem would be a matrix transposition, $M^T_{ij} = M_{ji}$. This problem doesn't require any direct calculations to be done on the elements themselves, just to read the elements from one location and place them in another. To keep things simple, let's look at this "out of place" matrix transpose: @@ -33,63 +33,62 @@ void Transpose(vector> &A, vector> &B) } ``` -- "Out of place" means that the result is placed in a new matrix rather than over-writing the original. +- "Out of place" means that the result is placed in a new matrix rather than over-writing the original. - This algorithm is almost entirely composed of memory read/write operations! -- In order to understand how to optimise a memory bound problem like this, we have to first understand the structure of memory in our machine. +- In order to understand how to optimise a memory bound problem like this, we have to first understand the structure of memory in our machine. ## The Memory Hierarchy Data is stored in a computer in different forms and different places. Generally, the bigger your memory store, the slower it is to access! This trade-off is typical for developing on many architectures from PCs to specialised accelerated hardware. On a typical computer you're likely to have: - **Persistent Memory** - - This is the largest store of data and includes things like your hard disk. Writing to this kind of memory has the highest overheads, but the data remains intact without power and is typically on the scale of tens of GB or more. + - This is the largest store of data and includes things like your hard disk. Writing to this kind of memory has the highest overheads, but the data remains intact without power and is typically on the scale of tens of GB or more. - **System Memory** - - This includes on-chip RAM and ROM. - - ROM is permanent, and read-only, and generally not of much interest to software developers since it handles things like instructions for basic I/O and loading the operating system: things that we don't (and can't) mess with! - - RAM is generally volatile memory, meaning that it requires power to be maintained: if you turn off your computer you will typically lose what is in RAM. Usually on the scale of a few GB. Contains the current program and data in use. - - Stack: Stack memory is usually $\lesssim 10$ MB, assigned by the operating system when the program is launched. Stack memory cannot be increased while the program is running. Contains data for currently open scopes i.e. the function currently being executed and any hierarchy of functions which are calling it and therefore have not terminated yet. Very large pieces of data or very deep call trees (e.g. excessively deep recursion) can cause a _stack overflow_, where the stack runs out of memory. Stack memory is generally faster than Heap memory. - - Heap: The Heap is a larger pool of memory, also assigned by the operating system at the program launch, but the heap size allocated to a program can grow dynamically as long as there is enough space left in RAM. Memory access tends to be slower than for the stack, but can be used for larger datasets. - - Cache: Very small pieces of memory designed to be very fast. Cache structure is hardware dependent, but three levels of caching is typical, ranging from kB to MB (with the smallest cache layer being fastest to access). Cache memory stores chunks of data from locations accessed recently in memory. + - This includes on-chip RAM and ROM. + - ROM is permanent, and read-only, and generally not of much interest to software developers since it handles things like instructions for basic I/O and loading the operating system: things that we don't (and can't) mess with! + - RAM is generally volatile memory, meaning that it requires power to be maintained: if you turn off your computer you will typically lose what is in RAM. Usually on the scale of a few GB. Contains the current program and data in use. + - Stack: Stack memory is usually $\lesssim 10$ MB, assigned by the operating system when the program is launched. Stack memory cannot be increased while the program is running. Contains data for currently open scopes i.e. the function currently being executed and any hierarchy of functions which are calling it and therefore have not terminated yet. Very large pieces of data or very deep call trees (e.g. excessively deep recursion) can cause a _stack overflow_, where the stack runs out of memory. Stack memory is generally faster than Heap memory. + - Heap: The Heap is a larger pool of memory, also assigned by the operating system at the program launch, but the heap size allocated to a program can grow dynamically as long as there is enough space left in RAM. Memory access tends to be slower than for the stack, but can be used for larger datasets. + - Cache: Very small pieces of memory designed to be very fast. Cache structure is hardware dependent, but three levels of caching is typical, ranging from kB to MB (with the smallest cache layer being fastest to access). Cache memory stores chunks of data from locations accessed recently in memory. This structure has implications for our understanding of memory bound problems: -- Problems which use very large datasets may be more likely to be memory bound, as larger data stores are less efficient to access. Accesses to large stores should be minimised, and data being worked on should be moved to faster memory. +- Problems which use very large datasets may be more likely to be memory bound, as larger data stores are less efficient to access. Accesses to large stores should be minimised, and data being worked on should be moved to faster memory. - Algorithms which jump around erratically in memory also result in slower memory accesses. Architectures are usually optimised for sequential or localised accesses to some extent, for example memory addresses close to those recently accessed are more likely to be in the cache. -- Working on a piece of data as fully as possible before accessing another location will limit the number of memory accesses over all. +- Working on a piece of data as fully as possible before accessing another location will limit the number of memory accesses over all. -## Cache structure +## Cache structure -It's not always possible to limit the number of memory accesses that we make, but we may be able to make choices about our memory access patterns to maximise our usage of our fastest memory. In this case, we'll consider an example where our object is stored entirely in on-chip RAM, but we want to make effective use of the cache. First, we'll need to understand a bit about how the cache works. +It's not always possible to limit the number of memory accesses that we make, but we may be able to make choices about our memory access patterns to maximise our usage of our fastest memory. In this case, we'll consider an example where our object is stored entirely in on-chip RAM, but we want to make effective use of the cache. First, we'll need to understand a bit about how the cache works. -- When data is requested, the cache is checked first to see if it is present. If it is, it can take the data straight from the cache, which is much faster than going into RAM. -- If there are multiple cache levels, it searches from the smallest (and fastest) cache to the largest (and slowest). -- If it's not is any cache (a cache "miss") it will fetch the value from RAM. -- When data is looked up in system memory, that data is stored in the cache. - - If the cache is full or there is data already in the location that the system wants to cache the new data, then some data in the cache is overwritten. (Where the data is cached and therefore which data is overwritten depends on the cache-mapping strategy and will be hardware dependent.) -- Data is added to the cache in blocks of a fixed size (which is hardware dependent). If the variable we wanted to read is smaller than this block size then some neighbouring data will end up in the cache as well. - - If we read an element from an array or vector for example, which store data contiguously, that means that some surrounding elements will also end up in the cache. - - We can then read close by elements from the cache quickly without system memory accesses until you reach the limits of the copied block! +- When data is requested, the cache is checked first to see if it is present. If it is, it can take the data straight from the cache, which is much faster than going into RAM. +- If there are multiple cache levels, it searches from the smallest (and fastest) cache to the largest (and slowest). +- If it's not is any cache (a cache "miss") it will fetch the value from RAM. +- When data is looked up in system memory, that data is stored in the cache. + - If the cache is full or there is data already in the location that the system wants to cache the new data, then some data in the cache is overwritten. (Where the data is cached and therefore which data is overwritten depends on the cache-mapping strategy and will be hardware dependent.) +- Data is added to the cache in blocks of a fixed size (which is hardware dependent). If the variable we wanted to read is smaller than this block size then some neighbouring data will end up in the cache as well. + - If we read an element from an array or vector for example, which store data contiguously, that means that some surrounding elements will also end up in the cache. + - We can then read close by elements from the cache quickly without system memory accesses until you reach the limits of the copied block! -Taking advantage of these blocks of memory in the cache is the key to writing efficient memory bound algorithms: if we use them wisely we can avoid a lot of calls to system memory and replace them with much quicker calls to the cache. +Taking advantage of these blocks of memory in the cache is the key to writing efficient memory bound algorithms: if we use them wisely we can avoid a lot of calls to system memory and replace them with much quicker calls to the cache. -### Using the Cache effectively +### Using the Cache effectively We know now that: -- Reading the same memory address (e.g. accessing the same variable), or reading nearby memory addresses (e.g. elements in a vector) is faster than jumping around in memory. - - This suggests that we should break problems down into sizes that will fit in the cache, and then work on them until we don't need that data any more before moving on (if we can). -- The structure and size of the cache, and the size of the blocks loaded into the cache from memory, are all system dependent. - - This suggests that over-optimising for the cache is a bad idea: if we design code especially for the specifications of the cache on our machine, it will not be very portable to other machines! We should try to make algorithms that will exploit the cache well but are ideally not dependent on the exact size. +- Reading the same memory address (e.g. accessing the same variable), or reading nearby memory addresses (e.g. elements in a vector) is faster than jumping around in memory. + - This suggests that we should break problems down into sizes that will fit in the cache, and then work on them until we don't need that data any more before moving on (if we can). +- The structure and size of the cache, and the size of the blocks loaded into the cache from memory, are all system dependent. + - This suggests that over-optimising for the cache is a bad idea: if we design code especially for the specifications of the cache on our machine, it will not be very portable to other machines! We should try to make algorithms that will exploit the cache well but are ideally not dependent on the exact size. An algorithm which exploits the cache but which does not depend on the exact details of the cache is called a _cache oblivious algorithm_. Some good patterns for cache oblivious algorithms include: - Tiling: breaking the problem into small chunks. -- Recursion can be a good way to make your solution cache oblivious. Recursion expresses the solution in terms of solutions to smaller sub-problems, down to a base case. The cache will start to be used effectively once the size of the sub-problems start to fit inside the cache, which means you don't have to tune the algorithm to the size of the cache to take advantage of it. -- Stencil algorithms are algorithms which calculate a value at a data point based on the values around it in a grid (common in simulations of e.g. flows) and fit naturally into efficient memory structures provided the stencil moves sensibly through memory. +- Recursion can be a good way to make your solution cache oblivious. Recursion expresses the solution in terms of solutions to smaller sub-problems, down to a base case. The cache will start to be used effectively once the size of the sub-problems start to fit inside the cache, which means you don't have to tune the algorithm to the size of the cache to take advantage of it. +- Stencil algorithms are algorithms which calculate a value at a data point based on the values around it in a grid (common in simulations of e.g. flows) and fit naturally into efficient memory structures provided the stencil moves sensibly through memory. - Rearrange data in memory to fit your access patterns. For example a matrix may be stored with elements in the same row next to each other (row major order) _or_ with elements in the same column next to each other (column major order). Accessing memory in sequence will take advantage of your cache well regardless of the size of your cache. - -## Efficiently Cached Algorithm Example: Matrix Transposition +## Efficiently Cached Algorithm Example: Matrix Transposition Let's take a look again at our example of a memory bound problem, matrix transposition, and see how it can be impacted by good and bad use of the cache. Let's start with our simple matrix transpose code and see how it might behave: @@ -106,61 +105,62 @@ void Transpose(vector> &A, vector> &B) } } ``` + We'll assume that our matrices are in row major order, so rows in each matrix are contiguous in memory, and we will be focusing just on reading the data from the source matrix, and ignoring writing the operations to the output matrix, since the output matrix will be filled in order so that part of the algorithm is already cache efficient. (If they were in column major order the logic would be the same except exchanging write for read: reading the source matrix would be cache efficient, but writing the output matrix would be inefficient.) -This is an illustrative example using a single cache of very small capacity; we won't concern ourselves with the exact cache-mapping strategy since this varies, but will just fill in our cache in order. In the diagrams _red_ blocks will be blocks in system memory but not in the cache, and _blue_ blocks are data which are also stored in the cache. +This is an illustrative example using a single cache of very small capacity; we won't concern ourselves with the exact cache-mapping strategy since this varies, but will just fill in our cache in order. In the diagrams _red_ blocks will be blocks in system memory but not in the cache, and _blue_ blocks are data which are also stored in the cache. 1. The first element we read is `A[0][0]`. Our cache is empty at the moment so this results in a cache miss. -![image](img/CacheTranspose1.png) + ![image](img/CacheTranspose1.png) -2. The block of data containing `A[0][0]` is therefore read from RAM and copied into the cache, which now also contains `A[0][1]` and `A[0][2]` etc. +2. The block of data containing `A[0][0]` is therefore read from RAM and copied into the cache, which now also contains `A[0][1]` and `A[0][2]` etc. -![image](img/CacheTranspose2.png) + ![image](img/CacheTranspose2.png) -3. The next value we read is `A[1][0]`. This also results in a cache miss if the matrix is too large for more than one row to fit in a single block in the cache (which is likely as cache blocks are very small). +3. The next value we read is `A[1][0]`. This also results in a cache miss if the matrix is too large for more than one row to fit in a single block in the cache (which is likely as cache blocks are very small). -![image](img/CacheTranspose3.png) + ![image](img/CacheTranspose3.png) 4. The block containing `A[1][0]`, `A[1][1]` ... is copied to the cache. -![image](img/CacheTranspose4.png) + ![image](img/CacheTranspose4.png) -5. This sequence of cache misses and copies continues and eventually the cache is filled. +5. This sequence of cache misses and copies continues and eventually the cache is filled. -![image](img/CacheTranspose5.png) + ![image](img/CacheTranspose5.png) -6. When we try to read the next element, we once again have a cache miss, but now in order to add it into the cache we must replace an earlier entry. +6. When we try to read the next element, we once again have a cache miss, but now in order to add it into the cache we must replace an earlier entry. -![image](img/CacheTranspose6.png) + ![image](img/CacheTranspose6.png) -7. Eventually we will have read through the entire first column, and will start on the second column to read `A[0][1]`. This was added into our cache in step 2, but if the matrix is sufficiently large (or if there are clashes because of the cache-mapping strategy) then by the time we return to read this value it will have been overwritten in the cache, resulting in yet another cache miss! +7. Eventually we will have read through the entire first column, and will start on the second column to read `A[0][1]`. This was added into our cache in step 2, but if the matrix is sufficiently large (or if there are clashes because of the cache-mapping strategy) then by the time we return to read this value it will have been overwritten in the cache, resulting in yet another cache miss! -![image](img/CacheTranspose7.png) + ![image](img/CacheTranspose7.png) -8. This process continues on for the whole matrix, in this case missing the cache and making a call to system memory for every single element. Since this problem is clearly memory bound, this will have a large impact on the performance of this algorithm by slowing down all of our memory accesses. +8. This process continues on for the whole matrix, in this case missing the cache and making a call to system memory for every single element. Since this problem is clearly memory bound, this will have a large impact on the performance of this algorithm by slowing down all of our memory accesses. -![image](img/CacheTranspose8.png) + ![image](img/CacheTranspose8.png) We can solve this problem by dividing our matrix up into smaller sub-matrices which do fit into the cache. In this toy example where we only have four slots in our cache, we'll just transpose the $4 \times 4$ sub-matrix $A_{0 ... 3, 0...3}$. (In reality you can store more than this in a cache but then the diagram would get very cluttered indeed!) 1. The algorithm will start as before, with a series of cache misses. -![image](img/CacheTranspose9.png) + ![image](img/CacheTranspose9.png) 2. However in this case we stop moving down the column before we overwrite any existing matrix data in our cache. So when we come to read `A[0][1]` it is still present in the cache! In fact the rest of this small matrix is cached, so we proceed with 12 cache hits after our first four cache misses: a major improvement in memory performance! -![image](img/CacheTranspose10.png) + ![image](img/CacheTranspose10.png) -3. We can then repeat this process for each small sub matrix within our main matrix and achieve the same ratio of hits to misses throughout. +3. We can then repeat this process for each small sub matrix within our main matrix and achieve the same ratio of hits to misses throughout. ## Optional note: Virtual Memory and Paging -Memory addresses used by pointers in C++ are in fact pointers to _virtual memory_: an abstract model of memory, but not the _physical_ memory addresses themselves. This is important because the memory used by a program is actually set by the operating system (remember that your program is assigned stack and heap memory at the start), so in order for our program to work regardless of what memory space we're given we can't refer to explicit physical memory addresses. Instead it has to refer to these virtual memory addresses which are then translated into the real memory addresses by the OS. This can have some consequences because addresses which are contiguous in _virtual memory_ are not necessarily contiguous in physical memory! +Memory addresses used by pointers in C++ are in fact pointers to _virtual memory_: an abstract model of memory, but not the _physical_ memory addresses themselves. This is important because the memory used by a program is actually set by the operating system (remember that your program is assigned stack and heap memory at the start), so in order for our program to work regardless of what memory space we're given we can't refer to explicit physical memory addresses. Instead it has to refer to these virtual memory addresses which are then translated into the real memory addresses by the OS. This can have some consequences because addresses which are contiguous in _virtual memory_ are not necessarily contiguous in physical memory! -Memory (in RAM or on disk) is generally _paged_, which means stored in blocks of a particular size (4kB is common). Pages in virtual memory can be translated into pages in physical memory, with some overhead to resolve the page location and usually some latency to access it (which will depend on the kind of memory you are accessing). If a an area of virtual memory, for example storage of a vector, crosses more than one page, these pages may not be contiguous in physical memory (even if they are in virtual memory). +Memory (in RAM or on disk) is generally _paged_, which means stored in blocks of a particular size (4kB is common). Pages in virtual memory can be translated into pages in physical memory, with some overhead to resolve the page location and usually some latency to access it (which will depend on the kind of memory you are accessing). If a an area of virtual memory, for example storage of a vector, crosses more than one page, these pages may not be contiguous in physical memory (even if they are in virtual memory). -If your data is not well aligned with the pages, then you can end up doing unnecessary additional work to resolve extra pages. Similar to how cache efficient algorithms work, some algorithms (such as B-trees, see the _Introduction to Algorithms_ book in the recommended texts for a great discussion of these!)) which deal with very large data on disk will work with one page at a time to minimise hopping from page to page. Sometimes alignment is even more important, as some accelerated devices require memory to be aligned with the pages in order to be streamed to / from the device. If the memory is not aligned, it can be copied into a new, aligned memory location which is expensive for large datasets. Page resolutions can also be made more efficient if we can force memory to be allocated contiguously in _physical memory_, which can also be useful for streaming to such devices. +If your data is not well aligned with the pages, then you can end up doing unnecessary additional work to resolve extra pages. Similar to how cache efficient algorithms work, some algorithms (such as B-trees, see the _Introduction to Algorithms_ book in the recommended texts for a great discussion of these!)) which deal with very large data on disk will work with one page at a time to minimise hopping from page to page. Sometimes alignment is even more important, as some accelerated devices require memory to be aligned with the pages in order to be streamed to / from the device. If the memory is not aligned, it can be copied into a new, aligned memory location which is expensive for large datasets. Page resolutions can also be made more efficient if we can force memory to be allocated contiguously in _physical memory_, which can also be useful for streaming to such devices. -If strictly necessary, these memory properties can be forced by using OS specific commands, although standard C++ does have methods for declaring aligned memory. -If you are interested, [here is an example for FPGAs](https://xilinx.github.io/Vitis-Tutorials/2022-1/build/html/docs/Hardware_Acceleration/Introduction/runtime_sw_design.html), a kind of accelerated hardware, with a discussion of these concepts and how to address them. \ No newline at end of file +If strictly necessary, these memory properties can be forced by using OS specific commands, although standard C++ does have methods for declaring aligned memory. +If you are interested, [here is an example for FPGAs](https://xilinx.github.io/Vitis-Tutorials/2022-1/build/html/docs/Hardware_Acceleration/Introduction/runtime_sw_design.html), a kind of accelerated hardware, with a discussion of these concepts and how to address them. diff --git a/08openmp/04_cache_performance.md b/08openmp/04_cache_performance.md index 0d753cd80..0749cbc23 100644 --- a/08openmp/04_cache_performance.md +++ b/08openmp/04_cache_performance.md @@ -1,22 +1,34 @@ --- -title: "Cache Perforance in Shared Memory" +title: "Cache Performance in Shared Memory" --- -# Cache Performance in Shared Memory +# Cache Performance in Shared Memory -The need for cache efficiency hasn't gone away just because we've started parallelising things; in fact, it may be more important than ever! Generally for distributed systems we just need to worry about the cache efficiency of each process in isolation, but if memory is shared then that means our cache gets shared too. The way that our cache behaves when shared is a little different though, so we'll need to re-think how we do things a bit. +The need for cache efficiency hasn't gone away just because we've started parallelising things; in fact, it may be more important than ever! Generally for distributed systems we just need to worry about the cache efficiency of each process in isolation, but if memory is shared then that means our cache gets shared too. The way that our cache behaves when shared is a little different though, so we'll need to re-think how we do things a bit. As always with the memory system, things will be system dependent, but on a typical CPU: -- General RAM and the largest cache level will be shared between cores which share memory i.e. there is just one physical RAM and one large physical cache (in my case just the L3 cache) which is accessed by all cores. (Not all cores necessarily access memory with equal bandwidth or latency though.) -- Each core will have its own copy of the smallest cache level(s), in my case it's the L1 and L2 caches. -- This keeps access to the small caches quick, but also enforces the need to consistency between the copies of the caches. - - If I have two cores $C_1$ and $C_2$ which both store a copy of variable `x` in their L1 cache, then when the want to read `x` from memory they just read it from the cache and not from RAM. Likewise when they want to _write_ to `x`, they write to the cache but not RAM. - - If $C_1$ changes `x`, it will change the value of `x` in _its own cache_, but not in the $C_2$ cache. - - In order for $C_2$ to read the correct value of `x` after the update, it has to find out about the change somehow. - - This mechanism will be system dependent but typically it will involve something written to a special shared area (possibly part of the L3 cache) when $C_1$ updates `x`. $C_2$ needs to check this and if `x` has been changed it needs to get the new value of `x` which will need to be copied over, incurring additional overheads. -- Remember that the cache stores data in blocks of a given size, called "cache lines". The cache lines on my machine for example are 64 bytes or 8 doubles wide. -- If two L1 caches on different cores both store the same cache line, and one of the cores changes _any value in that line_, then **the entire cache line is invalidated for the other core**. - - This is extremely important as it means that even if the two cores never operate on the same values, if the values that they operate on are next to each other in memory and stored in the same cache line, then they will still invalidate one another's caches and cause lookups to have to be made. + +- General RAM and the largest cache level will be shared between cores which share memory i.e. there + is just one physical RAM and one large physical cache (in my case just the L3 cache) which is + accessed by all cores. (Not all cores necessarily access memory with equal bandwidth or latency + though.) +- Each core will have its own copy of the smallest cache level(s), in my case it's the L1 and L2 caches. +- This keeps access to the small caches quick, but also enforces the need to consistency between the + copies of the caches. + + - If I have two cores $C_1$ and $C_2$ which both store a copy of variable `x` in their L1 cache, then when they want to read `x` from memory they just read it from the cache and not from RAM. Likewise when they want to _write_ to `x`, they write to the cache but not RAM. + - If $C_1$ changes `x`, it will change the value of `x` in _its own cache_, but not in the $C_2$ cache. + - In order for $C_2$ to read the correct value of `x` after the update, it has to find out about the change somehow. + - This mechanism will be system dependent but typically it will involve something written to a + special shared area (possibly part of the L3 cache) when $C_1$ updates `x`. $C_2$ needs to + check this and if `x` has been changed it needs to get the new value of `x` which will need to + be copied over, incurring additional overheads. + +- Remember that the cache stores data in blocks of a given size, called "cache lines". The cache lines on my machine for example are 64 bytes or 8 doubles wide. +- If two L1 caches on different cores both store the same cache line, and one of the cores changes + _any value in that line_, then **the entire cache line is invalidated for the other core**. + + - This is extremely important as it means that even if the two cores never operate on the same values, if the values that they operate on are next to each other in memory and stored in the same cache line, then they will still invalidate one another's caches and cause lookups to have to be made. As a very simple example, let's look at a possible manual implementation of the reduction code shown in last week's class. I'll do this example with no compiler optimisations to prevent the compiler optimising away any memory read/write operations so we have full control! As a reminder, the basic code looks like this: @@ -31,12 +43,12 @@ using namespace std; int main() { const int num_threads = omp_get_max_threads(); - + double sum = 0.; Timer timer; -#pragma omp parallel for +#pragma omp parallel for for(long i=0; i