-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[feature request] support multiple simultaneous right-hand-sides #3
Comments
This is a very interesting idea -- I'm afraid I won't have time to address it for a while (NeurIPS deadline!), but I wanted to leave some comments now.
|
Excellent; thanks, and best wishes on your NeurIPS push! |
Been thinking about this some more... Every operation in that tree I mentioned is another einsum operation, so I think all we need to do is define Figuring out the optimal tree structure, on the other hand, definitely seems like a difficult problem. Numbering your equations:
We're looking for the largest subset of summed variables shared by the largest number of outputs. I don't know yet if there's an algorithm that does this in better than exponential time. Do we sum out klm first and use it for 1, 2, and 5, or do we sum out klmn first and use it for 1 and 2? Are going for best space or time efficiency? And the result depends on the actual dimensions of the inputs, which are not known by the equation at compile time. |
Excellent; thanks for the reply! Also, sorry for the distraction from NeurIPS, and sorry for the long brain-dump here; I might not be thinking about this for a while, so I want to make sure I'll be able to jog my memory when needed. I don't expect you to even read this until after your paper is submitted. :) My original ideaOriginally I actually wasn't even thinking of being clever in how to reuse the summing work (even though that does seem like a good idea since multiple outputs will typically need to sum out shared sets of variables---that sounds like an extra bonus!). I was actually just expecting/hoping that at least the product computation could be reused (but in the code I don't actually see a full product followed by summing out---I expect this to be because of some combination of: (1) tricks to save memory, (2) tricks to avoid redundant multiplication (3), tricks to avoid redundant summing, and (4) me not looking close enough :). Is it true that we can typically reuse the multiplication work between outputs? Clever summation reuseHowever, considering the optimal summing problem for a moment and assuming that there is a dense, full product tensor sitting around somewhere before the summing happens, we can imagine a graph where nodes are variable subsets and there is an edge to subset A from subset B to whenever A is proper subset of B---the weight on the edge being the amount of work needed to do the summing (ignore the fact that this depends on actual dimension sizes which isn't know at compile time; maybe the right answer would be to recompute the optimal tree before each einsum or something; just ignore it for now). The goal, then, is to find a min-cost tree that touches the variable subsets included in our output factors (and the root node that has all variables). This sounds like exactly the Steiner Tree Problem (on graphs) which is NP-Hard (even if our graph were polynomialy sized!). Stack overflow suggests some approximations---including, ironically, a modified belief propagation approach (I say ironically because my interest in Factor GraphsHowever, your discussion of variable elimination order encourages me and highlights that Einstein notation is essentially a concise way to specify a Factor Graph along with a query for marginal inference. (Which also means that einsum is NP-Hard in the general case.) Allowing multiple right-hand-sides lets us specify more marginal inference queries at once and also allows us to represent "batched" (or "conditional") factor graphs (the dimensions shared by all queries do not contribute to the tree-width of the factor graph). [Einsum with multiple rh-sides technically describes slightly more: it allows some factors to come "uncollapsed" (multiple dimensions for the same variable) and it allows describing the order of the output dimensions, but those both just amount to a small amount of pre- and post-processing; I'll ignore that below.] In the example above, we were "lucky" to have a tree, but that is not the case in general. For the general factor graph, marginal inference is NP-Hard (see citations in this UAI 2008 paper), but it is polynomial in the treewidth of the factor graph. Again, the question boils down to what products to store all at once and what order to sum out the variables. Let's consider the following cases (things might get more complicated if there is special support for sparse tensors, but for this I'm assuming that all input tensors are dense; also, let's only consider output factors that are a subset of some input factor---this is without loss of generality as long as it is possible to always include a no-op factor touching the variables of some desired output factor, however, such a factor would necessarily depend on the semi-ring and it might impact computational complexity, so let's not consider it for now):
|
No worries, we made the NeurIPS deadline without a hitch. That just leaves me here at home recovering from oral surgery, so I'm happy to take a look now. The multiplication code is here: https://github.com/bdusell/semiring-einsum/blob/master/torch_semiring_einsum/extend.py#L150-L156 Your hunch is correct that the memory-saving technique complicates the re-use of the products. The memory-saving strategy I employ is based on the insight that you don't need to store the entire tensor of products (which can easily become extremely large) before doing the reduction. Instead, I break the input tensors into smaller slices, essentially doing a Now, can the mini-product tensor be re-used in some way? My hunch at first was no, but now I'm thinking this is the exact right direction to go in... we line up the dimensions and do the product, but then we perform a different mini-sum for each equation that gets rid of different variables, and then we accumulate each sum in a different tensor for each equation. This way you re-use the products for all outputs, but unlike the tree approach, you don't allocate any intermediate tensors whose sizes are proportional to the full inputs. I think this is exactly what you were looking for originally. Moreover, any tree optimizations can now be applied to the mini-sum, but with the bonus that the memory usage is constant because we're working on slices limited by block_size. It also makes optimizing at compile time easier, since the slice sizes are more predictable. |
Foregoing the tree optimizations and computing the mini-summation for each output directly from the mini-product might be faster anyway due to the parallelism, and definitely more memory-efficient, if not as eco-friendly. Definitely the first thing we should try. |
That all sounds great! As you point out, the idea of separately accumulating to different output tensors seems perfectly consistent (and you are right that that would satisfy my initial desire). The only additional complication (and the thing that threw me off originally) is that you currently have deliberately structured the product slices so that all reduced dimensions come last (which, I think is required to do the summation), so that may be an extra permutation of dimensions for each extra output (but that's not a big deal I think). This ticket was really just about (1) allowing the interface that allows multiple right-hand sides and (2) if possible, allowing the marginals for each of the three input factors while only multiplying the summation time by 3 rather than multiplying the product and summation time each by three, so I think your plan is a good one to handle both of these already. (Tangentially, I think that all of that BP and junction tree stuff continues to be applicable as a way of avoiding the need to even slice over the entire full product. For example, the BP approach for the example above would perform the einsum 'ij,jk->ik' and then pass the result in to a second einsum of 'ik,kl->il' which would achieve the same answer with 2 full product tensors of 100x100x100 which amounts to only 2000 10x10x10 slices. Much better than 10000 10x10x10x10 slices! Do you follow me?) Anyway, no pressure on any of this, and thanks very much for you library, attention, and comments. |
Ah, I think the comment above the line of code you linked to is actually out of date -- I changed it a while back so that I don't understand the BP approach yet, but are you saying it would be able to avoid creating an intermediate tensor of size I x K? That would be extremely useful if so. Do you think this is also related to #4? This library was definitely born out of necessity, and without it much of my current research would be impossible. I hope by accommodating new, interesting use cases like this we can spread the joy to more users and open up more research possibilities. |
Great; that sounds right.
We won't avoid IxK (which is already used in the input factors), nor IxL (which is used in the output), but we will avoid needing slices of (IxJxKxL). Even with your slicing trick, you have to represent the final sum, but you only have to represent a fixed-size slice of the intermediate (full) product at any given time. Even so, the size of that slice (in terms of number of elements) grows exponentially with the number of variables in the entire formula and the number of slices you need to iterate over grows with the product of the dimension sizes of all variables. With tree-structured factorizations, the BP version can avoid ever storing even a slice of the full product. Instead, the size and number of product slices that do need to be stored can be linear in the size and number of input factors. The junction-tree stuff is a way of representing any factorization as a tree of factor clusters (no magic here of course: if the factorization has high treewidth then the dense representation of some of the clusters will explode exponentially). To see a pretty trivial example of the difference: # three 100x100 matrices
a, b, c = [torch.rand(100,100) for _ in range(3)]
# formula to compute all at once
eq_full = tse.compile_equation('ij,jk,kl->il')
# formula analogous to separating the two matrix multiplies
eq_part_out, eq_part_in = tse.compile_equation('ik,kl->il'), tse.compile_equation('ij,jk->ik')
# block_size 1
tse.einsum(eq_full, a,b,c, block_size=1)
# %timeit: 654 ms ± 6.43 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# block_size 10
tse.einsum(eq_full, a,b,c, block_size=10)
# %timeit: 395 ms ± 9.47 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# "BP" version with block_size 1
tse.einsum(eq_part_out, tse.einsum(eq_part_in, a,b, block_size=1), c, block_size=1)
# %timeit: 11.7 ms ± 2.33 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
# "BP" version with block_size 10
tse.einsum(eq_part_out, tse.einsum(eq_part_in, a,b, block_size=10), c, block_size=10)
# %timeit: 6.17 ms ± 1.3 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
# same answers:
torch.isclose(tse.einsum(eq_full, a,b,c, block_size=10), tse.einsum(eq_part_out, tse.einsum(eq_part_in, a,b, block_size=10), c, block_size=10)).all()
# output: tensor(True)
# of course native pytorch matrix multiply and still wins handily (as does pytorch einsum, but no option for semiring)
a@b@c
# %timeit: 50.4 µs ± 2.82 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
torch.einsum('ik,kl->il', torch.einsum('ij,jk->ik', a,b), c)
# %timeit: 88.2 µs ± 4.23 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
torch.einsum('ij,jk,kl->il', a,b,c)
# %timeit: 86.8 µs ± 7.8 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
torch.isclose(tse.einsum(eq_full, a,b,c, block_size=10), a@b@c).all()
# outpu:t tensor(True) Is that clear?
Yes; I'm not well versed in that, but I do see connections made on page 55 and 57 of the Eisner and Blatz paper that Tim cited. |
Thanks for the great and thorough examples! To clarify, for
Iterations: O(JK) With the BP version this becomes:
Iterations: O(J + K) |
I have a situation where many of my formulas come in sets like the following:
It seems like there should be a substantial amount of work between these that could be shared, but I might be misunderstanding the way that you are doing the reducing.
If there is a possibility to support more efficient bulk work (or maybe even if there isn't), I'd suggest allowing the above set of formulas to be replaced with the following (which should indicate that
*einsum
should return a tuple of tensors with the respective results):@bdusell would you please comment on the following when you get a chance:
Thanks!
The text was updated successfully, but these errors were encountered: