-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
gallocr: fix reallocation of shared tensors #999
gallocr: fix reallocation of shared tensors #999
Conversation
The idea of ggml-alloc is that you have a graph, then you allocate the intermediate tensors in a way that minimizes memory usage, then evaluate it as many times as you want, and then you throw it away. I do not consider this is a bug, it is just not the way it is intended to be used. If you have some tensors that persist through multiple graphs, they should be allocated separately in a static buffer. If the graph changes and you need to reallocate it, then you need to throw away all the tensors from the previous graph. It may be possible to hack it to detect this case in particular, but ultimately it is working against the design of this component, and it is bound to create more problems. |
My interpretation of what you're saying is that the data pointers assigned by ggml-alloc are what is thrown away but that the The code from my end that is causing problems without the changes in this PR is as follows:
This code is first called for the forward graph, then for the backward graph, then a second time for the forward graph, then a second time for the backward graph. If for the forward graph a larger graph is allocated and there is no reallocation for the first backward graph, the issue occurs on the first backward pass. If upon each call to When I talk about "shared" tensors I do not mean tensors where the data or pointers are assumed to be consistent. I merely mean tensors that appear in multiple graphs and are thus indirectly passed multiple times to For me the bottom line is this: my expectation for |
If you use a new instance without freeing the previous one, that would be fine since the tensors will be allocated in the buffer of the previous The way to look at Ok, but you say, this is fine, but what does it hurt to add a check for tensors in the graph that are allocated in the
If these tensors are only temporary and you don't care about the data, the way your problem is usually handled is by rebuilding the graph and starting from scratch. Usually this is a very fast operation, so there is little reason to not do it. Is there any reason to not do that here? |
The construction of the forward graph is fundamentally done in user code. The current design for the optimization interface has the user pass an input and an output tensor to define the forward graph. If the graph needs to be reconstructed by GGML code at will the user would instead need to pass a function for constructing the forward graph. And especially if the interface needs to be C compliant I think that that would just be needlessly cumbersome. If each instance of |
That should be feasible. |
@slaren, I'd like chime in here with a related but different issue. In stable-diffsuion.cpp with Winograd-conv2d, I need to first alloc and load some tensors (filter weights) from the model file, and then do a transform on these tensors. After the transform, the pre-transformed weights can be discarded and the transformed ones will be kept and later used for conv2d op. Is there a way in |
Why don't you transform the tensors as you load them? You need to do that in a ggml graph? |
Yes, transform is done by a cuda op in graph compute. I don't want to hack weight loading process just for this purpose. |
@slaren do you think the following code in
The tensors have been distributed to two contexts @bssrdf Just create a temporary context for the transformation and free the context once you're done? |
Yes. It may be a good idea to add a function to ggml-backend to "free" or "reset" a tensor to make this more explicit and more resilient to changes in the future. Also, don't forget to set |
@bssrdf you could do that with ggml-alloc by tagging the original weights as inputs, and the converted tensors as outputs. However that will still require loading all the weights data before running the graph. I think that doing it weight by weight as they are loaded would be the most efficient way to do this. Also, it is possible to create a ggml backend buffer type that converts the tensors on the fly as they are loaded (it is done in the AMX and CANN backends to change the layout of some types, for example), but that may not be the best idea if this is not specific to one backend. |
Alright, thank you for the help. |
I noticed that as well but |
It wouldn't create a memory leak because the backends are responsible for keeping a list of allocated extras, since there is not a call to free a specific tensor, only to reset entire buffers. |
You're right, thank you. |
This comes into my mind as well. I will give it a try. |
Thank you for the suggestions. |
While working on #988 I found what I believe to be a bug in the graph allocator. I'm spinning the fix out into a dedicated PR because the problem is simple but time consuming to track down.
When repeatedly allocating different graphs with
ggml_backend_sched
I found that eventually unexpected data would be written to some tensors. My understanding is that his problem is caused by some of the tensors being shared between the graphs. These tensors are already in the graph allocator but are not being considered when allocating a new graph that contains additional tensors, thus leading to conflicting allocations. A solution that I've found to work is to when allocating a new graph, check the new graph for any tensors whose allocations are owned and to explicitly invalidate the data of those tensors. Those tensors will then simply be reallocated (but in such a way that they don't conflict with any new tensors).