Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gallocr: fix reallocation of shared tensors #999

Conversation

JohannesGaessler
Copy link
Collaborator

While working on #988 I found what I believe to be a bug in the graph allocator. I'm spinning the fix out into a dedicated PR because the problem is simple but time consuming to track down.

When repeatedly allocating different graphs with ggml_backend_sched I found that eventually unexpected data would be written to some tensors. My understanding is that his problem is caused by some of the tensors being shared between the graphs. These tensors are already in the graph allocator but are not being considered when allocating a new graph that contains additional tensors, thus leading to conflicting allocations. A solution that I've found to work is to when allocating a new graph, check the new graph for any tensors whose allocations are owned and to explicitly invalidate the data of those tensors. Those tensors will then simply be reallocated (but in such a way that they don't conflict with any new tensors).

@slaren
Copy link
Collaborator

slaren commented Oct 29, 2024

The idea of ggml-alloc is that you have a graph, then you allocate the intermediate tensors in a way that minimizes memory usage, then evaluate it as many times as you want, and then you throw it away. I do not consider this is a bug, it is just not the way it is intended to be used. If you have some tensors that persist through multiple graphs, they should be allocated separately in a static buffer. If the graph changes and you need to reallocate it, then you need to throw away all the tensors from the previous graph. It may be possible to hack it to detect this case in particular, but ultimately it is working against the design of this component, and it is bound to create more problems.

@JohannesGaessler
Copy link
Collaborator Author

The idea of ggml-alloc is that you have a graph, then you allocate the intermediate tensors in a way that minimizes memory usage, then evaluate it as many times as you want, and then you throw it away.

My interpretation of what you're saying is that the data pointers assigned by ggml-alloc are what is thrown away but that the ggml_gallocr object itself should be reusable (since that is how it is used in ggml_backend_sched_alloc_graph).

The code from my end that is causing problems without the changes in this PR is as follows:

if (opt_ctx->allocated_graph != graph) {
    ggml_backend_sched_reset(opt_ctx->backend_sched);
    ggml_backend_sched_alloc_graph(opt_ctx->backend_sched, graph);
    opt_ctx->allocated_graph = graph;
    if (graph != opt_ctx->gf) {
        ggml_graph_reset(graph);
    }
}
ggml_backend_sched_graph_compute(opt_ctx->backend_sched, graph);

This code is first called for the forward graph, then for the backward graph, then a second time for the forward graph, then a second time for the backward graph.
The graph inputs are allocated statically in a separate context.
The graph outputs are either explicitly retrieved via ggml_backend_tensor_get or they are implicitly used by the graph to modify the weights (the weights are also statically allocated).
The first forward pass works as expected, no intermediate tensors are initially allocated so all intermediate tensors get newly allocated.
The first backward pass works as expected, the backward graph is larger than the forward graph so all intermediate tensors get newly allocated.
For the second forward pass some of the intermediate tensors are already allocated (since they were also in the backward graph), those tensors are not reallocated and also not considered for further allocations of intermediate tensors that are only in the forward graph (but this randomly does not cause issues).
For the second backward pass some of the intermediate tensors are already allocated from the forward graph, those tensors are not reallocated or considered, unallocated intermediate tensors that are only in the backward graph are given new allocations that conflict with the intermediate tensors that were already given allocations in the forward graph. When the backward graph is executed for the second time some of its tensors that were allocated to separate memory regions in the first graph execution are instead allocated to the same memory region and overwrite each other in such a way that makes the outputs incorrect.

If for the forward graph a larger graph is allocated and there is no reallocation for the first backward graph, the issue occurs on the first backward pass.

If upon each call to ggml_backend_sched_alloc_graph all intermediate tensors are reallocated unconditionally the issue is fixed.

When I talk about "shared" tensors I do not mean tensors where the data or pointers are assumed to be consistent. I merely mean tensors that appear in multiple graphs and are thus indirectly passed multiple times to ggml_gallocr_reserve. The first time those tensors do not have any allocations and are thus given new ones. The second time those tensors already have (invalid) allocations and are thus treated as externally allocated because there is no check for whether or not preexisting allocations are owned.

For me the bottom line is this: my expectation for ggml_backend_sched_reset is that after calling it a further call to ggml_backend_sched_alloc_graph will yield the same result as with a newly created ggml_backend_sched object. But the internal state of ggml_backend_sched is dependent on the internal state of ggml_gallocr and the internal state of ggml_gallocr in turn effectively includes the pointers in ggml_tensor since those are used for bookkeeping. That is why I think the correct way to fix the problem is to initially clear the pointers used for bookkeeping and to let ggml_gallocr assign new ones.

@slaren
Copy link
Collaborator

slaren commented Oct 30, 2024

For me the bottom line is this: my expectation for ggml_backend_sched_reset is that after calling it a further call to ggml_backend_sched_alloc_graph will yield the same result as with a newly created ggml_backend_sched object.

If you use a new instance without freeing the previous one, that would be fine since the tensors will be allocated in the buffer of the previous ggml_backend_sched instance (I will just say ggml_gallocr from this point, since that's the component that handles these allocations). If you free the first instance, then it will result in a use-after-free since some tensors will still contain references to the first ggml_gallocr buffer that has now been freed.

The way to look at ggml_gallocr is as a temporary buffer from which tensors are allocated. Once you reset or allocate a new graph with a ggml_gallocr, then all previous allocations made from this buffer are freed. If you have some tensors that were allocated with this ggml_gallocr and you reset it, they are now effectively dangling pointers, they point to a buffer that either does not exist anymore, or has been repurposed for other allocations, and thus they cannot be used anymore. You can also look at it as a memory pool in which tensors are allocated, that is reset when a new graph is allocated. All previously allocated tensors on this memory pool become effectively invalidated once a new graph is allocated. I don't think the behavior you are observing is unexpected if you look at it this way, and it was designed to be used in this way.

Ok, but you say, this is fine, but what does it hurt to add a check for tensors in the graph that are allocated in the ggml_gallocr buffer and "reset" them? Well there are several issues with this:

  • Is this really intuitive behavior? you are throwing away the data in these tensors, and treating them as if they were never allocated. But they are allocated and they may contain data.
  • You can only detect this if the tensor was allocated in the last graph allocated by this ggml_gallocr, but if you allocate or reserve another graph in the middle, then there will no record in the hash table of this tensor being allocated anymore. The ggml_gallocr buffer may also be reallocated to make space for the new graph, so that's not a reliable way to tell if a tensor was allocated by this ggml_gallocr either. So this is very fragile, it supports exactly this one case, but it will fail in many others.
  • Even the fact that the hash table still has the tensors from the previous allocation is just an implementation detail, but now you are making it part of the API. Want to change the way tensors are tracked internally with a more optimal implementation? Well, can't do that anymore, because the API now needs to support this behavior consistently.

If these tensors are only temporary and you don't care about the data, the way your problem is usually handled is by rebuilding the graph and starting from scratch. Usually this is a very fast operation, so there is little reason to not do it. Is there any reason to not do that here?

@JohannesGaessler
Copy link
Collaborator Author

If these tensors are only temporary and you don't care about the data, the way your problem is usually handled is by rebuilding the graph and starting from scratch. Usually this is a very fast operation, so there is little reason to not do it. Is there any reason to not do that here?

The construction of the forward graph is fundamentally done in user code. The current design for the optimization interface has the user pass an input and an output tensor to define the forward graph. If the graph needs to be reconstructed by GGML code at will the user would instead need to pass a function for constructing the forward graph. And especially if the interface needs to be C compliant I think that that would just be needlessly cumbersome.

If each instance of ggml_tensor is supposed to only ever be allocated exactly once I think it would be a better solution to construct the forward and backward graphs as usual but to extend ggml_graph_cpy in such a way that enables deep copies instead of shallow copies. The tensors would then be entirely disjunct for the forward and the backward pass.

@slaren
Copy link
Collaborator

slaren commented Oct 30, 2024

If each instance of ggml_tensor is supposed to only ever be allocated exactly once I think it would be a better solution to construct the forward and backward graphs as usual but to extend ggml_graph_cpy in such a way that enables deep copies instead of shallow copies. The tensors would then be entirely disjunct for the forward and the backward pass.

That should be feasible. ggml_backend_graph_copy does something like this, although it also copies the tensor data of the tensors that are already allocated. It is intended to be used to copy all the tensors in a graph to a different backend, but it could be adapted to reuse the allocated tensors. It may be easier to re-implement it entirely in C++ and use a std::map to keep track of the tensors, however.

@bssrdf
Copy link
Contributor

bssrdf commented Oct 31, 2024

@slaren, I'd like chime in here with a related but different issue.

In stable-diffsuion.cpp with Winograd-conv2d, I need to first alloc and load some tensors (filter weights) from the model file, and then do a transform on these tensors. After the transform, the pre-transformed weights can be discarded and the transformed ones will be kept and later used for conv2d op. Is there a way in ggml to achieve this (potentially with ggml_gallocr)?

@slaren
Copy link
Collaborator

slaren commented Oct 31, 2024

Why don't you transform the tensors as you load them? You need to do that in a ggml graph?

@bssrdf
Copy link
Contributor

bssrdf commented Oct 31, 2024

Why don't you transform the tensors as you load them? You need to do that in a ggml graph?

Yes, transform is done by a cuda op in graph compute. I don't want to hack weight loading process just for this purpose.

@JohannesGaessler
Copy link
Collaborator Author

@slaren do you think the following code in ggml-opt.cpp with no modifications to ggml-alloc.c would be acceptable?

if (opt_ctx->allocated_graph != graph) {
    ggml_backend_sched_reset(opt_ctx->backend_sched);
    for (struct ggml_tensor * t = ggml_get_first_tensor(opt_ctx->ctx_compute); t != nullptr; t = ggml_get_next_tensor(opt_ctx->ctx_compute, t)) {
        t->data   = nullptr;
        t->buffer = nullptr;
    }

    ggml_backend_sched_alloc_graph(opt_ctx->backend_sched, graph);
    opt_ctx->allocated_graph = graph;
    if (graph != opt_ctx->gf) {
        ggml_graph_reset(graph);
    }
}
ggml_backend_sched_graph_compute(opt_ctx->backend_sched, graph);

The tensors have been distributed to two contexts ctx_static and ctx_compute.

@bssrdf Just create a temporary context for the transformation and free the context once you're done?

@slaren
Copy link
Collaborator

slaren commented Oct 31, 2024

Yes. It may be a good idea to add a function to ggml-backend to "free" or "reset" a tensor to make this more explicit and more resilient to changes in the future.

Also, don't forget to set t->extra to nullptr as well.

@slaren
Copy link
Collaborator

slaren commented Oct 31, 2024

@bssrdf you could do that with ggml-alloc by tagging the original weights as inputs, and the converted tensors as outputs. However that will still require loading all the weights data before running the graph. I think that doing it weight by weight as they are loaded would be the most efficient way to do this. Also, it is possible to create a ggml backend buffer type that converts the tensors on the fly as they are loaded (it is done in the AMX and CANN backends to change the layout of some types, for example), but that may not be the best idea if this is not specific to one backend.

@JohannesGaessler
Copy link
Collaborator Author

Alright, thank you for the help.

@JohannesGaessler
Copy link
Collaborator Author

Also, don't forget to set t->extra to nullptr as well.

I noticed that as well but ggml_tensor.extra cannot be simply set to a null pointer because that would create a memory leak. But it is as far as I can tell only used for split buffers in the CUDA backend (which in their current form would probably not make sense for training) so I think it will be fine to just assert that the pointer is null.

@slaren
Copy link
Collaborator

slaren commented Oct 31, 2024

It wouldn't create a memory leak because the backends are responsible for keeping a list of allocated extras, since there is not a call to free a specific tensor, only to reset entire buffers.

@JohannesGaessler
Copy link
Collaborator Author

You're right, thank you.

@bssrdf
Copy link
Contributor

bssrdf commented Oct 31, 2024

@slaren do you think the following code in ggml-opt.cpp with no modifications to ggml-alloc.c would be acceptable?

if (opt_ctx->allocated_graph != graph) {
    ggml_backend_sched_reset(opt_ctx->backend_sched);
    for (struct ggml_tensor * t = ggml_get_first_tensor(opt_ctx->ctx_compute); t != nullptr; t = ggml_get_next_tensor(opt_ctx->ctx_compute, t)) {
        t->data   = nullptr;
        t->buffer = nullptr;
    }

    ggml_backend_sched_alloc_graph(opt_ctx->backend_sched, graph);
    opt_ctx->allocated_graph = graph;
    if (graph != opt_ctx->gf) {
        ggml_graph_reset(graph);
    }
}
ggml_backend_sched_graph_compute(opt_ctx->backend_sched, graph);

The tensors have been distributed to two contexts ctx_static and ctx_compute.

@bssrdf Just create a temporary context for the transformation and free the context once you're done?

This comes into my mind as well. I will give it a try.

@bssrdf
Copy link
Contributor

bssrdf commented Oct 31, 2024

@bssrdf you could do that with ggml-alloc by tagging the original weights as inputs, and the converted tensors as outputs. However that will still require loading all the weights data before running the graph. I think that doing it weight by weight as they are loaded would be the most efficient way to do this. Also, it is possible to create a ggml backend buffer type that converts the tensors on the fly as they are loaded (it is done in the AMX and CANN backends to change the layout of some types, for example), but that may not be the best idea if this is not specific to one backend.

Thank you for the suggestions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants