ggml/examples: add backend support for numerical optimization #949

JohannesGaessler · 2024-09-05T21:44:37Z

The ultimate goal of this PR is to add backend support for numerical optimization, namely Adam and L-BFGS. As of right now the corresponding computations are done by a single thread outside any of the GGML graphs. As a consequence only a single thread is used and only the CPU backend is compatible. I think the correct way to remedy this is to make the optimizers part of the GGML compute graphs. This also fixes some allocation issues where the optimization code allocates extra tensors to hold persistent extra data for the optimizers.

As of right now this PR contains my WIP version that only supports stochastic gradient descent and the CPU backend. The training is ~3x faster than on master (but the overall rate of convergence is worse than fully featured Adam).

The overall design that I envision is that the optimizer is specified when creating the backwards graph. If no optimizer is specified, calculate the gradients without touching the weights. If an optimizer is specified, apply it to all parameters after the gradients have been calculated by adding an extra GGML op on top (could probably be optimized to overwrite gradients that are no longer needed). During backwards graph creation also specify any extra tensors needed for the optimizer so they can be correctly allocated for all backends. Functions like ggml_opt would then mainly be calling the backwards graph in a loop and check convergence. One potential issue is that the convergence logic would require calls to ggml_backend_tensor_get which would make ggml.c depend on ggml_backend.c (which it currently does not). If that is a problem the optimization code could maybe be moved to a new file like ggml-algo.c.

If there are issues with my design please let me know early.

JohannesGaessler · 2024-09-05T21:57:56Z

One potential issue is that the convergence logic would require calls to ggml_backend_tensor_get which would make ggml.c depend on ggml_backend.c (which it currently does not).

Actually, a much bigger issue is that for ggml_backend_graph_compute a pointer to a backend is needed.

slaren · 2024-09-05T22:08:49Z

I don't think that's a problem. ggml-backend was designed to not require many changes to the core ggml code, but since then I think it has become the standard way to use ggml, and it doesn't make much sense to maintain the subset of the API that only works with the CPU backend. We should move all the CPU backend code to a separate file, and make all the core ggml functions explicitly compatible with ggml-backend.

The design looks good to me. Something to consider is that to support multiple GPUs and fallback to CPU for unimplemented ops in the backends, it is necessary to use ggml_backend_sched.

JohannesGaessler · 2024-09-06T08:59:10Z

I forgot: the current code also has an extension to the GGML backend interface with memset_tensor in order to clear specific tensors (since right now I think the only way to do it would be to allocate zeroed memory and invoke set_tensor).

JohannesGaessler · 2024-09-06T10:49:11Z

The Adam optimizer needs to know the current iteration since it does a warmup. I'm currently passing this information via ggml_tensor.op_params but the downside with this approach is that the information regarding the iteration is duplicated across all tensors. But at the same time I don't think it would be a good idea to add a global state to the forward pass when right now all relevant information is encapsulated in ggml_tensor.

JohannesGaessler · 2024-09-07T08:38:06Z

I pushed a working prototype for CUDA MNIST training/evaluation (fully connected only). Compared to PyTorch the training on my RTX 3090 is ~45x faster (1.25s vs. 56.58s) but with such a small model you're basically just measuring overhead. The CUDA evaluation is actually slower than the CPU evaluation, presumably because the model is too small to make GPU acceleration worthwhile given the additional overhead.

One issue that I still have is how to handle the combination of GGUF+backends other than CPU. Right now I'm allocating a temporary context that just stores the data in RAM but it feels kind of clunky. Is there a better way to do this?

slaren · 2024-09-07T09:55:29Z

One issue that I still have is how to handle the combination of GGUF+backends other than CPU. Right now I'm allocating a temporary context that just stores the data in RAM but it feels kind of clunky. Is there a better way to do this?

Check the way the magika example does this: make a no_alloc gguf context, call ggml_backend_alloc_ctx_tensors, then load the data from file using gguf_get_tensor_offset. This way at least the whole file does not need to be loaded into memory.

ggerganov · 2024-09-08T07:38:51Z

src/ggml.c

+    for (int i = 0; i < gf->n_nodes; i++) {
+        struct ggml_tensor * node = gf->nodes[i];
+
+        if (node->flags & GGML_TENSOR_FLAG_PARAM) {
+            GGML_PRINT_DEBUG("%s: found root node %p\n", __func__, (void *) node);
+            struct ggml_tensor * opt_step = ggml_opt_step_adam(ctx, node, 1.0f, 0.001f, 0.9f, 0.999f, 1e-8f);
+            ggml_build_forward_expand(gb, opt_step);
+        }
+    }
+


The overall design that I envision is that the optimizer is specified when creating the backwards graph. If no optimizer is specified, calculate the gradients without touching the weights. If an optimizer is specified, apply it to all parameters after the gradients have been calculated by adding an extra GGML op on top (could probably be optimized to overwrite gradients that are no longer needed).

Purely from API PoV, it might be better to have separate calls that expand the graph with an optimizer computation (e.g. ggml_build_opt or something similar) that can optionally be called after ggml_build_backward.

Do you have an opinion on what to do with the current ggml_opt API? If we keep it the addition of tensors for optimization could be done in ggml_opt_init.

I'm not really sure how good is the design of the existing ggml_opt API. I think we can afford to change it significantly, since it is not really adopted by other projects. We can even implement a new API in parallel and when we know which one is better - remove the other.

ggml_opt_init adding the optimization graph sounds OK to me. The question is if there would be use cases where you would want to create an optimizer, but not immediately "apply" it to a graph. Maybe you might want to apply the same optimizer to multiple graphs? If there is some case like this, then a separate ggml_build_opt-like step might make sense.

I think it's definitely better to have a separate call for adding the optimizer. That way gradient accumulation can be implemented relatively easily by defining one graph that calculates just the gradients and one that also invokes the optimizer.

ggerganov · 2024-09-11T13:21:36Z

include/ggml-backend.h

@@ -234,6 +234,7 @@ extern "C" {
    GGML_API void ggml_backend_tensor_alloc(ggml_backend_buffer_t buffer, struct ggml_tensor * tensor, void * addr);
    GGML_API void ggml_backend_view_init(struct ggml_tensor * tensor);

+    GGML_API bool ggml_backend_load_from_gguf(const char * fname, struct ggml_context * ctx_ggml, struct gguf_context * ctx_gguf);


Is this temporary? Seems like it does not belong to ggml-backend. More like a utility function in user code.

What API should we use long-term for loading data from GGUF? I was thinking that since the pattern for tensors created in program code is initialization -> backend allocation -> data setting this would be the equivalent way to do it for GGUF.

There was some talk in llama.cpp about moving some of the loading code to ggml, including mmap support, so that other ggml applications can benefit from it. I am not sure how that API should look, though. It may be good to add this as a first step, but most likely it will need a different API to be able to achieve all the goals.

src/ggml-backend.c

JohannesGaessler · 2024-09-11T21:58:53Z

I think for ggml_opt_step_adam the parameter sched is not needed. It's essentially a way to adjust the learning rate via a callback but I think something like this should be done one level further up via rather than in the tensors.

examples/mnist/mnist-common.cpp

Co-authored-by: Georgi Gerganov <[email protected]>

JohannesGaessler · 2024-09-16T14:37:36Z

I get comparable results between PyTorch and GGML in terms of training loss when I add the following two modifications: disable dataset shuffling for PyTorch and set the GGML physical batch size to 1000. The latter is a bug since by definition the physical batch size should have no effect beyond differences in rounding error. For the dataset shuffling I would have intuitively expected that this is only relevant for generalization but it seems that it also improves the rate at which the model gets better on the training set.

JohannesGaessler · 2024-09-16T19:08:59Z

I figured out the problem: I incorrectly assumed that the ggml_tensor.grad pointers would be constant for my implementation. So while the original gradients are being used as input for the GGML_ADD tensors that are eventually being used as gradients they are never incremented and thus remain zero. The accumulation steps prior to the last one are effectively just being discarded. I think the fix will be to do in-place additions in ggml_compute_backward.

ggerganov · 2024-09-17T08:54:41Z

I figured out the problem: I incorrectly assumed that the ggml_tensor.grad pointers would be constant for my implementation. So while the original gradients are being used as input for the GGML_ADD tensors that are eventually being used as gradients they are never incremented and thus remain zero. The accumulation steps prior to the last one are effectively just being discarded. I think the fix will be to do in-place additions in ggml_compute_backward.

So if I understand correctly, the following call is basically a noop atm:

ggml/examples/mnist/mnist-common.cpp

Lines 563 to 566 in c1d13df

    
           if ((iex0 + model.nbatch_physical) % model.nbatch_logical != 0) { 
        
               // For the first nbatch_logical/nbatch_physical - 1 iterations, only calculate gradients and accumulate them: 
        
               ggml_backend_graph_compute(model.backend, gb_grad); 
        
           } else {

The reason is because ggml_backend_graph_compute(model.backend, gb_opt); ends up using the gradients from the gb_opt graph which so far haven't been updated. Instead we have been updating the gradients of the gb_grad graph.

I tried your idea, which I think is simply to set the inplace = true in ggml_add_or_set:

diff --git a/src/ggml.c b/src/ggml.c
index de61438..483a3b2 100644
--- a/src/ggml.c
+++ b/src/ggml.c
@@ -18129,7 +18129,7 @@ static struct ggml_tensor * ggml_add_or_set(struct ggml_context * ctx, struct gg
     if (ggml_hash_contains(zero_table, a)) {
         return b;
     } else {
-        return ggml_add_impl(ctx, a, b, false);
+        return ggml_add_impl(ctx, a, b, true);
     }
 }

But it seems we are still missing something, as the training accuracy dropped:

mnist_model_train: epoch 29 start...done, took 0.58s, train_loss=0.182481, train_acc=95.08%, val_loss=0.166181+-0.027298, train_acc=96.07+-0.35%

JohannesGaessler · 2024-09-17T09:52:04Z

I've pushed a WIP fix that works but has bad performance to clarify the problem. The original gradients are initialized with zero and need to be incremented with the sum tensors after each accumulation step to get correct results. Unrelated to the problem with accumulation there are also two other issues: the wrong graph was being copied for gb_opt and the execution of the forward graph is not needed because the backwards graphs include all of its tensors (the latter only matters for performance).

JohannesGaessler · 2024-09-17T09:54:38Z

I tried your idea, which I think is simply to set the inplace = true in ggml_add_or_set:

The problem is the upper branch where the tensor is in the zero table. In that case there needs to be an in-place addition instead of a replacement. But so far I have not been able to make that work so there is likely still some other issue.

JohannesGaessler · 2024-09-17T10:05:56Z

Sorry, the supposed fix had two bugs that happened to cancel each other out.

JohannesGaessler · 2024-09-17T12:01:50Z

I pushed a proper fix. The correct handling of gradient accumulation needs some extra bookkeeping to track the gradients of parameters and whether they should be accumulated, I added a new tensor flag for this.

JohannesGaessler · 2024-09-17T12:06:59Z

Actually, now that I think about it it would maybe be better to do this via a hash set instead of via tensor modification since whether or not a gradient should be accumulated is a property of the compute graph rather than the gradient tensor. But by that logic the existing code in ggml_build_backward that is modifying the tensors is also bad.

Also: there was some inconsistent use of ggml_cgraph.nodes[i]->grad vs. ggml_cgraph.grads[i] that was causing problems.

include/ggml.h

tests/test-backend-ops.cpp

src/ggml-cuda/out-prod.cu

Co-authored-by: slaren <[email protected]>

src/ggml-backend.c

Co-authored-by: Georgi Gerganov <[email protected]>

ggerganov

Going back to an earlier comment by @slaren:

Something to consider is that to support multiple GPUs and fallback to CPU for unimplemented ops in the backends, it is necessary to use ggml_backend_sched.

Should we attempt to do this within this PR or after we merge the existing changes? Do we see any obstacles to achieve this?

Overall, I think the changes are quite good. I'm not familiar with other training codebases, so not sure if we are missing something obvious from functionality perspective.

tests/test-backend-ops.cpp

JohannesGaessler · 2024-09-20T12:25:18Z

My priorities: while ggml_backend_sched would be nice to have I think it's more important to properly define datasets with functionality such as data shuffling and asynchronous data pre-loading (long-term probably also GGUF support for very large datasets that don't fit in RAM). Using such datasets I would then write a more high-level API that trains a feed-forward neural network given a dataset and compute graph (and optinally labels) as input. In that high-level API I would then start using ggml_backend_sched.

Right now I have a prototype for a dataset in user space.

slaren · 2024-09-20T12:28:36Z

The issue with ggml_backend_sched was more relevant when the plan was to pass a backend to the opt functions. Now that they are ggml ops, it is entirely up to the user code whether to use the scheduler or not.

slaren · 2024-09-20T12:40:53Z

Re asynchronous data loading: this may already be obvious, but you should look at ggml_backend instances as streams, and thus if you want to upload data while something else is running, this should be done by creating a new ggml_backend instance and using ggml_backend_tensor_set_async. I will make changes that will make this distinction more clear in the future by adding new objects to represent backends and backend devices, and eventually the current ggml_backend objects will be renamed to something like ggml_backend_stream.

JohannesGaessler force-pushed the mnist-cuda-2 branch from 0fc3efe to ed5cde0 Compare September 6, 2024 08:55

ggerganov reviewed Sep 8, 2024

View reviewed changes

ggerganov mentioned this pull request Sep 11, 2024

ggml : remove ggml_cplan + rework ggml_cgraph ggerganov/llama.cpp#9431

Closed

ggerganov reviewed Sep 11, 2024

View reviewed changes

slaren reviewed Sep 11, 2024

View reviewed changes

src/ggml-backend.c Outdated Show resolved Hide resolved

JohannesGaessler force-pushed the mnist-cuda-2 branch from 09d1703 to 397f617 Compare September 11, 2024 14:27

JohannesGaessler mentioned this pull request Sep 12, 2024

examples : remove finetune and train-text-from-scratch ggerganov/llama.cpp#8669

Merged

4 tasks

JohannesGaessler force-pushed the mnist-cuda-2 branch from edefe47 to efaf8e5 Compare September 13, 2024 17:21

JohannesGaessler added 14 commits September 13, 2024 20:28

CUDA eval works

a3c341d

stochastic gradient descent op

03c5b72

Adam except decay

1b2a4e5

CUDA CROSS_ENTROPY_LOSS_BACK

06bf41b

CUDA mnist-fc training works

fd31a57

backend CLI arg

dacab7b

refactor gguf load

c7adfba

remove sched from opt_step_adam

7094b55

implement l1 regularization (weight decay)

2040338

extra call to add optimizer

5d687c0

initialize gradients with ggml_graph_reset

16cb38f

gradient accumulation

3e93361

increment iter per eval instead of epoch

14d19f6

adjust backend interfaces

7dd2c94

JohannesGaessler force-pushed the mnist-cuda-2 branch from efaf8e5 to 7dd2c94 Compare September 13, 2024 20:11

ggerganov reviewed Sep 16, 2024

View reviewed changes

examples/mnist/mnist-common.cpp Outdated Show resolved Hide resolved

Update examples/mnist/mnist-common.cpp

c1d13df

Co-authored-by: Georgi Gerganov <[email protected]>

JohannesGaessler force-pushed the mnist-cuda-2 branch from 06e8adc to cac7aa1 Compare September 17, 2024 10:05

fix gradient accumulation

478472b

JohannesGaessler force-pushed the mnist-cuda-2 branch from cac7aa1 to 478472b Compare September 17, 2024 12:00

tensor flag for accumulators -> tensor hash set

1d0e3ca

slaren reviewed Sep 17, 2024

View reviewed changes

JohannesGaessler and others added 3 commits September 17, 2024 20:11

Update include/ggml.h

db89f4e

Co-authored-by: slaren <[email protected]>

Update tests/test-backend-ops.cpp

02e1a37

Co-authored-by: slaren <[email protected]>

Update tests/test-backend-ops.cpp

dbcd543

Co-authored-by: slaren <[email protected]>

ggerganov reviewed Sep 18, 2024

View reviewed changes

src/ggml-backend.c Outdated Show resolved Hide resolved

JohannesGaessler and others added 3 commits September 18, 2024 17:47

fix test prints

594a143

Update src/ggml-backend.c

0642b69

Co-authored-by: Georgi Gerganov <[email protected]>

better CUDA support for noncontiguous out_prod

00b43cf

ggerganov reviewed Sep 19, 2024

View reviewed changes

tests/test-backend-ops.cpp Show resolved Hide resolved

add comment

461b648

slaren approved these changes Sep 20, 2024

View reviewed changes

ggerganov approved these changes Sep 20, 2024

View reviewed changes

JohannesGaessler merged commit e7b2390 into ggerganov:master Sep 20, 2024
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ggml/examples: add backend support for numerical optimization #949

ggml/examples: add backend support for numerical optimization #949

JohannesGaessler commented Sep 5, 2024

JohannesGaessler commented Sep 5, 2024

slaren commented Sep 5, 2024 •

edited

Loading

JohannesGaessler commented Sep 6, 2024

JohannesGaessler commented Sep 6, 2024

JohannesGaessler commented Sep 7, 2024

slaren commented Sep 7, 2024 •

edited

Loading

ggerganov Sep 8, 2024 •

edited

Loading

JohannesGaessler Sep 11, 2024

ggerganov Sep 11, 2024

JohannesGaessler Sep 13, 2024

ggerganov Sep 11, 2024

JohannesGaessler Sep 11, 2024

slaren Sep 11, 2024

JohannesGaessler commented Sep 11, 2024

JohannesGaessler commented Sep 16, 2024

JohannesGaessler commented Sep 16, 2024 •

edited

Loading

ggerganov commented Sep 17, 2024

JohannesGaessler commented Sep 17, 2024

JohannesGaessler commented Sep 17, 2024

JohannesGaessler commented Sep 17, 2024

JohannesGaessler commented Sep 17, 2024

JohannesGaessler commented Sep 17, 2024

ggerganov left a comment

JohannesGaessler commented Sep 20, 2024

slaren commented Sep 20, 2024

slaren commented Sep 20, 2024

ggml/examples: add backend support for numerical optimization #949

ggml/examples: add backend support for numerical optimization #949

Conversation

JohannesGaessler commented Sep 5, 2024

JohannesGaessler commented Sep 5, 2024

slaren commented Sep 5, 2024 • edited Loading

JohannesGaessler commented Sep 6, 2024

JohannesGaessler commented Sep 6, 2024

JohannesGaessler commented Sep 7, 2024

slaren commented Sep 7, 2024 • edited Loading

ggerganov Sep 8, 2024 • edited Loading

Choose a reason for hiding this comment

JohannesGaessler Sep 11, 2024

Choose a reason for hiding this comment

ggerganov Sep 11, 2024

Choose a reason for hiding this comment

JohannesGaessler Sep 13, 2024

Choose a reason for hiding this comment

ggerganov Sep 11, 2024

Choose a reason for hiding this comment

JohannesGaessler Sep 11, 2024

Choose a reason for hiding this comment

slaren Sep 11, 2024

Choose a reason for hiding this comment

JohannesGaessler commented Sep 11, 2024

JohannesGaessler commented Sep 16, 2024

JohannesGaessler commented Sep 16, 2024 • edited Loading

ggerganov commented Sep 17, 2024

JohannesGaessler commented Sep 17, 2024

JohannesGaessler commented Sep 17, 2024

JohannesGaessler commented Sep 17, 2024

JohannesGaessler commented Sep 17, 2024

JohannesGaessler commented Sep 17, 2024

ggerganov left a comment

Choose a reason for hiding this comment

JohannesGaessler commented Sep 20, 2024

slaren commented Sep 20, 2024

slaren commented Sep 20, 2024

slaren commented Sep 5, 2024 •

edited

Loading

slaren commented Sep 7, 2024 •

edited

Loading

ggerganov Sep 8, 2024 •

edited

Loading

JohannesGaessler commented Sep 16, 2024 •

edited

Loading