ggml/ex: ref. CEL, ggml_backend_sched for MNIST #976

JohannesGaessler · 2024-09-30T14:18:31Z

This PR refactors the CPU implementation of cross entropy loss to avoid false sharing (from the partial sums being in the same cache line) as well as potential issues with the loop variables being int. It also adds ggml_backend_sched support for the MNIST example.

JohannesGaessler · 2024-10-01T05:53:10Z

There seems to be an issue with CPU MNIST training where the validation loss is much too low.

JohannesGaessler · 2024-10-01T08:13:41Z

I think I figured out the problem:

enum ggml_status ggml_backend_sched_graph_compute_async(ggml_backend_sched_t sched, struct ggml_cgraph * graph) {
    if (!sched->is_reset && !sched->is_alloc) {
        ggml_backend_sched_reset(sched);
    }

    if (!sched->is_alloc) {
        if (!ggml_backend_sched_alloc_graph(sched, graph)) {
            return GGML_STATUS_ALLOC_FAILED;
        }
    }

    return ggml_backend_sched_compute_splits(sched);
}

In mnist-common.cpp the call ggml_backend_sched_alloc_graph(model.backend_sched, gb_opt) allocates a graph that does the forward pass, the backward pass, and the optimizer step. But because this sets is_alloc to true the above code in ggml-backend.c then essentially just discards whatever graph you give it and executes the graph that was used for allocation. I assume this is not the intended behavior, what do you think would be the correct way to fix this?

slaren · 2024-10-01T08:22:30Z

You need to call ggml_backend_sched_reset when changing graphs.

JohannesGaessler · 2024-10-01T09:35:00Z

Adding calls to ggml_backend_sched_reset breaks CUDA training. I'm not sure why, it probably has to do with my code expecting persistent tensor data after resetting and allocating a different graph. How about this: extend the logic for ggml_backend_sched_compute_graph in such a way that when a subgraph of the allocated graph is passed (the nodes of the subgraph are a prefix of the nodes of the graph that was allocated), then execute only the nodes in the subgraph.

slaren · 2024-10-01T09:45:24Z

I don't think that's feasible without adding too much complexity, ggml_backend_sched_graph_compute doesn't work with the original graph directly, it works with a heavily processed version of it split into multiple parts per-backend. Can you point me to the code that does that?

JohannesGaessler · 2024-10-01T09:58:44Z

The instances where I think a reallocation could maybe be causing problems:

When adding gradients the model weights become nodes rather than leaves and thus can potentially be re-alloccated (I think). The optimizer momenta are leaves and should not be causing problems.
ggml_graph_reset sets the gradients of loss tensors to 1 (once, initially), if those gradients are re-allocated then the gradients of the entire graph are essentially undefined.
When using gradient accumulation the gradients need persistent data between calls to gb_grad and gb_opt.

Do you think it would be feasible to allocate separate graphs for the forward pass, the backward pass, and the optimizer step?

slaren · 2024-10-01T10:20:46Z

Everything can be reallocated between graph evaluations, the goal of ggml-alloc (the ggml_backend_sched is just using ggml-alloc internally) is to reuse the same buffers between different graph evaluations without requiring new allocations. Fundamentally it is not possible to use tensors from a previous graph evaluation in the next graph with the same ggml-alloc, since the tensors of the new graph will overwrite the tensors of the previous graph (or worse, the buffer they were allocated in may have been freed if ggml-alloc needs to increase its buffer size).

I can see two ways to handle this:

Pre-allocate the tensors that need to be kept in a static buffer. Generally to do this you would use a different ggml_context to create these tensors and then use ggml_backend_alloc_ctx_tensors to allocate them.
Flag all the tensors that need to be kept as graph outputs, and use a different instance of ggml-alloc for each graph. You can use the outputs of a graph in the next graph, as long as the each graph uses a different instance of ggml-alloc. This will work with the base ggml-alloc with a single backend, but in complex scenarios with ggml_backend_sched where tensors may be copied between backends it is a lot harder to ensure that this will work in every case.

JohannesGaessler · 2024-10-01T10:50:55Z

separate graphs for the forward pass, the backward pass, and the optimizer step

Actually, that won't work. Both the forward and backward pass need to be in the same graph because the backward pass may need to use data from the forward pass that cannot be overwritten until then.

slaren · 2024-10-01T11:14:06Z

I think you could have a separate graph and ggml-alloc (or sched) for the optimizer only. To do so, you would have to flag all the inputs to the optimizer as graph outputs.

JohannesGaessler · 2024-10-02T07:54:41Z

I was not able to nail down the exact problem with moderate effort. For now I'll just revert the MNIST changes, once I have a more high-level API for training I'll adapt and expand test-opt (compare tensor changes vs. analytical expectations) and debug the use of ggml_backend_sched using that.

JohannesGaessler force-pushed the mnist-backend-sched branch 2 times, most recently from 9216056 to caf425b Compare September 30, 2024 20:27

slaren approved these changes Sep 30, 2024

View reviewed changes

JohannesGaessler force-pushed the mnist-backend-sched branch from caf425b to e8c7030 Compare October 2, 2024 07:49

ggml: refactor cross entropy loss CPU impl.

c7d77f7

JohannesGaessler force-pushed the mnist-backend-sched branch from e8c7030 to c7d77f7 Compare October 2, 2024 07:50

JohannesGaessler merged commit 7cdb891 into ggerganov:master Oct 2, 2024
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ggml/ex: ref. CEL, ggml_backend_sched for MNIST #976

ggml/ex: ref. CEL, ggml_backend_sched for MNIST #976

JohannesGaessler commented Sep 30, 2024

JohannesGaessler commented Oct 1, 2024

JohannesGaessler commented Oct 1, 2024

slaren commented Oct 1, 2024

JohannesGaessler commented Oct 1, 2024

slaren commented Oct 1, 2024

JohannesGaessler commented Oct 1, 2024

slaren commented Oct 1, 2024 •

edited

Loading

JohannesGaessler commented Oct 1, 2024

slaren commented Oct 1, 2024 •

edited

Loading

JohannesGaessler commented Oct 2, 2024

ggml/ex: ref. CEL, ggml_backend_sched for MNIST #976

ggml/ex: ref. CEL, ggml_backend_sched for MNIST #976

Conversation

JohannesGaessler commented Sep 30, 2024

JohannesGaessler commented Oct 1, 2024

JohannesGaessler commented Oct 1, 2024

slaren commented Oct 1, 2024

JohannesGaessler commented Oct 1, 2024

slaren commented Oct 1, 2024

JohannesGaessler commented Oct 1, 2024

slaren commented Oct 1, 2024 • edited Loading

JohannesGaessler commented Oct 1, 2024

slaren commented Oct 1, 2024 • edited Loading

JohannesGaessler commented Oct 2, 2024

slaren commented Oct 1, 2024 •

edited

Loading

slaren commented Oct 1, 2024 •

edited

Loading