Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ggml/ex: ref. CEL, ggml_backend_sched for MNIST #976

Merged

Conversation

JohannesGaessler
Copy link
Collaborator

This PR refactors the CPU implementation of cross entropy loss to avoid false sharing (from the partial sums being in the same cache line) as well as potential issues with the loop variables being int. It also adds ggml_backend_sched support for the MNIST example.

@JohannesGaessler JohannesGaessler force-pushed the mnist-backend-sched branch 2 times, most recently from 9216056 to caf425b Compare September 30, 2024 20:27
@JohannesGaessler
Copy link
Collaborator Author

There seems to be an issue with CPU MNIST training where the validation loss is much too low.

@JohannesGaessler
Copy link
Collaborator Author

I think I figured out the problem:

enum ggml_status ggml_backend_sched_graph_compute_async(ggml_backend_sched_t sched, struct ggml_cgraph * graph) {
    if (!sched->is_reset && !sched->is_alloc) {
        ggml_backend_sched_reset(sched);
    }

    if (!sched->is_alloc) {
        if (!ggml_backend_sched_alloc_graph(sched, graph)) {
            return GGML_STATUS_ALLOC_FAILED;
        }
    }

    return ggml_backend_sched_compute_splits(sched);
}

In mnist-common.cpp the call ggml_backend_sched_alloc_graph(model.backend_sched, gb_opt) allocates a graph that does the forward pass, the backward pass, and the optimizer step. But because this sets is_alloc to true the above code in ggml-backend.c then essentially just discards whatever graph you give it and executes the graph that was used for allocation. I assume this is not the intended behavior, what do you think would be the correct way to fix this?

@slaren
Copy link
Collaborator

slaren commented Oct 1, 2024

You need to call ggml_backend_sched_reset when changing graphs.

@JohannesGaessler
Copy link
Collaborator Author

Adding calls to ggml_backend_sched_reset breaks CUDA training. I'm not sure why, it probably has to do with my code expecting persistent tensor data after resetting and allocating a different graph. How about this: extend the logic for ggml_backend_sched_compute_graph in such a way that when a subgraph of the allocated graph is passed (the nodes of the subgraph are a prefix of the nodes of the graph that was allocated), then execute only the nodes in the subgraph.

@slaren
Copy link
Collaborator

slaren commented Oct 1, 2024

I don't think that's feasible without adding too much complexity, ggml_backend_sched_graph_compute doesn't work with the original graph directly, it works with a heavily processed version of it split into multiple parts per-backend. Can you point me to the code that does that?

@JohannesGaessler
Copy link
Collaborator Author

The instances where I think a reallocation could maybe be causing problems:

  • When adding gradients the model weights become nodes rather than leaves and thus can potentially be re-alloccated (I think). The optimizer momenta are leaves and should not be causing problems.
  • ggml_graph_reset sets the gradients of loss tensors to 1 (once, initially), if those gradients are re-allocated then the gradients of the entire graph are essentially undefined.
  • When using gradient accumulation the gradients need persistent data between calls to gb_grad and gb_opt.

Do you think it would be feasible to allocate separate graphs for the forward pass, the backward pass, and the optimizer step?

@slaren
Copy link
Collaborator

slaren commented Oct 1, 2024

Everything can be reallocated between graph evaluations, the goal of ggml-alloc (the ggml_backend_sched is just using ggml-alloc internally) is to reuse the same buffers between different graph evaluations without requiring new allocations. Fundamentally it is not possible to use tensors from a previous graph evaluation in the next graph with the same ggml-alloc, since the tensors of the new graph will overwrite the tensors of the previous graph (or worse, the buffer they were allocated in may have been freed if ggml-alloc needs to increase its buffer size).

I can see two ways to handle this:

  • Pre-allocate the tensors that need to be kept in a static buffer. Generally to do this you would use a different ggml_context to create these tensors and then use ggml_backend_alloc_ctx_tensors to allocate them.
  • Flag all the tensors that need to be kept as graph outputs, and use a different instance of ggml-alloc for each graph. You can use the outputs of a graph in the next graph, as long as the each graph uses a different instance of ggml-alloc. This will work with the base ggml-alloc with a single backend, but in complex scenarios with ggml_backend_sched where tensors may be copied between backends it is a lot harder to ensure that this will work in every case.

@JohannesGaessler
Copy link
Collaborator Author

separate graphs for the forward pass, the backward pass, and the optimizer step

Actually, that won't work. Both the forward and backward pass need to be in the same graph because the backward pass may need to use data from the forward pass that cannot be overwritten until then.

@slaren
Copy link
Collaborator

slaren commented Oct 1, 2024

I think you could have a separate graph and ggml-alloc (or sched) for the optimizer only. To do so, you would have to flag all the inputs to the optimizer as graph outputs.

@JohannesGaessler
Copy link
Collaborator Author

I was not able to nail down the exact problem with moderate effort. For now I'll just revert the MNIST changes, once I have a more high-level API for training I'll adapt and expand test-opt (compare tensor changes vs. analytical expectations) and debug the use of ggml_backend_sched using that.

@JohannesGaessler JohannesGaessler merged commit 7cdb891 into ggerganov:master Oct 2, 2024
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants