-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
implement simple memory "working space" #138
Comments
I'd rather see a full clean up of the data structures being used to pass data among different kernels of the same producer, and across different producers, first. |
I fully agree that we will eventually need some more dynamic approach for the device memory. Let me ask a clarification to see if I understood correctly:
Does "stream" refer to EDM stream in all these cases? So essentially each EDProducer in the beginning of |
not really: to cuda stream, but I think it applies as well to edm stream
Yes, in reality there is not need to communicate back anything to anybody: an EDProducer just receive a pointer to the arena and its size. a utility class will "help" to allocate memory for the required data structure (just to avoid trivial error of byte counting) |
So this is only to avoid calling |
yes. My understanding is that it is expensive. If not, no need of such old-school solution.
Definitively NOT. Data shared across different EDM modules are "gpu specific" event-data. |
OK, understood, thanks. |
Thanks for the clarifications. Below I'm thinking out loud. Currently each EDProducer has a CUDA stream for each EDM stream. In #100 a "chain of modules" in an EDM stream share a CUDA stream. So there the assumption "only one module per EDM stream doing GPU work" may not hold anymore. Well, actually it does not hold in the current system either, as there is nothing preventing two independent EDProducers (doing GPU work) in a single EDM stream to be run in parallel. But do we actually have to tie the "workspace" to EDM/cuda streams? Couldn't we (rather easily) go one step further and provide On long term I'm a bit concerned on the very different allocation mechanism between the "workspace" and "products to event". I'm sure we can manage it, but it is an additional source for easy mistakes. In the context of #100 another downside is that naively it prevents the "streaming mode" (if we will ever make really use of it...) because EDProducer should not "return" before all the kernels have finished. A possible way to overcome this limitation would be to enqueue a callback function for the release of the workspace to the CUDA stream after the kernels. I believe these details could even be abstracted behind |
Rather than reinventing a (simple) memory allocator, what about reusing something like https://github.com/FelixPetriconi/AllocatorBuilder ? |
as said in my original posting If one finds something that suites our needs and fits our framework, I am not against |
I took a look on a couple of options
So far I haven't encountered anything that would sounds like a perfect fit to us. That may be because I don't know (or have wrong idea of) what exactly we want. Some random thoughts below
|
I think we should test th "CUB caching allocator". I will try to implement it in one of my "Unit" test... |
@VinInn
I already started to prototype during yhr hackathon (API within CUDAService and use in raw2cluster). If there is interest, I can share the API already before completing the prototype (when I get a decent Internet connection)
…On 10 September 2018 10:15:17 CEST, Vincenzo Innocente ***@***.***> wrote:
I think we should test th "CUB caching allocator".
From the description it behaves as a typical allocator in limited
memory.
It will suffer of "random-time garbage collection": constant latency is
not a real requirement for us though.
I will try to implement it in one of my "Unit" test...
--
You are receiving this because you commented.
Reply to this email directly or view it on GitHub:
#138 (comment)
|
@makortel, |
My experiment with the |
Summary of the chat with @makortel regarding the behaviour of the caching allocator, after looking at the code for the For large memory chunks (bigger than the largest bin):
For small memory chunks (up to the size of the largest bin):
Since work within each CUDA stream is serialised, it is possible to do something along the lines of (pseudocode by @makortel):
Here
If the allocator is replaced by direct calls to |
I think we should settle on the semantic we want, and then update the allocator to make it consistent with it. I can think of three options:
The I suspect that what we want for temporary buffers is more along the lines of 2, to avoid issuing a synchronisation every time. |
Few more thing to consider:
|
Here is an attempt of sketching a possible behaviour of the
The key points are that
|
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
I came across with Umpire https://github.com/LLNL/Umpire that is a "resource management library that allows the discovery, provision, and management of memory on next-generation architectures". I took a quick look, but am not really convinced (e.g. I didn't see any notes about asynchronous copies). |
On 26 Dec, 2018, at 8:26 PM, Matti Kortelainen ***@***.***> wrote:
I came across with Umpire https://github.com/LLNL/Umpire that is a "resource management library that allows the discovery, provision, and management of memory on next-generation architectures". I took a quick look, but am not really convinced (e.g. I didn't see any notes about asynchronous copies).
one may enquire with people we know at LLNL...
v.
|
I noticed that cutorch (https://github.com/torch/cutorch/) is using a caching allocator for both device and pinned host memory. The exact logic is different from CUB, but their allocator also considers a device/host memory in use until all operations queued on a CUDA stream at the point of host-side free have finished. |
I do like that it is supposed to be a drop-in replacement for
|
Hmm, they write
which makes me wonder the failure of #205. If Cutorch's caching allocators use a similar logic as CUB's caching allocator to keep the host-side-freed memory reserved until a CUDA event recorded on a stream at the point of host-side-free has occurred (they actually go a bit beyond to CUB's allocator by having mechanism to do the same for all CUDA streams that read from the memory block in addition to the one that was associated at the time of allocation; we'd actually need it too for non- |
Adding here from #306 (comment) by @fwyzard
Interesting, thanks for sharing. Do you know if there is any documentation beyond the (rather terse) README (and code, of course)? |
Unfortunately no, it was mention only en passant . |
I haven't had time to investigate further at all, but ArrayFire https://github.com/arrayfire/arrayfire appears to have some sort of memory manager. |
There is now some more information on the RAPIDS Memory Manager It could be interesting to give it a try at some point. |
Just to document here as well, @fwyzard gave a try of The results were for caching allocator
and for
|
many gpu algorithms require global data structure as a "working space".
In some cases the data structures are used to communicate between the various kernels that compose a more complex algo (encapsulated in a EDProdcucer)
At the moment these data structures are allocated at the beginning of the job by each EDProducer.
It should be easy to create just one arena, large enough for the most greedy algo, and then allocate those data structure in it. The Arena will be local to each stream (as the current data structures).
Concurrent access is not possible as kernels are sequential in each stream.
no host2dev or dev2host memcpy will be supported (even if should be safe as again any previous activity must be finished before those operation can be schedule on the stream)
The interface would be trivial: an "init" (or clear, or acquire) method that zero the allocated-memory counter and an "alloc(nBytes)" method that returns a pointer (8 byte aligned?) and increment the counter.
It throws bad_alloc if the preallocated working space is too small.
in principle the counter can be local, the only global quantities shall be the pointer of the working space and its size.
This scheme can go wrong only if we allow independent EDProducers that launch kernels on the same stream to be scheduled concurrently (on different cpu-thread). This is in principle safe in itself as the memcpy and the kernels will happily queue in the coda-stream while the parent EDProducers will continue their async activity. In case of a shared working space they with overwrite each other data structures (unless each algo is made on just one kernel and no memcpy is allowed in the working space)
Any more complex solution would immediately require a fully fledged malloc with garbage collector, etc
The text was updated successfully, but these errors were encountered: