implement simple memory "working space" #138

VinInn · 2018-08-19T14:54:26Z

many gpu algorithms require global data structure as a "working space".
In some cases the data structures are used to communicate between the various kernels that compose a more complex algo (encapsulated in a EDProdcucer)
At the moment these data structures are allocated at the beginning of the job by each EDProducer.
It should be easy to create just one arena, large enough for the most greedy algo, and then allocate those data structure in it. The Arena will be local to each stream (as the current data structures).
Concurrent access is not possible as kernels are sequential in each stream.
no host2dev or dev2host memcpy will be supported (even if should be safe as again any previous activity must be finished before those operation can be schedule on the stream)

The interface would be trivial: an "init" (or clear, or acquire) method that zero the allocated-memory counter and an "alloc(nBytes)" method that returns a pointer (8 byte aligned?) and increment the counter.
It throws bad_alloc if the preallocated working space is too small.
in principle the counter can be local, the only global quantities shall be the pointer of the working space and its size.

This scheme can go wrong only if we allow independent EDProducers that launch kernels on the same stream to be scheduled concurrently (on different cpu-thread). This is in principle safe in itself as the memcpy and the kernels will happily queue in the coda-stream while the parent EDProducers will continue their async activity. In case of a shared working space they with overwrite each other data structures (unless each algo is made on just one kernel and no memcpy is allowed in the working space)

Any more complex solution would immediately require a fully fledged malloc with garbage collector, etc

fwyzard · 2018-08-19T15:24:02Z

I'd rather see a full clean up of the data structures being used to pass data among different kernels of the same producer, and across different producers, first.

makortel · 2018-08-20T11:47:49Z

I fully agree that we will eventually need some more dynamic approach for the device memory.

Let me ask a clarification to see if I understood correctly:

The Arena will be local to each stream (as the current data structures). Concurrent access is not possible as kernels are sequential in each stream.
...
This scheme can go wrong only if we allow independent EDProducers that launch kernels on the same stream to be scheduled concurrently (on different cpu-thread).

Does "stream" refer to EDM stream in all these cases?

So essentially each EDProducer in the beginning of acquire() first resets the "workspace" for that EDM stream and then allocates the scratch memory from the "workspace" that will be released at the end of produce()?

VinInn · 2018-08-20T12:15:02Z

Does "stream" refer to EDM stream in all these cases?

not really: to cuda stream, but I think it applies as well to edm stream

So essentially each EDProducer in the beginning of acquire() first resets the "workspace" for that EDM stream and then allocates the scratch memory from the "workspace" that will be released at the end of produce()?

Yes, in reality there is not need to communicate back anything to anybody: an EDProducer just receive a pointer to the arena and its size. a utility class will "help" to allocate memory for the required data structure (just to avoid trivial error of byte counting)
in old good days there was a common block say /WS/ws(5000) and every routine was
just declaring in it its own arrays... /WS/myI(1000),myf(1000)
(the only shared info was the name of the common block
same here

fwyzard · 2018-08-20T14:37:14Z

So this is only to avoid calling cudaMalloc and cudaMallocHost (or equivalent) for every event ?
Whatever memory is being obtained by the pool is not used to hold data which is shared across different EDM modules ?

VinInn · 2018-08-20T15:27:29Z

So this is only to avoid calling cudaMalloc and cudaMallocHost (or equivalent) for every event ?

yes. My understanding is that it is expensive. If not, no need of such old-school solution.

Whatever memory is being obtained by the pool is not used to hold data which is shared across different EDM modules ?

Definitively NOT. Data shared across different EDM modules are "gpu specific" event-data.

fwyzard · 2018-08-20T16:23:18Z

OK, understood, thanks.

makortel · 2018-08-21T07:37:20Z

Thanks for the clarifications. Below I'm thinking out loud.

Currently each EDProducer has a CUDA stream for each EDM stream. In #100 a "chain of modules" in an EDM stream share a CUDA stream. So there the assumption "only one module per EDM stream doing GPU work" may not hold anymore. Well, actually it does not hold in the current system either, as there is nothing preventing two independent EDProducers (doing GPU work) in a single EDM stream to be run in parallel.

But do we actually have to tie the "workspace" to EDM/cuda streams? Couldn't we (rather easily) go one step further and provide N workspaces of size M MB (or split a given total memory to slices with size of M?) and then EDProducer in acquire() just requests a workspace and it is given one if free (and throw exception if not)?

On long term I'm a bit concerned on the very different allocation mechanism between the "workspace" and "products to event". I'm sure we can manage it, but it is an additional source for easy mistakes.

In the context of #100 another downside is that naively it prevents the "streaming mode" (if we will ever make really use of it...) because EDProducer should not "return" before all the kernels have finished. A possible way to overcome this limitation would be to enqueue a callback function for the release of the workspace to the CUDA stream after the kernels. I believe these details could even be abstracted behind CUDAScopedContext.

cmsbot · 2018-08-21T07:37:34Z

A new Issue was created by @VinInn Vincenzo Innocente.

can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

fwyzard · 2018-08-22T07:29:40Z

Rather than reinventing a (simple) memory allocator, what about reusing something like https://github.com/FelixPetriconi/AllocatorBuilder ?
Or any other found on the web and decently supported ...

VinInn · 2018-08-22T07:51:47Z

as said in my original posting
"Any more complex solution would immediately require a fully fledged malloc with garbage collector, etc"

If one finds something that suites our needs and fits our framework, I am not against

makortel · 2018-09-04T12:38:35Z

I took a look on a couple of options

https://github.com/FelixPetriconi/AllocatorBuilder
- abstractions to help building your own allocator
- as far as I can tell, the sizes of the allocated memory buckets are essentially compile-time constants
CUB caching allocator
- bucket sizes are specified at run time
- supports multiple devices out of the box
- if a requested memory block is not found in the cache, calls cudaMalloc()
- if a requested memory block can not be cudaMalloc()d, frees everything in the free block lists and tries again
https://github.com/ComputationalRadiationPhysics/mallocMC
- "This project provides a framework for fast memory managers on many core accelerators."
- apparently supports only device-side allocations
http://decibel.fi.muni.cz/~xvinkl/CMalloc/
- claimed to be fast and flexible
- available only in a zip package?
  - and the zip package seems to be more like code for a paper comparing allocators rather than a package for distribution
- device side allocations only?
"Buddy" algorithm could be worth of further investigation
- some slides http://iwcse.phys.ntu.edu.tw/parallel/Oct17/Jon-Yu_Lee_131017.pptx
- implementation in C (but without a license) https://github.com/cloudwu/buddy

[edited by @fwyzard]

Halloc GPU memory allocator, presented at GTC 2014
- supports only device-side allocations
- no developments since 2015

[/edited]

So far I haven't encountered anything that would sounds like a perfect fit to us. That may be because I don't know (or have wrong idea of) what exactly we want. Some random thoughts below

Do we want to allocate device memory only on the host, only on the device, or on both? (so far I've assumed the first)
- For event data allocated-on-host memory sounds easiest to deal with
  - anyway will need some device memory allocated on host to be able to deliver data from a kernel
- For "workspace" data the allocation location may be indifferent
  - if on-device allocations become possible, I'm sure we'll see imaginative use of that
- If both, the two allocator instances must communicate (=synchronize) between each other, which sounds costly
Event data has a life time of (at most) the event, and "workspace" data has a life time of a module
- If memory for these are served from a common (per EDM stream) pool of memory, that pool can be fully re-set at the end of event

VinInn · 2018-09-10T08:15:17Z

I think we should test th "CUB caching allocator".
From the description it behaves as a typical allocator in limited memory.
It will suffer of "random-time garbage collection": constant latency is not a real requirement for us though.

I will try to implement it in one of my "Unit" test...

makortel · 2018-09-10T14:02:12Z

@VinInn I already started to prototype during yhr hackathon (API within CUDAService and use in raw2cluster). If there is interest, I can share the API already before completing the prototype (when I get a decent Internet connection)

…

On 10 September 2018 10:15:17 CEST, Vincenzo Innocente ***@***.***> wrote: I think we should test th "CUB caching allocator". From the description it behaves as a typical allocator in limited memory. It will suffer of "random-time garbage collection": constant latency is not a real requirement for us though. I will try to implement it in one of my "Unit" test... -- You are receiving this because you commented. Reply to this email directly or view it on GitHub: #138 (comment)

VinInn · 2018-09-10T15:00:08Z

@makortel,
great!
I think we can wait for your full show-case in raw2cluster

makortel · 2018-09-24T20:25:50Z

My experiment with the CachingDeviceAllocator can be found (finally) here #172.

fwyzard · 2018-12-03T17:21:35Z

Summary of the chat with @makortel regarding the behaviour of the caching allocator, after looking at the code for the cub::CachingDeviceAllocator.

For large memory chunks (bigger than the largest bin):

allocations are synchronous, using cudaMalloc
deallocations are synchronous, using cudaFree

For small memory chunks (up to the size of the largest bin):

allocations can be synchronous, using cudaMalloc or memory returned to the pool from an idle CUDA stream, or asynchronous, reusing memory "freed" in the current CUDA stream
deallocations are asynchronous, marking the memory are reusable by the current CUDA stream

Since work within each CUDA stream is serialised, it is possible to do something along the lines of (pseudocode by @makortel):

dev_mem = allocate(1024, stream);
kernel<<<1,1, 0, stream>>>(dev_mem);
cudaMemcpyAsync(host_mem, dev_mem, stream);
free(dev_mem);
dev_mem2 = allocate(1024, stream);
kernel2<<<1,1, 0, stream>>>(dev_mem2);

Here free(dev_mem) marks the chunk used by dev_mem as "available within the stream stream".
dev_mem2 is likely to receive a pointer to the same chunk.
Launching kernel2 using dev_mem2 is assumed to be fine, because the operations associated to the CUDA stream stream are serial:

execute kernel that possibly writes to the chunk via dev_mem;
read the memory via dev_mem;
execute kernel2 that possibly writes to the chunk via dev_mem2.

If the allocator is replaced by direct calls to cudaMalloc/cudaFree, the behaviour changes, and the assumption is no longer valid.

fwyzard · 2018-12-03T17:21:46Z

I think we should settle on the semantic we want, and then update the allocator to make it consistent with it. I can think of three options:

synchronous: memory operations on the device are synchronous with the host: memory is allocated and deallocated immediately (à la cudaMalloc()/cudaFree())
asynchronous: memory operations on the device are asynchronous from the host, and synchronous with a CUDA stream (or other constructs like CUDA graphs)
lazy: memory allocations on the device are synchronous with the host (cudaMalloc()), while memory deallocations are asynchronous from the host, and synchronous with a CUDA stream.

The cub::CachingDeviceAllocator is doing a mixture of all three: 1. for large memory chunks, 2. for small memory chunks recycled from the pool, 3. for small memory chunks when the allocation pool is grown.

I suspect that what we want for temporary buffers is more along the lines of 2, to avoid issuing a synchronisation every time.

fwyzard · 2018-12-03T17:21:56Z

Few more thing to consider:

do we want to be able to allocate and deallocate memory from device code ?
do we want to be able to allocate memory on the host, and deallocate it from the device(or the opposite) ?
do we want to support managed memory ?

fwyzard · 2018-12-04T17:52:38Z

Here is an attempt of sketching a possible behaviour of the CachingDeviceAllocator:

host cpu   ----1-2-3-4-5-6----7-8-9-10-11-------------12-13-14-15-16------------
stream 1   ----------------3-4-5-6---7--8-9--10-11------------------------------
stream 2   --------------------------------------------------13-14--15-16-------

host mem   ----1---------------5-------------10---------------------15----------
device mem ------2---------3-4-5----6+7-8-9--10----11+12-----13-14--15----------

 1 prepare data on the host

 2 allocate device memory (associated to stream 1)      <-- cudaMalloc
 3 copy to device memory (async on stream 1)
 4 use device memory (e.g. async kernel on stream 1)
 5 copy from device memory (async on stream 1)
 6 release device memory (associated to stream 1)       <-- happens when reused by 7

 7 allocate device memory (associated to stream 1)      <-- reuse in the same stream
 8 copy to device memory (async on stream 1)
 9 use device memory (e.g. async kernel on stream 1)
10 copy from device memory (async on stream 1)
11 release device memory (associated to stream 1)       <-- happens when reused by 12

12 allocate device memory (associated to stream 2)      <-- reuse in a different stream
13 copy to device memory (async on stream 2)
14 use device memory (e.g. async kernel on stream 2)
15 copy from device memory (async on stream 2)
16 release device memory (associated to stream 2)

The key points are that

7 is delayed until stream 1 is done with 6, because they happen on the same stream
8-11 are delayed until after 7, because they happen on the same stream
12 can reuse the memory released by 11 even though it is on a different stream, stream 1 is idle (and the allocator has already synchronised with 11) when the request is made

makortel · 2018-12-26T19:26:47Z

I came across with Umpire https://github.com/LLNL/Umpire that is a "resource management library that allows the discovery, provision, and management of memory on next-generation architectures". I took a quick look, but am not really convinced (e.g. I didn't see any notes about asynchronous copies).

VinInn · 2018-12-27T09:49:41Z

On 26 Dec, 2018, at 8:26 PM, Matti Kortelainen ***@***.***> wrote: I came across with Umpire https://github.com/LLNL/Umpire that is a "resource management library that allows the discovery, provision, and management of memory on next-generation architectures". I took a quick look, but am not really convinced (e.g. I didn't see any notes about asynchronous copies).

one may enquire with people we know at LLNL... v.

makortel · 2018-12-27T21:49:42Z

I noticed that cutorch (https://github.com/torch/cutorch/) is using a caching allocator for both device and pinned host memory. The exact logic is different from CUB, but their allocator also considers a device/host memory in use until all operations queued on a CUDA stream at the point of host-side free have finished.

fwyzard · 2018-12-29T11:07:49Z

I do like that it is supposed to be a drop-in replacement for cudaMalloc()/cudaFree():

By default, cutorch calls cudaMalloc and cudaFree when CUDA tensors are allocated and freed. This is expensive because cudaFree synchronizes the CPU with the GPU. Setting THC_CACHING_ALLOCATOR=1 will cause cutorch to cache and re-use CUDA device and pinned memory allocations to avoid synchronizations.

makortel · 2019-01-02T22:50:41Z

Hmm, they write

cudaFree synchronizes the CPU with the GPU

which makes me wonder the failure of #205. If cudaFree/cudaFreeHost indeed implicitly cudaDeviceSynchronize (simple test supports that), #205 should be safe (and my conclusion in #205 (comment) was too quick). Maybe I'll try to play around with #205 again.

Cutorch's caching allocators use a similar logic as CUB's caching allocator to keep the host-side-freed memory reserved until a CUDA event recorded on a stream at the point of host-side-free has occurred (they actually go a bit beyond to CUB's allocator by having mechanism to do the same for all CUDA streams that read from the memory block in addition to the one that was associated at the time of allocation; we'd actually need it too for non-ExternalWork modules (that resemble TBB's streaming_node) if we ever get improved "early deletion" of event data products cms-sw#16481).

makortel · 2019-04-17T16:49:00Z

Adding here from #306 (comment) by @fwyzard

By the way, during the E4 Hackathon, an NVIDIA guy mentioned the new device-side RAPIDS Memory Manager.

Interesting, thanks for sharing. Do you know if there is any documentation beyond the (rather terse) README (and code, of course)?

fwyzard · 2019-04-17T16:58:55Z

Unfortunately no, it was mention only en passant .

makortel · 2019-05-01T20:12:09Z

I haven't had time to investigate further at all, but ArrayFire https://github.com/arrayfire/arrayfire appears to have some sort of memory manager.

makortel · 2021-01-14T20:57:49Z

There is now some more information on the RAPIDS Memory Manager
https://developer.nvidia.com/blog/fast-flexible-allocation-for-cuda-with-rapids-memory-manager/

It could be interesting to give it a try at some point.

makortel · 2021-01-14T21:00:34Z

Just to document here as well, @fwyzard gave a try of cudaMallocAsync()+cudaFreeAsync() in https://github.com/fwyzard/pixeltrack-standalone/tree/cudaMallocAsync.

The results were for caching allocator

Found 1 devices
Processing 10000 events, of which 10 concurrently, with 10 threads.
Processed 10000 events in 1.071067e+01 seconds, throughput 933.648 events/s.

and for cudaMallocAsync() + cudaFreeAsync()

Found 1 devices
Processing 10000 events, of which 10 concurrently, with 10 threads.
Processed 10000 events in 1.376592e+01 seconds, throughput 726.432 events/s.

cmsbot added the pending-assignment label Aug 21, 2018

fwyzard removed the pending-assignment label Sep 2, 2018

cmsbot added the pending-assignment label Sep 4, 2018

fwyzard removed the pending-assignment label Sep 4, 2018

cmsbot added the pending-assignment label Sep 10, 2018

makortel mentioned this issue Sep 24, 2018

Add infrastructure around cub CachingDeviceAllocator, and use it in SiPixelRawToCluster #172

Merged

fwyzard added enhancement and removed pending-assignment labels Sep 25, 2018

makortel mentioned this issue Dec 3, 2018

Add a flag to disable the caching for the allocators #205

Merged

This comment has been minimized.

Sign in to view

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

implement simple memory "working space" #138

implement simple memory "working space" #138

VinInn commented Aug 19, 2018

fwyzard commented Aug 19, 2018 •

edited

Loading

makortel commented Aug 20, 2018

VinInn commented Aug 20, 2018 •

edited

Loading

fwyzard commented Aug 20, 2018

VinInn commented Aug 20, 2018 •

edited

Loading

fwyzard commented Aug 20, 2018

makortel commented Aug 21, 2018

cmsbot commented Aug 21, 2018

fwyzard commented Aug 22, 2018

VinInn commented Aug 22, 2018

makortel commented Sep 4, 2018 •

edited by fwyzard

Loading

VinInn commented Sep 10, 2018

makortel commented Sep 10, 2018 via email

VinInn commented Sep 10, 2018

makortel commented Sep 24, 2018

fwyzard commented Dec 3, 2018

fwyzard commented Dec 3, 2018

fwyzard commented Dec 3, 2018

fwyzard commented Dec 4, 2018 •

edited

Loading

This comment has been minimized.

This comment has been minimized.

makortel commented Dec 26, 2018

VinInn commented Dec 27, 2018 via email

makortel commented Dec 27, 2018

fwyzard commented Dec 29, 2018

makortel commented Jan 2, 2019

makortel commented Apr 17, 2019

fwyzard commented Apr 17, 2019

makortel commented May 1, 2019

makortel commented Jan 14, 2021

makortel commented Jan 14, 2021

implement simple memory "working space" #138

implement simple memory "working space" #138

Comments

VinInn commented Aug 19, 2018

fwyzard commented Aug 19, 2018 • edited Loading

makortel commented Aug 20, 2018

VinInn commented Aug 20, 2018 • edited Loading

fwyzard commented Aug 20, 2018

VinInn commented Aug 20, 2018 • edited Loading

fwyzard commented Aug 20, 2018

makortel commented Aug 21, 2018

cmsbot commented Aug 21, 2018

fwyzard commented Aug 22, 2018

VinInn commented Aug 22, 2018

makortel commented Sep 4, 2018 • edited by fwyzard Loading

VinInn commented Sep 10, 2018

makortel commented Sep 10, 2018 via email

VinInn commented Sep 10, 2018

makortel commented Sep 24, 2018

fwyzard commented Dec 3, 2018

fwyzard commented Dec 3, 2018

fwyzard commented Dec 3, 2018

fwyzard commented Dec 4, 2018 • edited Loading

This comment has been minimized.

This comment has been minimized.

makortel commented Dec 26, 2018

VinInn commented Dec 27, 2018 via email

makortel commented Dec 27, 2018

fwyzard commented Dec 29, 2018

makortel commented Jan 2, 2019

makortel commented Apr 17, 2019

fwyzard commented Apr 17, 2019

makortel commented May 1, 2019

makortel commented Jan 14, 2021

makortel commented Jan 14, 2021

fwyzard commented Aug 19, 2018 •

edited

Loading

VinInn commented Aug 20, 2018 •

edited

Loading

VinInn commented Aug 20, 2018 •

edited

Loading

makortel commented Sep 4, 2018 •

edited by fwyzard

Loading

fwyzard commented Dec 4, 2018 •

edited

Loading