-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Global Atomic Operations on Multi-word Data #15
Comments
Re "Communication is not bad", I think your example was interesting but I'm not sure I'd describe it that way. I mean, you're saying that an unbounded amount of communication under contention kills performance. And it's probably worse to have unbounded number of compare-exchange when communication is involved than it is locally, because the latency on the operation is higher (so it might be more likely another operation changes the value). My main question here is - what are these bit-maps of all memory you mentioned? Nothing about the descriptor table idea implies to me that we need bitmaps... but perhaps I have not thought through some issue. |
The bitmaps are mainly to keep track of which descriptors are in use and which are not. The descriptors themselves do not refer to the actual address space, but some index into an array on some locale. The bitmap is to provide a lower (but essential) overhead to keeping track of which indices can be used/reused. Bitmaps may not be needed if instead of Chapel arrays we use just raw pointers to memory like |
I would expect a pattern like this: class C { ... }
proc test() {
var a = new C(1);
var b = new C(2);
var x: atomic C;
var result:bool;
x.write(a);
result = x.compareExchange(a, b);
assert(result);
assert(x.read() == b);
delete a;
delete b;
} I would imagine that would mean you would need to perform this |
Truth be told, under more ideal cases it would be better if class C { ... }
var gdt = new GDT(C);
proc test() {
// Maybe have parameters passed be used to reflectively call the constructor of C?
var (idxA, a) = gdt.create(1);
var (idxB, b) = gdt.create(2);
var x: atomic uint;
var result:bool;
x.write(idxA);
result = x.compareExchange(idxA, idxB);
assert(result);
assert(gdt.read(idxB) == b);
} Currently its ugly, but either the runtime or compiler will need to keep track of the actual index of the object or the user has to. I'm not sure if you can hash the actual pointer address to an index, but maybe something similar can be done to remove the need to use the index and support just using the raw object. I can experiment with this as well. |
Right, it's clear that would be more convenient to implement, but I'm not so sure I'd consider it a full solution to the problem. After all, we are trying to make a productive parallel language, so we want to minimize concerns for our users. Anyway, as far is implementing goes, yeah, keep going & experiment with what it would take to get my example pattern to work. Thanks! |
Here is a document detailing how I plan to implement global atomics in a fashion that can provide reasonable semantics (I.E: No need to carry descriptors, wide-pointer compression strategy, global atomics, etc.) I wish to start on it this weekend (as after I need to satisfy Engin's requirements for the Queue) so by then I should know if it does work or not. As well, the |
Also as well, @mppf I believe it was mentioned before that Linux only allows a 47-bit virtual address space, but I can't 100% verify that. I mean, if that is the case for a work-around I can always just shave off the top 17 bits for locale id (which only allows 131K locales, but it could be an even more temporary but immediate solution, taking literally next to no effort to do). I just can't make assumptions without knowing how Chapel handles things currently. gbt1 mentioned that network tops at 128k, but I can see it increasing easily within a few years (of which my solution won't work for). However, if both are 'fine for now' solutions, then its sufficient to just have top 17 bits be locale id, lower 47 bits be virtual address. Edit: Turns out, as seen here that my solution is rather overly complicated. "Most operating systems and applications will not need such a large address space for the foreseeable future, so implementing such wide virtual addresses would simply increase the complexity and cost of address translation with no real benefit." Absolutely crazy... so I can just actually compress it as I see fit. Or rather, my solution isn't overly complicated, but one that is resistant to change. |
I think it's important that we have a strategy for my above Chapel code example to work, even if it's not the first thing we implement. I don't understand why you need a "Zone" strategy at all. Let's back up a bit and talk through the problems that really need to be solved here. I think what I was imagining was: See https://github.com/chapel-lang/chapel/blob/master/runtime/include/chpl-wide-ptr-fns.h for some existing code that does wide pointer compression, but that wouldn't work on very large systems. (I.e. I'm asserting (a) is already implemented and you could steal/reuse that implementation. ). Anyway, for (b), could we propagate information about a mapping from index to node in a dynamic way, only when needed? For example, here is an "obvious" strategy for (b) - that is, a straw-man:
For terminology, let's call the thing we actually store in an atomic class variable an "atomic instance descriptor". How would that work with my example? class C { ... }
proc test() {
var a = new C(1); // works as it does now
var b = new C(2); // as it does now
var x: atomic C; // atomic C is 64 bits in a format we are inventing: "atomic instance descriptor".
var result:bool;
x.write(a);
result = x.compareExchange(a, b);
// compareExchange will:
// 1) Convert a and b to atomic instance descriptors (possibly by compression,
// possibly by searching in or updating some data structures)
// 2) Perform the 64-bit atomic compareExchange operation on the descriptors
assert(result);
assert(x.read() == b);
// read() will:
// 1) atomically read a 64-bit atomic instance descriptor
// 2) unpack the atomic instance descriptor into a 128-bit wide pointer
// (possibly by reading it directly, possibly with some data structure).
delete a;
// delete a could remove the atomic instance descriptor for a from some data structure
delete b;
// delete b could remove the atomic instance descriptor for a from some data structure
} So, for the example to be efficient, we need to have reasonable ways to do: We could conceivably store the atomic instance descriptor in the class instance, For (ii), we could literally use a global array as I was describing above. It could even be block distributed... Or, maybe the best solution here uses a limited number of atomic descriptors per locale. So that the format would be 32-bits of locale number/node id, and then a 32-bit offset into that locale's individual table. That's not really that different from using a Block distributed array, but it might add some flexibility. In particular, you could imagine using a skip list (or something) instead of an array, to make inserting/freeing more efficient. Or, what if these arrays were per-pthread, so that there isn't even any locking overhead in managing the data structures? Whatever pthread is doing (i) would update its table. Is it a significant limitation to only allow 2^32 atomic instance descriptors at a given time? I'm not sure.
Related ideas:
Even though these are related, I don't think it's a good idea to tie this effort to those other components, unless we really really need to. |
I believe I should reiterate on the point of Zones. Zones makes an assumption on
Assuming all locales cache each other's Zone base offsets and their respective mappings, then we can easily do this without necessary intervention from the runtime or compiler. Note again, this was from the perspective of using To skip to the good part... We'll assume that x.write(a);
// write will:
// 1) Descriptor becomes (a.localeId << 40) | zoneId(a) << 32 | a.addr & 0xFFFFFFFF
// 2) This is written to some atomic uint field.
// Note: zoneId performs the 'magic' in terms of compression... that's certainly possible but a WIP
result = x.compareExchange(a, b);
// compareExchange will:
// 1) Convert a and b to atomic instance descriptors (possibly by compression,
// possibly by searching in or updating some data structures)
// 2) Perform the 64-bit atomic compareExchange operation on the descriptors
assert(result);
assert(x.read() == b);
// read() will:
// 1) atomically read a 64-bit atomic instance descriptor
// 2) Unpacks by doing the following:
// var descr = ...;
// var localeId = descr >> 40;
// var addr = (zone((descr >> 32) & 0xFF)) | descr & 0xFFFFFFFF;
// return ...; /* Handle conversion back to wide pointer with localeId and addr */
// Note: zone performs the 'magic' in terms of decompression... while it is a WIP, what it does
// is return a 64-bit base that the offset (lower 32 bits) is added to. The above again makes relatively dangerous assumptions: Only 256 unique most significant 32 bits
Its good to hear that its been done before, definitely makes my job easier and also makes more sense
This is definitely viable if we do what you said...
This would be acceptable. If we had each type keep track of it's descriptor, and each type have a specific
Makes any remote operation (such as a
That's an interesting one... would this also allow certain NUMA-friendly/aware optimizations? Perhaps you could just have it be one for each NUMA node if its that the case. However, having one per pthread, especially if its hyperthreaded and has tons of processors per node and tons of nodes, then I can forsee it being an issue. Having too many |
I can't really make sense of that one...
Which operations need to access this data structure, and how do they access it? I don't object to the zone idea exactly, I just think it needs to fit into the framework I was describing & should be evaluated among some of these other ideas. |
That's my fault, I mean that
Now that's another thing that's my fault. Truthfully, I've never implemented a skip list before so when I thought 'list' I thought 'linked list', not that it could be array based. I rescind what I said, a skip list seems to be the ticket then.
Gotcha, the thing is when it comes to the runtime or compiler I have no idea what is and isn't possible. As well, I've never developed applications for HPC clusters (and this is my first time delving into distributed computing) so I don't have any input to give on whether or not something is enough or what is expected of a sample user program. However to give what little input I can give now...
This is one thing I can agree with. I honestly can't say whether or not its good enough even in the present, but that's all I can assume.
I honestly don't know, but my opinion on it would be that if a node exceeds what is set as the maximum threshold, it should just be a runtime error, and that the user should try another work-around. Of course, a fallback for this could be to just use the
This could be doable for my project by just using it in the queue. If a data structure can't exceed that limit, then I doubt anything else can. |
As well, to reiterate, the best thing I can potentially do is to implement it as an actual module, and then analyze how it can be integrated into the language itself. Currently a way to hash pointers to actual indexes are needed, and the only way to prove a solution works currently is to build an early prototype for it (at least IMO). The solution I agree with is having fields in class instances for it's descriptor unique to an |
Right, but when you're doing such a prototype implementation, it's important to know what a plausible API for the compiler to use will be. |
I can think of an adaptive 4-phase strategy... NumLocales ≤ 2^17Normal Compression of Wide Pointers (17 bits for locale, 47 Bits for offset, based on assumption that only 47 bits are ever used), which allows 131K locales and the entire usable address space. This should be sufficient and does not cause any performance loss whatsoever as barely any translation is needed. Note that I don't leave a bit for marking/tagging, but one can be added if you need to pack it into a 64-bit word, then it would change the condition to NumLocales ≤ 2^16. NumLocales < 2^32We use the Zone compression strategy to keep track of things like this. The more nodes you have, the less bits you have for Zone (and if you exceed the zone, then we can adapt into the third phase). In this case, if you have exactly 2^18 NumLocales, then you can reserve the rest for zones, meaning its more likely that you can continue using this strategy. As mentioned before, Zones just map the most significant 32 bits of the address to a zone id as they are encountered, which only has slight minor performance losses for caching another locale's zone mappings. NumLocales = 2^32 or ZoneId bits exceeded...We use the global descriptor table strategy. Any allocations that exceed the zone, will be instead added here. There is a significant penalty in terms of managing everything, but what else can you do? NumLocales > 2^32Just fall back to using a Would you agree to something like this? Edit: Updated a 4th phase |
Okay, now I found out that I can't really create Wide Pointers from Chapel... I can't call any of the runtime functions that expect or return |
I am not sure if I am understanding it right.. but yes you can cast references to uint or something similar using compiler primitives. And once you have it cast, then hash it as if it is just an integer and not a reference. If it is wide reference (which I think is the actual question you are asking) I think you can have an hash function that takes both locale id and address as an input. (There are |
You should be able to call the functions in runtime/include/chpl-wide-ptr-fns.h, but you'll have to use the foreign function interface to do so. Including telling Chapel that wide_ptr_t is an extern type. Additionally, there are compiler primitives to get the locale or address portion of a wide pointer. _wide_get_locale, _wide_get_node, _wide_get_addr. See for example
I would expect that these compiler primitives are all that you need for an initial attempt. I don't understand why compiler adjustments would make your proposal useless. |
I done it! I have done as you asked Michael, and created the descriptor table approach and have a MWE that is similar to the ones you desired. I even embedded everything within a single class C { var c : int; }
var a = new C(1);
var b = new C(2);
var x = new GlobalAtomicObject(C); // atomic C
var result : bool;
x.write(a);
result = x.compareExchange(a, b);
assert(result);
assert(x.read() == b);
// Note that you can call 'delete' on the object while having it still be present
// in the descriptor table. This may be okay for most cases where memory is not
// an issue since reused addresses will hash to the same node and that node will
// already point to the valid object the user is attempting to insert. However if
// memory is an issue, one can call '_delete' to ensure that the node itself is removed
// and can be recycled for later use.
delete a;
delete b;
// Is Safe because we only use the pointer itself.
x._delete(a);
x._delete(b);
// As well, when GlobalAtomicObject goes out of scope, all nodes it had in use also
// get deleted... I have also tested it in a multi-locale environment, and can prove that it works (although the test doesn't have much coverage yet aside for ensuring that the objects returned are consistent), So in essence, I have proof of correctness for this proof of concept. Next up is performance, of which I need to measure each operation (compared to |
Global Atomic ObjectI present the var x = atomicVar.read();
atomicVar.compareExchangeStrong(x, f(x)); There is no retrying, just a one-and-done atomic operation (which I count as one). It should be noted again that with pointer compression and a smaller number of nodes, PerformanceI present two graphs: one with respect to runtime, and another in terms of raw operations TimeThis shows the runtime of the program. As can be seen, raw atomic operations grossly Imagine if I had attempted to implement some MCAS operation to update both locations using As great as Operations Per SecondThis shows the raw operations per second. Notice that I removed 64-bit atomics entirely as they ImprovementsSome improvements to the Pointer Compression for NumLocales < 2^17This is something I still want to do, but it'd be great if I could have a good example to do it from Chapel. That's all I'd need to be able to start and complete it. Currently it only uses descriptor tables which uses a skip list for allocation/deallocation and a segmented memory pool that recycles nodes (and indexed directly during a Wait-Free/Lock-Free Descriptor Table ImplementationWe use a Low-Level C/Runtime specific optimizationFrom using |
Cool! The Chapel type system doesn't normally allow ptr-int casting but we can work around that with some special functions. The below example should have everything you need to get started on the pointer compression part. proc c_void_ptr_to_uint(ptr:c_void_ptr)
{
return __primitive("cast", uint, ptr);
}
proc uint_to_c_void_ptr(u:uint)
{
return __primitive("cast", c_void_ptr, u);
}
proc test(obj:object)
{
var loc = __primitive("_wide_get_locale", obj);
var node = chpl_nodeFromLocaleID(loc);
var subloc = chpl_sublocFromLocaleID(loc);
var raddr = __primitive("_wide_get_addr", obj);
// TODO: reasonable checking for pointers/nodes
// too big to fit into this format.
// TODO: configurable format
// Let's "pack" the wide pointer
var uraddr = c_void_ptr_to_uint(raddr);
var unode = node:uint;
var packed:uint = 0;
packed = uraddr & 0x0000ffffffffffff;
packed |= unode << 48;
writef("node %xi raddr %016xi\n", unode, uraddr);
writef("packed to %016xi\n", packed);
// Let's "unpack" the wide pointer.
var got_uraddr = packed & 0x0000ffffffffffff;
var got_unode = packed >> 48;
var got_raddr = uint_to_c_void_ptr(got_uraddr);
writef("unpacked to node %xi raddr %016xi\n", got_unode, got_uraddr);
// check pointer value matches.
assert(raddr == got_raddr);
// check node matches
assert(obj.locale.id == got_unode);
}
class C {
var x:int;
}
for loc in Locales {
on loc {
var c = new C(10);
test(c);
}
} For the other improvements, I'd expect the wait-free/lock-free version would be a big win but that implementing it in C vs Chapel won't matter much. |
You might also need a way to go from a node and a local pointer to a wide pointer. I can probably help you with that part too but I'm hoping you can make progress without it for the moment. |
I greatly appreciate the examples, definitely handy for handling compression and decompression, however I really do need to be able to obtain the object, which means I do need to create the wide pointer. If there was a way to actually copy a dummy wide-pointer by value, then modify the copied by-value wide pointer, that could work so long as the compiler allows it. Issue is, the fact that it throws "TODO - don't like type-punning record/union" is a huge roadblock, because it would work if it allowed it. I'm assuming that it throws a compiler warning because GCC will throw a compiler warning and any compiler warning gets treated as a compiler error?
Nevermind, realized that this was it. I managed to side-step the compiler a bit by using extern type wide_ptr_t;
extern type c_nodeid_t;
extern proc chpl_return_wide_ptr_node(c_nodeid_t, c_void_ptr) : wide_ptr_t;
proc c_void_ptr_to_uint(ptr:c_void_ptr)
{
return __primitive("cast", uint, ptr);
}
proc uint_to_c_void_ptr(u:uint)
{
return __primitive("cast", c_void_ptr, u);
}
class C {
var x:int;
}
proc test(obj:object)
{
var loc = __primitive("_wide_get_locale", obj);
var node = chpl_nodeFromLocaleID(loc);
var subloc = chpl_sublocFromLocaleID(loc);
var raddr = __primitive("_wide_get_addr", obj);
// TODO: reasonable checking for pointers/nodes
// too big to fit into this format.
// TODO: configurable format
// Let's "pack" the wide pointer
var uraddr = c_void_ptr_to_uint(raddr);
var unode = node:uint;
var packed:uint = 0;
packed = uraddr & 0x0000ffffffffffff;
packed |= unode << 48;
writef("node %xi raddr %016xi\n", unode, uraddr);
writef("packed to %016xi\n", packed);
// Let's "unpack" the wide pointer.
var got_uraddr = packed & 0x0000ffffffffffff;
var got_unode = packed >> 48;
var got_raddr = uint_to_c_void_ptr(got_uraddr);
writef("unpacked to node %xi raddr %016xi\n", got_unode, got_uraddr);
// check pointer value matches.
assert(raddr == got_raddr);
// check node matches
assert(obj.locale.id == got_unode);
// Start here for new stuff... we create the wide pointer from decompressed information
var wideptr = chpl_return_wide_ptr_node(got_unode, got_raddr);
// Note: We fool the compiler into creating a wide pointer by using an
// 'on here' type of statement. I indexed into 'Locales' because I don't want the
// compiler to 'optimize' it away...
var newObj : C;
on Locales[here.id] do newObj = new C();
// Warning: If this is *not* a wide pointer, it will overwrite part of the stack.
// If it is, it works. Dangerous, but only way to do so at the moment.
c_memcpy(c_ptrTo(newObj), c_ptrTo(wideptr), 16);
assert(obj == newObj);
}
for loc in Locales {
on loc {
var c = new C(10);
test(c);
}
} |
@LouisJenkinsCS - looks like an OK workaround for now. At some point, I think that somebody (possibly me) should add the reverse to __primitive("_wide_get_addr", obj) -- namely, a primitive that creates a wide pointer of a particular type from a local void* and a locale. |
FWIW, I have implemented something like that for my work: It can be used with a C pointer or wide reference (I implemented support for narrow reference as well, but it was useless for me, thus I commented it out). You can read |
It's been done. Not only does it perform the compression strategy for locales that fit within 16 bits, but also maintains manageable/respectable for locales that are within 17 and 32 bits. It is as fast as atomics (in this case it was faster, but likely due to environmental differences such as scheduling, network traffic, etc). I jumped straight to 32 locales because there's no point rerunning everything as its clear that 32 Locales
|
@LouisJenkinsCS - cool, let's work towards putting this in to a PR that can go onto master. |
Sure, I can handle something like that. Should I do the following...
If it doesn't have to be today, I'll be be able to refactor it more into something that is presentable (currently its more of something I coded in 2 days). |
@mppf -- Should this be a standard module or an internal one? |
Doesn't have to be today. You'll also need to socialize the basic idea (design, proposed interface, plans for compiler support that you won't implement but maybe I will). edit: socialize the idea by creating an issue or PR message or email summarizing it. Send email to chapel-developers with a link or with the description. |
Below is a rough draft for the PR proposal I want to make tomorrow afternoon. Let me know what you think. The code still needs to be refactored tomorrow, but I need to have a good opener I'd think. Global Atomic ObjectIntroductionIn any budding language, especially one that boasts to be a productive In this proposal I suggest a solution that not only is viable, but also is scalable. DetailsBelow, I go over some details that are present in the current implementation Pointer CompressionThis is the common case, but certainly not a novel concept in and of itself nor to Descriptor TablesThis could be a new design pattern really, but the idea, while novel in its application, PerformanceI compare GlobalAtomicObject (using descriptor tables) performance against 'Sync Var', var x = atomicVar.read();
atomicVar.compareExchangeStrong(x, f(x)); There is no retrying, just a one-and-done atomic operation (which I count as one). It should be noted again that with pointer compression and a smaller number of nodes, I present two graphs: one with respect to runtime, and another in terms of raw operations TimeThis shows the runtime of the benchmark. As can be seen, raw atomic operations grossly Imagine if we had attempted to implement some MCAS operation to update both locations using Operations Per SecondThis shows the raw operations per second. Notice that I removed 64-bit atomics entirely as they UsageSample usage of class C { var c : int; }
var a = new C(1);
var b = new C(2);
var x = new GlobalAtomicObject(C); // atomic C
var result : bool;
x.write(a);
result = x.compareExchange(a, b);
assert(result);
assert(x.read() == b);
// Note that you can call 'delete' on the object while having it still be present
// in the descriptor table. This may be okay for most cases where memory is not
// an issue since reused addresses will hash to the same node and that node will
// already point to the valid object the user is attempting to insert. However if
// memory is an issue, one can call '_delete' to ensure that the node itself is removed
// and can be recycled for later use.
delete a;
delete b;
// Is Safe because we only use the pointer itself. However, in cases where 'a' and 'b'
// can be reused by the same GlobalAtomicObject, then this could cause undefined behavior
// where another concurrent task adds the same pointer. In those cases, you should call
// '_delete' before 'delete'.
x._delete(a);
x._delete(b);
// As well, when GlobalAtomicObject goes out of scope, all nodes it had in use also
// get deleted...
class C { ... }
proc test() {
var a = new C(1);
var b = new C(2);
var x: atomic C;
var result:bool;
x.write(a);
result = x.compareExchange(a, b);
assert(result);
assert(x.read() == b);
delete a;
delete b;
} Where ImprovementsBelow, I go over some potential improvements that can be made by others who are up to it. Pointer Compression + Descriptor TablesIt may be possible to combine normal pointer compression for the lower locales that fit See next idea for how to differentiate. Zone CompressionA theoretical model where we use a more dynamic type of compression which attempts The benefit here is that we get to keep using 64-bit pointer compression, and the amount "What happens when we run out of zones?" - We use the descriptor "How do we differentiate between Zone Compression, Pointer Compression, and Descriptor Caching Descriptors into ObjectsThis is a change that isn't theoretical but requires changes with the language itself. Concurrent-Safe ImplementationThe table utilizes a simplistic skip list implementation on a segmented memory pool. If someone managed to make everything concurrent, it could improve performance further Edit: Am fixing up for official issue. |
You don't have to get get feedback on a PR message before opening it. Your PR reviewer will ask you to update it if they think it's necessary. Note that PR message becomes the merge commit message, so it goes into git history, more or less. So, I think capturing any "gotchas" in the code, or your software design at a high-level is important. The "Details" part seems to do that well. Since you already have it here: My initial thought is that this is a bit too long for a PR message. If you want to make it more concise, "Introduction", "Performance" and "Usage" headers might go away, or can be summarized with couple of sentences. Usage details can also be summarized into a documentation in the code. That being said, @mppf can override any comments I have here :) |
up to 64-bit values
some hardware supports 128-bit CAS; it's just that I don't think we want to require it. I skimmed over the rest. My suggestion would be to make that into a GitHub issue and simply link to that issue from any PR message you make to do the implementation. It's advisable to ask for feedback on the issue before you actually open a PR. Thanks! |
Decided to go ahead and create the issue describing the new construct here, but official PR is a bit of ways off, hopefully by Friday at this point. |
PR Officially up: chapel-lang/chapel#6717 |
@e-kayrakli
@mppf
This is of importance to both the project and to Chapel as a whole (or so I believe).
I believe that this early prototype of the Global Descriptor Table (GDT), formerly
referred to as GlobalAtomicObject, demonstrates the potential power of global atomics.
There are a few key things here which were misconceptions that I've had and been told
since the beginning...
Communication Is NOT Bad
Communication, even if it is per-operation, is not bad, but it's not good either.
The true bottleneck is contention. I've tested time and time again, and each time
an algorithm containing some contention-causing operation which may cause an unbounded number
of failed operation resulting in an unbounded number of communications, has failed horribly
in terms of performance. For example, operations like this is bad...
Will cause an unbounded number of communications. Reading
x
andy
is relativelyexpensive, and since the
CAS
operation not only causes remote contention, but the factthat both
x
andy
need to be read again, causes performance to drop.Now, algorithms which are guaranteed to succeed are guaranteed to scale very well;
that is, only wait-free algorithms can see scalability. Code that uses
fetchAdd
andexchange
work wonderfully, which gives me hope that a global datastructure with very loose guarantees are possible and useful.
Global Descriptor Table
Next, is the new and novel idea of the
GDT
, the Global Descriptor Table. A 64-bit wordis used over a 128-bit wide pointer allowing us to take advantage of the benefits of
network atomics. In essence, we encode the locale number in the upper 32 bits and
the actual index in the lower 32 bits. Currently, an array is used, but in reality (perhaps
with runtime support if not already available) its possible that a descriptor can be directly
used like a normal pointer. Currently there is a large amount of overhead in needing
to keep a bitmap of usable memory and it cannot be resized without needing to synchronize
all accesses to it (as Chapel domains and arrays cannot be resized in a lock-free way).
Currently, it has been tested with simple
exchange
operations on class instancesremotely across all locales, versus needing to acquire a sync variable to do the same.
As stated above, a
compareExchangeStrong
kills performance, but that has nothing to dowith the
GDT
but with network atomic contention, so its kept simple. It works and itscales. The below graph shows time to complete the same number of operations
(100,000 in this case). It shows the overall runtime of the average of 4 trials
(discarding the first warm-up), and that while
sync
will increase in an a nearexponential growth,
GDT
remains linear.Now, to get a better look at it, here are the same results in Operations per Second.
Implications
I believe this is some very valuable information, which is why I include Michael as well.
Not only is the implementation very immature (there are so many optimizations that
can be made, that there's no saying how much more this can scale), it also surpasses
the only way to perform atomics on global class instances. As well, this opens some
very unconventional means of concurrency, the first being the
DSMLock
(WIP) that isbased on the DSM-Synch (Distributed Shared Memory Combined Synchronization)
in the publication that had CC-Synch (Cache-Coherent Combined Synchronization). As I can confirm
that this scales, I can possibly even make
DSMLock
scale in such a way that globallock-based algorithms can scale (not just wait-free). Extremely exciting!
Edit:
If this is extended on for the runtime, I can imagine having the entire global address space chunked up into a 4GB (2^32) zones, with 2^8 zones on 2^24 locales. With 256 of 4GB zones, that's 1TB address space per locale, with 16M+ locales.
The text was updated successfully, but these errors were encountered: