-
Notifications
You must be signed in to change notification settings - Fork 60
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How can I tell the Rust compiler &mut [u8]
has changed after a DMA operation
#537
Comments
Assuming There are a few extra complexities which I am not completely sure of:
Why do you think the |
I'd agree with the above with the caveat that strict provenance is not a ruleset enforced on rust code by the language, only a coding standard. Though I agree that using |
|
As @korran mentioned, the
I am aware of this and very much looking forward to when
Yes, I'm aware of this as well, but was omitted it for code brevity sake. That being said, there is another issue (#347) that talks about how compiler fences may not work in a case like this.
Yes, I was running into issues (I did not complete debug the disassembly) where Rust would assume the |
volatile isn't allowed to access the memory of any address other that the one being accessed, so the volatile read checking for the dma complete isn't allowed to manipulate any address other than the pin check address with each read. You should probably pass a pointer to the buffer into asm after the DMA, and that should make llvm assume the memory is now changed. The actual asm doesn't need to do anything at all, it can just be a comment. llvm isn't allowed to check what you wrote in your asm block so it won't know. |
The memory a mutable reference points to must not be aliased, which means it must not be read or written by anything else. This includes DMA operations. So, your code quite simply has UB, and the fix is to use raw pointers or shared references with |
Thanks @RalfJung!
That makes sense, but in that case, why does |
When you give an |
But the DMA is also writing to a pointer derived from the mutable reference |
Oh I see... hm well, not really, since the provenance chain is broken. "derived from" refers to provenance. Once there is an integer, it's not "derived from" any more. But the provenance is getting exposed. So what's missing is the right kind of fence, right? If So IMO the proper fix is, yet again, to make (sufficiently small) volatile accesses relaxed atomic, and tell people to add these fences. But that's not something you can do in your code. Absent that, there's probably hacks involving inline asm as has been suggested above... |
Personally, I'd normally tell people to do the entire DMA sequence in asm when possible. Just pass the pointer to the buffer to asm, then set all the registers and spin loop within that asm. Don't move control flow back to Rust until the DMA is over, if possible. Then the asm block as a whole becomes "a slightly weird memcpy", which is easy to reason about. A pointer goes in, and when the asm is done there's new data in the pointed-to area, we're all used to that. This specific example usage code does seem to allow for that, so that's what I'd suggest here. |
That issue is about compiler fences, which are not relevant here. You need an atomic fence or equivalent to synchronize with the DMA operation since it's effectively another thread.
It sounds like the issues you are seeing are from this lack of synchronization, rather than anything to do with the mutable reference. What you need as others have mentioned is a volatile and atomic Acquire read, but Rust doesn't provide primitives for this, so you need to do this in ASM. |
That approach is not a general solution. You often want to do other things while DMA is running in the background (otherwise, why off-load it, why not just bit bang?). In particular it is not compatible with an async executor such as embassy. |
It should be sufficient to use one inline asm block that logically does "release fence and then write to DMA address", and another inline asm block that does "wait until DMA is done and then acquire" (or "test if DMA is done, and acquire if yes"). Then you can have normal Rust code in between. |
There's no way that can land without major upheaval of existing embedded codebases. Many are on platforms that don't define atomic operations, and even if they did, they could not afford that code size cost. There must be a solution that works on systems that don't define any atomic intrinsics and are consistently single-threaded. Perhaps this means two sets of volatile operations, one that works as it does today and another that interacts with atomic fences? |
@kupiakos on single-core platforms, a (relaxed) atomic access can just be compiled to a regular boring load or store. In fact that's true on multi-core platforms as well (at least under the current C++ memory model). So I don't know why you think this is a problem? |
It's a problem if a volatile access tries to generate atomic intrinsics and they're not defined for the platform, causing a compile failure for Tier-3 targets. If it can properly identify when atomics aren't available and explicitly avoid using them, then that shouldn't affect us. Then, my concern is with the rather unexpected automatic conflation of atomics (concurrency) and volatililty (ensuring order of operations and preventing elision). If Rust is going to extend a raw pointer operation to do more than what the equivalent C would do, we should provide a fallback for the C-like behavior. |
The last time I tried to use atomics on a no-atomic target, just after that was introduced to the compiler, there was an unfortunate amount of code generated. The situation might have improved since then, but an inspection of what the compiler actually does on several common older targets is definitely worth doing. |
Something seems seriously wrong if a relaxed atomic access needs intrinsics of any sort. On which architecture is this more complicated than a single load/store instruction? But anyway there's a cop-out here, which is that I said "sufficiently small" volatile accesses must be atomic. The threshold for "sufficient" will be platform-dependent, and some platforms can define it in such a way that no volatile accesses at all are atomic.
These things are closely related though. Volatile is meant for using the memory to communicate with another component, i.e., it only ever makes sense to use volatile on memory that is in some sense "shared" with something else. That something else runs concurrently with the current thread. And atomics are all about properly modeling concurrent memory accesses. In particular, atomics are also all about ensuring a certain order of operations -- the (in)famous "happens-before" order. And it's exactly that kind of ordering which we currently cannot express with volatile accesses, as the example in the OP shows. Atomics have all the vocabulary we need to fill this gap, it would be pointless to try to reinvent this. In my view, a non-atomic volatile access makes very little sense. (Note that I mean "atomic" here in the sense of "participating in the concurrency memory model". Volatiles accesses can of course in general tear, so they are closer to "atomic bytewise memcpy". But actually of course one often does not want tearing for volatile accesses, and yet we do not currently make any guarantee in that direction -- yet another reason to make sufficiently small volatile accesses atomic.) |
The vast majority of MMIO register accesses do not affect the contents of memory. Registers that control and monitor DMA actions are the exception. If atomic volatile accesses induce the compiler optimizer to be more conservative with reloading CPU registers from RAM, many embedded developers will complain VERY loudly about the additional code bloat; we often jump through many hoops just to save 100 bytes of instruction memory. We only want to pay the cost of atomic-volatile accesses when working with DMA MMIO-registers. |
They are still communicating with some concurrently running system. It doesn't matter whether there are any sort of persistent "contents" there.
Volatile accesses already need to be reloaded from RAM each time, that's what makes them volatile... I'm afraid you are going to have to spell out your concern in more details. Please skip the rhetoric and give us some technical details. :) I have now said multiple times that a relaxed atomic access has exactly the same cost (in the final assembly) as a non-atomic access, the difference is entirely in the memory model. (The "relaxed" part is important!) Until you provide any evidence to the contrary I will claim that the effects you are describing are therefore impossible. (I'm totally willing to believe that there's some potential for trouble here, but you will need to get a bit more specific.) |
Sure, but these non-DMA-related MMIO accesses won't affect (for example) the contents of the stack, unlike a DMA operation.
If this is true globally, then I don't have a concern. I'm sure the load/store itself is lightweight; it's how the atomic access influences the compiler optimizer's decision whether to reload (for example) unrelated local variables from the stack or re-use the ones in the registers that concerns me. I think the only way to ease my fears is to run an experiment and see how making every volatile access atomic influences the code-size of representative firmware. |
I don't like this explanation because volatile is clearly also about ordering.
This implies that you can use volatile operations for communication between threads, it's just very limited because you can never share any information via non-volatile accesses, and the resulting program will not be portable because of the tearing guarantees. Hence why you almost always want atomics. I think it's an interesting question as to whether we need all four combinations of atomic/non-atomic and volatile/non-volatile accesses, or whether we can make volatile imply atomic to some degree. My view is that while you probably could add some atomicity guarantees to volatile accesses, it is simpler to understand if they are separate. Also:
What memory ordering would these volatile accesses have? The DMA case requires at least "Acquire" level on the final volatile read. |
DMA operations coordinated with volatile operations may only read/write the stack when there are fences to induce synchronization. Without fences, optimizations should be largely unaffected. But that is indeed hard to say without actually trying it.
I would say they are primarily about being an observable effect. That indeed implies that the outside world (e.g. the hardware on the other end of that MMIO register) can observe the order of two accesses, but we don't currently let the program itself observe that order in any way. This is a quite surprising state of affairs that many people are confused by. That's why I think everything gets less confusing if we make volatile accesses be at least "bytewise atomic memcpy" (with relaxed ordering). Additionally we need platform-specific tearing guarantees, where non-tearing accesses are then entirely equivalent to relaxed atomic accesses. We don't yet have atomic bytewise memcpy though, so today the best we can do is to make small accesses truly relaxed atomic, and larger accesses stay non-atomic.
Not really, since we say racy volatile accesses are UB.
They are relaxed. The DMA case then needs an acquire fence after the loop, so it's fine for the accesses to be relaxed. |
I agree this is non-sensical: When using volatile to access hardware, accesses are always racy. And the program can't possibly distinguish a volatile access to hardware from a volatile access to another thread in the program. In fact a common pattern that absolutely must work would be setting a value with a volatile write, and then waiting for it to be cleared by hardware. I believe it's impossible for a compiler to both faithfully treat volatile accesses as side-effectful, and also use knowledge that a racy volatile access cannot occur in order to optimize a program in any way. So I agree that it makes sense for volatile accesses to be similar to an "elementwise atomic access" (where element size is platform dependent). However, I don't necessarily think that precludes the idea of operations that are both atomic and volatile, where ordering can be specified, etc. |
That means we need to say which ordering they have (to make them fit into the concurrency model), and "relaxed" is the only reasonable choice IMO.
Oh sure, we could have more APIs where one can do volatile acquire operations or something like that. But I think there's very little need for that -- release/acquire operations can be very reasonably approximated with relaxed operations + fences, which only leaves volatile Anyway, I'm not saying we shouldn't do that. All I am saying is we should fix our existing |
Hm, is there any weird thing that people want to do that would be incompatible with that? Or any architecture where the actual ISA isn't relaxed by default for loads/stores? So, I may be way off here, but how would that work with memory mapped write combining memory? For example GPU memory is generally mapped as write combining / non-temporal (in APIs such as Vulkan). Even on x86, which is... interesting. |
There has to still be some heavy platform-dependent aspect to volatile reads, which defines what exactly is being "acquired" by a This is not surprising: in formal models of release/acquire operations, we think of memory as recording all past stores that ever happened. A read operation picks some previous store whose value it will return (subject to a lower bound: you are guaranteed to not see a store that's "too old", which models the concept of which things you have "synchronized with"), and that store has attached metadata about which things can be "acquired" through it. For volatile reads, that store was done by the hardware (or whatever), so it is entirely up to the hardware to define what the things "acquired" through this store are. |
I guess a different approach would be to not use |
Actually, yeah, I don't think even "per-byte atomic" is upheld entirely by any normal access to specifically WT/UC memory on x86 at all (fully atomic definitely isn't for sure). WC memory also comes with the caveat of needing a machine fence to actually do an acquire operation. The reason I don't think its even per-byte atomic is because I actually don't believe WT memory is fully coherent. WC and WT memory types are incredibly common for MMIO on x86, as is UC typing (which makes reads incoherent arround writes on other cpus, unless you use hardware fences). You thus can't do language level atomic operations on MMIO memory often, so demanding volatile be effectively an atomic operation isn't going to fly. |
The problem is that only atomic operations are allowed to race, or else you get UB. It seems like atomic is too strong but volatile is currently too weak. At the very least, if a program only accesses some memory location via volatile accesses, that should not be UB, even if there is no synchronisation. |
In the llvm memory model for concurrent operations, they seem to list volatile as taking priority over the rest of the access stuff. Is it only rust that's calling it a data race? can we change rust's rules without having to even poke llvm about all this? |
The catch, though, is DMA here, since you can't link a fence to a volatile operation. I think the only real way is to inline-asm (or volatile atomics, provided the memory type co-operates). |
In which sense is it not upheld? Saying that volatile accesses are atomic is not intended to require anything new from targets. A volatile read is still basically an FFI call. However, currently the "FFI call" is severely limited in which state it is allowed to depend on or mutate, and in particular it cannot mutate the acquire-fence clock that determines which happens-before relationships are established by the next acquire fence. By making the operation atomic, we are giving this "FFI call" a few more options, so that it can update the acquire-fence clock. (And correspondingly, the "FFI call" that happens on a volatile write must be permitted to depend on the current thread's release-fence-clock, which it currently is not.) (These clocks are pieces of per-thread state that are completely implicit in the typical graph-based "axiomatic" memory model, but made explicit when translating them into a more operational form, such as the "promising semantics" model. This explains e.g. why a
It says the result is "target-dependent", which is pretty useless. Where does it spell out how the targets define it? Maybe it's UB. It also says "It does not generally provide cross-thread synchronization", whatever that means -- but it does sound like these are not actually atomic operations, e.g. "volatile read; acquire fence" doesn't work as intended for the DMA example above. Generally the LLVM memory model is fairly poorly documented, has no formal model I am aware of, and does some rather unconventional things. I don't think it is suitable for direct use by Rust at this point. |
I'm a big fan of this direction. To me, the point of volatile is for things where the act of reading/writing does something, and thus writing the same value twice in a row right next to each other (or reading twice right in a row) can't be optimized in the obvious way. Anything that doesn't need that shouldn't be For something like a vram buffer that does work like memory, and thus coalescing reads/writes like that is completely reasonable, treating it the same way we do something like shared memory between user-mode processes (or even just shared memory in the same process) and using atomics feels like a better fit anyway. |
They may not link up with fences properly or at all, and they aren't necessarily coherent. |
Yeah that's fine. They already don't satisfy tons of property of regular memory. The key point is to give them the option to interact with the concurrency model -- which, in my understanding, requires marking them as relaxed atomic in LLVM.
But doesn't that apply to the DMA example? |
I'm not sure this reasoning holds. "They don't interact with atomic rules because they don't interact with atomic rules" is an interesting tautology, but I don't think it holds. One property lost by WT memory is Write-write coherence, which is observable here, because it's the order that the writes occur. I don't think we can make them atomic without that property existing. |
I think we can. You keep ignoring my arguments. LLVM can already not assume that volatile accesses behave like normal memory accesses. Nothing about that changes just because we add an Please show concrete evidence of any issue. I just don't buy your hypotheticals. |
My assumption is that if we provide atomicy coherence, a user could rely on the order of writes between threads, because those writes are observable behaviour. This is the not the case with WT memory, even with a clear inter-thread-happens before. What the proposal is "Yeah, these operations are atomic, but you the user can't actually rely on any atomicy properties. They just, are because we say so." At least with a proper atomic (or atomic-volatile), we could say "Yeah, it's UB on memory types that don't support atomic operations", but saying that for all volatile operations is a non-starter. |
That seems far-fetched to me. Users already cannot rely on volatile reads/writes behaving like normal non-atomic reads/writes since there might not even be any memory there. Why would they suddenly forget about that lesson when we let volatile accesses have atomicity effects? Users that work on non-standard memory are hardly going to be surprised that they get non-standard semantics. They will be happy that we give them any way at all to use that memory! |
Hi all!
I'm trying to force the Rust compiler to assume the contents of a
&mut [u8]
has changed after a DMA operation.In the std library, calls to things like
std::fs::File::read
take a&mut [u8]
and after the call, the contents of that slice have changed. The Rust compiler knows it can not assume the contents of that buffer after the call toread
.I need to do something similar with DMA. After a DMA operation has been completed (and hardware has written bytes to memory), the Rust compiler must assume the
buf
has changed.My current solution (below) is to use
asm!
. Will that do the trick? Is there a more portable way?Something like:
The text was updated successfully, but these errors were encountered: