-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Crashes occurring on multiple objects when changing fftsettings #386
Comments
I got it to crash with 1) audio on and 2) wiggle the FFT size. it is not all the time:
this is annoying. but it was the same as the OP which are 100% crashing - for instance, sines~ every time... |
this does it:
|
On MSVC Debug build both patches fail in same glamorous way, crash when spamming the third FFT parameter
|
I am sad to say that Windows has a better report - on Mac I cannot get it to print that assert string... |
@weefuzzy might cry, @AlexHarker might too. It seems all maxAttributes have a similar behaviour... |
This didn't used to happen, right? All that assiduous testing of help files on Windows would've caught it. So, is there a release version where this doesn't happen? If so, isolating the breaking change would be The Way (not least, for working out if this is a flucoma-max bug or a flucoma-core bug) |
@tremblap - can you repro on Mac? |
I'm going to bet this is threading issue. Look at this point in the code: As far as I can see the member variable of the struct are set incorrectly at this point (and yet to be clipped) - if the audio thread reads at this point it will read incorrectly. @weefuzzy - am I wrong? I would note also that because these settings should be read atomically, but are not written atomically there is the possibility of reading them in a non-related state. Welcome to the horror of multi-threading - I will be happy to collect my prize now (BTW if I am correct then running in SIAI and changing the values in the high-priority thread only should never crash). |
Actually I think that would only be correct if v were the object state, but I think v should be a thread-local variable - the last bit (about the three values not writing atomically) is probably still true, but each element should still be atomic so it's hard to see how slicing could occur on the FFT size in this context... The copy into the client parameters should be happening here if I understand the code correctly: flucoma-max/source/include/FluidMaxWrapper.hpp Line 1033 in 6de4328
This might be a good place to check for spurious values (or just above it). If a can ever have spurious values then it would be a case of working back from there. |
Yeah, if it were that foundational, I feel like it would have bitten already. (Still, not ruling out threading badness, just on principle). The last time there were problems of this sort (cascades of assertion deaths due to FFT size stuff) it was because object-initialization-time maxima weren't getting set properly, so algos would end up allocating too little space (which then gets picked up later by the So, another thing to check is what |
2 bulk answers: @weefuzzy no it didn't use to happen but I wonder if we tested that extensively in debug on windows. I mean, this needs serious wiggling in debug to crash, hence the team finding it (we roll that way now to find more of these and understand the framework using debugger to spy on states) 1 question: should we recompile in debug on the last public release(s) until it stops crashing as recommended or are you further down the rabbit hole? |
I'm nowhere near the rabbit hole, and will only be able to issue vague advice, which is:
|
I was implying @AlexHarker might have been near the whole, not you dear @weefuzzy - sorry if that wasn't clear. vague advice is most welcome in all cases. |
I am only near the parts @tremblap. I would suggest some kind of console output at the point in the max wrapper above that can identify if there are spurious values there. If we discover yes we look at which values of the proposed value/client are spurious at that point and work back. I can't repro on my current build/machine/Max version and I'd advise that we try to find a patch that auto-crashes rather than through lengthy wiggling. |
E.g. metro to "wiggle" |
I'd also suggest Max bug info here to make sure we are testing the same versions of Max and flucoma. |
I can get consistent (but different) assert crashes if I build in debug mode and I'm pretty sure this is due to a non-correct object state due to threading issues, although I haven't fully traced it through. Here's my theory - in the line identified above the FFTParams object is moved/copied (and I suspect in practice it is the latter). Each 64bit is likely copied atomically, so the first few values might (for instance) make it through the copy before the audio thread asks to read the values. At this point perhaps the trackers have not been copied and thus it looks like there's no resizing to do - but the values eventually update and by the time we are processing the larger frame size is in action. There's no easy (cost-free) fix if this is the case, and this is unlikely to be a new bug if I'm right. It could also affect any other parameters where what needs to be written is larger than a single 64 bit piece of POD. |
OK - two things. I am still getting 100% consistent eigen_assert() issues with fluid.sines that are not the crashes above. However, if I comment out that assert I can get the two crashes @lewardo got above after a bit of wiggling (manual or automated) I must now retract my previous retraction, because: flucoma-max/source/include/FluidMaxWrapper.hpp Line 1007 in 6de4328
So in the setter for the FFTParam attribute the work is done directly on the client object, which is very much not threadsafe (in addition to the threadsafety issue from the copy, which in effect might not actually copy...). It looks like there are at least two bugs I'm seeing, one of which (the main one reported here) points to some fairly nasty parameter threadsafety problems. And whilst @weefuzzy says:
I don't actually think this is necessarily true. Threading bugs are by their nature non-deterministic and can commonly be subject to machine architecture etc., so they could exhibit more readily on some machines than others (we are not testing on the same OS, nor processors here) and they can be very occasional. When I first starting tracing a framelib threading issue last year it was taking 30 mins - 2 hours before I'd see a crash (until I devised some faster crashing code). So - let's try to get to the bottom of the threading question. @MattS6464 / @lewardo / @tremblap - please run the patch using a debug build (not an optimised one which won't have the asserts) below in two modes: 1 - overdrive on but schedular in audio interrupt off Then post here to say whether 1/2 crash - if it's a yes for 1 and a no for 2 then it's almost certain we have a threadsafety issue.
|
So far I can get 1 to crash but not 2...
|
Looks like I durped that last year, from inspecting the blame. Come what may, that So, there's always been the potential for threading badness because the wrapper doesn't do anything more robust than the same spinlock that the SDK examples use, and you're quite right that there's no pain free fix. My non-thought-through idea was that using a FIFO design similar to scsynth would be nicer (as in easier to reason about), but never had time to see that through. |
auto& should be a reference I think, but I'm not sure - I'd have to check that in the debugger. I have just seen the spin 'locks' and I'm afraid they are woefully insufficient - if that's in the SDK then it is incredibly disappointing as an example because it will only protect you some of the time. The setter object can pass the spin lock and then the audio thread increment the counter and both threads are then executing at the same time - the "lock: in this case does nothing (because it only checks that you aren't processing when it starts - it doesn't prevent the audio thread entering shared memory space after that). In theory there might also be low/high priority thread conflicts possible, but that may need addressing separately. The solutions are: 1 - to devise a proper locking scheme, which will by necessity require waiting in both threads (there is no reasonable chance to rearchitect the way parameters work for non-locking code - there's just too much baked into the design at this point. 2 - to queue parameters between threads (something like the above). 1 is the easier route. Doubtless some people will have opinions about the efficiency of such an approach, but personally I do not see how a not blocking route could work here (in the convolution objects I just don't process audio when certain things are changing but not processing audio just because any parameter is changing is not an option). Therefore IMO proper locks with fallback to sleeping or thread yielding should be employed. I would expect the performance loss to be minimal (or zero) - unless you are bombarding the objects with parameter changes. Often in audio discussions about threading I see people paranoid about locking, but then using solutions that don't guarantee coherent state, which means potential undefined behaviour. I can take the hand-rolled thread locks that I use elsewhere that are lightweight and tend to spin with microsleeps where needed and mock up something to test if that is helpful. This should be max only, because SC parameters are modified in the audio thread, pd is single-threaded and the CLI is also. |
(pretty sure looking in the debugger that a& is the parameter in the Max object). |
Given that mClient.process() is called in only 3 places I'm also pretty sure that thread locking could be used to solve all potential thread conflicts fairly simply. The steps would be:
|
FWIW - the spinlocks in the SDK actually use two small POD variables each of which is only written on one thread, but read on the other to create at least a slightly more robust lock than in flucoma - it's a valid field that really protects the data and that can only be set when not in perform (to make sure that it doesn't get set invalid or otherwise in the middle of an audio routine). |
I wasn't clear. I meant the ampersand should be removed (not that it makes things more thread safe, just less hilarious). MeanwhileThese are all shortcomings I was aware of but never had time to address, partly because once the dataset related objects come into the mix, with their arbitrary messages and shared state, everything gains a new gloss of horror. But thanks for laying them out clearly. My strong preference is for a queue. Strong enough that it may be read as insistence. I know that this isn't a quick fix to this issue.
I don't want to lose sight of someone actually reporting on the state of the objects as they hit the assertions. Please. |
ok thanks for the morning read :) I ran Alex's code but with stftpass instead (to remove potential issues unique to sines~) and a wider range of stft param changes. The code is below. It doesn't crash in overdrive+audio interrupt but it does after a certain time in overdrive without audio interrupt.
@MattS6464 and @lewardo are you getting the same result? |
I'm on the case, I'm setting up the debugger to spy on their state as we speak |
@tremblap's patch made my build crash automatically after a few seconds, before I could set up the debugger so I'll go ahead and run it again with the debugger attached to the process and report back |
This seems to me a meaningless thing to do - the state of the object is not coherent due to the threading issues. If you fix that (as I have done locally for testing) you never hit the asserts - what are you hoping to learn @weefuzzy that we don't already know?
I'm not sure how exactly how you would build a queue here because the parameters are different types, although I guess you could maybe queue the atoms and service that in the correct place, although what the correct place is would be unclear, and you'd need to update something immediately to make sure the attributes read immediately. Could I suggest that we test the locking route (which I already have coded) and look at performance, rather than leave crashes in for an unspecified amount of time. Oh course, **if the performance is insufficient then the queue could be looked at ** (and windows will likely perform differently to Mac so both need heavy testing), but I have done a lot of threading and often use locks simply because if you make them lock for short amounts of time they are easy to reason about and easier to gt right that fancier non-locking or non-blocking methods. |
Whether, in addition to these concurrency issues, the management of max sizes set at initialisation time is also screwed up. Re queue: I would enqueue a whole parameter set or its underlying tuple, I would think. But the type problems do indeed remain for the threading issues that pertain to all the dataset objects (though, thankfully, with less involvement of the audio thread thus far). I'm obviously not going to stop anyone trying something out, but am strongly icky about anything that implies the audio thread would sleep, and dubious about the idea that this is easy to reason about. |
Ah, so this is a different assert to the one on |
They don't appear to be.
It is far far easier to reason about that than non-blocking arguments or queues. Mutexes in audio thread exist in loads of audio software you have used, it's just that there is a modern trend towards the terms "non-blocking" or "lock-free" being considered essential, without the added knowledge of how to keep complex overall state coherent, which is a much harder problem than sending single streams of values in a non-blocking fashion. That would be better discussed on a call - given that it is something I have strong feelings about. The worst results of that are non-working code with the notion that it is "better". FrameLib locks in all threads at some points (whenever it needs to reach out to the core memory allocator) - the trick is to lock for short amounts of time (say around writing a few pieces of POD) in non-audio threads, which means those threads are blocked for a minimal amount of time. There is no point in queuing the whole state, in the sense of a queue with more than one slot (as you should always have the last state), but that is one option (a queue of length 1). It doesn't, however necessarily solve data races between low and high-priority threads. The harder part is actually that you need to transfer that state between threads, and to be honest I'd likely write a lock to do that - It's possible I could also mock that up sometime soonish in some lock-free way, but it involves a lot more thought. |
You will see different asserts/crashes until the threading is fixed - the state of the object is non-deterministic. |
@AlexHarker so you're saying tracking the individual failed asserts in each case would not lead to any global fix, until the state has been made more robust and detreministic when being accessed between threads ? |
OK - please can we get some testing on this: https://github.com/AlexHarker/flucoma-max/tree/threadsafe-parameters @lewardo / @MattS6464 - checking that it doesn't crash would be good (it doesn't crash here) @tremblap - what you would be well placed to do is to compile in release (not debug) and thens stress test performance of both audio objects and dataset objects (for now). At the moment I'm not locking against non-audio thread processing as this is a minimal set of changes to fix the crashing, but there are some other places we should add locks I think. |
I'm on it, I'll let you know what happens on that branch build, is there one specific object that I can compile so I don't have to recompile everything ? |
That is my position, yes. @weefuzzy may still have concerns, but I cannot personally see how we can reason about state if it is non-deterministic. |
BTW - the reason I use my own thread locks (rather than generic mutexs) is that it's then controllable how long they spin, periodically check (busy wait) and then fall to sleep checking on wake - a generic lock might make use of thread signalling or something more heavy duty. Never sleeping/yielding is actually bad because it can prevent threads from progressing do to lock contention. |
Yes - fluid.sines~ or any other FFT based audio object (which is then the one you use for testing) |
With |
the |
@lewardo - can I get the assert please or crashing line? - I suspect what you are now seeing is: but I'd need to know what the assert/crash is to be sure. |
it was I can get a screenshot of the crash if I rerun it if you need it, let me know if you need the extra verbosity |
Can you rerun and screenshot. That's not the one I was hoping for sadly... |
update: now the same error as flucoma/flucoma-core#236, on @AlexHarker's |
I can just retitle this issue if that's better and close the other two once we've confirmed they are also threading-related? |
maybe refrain before closing #387, both @MattS6464 and myself have found that this (@AlexHarker's branch) fix in conjunction with the flucoma/flucoma-core#238 pr doesnt fix #387 and it still instacrashes on both windows and mac |
My thinking is that, title notwithstanding, the opening posts of this issue just describe symptoms and there's quite a bit of scattered reading to do before one gets to causes. Because this is a pretty consequential change (IMO), I think future maintainers will be grateful for having a more succinct account of the issue to refer to (not least because we know similarly ameliorative work on dataset objects remains as a treat for these future maintainers, so understanding the history of whatever PR addresses this problem will be A Good Thing) |
I will make. new issue once we are well into the way to a fix and that one will be more succinct and less ranty (from me at least). |
Crashes occur in the help files of
fluid.melbands~
,fluid.sinefeature~
andfluid.mfcc~
. To recreate the crash, enter the help file for any of these objects, turn on audio, then rapidly increase the third parameter of fftsettings. This is occurring on a debut/test build.The text was updated successfully, but these errors were encountered: