Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Instacrash with fluid.mlpclassifier when trying to fit something #364

Open
rconstanzo opened this issue Apr 22, 2023 · 19 comments
Open

Instacrash with fluid.mlpclassifier when trying to fit something #364

rconstanzo opened this issue Apr 22, 2023 · 19 comments

Comments

@rconstanzo
Copy link

As mentioned on the discourse thread got an (unrepeated) crash when trying to fit some data with fluid.mlpclassifier.

Attaching the isolated patch bit in question, along with the data/labels I was using at the time. Also the crash report.

Screenshot 2023-04-22 at 2 50 07 PM

This is, I believe, the crash-y bit:

12  fluid.libmanipulation         	       0x1320628dc Eigen::DenseStorage<double, -1, -1, -1, 1>::resize(long, long, long) + 80
13  fluid.libmanipulation         	       0x1322763d8 Eigen::internal::product_evaluator<Eigen::Product<Eigen::Transpose<Eigen::Matrix<double, -1, -1, 0, -1, -1> const>, Eigen::Transpose<Eigen::Matrix<double, -1, -1, 0, -1, -1> >, 0>, 8, Eigen::DenseShape, Eigen::DenseShape, double, double>::product_evaluator(Eigen::Product<Eigen::Transpose<Eigen::Matrix<double, -1, -1, 0, -1, -1> const>, Eigen::Transpose<Eigen::Matrix<double, -1, -1, 0, -1, -1> >, 0> const&) + 108
14  fluid.libmanipulation         	       0x132276048 void Eigen::internal::call_dense_assignment_loop<Eigen::Matrix<double, -1, -1, 0, -1, -1>, Eigen::Transpose<Eigen::CwiseBinaryOp<Eigen::internal::scalar_sum_op<double, double>, Eigen::Product<Eigen::Transpose<Eigen::Matrix<double, -1, -1, 0, -1, -1> const>, Eigen::Transpose<Eigen::Matrix<double, -1, -1, 0, -1, -1> >, 0> const, Eigen::Replicate<Eigen::Matrix<double, -1, 1, 0, -1, 1>, 1, -1> const> >, Eigen::internal::assign_op<double, double> >(Eigen::Matrix<double, -1, -1, 0, -1, -1>&, Eigen::Transpose<Eigen::CwiseBinaryOp<Eigen::internal::scalar_sum_op<double, double>, Eigen::Product<Eigen::Transpose<Eigen::Matrix<double, -1, -1, 0, -1, -1> const>, Eigen::Transpose<Eigen::Matrix<double, -1, -1, 0, -1, -1> >, 0> const, Eigen::Replicate<Eigen::Matrix<double, -1, 1, 0, -1, 1>, 1, -1> const> > const&, Eigen::internal::assign_op<double, double> const&) + 40
15  fluid.libmanipulation         	       0x132275b34 fluid::algorithm::NNLayer::forward(Eigen::Ref<Eigen::Matrix<double, -1, -1, 0, -1, -1>, 0, Eigen::OuterStride<-1> >, Eigen::Ref<Eigen::Matrix<double, -1, -1, 0, -1, -1>, 0, Eigen::OuterStride<-1> >) const + 136
16  fluid.libmanipulation         	       0x132275880 fluid::algorithm::MLP::forward(Eigen::Ref<Eigen::Array<double, -1, -1, 0, -1, -1>, 0, Eigen::OuterStride<-1> >, Eigen::Ref<Eigen::Array<double, -1, -1, 0, -1, -1>, 0, Eigen::OuterStride<-1> >, long, long) const + 344
17  fluid.libmanipulation         	       0x1322743bc fluid::algorithm::SGD::train(fluid::algorithm::MLP&, fluid::FluidTensorView<double, 2ul>, fluid::FluidTensorView<double, 2ul>, long, long, double, double, double) + 2060
18  fluid.libmanipulation         	       0x13229ec8c fluid::client::mlpclassifier::MLPClassifierClient::fit(fluid::client::SharedClientRef<fluid::client::dataset::DataSetClient const>, fluid::client::SharedClientRef<fluid::client::labelset::LabelSetClient const>) + 1748
19  fluid.libmanipulation         	       0x1322b29f8 auto fluid::client::makeMessage<fluid::client::MessageResult<double>, fluid::client::mlpclassifier::MLPClassifierClient, fluid::client::SharedClientRef<fluid::client::dataset::DataSetClient const>, fluid::client::SharedClientRef<fluid::client::labelset::LabelSetClient const> >(char const*, fluid::client::MessageResult<double> (fluid::client::mlpclassifier::MLPClassifierClient::*)(fluid::client::SharedClientRef<fluid::client::dataset::DataSetClient const>, fluid::client::SharedClientRef<fluid::client::labelset::LabelSetClient const>))::'lambda'(fluid::client::mlpclassifier::MLPClassifierClient&, fluid::client::SharedClientRef<fluid::client::dataset::DataSetClient const>, fluid::client::SharedClientRef<fluid::client::labelset::LabelSetClient const>)::operator()('lambda'(fluid::client::mlpclassifier::MLPClassifierClient&, fluid::client::SharedClientRef<fluid::client::dataset::DataSetClient const>, fluid::client::SharedClientRef<fluid::client::labelset::LabelSetClient const>), fluid::client::SharedClientRef<fluid::client::dataset::DataSetClient const>, fluid::client::SharedClientRef<fluid::client::labelset::LabelSetClient const>) const + 96
20  fluid.libmanipulation         	       0x1322b27e0 fluid::client::Message<auto fluid::client::makeMessage<fluid::client::MessageResult<double>, fluid::client::mlpclassifier::MLPClassifierClient, fluid::client::SharedClientRef<fluid::client::dataset::DataSetClient const>, fluid::client::SharedClientRef<fluid::client::labelset::LabelSetClient const> >(char const*, fluid::client::MessageResult<double> (fluid::client::mlpclassifier::MLPClassifierClient::*)(fluid::client::SharedClientRef<fluid::client::dataset::DataSetClient const>, fluid::client::SharedClientRef<fluid::client::labelset::LabelSetClient const>))::'lambda'(fluid::client::mlpclassifier::MLPClassifierClient&, fluid::client::SharedClientRef<fluid::client::dataset::DataSetClient const>, fluid::client::SharedClientRef<fluid::client::labelset::LabelSetClient const>), fluid::client::MessageResult<double>, fluid::client::mlpclassifier::MLPClassifierClient, fluid::client::SharedClientRef<fluid::client::dataset::DataSetClient const>, fluid::client::SharedClientRef<fluid::client::labelset::LabelSetClient const> >::operator()(auto fluid::client::makeMessage<fluid::client::MessageResult<double>, fluid::client::mlpclassifier::MLPClassifierClient, fluid::client::SharedClientRef<fluid::client::dataset::DataSetClient const>, fluid::client::SharedClientRef<fluid::client::labelset::LabelSetClient const> >(char const*, fluid::client::MessageResult<double> (fluid::client::mlpclassifier::MLPClassifierClient::*)(fluid::client::SharedClientRef<fluid::client::dataset::DataSetClient const>, fluid::client::SharedClientRef<fluid::client::labelset::LabelSetClient const>))::'lambda'(fluid::client::mlpclassifier::MLPClassifierClient&, fluid::client::SharedClientRef<fluid::client::dataset::DataSetClient const>, fluid::client::SharedClientRef<fluid::client::labelset::LabelSetClient const>), fluid::client::SharedClientRef<fluid::client::dataset::DataSetClient const>, fluid::client::SharedClientRef<fluid::client::labelset::LabelSetClient const>) const + 80
21  fluid.libmanipulation         	       0x1322b255c _ZNK5fluid6client10MessageSetINSt3__15tupleIJNS0_7MessageIZNS0_11makeMessageINS0_13MessageResultIdEENS0_13mlpclassifier19MLPClassifierClientEJNS0_15SharedClientRefIKNS0_7dataset13DataSetClientEEENSA_IKNS0_8labelset14LabelSetClientEEEEEEDaPKcMT0_FT_DpT1_EEUlRS9_SE_SI_E_S7_S9_JSE_SI_EEENS4_IZNS5_INS6_IvEES9_JSE_NSA_ISG_EEEEESJ_SL_SR_EUlSS_SE_SW_E_SV_S9_JSE_SW_EEENS4_IZNS5_INS6_INS2_12basic_stringIcNS2_11char_traitsIcEENS2_9allocatorIcEEEEEES9_JNS2_10shared_ptrIKNS0_13BufferAdaptorEEEEEESJ_SL_SR_EUlSS_S19_E_S15_S9_JS19_EEENS4_IZNS5_ISV_NS0_10DataClientINS8_17MLPClassifierDataEEEJEEESJ_SL_SR_EUlRS1E_E_SV_S1E_JEEENS4_IZNS0_11makeMessageINS6_IlEES1E_JEEESJ_SL_MSM_KFSN_SP_EEUlS1F_E_S1J_S1E_JEEES1N_NS4_IZNS5_INS6_INS3_IJNSZ_IcS11_N9foonathan6memory13std_allocatorIcNS_17FallbackAllocatorEEEEENS_11FluidTensorIlLm1EEEllddldEEEEES9_JS14_EEESJ_SL_SR_EUlSS_S14_E_S1X_S9_JS14_EEENS4_IZNS5_IS15_S1E_JEEESJ_SL_SR_EUlS1F_E_S15_S1E_JEEENS4_IZNS5_ISV_S1E_JS14_EEESJ_SL_SR_EUlS1F_S14_E_SV_S1E_JS14_EEES1Z_EEEE6invokeILm0EJRNS0_24NRTSharedInstanceAdaptorIS9_E12SharedClientERSE_RSI_EEEDcDpOT0_ + 144
22  fluid.libmanipulation         	       0x1322b1ef8 decltype(auto) fluid::client::NRTThreadingAdaptor<fluid::client::NRTSharedInstanceAdaptor<fluid::client::mlpclassifier::MLPClassifierClient> >::invoke<0ul, fluid::client::NRTThreadingAdaptor<fluid::client::NRTSharedInstanceAdaptor<fluid::client::mlpclassifier::MLPClassifierClient> >, fluid::client::SharedClientRef<fluid::client::dataset::DataSetClient const>&, fluid::client::SharedClientRef<fluid::client::labelset::LabelSetClient const>&>(fluid::client::NRTThreadingAdaptor<fluid::client::NRTSharedInstanceAdaptor<fluid::client::mlpclassifier::MLPClassifierClient> >&, fluid::client::SharedClientRef<fluid::client::dataset::DataSetClient const>&, fluid::client::SharedClientRef<fluid::client::labelset::LabelSetClient const>&) + 360
23  fluid.libmanipulation         	       0x1322b17a0 void fluid::client::FluidMaxWrapper<fluid::client::NRTThreadingAdaptor<fluid::client::NRTSharedInstanceAdaptor<fluid::client::mlpclassifier::MLPClassifierClient> > >::invokeMessageImpl<0ul, 0ul, 1ul>(fluid::client::FluidMaxWrapper<fluid::client::NRTThreadingAdaptor<fluid::client::NRTSharedInstanceAdaptor<fluid::client::mlpclassifier::MLPClassifierClient> > >*, symbol*, long, atom*, std::__1::integer_sequence<unsigned long, 0ul, 1ul>) + 172

crashbits.zip

@rconstanzo
Copy link
Author

Got a second crash (also with a really long @hiddenlayers network.

Also, narrowed down when it happens. It seems like I got the crash when I pause the training (i.e. toggle off the toggle in the loop) then toggle it back on. I got the instacrash when toggling it back on.

crash2.zip

@tremblap
Copy link
Member

few observations:

  • it is a huge network. have you pca'd the datasets first to reduce the number of dimensions? I'm saying this because I get spinning wheel of death here when I start the patch, but no crash.
  • the first crash seems chromium related
  • the second crash is mem-alloc related but maybe not flucoma...

on my machine, there is no memory leak after running it for 15 minutes without a crash - I thought of checking this since both crashes are linked to memory allocation... and I start and stop it, to no avail.

so that brings us to how we can help you help us help you: are you set up for compilation? if so you could gain 2 things:

  • when you are in dev mode, you could use objects that are a little more explicit when they crash (line numbers in code) so that helps us volunteer coding people know where to look for problems in our ginormous code base
  • when you are in gig mode, you could have super optimised versions of the objects, tailored for your actual hardware.

this comes at the expense of having to compile, but also maybe being confused by which version you are actually using. I have scripts to swap them in the OS, but that might not be exciting for you. in all cases, I'm happy to help.

anyway, as I am unable to reproduce, we are stalled. let us know if you find something more reproducible.

@tremblap
Copy link
Member

now running for 45 minutes in 'test' compile mode, starting and stopping and resetting - still no crash.

@rconstanzo
Copy link
Author

I'll see if I can get it to crash again.

Not saying the network is useful, I was just testing different structures on to see what type/direction/style was better (maybe changing structures often, via the attrui is a component of this?).

It could just be coincidence, but happening twice with the same object/process seems unlikely.

@tremblap
Copy link
Member

If you don't mind, try this version of the object (keep the other one you have for real-life use) so if it crashes we'll know better if it is fluid.verse-related and where from...
fluid.libmanipulation.mxo.zip

@tremblap
Copy link
Member

tremblap commented May 4, 2023

@rconstanzo any more crash with my magic custom compile?

@rconstanzo
Copy link
Author

Was in the UK teaching, will give it a test now. Not gotten any new crashes since though (but haven't been testing super long network structures since.

@tremblap
Copy link
Member

tremblap commented Oct 9, 2024

any luck on this?

@rconstanzo
Copy link
Author

Will test the patch above a bit more, but I'm not actively using that large structure in any patches (partially being scared from this issue).

@rconstanzo
Copy link
Author

rconstanzo commented Oct 9, 2024

Turns out I hadn't tested the new version you posted.

Running the above patch again (with delay 500 on my laptop, since it's much slower than my desktop) and got crashes both with the old and new versions of libmanipulation. And both pretty quickly too (as in, let it do about 5 ticks of crunching, toggling off, then toggling on to an instacrash.

Here are both crash reports (both are the same kind of thing as above):
crash with default.zip

crash with new.zip

I will add, perhaps unhelpfully?, that the patch does not crash in a beta version of Max...

@tremblap
Copy link
Member

ok reading the log it might be a memory thing again only in Max. I'll recode in debug SC and see if I can crash at all, stay tuned

@tremblap
Copy link
Member

ok I'm running it in Max first, and I cannot crash... even with the default!

But, as I look at your code, you know that you are running in the high priority thread, right? "but I put a deferlow" you will say... but not at the right place! [delay 60] promotes the post-process bang back to scheduler!

To deal with Max's (awful) threading promotion and demotion, I usually am very disciplined now (thanks to @weefuzzy ) and put the defer right after the potential promotion objects (delay) in this instance.

Will run in SC in case I hopefully manage to crash it - but I think it might still be a max memory threading thing that we are discussing in another bug (and many others actually) with @AlexHarker, only happening in Max.

@tremblap
Copy link
Member

also, a few NN hints: momentum that low is not helpful, it is a huge network, and reset is not resetting the network, clear is (as per manual)

@rconstanzo
Copy link
Author

My thinking with the defer is that I wanted to wait until the process was "done" before moving on, with little interest in what thread was passed forward elsewhere (i.e. my intention was not to start the process in low priority, but end it)

Also, these specific settings aren't really working well (for a variety of reasons), but it shouldn't crash in any regard.

I have managed to get the patch to crash on both of my computers go (intel and arm).

@tremblap
Copy link
Member

yet it doesn't crash here - let's see if SC crashes

@tremblap
Copy link
Member

[mxj WhichThread] is your friend if you are in doubt. Always try to send long jobs in the low priority thread.

@tremblap
Copy link
Member

ok I'm running it in Max (no crash so far) and SC (no crash so far) at the same time - my i9 is in full leafblower mode 🤣
I'll leave it for an hour, clearing the network once in a while.

sc code:

x = FluidDataSet(s).read("/Users/pa/Downloads/crashbits/dataset.json").print
y = FluidLabelSet(s).read("/Users/pa/Downloads/crashbits/labelset.json").print
z = FluidMLPClassifier(s, hiddenLayers: [95,85,75,65,55,45,35,25,15,25,35,45,55,65,75,85,95], activation: 3, learnRate: 0.1, momentum: 0.1, validation: 0, maxIter: 100)
z.fit(x,y,{|x|x.postln})

fork{var cond = Condition.new;{z.fit(x,y,{|x|x.postln; cond.unhang});cond.hang}.loop}

@rconstanzo
Copy link
Author

rconstanzo commented Oct 10, 2024

Wait, are you clear-ing each time you restart?

I was getting a crash by not doing that. As in, I would toggle on the crunching for a bit, then toggle it off, then toggle it back on again (boom crash).

I don't think I ever got a crash from just letting it run.

@tremblap
Copy link
Member

ok, removing the [defer] I added, starting and stopping, this crashes, in max only. This points to @AlexHarker investigations of thread safety in Max then, which is piling up many other instacrashes... the good news for you for now, is that if you put a defer at the right place, you get a stable patch. A solution for the bigger issue is going to happen, don't worry

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants