Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cryptic errors when CNMFE params are incompatible #1185

Open
kushalkolar opened this issue Oct 11, 2023 · 10 comments
Open

Cryptic errors when CNMFE params are incompatible #1185

kushalkolar opened this issue Oct 11, 2023 · 10 comments

Comments

@kushalkolar
Copy link
Collaborator

Troubleshooting mescore CI and ran into this, reproduced on CNMFE demo nb by setting gSig = (10, 10) and gSiz = (41, 41), everything else is the same. These params are quite far from ideal for this movie, but it's weird that a pickling error is thrown:

The reason is because gSiz > rf, but this type of error is very cryptic. PR incoming...

Side note, dview.map_async(cnmf_patches, args_in).get(4294967) is retrieving chunks of size 2^32 - 1, curious why that's the chunk size?

---------------------------------------------------------------------------
MaybeEncodingError                        Traceback (most recent call last)
Cell In[32], line 2
      1 cnm = cnmf.CNMF(n_processes=n_processes, dview=dview, Ain=Ain, params=opts)
----> 2 cnm.fit(images)

File ~/repos/CaImAn/caiman/source_extraction/cnmf/cnmf.py:594, in CNMF.fit(self, images, indices)
    589 if not isinstance(images, np.memmap):
    590     raise Exception(
    591         'You need to provide a memory mapped file as input if you use patches!!')
    593 self.estimates.A, self.estimates.C, self.estimates.YrA, self.estimates.b, self.estimates.f, \
--> 594     self.estimates.sn, self.estimates.optional_outputs = run_CNMF_patches(
    595         images.filename, self.dims + (T,), self.params,
    596         dview=self.dview, memory_fact=self.params.get('patch', 'memory_fact'),
    597         gnb=self.params.get('init', 'nb'), border_pix=self.params.get('patch', 'border_pix'),
    598         low_rank_background=self.params.get('patch', 'low_rank_background'),
    599         del_duplicates=self.params.get('patch', 'del_duplicates'),
    600         indices=indices)
    602 self.estimates.bl, self.estimates.c1, self.estimates.g, self.estimates.neurons_sn = None, None, None, None
    603 logging.info("merging")

File ~/repos/CaImAn/caiman/source_extraction/cnmf/map_reduce.py:250, in run_CNMF_patches(file_name, shape, params, gnb, dview, memory_fact, border_pix, low_rank_background, del_duplicates, indices)
    248 if dview is not None:
    249     if 'multiprocessing' in str(type(dview)):
--> 250         file_res = dview.map_async(cnmf_patches, args_in).get(4294967)
    251     else:
    252         try:

File /usr/lib/python3.11/multiprocessing/pool.py:774, in ApplyResult.get(self, timeout)
    772     return self._value
    773 else:
--> 774     raise self._value

MaybeEncodingError: Error sending result: '<multiprocessing.pool.ExceptionWithTraceback object at 0x7f2db6da91d0>'. Reason: 'PicklingError("Can't pickle <class '_flapack.error'>: import of module '_flapack' failed")'
@kushalkolar kushalkolar changed the title Weird errors when CNMFE params are quite off Cryptic errors when CNMFE params are incompatible Oct 11, 2023
@kushalkolar
Copy link
Collaborator Author

kushalkolar commented Oct 11, 2023

And there's another one I'm trying to figure out!

Traceback (most recent call last):
  File "/home/kushal/repos/mesmerize-core/mesmerize_core/algorithms/cnmfe.py", line 89, in run_algo
    cnm = cnm.fit(images)
          ^^^^^^^^^^^^^^^
  File "/home/kushal/repos/CaImAn/caiman/source_extraction/cnmf/cnmf.py", line 594, in fit
    self.estimates.sn, self.estimates.optional_outputs = run_CNMF_patches(
                                                         ^^^^^^^^^^^^^^^^^
  File "/home/kushal/repos/CaImAn/caiman/source_extraction/cnmf/map_reduce.py", line 476, in run_CNMF_patches
    B_tot = np.array(B_tot, dtype=np.float32)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: float() argument must be a string or a real number, not 'coo_matrix'

EDIT: So this happens if nb=0 is not explicitly specified. Weird because our tests were passing in March without this being explicitly specified.

@EricThomson
Copy link
Contributor

EricThomson commented Oct 11, 2023

I have often wondered about that 4_294_967. It would be good to document that line of code, in particular that magic number. I finally took the time to research it -- looking up stuff here:
https://docs.python.org/3/library/multiprocessing.html
https://superfastpython.com/multiprocessing-pool-map_async/

For people reading this (e.g., me) in the future, map_asynch() is a non-blocking function (as opposed to map()). get() is an option that returns the value, and the argument says how long to wait in seconds before throwing a multiprocessing.TimeoutError. By setting it to something that large (basically 50 days!) it is effectively saying go forever. The thing is, we could just not put any argument in there for that and it would be the same. Or we could be more explicit and use the timeout=XX so it is more clear.

Anyway, glad you brought this up because that has always confused me but I never took the time to actually research it I assumed it was some memory management thing 😄

@pgunn
Copy link
Member

pgunn commented Oct 11, 2023

Probably worth removing the argument, now that we understand what it is.

@kushalkolar
Copy link
Collaborator Author

Ah ok! It's timeout, not chunk_size! 😄

Maybe the reason they used 2^32 - 1 is that this is the maximum timeout that can be specified, because 2^32 would overflow?

Maybe that timeout exists for a reason, perhaps there's actually some delay with getting the output from map_async and that arg was put there to make sure it's gotten.

I've sometimes heard of people who've had CNMF hang indefinitely, maybe this could've been the source?

Anyways, maybe we want to keep that arg but make it something reasonable, like 1 minute? Or a few minutes to be safe? run_CNMF_patches is in a deep nest so I kinda don't want to benchmark that thing to figure out what the real get() times are :)

@EricThomson
Copy link
Contributor

EricThomson commented Oct 12, 2023

I kinda don't want to benchmark that thing to figure out what the real get() times are

🤣 I will be the benchmarking hero.

jk nope it will always be a guessing game that's probably what they did pick the most unreasonably long time so it will never get a complaint 😄 Unless Pat does some AWS magic profiling with it.

@kushalkolar
Copy link
Collaborator Author

kushalkolar commented Oct 12, 2023

I have found an issue where it indeed does hang here, but it's from 2018 so not sure how applicable it is: #301

Note: This person's real issue was bizarre (python 2 and 3 mixed, how is that even possible?). But maybe get() will just hang in weird situations.

Anyways, multiprocessing will throw a TimeoutError if the timeout is exceeded, so we can do something reasonable like 5 mins and leave it at that?

@pgunn
Copy link
Member

pgunn commented Oct 12, 2023

I don't think a shorter timeout really solves anything (probably), it'd just give us a chance to spit out some kind of "I don't know what happened but things broke in this here code" message. It's probably always a bug if we pick a timeout and it goes over.

@kushalkolar
Copy link
Collaborator Author

Yea rather than hanging it'd basically be an indicator of "this is taking much longer than it should, something is probably wrong", verus the user having to do a manual keyboard interrupt because they noticed it was going for too long.

@kushalkolar
Copy link
Collaborator Author

So the gSiz > rf is caught in #1186 , but we still need to decide a timeout.

@ethanbb
Copy link
Contributor

ethanbb commented Aug 10, 2024

Looking back at the error:

TypeError: float() argument must be a string or a real number, not 'coo_matrix'

It seems this happens if nb != 0 AND low_rank_background is false (default is true):

# in map_reduce.py
if count_bgr == 0:
    # corresponds to nb == 0
    # OK
    ...
elif low_rank_background is None:
    # OK
    ...
elif low_rank_background:
    # OK
    ...
else:
    ...
    # error happens here (line 456-457); B_tot starts as a scipy.sparse.csc_matrix:
    B_tot /= nB
    B_tot = np.array(B_tot, dtype=np.float32)

Maybe scipy.sparse used to support that, I don't know. But to fix it, assuming a dense array is actually necessary there, we should just replace that last line with B_tot = B_tot.toarray().astype(np.float32).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants