-
Hi, I am getting an error when training multi-animal top down with an Apple M2 Ultra: 2024-05-22 11:44:09.809458: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:305] Could not identify NUMA node of platform GPU ID 0, defaulting to 0. Your kernel may not have been built with NUMA support. Any ideas how to solve this? Help is much appreciated. Philipp |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 3 replies
-
Hi @pbrec, The logs you pasted are actually all just warnings and can be safely ignored, but I imagine there's likely an error that starts with This has been an issue for a little while now (#1100), but unfortunately we still don't have a fix. Due to how Apple implemented support for TensorFlow on Apple Silicon, some operations behave differently than on other platforms, and it seems to do so in a way that breaks our top-down models. The error occurs when there is an entire batch of images (by default 4) that don't have any detected centroids. If you're in early stages of training a model, this might be because your centroid model isn't performing well enough and could be improved with more labeled data. If you have a good centroid model, it might be that you have frames where there really shouldn't be any detections (e.g., animals leave the FOV or are all simultaneously occluded), in which case there's not much to be done. All other model types seem to work, so a potential workaround is to try bottom-up if it makes sense for your data. Another is to try increasing the batch size at inference time so that you decrease the chances that you have a batch with no centroids. We are the in the process of transitioning away from TensorFlow, so unfortunately we won't have the bandwidth to fix TensorFlow-specific issues for the time being, but let us know if you have any questions or need help with workarounds! Cheers, Talmo |
Beta Was this translation helpful? Give feedback.
Hi @pbrec,
O man, it's looking like this is a totally new issue we hadn't run into before! I can't even find anything on this particular error other than a post on Apple's dev forums from last year that's unanswered...
So sorry about that! We generally would troubleshoot by soliciting some additional information for reproducing the bug, but as I said, we'll be transitioning away from TensorFlow which should bypass this issue.
In the meantime though, let us know if you need help with a workaround!
Talmo