Error during training on Apple M2 Ultra #1781

pbrec · 2024-05-22T16:02:49Z

pbrec
May 22, 2024

Hi,

I am getting an error when training multi-animal top down with an Apple M2 Ultra:

2024-05-22 11:44:09.809458: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:305] Could not identify NUMA node of platform GPU ID 0, defaulting to 0. Your kernel may not have been built with NUMA support.
2024-05-22 11:44:09.809751: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:271] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 0 MB memory) -> physical PluggableDevice (device: 0, name: METAL, pci bus id: )
2024-05-22 11:44:10.034870: W tensorflow/core/platform/profile_utils/cpu_utils.cc:128] Failed to get CPU frequency: 0 Hz
2024-05-22 11:44:10.702358: W tensorflow/core/grappler/costs/op_level_cost_estimator.cc:690] Error in PredictCost() for the op: op: "CropAndResize" attr { key: "T" value { type: DT_FLOAT } } attr { key: "extrapolation_value" value { f: 0 } } attr { key: "method" value { s: "bilinear" } } inputs { dtype: DT_FLOAT shape { dim { size: 1 } dim { size: 1080 } dim { size: 1080 } dim { size: 1 } } } inputs { dtype: DT_FLOAT shape { dim { size: -2 } dim { size: 4 } } } inputs { dtype: DT_INT32 shape { dim { size: -2 } } } inputs { dtype: DT_INT32 shape { dim { size: 2 } } } device { type: "CPU" model: "0" num_cores: 24 environment { key: "cpu_instruction_set" value: "ARM NEON" } environment { key: "eigen" value: "3.4.90" } l1_cache_size: 16384 l2_cache_size: 524288 l3_cache_size: 524288 memory_size: 268435456 } outputs { dtype: DT_FLOAT shape { dim { size: -2 } dim { size: 32 } dim { size: 32 } dim { size: 1 } } }

Any ideas how to solve this?

Help is much appreciated.

Philipp

Answered by talmo

May 22, 2024

Hi @pbrec,

O man, it's looking like this is a totally new issue we hadn't run into before! I can't even find anything on this particular error other than a post on Apple's dev forums from last year that's unanswered...

So sorry about that! We generally would troubleshoot by soliciting some additional information for reproducing the bug, but as I said, we'll be transitioning away from TensorFlow which should bypass this issue.

In the meantime though, let us know if you need help with a workaround!

Talmo

View full answer

talmo · 2024-05-22T21:29:40Z

talmo
May 22, 2024
Maintainer

Hi @pbrec,

The logs you pasted are actually all just warnings and can be safely ignored, but I imagine there's likely an error that starts with (0) INTERNAL: Missing 0-th output from a few lines down.

This has been an issue for a little while now (#1100), but unfortunately we still don't have a fix. Due to how Apple implemented support for TensorFlow on Apple Silicon, some operations behave differently than on other platforms, and it seems to do so in a way that breaks our top-down models.

The error occurs when there is an entire batch of images (by default 4) that don't have any detected centroids. If you're in early stages of training a model, this might be because your centroid model isn't performing well enough and could be improved with more labeled data. If you have a good centroid model, it might be that you have frames where there really shouldn't be any detections (e.g., animals leave the FOV or are all simultaneously occluded), in which case there's not much to be done.

All other model types seem to work, so a potential workaround is to try bottom-up if it makes sense for your data.

Another is to try increasing the batch size at inference time so that you decrease the chances that you have a batch with no centroids.

We are the in the process of transitioning away from TensorFlow, so unfortunately we won't have the bandwidth to fix TensorFlow-specific issues for the time being, but let us know if you have any questions or need help with workarounds!

Cheers,

Talmo

3 replies

pbrec May 22, 2024
Author

Hi Talmo,
Thanks for the fast response. I actually do not find the Missing 0-th output error. It seems like the error is somewhere else, since it also crashes when I try to train the bottom-up model. Here are the reports just before the error message occurs:

2024-05-22 17:43:18.450777: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:113] Plugin optimizer for device_type GPU is enabled. 2024-05-22 17:44:18.530552: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:113] Plugin optimizer for device_type GPU is enabled. /AppleInternal/Library/BuildRoots/91a344b1-f985-11ee-b563-fe8bc7981bff/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShadersGraph/mpsgraph/MetalPerformanceShadersGraph/Core/Files/MPSGraphExecutable.mm:903: failed assertion `Error: Input feed tensor not found in placeholders, tensor corresponds to operation: mps_placeholder' /Users/pbrand/mambaforge3/envs/sleap/lib/python3.9/multiprocessing/resource_tracker.py:216: UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown warnings.warn('resource_tracker: There appear to be %d ' Run Path: /Users/pbrand/science/tracking/sleap/models/240522_174310.multi_instance.n=20

talmo May 22, 2024
Maintainer

Hi @pbrec,

O man, it's looking like this is a totally new issue we hadn't run into before! I can't even find anything on this particular error other than a post on Apple's dev forums from last year that's unanswered...

So sorry about that! We generally would troubleshoot by soliciting some additional information for reproducing the bug, but as I said, we'll be transitioning away from TensorFlow which should bypass this issue.

In the meantime though, let us know if you need help with a workaround!

Talmo

Answer selected by pbrec

pbrec May 23, 2024
Author

Hi @talmo,

That is bad luck I guess. I would be grateful if you could help me find a workaround. Please let me know what type of information you'd need from my side.

Thank you,
Philipp

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error during training on Apple M2 Ultra #1781

{{title}}

Replies: 1 comment 3 replies

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Error during training on Apple M2 Ultra #1781

pbrec May 22, 2024

Replies: 1 comment · 3 replies

talmo May 22, 2024 Maintainer

pbrec May 22, 2024 Author

talmo May 22, 2024 Maintainer

pbrec May 23, 2024 Author

pbrec
May 22, 2024

Replies: 1 comment 3 replies

talmo
May 22, 2024
Maintainer

pbrec May 22, 2024
Author

talmo May 22, 2024
Maintainer

pbrec May 23, 2024
Author