Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

realloc()-invalid next size error during repopulate() in 3D models #709

Open
tiannh7 opened this issue Dec 24, 2024 · 5 comments
Open

realloc()-invalid next size error during repopulate() in 3D models #709

tiannh7 opened this issue Dec 24, 2024 · 5 comments

Comments

@tiannh7
Copy link

tiannh7 commented Dec 24, 2024

Hi,

I am encountering a recurring issue when running 3D models using Underworld2. The error seems to be related to memory allocation (realloc()), as detailed below. Interestingly, this issue only occurs in 3D simulations and never in 2D models.

When running a 3D model and calling repopulate() as part of the simulation workflow, the program crashes intermittently with the following error:

[2024-12-24 01:05:42] Swarm repopulating with voronoi swarm, style is default. previous total count is 20067275
realloc(): invalid next size
*** Process received signal ***
Signal: Aborted (6)
Signal code:  (-6)
[0] /usr/lib/x86_64-linux-gnu/libc.so.6(+0x42520)[0x...]
[1] /usr/lib/x86_64-linux-gnu/libc.so.6(pthread_kill+0x12c)[0x...]
[2] /usr/lib/x86_64-linux-gnu/libc.so.6(raise+0x16)[0x...]
[3] /usr/lib/x86_64-linux-gnu/libc.so.6(abort+0xd3)[0x...]
[4] /usr/lib/x86_64-linux-gnu/libc.so.6(+0x89676)[0x...]
[5] /usr/lib/x86_64-linux-gnu/libc.so.6(+0xa0cfc)[0x...]
[6] /usr/lib/x86_64-linux-gnu/libc.so.6(+0xa4c7c)[0x...]
[7] /usr/lib/x86_64-linux-gnu/libc.so.6(realloc+0x122)[0x...]
[8] libPICellerator.so(_DVCWeights_UpdateBchain+0x2d2)[0x...]
[9] libPICellerator.so(_DVCWeights_CreateVoronoi3D+0x1a0)[0x...]
[10] libPICellerator.so(_PCDVC_Calculate3D+0x29c)[0x...]
[11] libPICellerator.so(_PCDVC_Calculate+0x35)[0x...]
[12] libPICellerator.so(WeightsCalculator_CalculateAll+0x11f)[0x...]
...

This error occurs unpredictably, sometimes after hundreds of steps, and sometimes after fewer or more steps. it seems to stem from a small omission, but it will continue to accumulate. I have tried using different parameters for repopulate() (instead of the default settings), and execute on different machines, but the issue persists regardless of the parameters used.

I suspect the problem might be in functions like _DVCWeights_UpdateBchain or _DVCWeights_CreateVoronoi3D , which involve memory resizing (e.g., realloc()). If the allocated memory is unintentionally accessed or modified beyond its bounds, this reserved metadata can be corrupted.

Environment:
Underworld2 version: v2.15.0b0 release
Python version: 3.11.3
OS: Ubuntu 22.04

@julesghub
Copy link
Member

Hi @tiannh7 ,
Thanks for the report. Invalid next size for realloc looks suspicious.
Interesting that the error occurs with different population parameters. I would expect a change to the population parameters would change when/how this error manifests.
Are you using the UWGeodynamics layer or straight underworld? Be aware the population interface is different between the two.

A debug idea. You could reduce the the size of the 3D models mesh, naturally this would decrease the particle population. What happens if you do this?

Another thing to check is for 'unexpected' flow occurring just before the error? An example of this is a cell without a particle, ie. an 'empty' cells. This is a condition Underworld can't deal with. Is there outflow in the model, or stagnation points?

Let me know.

@tiannh7
Copy link
Author

tiannh7 commented Jan 7, 2025

Hi @julesghub,

Thank you for your response and for looking into this issue. I'd like to provide some updates and clarifications based on your suggestions:

  1. Population Parameters: Indeed, using different population parameters does seem to affect when the error occurs. For example, even when I explicitly call repopulate() at every step, the crash still happens, though the specific step varies (e.g., step 300, step 600, or others).

  2. Underworld2 vs. UWGeodynamics: I am using a lightly wrapped version of Underworld2, not UWGeodynamics. So the population interface differences you mentioned should not be relevant in this case.

  3. Empty Cells: I can confirm that "empty" cells are not the issue here. If such a situation occurred, Underworld would typically raise a different error (e.g., "This can occur when there are no particles found in a given element"). That specific scenario does not apply to this case.

  4. Resolution: I managed to resolve the issue a few days ago by modifying the #define DVC_INC 150 value in DVCWeights.h from 150 to 500. While I haven’t tested other values within this range, I believe increasing this value helps avoid issues related to memory reallocation, such as overwriting allocated memory or excessive fragmentation caused by repeated reallocation.

I hope this information is useful for understanding and potentially addressing this issue in future versions. Please let me know if I can assist further or provide additional details!

Best,

@julesghub
Copy link
Member

Thanks for the detailed response.
Fantastic that you have found a resolution. DVC_INC effects the chunking of the voronoi cell algorithm which is used for calculating particle weights and subsequently population control.
The DVC code is old and, in general, robust - I'm curious why we haven't observed this error with previous 3D models. Would you be able to share the 3D model with me. (Perhaps a private repo/gist if there is sensitive information in the model)

While I'm looking at that feel free to make a PR with this change to UW and we can check how the modification effects other models.

cheers,
J

@tiannh7
Copy link
Author

tiannh7 commented Jan 7, 2025

Thank you for your response and for explaining how DVC_INC affects the Voronoi cell algorithm. I really appreciate your insights and time spent on this issue.

Regarding sharing the 3D model, I’d like to first ensure that this issue is not specific to my particular setup or model. To do so, I am currently working on creating a simpler and more general example that reproduces the same error. This approach should help confirm whether the problem is more broadly applicable and make it easier to investigate without the complexities of my original model.

I just need a little more time to finalize this test case, as I want to make sure it reliably reproduces the issue. Once I have it ready, I’ll share it here so we can further analyze the problem together.

Thank you for your patience and understanding! If there’s anything specific you’d like me to focus on while preparing the example, feel free to let me know.

best,
tiannh

@julesghub
Copy link
Member

That sounds great! A simplified example is most welcome. Nothing specific from my end yet.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants