Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tensorflow deadlock in ARM after exception in job #1077

Open
aloeliger opened this issue Mar 8, 2023 · 0 comments
Open

Tensorflow deadlock in ARM after exception in job #1077

aloeliger opened this issue Mar 8, 2023 · 0 comments
Labels
Phase-1 Pertains to phase-1 development Software Error Bug causing segfaults, memory leaks, run time errors, or compilation problems

Comments

@aloeliger
Copy link

What

Taken from cms-sw#40996

Tensorflow jobs can hang and throw out of memory exceptions in some relval matrix jobs.

Further investigation shows that L1TMuonEndCapTrackProducer::produce() may be producing extraneous tensorflow sessions, and the module is not producing it's own tensorflow session at module level.

cms-sw#40996 (comment)

Recipe to Recreate

TBD/Check the log for the relval job

@aloeliger aloeliger added Phase-1 Pertains to phase-1 development Software Error Bug causing segfaults, memory leaks, run time errors, or compilation problems labels Mar 8, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Phase-1 Pertains to phase-1 development Software Error Bug causing segfaults, memory leaks, run time errors, or compilation problems
Projects
None yet
Development

No branches or pull requests

1 participant