Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix shared memory permission issue in a shared pod environment #813

Merged
merged 43 commits into from
Nov 1, 2024

Conversation

XiaohanZhangCMU
Copy link
Contributor

@XiaohanZhangCMU XiaohanZhangCMU commented Oct 25, 2024

Description of changes:

We run into read permission issues when the shared memory is created in a shared computer environment. We relax the retry condition to allow the creation of a new shared memory file under permission error. When creating a shared memory file, we make sure both LOCALS and FILELOCKS do not overlap with existing ones and they have the same prefix ints in the name.

What makes it more complicated is when a program exited non-normally, some of the shared meomory files are cleaned up but some are not. To address that, we need to make sure all files in SHM_TO_CLEAN do not have duplicates.

Issue #, if available:

Merge Checklist:

Put an x without space in the boxes that apply. If you are unsure about any checklist, please don't hesitate to ask. We are here to help! This is simply a reminder of what we are going to look for before merging your pull request.

General

  • I have read the contributor guidelines
  • This is a documentation change or typo fix. If so, skip the rest of this checklist.
  • I certify that the changes I am introducing will be backward compatible, and I have discussed concerns about this, if any, with the MosaicML team.
  • I have updated any necessary documentation, including README and API docs (if appropriate).

Tests

  • I ran pre-commit on my change. (check out the pre-commit section of prerequisites)
  • I have added tests that prove my fix is effective or that my feature works (if appropriate).
  • I ran the tests locally to make sure it pass. (check out testing)
  • I have added unit and/or integration tests as appropriate to ensure backward compatibility of the changes.

A complete list of testings done for this PR is put in this doc.

Copy link
Collaborator

@snarayan21 snarayan21 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this seems fine to me, but are you able to reproduce this with an example where this used to fail, and then installing streaming off this branch now succeeds?

streaming/base/shared/prefix.py Outdated Show resolved Hide resolved
streaming/base/shared/prefix.py Outdated Show resolved Hide resolved
@snarayan21
Copy link
Collaborator

closing after offline discussion with @XiaohanZhangCMU

@snarayan21 snarayan21 closed this Oct 26, 2024
@snarayan21 snarayan21 reopened this Oct 26, 2024
@XiaohanZhangCMU XiaohanZhangCMU force-pushed the xiaohan/flexible_retry_shm_prefix branch from 4e0d001 to 98ac253 Compare October 28, 2024 18:52
Copy link
Collaborator

@snarayan21 snarayan21 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems fine to me...i think. would like to see testing

streaming/base/dataset.py Show resolved Hide resolved
streaming/base/shared/prefix.py Show resolved Hide resolved
@XiaohanZhangCMU
Copy link
Contributor Author

@karan6181 @snarayan21 @bigning testings are done. PR is ready for review then merge. PTAL.

Copy link
Collaborator

@snarayan21 snarayan21 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

couple questions, but seems fine to me overall. Deferring to @karan6181 and @bigning on approval

streaming/base/shared/prefix.py Show resolved Hide resolved
@snarayan21 snarayan21 requested a review from bigning November 1, 2024 13:34
streaming/base/shared/prefix.py Show resolved Hide resolved
streaming/base/shared/prefix.py Outdated Show resolved Hide resolved
streaming/base/shared/prefix.py Show resolved Hide resolved
Copy link
Contributor

@bigning bigning left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank you for the fix! Will let @karan6181 stamp!

streaming/base/shared/prefix.py Show resolved Hide resolved
streaming/base/shared/prefix.py Show resolved Hide resolved
Copy link
Collaborator

@karan6181 karan6181 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thank You!

@XiaohanZhangCMU XiaohanZhangCMU merged commit fd9b55a into main Nov 1, 2024
7 checks passed
@XiaohanZhangCMU XiaohanZhangCMU deleted the xiaohan/flexible_retry_shm_prefix branch November 1, 2024 20:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants