Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add AMD Support #173

Open
wants to merge 20 commits into
base: master
Choose a base branch
from

Conversation

bethune-bryant
Copy link
Contributor

@bethune-bryant bethune-bryant commented Jul 29, 2024

Fixes #137

Design

To do this I duplicate the pynvml interface already used by gpustat in a wrapper around rocmi and dynamically import the correct library based on what hardware is present.

Current Status

The base functionality is currently working:
image

Remaining Tasks

  • Basic Functionality
  • Testing
  • Documentation

@bethune-bryant bethune-bryant changed the title WIP - AMD Support Add AMD Support Aug 8, 2024
@bethune-bryant bethune-bryant marked this pull request as ready for review August 8, 2024 14:37
@bethune-bryant
Copy link
Contributor Author

@wookayin
Before I start working on documentation and testing, would you mind taking a look at this PR?
Do you agree with the overall design, or is there something you would like changed?
Do you have any concerns?

@wookayin wookayin self-assigned this Aug 8, 2024
Copy link
Collaborator

@Stonesjtu Stonesjtu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add some mocking tests for ROCM devices?

setup.py Show resolved Hide resolved
gpustat/util.py Show resolved Hide resolved
@bethune-bryant
Copy link
Contributor Author

Can you add some mocking tests for ROCM devices?

I'm not super familiar with mockito, but I've started looking into this.

Copy link
Collaborator

@Stonesjtu Stonesjtu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

for the testing part, we can mock a ROCML based NVML library call like NVMLGetFanSpeed to return constant values.

@@ -612,6 +618,8 @@ def _wrapped(*args, **kwargs):
gpu_stat = InvalidGPU(index, "((Unknown Error))", e)
except N.NVMLError_GpuIsLost as e:
gpu_stat = InvalidGPU(index, "((GPU is lost))", e)
except Exception as e:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we raise the N.NVMLError_Unknown Error for consistency?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ps: we can catch NVMLError instead of Base Exception, since you may ignore some python native errors

super().__init__(self.message)


class NVMLError_Unknown(Exception):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should these NVMLError_xxx inherit NVMLError?

except (ImportError, SyntaxError, RuntimeError) as e:
_rocmi = sys.modules.get("rocmi", None)

raise ImportError(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we make this a dedicated NVMLError subclass?

gpustat/util.py Show resolved Hide resolved
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

AMD support
4 participants