Add AMD Support #173

bethune-bryant · 2024-07-29T14:17:50Z

Fixes #137

Design

To do this I duplicate the pynvml interface already used by gpustat in a wrapper around rocmi and dynamically import the correct library based on what hardware is present.

Current Status

The base functionality is currently working:

Remaining Tasks

Basic Functionality
Testing
Documentation

bethune-bryant · 2024-08-08T14:39:06Z

@wookayin
Before I start working on documentation and testing, would you mind taking a look at this PR?
Do you agree with the overall design, or is there something you would like changed?
Do you have any concerns?

Stonesjtu

Can you add some mocking tests for ROCM devices?

setup.py

gpustat/util.py

bethune-bryant · 2024-09-03T20:23:37Z

Can you add some mocking tests for ROCM devices?

I'm not super familiar with mockito, but I've started looking into this.

Stonesjtu

LGTM.

for the testing part, we can mock a ROCML based NVML library call like NVMLGetFanSpeed to return constant values.

Stonesjtu · 2024-10-14T12:15:34Z

gpustat/core.py

@@ -612,6 +618,8 @@ def _wrapped(*args, **kwargs):
                gpu_stat = InvalidGPU(index, "((Unknown Error))", e)
            except N.NVMLError_GpuIsLost as e:
                gpu_stat = InvalidGPU(index, "((GPU is lost))", e)
+            except Exception as e:


Should we raise the N.NVMLError_Unknown Error for consistency?

ps: we can catch NVMLError instead of Base Exception, since you may ignore some python native errors

Stonesjtu · 2024-10-14T12:17:02Z

gpustat/rocml.py

+        super().__init__(self.message)
+
+
+class NVMLError_Unknown(Exception):


Should these NVMLError_xxx inherit NVMLError?

Stonesjtu · 2024-10-14T12:18:01Z

gpustat/rocml.py

+except (ImportError, SyntaxError, RuntimeError) as e:
+    _rocmi = sys.modules.get("rocmi", None)
+
+    raise ImportError(


Should we make this a dedicated NVMLError subclass?

gpustat/util.py

bethune-bryant added 17 commits July 29, 2024 14:14

Begin adding AMD support.

65ba474

Add pyrsmi depedency.

261faf7

Add simple hardware switch functionalty.

ca650ba

Move default exception to end

5b229f8

Typo

3c1a744

Default to nvidia.

8ba8134

Typo...

9f07c49

Hide output from rocml.

85d0dbf

add frequency.

2c9aadf

Switching to amdsmi

cc2d0f0

Fix index lookup.

c2ea30e

Remove frequency stuff for now.

3e0c2b1

Check for amdsmi.

173d144

Get driver version

800bd0d

Format new file.

bf1a00a

Typo.

6b731eb

Switch to rocmi.

1a09222

bethune-bryant changed the title ~~WIP - AMD Support~~ Add AMD Support Aug 8, 2024

bethune-bryant marked this pull request as ready for review August 8, 2024 14:37

wookayin self-assigned this Aug 8, 2024

wookayin added the new feature label Aug 8, 2024

bethune-bryant added 3 commits August 8, 2024 14:56

Cleanup unneeded code.

dfce699

Add driver version.

f1abc19

Fix power divisor.

9a2e2af

Stonesjtu reviewed Aug 19, 2024

View reviewed changes

setup.py Show resolved Hide resolved

gpustat/util.py Show resolved Hide resolved

bethune-bryant requested a review from Stonesjtu October 8, 2024 20:02

Stonesjtu approved these changes Oct 14, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add AMD Support #173

Add AMD Support #173

bethune-bryant commented Jul 29, 2024 •

edited

Loading

bethune-bryant commented Aug 8, 2024

Stonesjtu left a comment

bethune-bryant commented Sep 3, 2024

Stonesjtu left a comment

Stonesjtu Oct 14, 2024

Stonesjtu Oct 14, 2024

Stonesjtu Oct 14, 2024

Stonesjtu Oct 14, 2024

		super().__init__(self.message)


		class NVMLError_Unknown(Exception):

Add AMD Support #173

Are you sure you want to change the base?

Add AMD Support #173

Conversation

bethune-bryant commented Jul 29, 2024 • edited Loading

Design

Current Status

Remaining Tasks

bethune-bryant commented Aug 8, 2024

Stonesjtu left a comment

Choose a reason for hiding this comment

bethune-bryant commented Sep 3, 2024

Stonesjtu left a comment

Choose a reason for hiding this comment

Stonesjtu Oct 14, 2024

Choose a reason for hiding this comment

Stonesjtu Oct 14, 2024

Choose a reason for hiding this comment

Stonesjtu Oct 14, 2024

Choose a reason for hiding this comment

Stonesjtu Oct 14, 2024

Choose a reason for hiding this comment

bethune-bryant commented Jul 29, 2024 •

edited

Loading