Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Warning when running PyTorch with ROCm #3096

Closed
succo104 opened this issue Jul 2, 2024 · 9 comments
Closed

Warning when running PyTorch with ROCm #3096

succo104 opened this issue Jul 2, 2024 · 9 comments

Comments

@succo104
Copy link

succo104 commented Jul 2, 2024

My system:
AMD Ryzen 5 7600X
32 GB RAM DDR5 6000 Mhz
NVME SSD 256 GB (For Ubuntu) + 2 TB (For Windows)
AMD Radeon RX 6800 RDNA 2

I'm using the latest AMD drivers on Ubuntu 22.04 LTS and getting this warning when running my PyTorch deep learning AI model:

MIOpen(HIP): Warning [FindSolutionImpl] Invalid config loaded from Perf Db: ConvBinWinogradRxSf3x2: 72. Performance may degrade.

What does this mean and how can i fix this?

Screenshot from 2024-07-02 13-33-49

@AtiqurRahmanAni
Copy link

I am facing the same warning
image

@ppanchad-amd
Copy link

Hi @succo104, an internal ticket has been created to investigate your issue. Thanks!

@jamesxu2
Copy link

Hi @succo104

I'm using the latest AMD drivers on Ubuntu 22.04 LTS and getting this warning when running my PyTorch deep learning AI model:

Can you provide more information on what PyTorch deep learning AI model you're running? If you have a specific model, tutorial, or can share source code that would really help to reproduce this issue. Also, do you have ROCm installed, and which version? Please provide as much detail as you can!

@AtiqurRahmanAni If you can provide the same information or just follow along on this ticket that would be great. We usually recommend that you open a new ticket rather than adding on to an existing issue, because they may have separate root causes. Also, we don't have any information about your hardware, OS, ROCm version which makes this hard to diagnose.

Some conjecture: At this point, I suspect the issue is just because you are both probably not running an officially supported GPU and MIOpen does not have a precompiled kernel that's tuned for your specific hardware, resulting in a less-than-optimal model performance. However, we can confirm this with more information on how to reproduce the problem.

@AtiqurRahmanAni
Copy link

Thank you so much your response. I am running on RX6600, Ubuntu 24.04.1 LTS. ROCm version is 6.2.0. Basically, when I use BatchNorm layer in CNN model, I get this warning.

@jamesxu2
Copy link

Hi @AtiqurRahmanAni , can you provide more information on what exactly you're running (ie. minimal source code that can reproduce the issue, commands to run the test).

I tried running this on an RX6800 (same series as your RX6600) and I do not see the same warning while running this simple MNIST example from Pytorch with a BatchNorm layer.

Also can you paste the output of running rocminfo on your system? Please surround the output in triple backticks (`).

Thanks!

@AtiqurRahmanAni
Copy link

Please run this code to reproduce this issue

import torch
import torch.nn as nn
import torch.optim as optim
import os
from torchvision.models import resnet18

os.environ['HSA_OVERRIDE_GFX_VERSION'] = '10.3.0'
os.environ['HIP_VISIBLE_DEVICES'] = '0'
os.environ['TORCH_HOME'] = os.getcwd()

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(device)
x = torch.tensor(1).to(device)
y = torch.tensor(2).to(device)
print(x + y)

model = resnet18(weights='IMAGENET1K_V1')
model.fc = nn.Linear(model.fc.in_features, 2)
model = model.to(device)

criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.0001)

batch_size = 8
input = torch.rand(batch_size, 3, 224, 224).to(device)
labels = torch.randint(0, 2, (batch_size,)).to(device)
out = model(input)

optimizer.zero_grad()
loss = criterion(out, labels)
loss.backward()
optimizer.step()

The output of rocminfo:

ROCk module version 6.8.5 is loaded
=====================    
HSA System Attributes    
=====================    
Runtime Version:         1.14
Runtime Ext Version:     1.6
System Timestamp Freq.:  1000.000000MHz
Sig. Max Wait Duration:  18446744073709551615 (0xFFFFFFFFFFFFFFFF) (timestamp count)
Machine Model:           LARGE                              
System Endianness:       LITTLE                             
Mwaitx:                  DISABLED
DMAbuf Support:          YES

==========               
HSA Agents               
==========               
*******                  
Agent 1                  
*******                  
  Name:                    Intel(R) Core(TM) i5-10400 CPU @ 2.90GHz
  Uuid:                    CPU-XX                             
  Marketing Name:          Intel(R) Core(TM) i5-10400 CPU @ 2.90GHz
  Vendor Name:             CPU                                
  Feature:                 None specified                     
  Profile:                 FULL_PROFILE                       
  Float Round Mode:        NEAR                               
  Max Queue Number:        0(0x0)                             
  Queue Min Size:          0(0x0)                             
  Queue Max Size:          0(0x0)                             
  Queue Type:              MULTI                              
  Node:                    0                                  
  Device Type:             CPU                                
  Cache Info:              
    L1:                      32768(0x8000) KB                   
  Chip ID:                 0(0x0)                             
  ASIC Revision:           0(0x0)                             
  Cacheline Size:          64(0x40)                           
  Max Clock Freq. (MHz):   4300                               
  BDFID:                   0                                  
  Internal Node ID:        0                                  
  Compute Unit:            12                                 
  SIMDs per CU:            0                                  
  Shader Engines:          0                                  
  Shader Arrs. per Eng.:   0                                  
  WatchPts on Addr. Ranges:1                                  
  Memory Properties:       
  Features:                None
  Pool Info:               
    Pool 1                   
      Segment:                 GLOBAL; FLAGS: FINE GRAINED        
      Size:                    16264332(0xf82c8c) KB              
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Recommended Granule:4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       TRUE                               
    Pool 2                   
      Segment:                 GLOBAL; FLAGS: KERNARG, FINE GRAINED
      Size:                    16264332(0xf82c8c) KB              
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Recommended Granule:4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       TRUE                               
    Pool 3                   
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED      
      Size:                    16264332(0xf82c8c) KB              
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Recommended Granule:4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       TRUE                               
  ISA Info:                
*******                  
Agent 2                  
*******                  
  Name:                    gfx1032                            
  Uuid:                    GPU-XX                             
  Marketing Name:          AMD Radeon RX 6600                 
  Vendor Name:             AMD                                
  Feature:                 KERNEL_DISPATCH                    
  Profile:                 BASE_PROFILE                       
  Float Round Mode:        NEAR                               
  Max Queue Number:        128(0x80)                          
  Queue Min Size:          64(0x40)                           
  Queue Max Size:          131072(0x20000)                    
  Queue Type:              MULTI                              
  Node:                    1                                  
  Device Type:             GPU                                
  Cache Info:              
    L1:                      16(0x10) KB                        
    L2:                      2048(0x800) KB                     
    L3:                      32768(0x8000) KB                   
  Chip ID:                 29695(0x73ff)                      
  ASIC Revision:           0(0x0)                             
  Cacheline Size:          64(0x40)                           
  Max Clock Freq. (MHz):   2750                               
  BDFID:                   768                                
  Internal Node ID:        1                                  
  Compute Unit:            28                                 
  SIMDs per CU:            2                                  
  Shader Engines:          2                                  
  Shader Arrs. per Eng.:   2                                  
  WatchPts on Addr. Ranges:4                                  
  Coherent Host Access:    FALSE                              
  Memory Properties:       
  Features:                KERNEL_DISPATCH 
  Fast F16 Operation:      TRUE                               
  Wavefront Size:          32(0x20)                           
  Workgroup Max Size:      1024(0x400)                        
  Workgroup Max Size per Dimension:
    x                        1024(0x400)                        
    y                        1024(0x400)                        
    z                        1024(0x400)                        
  Max Waves Per CU:        32(0x20)                           
  Max Work-item Per CU:    1024(0x400)                        
  Grid Max Size:           4294967295(0xffffffff)             
  Grid Max Size per Dimension:
    x                        4294967295(0xffffffff)             
    y                        4294967295(0xffffffff)             
    z                        4294967295(0xffffffff)             
  Max fbarriers/Workgrp:   32                                 
  Packet Processor uCode:: 118                                
  SDMA engine uCode::      76                                 
  IOMMU Support::          None                               
  Pool Info:               
    Pool 1                   
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED      
      Size:                    8372224(0x7fc000) KB               
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Recommended Granule:2048KB                             
      Alloc Alignment:         4KB                                
      Accessible by all:       FALSE                              
    Pool 2                   
      Segment:                 GLOBAL; FLAGS: EXTENDED FINE GRAINED
      Size:                    8372224(0x7fc000) KB               
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Recommended Granule:2048KB                             
      Alloc Alignment:         4KB                                
      Accessible by all:       FALSE                              
    Pool 3                   
      Segment:                 GROUP                              
      Size:                    64(0x40) KB                        
      Allocatable:             FALSE                              
      Alloc Granule:           0KB                                
      Alloc Recommended Granule:0KB                                
      Alloc Alignment:         0KB                                
      Accessible by all:       FALSE                              
  ISA Info:                
    ISA 1                    
      Name:                    amdgcn-amd-amdhsa--gfx1032         
      Machine Models:          HSA_MACHINE_MODEL_LARGE            
      Profiles:                HSA_PROFILE_BASE                   
      Default Rounding Mode:   NEAR                               
      Default Rounding Mode:   NEAR                               
      Fast f16:                TRUE                               
      Workgroup Max Size:      1024(0x400)                        
      Workgroup Max Size per Dimension:
        x                        1024(0x400)                        
        y                        1024(0x400)                        
        z                        1024(0x400)                        
      Grid Max Size:           4294967295(0xffffffff)             
      Grid Max Size per Dimension:
        x                        4294967295(0xffffffff)             
        y                        4294967295(0xffffffff)             
        z                        4294967295(0xffffffff)             
      FBarrier Max Size:       32

The screenshot of the problem while running the given code:
problem

Thank you. Let me know if you need more information.

@jamesxu2
Copy link

Hi @AtiqurRahmanAni ,
I've done some more testing - thank you for the reproducer code.

This warning shows up because of your GFX version override: os.environ['HSA_OVERRIDE_GFX_VERSION'] = '10.3.0'.

On a supported configuration, MIOpen will compile relevant kernels on the first run of a compute workload (eg. convolution kernels optimized to the specific GPU) and then cache them so they can be reused on future runs of the workload. For more details on this caching process, see https://rocm.docs.amd.com/projects/MIOpen/en/latest/conceptual/cache.html. You would notice that the first execution of your resnet model would be slower than future runs, because cached kernels are used in those future runs.

However, there are no kernels that are optimized for GFX1032, and you've tricked the compiler into generating kernels for GFX1030 with the HSA override (which I know is necessary to execute this program, so you cannot avoid it). You can verify this by going into your ~/.cache/miopen folder and finding the cached kernels, which in my case are called something like "gfx1030_14.ukdb".

When MIOpen looks for cached kernels for your device, it does a check against the installed GPU by querying it directly for hardware information to see if a cached kernel is compatible with your actual GPU. This check fails and provides that warning. (Source code).

So, this is ultimately an artifact of your use of an unsupported GPU and there is no workaround for it. However, the warning only states that you will encounter a performance degradation (because there are no precompiled, pretuned kernels available for gfx1032), so your workload may run slower. However, it should not impact the correctness of the result, and you can ignore the warning.

Please close this ticket if this answers your question!

@AtiqurRahmanAni
Copy link

AtiqurRahmanAni commented Oct 16, 2024

@jamesxu2, thank you so much for the investigation. I got my answer. Unfortunately, I cannot close this issue because this issue was opened by @succo104

@jamesxu2
Copy link

Oh, thanks for letting me know @AtiqurRahmanAni !

@succo104 I will close this ticket, but you are welcome to reopen it if you have any follow up questions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants