Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: HEVC decoding fails on DG1 when using upstream kernel instead of Intel DKMS #1415

Closed
eero-t opened this issue May 25, 2022 · 70 comments
Closed
Assignees
Labels
P3 Low priority no customer usage, no business requirements, not from communities, just from internal

Comments

@eero-t
Copy link

eero-t commented May 25, 2022

Which component impacted?

Decode

Is it regression? Good in old configuration?

No response

What happened?

Use-cases

  • ffmpeg -hwaccel vaapi -hwaccel_output_format vaapi -i GTAV_1920x1080_60_yuv420p.h265 -c:v h264_vaapi -f null -
  • sample_multi_transcode -i::h265 GTAV_1920x1080_60_yuv420p.h265 -o::h264 /dev/null

Expected outcome

Both of above do transcoding at hundreds of FPS, like is the case with TGL iGPU, with exactly the same setup. Or if I change the input to H.264 one.

Actual outcome

  • FFmpeg transcoding happens at 2 FPS
  • OneVPL transcoding fails to:
$ sample_multi_transcode -i::h265 /media/GTAV_1920x1080_60_yuv420p.h265 -o::h264 /dev/null
Multi Transcoding Sample Version 8.4.27.0

CONFIGURE LOADER: required implementation: hw 
CONFIGURE LOADER: required implementation mfxAccelerationMode: MFX_ACCEL_MODE_VIA_VAAPI 
libva info: VA-API version 1.14.0
libva info: User environment variable requested driver 'iHD'
libva info: Trying to open /usr/local/lib/dri/iHD_drv_video.so
libva info: Found init function __vaDriverInit_1_14
libva info: va_openDriver() returns 0
libva info: VA-API version 1.14.0
libva info: User environment variable requested driver 'iHD'
libva info: Trying to open /usr/local/lib/dri/iHD_drv_video.so
libva info: Found init function __vaDriverInit_1_14
libva info: va_openDriver() returns 0
libva info: VA-API version 1.14.0
libva info: User environment variable requested driver 'iHD'
libva info: Trying to open /usr/local/lib/dri/iHD_drv_video.so
libva info: Found init function __vaDriverInit_1_14
libva info: va_openDriver() returns 0
libva info: VA-API version 1.14.0
libva info: User environment variable requested driver 'iHD'
libva info: Trying to open /usr/local/lib/dri/iHD_drv_video.so
libva info: Found init function __vaDriverInit_1_14
libva info: va_openDriver() returns 0
libva info: VA-API version 1.14.0
libva info: User environment variable requested driver 'iHD'
libva info: Trying to open /usr/local/lib/dri/iHD_drv_video.so
libva info: Found init function __vaDriverInit_1_14
libva info: va_openDriver() returns 0
Session 0:
Loaded Library configuration: 
    Version: 2.7 
    ImplName: mfx-gen 
    Adapter number : 0 
    Adapter type: integrated
    DRMRenderNodeNum: 128 
Used implementation number: 0 
Loaded modules:
   0: /usr/local/lib/libmfxhw64.so.1.35 
   1: /usr/local/lib/libmfx-gen.so.1.2.7 

Pipeline surfaces number (DecPool): 10
Input  video: HEVC
Output video: AVC 

Session 0 was NOT joined with other sessions

Transcoding started

[ERROR], sts=MFX_ERR_ABORTED(-12), PutBS, Encode: SyncOperation failed at /home/nobody/source/oneVPL/tools/legacy/sample_multi_transcode/src/pipeline_transcode.cpp:2112

[ERROR], sts=MFX_ERR_ABORTED(-12), Transcode, PutBS failed at /home/nobody/source/oneVPL/tools/legacy/sample_multi_transcode/src/pipeline_transcode.cpp:2068

[ERROR], sts=MFX_ERR_ABORTED(-12), Run, CTranscodingPipeline::Run::Transcode() [0x55ba117e6c90] failed at /home/nobody/source/oneVPL/tools/legacy/sample_multi_transcode/src/pipeline_transcode.cpp:4677


 session 0 [0x55ba117e6c90] failed with status MFX_ERR_ABORTED shutting down the application...

session [0x55ba117e6c90] m_bForceStop is set

Transcoding finished

Common transcoding time is 2.88576 sec
-------------------------------------------------------------------------------
*** session 0 [0x55ba117e6c90] FAILED (MFX_ERR_ABORTED) 2.88555 sec, 4 frames, 1.386 fps
-i::h265 /media/GTAV_1920x1080_60_yuv420p.h265 -o::h264 /dev/null 

-------------------------------------------------------------------------------

The test FAILED

[ERROR], sts=MFX_ERR_ABORTED(-12), main, transcode.ProcessResult failed at /home/nobody/source/oneVPL/tools/legacy/sample_multi_transcode/src/sample_multi_transcode.cpp:1561

I do not know whether this is a regression. There have been too many issues to say for sure whether it's ever worked on 0x4905 device.

What's the usage scenario when you are seeing the problem?

Transcode for media delivery

What impacted?

No response

Debug Information

Setup

  • GPU: DG1 (0x4905)
  • Ubuntu 20.04.4 distro
  • drm-tip 5.18 kernel
  • media stack components build from latest release tags (as of today):
    • libva: 2.14.0
    • GMMlib: intel-gmmlib-22.1.3
    • Media: intel-media-22.4.2
    • MediaSDK: intel-mediasdk-22.4.2
    • oneVPL: v2022.1.3
    • VPL-GPU: intel-onevpl-22.4.2
    • FFmpeg: n5.0.1

VA-info

libva info: VA-API version 1.14.0
libva info: User environment variable requested driver 'iHD'
libva info: Trying to open /usr/local/lib/dri/iHD_drv_video.so
libva info: Found init function __vaDriverInit_1_14
libva info: va_openDriver() returns 0
vainfo: VA-API version: 1.14 (libva 2.12.0)
vainfo: Driver version: Intel iHD driver for Intel(R) Gen Graphics - 22.4.2 (d7a1feb)
vainfo: Supported profile and entrypoints
      VAProfileNone                   :	VAEntrypointVideoProc
      VAProfileNone                   :	VAEntrypointStats
      VAProfileMPEG2Simple            :	VAEntrypointVLD
      VAProfileMPEG2Simple            :	VAEntrypointEncSlice
      VAProfileMPEG2Main              :	VAEntrypointVLD
      VAProfileMPEG2Main              :	VAEntrypointEncSlice
      VAProfileH264Main               :	VAEntrypointVLD
      VAProfileH264Main               :	VAEntrypointEncSlice
      VAProfileH264Main               :	VAEntrypointFEI
      VAProfileH264Main               :	VAEntrypointEncSliceLP
      VAProfileH264High               :	VAEntrypointVLD
      VAProfileH264High               :	VAEntrypointEncSlice
      VAProfileH264High               :	VAEntrypointFEI
      VAProfileH264High               :	VAEntrypointEncSliceLP
      VAProfileVC1Simple              :	VAEntrypointVLD
      VAProfileVC1Main                :	VAEntrypointVLD
      VAProfileVC1Advanced            :	VAEntrypointVLD
      VAProfileJPEGBaseline           :	VAEntrypointVLD
      VAProfileJPEGBaseline           :	VAEntrypointEncPicture
      VAProfileH264ConstrainedBaseline:	VAEntrypointVLD
      VAProfileH264ConstrainedBaseline:	VAEntrypointEncSlice
      VAProfileH264ConstrainedBaseline:	VAEntrypointFEI
      VAProfileH264ConstrainedBaseline:	VAEntrypointEncSliceLP
      VAProfileHEVCMain               :	VAEntrypointVLD
      VAProfileHEVCMain               :	VAEntrypointEncSlice
      VAProfileHEVCMain               :	VAEntrypointFEI
      VAProfileHEVCMain               :	VAEntrypointEncSliceLP
      VAProfileHEVCMain10             :	VAEntrypointVLD
      VAProfileHEVCMain10             :	VAEntrypointEncSlice
      VAProfileHEVCMain10             :	VAEntrypointEncSliceLP
      VAProfileVP9Profile0            :	VAEntrypointVLD
      VAProfileVP9Profile0            :	VAEntrypointEncSliceLP
      VAProfileVP9Profile1            :	VAEntrypointVLD
      VAProfileVP9Profile1            :	VAEntrypointEncSliceLP
      VAProfileVP9Profile2            :	VAEntrypointVLD
      VAProfileVP9Profile2            :	VAEntrypointEncSliceLP
      VAProfileVP9Profile3            :	VAEntrypointVLD
      VAProfileVP9Profile3            :	VAEntrypointEncSliceLP
      VAProfileHEVCMain12             :	VAEntrypointVLD
      VAProfileHEVCMain12             :	VAEntrypointEncSlice
      VAProfileHEVCMain422_10         :	VAEntrypointVLD
      VAProfileHEVCMain422_10         :	VAEntrypointEncSlice
      VAProfileHEVCMain422_12         :	VAEntrypointVLD
      VAProfileHEVCMain422_12         :	VAEntrypointEncSlice
      VAProfileHEVCMain444            :	VAEntrypointVLD
      VAProfileHEVCMain444            :	VAEntrypointEncSliceLP
      VAProfileHEVCMain444_10         :	VAEntrypointVLD
      VAProfileHEVCMain444_10         :	VAEntrypointEncSliceLP
      VAProfileHEVCMain444_12         :	VAEntrypointVLD
      VAProfileHEVCSccMain            :	VAEntrypointVLD
      VAProfileHEVCSccMain            :	VAEntrypointEncSliceLP
      VAProfileHEVCSccMain10          :	VAEntrypointVLD
      VAProfileHEVCSccMain10          :	VAEntrypointEncSliceLP
      VAProfileHEVCSccMain444         :	VAEntrypointVLD
      VAProfileHEVCSccMain444         :	VAEntrypointEncSliceLP
      VAProfileAV1Profile0            :	VAEntrypointVLD
      VAProfileHEVCSccMain444_10      :	VAEntrypointVLD
      VAProfileHEVCSccMain444_10      :	VAEntrypointEncSliceLP

Notes

There are no GPU hangs. Kernel driver output / settings:

# dmesg |grep i915
[    0.000000] Command line: BOOT_IMAGE=/boot/drm_intel root=/dev/nvme0n1p2 rootwait fsck.repair=yes i915.enable_guc=3 i915.force_probe=4905 ro
[    0.026206] Kernel command line: BOOT_IMAGE=/boot/drm_intel root=/dev/nvme0n1p2 rootwait fsck.repair=yes i915.enable_guc=3 i915.force_probe=4905 ro
[    2.582581] i915 0000:03:00.0: [drm] VT-d active for gfx access
[    2.582586] fb0: switching to i915 from EFI VGA
[    2.582684] i915 0000:03:00.0: vgaarb: deactivate vga console
[    2.582707] i915 0000:03:00.0: [drm] Local memory IO size: 0x00000000fb800000
[    2.582708] i915 0000:03:00.0: [drm] Local memory available: 0x00000000fb800000
[    2.597310] i915 0000:03:00.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=none:owns=none
[    2.600584] i915 0000:03:00.0: [drm] Finished loading DMC firmware i915/dg1_dmc_ver2_02.bin (v2.2)
[    2.667756] i915 0000:03:00.0: [drm] GuC firmware i915/dg1_guc_70.1.1.bin version 70.1
[    2.667759] i915 0000:03:00.0: [drm] HuC firmware i915/dg1_huc_7.9.3.bin version 7.9
[    2.673370] i915 0000:03:00.0: [drm] HuC authenticated
[    2.673789] i915 0000:03:00.0: [drm] GuC submission enabled
[    2.673790] i915 0000:03:00.0: [drm] GuC SLPC enabled
[    2.674046] i915 0000:03:00.0: [drm] GuC RC: enabled

Do you want to contribute a patch to fix the issue?

No.

@eero-t
Copy link
Author

eero-t commented May 27, 2022

I've also tested Sysman functionality and simple OpenCL programs. Those work fine, so in general drm-tip kernel seems to work fine.

Btw. media-driver README still states following:

Media-driver requires special i915 kernel mode driver (KMD) version to support the following new platforms since upstream version of i915 KMD does not fully support them (pending patches upstream):

DG1/SG1
Alchemist(DG2)/ATSM

By default, media-driver builds against upstream i915 KMD and will miss support for the platforms listed above. To enable new platforms which require special i915 KMD and specify ENABLE_PRODUCTION_KMD=ON (default: OFF) build configuration option.

Although AFAIK that has not been true for DG1 for over a half a year, since this media-driver commit: db5a870

And DG2 / ATS-M support being already in public kernel (for some of their variants, and requiring force-probing for now).

@XinfengZhang
Copy link
Contributor

from 22'Q1 release, media_driver does not support DG1 with ENABLE_PRODUCTION_KMD=ON anymore
now, this option just support DG2 with suitable kernel support (https://github.com/intel-gpu/intel-gpu-i915-backports/tree/ubuntu/main) will update the document, if you still want DG1 against https://github.com/intel-gpu/kernel, please use 21'Q4 release.

@eero-t
Copy link
Author

eero-t commented May 30, 2022

The note about outdated DG1 info in README was just FYI.

I am using ENABLE_PRODUCTION_KMD=OFF with public kernel, and that fails for me on DG1 with HEVC.

(There may have been 3D + AVC transode running at the same time in the backend while I was running this test-case, but that should not have broken HEVC as dmesg does not show any errors.)

@Xiaogangli-intel
Copy link
Contributor

Hi @eero-t, I noticed i915.enable_guc=3 in your kernel parameters, could you try i915.enable_guc=2? Seems GuC submission doesn't work on DG1.

@Xiaogangli-intel
Copy link
Contributor

Xiaogangli-intel commented May 31, 2022

Hi @eero-t , media still have some issues on drm-tip KMD for DG1.
Could you please try this KMD at https://github.com/intel-gpu/intel-gpu-i915-backports, and need to build media driver with ENABLE_PRODUCTION_KMD=ON, also i915.enable_guc=2 in kernel boot parameters.

@dvrogozh
Copy link
Contributor

At the moment there are 2 possible ways to setup DG1:

  1. Use vanilla kernel (or drm-tip). DG1 support is not still finalized in here, user should use i915.force_probe=* (or specific device id) to enable. You don't need any special options of media driver build for that.
  2. Use custom kernel (or rather kernel module) which @Xiaogangli-intel suggests above, https://github.com/intel-gpu/intel-gpu-i915-backports. Use ENABLE_PRODUCTION_KMD=ON to build media-driver.

That's up to the user to decide which kernel to use. However, in both cases, user is NOT supposed to adjust i915.enable_guc option in any way. This is a very risky option and user should clear understand why he is trying to change it.

@eero-t : I strongly suggest to drop i915.enable_guc from cmdline and try again. I vaguely recall you had some issue because of setting this option before. Hope this will help. If not, then you've found LGTM issue for vanilla kernel which media team will need to look at.

@eero-t
Copy link
Author

eero-t commented May 31, 2022

Good catch. I'll test upstream kernel without the GuC option tomorrow, and report back.

(I've intended to clean that out, but had forgotten to do it for all kernel configs on all machines.)

@eero-t
Copy link
Author

eero-t commented Jun 1, 2022

@dvrogozh GuC scheduling is enabled by public (yesterday) "drm-tip" kernel, even when it's not forced:

# dmesg | grep i915
[    0.000000] Command line: BOOT_IMAGE=/boot/drm_intel root=/dev/nvme0n1p2 rootwait fsck.repair=yes i915.force_probe=4905 ro
[    0.026212] Kernel command line: BOOT_IMAGE=/boot/drm_intel root=/dev/nvme0n1p2 rootwait fsck.repair=yes i915.force_probe=4905 ro
[    2.081413] i915 0000:03:00.0: [drm] VT-d active for gfx access
[    2.081417] fb0: switching to i915 from EFI VGA
[    2.081679] i915 0000:03:00.0: vgaarb: deactivate vga console
[    2.081714] i915 0000:03:00.0: [drm] Local memory IO size: 0x00000000fb800000
[    2.081715] i915 0000:03:00.0: [drm] Local memory available: 0x00000000fb800000
[    2.094978] i915 0000:03:00.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=none:owns=none
[    2.099214] i915 0000:03:00.0: [drm] Finished loading DMC firmware i915/dg1_dmc_ver2_02.bin (v2.2)
[    2.165804] i915 0000:03:00.0: [drm] GuC firmware i915/dg1_guc_70.1.1.bin version 70.1
[    2.165807] i915 0000:03:00.0: [drm] HuC firmware i915/dg1_huc_7.9.3.bin version 7.9
[    2.171291] i915 0000:03:00.0: [drm] HuC authenticated
[    2.171568] i915 0000:03:00.0: [drm] GuC submission enabled
[    2.171569] i915 0000:03:00.0: [drm] GuC SLPC enabled
[    2.171828] i915 0000:03:00.0: [drm] GuC RC: enabled
[    2.207306] [drm] Initialized i915 1.6.0 20201103 for 0000:03:00.0 on minor 0

And the same issue persists.

Note: I forgot to mention earlier, but in case it matters, all of these nodes have 2 0x4905 DG1 GPUs. Limiting media-driver devfs visibility just to first one (with Docker) did not change anything though.

@eero-t
Copy link
Author

eero-t commented Jun 1, 2022

@Xiaogangli-intel even with GuC scheduling explicitly disabled for drm-tip:

# dmesg | grep i915
[    0.000000] Command line: BOOT_IMAGE=/boot/drm_intel root=/dev/nvme0n1p2 rootwait fsck.repair=yes i915.enable_guc=2 i915.force_probe=4905 ro
[    0.026191] Kernel command line: BOOT_IMAGE=/boot/drm_intel root=/dev/nvme0n1p2 rootwait fsck.repair=yes i915.enable_guc=2 i915.force_probe=4905 ro
[    2.084816] i915 0000:03:00.0: [drm] VT-d active for gfx access
[    2.084820] fb0: switching to i915 from EFI VGA
[    2.084891] i915 0000:03:00.0: vgaarb: deactivate vga console
[    2.084913] i915 0000:03:00.0: [drm] Local memory IO size: 0x00000000fb800000
[    2.084914] i915 0000:03:00.0: [drm] Local memory available: 0x00000000fb800000
[    2.096936] i915 0000:03:00.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=none:owns=none
[    2.100697] i915 0000:03:00.0: [drm] Finished loading DMC firmware i915/dg1_dmc_ver2_02.bin (v2.2)
[    2.170110] i915 0000:03:00.0: [drm] GuC firmware i915/dg1_guc_70.1.1.bin version 70.1
[    2.170113] i915 0000:03:00.0: [drm] HuC firmware i915/dg1_huc_7.9.3.bin version 7.9
[    2.185450] i915 0000:03:00.0: [drm] HuC authenticated
[    2.185451] i915 0000:03:00.0: [drm] GuC submission disabled
[    2.185452] i915 0000:03:00.0: [drm] GuC SLPC disabled
[    2.220340] [drm] Initialized i915 1.6.0 20201103 for 0000:03:00.0 on minor 0

Media driver fails:

sample_multi_transcode -i::h265 /media/GTAV_1920x1080_60_yuv420p.h265 -o::h264 /dev/null
Multi Transcoding Sample Version 8.4.27.0

CONFIGURE LOADER: required implementation: hw 
CONFIGURE LOADER: required implementation mfxAccelerationMode: MFX_ACCEL_MODE_VIA_VAAPI 
libva info: VA-API version 1.14.0
...
libva info: User environment variable requested driver 'iHD'
libva info: Trying to open /usr/local/lib/dri/iHD_drv_video.so
libva info: Found init function __vaDriverInit_1_14
libva info: va_openDriver() returns 0
Session 0:
Loaded Library configuration: 
    Version: 2.7 
    ImplName: mfx-gen 
    Adapter number : 0 
    Adapter type: integrated
    DRMRenderNodeNum: 128 
Used implementation number: 0 
Loaded modules:
   0: /usr/local/lib/libmfxhw64.so.1.35 
   1: /usr/local/lib/libmfx-gen.so.1.2.7 

Pipeline surfaces number (DecPool): 10
Input  video: HEVC
Output video: AVC 

Session 0 was NOT joined with other sessions

Transcoding started

[ERROR], sts=MFX_ERR_ABORTED(-12), PutBS, Encode: SyncOperation failed at /home/nobody/source/oneVPL/tools/legacy/sample_multi_transcode/src/pipeline_transcode.cpp:2112

[ERROR], sts=MFX_ERR_ABORTED(-12), Transcode, PutBS failed at /home/nobody/source/oneVPL/tools/legacy/sample_multi_transcode/src/pipeline_transcode.cpp:2068

[ERROR], sts=MFX_ERR_ABORTED(-12), Run, CTranscodingPipeline::Run::Transcode() [0x55b1e65abc90] failed at /home/nobody/source/oneVPL/tools/legacy/sample_multi_transcode/src/pipeline_transcode.cpp:4677

 session 0 [0x55b1e65abc90] failed with status MFX_ERR_ABORTED shutting down the application...

session [0x55b1e65abc90] m_bForceStop is set

Transcoding finished

@eero-t
Copy link
Author

eero-t commented Jun 1, 2022

When GuC scheduling is explicitly disabled, there's also a GPU hang:

[   65.151632] i915 0000:03:00.0: [drm] Resetting vcs1 for preemption time out
[   65.151688] i915 0000:03:00.0: [drm] sample_multi_tr[5533] context reset due to GPU hang
[   65.160360] i915 0000:03:00.0: [drm] GPU HANG: ecode 12:4:28fffffd, in sample_multi_tr [5533]

See: gpu-hang.txt

media still have some issues on drm-tip KMD for DG1.

Could you give pointer to more info?

Could you please try this KMD at https://github.com/intel-gpu/intel-gpu-i915-backports, and need to build media driver with ENABLE_PRODUCTION_KMD=ON, also i915.enable_guc=2 in kernel boot parameters.

Sorry, but I'm not interested about public media-driver on backport kernel, only with what's going to upstream.

@eero-t
Copy link
Author

eero-t commented Jun 2, 2022

Btw. 1-2 months ago when I was testing internal KMD + UMD versions, I was seeing some instances failing with OneVPL HEVC transcode, when trying to do many parallel transcodes on DG2. I did not debug it further (container instances were changing too fast), but I'm now wondering whether it's related to HEVC issues here with public KMD+UMD versions on DG1. Are there known HEVC issues for DG2 too?

@Xiaogangli-intel
Copy link
Contributor

Hi @eero-t, I noticed the hang issue of HEVC decode. If you really mind using backport kernel, maybe we have to sync with KMD to check the progress of DG1 patches upstreaming.

@eero-t
Copy link
Author

eero-t commented Jun 10, 2022

DG1 has been enabled in upstream kernel (not just drm-tip) for a long time: https://github.com/torvalds/linux/blob/master/include/drm/i915_pciids.h#L630

But kernel docs RFC section still mentions several items: https://www.kernel.org/doc/html/latest/gpu/rfc/index.html

I've asked whether they've landed already upstream (in Linus' tree i.e. should docs have been moved out of RFC section), not just in public drm-tip that I was testing (and with which I was seeing the issues).

@eero-t
Copy link
Author

eero-t commented Jun 14, 2022

According to kernel side, status specified in RFC docs applies both to public upstream and drm-tip. I.e. there are still significant gaps in kernel i915 dGPU support, although GuC scheduling has already been enabled by default.

PS. I just tested latest media driver stack releases, and e.g. FFmpeg still gives 2 FPS with drm-tip (instead of the expected hundreds of FPS). I haven't updated the kernel side though (will probably do that late summer, when 5.19 nears release).

@eero-t
Copy link
Author

eero-t commented Jul 22, 2022

I tested yesterday's drm-tip 5.19-rc7 (and few days earlier 5.19-rc6) on DG1, and things have gone downhill. Instead of 2 FPS HEVC decode, there are lots of failures now with FFmpeg / VA-API:

Input #0, hevc, from '/media/GTAV_1920x1080_60_yuv420p.h265':
  Duration: N/A, bitrate: N/A
  Stream #0:0: Video: hevc (Main), yuv420p(tv), 1920x1080, 60 fps, 60 tbr, 1200k tbn
Stream mapping:
  Stream #0:0 -> #0:0 (hevc (native) -> h264 (h264_vaapi))
Press [q] to stop, [?] for help
[hevc @ 0x560f76ef5700] Failed to end picture decode issue: 23 (internal decoding error).
[hevc @ 0x560f76ef5700] hardware accelerator failed to decode picture
[hevc @ 0x560f76fa7080] Could not find ref with POC 0
[hevc @ 0x560f76fa7080] Failed to end picture decode issue: 23 (internal decoding error).
[hevc @ 0x560f76fa7080] hardware accelerator failed to decode picture
[hevc @ 0x560f76fb8840] Could not find ref with POC 1
[hevc @ 0x560f76fb8840] Failed to end picture decode issue: 23 (internal decoding error).
[hevc @ 0x560f76fb8840] hardware accelerator failed to decode picture
[hevc @ 0x560f76fca040] Could not find ref with POC 6
[hevc @ 0x560f76fca040] Failed to end picture decode issue: 23 (internal decoding error).
[hevc @ 0x560f76fca040] hardware accelerator failed to decode picture
[hevc @ 0x560f76fdb840] Could not find ref with POC 4
[hevc @ 0x560f76fdb840] Failed to end picture decode issue: 23 (internal decoding error).
[hevc @ 0x560f76fdb840] hardware accelerator failed to decode picture

(No errors in dmesgs though.)

When using FFmpeg with QSV instead of VA-API, it fails immediately:

Input #0, hevc, from '/media/GTAV_1920x1080_60_yuv420p.h265':
  Duration: N/A, bitrate: N/A
  Stream #0:0: Video: hevc (Main), yuv420p(tv), 1920x1080, 60 fps, 60 tbr, 1200k tbn
Stream mapping:
  Stream #0:0 -> #0:0 (hevc (native) -> h264 (h264_qsv))
Press [q] to stop, [?] for help
Output #0, null, to 'pipe:':
  Metadata:
    encoder         : Lavf59.16.100
  Stream #0:0: Video: h264, nv12(tv, progressive), 1920x1080, q=2-31, 1000 kb/s, 60 fps, 60 tbn
    Metadata:
      encoder         : Lavc59.18.100 h264_qsv
    Side data:
      cpb: bitrate max/min/avg: 0/0/1000000 buffer size: 0 vbv_delay: N/A
[h264_qsv @ 0x561d6d5488c0] Unknown FrameType, set pict_type to AV_PICTURE_TYPE_NONE.
[h264_qsv @ 0x561d6d5488c0] Error during encoding: unknown error (-21)
Video encoding failed
Conversion failed!

However, exactly the same drm-tip kernel, user-space [1] and test-case still work fine on TGL (with perf in hundreds of FPS).

[1] User-space components:

  • libva 2.15.0
  • intel-gmmlib-22.1.6
  • intel-media-22.5.0
  • intel-mediasdk-22.5.0
  • FFmpeg n5.0.1

TGL dmesg content:

$ dmesg | grep i915
[    0.000000] Command line: BOOT_IMAGE=/boot/drm_intel root=/dev/nvme0n1p2 rootwait fsck.repair=yes i915.enable_guc=2 ro
[    0.037729] Kernel command line: BOOT_IMAGE=/boot/drm_intel root=/dev/nvme0n1p2 rootwait fsck.repair=yes i915.enable_guc=2 ro
[    2.621257] i915 0000:00:02.0: [drm] VT-d active for gfx access
[    2.621363] i915 0000:00:02.0: vgaarb: deactivate vga console
[    2.621412] i915 0000:00:02.0: [drm] Using Transparent Hugepages
[    2.623817] i915 0000:00:02.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=io+mem:owns=io+mem
[    2.625315] i915 0000:00:02.0: [drm] Finished loading DMC firmware i915/tgl_dmc_ver2_12.bin (v2.12)
[    3.298064] i915 0000:00:02.0: [drm] failed to retrieve link info, disabling eDP
[    3.414350] i915 0000:00:02.0: [drm] GuC firmware i915/tgl_guc_70.1.1.bin version 70.1
[    3.414352] i915 0000:00:02.0: [drm] HuC firmware i915/tgl_huc_7.9.3.bin version 7.9
[    3.427536] i915 0000:00:02.0: [drm] HuC authenticated
[    3.427538] i915 0000:00:02.0: [drm] GuC submission disabled
[    3.427538] i915 0000:00:02.0: [drm] GuC SLPC disabled
[    3.504681] [drm] Initialized i915 1.6.0 20201103 for 0000:00:02.0 on minor 0
[    3.511744] snd_hda_intel 0000:00:1f.3: bound 0000:00:02.0 (ops i915_audio_component_bind_ops [i915])
[    3.683072] fbcon: i915drmfb (fb0) is primary device
[    3.780445] i915 0000:00:02.0: [drm] fb0: i915drmfb frame buffer device

DG1 dmesg content:

$ dmesg | grep i915
[    0.000000] Command line: BOOT_IMAGE=/boot/drm_intel root=/dev/nvme0n1p2 rootwait fsck.repair=yes i915.force_probe=4905 ro
[    0.025971] Kernel command line: BOOT_IMAGE=/boot/drm_intel root=/dev/nvme0n1p2 rootwait fsck.repair=yes i915.force_probe=4905 ro
[    2.174692] i915 0000:03:00.0: [drm] VT-d active for gfx access
[    2.174792] i915 0000:03:00.0: vgaarb: deactivate vga console
[    2.174819] i915 0000:03:00.0: [drm] Local memory IO size: 0x00000000fb800000
[    2.174820] i915 0000:03:00.0: [drm] Local memory available: 0x00000000fb800000
[    2.189355] i915 0000:03:00.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=none:owns=none
[    2.191620] i915 0000:03:00.0: [drm] Finished loading DMC firmware i915/dg1_dmc_ver2_02.bin (v2.2)
[    2.258663] i915 0000:03:00.0: [drm] GuC firmware i915/dg1_guc_70.1.1.bin version 70.1
[    2.258666] i915 0000:03:00.0: [drm] HuC firmware i915/dg1_huc_7.9.3.bin version 7.9
[    2.263872] i915 0000:03:00.0: [drm] HuC authenticated
[    2.264158] i915 0000:03:00.0: [drm] GuC submission enabled
[    2.264160] i915 0000:03:00.0: [drm] GuC SLPC enabled
[    2.264414] i915 0000:03:00.0: [drm] GuC RC: enabled
[    2.301517] [drm] Initialized i915 1.6.0 20201103 for 0000:03:00.0 on minor 0
[    2.302089] i915 0000:0a:00.0: [drm] VT-d active for gfx access
[    2.302121] i915 0000:0a:00.0: [drm] Local memory IO size: 0x00000000fb800000
[    2.302122] i915 0000:0a:00.0: [drm] Local memory available: 0x00000000fb800000
[    2.321256] i915 0000:0a:00.0: [drm] Finished loading DMC firmware i915/dg1_dmc_ver2_02.bin (v2.2)
[    2.327096] fbcon: i915drmfb (fb0) is primary device
[    2.373506] i915 0000:03:00.0: [drm] fb0: i915drmfb frame buffer device
[    2.391770] i915 0000:0a:00.0: [drm] GuC firmware i915/dg1_guc_70.1.1.bin version 70.1
[    2.391772] i915 0000:0a:00.0: [drm] HuC firmware i915/dg1_huc_7.9.3.bin version 7.9
[    2.398253] i915 0000:0a:00.0: [drm] HuC authenticated
[    2.398673] i915 0000:0a:00.0: [drm] GuC submission enabled
[    2.398674] i915 0000:0a:00.0: [drm] GuC SLPC enabled
[    2.398985] i915 0000:0a:00.0: [drm] GuC RC: enabled
[    2.411403] [drm] Initialized i915 1.6.0 20201103 for 0000:0a:00.0 on minor 1
[    2.414748] i915 0000:0a:00.0: [drm] Cannot find any crtc or sizes
[    2.415142] i915 0000:0a:00.0: [drm] Cannot find any crtc or sizes

I.e. the main differences are there being 2x DG1 devices, with GuC scheduling being enabled (by default), and THP being enabled only on TGL for some reason, although both have VT-d active.

@eero-t eero-t changed the title [Bug]: HEVC transcoding fails on DG1 (0x4905) [Bug]: HEVC transcoding runs at 2 FPS, or fails, on DG1 (0x4905) Aug 4, 2022
@eero-t
Copy link
Author

eero-t commented Sep 13, 2022

Tested media stack components build from latest release tags (Ubuntu 22.04 based container):

  • libVA: 2.15.0
  • GMMlib: intel-gmmlib-22.1.8
  • media-driver: intel-media-22.5.3
  • MediaSDK: intel-mediasdk-22.5.3
  • oneVPL: v2022.2.2
  • oneVPL GPU: intel-onevpl-22.5.3
  • FFmpeg: n5.1.1

And both the FFmpeg VA-API and OneVPL decoding failures are still there, both with slightly older drm-tip v6.0-rc3 kernel, and v6.0-rc5 from yesterday.

OneVPL / MFX error message has changed to match what FFmpeg / VA-API was reporting:

Loaded Library configuration: 
    Version: 2.7 
    ImplName: mfx-gen 
    Adapter number : 0 
    Adapter type: integrated
    DRMRenderNodeNum: 128 
Used implementation number: 0 
Loaded modules:
   0: /usr/local/lib/libmfxhw64.so.1.35 
   1: /usr/local/lib/libmfx-gen.so.1.2.7 

Pipeline surfaces number (DecPool): 10
Input  video: HEVC
Output video: AVC 

Session 0 was NOT joined with other sessions

Transcoding started

[ERROR], sts=MFX_ERR_DEVICE_FAILED(-17), Transcode, Decode<One|Last>Frame failed at /home/nobody/source/oneVPL/tools/legacy/sample_multi_transcode/src/pipeline_transcode.cpp:1933

[ERROR], sts=MFX_ERR_DEVICE_FAILED(-17), Run, CTranscodingPipeline::Run::Transcode() [0x556090093fa0] failed at /home/nobody/source/oneVPL/tools/legacy/sample_multi_transcode/src/pipeline_transcode.cpp:4868

 session 0 [0x556090093fa0] failed with status MFX_ERR_DEVICE_FAILED shutting down the application...

@eero-t eero-t changed the title [Bug]: HEVC transcoding runs at 2 FPS, or fails, on DG1 (0x4905) [Bug]: HEVC decoding fails on DG1 (0x4905) Sep 13, 2022
@eero-t
Copy link
Author

eero-t commented Sep 13, 2022

drm-tip kernel dmesg shows this on startup, but I guess this is related just to error reporting, not media:

[    2.477122] i915 0000:0a:00.0: [drm] *ERROR* Zero GuC log crash dump size!
[    2.477124] i915 0000:0a:00.0: [drm] *ERROR* Zero GuC log debug size!
[    2.478087] i915 0000:0a:00.0: [drm] GuC error state capture buffer maybe too small: 2097152 < 2360316 (min = 786772)
[    2.482243] i915 0000:0a:00.0: [drm] GuC firmware i915/dg1_guc_70.1.1.bin version 70.1.1

@eero-t
Copy link
Author

eero-t commented Sep 21, 2022

Things still fail with latest "drm-tip" (6.0-rc6) from today, and latest media-driver release:

  • libVA: 2.15.0
  • GMMlib: intel-gmmlib-22.2.0
  • media-driver: intel-media-22.5.3

EDIT: VAAPI init failure was due to kernel FW loading issue: https://gitlab.freedesktop.org/drm/intel/-/issues/6895

The error is now:

[ERROR], sts=MFX_ERR_NULL_PTR(-2), Init, m_fSource pointer is NULL at /home/nobody/source/oneVPL/tools/legacy/sample_common/src/sample_utils.cpp:682

[ERROR], sts=MFX_ERR_NULL_PTR(-2), Init, reader->Init failed at /home/nobody/source/oneVPL/tools/legacy/sample_multi_transcode/src/sample_multi_transcode.cpp:528

[ERROR], sts=MFX_ERR_NULL_PTR(-2), main, transcode.Init failed at /home/nobody/source/oneVPL/tools/legacy/sample_multi_transcode/src/sample_multi_transcode.cpp:1565

@nyanmisaka
Copy link
Contributor

Any update on this? I've got a DG1 80EU and it fails decoding any video with VAAPI/QSV through ffmpeg cli. But everything works just fine on Windows.

@eero-t
Copy link
Author

eero-t commented Oct 7, 2022

Things still fail with latest "drm-tip" (6.0-rc7) from yesterday, with a matching FW (GuC: 70.5.1, HuC: 7.9.3), and latest media stack releases:

  • libVA: 2.16.0
  • GMMlib: intel-gmmlib-22.2.0
  • media-driver: intel-media-22.5.4
  • MediaSDK: intel-mediasdk-22.5.4
  • oneVPL: v2022.2.4
  • oneVPL-gpu: intel-onevpl-22.5.4
  • FFmpeg: n5.1.2

Output from OneVPL tool:

$ sample_multi_transcode -i::h265 /media/GTAV_1920x1080_60_yuv420p.h265 -o::h264 /dev/null
...
Loaded Library configuration: 
    Version: 2.7 
    ImplName: mfx-gen 
    Adapter number : 0 
    Adapter type: integrated
    DRMRenderNodeNum: 128 
Used implementation number: 0 
Loaded modules:
   0: /usr/local/lib/libmfxhw64.so.1.35 
   1: /usr/local/lib/libmfx-gen.so.1.2.7 

Pipeline surfaces number (DecPool): 10
Input  video: HEVC
Output video: AVC 

Session 0 was NOT joined with other sessions
Transcoding started

[ERROR], sts=MFX_ERR_DEVICE_FAILED(-17), Transcode, Decode<One|Last>Frame failed at /home/nobody/source/oneVPL/tools/legacy/sample_multi_transcode/src/pipeline_transcode.cpp:1944

[ERROR], sts=MFX_ERR_DEVICE_FAILED(-17), Run, CTranscodingPipeline::Run::Transcode() [0x5555799dfdf0] failed at /home/nobody/source/oneVPL/tools/legacy/sample_multi_transcode/src/pipeline_transcode.cpp:4904

 session 0 [0x5555799dfdf0] failed with status MFX_ERR_DEVICE_FAILED shutting down the application...

Output from FFmpeg / VA-API:

$ ffmpeg -y -an -loglevel verbose -hwaccel vaapi -hwaccel_output_format vaapi -i /media/GTAV_1920x1080_60_yuv420p.h265 -c:v h264_vaapi -f null
...
Input #0, hevc, from '/media/GTAV_1920x1080_60_yuv420p.h265':
  Duration: N/A, bitrate: N/A
  Stream #0:0: Video: hevc (Main), 1 reference frame, yuv420p(tv, left), 1920x1080, 60 fps, 60 tbr, 1200k tbn
Stream mapping:
  Stream #0:0 -> #0:0 (hevc (native) -> h264 (h264_vaapi))
Press [q] to stop, [?] for help
[hevc @ 0x56401e78b500] Failed to end picture decode issue: 23 (internal decoding error).
[hevc @ 0x56401e78b500] hardware accelerator failed to decode picture
[hevc @ 0x56401e82a180] Could not find ref with POC 0
[hevc @ 0x56401e82a180] Failed to end picture decode issue: 23 (internal decoding error).
[hevc @ 0x56401e82a180] hardware accelerator failed to decode picture
[hevc @ 0x56401e7f15c0] Could not find ref with POC 1
[hevc @ 0x56401e7f15c0] Failed to end picture decode issue: 23 (internal decoding error).
[hevc @ 0x56401e7f15c0] hardware accelerator failed to decode picture
[hevc @ 0x56401e802e40] Could not find ref with POC 6
...

OneVPL does not show anything in dmesg, but FFmpeg does show GPU hangs:

[10735.331980] i915 0000:0a:00.0: [drm] GPU HANG: ecode 12:4:00000000, in ffmpeg [19118]
[10735.331984] i915 0000:0a:00.0: [drm] ffmpeg[19118] context reset due to GPU hang
[10741.805987] i915 0000:0a:00.0: [drm] GPU HANG: ecode 12:4:00000000, in ffmpeg [19118]
[10741.805992] i915 0000:0a:00.0: [drm] ffmpeg[19118] context reset due to GPU hang

E.g. running simple OpenCL program with latest public compute stack releases does not show any problems.

@XinfengZhang
Copy link
Contributor

could you help to have a try with #1500

@eero-t
Copy link
Author

eero-t commented Oct 7, 2022

could you help to have a try with #1500

Sure, but I'd like to see it first pass at least one of the CI tests... Currently they all fail for it?

@eero-t
Copy link
Author

eero-t commented Oct 7, 2022

could you help to have a try with #1500

Sure, but I'd like to see it first pass at least one of the CI tests... Currently they all fail for it?

Tried it anyway. "Disable object capture for recoverable context" commit did not help, things fail like before.

@gizahNL
Copy link

gizahNL commented Oct 12, 2022

Coming here after my issue report on onevpl-intel-gpu:

latest media-driver is completely unusable for me (I'm only interested in encoding). AVC and HEVC encoding both fail when using sample_encode program.

Up till commit 60001c6 (bisected) HEVC encoding works.

@nyanmisaka
Copy link
Contributor

I tried DG1 on Windows 10 host last year, where it worked fine.

@Sherry-Lin
Copy link
Contributor

Tested latest Git build of Mesa 3D driver on DG1 & DG2 (Arc). It aborted Weston & Xwayland with backport DKMS and worked only with upstream kernel (e.g. v6.3 drm-tip). I did not see any Mesa config option for enabling support for backport DKMS (like the media driver option for "production KMD").

=> Media driver working just with backport DKMS ties one also to whatever Mesa version is in the same repository?

PS. On quick testing public compute-runtime built from latest git tag worked with both KMD versions on DG1.

@eero-t are you using the Mesa from https://github.com/intel-gpu/Mesa/tags or it's upstream Mesa?

@eero-t
Copy link
Author

eero-t commented Jul 5, 2023

@Sherry-Lin I was using upstream Mesa i.e. version that distros will (eventually) include.

@Jexu Jexu added the P3 Low priority no customer usage, no business requirements, not from communities, just from internal label Jul 13, 2023
@Jianshui
Copy link

Are you using Ubuntu 22.04?
As I understand, the DKMS OOT kernel should work with default Mesa.
You also can install Intel prebuild mesa packages, you can follow the release installation guide. It is verified on Ubuntu 22.04.
https://dgpu-docs.intel.com/driver/client/overview.html

2.1.2. Client Intel package repository configuration
For all client scenarios you must configure your system to install client (arc) packages. To add the Ubuntu 22.04 client package repository:

wget -qO - https://repositories.intel.com/graphics/intel-graphics.key |
sudo gpg --dearmor --output /usr/share/keyrings/intel-graphics.gpg
echo "deb [arch=amd64,i386 signed-by=/usr/share/keyrings/intel-graphics.gpg] https://repositories.intel.com/graphics/ubuntu jammy arc" |
sudo tee /etc/apt/sources.list.d/intel-gpu-jammy.list
sudo apt-get update

2.1.3. Install Compute, Media, and Display runtimes
This group of usermode packages should be installed for both out-of-tree and upstream driver install scenarios.

Note: Intel’s version of Mesa includes support for the out-of-tree driver. Standard Mesa can be used for the easiest case where the default Ubuntu 23.04 driver is installed.

sudo apt-get install -y
libegl-mesa0 libegl1-mesa libegl1-mesa-dev libgbm1 libgl1-mesa-dev libgl1-mesa-dri
libglapi-mesa libgles2-mesa-dev libglx-mesa0 libigdgmm12 libxatracker2 mesa-va-drivers
mesa-vdpau-drivers mesa-vulkan-drivers
After installing the new version of Mesa, you will need to restart your desktop either through a system reboot or by restarting your window manager.

@nyanmisaka
Copy link
Contributor

@Jianshui Any ETA about fixing the DG1 media-driver support in non-OOT/upstream kernel?

The OOT kernel maintainer even says they do not support DG1.
intel-gpu/intel-gpu-i915-backports#99 (comment)

It is difficult to understand why DG1 is still not fully supported in upstream as a product of the same period as TGL.

@Jianshui
Copy link

sorry, it's out of my scope. I hope the OOT kernel can fulfill your requirements.

@colorblank
Copy link

Really hope you guys can fix the DG1 media-driver support in non-OOT/upstream kernel. It is difficult to build and install the OOT kernel in Niche Linux distributions like Unraid OS. Thus introducing DG1 support in upstream kernel realy means a lot.

@Jexu Jexu assigned XinfengZhang and Sherry-Lin and unassigned Jexu Apr 1, 2024
@Jexu Jexu removed the Decode video decode related label Apr 1, 2024
@Jexu
Copy link
Contributor

Jexu commented Apr 1, 2024

Media driver still needs PROD kernel to work on DG1, and loses almost all features in upstream kernel (decode, encode, vp all not work).
@Sherry-Lin do we or kmd have plan to fully support dg1 in upstream kernel?

@nyanmisaka
Copy link
Contributor

Media driver still needs PROD kernel to work on DG1, and loses almost all features in upstream kernel (decode, encode, vp all not work).
@Sherry-Lin do we or kmd have plan to fully support dg1 in upstream kernel?

Unfortunately the maintainer of PROD kernel says DG1 is not supported by them.

Also, No one seems to want to submit fixes for DG1 media feats from PROD i915 to upstream i915. At this point, DG1 has no where to go.

@Sherry-Lin
Copy link
Contributor

Yes, that's why PROD i915 is the suggested kernel version for DG1 from KMD team. I'm going to close this issue. Please feel free to re-open if any concerns.

@gizahNL
Copy link

gizahNL commented Apr 1, 2024

So Intel is just giving up on driver support for the DG1 GPU?

@nyanmisaka
Copy link
Contributor

Yes, that's why PROD i915 is the suggested kernel version for DG1 from KMD team. I'm going to close this issue. Please feel free to re-open if any concerns.

image
image
image
image

@Sherry-Lin Didn't you see that the DG1 is NOT being supported and tested with i915 backported PROD/OOT driver? The intel devs on the kernel side recommend users to use upstream kernel.

@nyanmisaka
Copy link
Contributor

Judging from your comments. Intel does not plan to provide any support for DG1 in either PROD/OOT or upstream kernel.

@Jexu
Copy link
Contributor

Jexu commented Apr 2, 2024

FYI, just had a try today with latest drm-tip and media driver that everything seems work fine by applying this draft fix in media driver #1787. Upstream kernel doesn't support gem buffer object capture, just disable it in media driver.

This is drm-tip version I tried:
21a087f3d1d62518feb57cea8caf101cd4a81b5d (HEAD, cgit/drm-tip) drm-tip: 2024y-04m-01d-19h-35m-47s UTC integration manifest

So, please feel free to give a try to this change.

Note: ENABLE_PRODUCTION_KMD=OFF is required (default off)

@Jexu Jexu self-assigned this Apr 2, 2024
@nyanmisaka
Copy link
Contributor

nyanmisaka commented Apr 2, 2024

Hi @Jexu, thanks you very much for the patch. With EXEC_OBJECT_CAPTURE disabled and ENABLE_PRODUCTION_KMD=OFF, the media driver is now partially working on DG1 (8086:4095). There are still some noticeable issues that can be fixed or improved.

decode:
H264 & MPEG2 & VC1 & MJPEG & AV1 decoding works
HEVC & VP9 decoding hangs infinitely (i915 0000:03:00.0: [drm] GPU HANG: ecode 12:4:28fffffd)

vpp:
Commonly used VPP filters works (scale/csc/crop/deint/overlay/tonemap)

copy:
A little bit slow from my experience. (IIRC this was faster on older PROD kernel 5.15? I tested last year)

// sw=>vaapi, 1080p nv12, 575fps
ffmpeg -hide_banner -init_hw_device vaapi -f lavfi -i nullsrc=s=1920x1080,format=nv12 -vf hwupload -f null -
frame= 1506 fps=574 q=-0.0 Lsize=N/A time=00:01:00.20 bitrate=N/A speed=  23x

// sw=>qsv, 1080p nv12, 332fps
ffmpeg -hide_banner -init_hw_device qsv -f lavfi -i nullsrc=s=1920x1080,format=nv12 -vf hwupload=extra_hw_frames=16 -f null -
frame=  665 fps=332 q=-0.0 size=N/A time=00:00:26.60 bitrate=N/A speed=13.3x

encode:
Both H264 and HEVC encoding works from my initial testing.

kernel:

// 6.9.0-rc2-1-drm-tip-git-g5100fcc57dc5
5100fcc57dc5d45b246a0aeb068f4f8062d29b09 drm-tip: 2024y-04m-02d-12h-18m-22s UTC integration manifest

dmesg i915:

[    7.079481] i915 0000:03:00.0: Force probing unsupported Device ID 4905, tainting kernel
[    7.079493] i915 0000:03:00.0: enabling device (0000 -> 0002)
[    7.079958] i915 0000:03:00.0: [drm] VT-d active for gfx access
[    7.102965] i915 0000:03:00.0: vgaarb: VGA decodes changed: olddecodes=io+mem,decodes=none:owns=none
[    7.103016] i915 0000:03:00.0: [drm] Disabling DMC firmware and runtime PM
[    7.752794] i915 0000:03:00.0: [drm] GT0: GuC firmware i915/dg1_guc_70.bin version 70.20.0
[    7.752798] i915 0000:03:00.0: [drm] GT0: HuC firmware i915/dg1_huc.bin version 7.9.3
[    7.758383] i915 0000:03:00.0: [drm] GT0: HuC: authenticated for all workloads
[    7.758767] i915 0000:03:00.0: [drm] GT0: GUC: submission enabled
[    7.758769] i915 0000:03:00.0: [drm] GT0: GUC: SLPC enabled
[    7.758971] i915 0000:03:00.0: [drm] GT0: GUC: RC enabled
[    7.763975] [drm] Initialized i915 1.6.0 20230929 for 0000:03:00.0 on minor 0
[    7.826779] i915 0000:03:00.0: [drm] fb0: i915drmfb frame buffer device
[   31.088904] i915 0000:03:00.0: [drm] GPU HANG: ecode 12:4:28fffffd
[  389.057962] i915 0000:03:00.0: [drm] GPU HANG: ecode 12:4:28fffffd
[  436.203919] i915 0000:03:00.0: [drm] GPU HANG: ecode 12:4:28fffffd
[ 1420.091142] i915 0000:03:00.0: [drm] GPU HANG: ecode 12:4:28fffffd

lspci:

03:00.0 VGA compatible controller: Intel Corporation DG1 [Iris Xe MAX Graphics] (rev 01) (prog-if 00 [VGA controller])
	Flags: bus master, fast devsel, latency 0, IRQ 121, IOMMU group 0
	Memory at f9000000 (64-bit, non-prefetchable) [size=16M]
	Memory at 7e00000000 (64-bit, prefetchable) [size=4G]
	Expansion ROM at fa000000 [disabled] [size=2M]
	Capabilities: [40] Vendor Specific Information: Len=0c <?>
	Capabilities: [70] Express Endpoint, IntMsgNum 0
	Capabilities: [ac] MSI: Enable+ Count=1/1 Maskable+ 64bit+
	Capabilities: [d0] Power Management version 3
	Capabilities: [100] Latency Tolerance Reporting
	Kernel driver in use: i915
	Kernel modules: i915

Please let me know if you have more patches that can fix the above issues on DG1.

@Jexu
Copy link
Contributor

Jexu commented Apr 8, 2024

About vp9 and hevc decoding hang, I had a comparasion of the batch buffer dump from PROD and upstrean kernel, unfortunately didn't find any difference.
Same batch buffer given by media driver but leading to different result on both kernel. Something of unknon issue may still exist in upstream kernel while gpu hang seems occur in ring buffer(kmd programed). Media driver was never fully verified in upstream kernel, but all features should be verified in prod kernel.

Luckly, draft pr #1787 could bring back most of media features. If you want, I could prepare a formal fix and get it merged into media driver. By the way, using env 'export INTEL_I915_CTX_CONTROL=1' to disable recoverable ctx could also solve it.

@Jexu
Copy link
Contributor

Jexu commented Apr 8, 2024

More experiment:
Remove decode instructions from batch buffer by submitting an empty batch to kmd, gpu hang still exists. It does have some issue in upstream kernel.

@nyanmisaka
Copy link
Contributor

nyanmisaka commented Apr 9, 2024

This is very tricky, considering that the development of DG1 in upstream i915 has been stalled for a long time. All changes occur in PROD i915. All the commits there are squashed and difficult to read.

@Jexu Any chance the DG1 is fully supported in the newly upstreamed Xe KMD?

@Jexu
Copy link
Contributor

Jexu commented Apr 9, 2024

I didn't ever try it on DG1, but Xe kmd commits experimental support for DG1 and everything about xe kmd support is already in latest media driver.

@nyanmisaka
Copy link
Contributor

@Jexu With xe kmd mainlined, I revisited DG1 with it (linux 6.9.5 + xe.force_probe=4905). Now media works fine on DG1, including end-to-end transcoding. No more segfaults or insanely slow on certain codecs. I'd call it very usable, though it still needs time to prove its stability.

2FF7912E-739D-4C37-812C-3B323AFB54A2

@colorblank
Copy link

I tried replacing the Unraid kernel with Linux 6.9.5 and added xe.force_probe=4908 to the startup script. It can only drive the DG1 but cannot decode properly. Are there any additional steps required?

@nyanmisaka
Copy link
Contributor

Possibilities:

  • Missing firmware
  • Outdated media-driver
  • Outdated vpl-rt

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
P3 Low priority no customer usage, no business requirements, not from communities, just from internal
Projects
None yet
Development

No branches or pull requests