Skip to content

Instantly share code, notes, and snippets.

@peppergrayxyz
Last active April 21, 2025 18:59
Show Gist options
  • Save peppergrayxyz/fdc9042760273d137dddd3e97034385f to your computer and use it in GitHub Desktop.
Save peppergrayxyz/fdc9042760273d137dddd3e97034385f to your computer and use it in GitHub Desktop.
QEMU with VirtIO GPU Vulkan Support

QEMU with VirtIO GPU Vulkan Support

With its latest reales qemu added the Venus patches so that virtio-gpu now support venus encapsulation for vulkan. This is one more piece to the puzzle towards full Vulkan support.

An outdated blog post on clollabora described in 2021 how to enable 3D acceleration of Vulkan applications in QEMU through the Venus experimental Vulkan driver for VirtIO-GPU with a local development environment. Following up on the outdated write up, this is how its done today.

Definitions

Let's start with the brief description of the projects mentioned in the post & extend them:

  • QEMU is a machine emulator
  • VirGL is an OpenGL driver for VirtIO-GPU, available in Mesa.
  • Venus is an experimental Vulkan driver for VirtIO-GPU, also available in Mesa.
  • Virglrenderer is a library that enables hardware acceleration to VM guests, effectively translating commands from the two drivers just mentioned to either OpenGL or Vulkan.
  • libvirt is an API for managing platform virtualization
  • virt-manager is a desktop user interface for managing virtual machines through libvirt

Merged Patches:

Work in progress:

Prerequisites

Make sure you have the proper version installed on the host:

  • linux kernel >= 6.13 built with CONFIG_UDMABUF
  • working Vulkan and kvm setup
  • qemu >= 9.2.0
  • virglrenderer with enabled venus support
  • mesa >= 24.2.0

You can verify this like so:

$ uname -r
6.13.0
$ ls /dev/udmabuf
/dev/udmabuf
$ ls /dev/kvm
/dev/kvm
$ qemu-system-x86_64 --version
QEMU emulator version 9.2.0
Copyright (c) 2003-2024 Fabrice Bellard and the QEMU Project developers

Check your distros package sources how they build virglrenderer.

For Vulkan to work you need the proper drivers to be installed for your graphics card.

To verfiy your setup, install vulkan-tools. Make sure mesa >= 24.2.0 and test vkcube:

$ vulkaninfo --summary | grep driverInfo
	driverInfo         = Mesa 24.2.3-1ubuntu1
	driverInfo         = Mesa 24.2.3-1ubuntu1 (LLVM 19.1.0)
...
$ vkcube
Selected GPU x: ..., type: ...

Building qemu

If your distro doesn't (yet) ship and updated version of qemu, you can build it yourself from source:

wget https://download.qemu.org/qemu-9.2.0.tar.xz
tar xvJf qemu-9.2.0.tar.xz
cd qemu-9.2.0
mkdir build && cd build
../configure --target-list=x86_64-softmmu  \
  --enable-kvm                 \
  --enable-opengl              \
  --enable-virglrenderer       \
  --enable-gtk                 \
  --enable-sdl
make -j4

The configuration step will throgh errors if packages are missing. Check the qemu wiki for further info what to install: https://wiki.qemu.org/Hosts/Linux

Create and run an image for QEMU

Create an image & fetch the distro of your choice:

Host

ISO=ubuntu-24.10-desktop-amd64.iso  
wget https://releases.ubuntu.com/oracular/ubuntu-24.10-desktop-amd64.iso  

IMG=ubuntu-24-10.qcow2
qemu-img create -f qcow2 $IMG 16G

Run a live version or install the distro

qemu-system-x86_64                                               \
    -enable-kvm                                                  \
    -M q35                                                       \
    -smp 4                                                       \
    -m 4G                                                        \
    -cpu host                                                    \
    -net nic,model=virtio                                        \
    -net user,hostfwd=tcp::2222-:22                              \
    -device virtio-vga-gl,hostmem=4G,blob=true,venus=true        \
    -vga none                                                    \
    -display gtk,gl=on,show-cursor=on                            \
    -usb -device usb-tablet                                      \
    -object memory-backend-memfd,id=mem1,size=4G                 \
    -machine memory-backend=mem1                                 \
    -hda $IMG                                                    \
    -cdrom $ISO                                                  

Adjust the parameters accordingly:

  • smp: number of cpu cores
  • m: RAM
  • hostmem,size: VRAM

Guest

Install mesa-utilites and vulkan-tools to test the setup:

$ glxinfo -B
$ vkcube
Selected GPU x: ..., type: ...

If the deive is llvmpipe somehting is wrong. The device should be virgl (...).

Troubleshooting

  • (host) add -d guest_errors to show error messages from the guest
  • (guest) try installing vulkan virtio drivers and mesa
  • check the original blog post

Ubuntu 24.10

This is how you do it on Ubuntu

kernel

Install mainline: https://github.com/bkw777/mainline

sudo add-apt-repository ppa:cappelikan/ppa
sudo apt update
sudo apt install mainline

find the latest kernel (>= 6.13), at the time of writing 6.13 is a release candidate, so include those:

$ mainline check --include-rc

Install kernel:

$ sudo mainline install 6.13-rc1

Verfify installed kernels:

$ mainline list-installed
mainline 1.4.10
Installed Kernels:
linux-image-6.11.0-13-generic
linux-image-generic-hwe-24.04
linux-image-unsigned-6.13.0-061300rc1-generic
mainline: done

reboot into new kernel

verify running kernel

$ uname -r
6.13.0-061300rc1-generic

virglrenderer

the ubuntu package is not compiled with the proper flags.

If installed remove it: $ sudo apt-get remove libvirglrenderer-dev

download, build & install from source with venus enabled

wget    https://gitlab.freedesktop.org/virgl/virglrenderer/-/archive/1.1.0/virglrenderer-1.1.0.tar.gz
sudo apt-get install python3-full ninja-build libvulkan-dev libva-dev
python3 -m venv venv
venv/bin/pip install meson
venv/bin/meson build -Dvideo=true -Dvenus=true
ninja -C build
ninja install

qemu

install qemu >= 9.2.0, at the time of writing ubuntu has not yet packaged it

Install build depdencies: https://wiki.qemu.org/Hosts/Linux

sudo apt-get install build-essential pip libslirp-dev slirp
sudo apt-get install git libglib2.0-dev libfdt-dev libpixman-1-dev zlib1g-dev ninja-build
sudo apt-get install git-email
sudo apt-get install libaio-dev libbluetooth-dev libcapstone-dev libbrlapi-dev libbz2-dev
sudo apt-get install libcap-ng-dev libcurl4-gnutls-dev libgtk-3-dev
sudo apt-get install libibverbs-dev libjpeg8-dev libncurses5-dev libnuma-dev
sudo apt-get install librbd-dev librdmacm-dev
sudo apt-get install libsasl2-dev libsdl2-dev libseccomp-dev libsnappy-dev libssh-dev
sudo apt-get install libvde-dev libvdeplug-dev libvte-2.91-dev libxen-dev liblzo2-dev
sudo apt-get install valgrind xfslibs-dev 
sudo apt-get install libnfs-dev libiscsi-dev

build and run as described

virt-manager

-- work in progress --

Currently this is work in progress, so there is no option to add vulkan support in virt-manager. There are no fields to configure this. Also xml doesnt work, because libvirt doesn't know about these options either, so xml validation fails. There is however an option for QEMU command-line passthrough which bypasses the validation.

If you setup a default machine with 4G of memory, you can do this:

  <qemu:commandline>
    <qemu:arg value="-device"/>
    <qemu:arg value="virtio-vga-gl,hostmem=4G,blob=true,venus=true"/>
    <qemu:arg value="-object"/>
    <qemu:arg value="memory-backend-memfd,id=mem1,size=4G"/>
    <qemu:arg value="-machine"/>
    <qemu:arg value="memory-backend=mem1"/>
    <qemu:arg value="-vga"/>
    <qemu:arg value="none"/>
  </qemu:commandline>

Which gives this error:

qemu-system-x86_64: virgl could not be initialized: -1

Changing the number from 4G to 4194304k (same as memory) leds to this error:

qemu-system-x86_64: Spice: ../spice-0.15.2/server/red-qxl.cpp:435:spice_qxl_gl_scanout: condition `qxl_state->gl_draw_cookie == GL_DRAW_COOKIE_INVALID' failed

to be further investigated.

@zhangyiwei
Copy link

Hi,

For anyone that encounters the stuck in fence wait error, please refer to the latest Virtio-GPU Venus driver page for required KVM patches corresponding to your setup: https://docs.mesa3d.org/drivers/venus.html

As folks might have find out for VN_PERF=no_fence_feedback, the cube can spin but very likely you'll still see visual artifacts of corrupted/half-uploaded textures or borken vertices with some real game workloads.

The underlying brokenness has been fixed by this KVM series: https://lore.kernel.org/all/[email protected]/ Unfortunately, the last enablement patch has been temporarily reverted due to a Bochs drm driver's faulty behavior that relies on KVM ignoring the guest pat while overwriting to wb. The driver has been fixed, but since it's on the guest kernel side, the host KVM patch still cannot reland in fear or regressing existing VMs. How and when for the reland of the last patch is TBD, so manually patching that to your host kernel is the only resovle at this point. Feel free to check the updated venus driver page for the mailing list conversations regarding this matter.

Thanks!

@z1g
Copy link

z1g commented Mar 31, 2025

Thanks for taking the time to write all this up. I just switched from Nobara to Arch since Fedora dropped X11 support. I had worked on this about a year or so ago with some progress but no results. I decided to give it a go on Arch with just the packages from AUR and it just worked. Pretty impressive.

@thesword53
Copy link

For those who encounter stuck in fence wait, adding environment variable

VN_PERF=no_fence_feedback

to /etc/environment might help.

Not working with Nvidia drivers. I also have those logs on host:

[16888.122513] NVRM: GPU at PCI:0000:2d:00: GPU-85142072-d857-6520-e4b7-8d06bf8e4a0d
[16888.122519] NVRM: Xid (PCI:0000:2d:00): 69, pid=37310, name=vkr-ring-4, Class Error: ChId 00bc, Class 0000c597, Offset 00000274, Data 00000024, ErrorCode 0000009c

It happens with vkcube, vkmark, vkgears, but non-native games using VKD3D or DXVK are working fine.

@myrslint
Copy link

myrslint commented Apr 10, 2025

Hi,

For anyone that encounters the stuck in fence wait error, please refer to the latest Virtio-GPU Venus driver page for required KVM patches corresponding to your setup: https://docs.mesa3d.org/drivers/venus.html

As folks might have find out for VN_PERF=no_fence_feedback, the cube can spin but very likely you'll still see visual artifacts of corrupted/half-uploaded textures or borken vertices with some real game workloads.

The underlying brokenness has been fixed by this KVM series: https://lore.kernel.org/all/[email protected]/ Unfortunately, the last enablement patch has been temporarily reverted due to a Bochs drm driver's faulty behavior that relies on KVM ignoring the guest pat while overwriting to wb. The driver has been fixed, but since it's on the guest kernel side, the host KVM patch still cannot reland in fear or regressing existing VMs. How and when for the reland of the last patch is TBD, so manually patching that to your host kernel is the only resovle at this point. Feel free to check the updated venus driver page for the mailing list conversations regarding this matter.

Thanks!

What I write below may be incorrect or inaccurate since I don't have the knowledge for properly understanding what's going on between these various code bases and pieces of software; so please bear with me. My current stake in the matter is limited to hoping to get working Vulkan encapsulation on my current hardware consisting of an NVIDIA GPU and an Intel CPU.

Mesa documentation seems to be out-of-date with the current state of KVM. The patch series linked no longer applies correctly against kernel source code. I brought this up on the mailing list linked in the quoted reply. It was explained to me that a quirk flag, on by default, is now available in KVM which QEMU has to disable to enforce the feature that is needed for getting working Venus on AMD and NVIDIA GPUs paired with Intel CPUs.

Thus, it seems we now have to ask QEMU developers to add the disabling of said quirk as a default, since the use of Bochs DRM driver or guests that need the quirk enabled is highly uncommon.

Any correction to my understanding of the issue, detailed above, is most welcome.

@zhangyiwei
Copy link

Mesa documentation seems to be out-of-date with the current state of KVM. The patch series linked no longer applies correctly against kernel source code.

Which kernel branch you are working with?

That last patch is a 2-liner. I linked the original mailing list series only to help folks get all the context there ; )

@myrslint
Copy link

myrslint commented Apr 10, 2025

Mesa documentation seems to be out-of-date with the current state of KVM. The patch series linked no longer applies correctly against kernel source code.

Which kernel branch you are working with?

That last patch is a 2-liner. I linked the original mailing list series only to help folks get all the context there ; )

I downloaded the series using b4 and attempted applying it against sources extracted from 6.14.1 tarball, the latest stable as of 2025-04-09. (6.14.2 was released somewhat earlier today.)

I likely have misunderstood your post. Did I need to apply only PATCH 5/5?

@zhangyiwei
Copy link

I likely have misunderstood your post. Did I need to apply only PATCH 5/5?

Correct. Just the last patch, which is the one got reverted. You could also git revert that revert patch in-tree. I've also double-checked no conflicts with latest kernel.

Meanwhile, thanks for following up on that mailing list. I was planning to update the driver page again once that new userspace opt-in hit a release.

@myrslint
Copy link

I likely have misunderstood your post. Did I need to apply only PATCH 5/5?

Correct. Just the last patch, which is the one got reverted. You could also git revert that revert patch in-tree. I've also double-checked no conflicts with latest kernel.

Meanwhile, thanks for following up on that mailing list. I was planning to update the driver page again once that new userspace opt-in hit a release.

That's great help. Thank you! I'll be trying to compile 6.14.2 with only the last patch soon.

@myrslint
Copy link

myrslint commented Apr 11, 2025

My status report
My hardware combination is Intel i7-3770k and NVIDIA GTX 1060 6GB. I use NVIDIA's proprietary driver version 570.133.07.

I compiled kernel 6.14.2 using Arch Build System (ABS) with the patch listed above applied. After that I installed the generated linux and linux-headers packages.

QEMU 9.2.3 was used to boot an Arch Linux VM nearly identical to the host, including installation of the patched 6.14.2 kernel.

Upon being run vulkaninfo correctly reported the virtualized GPU. It does so with or without the patch.

Running vkcube (with either --wsi wayland or --wsi xcb, defaults to XCB WSI) leads to a black window being displayed with no spinning cube and the same error messages on the guest and on the host as before.

On the guest: MESA-VIRTIO: debug: stuck in fence wait with iter at 1024 which repeats with doubled iter each time.
If gamescope is used as compositor on the guest, instead of sway, an almost identical error message is produced with one key difference: fence wait is changed to semaphore wait.

On the host, in dmesg:

NVRM: Xid (PCI:0000:01:00): 69, pid=2608, name=vkr-ring-9, Class Error: ChId 0035, Class 0000c197, Offset 00000d78, Data 00000024, ErrorCode 0000009c
NVRM: Xid (PCI:0000:01:00): 69, pid=2636, name=vkr-ring-9, Class Error: ChId 0035, Class 0000c197, Offset 00000d78, Data 00000024, ErrorCode 0000009c

It appears applying the patch has no effect on this problem.

@zhangyiwei
Copy link

It appears applying the patch has no effect on this problem.

Could you help check if any of below makes a difference?

  1. VN_DEBUG=all VN_PERF=all vkcube
  2. MESA_VK_WSI_DEBUG=buffer VN_DEBUG=all VN_PERF=all vkcube
  3. MESA_VK_WSI_DEBUG=sw,buffer VN_DEBUG=all VN_PERF=all vkcube
  4. MESA_VK_WSI_DEBUG=sw,buffer VN_DEBUG=all VN_PERF=no_fence_feedback,no_semaphore_feedback vkcube

Btw, what's your guest mesa driver version? could you help give tip-of-tree mesa a try as well if it's not?

@myrslint
Copy link

myrslint commented Apr 11, 2025

It appears applying the patch has no effect on this problem.

Could you help check if any of below makes a difference?

1. `VN_DEBUG=all VN_PERF=all vkcube`

2. `MESA_VK_WSI_DEBUG=buffer VN_DEBUG=all VN_PERF=all vkcube`

3. `MESA_VK_WSI_DEBUG=sw,buffer VN_DEBUG=all VN_PERF=all vkcube`

4. `MESA_VK_WSI_DEBUG=sw,buffer VN_DEBUG=all VN_PERF=no_fence_feedback,no_semaphore_feedback vkcube`

Btw, what's your guest mesa driver version? could you help give tip-of-tree mesa a try as well if it's not?

I have attached the logs from vkcube stderr, in this gist, for each case with the corresponding number. Visually, the first two still show a black window and no cube as before; the second two show a spinning cube but it spins rapidly and erratically.

Mesa version is 25.0.3 as shipped by Arch Linux, on both host and guest. Kernel is unpatched 6.14.2 also as shipped by Arch Linux, on both host and guest.

Compiling and installing from Mesa's latest commit will take me a while. I'll report back when it is done.

@myrslint
Copy link

myrslint commented Apr 11, 2025

Built and installed, on both host and guest, Mesa from latest commit (f1f87d302fa) using this AUR package. The results were visually the same. I have attached the logs in this gist, similarly numbered by cases.

@zhangyiwei
Copy link

Thanks for all the info! It looks to me the issue with your setup is more about the scanout image handling. Could you additionally give below a try? (and does the NVRM error log show up in the host dmesg?)

  1. MESA_VK_WSI_DEBUG=sw,buffer vkcube
  2. MESA_VK_WSI_DEBUG=sw,linear vkcube

@myrslint
Copy link

Thanks for all the info! It looks to me the issue with your setup is more about the scanout image handling. Could you additionally give below a try? (and does the NVRM error log show up in the host dmesg?)

5. `MESA_VK_WSI_DEBUG=sw,buffer vkcube`

6. `MESA_VK_WSI_DEBUG=sw,linear vkcube`

I've attached the logs in this gist, numbered as before. This time I managed to remember to include the output from vulkaninfo as well. Without VN_DEBUG=all the logs are very terse.

No messages were logged in the host dmesg. However, no messages were logged when running without the environment variables either. I have noticed the messages from NVIDIA driver are sporadic and do not correspond to every run of vkcube. Some runs, usually the first two or three after a host reboot, log a message on the host but the rest don't, regardless of environment variables. Rebooting the guest does not seem to have any effect on inducing fresh driver messages on the host.

@zhangyiwei
Copy link

Based on the logs for (5), comparing to (3) and (4), your setup does need the patched host kernel to mitigate the EPT PAT issue. Previously the issue is partially hidden behind Venus perf options.

After patching host kvm, at least case (5) would be running fine. Meanwhile, case (6) is the one I'm also curious about with the patched kernel.

No messages were logged in the host dmesg. However, no messages were logged when running without the environment variables either. I have noticed the messages from NVIDIA driver are sporadic and do not correspond to every run of vkcube. Some runs, usually the first two or three after a host reboot, log a message on the host but the rest don't, regardless of environment variables. Rebooting the guest does not seem to have any effect on inducing fresh driver messages on the host.

This is really helpful info! If the same is also observed with the patched kvm, then the issue resides in some level of proper VK->GL external memory import. Another follow-up experiment to do would be redirect x server to use Zink-on-Venus as the gbm backing, so that we can tell whether the brokenness with hw scanout images are due to VirGL or host GL driver issues ; )

@myrslint
Copy link

myrslint commented Apr 11, 2025

Based on the logs for (5), comparing to (3) and (4), your setup does need the patched host kernel to mitigate the EPT PAT issue. Previously the issue is partially hidden behind Venus perf options.

After patching host kvm, at least case (5) would be running fine. Meanwhile, case (6) is the one I'm also curious about with the patched kernel.

No messages were logged in the host dmesg. However, no messages were logged when running without the environment variables either. I have noticed the messages from NVIDIA driver are sporadic and do not correspond to every run of vkcube. Some runs, usually the first two or three after a host reboot, log a message on the host but the rest don't, regardless of environment variables. Rebooting the guest does not seem to have any effect on inducing fresh driver messages on the host.

This is really helpful info! If the same is also observed with the patched kvm, then the issue resides in some level of proper VK->GL external memory import. Another follow-up experiment to do would be redirect x server to use Zink-on-Venus as the gbm backing, so that we can tell whether the brokenness with hw scanout images are due to VirGL or host GL driver issues ; )

Firstly, I want to thank you for following up on this issue and investing so much time and effort into resolving it. I very much appreciate your help. Below, are my findings based on your pointers.

I built the patched kernel again, installed it on the host, and rebooted the host. This is 6.14.2 kernel built from kernel.org source tarball using the default Arch kernel config, with many unnecessary device drivers disabled to make the build take less time. The only addition was PATCH-5-5-KVM-VMX-Always-honor-guest-PAT-on-CPUs-that-support-self-snoop. For my CPU (i7-3770k) /proc/cpuinfo includes the following line:

flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm pcid sse4_1 sse4_2 popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm cpuid_fault epb pti ssbd ibrs ibpb stibp tpr_shadow flexpriority ept vpid fsgsbase smep erms xsaveopt dtherm ida arat pln pts vnmi md_clear flush_l1d

The flag ss here indicates the CPU supports self-snoop.

Installing the patched kernel on the host made the triggering of host dmesg error message consistent. Any run of vkcube on the guest which resulted in the messages MESA-VIRTIO: debug: stuck in fence wait with iter at N on the guest, also resulted in a corresponding error in the host dmesg reading NVRM: Xid (PCI:0000:01:00): 69, pid=1737, name=vkr-ring-9, Class Error: ChId 002d, Class 0000c197, Offset 00000d78, Data 00000024, ErrorCode 0000009c. I confirmed this through numerous runs.

I ran all the commands you provided again and collected the logs in this gist. These can be summarized as:

  1. As per your prediction, commands 3, 4, and 5 do result in a properly spinning cube being rendered. These run rather sluggishly and heavily engage the CPU while leaving the GPU mostly idle (based on nvtop observation). My understanding is that some aspect of the work is being done in software on the CPU, rather than running on the GPU.
  2. Commands 1, 2, and 6 result in the black window with no spinning cube. These produce the usual error message on the guest and also reliably trigger the NVIDIA driver error message in the host dmesg.

All these tests were performed from a terminal (foot) running on the compositor sway. Running gamescope -- vkcube from the console i.e., using gamescope's DRM backend which prior to patching the host kernel resulted in the semaphore wait error message instead resulted in QEMU's GTK frontend showing a blank screen and briefly displaying the error message Display output is not active. From this point on the VM seems to become non-responsive. Pressing [Escape], to exit vkcube if possible, and then blindly attempting a soft reboot does not help. A hard reset of the VM using QEMU facilities is required.

For the follow-up tests, my understanding is that Zink is the name for running OpenGL programs on top of a Vulkan graphics stack with a translation layer from OpenGL to Vulkan in-between. Some online reading and this small wrapper seemed to indicate I should have prefixed the commands with __GLX_VENDOR_LIBRARY_NAME=mesa MESA_LOADER_DRIVER_OVERRIDE=zink GALLIUM_DRIVER=zink LIBGL_KOPPER_DRI2=1 to set those environment variables.

To test Zink-on-Venus based on above understanding, I tried running sway (from the console, with and without WLR_RENDERER=vulkan, the default renderer being gles2), eglgears_x11, gamescope (from the console), glxgears, and vkgears on the guest with these variables prefixed. The logs, where applicable, are collected in this gist. My description of what happened is:

  1. sway ran but presented only a blank screen with the brief Display output is not active message. This could be recovered from by blindly exiting sway. Upon exiting the console would be displayed again normally.
  2. gamescope ran with the same error as sway and this could not be recovered from in any way other than resetting the VM.
  3. eglgears_x11 run on top of sway (started normally from console) displayed a black window and on the terminal printed the errors seen in the collected logs. This (MESA: error: CreateSwapchainKHR failed with VK_ERROR_OUT_OF_HOST_MEMORY) was, to me, a new type of error message but might be what's underlying the previous errors as well.
  4. glxgears displayed a black window and then quickly exited with the errors seen in the collected logs. These consisted of the same Vulkan error as with eglgears_x11 as well as a more specific GLX error.
  5. vkgears expectedly demonstrated the same problem as other Vulkan demo programs. Zink (OpenGL atop Vulkan) seemingly had no hand in that.
  6. Notably, runs of OpenGL/EGL programs did not trigger NVIDIA driver error messages on the host. Any run of vkgears, however, did do so.

Next, I tried a test not listed above. On the host I installed the Arch Linux vulkan-swrast package and forced QEMU to use llvmpipe software rasterizer by specifying VK_ICD_FILENAMES=/usr/share/vulkan/icd.d/lvp_icd.x86_64.json in the command invoking qemu-system-x86_64. Naturally, this resulted in vulkaninfo on the guest reporting llvmpipe (LLVM 19.1.7, 256 bits) as its GPU. With that change I could run vkcube without any environment variables. This displayed the spinning cube but, in addition to the expected high CPU usage, the cube's spinning was juddery and erratic, and some short-lived, noisy, black-colored artefacts would repeatedly appear and disappear across a relatively fixed rectangle within the vkcube window. In the terminal QEMU was run from and logging to this error message could be read: virtio_gpu_virgl_process_cmd: ctrl 0x209, error 0x1200.

As an aside, I think I should mention that guest-to-host OpenGL encapsulation (without Zink) does not work entirely well on my hardware and software combination either. Demo programs such as eglgears_x11 and glxgears run successfully--achieving smooth motion and consistent frame rates--and even glmark2 runs its default benchmark scenes without issues. However, more complex OpenGL benchmarks such as Unigine Heaven and Unigine Valley, while producing commendable frame rates, also produce highly noticeable visual glitches such as momentarily disappearing and reappearing parts of some scenes, occasional vertex explosions, and occasional duplicated and/or misplaced scene objects. These glitches do not appear consistently in every cycle of the same run of the same benchmark on the guest VM, and they don't appear at all if the same benchmark is run on the host. Nonetheless, the glitches have steadily been reduced and the output quality improved over the past few months that I have tried these. It may be of note that although OpenGL rendering is clearly being offloaded to and done by the GPU there still is quite high CPU usage, entirely consuming 3-4 threads of a 4C/8T CPU. In some cases, such as the Refresh2025 benchmark (the OpenGL version since the Vulkan one only stalls as with vkcube) increasing rendered object count above some threshold seems to result in GPU utilization decreasing in proportion to the CPU's inability to keep up i.e., benchmark performance becomes CPU-bound.

@zhangyiwei
Copy link

Firstly, I want to thank you for following up on this issue and investing so much time and effort into resolving it. I very much appreciate your help. Below, are my findings based on your pointers.

I'm the one to say thanks 🙏 I haven't used any nv gpu for years so mostly rely on community folks with nv setups for these sorts of investigations. Thanks again for bearing with me >_<

Installing the patched kernel on the host made the triggering of host dmesg error message consistent. Any run of vkcube on the guest which resulted in the messages...on the guest, also resulted in a corresponding error in the host dmesg...I confirmed this through numerous runs.

No random behaviors then. I was suspecting there existed two issues tangled here: Intel EPT issue and wsi issue. Before guest par being honored, sometimes one issue could shield the other due to timing +/- VN_PERF options.

I ran all the commands you provided again and collected the logs in this gist...

As per your prediction, commands 3, 4, and 5 do result in a properly spinning cube being rendered.

Cool! Based on the observations, now I have a rough idea of what has gone wrong. Let me do some more homework before knocking myself out, or proposing any workarounds for nv setup.

...These run rather sluggishly and heavily engage the CPU while leaving the GPU mostly idle (based on nvtop observation). My understanding is that some aspect of the work is being done in software on the CPU, rather than running on the GPU.

That's expected. vkcube uses prerecorded cmds so normally it's just throttled acquired and present calls at runtime from the app side, while the x server is doing composition with GL driver. The wsi debug option used has engaged a cpu buffer to share with the x server instead of direct/zero-copy device memory sharing, ending up with heavier cpu usage.

@zhangyiwei
Copy link

Could you help apply this hack to your guest mesa, and see if vkcube without any env vars works? It forces venus to take the prime blit path, and might not work...

I suspect the wsi side issue is the proprietary nv vulkan driver not waiting for implicit fence attached to the external memory. It might have such support for native wsi extension, but venus layers wsi atop external memory. The implicit fence not being properly waited is likely the one from host gl sampling from the venus wsi image (guest x server doing composition). That could explain why MESA_VK_WSI_DEBUG=sw,buffer makes things work. Currently I don't have any good way to workaround this because that implicit fence is entirely unknown to guest venus. If forcing prime blit can't hide the issue, I'll draft a workaround in host venus (vkr) to explicitly wait for the implicit fence before submitting to the nv driver.

@thesword53
Copy link

I have the same issue as @myrslint with vkcube, vkgears and vkmark (MESA-VIRTIO: debug: stuck in fence wait with iter at 1024), but DXVK and VKD3D games are working fine. I can also run GTK4 application with the Vulkan renderer.
GPU: RTX 2080 SUPER
CPU: AMD Ryzen 7 3700X

The following logs on host mean Graphics Engine class error according to https://docs.nvidia.com/deploy/xid-errors/index.html.

NVRM: Xid (PCI:0000:01:00): 69, pid=2608, name=vkr-ring-9, Class Error: ChId 0035, Class 0000c197, Offset 00000d78, Data 00000024, ErrorCode 0000009c
NVRM: Xid (PCI:0000:01:00): 69, pid=2636, name=vkr-ring-9, Class Error: ChId 0035, Class 0000c197, Offset 00000d78, Data 00000024, ErrorCode 0000009c

@zhangyiwei
Copy link

Your amd cpu + nv dgpu setup doesn't have the pat issue so is only affected by the wsi side issue.

...but DXVK and VKD3D games are working fine. I can also run GTK4 application with the Vulkan renderer.

They happen to hide the synchronization issue potentially because they have reasonable frame pacing and they all have certain amount of cpu workloads before making the submission that involves the wsi image, which gives enough time for the implicit fence attached by the compositor to signal.

Two experiments against vkcube/vkmark/etc can be done to confirm the theory:

  1. Override to increase the x11 swapchain length to something much bigger with vk_x11_override_min_image_count env var.
  2. Add some sleep at the end of vn_AcquireNextImage2KHR.

@myrslint
Copy link

myrslint commented Apr 12, 2025

Could you help apply this hack to your guest mesa, and see if vkcube without any env vars works? It forces venus to take the prime blit path, and might not work...

I suspect the wsi side issue is the proprietary nv vulkan driver not waiting for implicit fence attached to the external memory. It might have such support for native wsi extension, but venus layers wsi atop external memory. The implicit fence not being properly waited is likely the one from host gl sampling from the venus wsi image (guest x server doing composition). That could explain why MESA_VK_WSI_DEBUG=sw,buffer makes things work. Currently I don't have any good way to workaround this because that implicit fence is entirely unknown to guest venus. If forcing prime blit can't hide the issue, I'll draft a workaround in host venus (vkr) to explicitly wait for the implicit fence before submitting to the nv driver.

I recompiled Mesa from latest commit (676e26aed58) on the main branch using the AUR package previously mentioned. Thankfully, the PKGBUILD also has a section that applies patch files so I added this patch from your fork to the sources array as vn-force-prime-blit.patch which applied cleanly against the checked-out tree and compiled fine.

I installed the resulting Mesa package in the guest. The current software configuration consists of latest stock Arch Linux on the host and guest, with a patched kernel on the host and patched Mesa on the guest. With this configuration, vkcube runs on the guest without any environment variables. There is a short period of a black window being displayed followed by the spinning cube being displayed. There are no artefacts but the spinning is somewhat erratic. vkgears and vkmark, however, exhibit the same symptoms as before and print the same error message they did previously. I have uploaded a screen recording of the VM window to give a sense of the erratic motion mentioned. It also contains the terminal window showing the programs being run and the error messages.

@zhangyiwei
Copy link

@myrslint @thesword53 hi, would you like to give https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/34516 a try on your nvidia setup? with that, venus should have properly handled the implicit compositor release fence.

@myrslint
Copy link

@myrslint @thesword53 hi, would you like to give https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/34516 a try on your nvidia setup? with that, venus should have properly handled the implicit compositor release fence.

If I have understood your instructions correctly, I recompiled Mesa from latest commit (b1af5780d13) with only MR #34516 applied as a patch. I installed the resulting patch in the guest and ran vkcube. With this, vkcube once again showed only a black window and the same error messages as before were logged in both guest and host.

The log from running VN_DEBUG=all VN_PERF=all vkcube is found in this gist.

@zhangyiwei
Copy link

Thanks! What about together with the prior hack (force buffer blit) and see if the mentioned erratic motion is improved?

There might exist multiple issues on the wsi path. Previously with the PAT fix, (5) working but (6) not suggested host Nvidia driver has issues dealing with linear image import.

@myrslint
Copy link

Thanks! What about together with the prior hack (force buffer blit) and see if the mentioned erratic motion is improved?

There might exist multiple issues on the wsi path. Previously with the PAT fix, (5) working but (6) not suggested host Nvidia driver has issues dealing with linear image import.

I did guess I had misunderstood your instructions.

This time I recompiled Mesa from latest commit (09896ee79e3 as of when I pulled from FDO) with both vn-force-prime-blit and vn-fix-acquire-fence applied as patches to that commit. Then, I installed the resulting package in the guest. The host kernel is the same patched 6.14.2 Arch kernel as before. The guest kernel is a stock 6.14.2 Arch kernel as before.

In this gist logs from running the commands 1-6 are collected and numbered. Visually, 1-5 result in a spinning cube being drawn. The ones with sw in MESA_VK_WSI_DEBUG generally appear more correct but much slower. The ones without are quicker but demonstrate what appears to be abrupt changes of the spinning speed or skipped frames every so often. They are nonetheless somewhat improved over the previous case of applying only vn-force-prime-blit.

6 still does not result in a spinning cube, only a black window, and triggers the host dmesg error message from NVIDIA driver.

I have also made a screen recording of a VN_DEBUG=all VN_PERF=all vkcube run to hopefully give a sense of the cube's motion.

@zhangyiwei
Copy link

One more question before I go back to do more homework: do you see improvements with just vkcube (w/o any additional env vars)? as compared to the video attached on your previous #gistcomment-5537654?

The video on your latest #gistcomment-5541821 looks fine to me already. The occasional janks likely came from the compositor stack backpressure, but the out-of-order issue is gone per my visual check. Just need to confirm this ; )

@myrslint
Copy link

One more question before I go back to do more homework: do you see improvements with just vkcube (w/o any additional env vars)? as compared to the video attached on your previous #gistcomment-5537654?

The video on your latest #gistcomment-5541821 looks fine to me already. The occasional janks likely came from the compositor stack backpressure, but the out-of-order issue is gone per my visual check. Just need to confirm this ; )

Yes, it has improved significantly when run without any environment variables. It has gone from no cube and a black window at the very beginning to a proper rendering of the cube and no glaringly erratic motion. The issues that stand out are the following:

  1. vkcube --wsi wayland still stalls with the fence wait error, even though the same command runs fine on the host.
  2. With XCB WSI on the guest, vkcube runs and displays fine but there is a brief display of a black background before the rendering starts. This does not happen on the host with either XCB or Wayland WSI.
  3. The cube's motion, while significantly better than with just the prime blit patch, still is not smooth. As you have pointed out, it does not seem to go back and forth anymore but occasionally speeds up and slows down.

@zhangyiwei
Copy link

For those who encounter stuck in fence wait, adding environment variable

VN_PERF=no_fence_feedback

to /etc/environment might help.

Just in case some folks are affected by this, I happen to realize that the stock mesa driver from the stable debian bookworm is with Mesa 22.3.6 which contains an Intel ANV bug hit by Venus sync feedback optimization path. So if you see the issue with Intel iGPU setup, you can compile a separate ANV driver from the latest mesa release and the issue will be gone with optimized Venus performance.

@zhangyiwei
Copy link

  1. With XCB WSI on the guest, vkcube runs and displays fine but there is a brief display of a black background before the rendering starts. This does not happen on the host with either XCB or Wayland WSI.

Hi @myrslint , will the black period go away with MESA_SHADER_CACHE_DISABLE=true? if so, that's due to the slow filesystem ops for shader disk cache. The initial loading occurs during VkDevice creation time.

@myrslint
Copy link

myrslint commented Apr 16, 2025

  1. With XCB WSI on the guest, vkcube runs and displays fine but there is a brief display of a black background before the rendering starts. This does not happen on the host with either XCB or Wayland WSI.

Hi @myrslint , will the black period go away with MESA_SHADER_CACHE_DISABLE=true? if so, that's due to the slow filesystem ops for shader disk cache. The initial loading occurs during VkDevice creation time.

Hello there again 🙂

Adding the environment variable MESA_SHADER_CACHE_DISABLE=true to vkcube runs does not make that period of black window display go away. The VM's storage uses virtio, relatively fast virtual storage driver, and is backed by a qcow2 file on a relatively fast SSD so I doubt disk operations would be a bottleneck anywhere.

However, as seen in the debug logs during the time a black window is displayed before the cube is shown messages indicate that a swapchain is created three times (in parallel?) and destroyed for some reason before a fourth successful one is finally created.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment