1132 Commits

Author SHA1 Message Date
Shameer Kolothum
06b38473cd hw/vfio/pci: Synthesize PASID capability for vfio-pci devices
Add support for synthesizing a PCIe PASID extended capability for
vfio-pci devices when PASID is enabled via a vIOMMU and supported by
the host IOMMU backend.

PASID capability parameters are retrieved via IOMMUFD APIs and the
capability is inserted into the PCIe extended capability list using
the insertion helper. A new x-vpasid-cap-offset property allows
explicit control over the placement; by default the capability is
placed at the end of the PCIe extended configuration space.

If the kernel does not expose PASID information or insertion fails,
the device continues without PASID support.

Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
Tested-by: Eric Auger <eric.auger@redhat.com>
Tested-by: Zhangfei Gao <zhangfei.gao@linaro.org>
Reviewed-by: Cédric Le Goater <clg@redhat.com>
Signed-off-by: Shameer Kolothum <skolothumtho@nvidia.com>
Reviewed-by: Yi Liu <yi.l.liu@intel.com>
Message-id: 20260126104342.253965-37-skolothumtho@nvidia.com
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
2026-01-29 13:32:05 +00:00
Shameer Kolothum
550beca3d7 backends/iommufd: Retrieve PASID width from iommufd_backend_get_device_info()
Retrieve PASID width from iommufd_backend_get_device_info() and store it
in HostIOMMUDeviceCaps for later use.

Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Tested-by: Zhangfei Gao <zhangfei.gao@linaro.org>
Reviewed-by: Nicolin Chen <nicolinc@nvidia.com>
Reviewed-by: Cédric Le Goater <clg@redhat.com>
Tested-by: Eric Auger <eric.auger@redhat.com>
Signed-off-by: Shameer Kolothum <skolothumtho@nvidia.com>
Reviewed-by: Yi Liu <yi.l.liu@intel.com>
Message-id: 20260126104342.253965-33-skolothumtho@nvidia.com
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
2026-01-29 13:32:05 +00:00
Nicolin Chen
8cfaf22668 hw/vfio/region: Create dmabuf for PCI BAR per region
Linux now provides a VFIO dmabuf exporter to expose PCI BAR memory for P2P
use cases. Create a dmabuf for each mapped BAR region after the mmap is set
up, and store the returned fd in the region’s RAMBlock. This allows QEMU to
pass the fd to dma_map_file(), enabling iommufd to import the dmabuf and map
the BAR correctly in the host IOMMU page table.

If the kernel lacks support or dmabuf setup fails, QEMU skips the setup
and continues with normal mmap handling.

Tested-by: Nicolin Chen <nicolinc@nvidia.com>
Reviewed-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
Reviewed-by: Cédric Le Goater <clg@redhat.com>
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Shameer Kolothum <skolothumtho@nvidia.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Tested-by: Eric Auger <eric.auger@redhat.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Link: https://lore.kernel.org/qemu-devel/20260121114111.34045-4-skolothumtho@nvidia.com
Signed-off-by: Cédric Le Goater <clg@redhat.com>
2026-01-26 08:30:04 +01:00
Shameer Kolothum
de36da106d hw/vfio: Add helper to retrieve device feature
Add vfio_device_get_feature() as a common helper to retrieve
VFIO device features.

No functional change intended.

Reviewed-by: Cédric Le Goater <clg@redhat.com>
Signed-off-by: Shameer Kolothum <skolothumtho@nvidia.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Tested-by: Eric Auger <eric.auger@redhat.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Link: https://lore.kernel.org/qemu-devel/20260121114111.34045-3-skolothumtho@nvidia.com
Signed-off-by: Cédric Le Goater <clg@redhat.com>
2026-01-26 08:30:04 +01:00
Jim Shu
0e387bd1df hw/vfio: cpr-iommufd: Fix wrong usage of migrate_add_blocker_modes
The return value of API is 0 for success and negative error code for
failure. We'll check if the return value equals to 0.
Also, the MIG_MODE should be CPR_TRANSFER and CPR_EXEC instead
of 2 same bits.

The API usage is aligned with 'hw/vfio/cpr-legacy.c' after these 2
changes.

Fixes: 3ca0a0ab05 ("migration: Use bitset of MigMode instead of variable arguments")
Signed-off-by: Jim Shu <jim.shu@sifive.com>
Reviewed-by: Cédric Le Goater <clg@redhat.com>
Link: https://lore.kernel.org/qemu-devel/20260121063418.2001326-1-jim.shu@sifive.com
Signed-off-by: Cédric Le Goater <clg@redhat.com>
2026-01-26 08:30:04 +01:00
Zhenzhong Duan
e3c659fee0 vfio/migration: Fix page size calculation
Coverity detected an issue of left shifting int by more than 31 bits leading
to undefined behavior.

In practice bcontainer->dirty_pgsizes always have some common page sizes
when dirty tracking is supported.

Resolves: Coverity CID 1644186
Resolves: Coverity CID 1644187
Resolves: Coverity CID 1644188
Fixes: 46c7633114 ("vfio/migration: Add migration blocker if VM memory is too large to cause unmap_bitmap failure").
Suggested-by: Cédric Le Goater <clg@redhat.com>
Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
Reviewed-by: Cédric Le Goater <clg@redhat.com>
Link: https://lore.kernel.org/qemu-devel/20260116060315.65723-1-zhenzhong.duan@intel.com
Signed-off-by: Cédric Le Goater <clg@redhat.com>
2026-01-26 08:30:04 +01:00
Zhenzhong Duan
68d3a2a24d Workaround for ERRATA_772415_SPR17
On a system influenced by ERRATA_772415, IOMMU_HW_INFO_VTD_ERRATA_772415_SPR17
is repored by IOMMU_DEVICE_GET_HW_INFO. Due to this errata, even the readonly
range mapped on second stage page table could still be written.

Reference from 4th Gen Intel Xeon Processor Scalable Family Specification
Update, Errata Details, SPR17.
Link https://edc.intel.com/content/www/us/en/design/products-and-solutions/processors-and-chipsets/eagle-stream/sapphire-rapids-specification-update/
Backup https://cdrdv2.intel.com/v1/dl/getContent/772415

Also copied the SPR17 details from above link:
"Problem: When remapping hardware is configured by system software in
scalable mode as Nested (PGTT=011b) and with PWSNP field Set in the
PASID-table-entry, it may Set Accessed bit and Dirty bit (and Extended
Access bit if enabled) in first-stage page-table entries even when
second-stage mappings indicate that corresponding first-stage page-table
is Read-Only.

Implication: Due to this erratum, pages mapped as Read-only in second-stage
page-tables may be modified by remapping hardware Access/Dirty bit updates.

Workaround: None identified. System software enabling nested translations
for a VM should ensure that there are no read-only pages in the
corresponding second-stage mappings."

Introduce a helper vfio_device_get_host_iommu_quirk_bypass_ro to check if
readonly mappings should be bypassed.

Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
Link: https://lore.kernel.org/qemu-devel/20260106062808.316574-5-zhenzhong.duan@intel.com
Signed-off-by: Cédric Le Goater <clg@redhat.com>
2026-01-13 08:29:59 +01:00
Zhenzhong Duan
5c9da3d65d vfio/listener: Bypass readonly region for dirty tracking
When doing dirty tracking or calculating dirty tracking range, readonly
regions can be bypassed, because corresponding DMA mappings are readonly
and never become dirty.

This can optimize dirty tracking a bit for passthrough device.

Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
Link: https://lore.kernel.org/qemu-devel/20260106062808.316574-4-zhenzhong.duan@intel.com
Signed-off-by: Cédric Le Goater <clg@redhat.com>
2026-01-13 08:29:59 +01:00
Zhenzhong Duan
0e3c1e2b2b vfio/migration: Allow live migration with vIOMMU without VFs using device dirty tracking
Commit e46883204c ("vfio/migration: Block migration with vIOMMU")
introduces a migration blocker when vIOMMU is enabled, because we need
to calculate the IOVA ranges for device dirty tracking. But this is
unnecessary for iommu dirty tracking.

Limit the vfio_viommu_preset() check to those devices which use device
dirty tracking. This allows live migration with VFIO devices which use
iommu dirty tracking.

Suggested-by: Jason Zeng <jason.zeng@intel.com>
Co-developed-by: Joao Martins <joao.m.martins@oracle.com>
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
Reviewed-by: Yi Liu <yi.l.liu@intel.com>
Tested-by: Xudong Hao <xudong.hao@intel.com>
Tested-by: Giovannio Cabiddu <giovanni.cabiddu@intel.com>
Tested-by: Rohith S R <rohith.s.r@intel.com>
Link: https://lore.kernel.org/qemu-devel/20251218062643.624796-10-zhenzhong.duan@intel.com
Signed-off-by: Cédric Le Goater <clg@redhat.com>
2026-01-13 08:29:59 +01:00
Zhenzhong Duan
46c7633114 vfio/migration: Add migration blocker if VM memory is too large to cause unmap_bitmap failure
With default config, kernel VFIO IOMMU type1 driver limits dirty bitmap to
256MB for unmap_bitmap ioctl so the maximum guest memory region is no more
than 8TB size for the ioctl to succeed.

Be conservative here to limit total guest memory to max value supported
by unmap_bitmap ioctl or else add a migration blocker. IOMMUFD backend
doesn't have such limit, one can use it if there is a need to migrate such
large VM.

Suggested-by: Yi Liu <yi.l.liu@intel.com>
Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
Reviewed-by: Yi Liu <yi.l.liu@intel.com>
Link: https://lore.kernel.org/qemu-devel/20251218062643.624796-9-zhenzhong.duan@intel.com
Signed-off-by: Cédric Le Goater <clg@redhat.com>
2026-01-13 08:29:59 +01:00
Zhenzhong Duan
6e360c0617 vfio/listener: Add missing dirty tracking in region_del
If a VFIO device in guest switches from passthrough(PT) domain to block
domain, the whole memory address space is unmapped, but we passed a NULL
iotlb entry to unmap_bitmap, then bitmap query didn't happen and we lost
dirty pages.

By constructing an iotlb entry with iova = gpa for unmap_bitmap, it can
set dirty bits correctly.

For IOMMU address space, we still send NULL iotlb because VFIO don't know
the actual mappings in guest. It's vIOMMU's responsibility to send actual
unmapping notifications, e.g., vtd_address_space_unmap_in_dirty_tracking().

Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
Tested-by: Giovannio Cabiddu <giovanni.cabiddu@intel.com>
Reviewed-by: Yi Liu <yi.l.liu@intel.com>
Link: https://lore.kernel.org/qemu-devel/20251218062643.624796-8-zhenzhong.duan@intel.com
Signed-off-by: Cédric Le Goater <clg@redhat.com>
2026-01-13 08:29:59 +01:00
Zhenzhong Duan
e98a1c7049 vfio/iommufd: Add IOMMU_HWPT_GET_DIRTY_BITMAP_NO_CLEAR flag support
Pass IOMMU_HWPT_GET_DIRTY_BITMAP_NO_CLEAR when doing the last dirty
bitmap query right before unmap, no PTEs flushes. This accelerates the
query without issue because unmap will tear down the mapping anyway.

Co-developed-by: Joao Martins <joao.m.martins@oracle.com>
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
Reviewed-by: Cédric Le Goater <clg@redhat.com>
Reviewed-by: Yi Liu <yi.l.liu@intel.com>
Tested-by: Xudong Hao <xudong.hao@intel.com>
Tested-by: Giovannio Cabiddu <giovanni.cabiddu@intel.com>
Tested-by: Rohith S R <rohith.s.r@intel.com>
Link: https://lore.kernel.org/qemu-devel/20251218062643.624796-6-zhenzhong.duan@intel.com
Signed-off-by: Cédric Le Goater <clg@redhat.com>
2026-01-13 08:29:59 +01:00
Joao Martins
374e28d876 vfio: Add a backend_flag parameter to vfio_container_query_dirty_bitmap()
This new parameter will be used in following patch, currently 0 is passed.

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
Reviewed-by: Cédric Le Goater <clg@redhat.com>
Reviewed-by: Yi Liu <yi.l.liu@intel.com>
Tested-by: Giovannio Cabiddu <giovanni.cabiddu@intel.com>
Tested-by: Rohith S R <rohith.s.r@intel.com>
Link: https://lore.kernel.org/qemu-devel/20251218062643.624796-5-zhenzhong.duan@intel.com
[ clg: Fixed subject typo ]
Signed-off-by: Cédric Le Goater <clg@redhat.com>
2026-01-13 08:29:59 +01:00
Zhenzhong Duan
e79bc265ef vfio/container-legacy: rename vfio_dma_unmap_bitmap() to vfio_legacy_dma_unmap_get_dirty_bitmap()
This is to follow naming style in container-legacy.c to have low level functions
with vfio_legacy_ prefix.

No functional changes.

Suggested-by: Cédric Le Goater <clg@redhat.com>
Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
Reviewed-by: Cédric Le Goater <clg@redhat.com>
Reviewed-by: Yi Liu <yi.l.liu@intel.com>
Link: https://lore.kernel.org/qemu-devel/20251218062643.624796-4-zhenzhong.duan@intel.com
Signed-off-by: Cédric Le Goater <clg@redhat.com>
2026-01-13 08:29:59 +01:00
Zhenzhong Duan
f051dbeb91 vfio/iommufd: Query dirty bitmap before DMA unmap
When an existing mapping is unmapped, there could already be dirty bits
which need to be recorded before unmap.

If query dirty bitmap fails, we still need to do unmapping or else there
is stale mapping and it's risky to guest.

Co-developed-by: Joao Martins <joao.m.martins@oracle.com>
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
Reviewed-by: Cédric Le Goater <clg@redhat.com>
Reviewed-by: Yi Liu <yi.l.liu@intel.com>
Tested-by: Xudong Hao <xudong.hao@intel.com>
Tested-by: Giovannio Cabiddu <giovanni.cabiddu@intel.com>
Tested-by: Rohith S R <rohith.s.r@intel.com>
Link: https://lore.kernel.org/qemu-devel/20251218062643.624796-3-zhenzhong.duan@intel.com
Signed-off-by: Cédric Le Goater <clg@redhat.com>
2026-01-13 08:29:59 +01:00
Zhenzhong Duan
31ec4aadd0 vfio/iommufd: Add framework code to support getting dirty bitmap before unmap
Currently we support device and iommu dirty tracking, device dirty tracking
is preferred.

Add the framework code in iommufd_cdev_unmap() to choose either device or
iommu dirty tracking, just like vfio_legacy_dma_unmap_one().

Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
Reviewed-by: Cédric Le Goater <clg@redhat.com>
Reviewed-by: Yi Liu <yi.l.liu@intel.com>
Tested-by: Xudong Hao <xudong.hao@intel.com>
Tested-by: Giovannio Cabiddu <giovanni.cabiddu@intel.com>
Tested-by: Rohith S R <rohith.s.r@intel.com>
Link: https://lore.kernel.org/qemu-devel/20251218062643.624796-2-zhenzhong.duan@intel.com
Signed-off-by: Cédric Le Goater <clg@redhat.com>
2026-01-13 08:29:59 +01:00
Zhenzhong Duan
c3459c6bfa vfio/iommufd: Force creating nesting parent HWPT
Call pci_device_get_viommu_flags() to get if vIOMMU supports
VIOMMU_FLAG_WANT_NESTING_PARENT.

If yes, create a nesting parent HWPT and add it to the container's hwpt_list,
letting this parent HWPT cover the entire second stage mappings (GPA=>HPA).

This allows a VFIO passthrough device to directly attach to this default HWPT
and then to use the system address space and its listener.

Introduce a vfio_device_get_viommu_flags_want_nesting() helper to facilitate
this implementation.

It is safe to do so because a vIOMMU will be able to fail in set_iommu_device()
call, if something else related to the VFIO device or vIOMMU isn't compatible.

Suggested-by: Nicolin Chen <nicolinc@nvidia.com>
Suggested-by: Yi Liu <yi.l.liu@intel.com>
Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
Reviewed-by: Nicolin Chen <nicolinc@nvidia.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Reviewed-by: Yi Liu <yi.l.liu@intel.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Link: https://lore.kernel.org/qemu-devel/20260106061304.314546-9-zhenzhong.duan@intel.com
Signed-off-by: Cédric Le Goater <clg@redhat.com>
2026-01-13 08:29:58 +01:00
Philippe Mathieu-Daudé
78e630fcc4 hw/vfio/migration: Check base architecture at runtime
Inline vfio_arch_wants_loading_config_after_iter() and
replace the compile time check of the TARGET_ARM definition
by a runtime call to target_base_arm().

Signed-off-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Acked-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Link: https://lore.kernel.org/qemu-devel/20251021161707.8324-1-philmd@linaro.org
Signed-off-by: Cédric Le Goater <clg@redhat.com>
2026-01-13 08:29:58 +01:00
Markus Armbruster
b351b49275 error: Use error_setg_errno() to improve error messages
A few error messages show numeric errno codes.  Use error_setg_errno()
to show human-readable text instead.

Signed-off-by: Markus Armbruster <armbru@redhat.com>
Message-ID: <20251121121438.1249498-13-armbru@redhat.com>
Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
[Trivial fixup to riscv_kvm_cpu_finalize_features()]
2026-01-08 07:49:23 +01:00
Paolo Bonzini
7f548b8f23 include: reorganize memory API headers
Move RAMBlock functions out of ram_addr.h and cpu-common.h;
move memory API headers out of include/exec and into include/system.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-12-27 10:11:09 +01:00
Paolo Bonzini
048a23851c include: move hw/hw.h to hw/core/, rename
Call it include/hw/core/hw-error.h since that is the only
thing it contains.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-12-27 10:11:09 +01:00
Paolo Bonzini
e1e9a72500 include: move hw/qdev-properties-system.h to hw/core/
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-12-27 10:11:08 +01:00
Paolo Bonzini
78d45220b4 include: move hw/qdev-properties.h to hw/core/
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-12-27 10:11:07 +01:00
Paolo Bonzini
d1000ecae2 include: move hw/qdev-core.h to hw/core/, rename
Call it hw/core/qdev.h to avoid the duplication in the name.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-12-27 10:11:07 +01:00
Paolo Bonzini
1942b61b74 include: move hw/boards.h to hw/core/
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-12-27 10:11:06 +01:00
Yanghang Liu
5f9ac96373 Fix the typo of vfio-pci device's enable-migration option
Signed-off-by: Yanghang Liu <yanghliu@redhat.com>
Reported-by: Mario Casquero <mcasquer@redhat.com>
Reviewed-by: Michael Tokarev <mjt@tls.msk.ru>
Signed-off-by: Michael Tokarev <mjt@tls.msk.ru>
2025-11-21 15:53:06 +03:00
Peter Maydell
b1f4f4695c vfio: Clean up includes
This commit was created with scripts/clean-includes:
 ./scripts/clean-includes --git vfio hw/vfio hw/vfio-user

All .c should include qemu/osdep.h first.  The script performs three
related cleanups:

* Ensure .c files include qemu/osdep.h first.
* Including it in a .h is redundant, since the .c  already includes
  it.  Drop such inclusions.
* Likewise, including headers qemu/osdep.h includes is redundant.
  Drop these, too.

Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
Reviewed-by: Cédric Le Goater <clg@redhat.com>
Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Message-id: 20251104160943.751997-9-peter.maydell@linaro.org
2025-11-14 13:18:04 +00:00
Markus Armbruster
3ca0a0ab05 migration: Use bitset of MigMode instead of variable arguments
migrate_add_blocker_modes() and migration_add_notifier_modes use
variable arguments for a set of migration modes.  The variable
arguments get collected into a bitset for processsing.  Take a bitset
argument instead, it's simpler.

Signed-off-by: Markus Armbruster <armbru@redhat.com>
Reviewed-by: Fabiano Rosas <farosas@suse.de>
Link: https://lore.kernel.org/r/20251027064503.1074255-3-armbru@redhat.com
Signed-off-by: Peter Xu <peterx@redhat.com>
2025-11-03 16:04:10 -05:00
John Levon
ecbe424a63 vfio: only check region info cache for initial regions
It is semantically valid for a VFIO device to increase the number of
regions after initialization. In this case, we'd attempt to check for
cached region info past the size of the ->reginfo array. Check for the
region index and skip the cache in these cases.

This also works around some VGPU use cases which appear to be a bug,
where VFIO_DEVICE_QUERY_GFX_PLANE returns a region index beyond the
reported ->num_regions.

Fixes: 95cdb024 ("vfio: add region info cache")
Signed-off-by: John Levon <john.levon@nutanix.com>
Reviewed-by: Cédric Le Goater <clg@redhat.com>
Reviewed-by: Alex Williamson <alex@shazbot.org>
Link: https://lore.kernel.org/qemu-devel/20251014151227.2298892-3-john.levon@nutanix.com
Signed-off-by: Cédric Le Goater <clg@redhat.com>
2025-10-22 08:12:52 +02:00
John Levon
aaca725884 vfio: rename field to "num_initial_regions"
We set VFIODevice::num_regions at initialization time, and do not
otherwise refresh it. As it is valid in theory for a VFIO device to
later increase the number of supported regions, rename the field to
"num_initial_regions" to better reflect its semantics.

Signed-off-by: John Levon <john.levon@nutanix.com>
Reviewed-by: Cédric Le Goater <clg@redhat.com>
Reviewed-by: Alex Williamson <alex@shazbot.org>
Link: https://lore.kernel.org/qemu-devel/20251014151227.2298892-2-john.levon@nutanix.com
Signed-off-by: Cédric Le Goater <clg@redhat.com>
2025-10-22 08:12:52 +02:00
Zhenzhong Duan
271fec6f18 vfio/listener: Add an assertion for unmap_all
Currently the maximum of iommu address space is 64bit. So when a maximum
iommu memory section is deleted, it's in scope [0, 2^64). Add a
assertion for that.

Suggested-by: Cédric Le Goater <clg@redhat.com>
Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
Reviewed-by: Cédric Le Goater <clg@redhat.com>
Link: https://lore.kernel.org/qemu-devel/20251009040134.334251-4-zhenzhong.duan@intel.com
Signed-off-by: Cédric Le Goater <clg@redhat.com>
2025-10-22 08:12:52 +02:00
Zhenzhong Duan
b30823e561 vfio/iommufd: Support unmap all in one ioctl()
IOMMUFD kernel uAPI supports unmapping whole address space in one call with
[iova, size] set to [0, UINT64_MAX], this can simplify iommufd_cdev_unmap()
a bit. See iommufd_ioas_unmap() in kernel for details.

Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
Reviewed-by: Cédric Le Goater <clg@redhat.com>
Link: https://lore.kernel.org/qemu-devel/20251009040134.334251-3-zhenzhong.duan@intel.com
Signed-off-by: Cédric Le Goater <clg@redhat.com>
2025-10-22 08:12:52 +02:00
Zhenzhong Duan
962bcf0911 vfio/container: Support unmap all in one ioctl()
VFIO type1 kernel uAPI supports unmapping whole address space in one call
since commit c19650995374 ("vfio/type1: implement unmap all"). Use the
unmap_all variant whenever it's supported in kernel.

Opportunistically pass VFIOLegacyContainer pointer in low level function
vfio_legacy_dma_unmap_one().

Co-developed-by: John Levon <levon@movementarian.org>
Signed-off-by: John Levon <levon@movementarian.org>
Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
Reviewed-by: Cédric Le Goater <clg@redhat.com>
Link: https://lore.kernel.org/qemu-devel/20251009040134.334251-2-zhenzhong.duan@intel.com
Signed-off-by: Cédric Le Goater <clg@redhat.com>
2025-10-22 08:12:52 +02:00
Zhenzhong Duan
8bf49fff0d vfio/iommufd: Restore vbasedev's reference to hwpt after CPR transfer
After CPR transfer, if there are more than one VFIO devices, device is
not added to hwpt->device_list and its reference to hwpt isn't restored
on destination. We still need to call iommufd_cdev_attach_container() to
restore it after a matching container is found, or else SIGSEV triggers.

Fixes: 4296ee0745 ("vfio/iommufd: reconstruct device")
Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
Reviewed-by: Steve Sistare <steven.sistare@oracle.com>
Link: https://lore.kernel.org/qemu-devel/20250928085432.40107-5-zhenzhong.duan@intel.com
Signed-off-by: Cédric Le Goater <clg@redhat.com>
2025-10-22 08:12:52 +02:00
Zhenzhong Duan
d59db04aed vfio/iommufd: Set cpr.ioas_id on source side for CPR transfer
On source side, if there are more than one VFIO devices and they
attach to same container, only the first device sets cpr.ioas_id,
the others are bypassed. We should set it for each device, or
else only first device works.

Fixes: 4296ee0745 ("vfio/iommufd: reconstruct device")
Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
Reviewed-by: Steve Sistare <steven.sistare@oracle.com>
Link: https://lore.kernel.org/qemu-devel/20250928085432.40107-4-zhenzhong.duan@intel.com
Signed-off-by: Cédric Le Goater <clg@redhat.com>
2025-10-22 08:12:52 +02:00
Zhenzhong Duan
9423094896 vfio/cpr-legacy: drop an erroneous assert
vfio_legacy_cpr_dma_map() is not only used in post_load on destination
but also error recovery path on source side. Assert it for destination
is wrong.

Fixes: 7e9f214113 ("vfio/container: restore DMA vaddr")
Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
Reviewed-by: Steve Sistare <steven.sistare@oracle.com>
Link: https://lore.kernel.org/qemu-devel/20250928085432.40107-3-zhenzhong.duan@intel.com
Signed-off-by: Cédric Le Goater <clg@redhat.com>
2025-10-22 08:12:52 +02:00
Zhenzhong Duan
5a78db7f80 vfio/container: Remap only populated parts in a section
If there are multiple containers and unmap-all fails for some of them, we
need to remap vaddr for the other containers for which unmap-all succeeded.
When ram discard is enabled, we should only remap populated parts in a
section instead of the whole section.

Fixes: eba1f657cb ("vfio/container: recover from unmap-all-vaddr failure")
Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
Reviewed-by: Steven Sistare <steven.sistare@oracle.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Link: https://lore.kernel.org/qemu-devel/20250928085432.40107-2-zhenzhong.duan@intel.com
Signed-off-by: Cédric Le Goater <clg@redhat.com>
2025-10-22 08:12:52 +02:00
Philippe Mathieu-Daudé
4db362f68c system/physmem: Extract API out of 'system/ram_addr.h' header
Very few files use the Physical Memory API. Declare its
methods in their own header: "system/physmem.h".

Signed-off-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Reviewed-by: Cédric Le Goater <clg@redhat.com>
Message-Id: <20251001175448.18933-19-philmd@linaro.org>
2025-10-07 05:03:56 +02:00
Philippe Mathieu-Daudé
aa60bdb700 system/physmem: Drop 'cpu_' prefix in Physical Memory API
The functions related to the Physical Memory API declared
in "system/ram_addr.h" do not operate on vCPU. Remove the
'cpu_' prefix.

Signed-off-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Reviewed-by: Cédric Le Goater <clg@redhat.com>
Message-Id: <20251001175448.18933-18-philmd@linaro.org>
2025-10-07 05:03:56 +02:00
Philippe Mathieu-Daudé
97480ca692 hw: Remove unnecessary 'system/ram_addr.h' header
None of these files require definition exposed by "system/ram_addr.h",
remove its inclusion.

Signed-off-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Reviewed-by: Jagannathan Raman <jag.raman@oracle.com>
Reviewed-by: Cédric Le Goater <clg@redhat.com>
Reviewed-by: Eric Farman <farman@linux.ibm.com>
Reviewed-by: Thomas Huth <thuth@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Message-Id: <20251001175448.18933-7-philmd@linaro.org>
2025-10-07 05:03:56 +02:00
Philippe Mathieu-Daudé
edd1f91d38 hw/vfio/listener: Include missing 'exec/target_page.h' header
The "exec/target_page.h" header is indirectly pulled from
"system/ram_addr.h". Include it explicitly, in order to
avoid unrelated issues when refactoring "system/ram_addr.h":

  hw/vfio/listener.c: In function ‘vfio_ram_discard_register_listener’:
  hw/vfio/listener.c:258:28: error: implicit declaration of function ‘qemu_target_page_size’; did you mean ‘qemu_ram_pagesize’?
    258 |     int target_page_size = qemu_target_page_size();
        |                            ^~~~~~~~~~~~~~~~~~~~~

Signed-off-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Reviewed-by: Cédric Le Goater <clg@redhat.com>
Message-Id: <20251001175448.18933-5-philmd@linaro.org>
2025-10-07 05:03:56 +02:00
Richard Henderson
bd6aa0d1e5 Merge tag 'staging-pull-request' of https://gitlab.com/peterx/qemu into staging
Migration/Memory Pull for 10.2

- PeterX's fix on tls warning for preempt channel when migratino completes
- Arun's series to enhance error reporting for vTPM and migration framework
- PeterX's patch to cleanup multifd send TLS BYE messages
- Juraj's fix on postcopy start state transition when switchover failed
- Yanfei's fix to migrate APIC before VFIO-PCI to avoid irq fallbacks
- Dan's cleanup to simplify error reporting in qemu_fill_buffer()
- PeterM's fix on address space leak when cpu hot plug / unplug
- Steve's cpr-exec wholeset

# -----BEGIN PGP SIGNATURE-----
#
# iIgEABYKADAWIQS5GE3CDMRX2s990ak7X8zN86vXBgUCaN/uIhIccGV0ZXJ4QHJl
# ZGhhdC5jb20ACgkQO1/MzfOr1wZ+mAEA1l2RS9sZS1W3vXQMCNb+Nu8Uo2p+e5Qj
# Uu6J0WVV+XsBANtzGZk2UM/frqlABywW3/ozJ4qBvIPKo758Mr6/lqUH
# =asUv
# -----END PGP SIGNATURE-----
# gpg: Signature made Fri 03 Oct 2025 08:39:14 AM PDT
# gpg:                using EDDSA key B9184DC20CC457DACF7DD1A93B5FCCCDF3ABD706
# gpg:                issuer "peterx@redhat.com"
# gpg: Good signature from "Peter Xu <xzpeter@gmail.com>" [unknown]
# gpg:                 aka "Peter Xu <peterx@redhat.com>" [unknown]
# gpg: WARNING: The key's User ID is not certified with a trusted signature!
# gpg:          There is no indication that the signature belongs to the owner.
# Primary key fingerprint: B918 4DC2 0CC4 57DA CF7D  D1A9 3B5F CCCD F3AB D706

* tag 'staging-pull-request' of https://gitlab.com/peterx/qemu: (45 commits)
  migration-test: test cpr-exec
  vfio: cpr-exec mode
  migration: cpr-exec docs
  migration: cpr-exec mode
  migration: cpr-exec save and load
  migration: cpr-exec-command parameter
  oslib: qemu_clear_cloexec
  migration: add cpr_walk_fd
  migration: multi-mode notifier
  migration: simplify error reporting after channel read
  physmem: Destroy all CPU AddressSpaces on unrealize
  memory: New AS helper to serialize destroy+free
  include/system/memory.h: Clarify address_space_destroy() behaviour
  migration: ensure APIC is loaded prior to VFIO PCI devices
  migration: Fix state transition in postcopy_start() error handling
  migration/multifd/tls: Cleanup BYE message processing on sender side
  migration: HMP: Adjust the order of output fields
  migration: Make migration_has_failed() work even for CANCELLING
  io/crypto: Move tls premature termination handling into QIO layer
  backends/tpm: Propagate vTPM error on migration failure
  ...

Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
2025-10-04 09:10:58 -07:00
Steve Sistare
ee1ca09fc1 vfio: cpr-exec mode
All blockers and notifiers for cpr-transfer mode also apply to cpr-exec.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
Acked-by: Cédric Le Goater <clg@redhat.com>
Link: https://lore.kernel.org/r/30750362-d4a1-4392-8dd6-016624d01be1@oracle.com
Signed-off-by: Peter Xu <peterx@redhat.com>
2025-10-03 09:48:02 -04:00
Arun Menon
6f9fc6f501 migration: Remove error variant of vmstate_save_state() function
This commit removes the redundant vmstate_save_state_with_err()
function.

Previously, commit 969298f9d7 introduced vmstate_save_state_with_err()
to handle error propagation, while vmstate_save_state() existed for
non-error scenarios.
This is because there were code paths where vmstate_save_state_v()
(called internally by vmstate_save_state) did not explicitly set
errors on failure.

This change unifies error handling by
 - updating vmstate_save_state() to accept an Error **errp argument.
 - vmstate_save_state_v() ensures errors are set directly within the errp
   object, eliminating the need for two separate functions.

All calls to vmstate_save_state_with_err() are replaced with
vmstate_save_state(). This simplifies the API and improves code
maintainability.

vmstate_save_state() that only calls vmstate_save_state_v(),
by inference, also has errors set in errp in case of failure.
The errors are reported using error_report_err().
If we want the function to exit on error, then &error_fatal is
passed.

Reviewed-by: Fabiano Rosas <farosas@suse.de>
Signed-off-by: Arun Menon <armenon@redhat.com>
Tested-by: Fabiano Rosas <farosas@suse.de>
Reviewed-by: Akihiko Odaki <odaki@rsg.ci.i.u-tokyo.ac.jp>
Link: https://lore.kernel.org/r/20250918-propagate_tpm_error-v14-24-36f11a6fb9d3@redhat.com
Signed-off-by: Peter Xu <peterx@redhat.com>
2025-10-03 09:48:02 -04:00
Arun Menon
c632ffbd74 migration: push Error **errp into vmstate_load_state()
This is an incremental step in converting vmstate loading
code to report error via Error objects instead of directly
printing it to console/monitor.
It is ensured that vmstate_load_state() must report an error
in errp, in case of failure.

The errors are temporarily reported using error_report_err().
This is removed in the subsequent patches in this series,
when we are actually able to propagate the error to the calling
function using errp. Whereas, if we want the function to exit on
error, then error_fatal is passed.

Reviewed-by: Marc-André Lureau <marcandre.lureau@redhat.com>
Reviewed-by: Fabiano Rosas <farosas@suse.de>
Signed-off-by: Arun Menon <armenon@redhat.com>
Tested-by: Fabiano Rosas <farosas@suse.de>
Reviewed-by: Akihiko Odaki <odaki@rsg.ci.i.u-tokyo.ac.jp>
Link: https://lore.kernel.org/r/20250918-propagate_tpm_error-v14-2-36f11a6fb9d3@redhat.com
Signed-off-by: Peter Xu <peterx@redhat.com>
2025-10-03 09:48:01 -04:00
Philippe Mathieu-Daudé
f0b52aa08a hw/vfio: Use uint64_t for IOVA mapping size in vfio_container_dma_*map
The 'ram_addr_t' type is described as:

  a QEMU internal address space that maps guest RAM physical
  addresses into an intermediate address space that can map
  to host virtual address spaces.

This doesn't represent well an IOVA mapping size. Simply use
the uint64_t type.

Signed-off-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Reviewed-by: Cédric Le Goater <clg@redhat.com>
Link: https://lore.kernel.org/qemu-devel/20250930123528.42878-5-philmd@linaro.org
Signed-off-by: Cédric Le Goater <clg@redhat.com>
2025-10-02 10:41:23 +02:00
Philippe Mathieu-Daudé
0ca70d3bf7 hw/vfio: Avoid ram_addr_t in vfio_container_query_dirty_bitmap()
The 'ram_addr_t' type is described as:

  a QEMU internal address space that maps guest RAM physical
  addresses into an intermediate address space that can map
  to host virtual address spaces.

vfio_container_query_dirty_bitmap() doesn't expect such QEMU
intermediate address, but a guest physical addresses. Use the
appropriate 'hwaddr' type, rename as @translated_addr for
clarity.

Signed-off-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Reviewed-by: Cédric Le Goater <clg@redhat.com>
Link: https://lore.kernel.org/qemu-devel/20250930123528.42878-4-philmd@linaro.org
Signed-off-by: Cédric Le Goater <clg@redhat.com>
2025-10-02 10:41:23 +02:00
Philippe Mathieu-Daudé
5764a71527 hw/vfio: Reorder vfio_container_query_dirty_bitmap() trace format
Update the trace-events comments after the changes from
commit dcce51b193 ("hw/vfio/container-base.c: rename file
to container.c") and commit a3bcae62b6 ("hw/vfio/container.c:
rename file to container-legacy.c").

Signed-off-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Reviewed-by: Cédric Le Goater <clg@redhat.com>
Link: https://lore.kernel.org/qemu-devel/20250930123528.42878-3-philmd@linaro.org
Signed-off-by: Cédric Le Goater <clg@redhat.com>
2025-10-02 10:41:23 +02:00
Cédric Le Goater
1d9a832b58 vfio: Remove workaround for kernel DMA unmap overflow bug
A kernel bug was introduced in Linux v4.15 via commit 71a7d3d78e3c
("vfio/type1: Check for address space wrap-around on unmap"), which
added a test for address space wrap-around in the vfio DMA unmap path.
Unfortunately, due to an integer overflow, the kernel would
incorrectly detect an unmap of the last page in the 64-bit address
space as a wrap-around, causing the unmap to fail with -EINVAL.

A QEMU workaround was introduced in commit 567d7d3e6b ("vfio/common:
Work around kernel overflow bug in DMA unmap") to retry the unmap,
excluding the final page of the range.

The kernel bug was then fixed in Linux v5.0 via commit 58fec830fc19
("vfio/type1: Fix dma_unmap wrap-around check"). Since the oldest
supported LTS kernel is now v5.4, kernels affected by this bug are
considered deprecated, and the workaround is no longer necessary.

This change reverts 567d7d3e6b, removing the workaround.

Link: https://bugzilla.redhat.com/show_bug.cgi?id=1662291
Reviewed-by: Alex Williamson <alex.williamson@redhat.com>
Reviewed-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
Link: https://lore.kernel.org/qemu-devel/20250926085423.375547-1-clg@redhat.com
Signed-off-by: Cédric Le Goater <clg@redhat.com>
2025-10-02 10:41:23 +02:00
Mark Cave-Ayland
5bdf0db823 vfio/pci.c: rename vfio_pci_nohotplug_dev_info to vfio_pci_nohotplug_info
This changes the prefix to match the name of the QOM type.

Signed-off-by: Mark Cave-Ayland <mark.caveayland@nutanix.com>
Reviewed-by: Cédric Le Goater <clg@redhat.com>
Link: https://lore.kernel.org/qemu-devel/20250925113159.1760317-23-mark.caveayland@nutanix.com
Signed-off-by: Cédric Le Goater <clg@redhat.com>
2025-09-25 17:55:20 +02:00