A use-after-free bug was reported when booting a Linux kernel during the
pci setup phase. It's quite hard to reproduce (needs smp, and favored by
having several pci devices with BAR and specific Linux config, which
is Debian default one in this case).
After investigation (see the associated bug ticket), it appears that,
under specific conditions, we might access a cached AddressSpaceDispatch
that was reclaimed by RCU thread meanwhile.
In the Linux boot scenario, during the pci phase, memory region are
destroyed/recreated, resulting in exposition of the bug.
The core of the issue is that we cache the dispatch associated to
current cpu in cpu->cpu_ases[asidx].memory_dispatch. It is updated with
tcg_commit, which runs asynchronously on a given cpu.
At some point, we leave the rcu critial section, and the RCU thread
starts reclaiming it, but tcg_commit is not yet invoked, resulting in
the use-after-free.
It's not the first problem around this area, and commit 0d58c66068 [1]
("softmmu: Use async_run_on_cpu in tcg_commit") already tried to
address it. It did a good job, but it seems that we found a specific
situation where it's not enough.
This patch takes a simple approach: remove the cached value creating the
issue, and make sure we always get the current mapping for address
space, using address_space_to_dispatch(cpu->cpu_ases[asidx].as).
It's equivalent to qatomic_rcu_read(&as->current_map)->dispatch;
This is not really costly, we just need two dereferences,
including one atomic (rcu) read, which is negligible considering we are
already on mmu slow path anyway.
Note that tcg_commit is still needed, as it's taking care of flushing
TLB, removing previously mapped entries.
Another solution would be to cache directly values under the dispatch
(dispatch themselves are not ref counted), keep an active reference on
associated memory section, and release it when appropriate (tricky).
Given the time already spent debugging this area now and previously, I
strongly prefer eliminating the root of the issue, instead of adding
more complexity for a hypothetical performance gain. RCU is precisely
used to ensure good performance when reading data, so caching is not as
beneficial as it might seem IMHO.
[1] 0d58c66068
Cc: qemu-stable@nongnu.org
Resolves: https://gitlab.com/qemu-project/qemu/-/issues/3040
Signed-off-by: Pierrick Bouvier <pierrick.bouvier@linaro.org>
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Reviewed-by: Michael Tokarev <mjt@tls.msk.ru>
Tested-by: Michael Tokarev <mjt@tls.msk.ru>
Message-ID: <20250724161142.2803091-1-pierrick.bouvier@linaro.org>
Signed-off-by: Philippe Mathieu-Daudé <philmd@linaro.org>
UI-related for 10.1
- [PATCH v3 0/2] ui/vnc: Do not copy z_stream
- [PATCH v6 0/7] ui/spice: Enable gl=on option for non-local or remote clients
- [PATCH v6 0/1] Allow injection of virtio-gpu EDID name
- [PATCH 0/2] ui/gtk: Add keep-aspect-ratio and scale option
# -----BEGIN PGP SIGNATURE-----
#
# iQJQBAABCgA6FiEEh6m9kz+HxgbSdvYt2ujhCXWWnOUFAmh19eYcHG1hcmNhbmRy
# ZS5sdXJlYXVAcmVkaGF0LmNvbQAKCRDa6OEJdZac5cLsEAC1NV4DFQmb0TjuK/Bb
# 81dDED9DGHsYybVy5x3xSqVkJtAoHTC4FmCm8x9T8wwg+utDvCGFfRM1GeMFR/yI
# IzM+2xs9PcG/+7j/HhVLWr9QhoWV/yoKHcjJScfkTrTtZxAQRA3suUdQT1RjvwUY
# NEuKaOx42dEpV7E+OHp8172eG8CWBzFMjH+cx2b6yKoxF1kVsB7kgVb+kCMYBEQi
# 1YHf34G+HGTev+IzzpxnO+P7p2lJ1ud93kCp1Yz8ua5zOUEPiaHkbClFj4M9mdsn
# xvaxby+zJqe33rh8pVr3qD/4R2j35OW7F5uiAQ8C96KF5Eviia8Cno1s4QInpcw/
# sqtorkaP+OLO6sCnvBQqo99iMH2KloCV7b5sUzfxlUkS+3txD1AKRbodz+vhBqMN
# dbESdd1veUFEvi00DGbxfJbbkzVIhxAwad8CNnSjCdsvJdfYLA7TuSEuBtf1lQPF
# lqpVZFB6C3LQMbmTwT9YrOzMtMXQcT+GFpJLOBk0Cxv4rCSil+TeDpEUNXHurYjI
# qWZT+vyGDqyhoZHyQMPsBwAywKgtMC3IwnkKgJdTHroJ57Am86BvZqELRzh8Tffl
# nkdu1uHdNQXT/u8ybU3mStaQ7xMJALL4tlMuIZ5TIkvMeQm4CiViGb/i5LSn/GMk
# lx2JmBwXXf/imsXeBUfxktJFrw==
# =QQ/7
# -----END PGP SIGNATURE-----
# gpg: Signature made Tue 15 Jul 2025 02:32:06 EDT
# gpg: using RSA key 87A9BD933F87C606D276F62DDAE8E10975969CE5
# gpg: issuer "marcandre.lureau@redhat.com"
# gpg: Good signature from "Marc-André Lureau <marcandre.lureau@redhat.com>" [full]
# gpg: aka "Marc-André Lureau <marcandre.lureau@gmail.com>" [full]
# Primary key fingerprint: 87A9 BD93 3F87 C606 D276 F62D DAE8 E109 7596 9CE5
* tag 'ui-pull-request' of https://gitlab.com/marcandre.lureau/qemu:
tpm: "qemu -tpmdev help" should return success
ui/gtk: Add scale option
ui/gtk: Add keep-aspect-ratio option
hw/display: Allow injection of virtio-gpu EDID name
ui/spice: Blit the scanout texture if its memory layout is not linear
ui/spice: Create a new texture with linear layout when gl=on is specified
ui/console-gl: Add a helper to create a texture with linear memory layout
ui/spice: Add an option to submit gl_draw requests at fixed rate
ui/spice: Add an option for users to provide a preferred video codec
ui/spice: Enable gl=on option for non-local or remote clients
ui/egl-helpers: Error check the fds in egl_dmabuf_export_texture()
ui/vnc: Introduce the VncWorker type
ui/vnc: Do not copy z_stream
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Generally APIs to the rest of QEMU should be documented in the headers.
Comments on individual functions or internal details are fine to live
in the C files. Make qemu_add_vm_change_state_handler_prio[_full]()
docstrings consistent by moving them from source to header.
Suggested-by: Xiaoyao Li <xiaoyao.li@intel.com>
Signed-off-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Reviewed-by: Pierrick Bouvier <pierrick.bouvier@linaro.org>
Message-Id: <20250715171920.89670-1-philmd@linaro.org>
Unfortunately "system/accel-ops.h" handlers are not only
system-specific. For example, the cpu_reset_hold() hook
is part of the vCPU creation, after it is realized.
Mechanical rename to drop 'system' using:
$ sed -i -e s_system/accel-ops.h_accel/accel-cpu-ops.h_g \
$(git grep -l system/accel-ops.h)
Signed-off-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Reviewed-by: Zhao Liu <zhao1.liu@intel.com>
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Message-Id: <20250703173248.44995-38-philmd@linaro.org>
This can be useful for devices that might take too long to shut down
gracefully, but may have a way to shutdown quickly otherwise if needed
or explicitly requested by a force shutdown.
For now we only consider SIGTERM or the QMP quit() command a force
shutdown, since those bypass the guest entirely and are equivalent to
pulling the power plug.
Signed-off-by: Daniil Tatianin <d-tatianin@yandex-team.ru>
Message-Id: <20250609212547.2859224-2-d-tatianin@yandex-team.ru>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@yandex-team.ru>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Coverity reported a unnecessary NULL check:
qemu/system/qdev-monitor.c: 720 in qdev_device_add_from_qdict()
683 /* create device */
684 dev = qdev_new(driver);
...
719 err_del_dev:
>>> CID 1590192: Null pointer dereferences (REVERSE_INULL)
>>> Null-checking "dev" suggests that it may be null, but it has already been dereferenced on all paths leading to the check.
720 if (dev) {
721 object_unparent(OBJECT(dev));
722 object_unref(OBJECT(dev));
723 }
724 return NULL;
725 }
Indeed, unlike qdev_try_new() which can return NULL,
qdev_new() always returns a heap pointer (or aborts).
Remove the unnecessary assignment and check.
Fixes: f3a8505656 ("qdev/qbus: add hidden device support")
Resolves: Coverity CID 1590192 (Null pointer dereferences)
Suggested-by: Peter Maydell <peter.maydell@linaro.org>
Signed-off-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
A new field, attributes, was introduced in RAMBlock to link to a
RamBlockAttributes object, which centralizes all guest_memfd related
information (such as fd and status bitmap) within a RAMBlock.
Create and initialize the RamBlockAttributes object upon ram_block_add().
Meanwhile, register the object in the target RAMBlock's MemoryRegion.
After that, guest_memfd-backed RAMBlock is associated with the
RamDiscardManager interface, and the users can execute RamDiscardManager
specific handling. For example, VFIO will register the
RamDiscardListener and get notifications when the state_change() helper
invokes.
As coordinate discarding of RAM with guest_memfd is now supported, only
block uncoordinated discard.
Tested-by: Alexey Kardashevskiy <aik@amd.com>
Reviewed-by: Alexey Kardashevskiy <aik@amd.com>
Acked-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com>
Link: https://lore.kernel.org/r/20250612082747.51539-6-chenyi.qiang@intel.com
Signed-off-by: Peter Xu <peterx@redhat.com>
Commit 852f0048f3 ("RAMBlock: make guest_memfd require uncoordinated
discard") highlighted that subsystems like VFIO may disable RAM block
discard. However, guest_memfd relies on discard operations for page
conversion between private and shared memory, potentially leading to
the stale IOMMU mapping issue when assigning hardware devices to
confidential VMs via shared memory. To address this and allow shared
device assignement, it is crucial to ensure the VFIO system refreshes
its IOMMU mappings.
RamDiscardManager is an existing interface (used by virtio-mem) to
adjust VFIO mappings in relation to VM page assignment. Effectively page
conversion is similar to hot-removing a page in one mode and adding it
back in the other. Therefore, similar actions are required for page
conversion events. Introduce the RamDiscardManager to guest_memfd to
facilitate this process.
Since guest_memfd is not an object, it cannot directly implement the
RamDiscardManager interface. Implementing it in HostMemoryBackend is
not appropriate because guest_memfd is per RAMBlock, and some RAMBlocks
have a memory backend while others do not. Notably, virtual BIOS
RAMBlocks using memory_region_init_ram_guest_memfd() do not have a
backend.
To manage RAMBlocks with guest_memfd, define a new object named
RamBlockAttributes to implement the RamDiscardManager interface. This
object can store the guest_memfd information such as the bitmap for
shared memory and the registered listeners for event notifications. A
new state_change() helper function is provided to notify listeners, such
as VFIO, allowing VFIO to do dynamically DMA map and unmap for the shared
memory according to conversion events. Note that in the current context
of RamDiscardManager for guest_memfd, the shared state is analogous to
being populated, while the private state can be considered discarded for
simplicity. In the future, it would be more complicated if considering
more states like private/shared/discarded at the same time.
In current implementation, memory state tracking is performed at the
host page size granularity, as the minimum conversion size can be one
page per request. Additionally, VFIO expected the DMA mapping for a
specific IOVA to be mapped and unmapped with the same granularity.
Confidential VMs may perform partial conversions, such as conversions on
small regions within a larger one. To prevent such invalid cases and
until support for DMA mapping cut operations is available, all
operations are performed with 4K granularity.
In addition, memory conversion failures cause QEMU to quit rather than
resuming the guest or retrying the operation at present. It would be
future work to add more error handling or rollback mechanisms once
conversion failures are allowed. For example, in-place conversion of
guest_memfd could retry the unmap operation during the conversion from
shared to private. For now, keep the complex error handling out of the
picture as it is not required.
Tested-by: Alexey Kardashevskiy <aik@amd.com>
Reviewed-by: Alexey Kardashevskiy <aik@amd.com>
Reviewed-by: Pankaj Gupta <pankaj.gupta@amd.com>
Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com>
Link: https://lore.kernel.org/r/20250612082747.51539-5-chenyi.qiang@intel.com
[peterx: squash fixup from Chenyi to fix builds]
Signed-off-by: Peter Xu <peterx@redhat.com>
Modify memory_region_set_ram_discard_manager() to return -EBUSY if a
RamDiscardManager is already set in the MemoryRegion. The caller must
handle this failure, such as having virtio-mem undo its actions and fail
the realize() process. Opportunistically move the call earlier to avoid
complex error handling.
This change is beneficial when introducing a new RamDiscardManager
instance besides virtio-mem. After
ram_block_coordinated_discard_require(true) unlocks all
RamDiscardManager instances, only one instance is allowed to be set for
one MemoryRegion at present.
Suggested-by: David Hildenbrand <david@redhat.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Pankaj Gupta <pankaj.gupta@amd.com>
Tested-by: Alexey Kardashevskiy <aik@amd.com>
Reviewed-by: Alexey Kardashevskiy <aik@amd.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com>
Link: https://lore.kernel.org/r/20250612082747.51539-3-chenyi.qiang@intel.com
Signed-off-by: Peter Xu <peterx@redhat.com>
* target/i386/kvm: Intel TDX support
* target/i386/emulate: more lflags cleanups
* meson: remove need for explicit listing of dependencies in hw_common_arch and
target_common_arch
* rust: small fixes
* hpet: Reorganize register decoding to be more similar to Rust code
* target/i386: fixes for AMD models
* target/i386: new EPYC-Turin CPU model
# -----BEGIN PGP SIGNATURE-----
#
# iQFIBAABCgAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmg4BxwUHHBib256aW5p
# QHJlZGhhdC5jb20ACgkQv/vSX3jHroP67gf+PEP4EDQP0AJUfxXYVsczGf5snGjz
# ro8jYmKG+huBZcrS6uPK5zHYxtOI9bHr4ipTHJyHd61lyzN6Ys9amPbs/CRE2Q4x
# Ky4AojPhCuaL2wHcYNcu41L+hweVQ3myj97vP3hWvkatulXYeMqW3/4JZgr4WZ69
# A9LGLtLabobTz5yLc8x6oHLn/BZ2y7gjd2LzTz8bqxx7C/kamjoDrF2ZHbX9DLQW
# BKWQ3edSO6rorSNHWGZsy9BE20AEkW2LgJdlV9eXglFEuEs6cdPKwGEZepade4bQ
# Rdt2gHTlQdUDTFmAbz8pttPxFGMC9Zpmb3nnicKJpKQAmkT/x4k9ncjyAQ==
# =XmkU
# -----END PGP SIGNATURE-----
# gpg: Signature made Thu 29 May 2025 03:05:00 EDT
# gpg: using RSA key F13338574B662389866C7682BFFBD25F78C7AE83
# gpg: issuer "pbonzini@redhat.com"
# gpg: Good signature from "Paolo Bonzini <bonzini@gnu.org>" [full]
# gpg: aka "Paolo Bonzini <pbonzini@redhat.com>" [full]
# Primary key fingerprint: 46F5 9FBD 57D6 12E7 BFD4 E2F7 7E15 100C CD36 69B1
# Subkey fingerprint: F133 3857 4B66 2389 866C 7682 BFFB D25F 78C7 AE83
* tag 'for-upstream' of https://gitlab.com/bonzini/qemu: (77 commits)
target/i386/tcg/helper-tcg: fix file references in comments
target/i386: Add support for EPYC-Turin model
target/i386: Update EPYC-Genoa for Cache property, perfmon-v2, RAS and SVM feature bits
target/i386: Add couple of feature bits in CPUID_Fn80000021_EAX
target/i386: Update EPYC-Milan CPU model for Cache property, RAS, SVM feature bits
target/i386: Update EPYC-Rome CPU model for Cache property, RAS, SVM feature bits
target/i386: Update EPYC CPU model for Cache property, RAS, SVM feature bits
rust: make declaration of dependent crates more consistent
docs: Add TDX documentation
i386/tdx: Validate phys_bits against host value
i386/tdx: Make invtsc default on
i386/tdx: Don't treat SYSCALL as unavailable
i386/tdx: Fetch and validate CPUID of TD guest
target/i386: Print CPUID subleaf info for unsupported feature
i386: Remove unused parameter "uint32_t bit" in feature_word_description()
i386/cgs: Introduce x86_confidential_guest_check_features()
i386/tdx: Define supported KVM features for TDX
i386/tdx: Add XFD to supported bit of TDX
i386/tdx: Add supported CPUID bits relates to XFAM
i386/tdx: Add supported CPUID bits related to TD Attributes
...
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
The MachineClass::legacy_fw_cfg_order boolean was only used
by the pc-q35-2.5 and pc-i440fx-2.5 machines, which got
removed. Remove it along with:
- FW_CFG_ORDER_OVERRIDE_* definitions
- fw_cfg_set_order_override()
- fw_cfg_reset_order_override()
- fw_cfg_order[]
- rom_set_order_override()
- rom_reset_order_override()
Simplify CLI and pc_vga_init() / pc_nic_init().
Signed-off-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Reviewed-by: Daniel P. Berrangé <berrange@redhat.com>
Reviewed-by: Zhao Liu <zhao1.liu@intel.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Message-ID: <20250512083948.39294-12-philmd@linaro.org>
[thuth: Fix error from check_patch.pl wrt to an empty "for" loop]
Signed-off-by: Thomas Huth <thuth@redhat.com>
This patch adds the new VM state change cb type `VMChangeStateHandlerWithRet`,
which has return value for `VMChangeStateEntry`.
Thus, we can register a new VM state change cb with return value for device.
Note that `VMChangeStateHandler` and `VMChangeStateHandlerWithRet` are mutually
exclusive and cannot be provided at the same time.
This patch is the pre patch for 'vhost-user: return failure if backend crashes
when live migration', which makes the live migration aware of the loss of
connection with the vhost-user backend and aborts the live migration.
Virtio device will use VMChangeStateHandlerWithRet.
Signed-off-by: Haoqian He <haoqian.he@smartx.com>
Message-Id: <20250416024729.3289157-2-haoqian.he@smartx.com>
Tested-by: Lei Yang <leiyang@redhat.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Binaries can register a QOM type to filter their machines
by filling their TargetInfo::machine_typename field.
This can be used by example by main() -> machine_help_func()
to filter the machines list.
Signed-off-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Reviewed-by: Pierrick Bouvier <pierrick.bouvier@linaro.org>
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Restrict iotlb_to_section(), address_space_translate_for_iotlb()
and memory_region_section_get_iotlb() to TCG. Declare them in
the new "accel/tcg/iommu.h" header. Declare iotlb_to_section()
using the MemoryRegionSection typedef.
Signed-off-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
Message-ID: <20250424202412.91612-12-philmd@linaro.org>
Since the previous commit ("exec/memory.h: make devend_memop
"target defines" agnostic") there is a single use of the
DEVICE_HOST_ENDIAN definition in ram_device_mem_ops: inline
it and remove its definition altogether.
Signed-off-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Message-Id: <20250423111625.10424-1-philmd@linaro.org>
On Emscripten, function pointer casts can result in runtime failures due to
strict function signature checks. This affects the use of g_list_sort and
g_slist_sort, which internally perform function pointer casts that are not
supported by Emscripten. To avoid these issues, g_list_sort_with_data and
g_slist_sort_with_data should be used instead, as they do not rely on
function pointer casting.
Signed-off-by: Kohei Tokunaga <ktokunaga.mail@gmail.com>
Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Message-ID: <e9a50b76c54cc64fc9985186f0aef3fcc2024da6.1745295397.git.ktokunaga.mail@gmail.com>
Signed-off-by: Philippe Mathieu-Daudé <philmd@linaro.org>
In commit 98ed8ecfc9 ("exec: introduce target_words_bigendian()
helper") target_words_bigendian() was matching the definition it
was depending on (TARGET_WORDS_BIGENDIAN). Later in commit
ee3eb3a7ce ("Replace TARGET_WORDS_BIGENDIAN") the definition was
renamed as TARGET_BIG_ENDIAN but we didn't update the helper.
Do it now mechanically using:
$ sed -i -e s/target_words_bigendian/target_big_endian/g \
$(git grep -wl target_words_bigendian)
Signed-off-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: Pierrick Bouvier <pierrick.bouvier@linaro.org>
Message-Id: <20250417210025.68322-1-philmd@linaro.org>
Split icount stuff from system/cpu-timers.h.
There are 17 files which only require icount.h, 7 that only
require cpu-timers.h, and 7 that require both.
Reviewed-by: Pierrick Bouvier <pierrick.bouvier@linaro.org>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>