aboutsummaryrefslogtreecommitdiffstats
AgeCommit message (Collapse)AuthorFilesLines
2023-04-06arm: Do not add padding alignment for hugetlbfs backed memoryHEADmasterSuzuki K Poulose1-1/+3
The arm code tries to align the memory allocation size to 2M to potentially make use of the transparent hugepages. But this would be problematic if we try to allocate from the hugetlbfs, where the allocation size could be more than 2M. Given we support upto 1G, let use leave it to the user to align the requested memory when hugetlbfs is used. Without the patch: $ echo 1 > /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages $ mount -t hugetlbfs -o pagesize=1G none /root/hugemem/ $ lkvm run -m 1024 --hugetlbfs /root/hugemem/ ... # lkvm run -k ... -m 1024 -c 6 Fatal: Can't ftruncate for mem mapping size 1075838976 Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com> Acked-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20230405110905.669217-1-suzuki.poulose@arm.com Signed-off-by: Will Deacon <will@kernel.org>
2023-03-24Add virtio-transport option and deprecate force-pci and virtio-legacy.Rajnesh Kanwal16-30/+61
This is a follow-up patch for [0] which proposed the --force-pci option for riscv. As per the discussion it was concluded to add virtio-tranport option taking in four options (pci, pci-legacy, mmio, mmio-legacy). With this change force-pci and virtio-legacy are both deprecated and arm's default transport changes from MMIO to PCI as agreed in [0]. This is also true for riscv. Nothing changes for other architectures. [0]: https://lore.kernel.org/all/20230118172007.408667-1-rkanwal@rivosinc.com/ Signed-off-by: Rajnesh Kanwal <rkanwal@rivosinc.com> Link: https://lore.kernel.org/r/20230320143344.404307-1-rkanwal@rivosinc.com Signed-off-by: Will Deacon <will@kernel.org>
2023-03-24riscv: Move serial and rtc from IO port space to MMIO area.Rajnesh Kanwal3-1/+14
The default serial and rtc IO region overlaps with PCI IO bar region leading bar 0 activation to fail. Moving these devices to MMIO region similar to ARM. Given serial has been moved from 0x3f8 to 0x10000000, this requires us to now pass earlycon=uart8250,mmio,0x10000000 from cmdline rather than earlycon=uart8250,mmio,0x3f8. To avoid the need to change the address every time the tool is updated, we can also just pass "earlycon" from cmdline and guest then finds the type and base address by following the Device Tree's stdout-path property. Signed-off-by: Rajnesh Kanwal <rkanwal@rivosinc.com> Tested-by: Atish Patra <atishp@rivosinc.com> Reviewed-by: Atish Patra <atishp@rivosinc.com> Link: https://lore.kernel.org/r/20230203122934.18714-1-rkanwal@rivosinc.com Signed-off-by: Will Deacon <will@kernel.org>
2022-11-08riscv: Add --disable-<xyz> options to allow user disable extensionsAnup Patel2-1/+25
By default, the KVM RISC-V keeps all extensions available to VCPU enabled and KVMTOOL does not disable any extension. We add --disable-<xyz> command-line options in KVMTOOL RISC-V to allow users explicitly disable certain extension if they don't desire it. Signed-off-by: Anup Patel <apatel@ventanamicro.com> Link: https://lore.kernel.org/r/20221018140854.69846-7-apatel@ventanamicro.com Signed-off-by: Will Deacon <will@kernel.org>
2022-11-08riscv: Add Zicbom extension supportAndrew Jones1-0/+11
When the Zicbom extension is available expose it to the guest. Also provide the guest the size of the cache block through DT. Signed-off-by: Andrew Jones <ajones@ventanamicro.com> Link: https://lore.kernel.org/r/20221018140854.69846-6-apatel@ventanamicro.com Signed-off-by: Will Deacon <will@kernel.org>
2022-11-08riscv: Move reg encoding helpers to kvm-cpu-arch.hAndrew Jones3-18/+19
We'll need one of these helpers in the next patch in another file. Let's proactively move them all now, since others may some day also be useful. Signed-off-by: Andrew Jones <ajones@ventanamicro.com> Link: https://lore.kernel.org/r/20221018140854.69846-5-apatel@ventanamicro.com Signed-off-by: Will Deacon <will@kernel.org>
2022-11-08riscv: Add zihintpause extension supportMayuresh Chitale1-0/+1
The zihintpause extension allows software to use the PAUSE instruction to reduce energy consumption while executing spin-wait code sequences. Add the zihintpause extension to the device tree if it is supported by the host. Signed-off-by: Mayuresh Chitale <mchitale@ventanamicro.com> Link: https://lore.kernel.org/r/20221018140854.69846-4-apatel@ventanamicro.com Signed-off-by: Will Deacon <will@kernel.org>
2022-11-08riscv: Add Svinval extension supportAnup Patel1-0/+1
Svinval extension allows the guest OS to perform range based TLB maintenance efficiently. Add the Svinval extensiont to the device tree if it is supported by the host. Signed-off-by: Anup Patel <apatel@ventanamicro.com> Link: https://lore.kernel.org/r/20221018140854.69846-3-apatel@ventanamicro.com Signed-off-by: Will Deacon <will@kernel.org>
2022-11-08Update UAPI headers based on Linux-6.1-rc1Anup Patel6-14/+46
We update all UAPI headers based on Linux-6.1-rc1 so that we can use latest features. Signed-off-by: Anup Patel <apatel@ventanamicro.com> Link: https://lore.kernel.org/r/20221018140854.69846-2-apatel@ventanamicro.com Signed-off-by: Will Deacon <will@kernel.org>
2022-11-08hw/i8042: Fix value uninitialized in kbd_io()hbuxiaofei1-1/+1
GCC Version: gcc (GCC) 8.4.1 20200928 (Red Hat 8.4.1-1) hw/i8042.c: In function ‘kbd_io’: hw/i8042.c:153:19: error: ‘value’ may be used uninitialized in this function [-Werror=maybe-uninitialized] state.write_cmd = val; ~~~~~~~~~~~~~~~~^~~~~ hw/i8042.c:298:5: note: ‘value’ was declared here u8 value; ^~~~~ cc1: all warnings being treated as errors make: *** [Makefile:508: hw/i8042.o] Error 1 Signed-off-by: hbuxiaofei <hbuxiaofei@gmail.com> Link: https://lore.kernel.org/r/20221102080501.69274-1-hbuxiaofei@gmail.com Signed-off-by: Will Deacon <will@kernel.org>
2022-11-08pci: Disable writes to Status registerJean-Philippe Brucker1-14/+40
Although the PCI Status register only contains read-only and write-1-to-clear bits, we currently keep anything written there, which can confuse a guest. The problem was highlighted by recent Linux commit 6cd514e58f12 ("PCI: Clear PCI_STATUS when setting up device"), which unconditionally writes 0xffff to the Status register in order to clear pending errors. Then the EDAC driver sees the parity status bits set and attempts to clear them by writing 0xc100, which in turn clears the Capabilities List bit. Later on, when the virtio-pci driver starts probing, it assumes due to missing capabilities that the device is using the legacy transport, and fails to setup the device because of mismatched protocol. Filter writes to the config space, keeping only those to writable fields. Tighten the access size check while we're at it, to prevent overflow. This is only a small step in the right direction, not a foolproof solution, because a guest could still write both Command and Status registers using a single 32-bit write. More work is needed for: * Supporting arbitrary sized writes. * Sanitizing accesses to capabilities, which are device-specific. Also remove the old hack that filtered accesses. It was most likely guarding against ROM BAR writes, which is now handled by the pci_config_writable bitmap. Reported-by: Pierre Gondois <pierre.gondois@arm.com> Tested-by: Pierre Gondois <pierre.gondois@arm.com> Signed-off-by: Jean-Philippe Brucker <jean-philippe@linaro.org> Link: https://lore.kernel.org/r/20221020173452.203043-1-jean-philippe@linaro.org Signed-off-by: Will Deacon <will@kernel.org>
2022-10-04virtio-net: Fix vq->use_event_idx flag checkTu Dinh Ngoc1-1/+1
VIRTIO_RING_F_EVENT_IDX is a bit position value, but virtio_init_device_vq populates vq->use_event_idx by ANDing this value directly to vdev->features. Fix the check for this flag in virtio_init_device_vq. Reviewed-by: Andre Przywara <andre.przywara@arm.com> Signed-off-by: Tu Dinh Ngoc <dinhngoc.tu@irit.fr> Link: https://lore.kernel.org/r/20220929121858.156-1-dinhngoc.tu@irit.fr Signed-off-by: Will Deacon <will@kernel.org>
2022-09-22riscv: Fix serial0 alias pathAnup Patel1-4/+8
We have all MMIO devices under "/smb" DT node so the serial0 alias path should have "/smb" prefix. Fixes: 7c9aac003925 ("riscv: Generate FDT at runtime for Guest/VM") Signed-off-by: Anup Patel <apatel@ventanamicro.com> Link: https://lore.kernel.org/r/20220815101325.477694-6-apatel@ventanamicro.com Signed-off-by: Will Deacon <will@kernel.org>
2022-09-22riscv: Add Sstc extension supportAtish Patra1-0/+1
Sstc extension allows the guest OS to program the timer directly without relying on the SBI call. The kernel detects the presence of Sstc extnesion from the riscv,isa DT property. Add the Sstc extension to the device tree if it is supported by the host. Signed-off-by: Atish Patra <atishp@rivosinc.com> Signed-off-by: Anup Patel <apatel@ventanamicro.com> Link: https://lore.kernel.org/r/20220815101325.477694-5-apatel@ventanamicro.com Signed-off-by: Will Deacon <will@kernel.org>
2022-09-22riscv: Add Svpbmt extension supportAnup Patel1-0/+1
The Svpbmt extension allows PTE based memory attributes in page tables. This extension also allows Guest/VM to use PTE based memory attributes in VS-stage page tables so let us add it Guest/VM ISA string when KVM RISC-V supports it. Signed-off-by: Anup Patel <apatel@ventanamicro.com> Link: https://lore.kernel.org/r/20220815101325.477694-4-apatel@ventanamicro.com Signed-off-by: Will Deacon <will@kernel.org>
2022-09-22riscv: Append ISA extensions to the device treeAtish Patra3-11/+41
The riscv,isa DT property only contains single letter base extensions until now. However, there are also multi-letter extensions which were ratified recently. Add a mechanism to append those extension details to the device tree so that guest can leverage those. Signed-off-by: Atish Patra <atishp@rivosinc.com> Signed-off-by: Anup Patel <apatel@ventanamicro.com> Link: https://lore.kernel.org/r/20220815101325.477694-3-apatel@ventanamicro.com Signed-off-by: Will Deacon <will@kernel.org>
2022-09-22Update UAPI headers based on Linux-6.0-rc1Anup Patel9-30/+301
We update all UAPI headers based on Linux-6.0-rc1 so that we can use latest features. Signed-off-by: Anup Patel <apatel@ventanamicro.com> Link: https://lore.kernel.org/r/20220815101325.477694-2-apatel@ventanamicro.com Signed-off-by: Will Deacon <will@kernel.org>
2022-09-22net: Use vfork() instead of fork() for script executionSuzuki K Poulose1-1/+1
When a script is specified for a guest nic setup, we fork() and execl()s the script when it is time to execute the script. However this is not optimal, given we are running a VM. The fork() will trigger marking the entire page-table of the current process as CoW, which will trigger unmapping the entire stage2 page tables from the guest. Anyway, the child process will exec the script as soon as we fork(), making all these mm operations moot. Also, this operation could be problematic for confidential compute VMs, where it may be expensive (and sometimes destructive) to make changes to the stage2 page tables. So, instead we could use vfork() and avoid the CoW and unmap of the stage2. Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com> Reviewed-by: Alexandru Elisei <alexandru.elisei@arm.com> Link: https://lore.kernel.org/r/20220809124816.2880990-1-suzuki.poulose@arm.com Signed-off-by: Will Deacon <will@kernel.org>
2022-08-04Makefile: Introduce LIBFDT_DIR to specify libfdt locationAlexandru Elisei2-8/+33
The arm, arm64, powerpc and riscv architectures require that libfdt is installed on the system, however the library might not be available for every architecture on the user's distro of choice. Or the static version of the library, needed for the lkvm-static target, might be missing. Fortunately, kvmtool has anticipated this situation and it includes instructions to compile and install libfdt in the INSTALL file. Unfortunately, those instructions do not always work (for example, because the user is missing the needed permisssions), leaving the user unable to compile kvmtool. As an alternative to installing libfdt system-wide, provide the LIBFDT_DIR variable when compiling kvmtool. For example, when compiling with the command: $ make ARCH=<arch> CROSS_COMPILE=<cross_compile> LIBFDT_DIR=<dir> kvmtool will link the executable against the static version of the library located in LIBFDT_DIR/libfdt.a. LIBFDT_DIR takes precedence over the system library, as there are valid reasons to prefer a self-compiled library over the one that the distro provides (like the system library being older). Note that this will slightly increase the size of the executable. For the arm64 architecture, the increase has been measured to be about 100KB, or about 5% of the total executable size. Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com> Link: https://lore.kernel.org/r/20220722141448.168252-2-alexandru.elisei@arm.com Signed-off-by: Will Deacon <will@kernel.org>
2022-08-04virtio/rng: Zero-initialize the deviceJean-Philippe Brucker1-1/+1
Use calloc() to avoid uninitialized fields in the rng device. Signed-off-by: Jean-Philippe Brucker <jean-philippe@linaro.org> Link: https://lore.kernel.org/r/20220722141731.64039-5-jean-philippe@linaro.org Signed-off-by: Will Deacon <will@kernel.org>
2022-08-04virtio/pci: Deassert IRQ line on ISR readJean-Philippe Brucker1-4/+1
Since commit 2108c86d0623 ("virtio/pci: Signal INTx interrupts as level instead of edge"), virtio uses level-triggered IRQs. Bring the modern device up to date, by deasserting the IRQ line when the guest reads the interrupt status register. Fixes: 3bf79498e6d5 ("virtio: Add support for modern virtio-pci") Reported-by: Sami Mujawar <sami.mujawar@arm.com> Signed-off-by: Jean-Philippe Brucker <jean-philippe@linaro.org> Link: https://lore.kernel.org/r/20220722141731.64039-4-jean-philippe@linaro.org Signed-off-by: Will Deacon <will@kernel.org>
2022-08-04Makefile: Fix ARCH overrideJean-Philippe Brucker1-2/+2
Variables set on the command-line are not overridden by normal assignments. So when passing ARCH=x86_64 on the command-line, build fails: Makefile:227: *** This architecture (x86_64) is not supported in kvmtool. Use the 'override' directive to force the ARCH reassignment. Signed-off-by: Jean-Philippe Brucker <jean-philippe@linaro.org> Tested-by: Alexandru Elisei <alexandru.elisei@arm.com> Reviewed-by: Alexandru Elisei <alexandru.elisei@arm.com> Link: https://lore.kernel.org/r/20220722141731.64039-3-jean-philippe@linaro.org Signed-off-by: Will Deacon <will@kernel.org>
2022-08-04Makefile: Add missing build dependenciesJean-Philippe Brucker1-1/+2
When running kvmtool after updating without doing a make clean, one might run into strange issues such as: Warning: Failed init: symbol_init Fatal: Initialisation failed or worse. This happens because symbol.o is not automatically rebuilt after a change of headers, because .symbol.o.d is not in the $(DEPS) variable. So if the layout of struct kvm_config changes, for example, symbols.o that was built for an older version will try to read kvm->vmlinux from the wrong location in struct kvm, and lkvm will die. Add all .d files to $(DEPS). Also include $(STATIC_DEPS) which was previously set but not used. Signed-off-by: Jean-Philippe Brucker <jean-philippe@linaro.org> Reviewed-by: Alexandru Elisei <alexandru.elisei@arm.com> Link: https://lore.kernel.org/r/20220722141731.64039-2-jean-philippe@linaro.org Signed-off-by: Will Deacon <will@kernel.org>
2022-07-01arm64: pvtime: Use correct region sizeAlexandru Elisei1-5/+5
pvtime uses ARM_PVTIME_BASE instead of ARM_PVTIME_SIZE for the size of the memory region given to the guest, which causes to the following error when creating a flash device (via the -F/--flash command line argument): Error: RAM (read-only) region [2000000-27fffff] would overlap RAM region [1020000-203ffff] The read-only region represents the guest memory where the flash image is copied by kvmtool. The region starting at 0x102_0000 (ARM_PVTIME_BASE) is the pvtime region, which should be 64K in size. kvmtool erroneously creates the region to be ARM_PVTIME_BASE in size instead, and the last address becomes: ARM_PVTIME_BASE + ARM_PVTIME_BASE - 1 = 0x102_0000 + 0x102_0000 - 1 = 0x203_ffff which corresponds to the end of the region from the error message. Do the right thing and make the pvtime memory region ARM_PVTIME_SIZE = 64K bytes, as it was intended. Fixes: 7d4671e5d372 ("aarch64: Add stolen time support") Reported-by: Pierre Gondois <pierre.gondois@arm.com> Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com> Reviewed-by: Andre Przywara <andre.przywara@arm.com> Reviewed-by: Sebastian Ene <sebastianene@google.com> Link: https://lore.kernel.org/r/20220629103905.24480-1-alexandru.elisei@arm.com Signed-off-by: Will Deacon <will@kernel.org>
2022-07-01virtio/pci: Remove VIRTIO_PCI_F_SIGNAL_MSIJean-Philippe Brucker2-7/+5
VIRTIO_PCI_F_SIGNAL_MSI is not a virtio feature but an internal flag. Change it to bool to avoid confusion. Signed-off-by: Jean-Philippe Brucker <jean-philippe.brucker@arm.com> Link: https://lore.kernel.org/r/20220701142434.75170-13-jean-philippe.brucker@arm.com Signed-off-by: Will Deacon <will@kernel.org>
2022-07-01virtio/pci: Initialize all vectors to VIRTIO_MSI_NO_VECTORJean-Philippe Brucker2-2/+4
According to the virtio spec, all vectors must be initialized to VIRTIO_MSI_NO_VECTOR (0xffff). In 4.1.5.1.2.1 "Device Requirements: MSI-X Vector Configuration": The device MUST return vector mapped to a given event, (NO_VECTOR if unmapped) on read of config_msix_vector/queue_msix_vector. Currently we return 0, which is a valid MSI vector. Return NO_VECTOR instead. Signed-off-by: Jean-Philippe Brucker <jean-philippe.brucker@arm.com> Link: https://lore.kernel.org/r/20220701142434.75170-12-jean-philippe.brucker@arm.com Signed-off-by: Will Deacon <will@kernel.org>
2022-07-01virtio: Add support for modern virtio-mmioJean-Philippe Brucker8-11/+195
Add modern MMIO transport to virtio, make it the default. Legacy transport can be enabled with --virtio-legacy. The main change for MMIO is the queue addresses. They are now 64-bit addresses instead of 32-bit PFNs. Apart from that all changes for supporting modern devices are already implemented. Signed-off-by: Jean-Philippe Brucker <jean-philippe.brucker@arm.com> Link: https://lore.kernel.org/r/20220701142434.75170-11-jean-philippe.brucker@arm.com Signed-off-by: Will Deacon <will@kernel.org>
2022-07-01virtio: Move MMIO transport to mmio-legacyJean-Philippe Brucker4-155/+165
To make space for the modern register layout, move the current code to mmio-legacy. Signed-off-by: Jean-Philippe Brucker <jean-philippe.brucker@arm.com> Link: https://lore.kernel.org/r/20220701142434.75170-10-jean-philippe.brucker@arm.com Signed-off-by: Will Deacon <will@kernel.org>
2022-07-01virtio: Add support for modern virtio-pciJean-Philippe Brucker15-19/+445
Add support for modern virtio-pci implementation (based on the 1.0 virtio spec). We add a new transport, alongside MMIO and PCI-legacy. This is now the default when selecting PCI, but users can still select the legacy transport for all virtio devices by passing "--virtio-legacy" on the command-line. The main change in modern PCI is the way we address virtqueues, using 64-bit values instead of PFNs. To keep the queue configuration atomic the device also gets a "queue enable" register. Configuration is also made extensible by more feature bits and PCI capabilities. Scalability is improved as well, as devices can have notification registers for each virtqueue on separate pages. However this implementation keeps a single notification register. Signed-off-by: Sasha Levin <sasha.levin@oracle.com> Signed-off-by: Jean-Philippe Brucker <jean-philippe.brucker@arm.com> Link: https://lore.kernel.org/r/20220701142434.75170-9-jean-philippe.brucker@arm.com Signed-off-by: Will Deacon <will@kernel.org>
2022-07-01virtio: Move PCI transport to pci-legacyJean-Philippe Brucker4-236/+254
To make space for the more recent virtio version, move the legacy bits of virtio-pci to a different file. Signed-off-by: Jean-Philippe Brucker <jean-philippe.brucker@arm.com> Link: https://lore.kernel.org/r/20220701142434.75170-8-jean-philippe.brucker@arm.com Signed-off-by: Will Deacon <will@kernel.org>
2022-07-01virtio: Prepare for more feature bitsJean-Philippe Brucker10-14/+14
Modern virtio uses more than 32 bits of features. Bump the feature bitfield size to 64 bits. virtio_set_guest_features() changes in behavior because it will now be called multiple times, each time the guest writes to a 32-bit slice of the features. Signed-off-by: Jean-Philippe Brucker <jean-philippe.brucker@arm.com> Link: https://lore.kernel.org/r/20220701142434.75170-7-jean-philippe.brucker@arm.com Signed-off-by: Will Deacon <will@kernel.org>
2022-07-01virtio/net: Set vhost backend after queue addressJean-Philippe Brucker1-5/+6
We currently call VHOST_SET_BACKEND from notify_vq_gsi(), which can't work with modern virtio because vhost checks that the virtqueue is accessible when handling VHOST_SET_BACKEND, and the modern driver initializes the MSIs before setting up the virtqueue. Move VHOST_SET_BACKEND to init_vq(). Signed-off-by: Jean-Philippe Brucker <jean-philippe.brucker@arm.com> Link: https://lore.kernel.org/r/20220701142434.75170-6-jean-philippe.brucker@arm.com Signed-off-by: Will Deacon <will@kernel.org>
2022-07-01virtio/pci: Use the correct eventfd for vhost notificationJean-Philippe Brucker1-4/+5
Legacy virtio drivers write to the I/O port BAR, and the modern virtio device uses the MMIO BAR. Since vhost can only listen on one ioeventfd, select the one that the guest will use. Signed-off-by: Jean-Philippe Brucker <jean-philippe.brucker@arm.com> Link: https://lore.kernel.org/r/20220701142434.75170-5-jean-philippe.brucker@arm.com Signed-off-by: Will Deacon <will@kernel.org>
2022-07-01virtio/pci: Make doorbell offset dynamicJean-Philippe Brucker2-5/+10
The doorbell offset depends on the transport - virtio-legacy uses a fixed offset, but modern virtio can have per-vq offsets. Add an offset field to the virtio_pci structure. Signed-off-by: Jean-Philippe Brucker <jean-philippe.brucker@arm.com> Link: https://lore.kernel.org/r/20220701142434.75170-4-jean-philippe.brucker@arm.com Signed-off-by: Will Deacon <will@kernel.org>
2022-07-01virtio: Extract init_vq() for PCI and MMIOJean-Philippe Brucker2-8/+30
Modern virtio will need to reuse this code when initializing a virtqueue. It's not much, but still nicer to have next to exit_vq(). Signed-off-by: Jean-Philippe Brucker <jean-philippe.brucker@arm.com> Link: https://lore.kernel.org/r/20220701142434.75170-3-jean-philippe.brucker@arm.com Signed-off-by: Will Deacon <will@kernel.org>
2022-07-01virtio/pci: Delete MSI routesJean-Philippe Brucker1-0/+14
On exit_vq() and device reset, remove the MSI routes that were set up at runtime. Signed-off-by: Jean-Philippe Brucker <jean-philippe.brucker@arm.com> Link: https://lore.kernel.org/r/20220701142434.75170-2-jean-philippe.brucker@arm.com Signed-off-by: Will Deacon <will@kernel.org>
2022-07-01arm64: Allow the user to specify the RAM base addressAlexandru Elisei7-10/+64
Allow the user to specify the RAM base address by using -m/--mem size@addr command line argument. The base address must be above 2GB, as to not overlap with the MMIO I/O region. Reviewed-by: Andre Przywara <andre.przywara@arm.com> Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com> Reviewed-and-Tested-by: Suzuki K Poulose <suzuki.poulose@arm.com> Link: https://lore.kernel.org/r/20220616134828.129006-13-alexandru.elisei@arm.com Signed-off-by: Will Deacon <will@kernel.org>
2022-07-01Introduce kvm__arch_default_ram_address()Alexandru Elisei7-0/+31
Add a new function, kvm__arch_default_ram_address(), which returns the default address for guest RAM for each architecture. Reviewed-by: Andre Przywara <andre.przywara@arm.com> Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com> Reviewed-and-Tested-by: Suzuki K Poulose <suzuki.poulose@arm.com> Link: https://lore.kernel.org/r/20220616134828.129006-12-alexandru.elisei@arm.com Signed-off-by: Will Deacon <will@kernel.org>
2022-07-01arm/arm64: Consolidate RAM initialization in kvm__init_ram()Julien Grall1-26/+26
RAM initialization is unnecessarily split between kvm__init_ram() and kvm__arch_init(). Move all code related to RAM initialization to kvm__init_ram(), making the code easier to follow and to modify. One thing to note is that the initialization order is slightly altered: kvm__arch_enable_mte() and gic__create() are now called before mmap'ing the guest RAM. That is perfectly fine, as they don't use the host's mapping of the guest memory. Reviewed-by: Andre Przywara <andre.przywara@arm.com> Signed-off-by: Julien Grall <julien.grall@arm.com> Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com> Reviewed-and-Tested-by: Suzuki K Poulose <suzuki.poulose@arm.com> Link: https://lore.kernel.org/r/20220616134828.129006-11-alexandru.elisei@arm.com Signed-off-by: Will Deacon <will@kernel.org>
2022-07-01kvm__arch_init: Remove hugetlbfs_path and ram_size as parametersJulien Grall7-14/+20
The kvm struct already contains a pointer to the configuration, which contains both hugetlbfs_path and ram_size, so is it not necessary to pass them as arguments to kvm__arch_init(). Reviewed-by: Andre Przywara <andre.przywara@arm.com> Signed-off-by: Julien Grall <julien.grall@arm.com> Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com> Reviewed-and-Tested-by: Suzuki K Poulose <suzuki.poulose@arm.com> Link: https://lore.kernel.org/r/20220616134828.129006-10-alexandru.elisei@arm.com Signed-off-by: Will Deacon <will@kernel.org>
2022-07-01builtin_run: Allow standard size specifiers for memorySuzuki K Poulose1-5/+54
Allow the user to use the standard B (bytes), K (kilobytes), M (megabytes), G (gigabytes), T (terabytes) and P (petabytes) suffixes for memory size. When none are specified, the default is megabytes. Also raise an error if the guest specifies 0 as the memory size, instead of treating it as uninitialized, as kvmtool has done so far. Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com> Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com> Reviewed-by: Andre Przywara <andre.przywara@arm.com> Reviewed-and-Tested-by: Suzuki K Poulose <suzuki.poulose@arm.com> Link: https://lore.kernel.org/r/20220616134828.129006-9-alexandru.elisei@arm.com Signed-off-by: Will Deacon <will@kernel.org>
2022-07-01arm/arm64: Kill the ARM_HIMAP_MAX_MEMORY() macroAlexandru Elisei1-1/+0
The ARM_HIMAP_MAX_MEMORY() is a remnant of a time when KVM only supported 40 bits if IPA. There are no users left for this macro, remove it. Reviewed-by: Andre Przywara <andre.przywara@arm.com> Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com> Reviewed-and-Tested-by: Suzuki K Poulose <suzuki.poulose@arm.com> Link: https://lore.kernel.org/r/20220616134828.129006-8-alexandru.elisei@arm.com Signed-off-by: Will Deacon <will@kernel.org>
2022-07-01arm/arm64: Kill the ARM_MAX_MEMORY() macroAlexandru Elisei2-18/+0
For 32-bit guests, the maximum memory size is represented by the define ARM_LOMAP_MAX_MEMORY, which ARM_MAX_MEMORY() returns. For 64-bit guests, the RAM size is checked against the maximum allowed by KVM in kvm__get_vm_type(). There are no users left for the ARM_MAX_MEMORY() macro, remove it. Reviewed-by: Andre Przywara <andre.przywara@arm.com> Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com> Reviewed-and-Tested-by: Suzuki K Poulose <suzuki.poulose@arm.com> Link: https://lore.kernel.org/r/20220616134828.129006-7-alexandru.elisei@arm.com Signed-off-by: Will Deacon <will@kernel.org>
2022-07-01arm/arm64: Fail if RAM size is too large for 32-bit guestsAlexandru Elisei3-1/+10
For 64-bit guests, kvmtool exists with an error in kvm__get_vm_type() if the memory size is larger than what KVM supports. For 32-bit guests, the RAM size is silently rounded down to ARM_LOMAP_MAX_MEMORY in kvm__arch_init(). Be consistent and exit with an error when the user has configured the wrong RAM size for 32-bit guests. Reviewed-by: Andre Przywara <andre.przywara@arm.com> Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com> Reviewed-and-Tested-by: Suzuki K Poulose <suzuki.poulose@arm.com> Link: https://lore.kernel.org/r/20220616134828.129006-6-alexandru.elisei@arm.com Signed-off-by: Will Deacon <will@kernel.org>
2022-07-01builtin-run: Add arch hook to validate VM configurationAlexandru Elisei9-0/+29
Architectures are free to set their own command line options. Add an architecture specific hook to validate these options. For now, the hook does nothing, but it will be used in later patches. Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com> Reviewed-by: Andre Przywara <andre.przywara@arm.com> Reviewed-and-Tested-by: Suzuki K Poulose <suzuki.poulose@arm.com> Link: https://lore.kernel.org/r/20220616134828.129006-5-alexandru.elisei@arm.com Signed-off-by: Will Deacon <will@kernel.org>
2022-07-01builtin-run: Rework RAM size validationAlexandru Elisei1-7/+13
host_ram_size() uses sysconf() to calculate the available ram, and sysconf() can fail. When that happens, host_ram_size() returns 0. kvmtool warns the user when the configured VM ram size exceeds the size of the host's memory, but doesn't take into account that host_ram_size() can return 0. If the function returns zero, skip the warning. Since this can only happen when the user sets the memory size (via the -m/--mem command line argument), skip the check entirely if the user hasn't set it. Move the check to kvm_run_validate_cfg(), as it checks for valid user configuration. Reviewed-by: Andre Przywara <andre.przywara@arm.com> Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com> Reviewed-and-Tested-by: Suzuki K Poulose <suzuki.poulose@arm.com> Link: https://lore.kernel.org/r/20220616134828.129006-4-alexandru.elisei@arm.com Signed-off-by: Will Deacon <will@kernel.org>
2022-07-01builtin-run: Always use RAM size in bytesAlexandru Elisei3-13/+15
The user can specify the virtual machine memory size in MB, which is saved in cfg->ram_size. kvmtool validates it against the host memory size, converted from bytes to MB. ram_size is then converted to bytes, and this is how it is used throughout the rest of kvmtool. To avoid any confusion about the unit of measurement, especially once the user is allowed to specify the unit of measurement, always use ram_size in bytes. Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com> Reviewed-by: Andre Przywara <andre.przywara@arm.com> Reviewed-and-Tested-by: Suzuki K Poulose <suzuki.poulose@arm.com> Link: https://lore.kernel.org/r/20220616134828.129006-3-alexandru.elisei@arm.com Signed-off-by: Will Deacon <will@kernel.org>
2022-07-01Use MB for megabytes consistentlyAlexandru Elisei2-3/+3
The help text for the -m/--mem argument states that the guest memory size is in MiB (mebibyte). MiB is the same thing as MB (megabyte), and indeed this is how MB is used throughout kvmtool. Replace MiB with MB, so people don't get the wrong idea and start believing that for kvmtool a MB is 10^6 bytes instead of 2^20. Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com> Reviewed-and-Tested-by: Suzuki K Poulose <suzuki.poulose@arm.com> Link: https://lore.kernel.org/r/20220616134828.129006-2-alexandru.elisei@arm.com Signed-off-by: Will Deacon <will@kernel.org>
2022-07-01arm: gic: fdt: fix PPI CPU mask calculationAndre Przywara4-5/+16
The GICv2 DT binding describes the third cell in each interrupt descriptor as holding the trigger type, but also the CPU mask that this IRQ applies to, in bits [15:8]. However this is not the case for GICv3, where we don't use a CPU mask in the third cell: a simple mask wouldn't fit for the many more supported cores anyway. At the moment we fill this CPU mask field regardless of the GIC type, for the PMU and arch timer DT nodes. This is not only the wrong thing to do in case of a GICv3, but also triggers UBSAN splats when using more than 30 cores, as we do shifting beyond what a u32 can hold: $ lkvm run -k Image -c 31 --pmu arm/timer.c:13:22: runtime error: left shift of 1 by 31 places cannot be represented in type 'int' arm/timer.c:13:38: runtime error: signed integer overflow: -2147483648 - 1 cannot be represented in type 'int' arm/timer.c:13:43: runtime error: left shift of 2147483647 by 8 places cannot be represented in type 'int' arm/aarch64/pmu.c:202:22: runtime error: left shift of 1 by 31 places cannot be represented in type 'int' arm/aarch64/pmu.c:202:38: runtime error: signed integer overflow: -2147483648 - 1 cannot be represented in type 'int' arm/aarch64/pmu.c:202:43: runtime error: left shift of 2147483647 by 8 places cannot be represented in type 'int' Fix that by adding a function that creates the mask by looking at the GIC type first, and returning zero when a GICv3 is used. Also we explicitly check for the CPU limit again, even though this would be done before already, when we try to create a GICv2 VM with more than 8 cores. Acked-by: Marc Zyngier <maz@kernel.org> Signed-off-by: Andre Przywara <andre.przywara@arm.com> Link: https://lore.kernel.org/r/20220616145526.3337196-1-andre.przywara@arm.com Signed-off-by: Will Deacon <will@kernel.org>
2022-06-09virtio/pci: Factor MSI route creationJean-Philippe Brucker1-33/+27
The code for creating an MSI route is already duplicated between config and virtqueue MSI. Modern virtio will need it as well, so move it to a separate function. Signed-off-by: Jean-Philippe Brucker <jean-philippe.brucker@arm.com> Link: https://lore.kernel.org/r/20220607170239.120084-17-jean-philippe.brucker@arm.com Signed-off-by: Will Deacon <will@kernel.org>
2022-06-09virtio/blk: Implement VIRTIO_F_ANY_LAYOUT featureJean-Philippe Brucker3-28/+60
The current virtio-block implementation assumes that buffers have a specific layout (5.2.6.4 "Legacy Interface: Framing Requirements"). Modern virtio removes this layout constraint, so we have to be careful when reading buffers. Note that since the Linux driver uses the same layout as the legacy transport, arbitrary layouts were not actually tested. Signed-off-by: Jean-Philippe Brucker <jean-philippe.brucker@arm.com> Link: https://lore.kernel.org/r/20220607170239.120084-16-jean-philippe.brucker@arm.com Signed-off-by: Will Deacon <will@kernel.org>
2022-06-09virtio/console: Add VIRTIO_F_ANY_LAYOUT featureJean-Philippe Brucker1-1/+1
Our virtio-console implementation already supports ANY_LAYOUT, because buffers are accessed with scatter-gather operations. Advertise the VIRTIO_F_ANY_LAYOUT feature. Signed-off-by: Jean-Philippe Brucker <jean-philippe.brucker@arm.com> Link: https://lore.kernel.org/r/20220607170239.120084-15-jean-philippe.brucker@arm.com Signed-off-by: Will Deacon <will@kernel.org>
2022-06-09virtio/net: Implement VIRTIO_F_ANY_LAYOUT featureJean-Philippe Brucker2-35/+57
Modern virtio demands that devices do not make assumptions about the buffer layouts. Currently the user network backend assumes that TX packets are neatly split between virtio-net header and ethernet frame. Modern virtio-net usually puts everything into one descriptor, but could also split the buffer arbitrarily. Handle arbitrary buffer layouts and advertise the VIRTIO_F_ANY_LAYOUT feature. Signed-off-by: Jean-Philippe Brucker <jean-philippe.brucker@arm.com> Link: https://lore.kernel.org/r/20220607170239.120084-14-jean-philippe.brucker@arm.com Signed-off-by: Will Deacon <will@kernel.org>
2022-06-09virtio/net: Prepare for modern virtioJean-Philippe Brucker3-7/+21
The virtio_net header contains a 'num_buffers' field, used when the VIRTIO_NET_F_MRG_RXBUF feature is negotiated. The legacy driver does not present this field when the feature is not negotiated. In that case the header is 2 bytes smaller. When using the modern virtio transport, the header always contains the field and in addition the device MUST set it to 1 when the VIRTIO_NET_F_MRG_RXBUF is not negotiated. Prepare for modern virtio support by enabling this case once the 'legacy' flag is switched off. Signed-off-by: Jean-Philippe Brucker <jean-philippe.brucker@arm.com> Link: https://lore.kernel.org/r/20220607170239.120084-13-jean-philippe.brucker@arm.com Signed-off-by: Will Deacon <will@kernel.org>
2022-06-09virtio/net: Offload vnet header endianness conversion to tapJean-Philippe Brucker1-20/+19
The conversion of vnet header fields will be more difficult when supporting the virtio ANY_LAYOUT feature. Since the uip backend doesn't use the vnet header, and since tap can handle that conversion itself, offload it to tap. Signed-off-by: Jean-Philippe Brucker <jean-philippe.brucker@arm.com> Link: https://lore.kernel.org/r/20220607170239.120084-12-jean-philippe.brucker@arm.com Signed-off-by: Will Deacon <will@kernel.org>
2022-06-09Add memcpy_fromiovec_safeJean-Philippe Brucker2-0/+33
Existing IOV functions don't take the iovec size as parameter. This is unfortunate because when parsing buffers split into header and body, callers may want to know where the body starts in the iovec, after copying the header. Add a function that does the same as memcpy_fromiovec, but also allows to iterate over the iovec. Signed-off-by: Jean-Philippe Brucker <jean-philippe.brucker@arm.com> Link: https://lore.kernel.org/r/20220607170239.120084-11-jean-philippe.brucker@arm.com Signed-off-by: Will Deacon <will@kernel.org>
2022-06-09virtio: Remove set_guest_features() device opJean-Philippe Brucker11-69/+2
Now that devices have a status callback, they don't use set_guest_features() anymore. The negotiated feature set is available in struct virtio_device. Remove the callback from all devices. Signed-off-by: Jean-Philippe Brucker <jean-philippe.brucker@arm.com> Link: https://lore.kernel.org/r/20220607170239.120084-10-jean-philippe.brucker@arm.com Signed-off-by: Will Deacon <will@kernel.org>
2022-06-09virtio/console: Remove unused callbackJean-Philippe Brucker1-5/+0
Remove unused set_status() callback Signed-off-by: Jean-Philippe Brucker <jean-philippe.brucker@arm.com> Link: https://lore.kernel.org/r/20220607170239.120084-9-jean-philippe.brucker@arm.com Signed-off-by: Will Deacon <will@kernel.org>
2022-06-09virtio: Fix device-specific config endiannessJean-Philippe Brucker10-63/+84
Some legacy virtio drivers expect to read the device-specific config in guest endianness (2.5.3 "Legacy Interface: A Note on Device Configuration Space endian-ness"). Kvmtool doesn't know the guest endianness until it can probe a VCPU. So the config fields start in host endianness, and are swapped once the guest is running. Currently this is done in set_guest_features(), but that is too late because the driver is allowed to read config fields before setting feature bits (2.5.2 "Device Requirements: Device Configuration Space"). In addition some devices don't swap the fields, and those that do swap the fields do it every time the guest writes the feature register, which can't work if a device gets reset more than once. Initialize the config on device reset. Do it on every reset because in theory multiple guests could run with different endianness during the lifetime of the device. Notes: * the balloon device uses little-endian (5.5.4.0.0.1 "Legacy Interface: Device configuration layout"). * the vsock device was introduced after virtio 0.9.5, hence doesn't describe a legacy interface, but the Linux driver allows to use the legacy transport, and always reads the 64-bit guest_cid field as little-endian. * the specification does not describe the 9p device, but the Linux driver uses guest-endian helpers. * the specification does not explicitly forbid a driver from reading the configuration at any time, but a driver must follow the sequence from 3.1.1 "Driver Requirements: Device Initialization", where the driver is allowed to read the config after setting the DRIVER status bit. It should therefore be safe to keep dealing with guest endianness only on device reset, and not on the first config access. Signed-off-by: Jean-Philippe Brucker <jean-philippe.brucker@arm.com> Link: https://lore.kernel.org/r/20220607170239.120084-8-jean-philippe.brucker@arm.com Signed-off-by: Will Deacon <will@kernel.org>
2022-06-09virtio: Add config access helpersJean-Philippe Brucker4-65/+47
At the moment device-specific config access is tailored for a Linux guest, that performs any access in 8 bits. But config access can have any size, and modern virtio drivers must use the size of the accessed field. Add helpers that generalize config accesses. Signed-off-by: Jean-Philippe Brucker <jean-philippe.brucker@arm.com> Link: https://lore.kernel.org/r/20220607170239.120084-7-jean-philippe.brucker@arm.com Signed-off-by: Will Deacon <will@kernel.org>
2022-06-09virtio: Support modern virtqueue addressesJean-Philippe Brucker12-53/+86
Modern virtio devices can use separate buffer for descriptors, available and used rings. They can also use 64-bit addresses instead of 44-bit. Rework the virtqueue initialization function to support modern virtio. Signed-off-by: Jean-Philippe Brucker <jean-philippe.brucker@arm.com> Link: https://lore.kernel.org/r/20220607170239.120084-6-jean-philippe.brucker@arm.com Signed-off-by: Will Deacon <will@kernel.org>
2022-06-09virtio: Factor virtqueue initializationJean-Philippe Brucker10-59/+34
All virtio devices perform the same set of operations when initializing their virtqueues. Move it to virtio core. Signed-off-by: Jean-Philippe Brucker <jean-philippe.brucker@arm.com> Link: https://lore.kernel.org/r/20220607170239.120084-5-jean-philippe.brucker@arm.com Signed-off-by: Will Deacon <will@kernel.org>
2022-06-09virtio/vsock: Remove redundant state trackingJean-Philippe Brucker1-5/+5
The core already tells us whether a device is being started or stopped. Signed-off-by: Jean-Philippe Brucker <jean-philippe.brucker@arm.com> Link: https://lore.kernel.org/r/20220607170239.120084-4-jean-philippe.brucker@arm.com Signed-off-by: Will Deacon <will@kernel.org>
2022-06-09virtio: Remove redundant testJean-Philippe Brucker1-2/+1
Don't test for VIRTIO__STATUS_STOP right after setting it. Signed-off-by: Jean-Philippe Brucker <jean-philippe.brucker@arm.com> Link: https://lore.kernel.org/r/20220607170239.120084-3-jean-philippe.brucker@arm.com Signed-off-by: Will Deacon <will@kernel.org>
2022-06-09virtio: Add NEEDS_RESET to the status maskJean-Philippe Brucker1-0/+1
Not all toolchains used to know about VIRTIO_CONFIG_S_NEEDS_RESET, so we left it out of the status mask. Now that we include our own version of virtio_config.h and we'll need it for virtio 1.0, add it back. Signed-off-by: Jean-Philippe Brucker <jean-philippe.brucker@arm.com> Link: https://lore.kernel.org/r/20220607170239.120084-2-jean-philippe.brucker@arm.com Signed-off-by: Will Deacon <will@kernel.org>
2022-05-26riscv: Add missing asm/kernel.h headerDao Lu1-0/+8
Fixes the following compilation issue: include/linux/kernel.h:5:10: fatal error: asm/kernel.h: No such file or directory 5 | #include "asm/kernel.h" Tested-by: Alexandru Elisei <alexandru.elisei@arm.com> Reviewed-by: Alexandru Elisei <alexandru.elisei@arm.com> Signed-off-by: Dao Lu <daolu@rivosinc.com> Reviewed-by: Anup Patel <anup@brainfault.org> Fixes: 0febaae00bb6 ("Add cpumask functions") Link: https://lore.kernel.org/r/20220524180030.1848992-1-daolu@rivosinc.com Signed-off-by: Will Deacon <will@kernel.org>
2022-05-26mips: Do not emulate a serial deviceAlexandru Elisei2-2/+10
Commit 45b4968e0de1 ("hw/serial: ARM/arm64: Use MMIO at higher addresses") changed how the address for the UART is computed by using KVM_IOPORT_AREA. The symbol is not defined for MIPS, which results in the following compilation error: hw/serial.c:21:27: error: ‘KVM_IOPORT_AREA’ undeclared here (not in a function); did you mean ‘KVM_MIPS_IOPORT_AREA’? 21 | #define serial_iobase_0 (KVM_IOPORT_AREA + 0x3f8) | ^~~~~~~~~~~~~~~ hw/serial.c:29:27: note: in expansion of macro ‘serial_iobase_0’ 29 | #define serial_iobase(nr) serial_iobase_##nr | ^~~~~~~~~~~~~~ hw/serial.c:92:15: note: in expansion of macro ‘serial_iobase’ 92 | .iobase = serial_iobase(0), | ^~~~~~~~~~~~~ Before the commit, the serial was placed at addresses 0x3f8, 0x2f8, 0x3e8 and 0x2e8. However, MIPS puts the RAM at those addresses, up to KVM_MMIO_START, which is 0x10000000. Meaning that serial device emulation never worked, as those addresses were part of a valid memslot representing memory. This has been the case since commit 7281a8db199b ("kvm tools, mips: Add MIPS support") from 2014. A quick examination of the MIPS code reveals that the architecture relies on hypercalls from the guest and the virtio console for input and output. Since nobody complained about the missing serial device, assume that it is indeed not needed and do not compile it for MIPS. Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com> Link: https://lore.kernel.org/r/20220525165704.186754-3-alexandru.elisei@arm.com Signed-off-by: Will Deacon <will@kernel.org>
2022-05-26arm64: Honor --vcpu-affinity for aarch32 guestsAlexandru Elisei1-10/+12
Commit 4639b72f61a3 ("arm64: Add --vcpu-affinity command line argument") introduced the --vcpu-affinity command line argument to pin the VCPUs to a given list of physical CPUs. Unfortunately, the affinity is set only for an arm64 guest, leading to the following error when running a 32-bit guest on a system with two or more PMUs: KVM exit reason: 9 ("KVM_EXIT_FAIL_ENTRY") Registers: PC: 0x8000c608 PSTATE: 0x200000d3 SP_EL1: 0x0 LR: 0x0 *pc: 0x8000c608: 25 3f a0 e1 83 61 a0 e1 0x8000c610: 83 31 98 e7 04 10 82 e1 0x8000c618: 07 2c 81 e3 28 10 1b e5 0x8000c620: 03 20 82 e3 03 00 a0 e1 *lr: Warning: unable to translate guest address 0x0 to host 0x00000000: <unknown> 0x00000008: <unknown> 0x00000010: <unknown> 0x00000018: <unknown> # KVM compatibility warning. virtio-net device was not detected. While you have requested a virtio-net device, the guest kernel did not initialize it. Please make sure that the guest kernel was compiled with CONFIG_VIRTIO_NET=y enabled in .config. # KVM session ended normally. Make the error go away by setting the affinity of the VCPUs for both 32-bit and 64-bit guests. Fixes: 4639b72f61a3 ("arm64: Add --vcpu-affinity command line argument") Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com> Link: https://lore.kernel.org/r/20220525165704.186754-2-alexandru.elisei@arm.com Signed-off-by: Will Deacon <will@kernel.org>
2022-05-26include: add new virtio uapi header filesAndre Przywara8-0/+1005
Commit a08bb43a0c37 ("kvmtool: Copy Linux' up-to-date virtio headers") copied in some of the virtio UAPI headers from the kernel tree, but didn't include all of them, as we were relying on some of them being provided by the distribution. Now commit bc77bf49df6e ("stat: Add descriptions for new virtio_balloon stat types") used some newer virtio balloon symbols, that some older distros (e.g. Ubuntu 18.04) do not carry, which breaks compilation there: ======================= CC builtin-stat.o builtin-stat.c: In function 'do_memstat': builtin-stat.c:86:8: error: 'VIRTIO_BALLOON_S_HTLB_PGALLOC' undeclared (first use in this function); did you mean 'VIRTIO_BALLOON_S_AVAIL'? case VIRTIO_BALLOON_S_HTLB_PGALLOC: ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~ VIRTIO_BALLOON_S_AVAIL builtin-stat.c:86:8: note: each undeclared identifier is reported only once for each function it appears in ======================= To fix this include the remaining virtio headers (those that we actually need for kvmtool at the moment), from Linux v5.18.0. Fixes: bc77bf49df6e ("stat: Add descriptions for new virtio_balloon stat types") Signed-off-by: Andre Przywara <andre.przywara@arm.com> Link: https://lore.kernel.org/r/20220524150611.523910-5-andre.przywara@arm.com Signed-off-by: Will Deacon <will@kernel.org>
2022-05-26include: update virtio UAPI headersAndre Przywara4-92/+305
Commit a08bb43a0c37 ("kvmtool: Copy Linux' up-to-date virtio headers") copied the kernel's virtio UAPI headers into the kvmtool tree, because at the time some distros didn't include (all of) them in their kernel headers package. Let's update those copies, so that we can use newer features, if needed. This syncs in the already existing copies of the headers from Linux v5.18.0. Signed-off-by: Andre Przywara <andre.przywara@arm.com> Link: https://lore.kernel.org/r/20220524150611.523910-4-andre.przywara@arm.com Signed-off-by: Will Deacon <will@kernel.org>
2022-05-26util: include virtio UAPI headers in syncAndre Przywara1-0/+10
We already have an update_headers.sh sync script, where we occasionally update the KVM interface UAPI kernel headers into our tree. So far this covered only the generic kvm.h, plus each architecture's version of that file. Commit bc77bf49df6e ("stat: Add descriptions for new virtio_balloon stat types") used newer virtio symbols, which some older distros do not include in their kernel headers package. To help fixing this and to avoid similar problems in the future, add the virtio headers to our sync script, so that we can get the same, up-to-date versions of the headers easily. Signed-off-by: Andre Przywara <andre.przywara@arm.com> Link: https://lore.kernel.org/r/20220524150611.523910-3-andre.przywara@arm.com Signed-off-by: Will Deacon <will@kernel.org>
2022-05-26update virtio_mmio.hAndre Przywara2-11/+52
At the time we pulled in virtio_mmio.h from the kernel tree (commit a08bb43a0c37c "kvmtool: Copy Linux' up-to-date virtio headers"), this was not an official UAPI header file, so wasn't stable and was not shipped with distributions. This has changed with Linux commit 51be7a9a261c ("virtio_mmio: expose header to userspace"), so we can now use that file officially. However before that the name of some symbols have changed, so we have to adjust their usage in our source. This pulls in virtio_mmio.h from Linux v5.18.0. Signed-off-by: Andre Przywara <andre.przywara@arm.com> Link: https://lore.kernel.org/r/20220524150611.523910-2-andre.przywara@arm.com Signed-off-by: Will Deacon <will@kernel.org>
2022-05-20kvmtool: Have stack be not executable on x86Martin Radev2-0/+10
This patch fixes an issue of having the stack be executable for x86 builds by ensuring that the two objects bios-rom.o and entry.o have the section .note.GNU-stack. Suggested-by: Alexandru Elisei <alexandru.elisei@arm.com> Signed-off-by: Martin Radev <martin.b.radev@gmail.com> Link: https://lore.kernel.org/r/20220509203940.754644-7-martin.b.radev@gmail.com Signed-off-by: Will Deacon <will@kernel.org>
2022-05-20virtio: Check for overflows in QUEUE_NOTIFY and QUEUE_SELMartin Radev11-12/+39
This patch checks for overflows in QUEUE_NOTIFY and QUEUE_SEL in the PCI and MMIO operation handling paths. Further, the return value type of get_vq_count is changed from int to uint since negative doesn't carry any semantic meaning. Reviewed-by: Alexandru Elisei <alexandru.elisei@arm.com> Signed-off-by: Martin Radev <martin.b.radev@gmail.com> Link: https://lore.kernel.org/r/20220509203940.754644-6-martin.b.radev@gmail.com Signed-off-by: Will Deacon <will@kernel.org>
2022-05-20virtio: Sanitize config accessesMartin Radev12-9/+119
The handling of VIRTIO_PCI_O_CONFIG is prone to buffer access overflows. This patch sanitizes this operation by using the newly added virtio op get_config_size. Any access which goes beyond the config structure's size is prevented and a failure is returned. Additionally, PCI accesses which span more than a single byte are prevented and a warning is printed because the implementation does not currently support the behavior correctly. Reviewed-by: Alexandru Elisei <alexandru.elisei@arm.com> Signed-off-by: Martin Radev <martin.b.radev@gmail.com> Link: https://lore.kernel.org/r/20220509203940.754644-5-martin.b.radev@gmail.com Signed-off-by: Will Deacon <will@kernel.org>
2022-05-20virtio/9p: Fix virtio_9p_config allocation sizeMartin Radev1-1/+1
Per the Linux user API, the struct virtio_9p_config "tag" field contains the non-NULL terminated tag name and this is how the tag name is copied by kvmtool in virtio_9p__register(). However, the memory allocation for the struct is off by one, as it allocates memory for the tag name and the NULL byte. Fix it by reducing the allocation by exactly one byte. This is also matches how the struct is allocated by QEMU tagged v7.0.0 in virtio_9p_get_config(). Suggested-by: Alexandru Elisei <alexandru.elisei@arm.com> Signed-off-by: Martin Radev <martin.b.radev@gmail.com> Link: https://lore.kernel.org/r/YnzhdgUwrLlqmzch@monolith.localdoman Signed-off-by: Will Deacon <will@kernel.org>
2022-05-20virtio: Use u32 instead of int in pci_data_in/outMartin Radev1-4/+4
The PCI access size type is changed from a signed type to an unsigned type since the size is never expected to be negative, and the type also matches the type in the signature of virtio_pci__io_mmio_callback. This change simplifies size checking in the next patch. Reviewed-by: Alexandru Elisei <alexandru.elisei@arm.com> Signed-off-by: Martin Radev <martin.b.radev@gmail.com> Link: https://lore.kernel.org/r/20220509203940.754644-4-martin.b.radev@gmail.com Signed-off-by: Will Deacon <will@kernel.org>
2022-05-20mmio: Sanitize addr and lenMartin Radev1-0/+4
This patch verifies that adding the addr and length arguments from an MMIO op do not overflow. This is necessary because the arguments are controlled by the VM. The length may be set to an arbitrary value by using the rep prefix. Reviewed-by: Alexandru Elisei <alexandru.elisei@arm.com> Signed-off-by: Martin Radev <martin.b.radev@gmail.com> Link: https://lore.kernel.org/r/20220509203940.754644-3-martin.b.radev@gmail.com [will: Drop redundant o/f check in virtio_mmio_device_specific() per Alex] Signed-off-by: Will Deacon <will@kernel.org>
2022-05-20kvmtool: Add WARN_ONCE macroMartin Radev1-0/+10
Add a macro to enable to print a warning only once. This is beneficial for cases where a warning could be helpful for debugging, but still log pollution is preferred not to happen. Reviewed-by: Alexandru Elisei <alexandru.elisei@arm.com> Signed-off-by: Martin Radev <martin.b.radev@gmail.com> Link: https://lore.kernel.org/r/20220509203940.754644-2-martin.b.radev@gmail.com Signed-off-by: Will Deacon <will@kernel.org>
2022-05-20stat: Add descriptions for new virtio_balloon stat typesKeir Fraser1-1/+16
Unknown types would print the value with no descriptive text at all. Add descriptions for all known stat types, and a default description when the type is unknown. Signed-off-by: Keir Fraser <keirf@google.com> Cc: Will Deacon <will@kernel.org> Link: https://lore.kernel.org/r/20220520143706.550169-3-keirf@google.com Signed-off-by: Will Deacon <will@kernel.org>
2022-05-20virtio/balloon: Fix a crash when collecting statsKeir Fraser1-1/+6
The collect_stats hook dereferences the stats virtio queue without checking that it has been initialised. Signed-off-by: Keir Fraser <keirf@google.com> Cc: Will Deacon <will@kernel.org> Link: https://lore.kernel.org/r/20220520143706.550169-2-keirf@google.com Signed-off-by: Will Deacon <will@kernel.org>
2022-05-20aarch64: Give up with MTE for AArch32 guestVladimir Murzin1-0/+5
KVM doesn't support combination of MTE and AArch32 guest, so do not even try. Signed-off-by: Vladimir Murzin <vladimir.murzin@arm.com> Tested-by: Alexandru Elisei <alexandru.elisei@arm.com> Reviewed-by: Alexandru Elisei <alexandru.elisei@arm.com> Link: https://lore.kernel.org/r/20220520123844.127733-1-vladimir.murzin@arm.com Signed-off-by: Will Deacon <will@kernel.org>
2022-05-06arm64: Add --vcpu-affinity command line argumentAlexandru Elisei7-22/+118
Add a new command line argument, --vcpu-affinity, to set the CPU affinity for the VCPUs. The affinity is expressed as a cpulist and will apply to all VCPU threads. This gives the user a second option for choosing the PMU on a heterogeneous system. The PMU setup code, when --vcpu-affinity is specified, will search for the PMU associated with the CPUs specified with this command line argument instead of the PMU associated with the CPU on which the main thread is executing. Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com> Link: https://lore.kernel.org/r/20220412133231.35355-12-alexandru.elisei@arm.com Signed-off-by: Will Deacon <will@kernel.org>
2022-05-06arm64: Add support for KVM_ARM_VCPU_PMU_V3_SET_PMUAlexandru Elisei2-3/+148
The KVM_ARM_VCPU_PMU_V3_CTRL(KVM_ARM_VCPU_PMU_V3_SET_PMU) VCPU ioctl is used to assign a physical PMU to the events that KVM creates when emulating the PMU for that VCPU. This is useful on heterogeneous systems, when there is more than one hardware PMU present. All VCPUs must have the same PMU assigned. The assumption that is made in the implementation is that the user will pin the kvmtool process on a set of CPUs that share the same PMU. This allows kvmtool to set the same PMU for all VCPUs from the main thread, instead of in the individual VCPU threads. If a VCPU thread migrates to a CPU which has a different a PMU than the CPU on which the main thread was executing when the PMU was set, the KVM_RUN ioctl will fail with kvm_run.exit_reason set to KVM_EXIT_FAIL_ENTRY, and kvm_run.fail_entry will be populated with the physical CPU ID on which the VCPU tried to execute. Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com> Link: https://lore.kernel.org/r/20220412133231.35355-11-alexandru.elisei@arm.com Signed-off-by: Will Deacon <will@kernel.org>
2022-05-06update_headers.sh: Sync ABI headers with Linux v5.18-rc2Alexandru Elisei2-2/+24
Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com> Link: https://lore.kernel.org/r/20220412133231.35355-10-alexandru.elisei@arm.com Signed-off-by: Will Deacon <will@kernel.org>
2022-05-06Add cpumask functionsAlexandru Elisei14-0/+517
Add a handful of cpumask functions, some of which will be used when dealing with different PMUs on heterogeneous systems. The maximum number of CPUs in a system, NR_CPUS, which dictates the size of the cpumask, has been taken from the Kconfig file for each architecture, from Linux version 5.16. Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com> Link: https://lore.kernel.org/r/20220412133231.35355-9-alexandru.elisei@arm.com Signed-off-by: Will Deacon <will@kernel.org>
2022-05-06arm64: Rework set_pmu_attr()Alexandru Elisei1-32/+16
By the time kvmtool generates the DTB node for the PMU, the KVM_ARM_VCPU_PMU_V3 VCPU feature is already set by kvm_cpu__arch_init(). KVM refuses to run a VCPU if the PMU hasn't been initialized. A PMU cannot be initialized if the interrupt ID hasn't been set by userspace. As a consequence, kvmtool will get an error if the interrupt ID or if the PMU has not been initialized: KVM_RUN failed: Invalid argument To make debugging easier, exit with an error message as soon as one the PMU ioctls fails, instead of waiting until the VCPU is first run. To avoid the repetition of assigning a new kvm_device_attr struct in the main body of pmu__generate_fdt_nodes(), which hinders readability of the function, move the struct to set_pmu_attr(). Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com> Link: https://lore.kernel.org/r/20220412133231.35355-8-alexandru.elisei@arm.com Signed-off-by: Will Deacon <will@kernel.org>
2022-05-06arm: Make the PMUv3 emulation code arm64 specificAlexandru Elisei4-12/+10
KVM for aarch32 does not exist anymore, PMUv3 is a hardware feature present only on aarch64 CPUs, the command line option to enable the feature for a VCPU is aarch64 specific, the PMU code is called only from an aarch64 function and it compiles to an empty stub when ARCH=arm. There is no reason to have the PMUv3 emulation code in the common code area for arm and arm64, so move it to the arm64 directory, where it can be expanded in the future without fear of breaking aarch32 support. Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com> Link: https://lore.kernel.org/r/20220412133231.35355-7-alexandru.elisei@arm.com Signed-off-by: Will Deacon <will@kernel.org>
2022-05-06arm: Get rid of the ARM_VCPU_FEATURE_FLAGS() macroAlexandru Elisei3-9/+5
The ARM_VCPU_FEATURE_FLAGS() macro sets a feature bit in a rather convoluted way: if cpu_id is 0, then bit KVM_ARM_VCPU_POWER_OFF is 0, otherwise is set to 1. There's really no need for this indirection, especially considering that the macro has been changed to return the same value for both the arm and arm64 architectures. Replace it with a simple conditional statement in kvm_cpu__arch_init(), which makes it clearer to understand. Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com> Link: https://lore.kernel.org/r/20220412133231.35355-6-alexandru.elisei@arm.com Signed-off-by: Will Deacon <will@kernel.org>
2022-05-06arm: Move arch specific VCPU features to the arch specific functionAlexandru Elisei3-11/+13
KVM_CAP_ARM_EL1_32BIT and KVM_CAP_ARM_PMU_V3 are arm64 specific features. They are set based on arm64 specific command line options and they target arm64 hardware features. It makes little sense for kvmtool to set the features in the code that is shared between arm and arm64. Move the logic to set the feature bits to the arch specific function kvm_cpu__select_features(), which is already used by arm64 to set other arm64 specific features. Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com> Link: https://lore.kernel.org/r/20220412133231.35355-5-alexandru.elisei@arm.com Signed-off-by: Will Deacon <will@kernel.org>
2022-05-06arm/arm64: pmu.h: Add missing header guardsAlexandru Elisei1-0/+4
Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com> Link: https://lore.kernel.org/r/20220412133231.35355-4-alexandru.elisei@arm.com Signed-off-by: Will Deacon <will@kernel.org>
2022-05-06linux/bitops.h: Include wordsize.h to provide the __WORDSIZE defineAlexandru Elisei1-0/+2
Trying to build a source file which included bitops.h, but didn't also bring in the definition for __WORDSIZE (by including limits.h, for example) would result in the following error: include/linux/bitops.h:8:23: error: ‘__WORDSIZE’ undeclared (first use in this function) 8 | #define BITS_PER_LONG __WORDSIZE | ^~~~~~~~~~ The symbol is defined in the bits/wordsize.h header file, include it. Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com> Link: https://lore.kernel.org/r/20220412133231.35355-3-alexandru.elisei@arm.com Signed-off-by: Will Deacon <will@kernel.org>
2022-05-06linux/err.h: Add missing stdbool.h includeAlexandru Elisei1-0/+2
Add missing header stdbool.h to avoid errors like this one, which can happen if the including file doesn't include stdbool.h: include/linux/err.h:33:15: error: type defaults to ‘int’ in declaration of ‘bool’ [-Werror=implicit-int] 33 | static inline bool __must_check IS_ERR(__force const void *ptr) | ^~~~ include/linux/err.h:33:15: error: variable ‘bool’ declared ‘inline’ [-Werror] include/linux/err.h:33:1: error: ‘warn_unused_result’ attribute only applies to function types [-Werror=attributes] 33 | static inline bool __must_check IS_ERR(__force const void *ptr) | ^~~~~~ include/linux/err.h:33:33: error: expected ‘,’ or ‘;’ before ‘IS_ERR’ 33 | static inline bool __must_check IS_ERR(__force const void *ptr) | ^~~~~~ include/linux/err.h:38:15: error: type defaults to ‘int’ in declaration of ‘bool’ [-Werror=implicit-int] 38 | static inline bool __must_check IS_ERR_OR_NULL(__force const void *ptr) | ^~~~ include/linux/err.h:38:15: error: variable ‘bool’ declared ‘inline’ [-Werror] include/linux/err.h:38:1: error: ‘warn_unused_result’ attribute only applies to function types [-Werror=attributes] 38 | static inline bool __must_check IS_ERR_OR_NULL(__force const void *ptr) | ^~~~~~ include/linux/err.h:38:15: error: redundant redeclaration of ‘bool’ [-Werror=redundant-decls] 38 | static inline bool __must_check IS_ERR_OR_NULL(__force const void *ptr) | ^~~~ include/linux/err.h:33:15: note: previous declaration of ‘bool’ was here 33 | static inline bool __must_check IS_ERR(__force const void *ptr) | ^~~~ include/linux/err.h:38:33: error: expected ‘,’ or ‘;’ before ‘IS_ERR_OR_NULL’ 38 | static inline bool __must_check IS_ERR_OR_NULL(__force const void *ptr) | ^~~~~~~~~~~~~~ include/linux/err.h: In function ‘PTR_ERR_OR_ZERO’: include/linux/err.h:58:6: error: implicit declaration of function ‘IS_ERR’ [-Werror=implicit-function-declaration] 58 | if (IS_ERR(ptr)) | ^~~~~~ include/linux/err.h:58:6: error: nested extern declaration of ‘IS_ERR’ [-Werror=nested-externs] Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com> Link: https://lore.kernel.org/r/20220412133231.35355-2-alexandru.elisei@arm.com Signed-off-by: Will Deacon <will@kernel.org>
2022-04-04aarch64: Add support for MTEAlexandru Elisei6-0/+31
MTE has been supported in Linux since commit 673638f434ee ("KVM: arm64: Expose KVM_ARM_CAP_MTE"), add support for it in kvmtool. MTE is enabled by default. Enabling the MTE capability incurs a cost, both in time (for each translation fault the tags need to be cleared), and in space (the tags need to be saved when a physical page is swapped out). This overhead is expected to be negligible for most users, but for those cases where it matters (like performance benchmarks), a --disable-mte option has been added. Reviewed-by: Vladimir Murzin <vladimir.murzin@arm.com> Tested-by: Vladimir Murzin <vladimir.murzin@arm.com> Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com> Link: https://lore.kernel.org/r/20220328103328.18768-3-alexandru.elisei@arm.com Signed-off-by: Will Deacon <will@kernel.org>
2022-04-04update_headers.sh: Sync ABI headers with Linux v5.17Alexandru Elisei3-1/+41
Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com> Link: https://lore.kernel.org/r/20220328103328.18768-2-alexandru.elisei@arm.com Signed-off-by: Will Deacon <will@kernel.org>
2022-04-04Make --no-pvtime command argument arm specificSebastian Ene5-7/+6
The stolen time option is available only for aarch64 and is enabled by default. Move the option that disables stolen time functionality in the arch specific path. Signed-off-by: Sebastian Ene <sebastianene@google.com> Link: https://lore.kernel.org/r/20220324154304.2572891-1-sebastianene@google.com Signed-off-by: Will Deacon <will@kernel.org>
2022-03-21Revert "kvm tools: Filter out CPU vendor string"Oliver Upton1-8/+0
This reverts commit bc0b99a2a74047707db73ba057743febf458fd90. Thanks to some digging from Andre [1], we know that kvmtool commit bc0b99a2a740 ("kvm tools: Filter out CPU vendor string") was intended to work around a guest kernel bug resulting from kernel commit 5bbc097d8904 ("x86, amd: Disable GartTlbWlkErr when BIOS forgets it"). Critically, KVM does not implement the MC4 mask MSR and instead injects a #GP into the guest. On guest kernels without commit d47cc0db8fd6 ("x86, amd: Use _safe() msr access for GartTlbWlk disable code") this is unexpected and causes a kernel oops. Since the kernel has taken the position to fix the bug in the guest and not KVM, there is no need for CPU vendor string filtering in kvmtool. Vendor string filtering is highly problematic for feature discovery, both in the kernel and userspace. As Andre noted, glibc depends on the vendor string to discover CPU features at runtime [2]. This has been generally innocuous, but as distributions begin to raise the minimum ISA guest userspace will quickly crash and burn on kvmtool. Hiding the vendor string also makes it impossible to test vendor-specific CPU features in kvmtool guest kernels. Given the fact that there are known dependencies in kernel and userspace on the CPU vendor string, allow the guest to see the native CPU vendor string. This has the potential to break certain guest kernels of 2011 vintage when running on an AMD Fam10h processor. Onus is on the guest to update its kernel at this point. Link: https://lore.kernel.org/kvm/20220311121042.010bbb30@donnerap.cambridge.arm.com/ Link: https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/x86/cpu-features.c;h=514226b37889;hb=HEAD#l398 Reported-by: Dongli Si <sidongli1997@gmail.com> Suggested-by: Andre Przywara <andre.przywara@arm.com> Signed-off-by: Oliver Upton <oupton@google.com> Link: https://lore.kernel.org/r/20220318204938.496840-1-oupton@google.com Signed-off-by: Will Deacon <will@kernel.org>
2022-03-21Add --no-pvtime command line argumentSebastian Ene1-0/+2
The command line argument disables the stolen time functionality when is specified. Reviewed-by: Alexandru Elisei <alexandru.elisei@arm.com> Signed-off-by: Sebastian Ene <sebastianene@google.com> Link: https://lore.kernel.org/r/20220313161949.3565171-4-sebastianene@google.com Signed-off-by: Will Deacon <will@kernel.org>
2022-03-21aarch64: Add stolen time supportSebastian Ene8-2/+114
This patch adds support for stolen time by sharing a memory region with the guest which will be used by the hypervisor to store the stolen time information. Reserve a 64kb MMIO memory region after the RTC peripheral to be used by pvtime. The exact format of the structure stored by the hypervisor is described in the ARM DEN0057A document. Reviewed-by: Alexandru Elisei <alexandru.elisei@arm.com> Tested-by: Alexandru Elisei <alexandru.elisei@arm.com> Signed-off-by: Sebastian Ene <sebastianene@google.com> Link: https://lore.kernel.org/r/20220313161949.3565171-3-sebastianene@google.com Signed-off-by: Will Deacon <will@kernel.org>
2022-03-21aarch64: Populate the vCPU struct before target->init()Sebastian Ene1-7/+7
Move the vCPU structure initialisation before the target->init() call to keep a reference to the kvm structure during init(). This is required by the pvtime peripheral to reserve a memory region while the vCPU is beeing initialised. Reviewed-by: Marc Zyngier <maz@kernel.org> Signed-off-by: Sebastian Ene <sebastianene@google.com> Link: https://lore.kernel.org/r/20220313161949.3565171-2-sebastianene@google.com Signed-off-by: Will Deacon <will@kernel.org>
2022-02-16arm: pci: Generate "msi-parent" property only with a MSI controllerAlexandru Elisei3-4/+9
The "msi-parent" PCI root complex property describes the MSI parent of the root complex. When the VM is created with a GICv2 or GICv3 irqchip (--irqchip=gicv3 or --irqchip=gicv2), there is no MSI controller present on the system and the corresponding phandle is not generated, leaving the "msi-parent" property to point to a non-existing phandle. Skip creating the "msi-parent" property when no MSI controller exists. Reported-by: Pierre Gondois <pierre.gondois@arm.com> Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com> Link: https://lore.kernel.org/r/20220214165830.69207-4-alexandru.elisei@arm.com Signed-off-by: Will Deacon <will@kernel.org>
2022-02-16arm: Use pr_debug() to print memory layout when loading a firmware imageAlexandru Elisei1-3/+5
When loading a kernel image, kvmtool is nice enough to print a message informing the user where the file was loaded in guest memory, which is very useful for debugging. Do the same for the firmware image. Commit e1c7c62afc7b ("arm: turn pr_info() into pr_debug() messages") changed various pr_info() into pr_debug() messages to stop kvmtool from cluttering stdout. Do the same when printing where the FDT has been copied when loading a firmware image. Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com> Link: https://lore.kernel.org/r/20220214165830.69207-3-alexandru.elisei@arm.com Signed-off-by: Will Deacon <will@kernel.org>
2022-02-16Remove initrd magic checkAlexandru Elisei1-22/+0
Linux, besides CPIO, supports 7 different compressed formats for the initrd (gzip, bzip2, LZMA, XZ, LZO, LZ4, ZSTD), but kvmtool only recognizes one of them. Remove the initrd magic check because: 1. It doesn't bring much to the end user, as the Linux kernel still complains if the initrd is in an unknown format. 2. --kernel can be used to load something that is not a Linux kernel (like a kvm-unit-tests test), in which case a format which is not supported by a Linux kernel can still be perfectly valid. For example, kvm-unit-tests load the test environment as an initrd in plain ASCII format. 3. It cuts down on the maintenance effort when new formats are added to the Linux kernel. Not a big deal, since that doesn't happen very often, but it's still an effort with very little gain (see point #1 above). Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com> Link: https://lore.kernel.org/r/20220214165830.69207-2-alexandru.elisei@arm.com Signed-off-by: Will Deacon <will@kernel.org>
2022-02-16virtio/pci: Signal INTx interrupts as level instead of edgeMarc Zyngier2-2/+2
It appears that the way INTx is emulated is "slightly" out of spec in kvmtool. We happily inject an edge interrupt, even if the spec mandates a level. This doesn't change much for either the guest or userspace (only KVM will have a bit more work tracking the EOI), but at least this is correct. Reported-by: Pierre Gondois <pierre.gondois@arm.com> Signed-off-by: Marc Zyngier <maz@kernel.org> Cc: Ard Biesheuvel <ardb@kernel.org> Cc: Sami Mujawar <sami.mujawar@arm.com> Cc: Will Deacon <will@kernel.org> Acked-by: Ard Biesheuvel <ardb@kernel.org> Link: https://lore.kernel.org/r/20220131160242.2665191-1-maz@kernel.org Signed-off-by: Will Deacon <will@kernel.org>
2022-02-16x86: Set the correct APIC IDMuchun Song1-2/+4
When kvmtool boots a kernel, the dmesg will print the following message: [Firmware Bug]: CPU1: APIC id mismatch. Firmware: 1 APIC: 30 Fix this by setting up correct initial_apicid to cpu_id. Signed-off-by: Muchun Song <songmuchun@bytedance.com> Tested-by: Alexandru Elisei <alexandru.elisei@arm.com> Link: https://lore.kernel.org/r/20220216113735.52240-2-songmuchun@bytedance.com Signed-off-by: Will Deacon <will@kernel.org>
2022-02-16x86: Fix initialization of irq mptableMuchun Song1-1/+1
When dev_hdr->dev_num is greater one, the initialization of last_addr is wrong. Fix it. Fixes: f83cd16 ("kvm tools: irq: replace the x86 irq rbtree with the PCI device tree") Signed-off-by: Muchun Song <songmuchun@bytedance.com> Reviewed-by: Alexandru Elisei <alexandru.elisei@arm.com> Link: https://lore.kernel.org/r/20220216113735.52240-1-songmuchun@bytedance.com Signed-off-by: Will Deacon <will@kernel.org>
2021-12-14riscv: Generate PCI host DT nodeAnup Patel4-0/+115
This patch extends FDT generation to generate PCI host DT node. Of course, PCI host for Guest/VM is not useful at the moment because it's mostly for PCI pass-through and we don't have IOMMU and interrupt routing available for KVM RISC-V. In future, we might be able to use PCI host for VirtIO PCI transport or other software emulated PCI devices. Signed-off-by: Anup Patel <anup.patel@wdc.com> Link: https://lore.kernel.org/r/20211119124515.89439-9-anup.patel@wdc.com Signed-off-by: Will Deacon <will@kernel.org>
2021-12-14riscv: Handle SBI calls forwarded to user spaceAnup Patel2-1/+96
The kernel KVM RISC-V module will forward certain SBI calls to user space. These forwared SBI calls will usually be the SBI calls which cannot be emulated in kernel space such as PUTCHAR and GETCHAR calls. This patch extends kvm_cpu__handle_exit() to handle SBI calls forwarded to user space. Signed-off-by: Atish Patra <atish.patra@wdc.com> Signed-off-by: Anup Patel <anup.patel@wdc.com> Link: https://lore.kernel.org/r/20211119124515.89439-8-anup.patel@wdc.com Signed-off-by: Will Deacon <will@kernel.org>
2021-12-14riscv: Generate FDT at runtime for Guest/VMAnup Patel6-0/+255
We generate FDT at runtime for RISC-V Guest/VM so that KVMTOOL users don't have to pass FDT separately via command-line parameters. Also, we provide "--dump-dtb <filename>" command-line option to dump generated FDT into a file for debugging purpose. Signed-off-by: Atish Patra <atish.patra@wdc.com> Signed-off-by: Anup Patel <anup.patel@wdc.com> Link: https://lore.kernel.org/r/20211119124515.89439-7-anup.patel@wdc.com Signed-off-by: Will Deacon <will@kernel.org>
2021-12-14riscv: Add PLIC device emulationAnup Patel4-2/+526
The PLIC (platform level interrupt controller) manages peripheral interrupts in RISC-V world. The per-CPU interrupts are managed using CPU CSRs hence virtualized in-kernel by KVM RISC-V. This patch adds PLIC device emulation for KVMTOOL RISC-V. Signed-off-by: Vincent Chen <vincent.chen@sifive.com> [For PLIC context CLAIM register emulation] Signed-off-by: Anup Patel <anup.patel@wdc.com> Link: https://lore.kernel.org/r/20211119124515.89439-6-anup.patel@wdc.com Signed-off-by: Will Deacon <will@kernel.org>
2021-12-14riscv: Implement Guest/VM VCPU arch functionsAnup Patel2-7/+390
This patch implements kvm_cpu__<xyz> Guest/VM VCPU arch functions. These functions mostly deal with: 1. VCPU allocation and initialization 2. VCPU reset 3. VCPU show/dump code 4. VCPU show/dump registers We also save RISC-V ISA, XLEN, and TIMEBASE frequency for each VCPU so that it can be later used for generating Guest/VM FDT. Signed-off-by: Atish Patra <atish.patra@wdc.com> Signed-off-by: Anup Patel <anup.patel@wdc.com> Link: https://lore.kernel.org/r/20211119124515.89439-5-anup.patel@wdc.com Signed-off-by: Will Deacon <will@kernel.org>
2021-12-14riscv: Implement Guest/VM arch functionsAnup Patel2-6/+134
This patch implements all kvm__arch_<xyz> Guest/VM arch functions. These functions mostly deal with: 1. Guest/VM RAM initialization 2. Updating terminals on character read 3. Loading kernel and initrd images Firmware loading is not implemented currently because initially we will be booting kernel directly without any bootloader. In future, we will certainly support firmware loading. Signed-off-by: Anup Patel <anup.patel@wdc.com> Link: https://lore.kernel.org/r/20211119124515.89439-4-anup.patel@wdc.com Signed-off-by: Will Deacon <will@kernel.org>
2021-12-14riscv: Initial skeletal supportAnup Patel13-5/+440
This patch adds initial skeletal KVMTOOL RISC-V support which just compiles for RV32 and RV64 host. Signed-off-by: Anup Patel <anup.patel@wdc.com> Link: https://lore.kernel.org/r/20211119124515.89439-3-anup.patel@wdc.com Signed-off-by: Will Deacon <will@kernel.org>
2021-12-14update_headers: Sync-up ABI headers with Linux-5.16-rc1Anup Patel4-14/+557
We sync-up all ABI headers with Linux-5.16-rc1 so that RISC-V specfic changes in include/linux/kvm.h are available. Signed-off-by: Anup Patel <anup.patel@wdc.com> Link: https://lore.kernel.org/r/20211119124515.89439-2-anup.patel@wdc.com Signed-off-by: Will Deacon <will@kernel.org>
2021-12-14Makefile: Calculate the correct kvmtool versionhaibiao.xiao1-2/+2
Command 'lvm version' works incorrect. It is expected to print: # ./lvm version # kvm tool [KVMTOOLS_VERSION] but the KVMTOOLS_VERSION is missed: # ./lvm version # kvm tool The KVMTOOLS_VERSION is defined in the KVMTOOLS-VERSION-FILE file which is included at the end of Makefile. Since the CFLAGS is a 'Simply expanded variables' which means CFLAGS is only scanned once. So the definetion of KVMTOOLS_VERSION at the end of Makefile would not scanned by CFLAGS. So the '-DKVMTOOLS_VERSION=' remains empty. I fixed the bug by moving the '-include $(OUTPUT)KVMTOOLS-VERSION-FILE' before the CFLAGS. Signed-off-by: haibiao.xiao <xiaohaibiao331@outlook.com> Tested-by: Alexandru Elisei <alexandru.elisei@arm.com> Link: https://lore.kernel.org/r/20211210030708.288066-1-haibiao.xiao@zstack.io Signed-off-by: Will Deacon <will@kernel.org>
2021-12-14arm/pci: update interrupt-map only for legacy interruptsSathyam Panda1-0/+10
The interrupt pin cell in "interrupt-map" property is defined only for legacy interrupts with a valid range in [1-4] corrspoding to INTA#..INTD#. And the PCI endpoint devices that support advance interrupt mechanism like MSI or MSI-X should not have an entry with value 0 in "interrupt-map". This patch takes care of this problem by avoiding redundant entries. Signed-off-by: Sathyam Panda <sathyam.panda@arm.com> Reviewed-by: Vivek Kumar Gautam <vivek.gautam@arm.com> Reviewed-by: Alexandru Elisei <alexandru.elisei@arm.com> Link: https://lore.kernel.org/r/20211111120231.5468-1-sathyam.panda@arm.com Signed-off-by: Will Deacon <will@kernel.org>
2021-10-13vfio/pci: Align MSIX Table and PBA size to guest maximum page sizeAlexandru Elisei6-2/+21
When allocating MMIO space for the MSI-X table, kvmtool rounds the allocation to the host's page size to make it as easy as possible for the guest to map the table to a page, if it wants to (and doesn't do BAR reassignment, like the x86 architecture for example). However, the host's page size can differ from the guest's on architectures which support multiple page sizes. For example, arm64 supports three different page size, and it is possible for the host to be using 4k pages, while the guest is using 64k pages. To make sure the allocation is always aligned to a guest's page size, round it up to the maximum architectural page size. Do the same for the pending bit array if it lives in its own BAR. Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com> Reviewed-by: Andre Przywara <andre.przywara@arm.com> Link: https://lore.kernel.org/r/20211012132510.42134-8-alexandru.elisei@arm.com Signed-off-by: Will Deacon <will@kernel.org>
2021-10-13vfio/pci: Print an error when offset is outside of the MSIX table or PBAAlexandru Elisei1-0/+9
Now that we keep track of the real size of MSIX table and PBA, print an error when the guest tries to write to an offset which is not inside the correct regions. Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com> Link: https://lore.kernel.org/r/20211012132510.42134-7-alexandru.elisei@arm.com Signed-off-by: Will Deacon <will@kernel.org>
2021-10-13vfio/pci: Rework MSIX table and PBA physical size allocationAlexandru Elisei2-28/+42
When creating the MSIX table and PBA, kvmtool rounds up the table and pending bit array sizes to the host's page size. Unfortunately, when doing that, it doesn't take into account that the new size can exceed the device BAR size, leading to hard to diagnose errors for certain configurations. One theoretical example: PBA and table in the same 4k BAR, host's page size is 4k. In this case, table->size = 4k, pba->size = 4k, map_size = 4k, which means that pba->guest_phys_addr = table->guest_phys_addr + 4k, which is outside of the 4k MMIO range allocated for both structures. Another example, this time a real-world error that I encountered: happens with a 64k host booting a 4k guest, an RTL8168 PCIE NIC assigned to the guest. In this case, kvmtool sets table->size = 64k (because it's rounded to the host's page size) and pba->size = 64k. Truncated output of lspci -vv on the host: 01:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller (rev 06) Subsystem: TP-LINK Technologies Co., Ltd. TG-3468 Gigabit PCI Express Network Adapter Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx- Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx- Latency: 0 Interrupt: pin A routed to IRQ 255 Region 0: I/O ports at 1000 [size=256] Region 2: Memory at 40000000 (64-bit, non-prefetchable) [size=4K] Region 4: Memory at 100000000 (64-bit, prefetchable) [size=16K] [..] Capabilities: [b0] MSI-X: Enable- Count=4 Masked- Vector table: BAR=4 offset=00000000 PBA: BAR=4 offset=00000800 [..] When booting the guest: [..] [ 0.207444] pci-host-generic 40000000.pci: host bridge /pci ranges: [ 0.208564] pci-host-generic 40000000.pci: IO 0x0000000000..0x000000ffff -> 0x0000000000 [ 0.209857] pci-host-generic 40000000.pci: MEM 0x0050000000..0x007fffffff -> 0x0050000000 [ 0.211184] pci-host-generic 40000000.pci: ECAM at [mem 0x40000000-0x4fffffff] for [bus 00] [ 0.212625] pci-host-generic 40000000.pci: PCI host bridge to bus 0000:00 [ 0.213647] pci_bus 0000:00: root bus resource [bus 00] [ 0.214429] pci_bus 0000:00: root bus resource [io 0x0000-0xffff] [ 0.215355] pci_bus 0000:00: root bus resource [mem 0x50000000-0x7fffffff] [ 0.216676] pci 0000:00:00.0: [10ec:8168] type 00 class 0x020000 [ 0.223771] pci 0000:00:00.0: reg 0x10: [io 0x6200-0x62ff] [ 0.239765] pci 0000:00:00.0: reg 0x18: [mem 0x50010000-0x50010fff] [ 0.244595] pci 0000:00:00.0: reg 0x20: [mem 0x50000000-0x50003fff] [ 0.246331] pci 0000:00:01.0: [1af4:1000] type 00 class 0x020000 [ 0.247278] pci 0000:00:01.0: reg 0x10: [io 0x6300-0x63ff] [ 0.248212] pci 0000:00:01.0: reg 0x14: [mem 0x50020000-0x500200ff] [ 0.249172] pci 0000:00:01.0: reg 0x18: [mem 0x50020400-0x500207ff] [ 0.250450] pci 0000:00:02.0: [1af4:1001] type 00 class 0x018000 [ 0.251392] pci 0000:00:02.0: reg 0x10: [io 0x6400-0x64ff] [ 0.252351] pci 0000:00:02.0: reg 0x14: [mem 0x50020800-0x500208ff] [ 0.253312] pci 0000:00:02.0: reg 0x18: [mem 0x50020c00-0x50020fff] [ 0.254760] pci 0000:00:00.0: BAR 4: assigned [mem 0x50000000-0x50003fff] (1) [ 0.255805] pci 0000:00:00.0: BAR 2: assigned [mem 0x50004000-0x50004fff] (2) Warning: [10ec:8168] Error activating emulation for BAR 2 Warning: [10ec:8168] Error activating emulation for BAR 2 [ 0.260432] pci 0000:00:01.0: BAR 2: assigned [mem 0x50005000-0x500053ff] Warning: [1af4:1000] Error activating emulation for BAR 2 Warning: [1af4:1000] Error activating emulation for BAR 2 [ 0.261469] pci 0000:00:02.0: BAR 2: assigned [mem 0x50005400-0x500057ff] Warning: [1af4:1001] Error activating emulation for BAR 2 Warning: [1af4:1001] Error activating emulation for BAR 2 [ 0.262499] pci 0000:00:00.0: BAR 0: assigned [io 0x1000-0x10ff] [ 0.263415] pci 0000:00:01.0: BAR 0: assigned [io 0x1100-0x11ff] [ 0.264462] pci 0000:00:01.0: BAR 1: assigned [mem 0x50005800-0x500058ff] Warning: [1af4:1000] Error activating emulation for BAR 1 Warning: [1af4:1000] Error activating emulation for BAR 1 [ 0.265481] pci 0000:00:02.0: BAR 0: assigned [io 0x1200-0x12ff] [ 0.266397] pci 0000:00:02.0: BAR 1: assigned [mem 0x50005900-0x500059ff] Warning: [1af4:1001] Error activating emulation for BAR 1 Warning: [1af4:1001] Error activating emulation for BAR 1 [ 0.267892] EINJ: ACPI disabled. [ 0.269922] virtio-pci 0000:00:01.0: virtio_pci: leaving for legacy driver [ 0.271118] virtio-pci 0000:00:02.0: virtio_pci: leaving for legacy driver [ 0.274122] Serial: 8250/16550 driver, 4 ports, IRQ sharing enabled [ 0.275930] printk: console [ttyS0] disabled [ 0.276669] 1000000.U6_16550A: ttyS0 at MMIO 0x1000000 (irq = 13, base_baud = 115200) is a 16550A [ 0.278058] printk: console [ttyS0] enabled [ 0.278058] printk: console [ttyS0] enabled [ 0.279304] printk: bootconsole [ns16550a0] disabled [ 0.279304] printk: bootconsole [ns16550a0] disabled [ 0.281252] 1001000.U6_16550A: ttyS1 at MMIO 0x1001000 (irq = 14, base_baud = 115200) is a 16550A [ 0.282842] 1002000.U6_16550A: ttyS2 at MMIO 0x1002000 (irq = 15, base_baud = 115200) is a 16550A [ 0.284611] 1003000.U6_16550A: ttyS3 at MMIO 0x1003000 (irq = 16, base_baud = 115200) is a 16550A [ 0.286094] SuperH (H)SCI(F) driver initialized [ 0.286868] msm_serial: driver initialized [ 0.287890] [drm] radeon kernel modesetting enabled. [ 0.288826] cacheinfo: Unable to detect cache hierarchy for CPU 0 [ 0.293321] loop: module loaded KVM_SET_GSI_ROUTING: Invalid argument At (1), the guest writes 0x50000000 into BAR 4 of the NIC (which holds the MSIX table and PBA), expecting that will cover only 16k of address space (the BAR size), up to 0x50003fff, inclusive. On the host side, in vfio_pci_bar_activate(), kvmtool will actually register for MMIO emulation the region 0x50000000-0x5000ffff (64k in total) for the MSIX table and 0x50010000-0x5001ffff (another 64k) for the PBA (kvmtool set table->size and pba->size to 64k when it aligned them to the host's page size). Then at step (2), the guest writes the next available address (from its point of view) into BAR 2 of the NIC, which is 0x50004000. On the host side, the PCI emulation layer will search all the regions that overlap with the BAR address range (0x50004000-0x50004fff) and will find none because, just like the guest, it uses the BAR size to check for overlaps. When vfio_pci_bar_activate() is reached, kvmtool will try to register memory for this region, but it is already registered for the MSIX table emulation and fails. The same scenario repeats for every following memory BAR, because the MSIX table and PBA use memory from 0x50000000 to 0x5001ffff. The error at the end, which finally terminates the VM, is caused by the guest trying to write to a totally different BAR, which vfio-pci interpretes as a write to MSI-X table because it falls in the 64k region that was registered for emulation. The IRQ ID is not a valid SPI number and gicv2m_update_routing() returns an error (and sets errno to EINVAL). Fix this by aligning the table and PBA size to 8 bytes to allow for qword accesses, like PCI 3.0 mandates. For the sake of simplicity, the PBA offset in a BAR, in case of a shared BAR, is kept the same as the offset of the physical device. One hopes that the device respects the recommendations set forth in PCI LOCAL BUS SPECIFICATION, REV. 3.0, section "MSI-X Capability and Table Structures" Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com> Reviewed-by: Andre Przywara <andre.przywara@arm.com> Link: https://lore.kernel.org/r/20211012132510.42134-6-alexandru.elisei@arm.com Signed-off-by: Will Deacon <will@kernel.org>
2021-10-13vfio/pci: Rename PBA offset in device descriptor to fd_offsetAlexandru Elisei2-4/+4
The MSI-X capability defines a PBA offset, which is the offset of the PBA array in the BAR that holds the array. kvmtool uses the field "pba_offset" in struct msix_cap (which represents the MSIX capability) to refer to the [PBA offset:BAR] field of the capability; and the field "offset" in the struct vfio_pci_msix_pba to refer to offset of the PBA array in the device descriptor created by the VFIO driver. As we're getting ready to add yet another field that represents an offset to struct vfio_pci_msix_pba, try to avoid ambiguities by renaming the struct's "offset" field to "fd_offset". No functional change intended. Reviewed-by: Andre Przywara <andre.przywara@arm.com> Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com> Link: https://lore.kernel.org/r/20211012132510.42134-5-alexandru.elisei@arm.com Signed-off-by: Will Deacon <will@kernel.org>
2021-10-13pci: Fix pci_dev_* print macrosAlexandru Elisei1-5/+5
Evaluate the "pci_hdr" argument before attempting to deference a field. This fixes cryptic errors like this one, which came about during a debugging session: vfio/pci.c: In function 'vfio_pci_bar_activate': include/kvm/pci.h:18:40: error: invalid type argument of '->' (have 'struct pci_device_header') pr_warning("[%04x:%04x] " fmt, pci_hdr->vendor_id, pci_hdr->device_id, ##__VA_ARGS__) ^~ vfio/pci.c:482:3: note: in expansion of macro 'pci_dev_warn' pci_dev_warn(&vdev->pci.hdr, "%s: BAR4\n", __func__); This is caused by the operator precedence rules in C, where pointer deference via "->" has a higher precedence than taking the address with the ampersand symbol. When the macro is substituted, it becomes &vdev->pci.hdr->vendor_id and it dereferences vdev->pci.hdr, which is not a pointer, instead of dereferencing &vdev->pci.hdr, which is a pointer, and quite likely what the author intended. Reviewed-by: Andre Przywara <andre.przywara@arm.com> Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com> Link: https://lore.kernel.org/r/20211012132510.42134-4-alexandru.elisei@arm.com Signed-off-by: Will Deacon <will@kernel.org>
2021-10-13vfio/pci.c: Remove double include for assert.hAlexandru Elisei1-2/+0
assert.h is included twice, keep only one instance. Reviewed-by: Andre Przywara <andre.przywara@arm.com> Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com> Link: https://lore.kernel.org/r/20211012132510.42134-3-alexandru.elisei@arm.com Signed-off-by: Will Deacon <will@kernel.org>
2021-10-13arm/gicv2m: Set errno when gicv2_update_routing() failsAlexandru Elisei1-4/+6
In case of an error when updating the routing table entries, irq__update_msix_route() uses perror to print an error message. gicv2m_update_routing() doesn't set errno, and instead returns the value that errno should have had, which can lead to failure messages like this: KVM_SET_GSI_ROUTING: Success Set errno in gicv2m_update_routing() to avoid such messages in the future. Reviewed-by: Andre Przywara <andre.przywara@arm.com> Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com> Link: https://lore.kernel.org/r/20211012132510.42134-2-alexandru.elisei@arm.com Signed-off-by: Will Deacon <will@kernel.org>
2021-10-12arm64: Be more permissive when parsing the kernel headerAlexandru Elisei1-8/+8
kvmtool complains loudly when it parses the kernel header and doesn't find what it expects, but unless it outright fails to read the kernel image, it will copy the image in the guest memory at the default offset of 0x80000. There's no technical reason to stop the user from loading payloads other than a Linux kernel with the --kernel option. These payloads can behave just like a kernel and can use an initrd (which is not possible with --firmware), but don't have the kernel header (like kvm-unit-tests), and the warnings kvmtool emites can be confusing for this type of payloads. Change the warnings to debug statements, which can be enabled via the --debug kvmtool command line option, to make them disappear for these cases where they aren't really relevant. Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com> Link: https://lore.kernel.org/r/20210923144505.60776-11-alexandru.elisei@arm.com Signed-off-by: Will Deacon <will@kernel.org>
2021-10-12arm64: Use the default offset when the kernel image magic is not foundAlexandru Elisei1-2/+4
Commit fd0a05bd27dd ("arm64: Obtain text offset from kernel image") added support for getting the kernel offset from the kernel header. The code checks for the kernel header magic number, and if not found, prints a warning and continues searching for the kernel offset in the image. The -k/--kernel option can be used to load things which are not a Linux kernel, but behave like one, like a kvm-unit-tests test. The tests don't have a valid kernel header, and because kvmtool insists on searching for the offset, creating a virtual machine can fail with this message: $ ./vm run -c2 -m256 -k ../kvm-unit-tests/arm/cache.flat # lkvm run -k ../kvm-unit-tests/arm/cache.flat -m 256 -c 2 --name guest-7529 Warning: Kernel image magic not matching Warning: unable to translate host address 0x910100a502a00085 to guest Fatal: kernel image too big to contain in guest memory. The host address is a random number read from the test binary from the location where text_offset is found in the kernel header. Before the commit, the test was executing just fine: $ ./vm run -c2 -m256 -k ../kvm-unit-tests/arm/cache.flat # lkvm run -k ../kvm-unit-tests/arm/cache.flat -m 256 -c 2 --name guest-8105 INFO: IDC-DIC: dcache clean to PoU required INFO: IDC-DIC: icache invalidation to PoU required PASS: IDC-DIC: code generation SUMMARY: 1 tests Change kvm__arch_get_kern_offset() so it returns the default text_offset value if the kernel image magic number is not found, making it possible again to use something other than a Linux kernel with --kernel. Reported-by: Vivek Kumar Gautam <vivek.gautam@arm.com> Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com> Link: https://lore.kernel.org/r/20210923144505.60776-10-alexandru.elisei@arm.com Signed-off-by: Will Deacon <will@kernel.org>
2021-10-12Add --nodefaults command line argumentAlexandru Elisei4-4/+16
kvmtool attempts to make it as easier as possible on the user to run a VM by doing a few different things: it tries to create a rootfs filesystem in a directory if not disk or initrd is set by the user, and it adds various parameters to the kernel command line based on the VM configuration options. While this is generally very useful, today there isn't any way for the user to prohibit this behaviour, even though there are situations where this might not be desirable, like, for example: loading something which is not a kernel (kvm-unit-tests comes to mind, which expects test parameters on the kernel command line); the kernel has a built-in initramfs and there is no need to generate the root filesystem, or it not possible; and what is probably the most important use case, when the user is actively trying to break things for testing purposes. Add a --nodefaults command line argument which disables everything that cannot be disabled via another command line switch. The purpose of this knob is not to disable the default options for arguments that can be set via the kvmtool command line, but rather to inhibit behaviour that cannot be disabled otherwise. Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com> Link: https://lore.kernel.org/r/20210923144505.60776-8-alexandru.elisei@arm.com Signed-off-by: Will Deacon <will@kernel.org>
2021-10-12builtin-run: Move kernel command line generation to a separate functionAlexandru Elisei1-46/+54
The real kernel command line is gradually generated in kvm_cmd_run_init() and it is interspersed with the initialization code. This means that both the code that generates the command line and the rest of the code is unnecessarily difficult to follow and to modify. Move the code that generates the command line to one function, to make it easier to understand, and to declutter kvm_cmd_run_init(). No functional change intended. Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com> Link: https://lore.kernel.org/r/20210923144505.60776-7-alexandru.elisei@arm.com Signed-off-by: Will Deacon <will@kernel.org>
2021-10-12Use kvm->nr_disks instead of kvm->cfg.image_countAlexandru Elisei3-12/+9
A user can specify multiple disk images using the --disk/-d argument. The callback for the argument ends up in disk/core.c::calling disk_img_name_parser(), which increments kvm->cfg.image_count for each disk image. Immediately after parsing the arguments in kvm_cmd_run_init(), kvm->nr_disks is set to kvm->cfg.image_count, effectively making kvm->nr_disks an alias for kvm->cfg.image_count, as image_count is never changed afterward. Later on, the core disk code uses kvm->cfg.image_count when opening all the disk images, but kvm->nr_disks when closing them, which is inconsistent, but technically correct since they represent the same thing and have the same value. Let's remove all this confusing usage and use only kvm->nr_disks to represent the number of disk images specified by the user. While this technically means that kvmtool now supports up to INT_MAX disk images, in practice this is limited by MAX_DISK_IMAGES, which is equal to four. Which means there are no functional changes. Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com> Link: https://lore.kernel.org/r/20210923144505.60776-6-alexandru.elisei@arm.com Signed-off-by: Will Deacon <will@kernel.org>
2021-10-12builtin-run: Abstract argument validation into a separate functionAlexandru Elisei1-13/+17
kvm_cmd_run_init() is a complex function which parses the command line arguments, configures various aspects of a VM (the size of the RAM, the number of CPUs, the network, the active console, the kernel command line, creates a custom rootfs, etc), and after the recent patches, also does a few checks against mutually exclusive kvmtool arguments. Make the function just that little bit easier to read by moving the argument validation into a separate function. No functional change intended. Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com> Link: https://lore.kernel.org/r/20210923144505.60776-5-alexandru.elisei@arm.com Signed-off-by: Will Deacon <will@kernel.org>
2021-10-12builtin-run: Do not attempt to find vmlinux if --firmwareAlexandru Elisei1-2/+4
kvm->vmlinux is used by symbol.c on x86 to translate a PC address to a kernel symbol when kvmtool exits unexpectedly. When the --firmware argument is used, a kernel image is not used for the VM, and the vmlinux file has no relevance in this case. Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com> Link: https://lore.kernel.org/r/20210923144505.60776-4-alexandru.elisei@arm.com Signed-off-by: Will Deacon <will@kernel.org>
2021-10-12builtin-run: Warn when ignoring initrd because --firmware was specifiedAlexandru Elisei1-0/+3
The firmware image is copied into the guest memory with the arch specific function kvm__load_firmware() in kvm__init(). That function ignores the initrd file, if the user specified one. Let the user know that the file is ignored by KVM and the --initrd argument does nothing with --firmware. Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com> Link: https://lore.kernel.org/r/20210923144505.60776-3-alexandru.elisei@arm.com Signed-off-by: Will Deacon <will@kernel.org>
2021-10-12builtin-run: Treat specifying both --kernel and --firmware as an errorAlexandru Elisei1-0/+3
If the user specifies both the --kernel and the --firmware arguments, --firmware takes precedence and --kernel is silently ignored. Since kvmtool has no way of knowing what the user really intended, and guessing that --firmware is the right argument might prove to be quite unexpected for the user, be vocal about the incompatibility and refuse to create the VM. Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com> Link: https://lore.kernel.org/r/20210923144505.60776-2-alexandru.elisei@arm.com Signed-off-by: Will Deacon <will@kernel.org>
2021-08-31virtio/pci: Size the MSI-X bar according to the number of MSI-XMarc Zyngier1-12/+30
Since 45d3b59e8c45 ("kvm tools: Increase amount of possible interrupts per PCI device"), the number of MSI-S has gone from 4 to 33. However, the corresponding storage hasn't been upgraded, and writing to the MSI-X table is a pretty risky business. Now that the Linux kernel writes to *all* MSI-X entries before doing anything else with the device, kvmtool dies a horrible death. Fix it by properly defining the size of the MSI-X bar, and make Linux great again. This includes some fixes the PBA region decoding, as well as minor cleanups to make this code a bit more maintainable. Signed-off-by: Marc Zyngier <maz@kernel.org> Reviewed-by: Andre Przywara <andre.przywara@arm.com> Link: https://lore.kernel.org/r/20210827115405.1981529-1-maz@kernel.org Signed-off-by: Will Deacon <will@kernel.org>
2021-08-31kvmtool: arm64: Configure VM with the minimal required IPA spaceMarc Zyngier1-1/+19
There is some value in keeping the IPA space small, as it reduces the size of the stage-2 page tables. Let's compute the required space at VM creation time, and inform the kernel of our requirements. Signed-off-by: Marc Zyngier <maz@kernel.org> Reviewed-by: Oliver Upton <oupton@google.com> Reviewed-by: Andre Przywara <andre.przywara@arm.com> Link: https://lore.kernel.org/r/20210822152526.1291918-4-maz@kernel.org Signed-off-by: Will Deacon <will@kernel.org>
2021-08-31kvmtool: arm64: Use the maximum supported IPA size when creating the VMMarc Zyngier2-3/+31
Instead of just asking the the default VM size, request the maximum IPA size to the kernel, and use this at VM creation time. The IPA space is parametrized accordingly. Signed-off-by: Marc Zyngier <maz@kernel.org> Reviewed-by: Oliver Upton <oupton@google.com> Reviewed-by: Andre Przywara <andre.przywara@arm.com> Link: https://lore.kernel.org/r/20210822152526.1291918-3-maz@kernel.org Signed-off-by: Will Deacon <will@kernel.org>
2021-08-31kvmtool: Abstract KVM_VM_TYPE into a weak functionMarc Zyngier2-1/+7
Most architectures pass a fixed value for their VM type. However, arm64 uses it as a parameter describing the size of the guest's physical address space. In order to support this, introduce a kvm__get_vm_type() helper that only returns KVM_VM_TYPE for now. Reviewed-by: Andre Przywara <andre.przywara@arm.com> Signed-off-by: Marc Zyngier <maz@kernel.org> Reviewed-by: Oliver Upton <oupton@google.com> Link: https://lore.kernel.org/r/20210822152526.1291918-2-maz@kernel.org Signed-off-by: Will Deacon <will@kernel.org>
2021-07-16arm/arm64: vfio: Add PCI Express Capability StructureAlexandru Elisei2-0/+42
It turns out that some Linux drivers (like Realtek R8169) fall back to a device-specific configuration method if the device is not PCI Express capable: [ 1.433825] r8169 0000:00:00.0 enp0s0: No native access to PCI extended config space, falling back to CSI Add the PCI Express Capability Structure and populate it for assigned devices, as this is how the Linux PCI driver determines if a device is PCI Express capable. Because we don't emulate a PCI Express link, a root complex or any slot related properties, the PCI Express capability is kept as small as possible by ignoring those fields. Reviewed-by: Andre Przywara <andre.przywara@arm.com> Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com> Link: https://lore.kernel.org/r/20210713170631.155595-5-alexandru.elisei@arm.com Signed-off-by: Will Deacon <will@kernel.org>
2021-07-16arm/arm64: Add PCI Express 1.1 supportAlexandru Elisei5-20/+68
PCI Express comes with an extended addressing scheme, which directly translated into a bigger device configuration space (256->4096 bytes) and bigger PCI configuration space (16->256 MB), as well as mandatory capabilities (power management [1] and PCI Express capability [2]). However, our virtio PCI implementation implements version 0.9 of the protocol and it still uses transitional PCI device ID's, so we have opted to omit the mandatory PCI Express capabilities. For VFIO, the power management and PCI Express capability are left for a subsequent patch. [1] PCI Express Base Specification Revision 1.1, section 7.6 [2] PCI Express Base Specification Revision 1.1, section 7.8 Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com> Reviewed-by: Andre Przywara <andre.przywara@arm.com> Link: https://lore.kernel.org/r/20210713170631.155595-4-alexandru.elisei@arm.com Signed-off-by: Will Deacon <will@kernel.org>
2021-07-16arm/fdt.c: Don't generate the node if generator function is NULLAlexandru Elisei1-1/+6
Print a more helpful debugging message when a MMIO device hasn't set a function to generate an FDT node instead of causing a segmentation fault by dereferencing a NULL pointer. Reviewed-by: Andre Przywara <andre.przywara@arm.com> Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com> Link: https://lore.kernel.org/r/20210713170631.155595-3-alexandru.elisei@arm.com Signed-off-by: Will Deacon <will@kernel.org>
2021-07-16Move fdt_irq_fn typedef to fdt.hAlexandru Elisei3-1/+3
The device tree code passes the function generate_irq_prop() to MMIO devices to create the "interrupts" property. The typedef fdt_irq_fn is the type used to pass the function to the device. It makes more sense for the typedef to be in fdt.h with the rest of the device tree functions, so move it there. Reviewed-by: Andre Przywara <andre.przywara@arm.com> Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com> Link: https://lore.kernel.org/r/20210713170631.155595-2-alexandru.elisei@arm.com Signed-off-by: Will Deacon <will@kernel.org>
2021-04-22arm: Fail early if KVM_CAP_ARM_PMU_V3 is not supportedAlexandru Elisei2-5/+4
pmu__generate_fdt_nodes() checks if the host has support for PMU in a guest and prints a warning if that's not the case. However, this check is too late because the function is called after the VCPU has been created, and VCPU creation fails if KVM_CAP_ARM_PMU_V3 is not available with a rather unhelpful error: $ ./vm run -c1 -m64 -f selftest.flat --pmu # lkvm run --firmware selftest.flat -m 64 -c 1 --name guest-1039 Info: Placing fdt at 0x80200000 - 0x80210000 Fatal: Unable to initialise vcpu Move the check for KVM_CAP_ARM_PMU_V3 to kvm_cpu__arch_init() before the VCPU is created so the user can get a more useful error message. This also matches the behaviour of KVM_CAP_ARM_EL1_32BIT. Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com> Link: https://lore.kernel.org/r/20210415131725.105675-1-alexandru.elisei@arm.com Signed-off-by: Will Deacon <will@kernel.org>
2021-03-18virtio: add support for vsockTianjia Zhang9-0/+503
The "run" command accepts a new option (--vsock <cid>) which specify the guest CID. For instance: $ lkvm run --kernel ./bzImage --disk test --vsock 3 One can easily test by: https://github.com/stefanha/nc-vsock. In the guest: # modprobe vsock # nc-vsock -l 1234 In the host: # modprobe vhost_vsock # nc-vsock 3 1234 This patch comes from the early submission of G. Campana. On this basis, I fixed the compilation errors and runtime crashes. Thanks for the work done by G. Campana. https://patchwork.kernel.org/patch/9542313/ Signed-off-by: G. Campana <gcampana+kvm@quarkslab.com> Signed-off-by: Tianjia Zhang <tianjia.zhang@linux.alibaba.com> Link: https://lore.kernel.org/r/20200915094402.107988-1-tianjia.zhang@linux.alibaba.com Signed-off-by: Will Deacon <will@kernel.org>
2021-03-18hw/rtc: ARM/arm64: Use MMIO at higher addressesAndre Przywara2-10/+21
Using the RTC device at its legacy I/O address as set by IBM in 1981 was a kludge we used for simplicity on ARM platforms as well. However this imposes problems due to their missing alignment and overlap with the PCI I/O address space. Now that we can switch a device easily between using ioports and MMIO, let's move the RTC out of the first 4K of memory on ARM platforms. That should be transparent for well behaved guests, since the change is naturally reflected in the device tree. Signed-off-by: Andre Przywara <andre.przywara@arm.com> Reviewed-by: Alexandru Elisei <alexandru.elisei@arm.com> Link: https://lore.kernel.org/r/20210315153350.19988-23-andre.przywara@arm.com Signed-off-by: Will Deacon <will@kernel.org>
2021-03-18hw/serial: ARM/arm64: Use MMIO at higher addressesAndre Przywara2-19/+42
Using the UART devices at their legacy I/O addresses as set by IBM in 1981 was a kludge we used for simplicity on ARM platforms as well. However this imposes problems due to their missing alignment and overlap with the PCI I/O address space. Now that we can switch a device easily between using ioports and MMIO, let's move the UARTs out of the first 4K of memory on ARM platforms. That should be transparent for well behaved guests, since the change is naturally reflected in the device tree. Even "earlycon" keeps working, as the stdout-path property is adjusted automatically. People providing direct earlycon parameters via the command line need to adjust it to: "earlycon=uart,mmio,0x1000000". Signed-off-by: Andre Przywara <andre.przywara@arm.com> Reviewed-by: Alexandru Elisei <alexandru.elisei@arm.com> Link: https://lore.kernel.org/r/20210315153350.19988-22-andre.przywara@arm.com Signed-off-by: Will Deacon <will@kernel.org>
2021-03-18arm: Reorganise and document memory mapAndre Przywara1-12/+29
The hardcoded memory map we expose to a guest is currently described using a series of partially interconnected preprocessor constants, which is hard to read and follow. In preparation for moving the UART and RTC to some different MMIO region, document the current map with some ASCII art, and clean up the definition of the sections. This changes the only internally used value of ARM_MMIO_AREA, to better align with its actual meaning and future extensions. No functional change. Signed-off-by: Andre Przywara <andre.przywara@arm.com> Reviewed-by: Alexandru Elisei <alexandru.elisei@arm.com> Link: https://lore.kernel.org/r/20210315153350.19988-21-andre.przywara@arm.com Signed-off-by: Will Deacon <will@kernel.org>
2021-03-18Remove ioport specific routinesAndre Przywara5-226/+1
Now that all users of the dedicated ioport trap handler interface are gone, we can retire the code associated with it. This removes ioport.c and ioport.h, along with removing prototypes from other header files. This also transfers the responsibility for port I/O trap handling entirely into the new routine in mmio.c. Signed-off-by: Andre Przywara <andre.przywara@arm.com> Reviewed-by: Alexandru Elisei <alexandru.elisei@arm.com> Link: https://lore.kernel.org/r/20210315153350.19988-20-andre.przywara@arm.com Signed-off-by: Will Deacon <will@kernel.org>
2021-03-18pci: Switch trap handling to use MMIO handlerAndre Przywara1-58/+24
With the planned retirement of the special ioport emulation code, we need to provide an emulation function compatible with the MMIO prototype. Merge the existing _in and _out handlers to adhere to that MMIO interface, and register these using the new registration function. Signed-off-by: Andre Przywara <andre.przywara@arm.com> Reviewed-by: Alexandru Elisei <alexandru.elisei@arm.com> Link: https://lore.kernel.org/r/20210315153350.19988-19-andre.przywara@arm.com Signed-off-by: Will Deacon <will@kernel.org>
2021-03-18virtio: Switch trap handling to use MMIO handlerAndre Przywara1-32/+14
With the planned retirement of the special ioport emulation code, we need to provide an emulation function compatible with the MMIO prototype. Adjust the existing MMIO callback routine to automatically determine the region this trap came through, and call the existing I/O handlers. Register the ioport region using the new registration function. Signed-off-by: Andre Przywara <andre.przywara@arm.com> Reviewed-by: Alexandru Elisei <alexandru.elisei@arm.com> Link: https://lore.kernel.org/r/20210315153350.19988-18-andre.przywara@arm.com Signed-off-by: Will Deacon <will@kernel.org>
2021-03-18vfio: Switch to new ioport trap handlersAndre Przywara1-27/+10
Now that the vfio device has a trap handler adhering to the MMIO fault handler prototype, let's switch over to the joint registration routine. This allows us to get rid of the ioport shim routines. Signed-off-by: Andre Przywara <andre.przywara@arm.com> Reviewed-by: Alexandru Elisei <alexandru.elisei@arm.com> Link: https://lore.kernel.org/r/20210315153350.19988-17-andre.przywara@arm.com Signed-off-by: Will Deacon <will@kernel.org>
2021-03-18vfio: Refactor ioport trap handlerAndre Przywara1-15/+36
With the planned retirement of the special ioport emulation code, we need to provide an emulation function compatible with the MMIO prototype. Adjust the I/O port trap handler to use that new function, and provide shims to implement the old ioport interface, for now. Signed-off-by: Andre Przywara <andre.przywara@arm.com> Reviewed-by: Alexandru Elisei <alexandru.elisei@arm.com> Link: https://lore.kernel.org/r/20210315153350.19988-16-andre.przywara@arm.com Signed-off-by: Will Deacon <will@kernel.org>
2021-03-18hw/serial: Switch to new trap handlersAndre Przywara1-28/+3
Now that the serial device has a trap handler adhering to the MMIO fault handler prototype, let's switch over to the joint registration routine. This allows us to get rid of the ioport shim routines. Signed-off-by: Andre Przywara <andre.przywara@arm.com> Reviewed-by: Alexandru Elisei <alexandru.elisei@arm.com> Link: https://lore.kernel.org/r/20210315153350.19988-15-andre.przywara@arm.com Signed-off-by: Will Deacon <will@kernel.org>
2021-03-18hw/serial: Refactor trap handlerAndre Przywara1-13/+37
With the planned retirement of the special ioport emulation code, we need to provide an emulation function compatible with the MMIO prototype. Adjust the trap handler to use that new function, and provide shims to implement the old ioport interface, for now. Signed-off-by: Andre Przywara <andre.przywara@arm.com> Reviewed-by: Alexandru Elisei <alexandru.elisei@arm.com> Link: https://lore.kernel.org/r/20210315153350.19988-14-andre.przywara@arm.com Signed-off-by: Will Deacon <will@kernel.org>
2021-03-18hw/vesa: Switch trap handling to use MMIO handlerAndre Przywara1-14/+5
To be able to use the VESA device with the new generic I/O trap handler, we need to use the different MMIO handler callback routine. Replace the existing dummy in and out handlers with a joint dummy MMIO handler, and register this using the new registration function. Signed-off-by: Andre Przywara <andre.przywara@arm.com> Reviewed-by: Alexandru Elisei <alexandru.elisei@arm.com> Link: https://lore.kernel.org/r/20210315153350.19988-13-andre.przywara@arm.com Signed-off-by: Will Deacon <will@kernel.org>
2021-03-18hw/rtc: Switch to new trap handlerAndre Przywara1-19/+2
Now that the RTC device has a trap handler adhering to the MMIO fault handler prototype, let's switch over to the joint registration routine. This allows us to get rid of the ioport shim routines. Signed-off-by: Andre Przywara <andre.przywara@arm.com> Reviewed-by: Alexandru Elisei <alexandru.elisei@arm.com> Link: https://lore.kernel.org/r/20210315153350.19988-12-andre.przywara@arm.com Signed-off-by: Will Deacon <will@kernel.org>
2021-03-18hw/rtc: Refactor trap handlersAndre Przywara1-35/+35
With the planned retirement of the special ioport emulation code, we need to provide emulation functions compatible with the MMIO prototype. Merge the two different trap handlers into one function, checking for read/write and data/index register inside. Adjust the trap handlers to use that new function, and provide shims to implement the old ioport interface, for now. Signed-off-by: Andre Przywara <andre.przywara@arm.com> Reviewed-by: Alexandru Elisei <alexandru.elisei@arm.com> Link: https://lore.kernel.org/r/20210315153350.19988-11-andre.przywara@arm.com Signed-off-by: Will Deacon <will@kernel.org>
2021-03-18x86/ioport: Switch to new trap handlersAndre Przywara1-64/+37
Now that the x86 I/O ports have trap handlers adhering to the MMIO fault handler prototype, let's switch over to the joint registration routine. This allows us to get rid of the ioport shim routines. Since the debug output was done in ioport.c, we would lose this functionality when moving over to the MMIO handlers. So bring this back here explicitly, by introducing debug_io(). Signed-off-by: Andre Przywara <andre.przywara@arm.com> Reviewed-by: Alexandru Elisei <alexandru.elisei@arm.com> Link: https://lore.kernel.org/r/20210315153350.19988-10-andre.przywara@arm.com Signed-off-by: Will Deacon <will@kernel.org>
2021-03-18x86/ioport: Refactor trap handlersAndre Przywara1-4/+26
With the planned retirement of the special ioport emulation code, we need to provide emulation functions compatible with the MMIO prototype. Adjust the trap handlers to use that new function, and provide shims to implement the old ioport interface, for now. Signed-off-by: Andre Przywara <andre.przywara@arm.com> Reviewed-by: Alexandru Elisei <alexandru.elisei@arm.com> Link: https://lore.kernel.org/r/20210315153350.19988-9-andre.przywara@arm.com Signed-off-by: Will Deacon <will@kernel.org>
2021-03-18hw/i8042: Switch to new trap handlersAndre Przywara2-27/+4
Now that the PC keyboard has a trap handler adhering to the MMIO fault handler prototype, let's switch over to the joint registration routine. This allows us to get rid of the ioport shim routines. Make the kbd_init() function static on the way. Signed-off-by: Andre Przywara <andre.przywara@arm.com> Reviewed-by: Alexandru Elisei <alexandru.elisei@arm.com> Link: https://lore.kernel.org/r/20210315153350.19988-8-andre.przywara@arm.com Signed-off-by: Will Deacon <will@kernel.org>
2021-03-18hw/i8042: Refactor trap handlerAndre Przywara1-34/+34
With the planned retirement of the special ioport emulation code, we need to provide an emulation function compatible with the MMIO prototype. Adjust the trap handler to use that new function, and provide shims to implement the old ioport interface, for now. Signed-off-by: Andre Przywara <andre.przywara@arm.com> Reviewed-by: Alexandru Elisei <alexandru.elisei@arm.com> Link: https://lore.kernel.org/r/20210315153350.19988-7-andre.przywara@arm.com Signed-off-by: Will Deacon <will@kernel.org>
2021-03-18hw/i8042: Clean up data typesAndre Przywara1-13/+13
The i8042 is clearly an 8-bit era device, so there is little room for 32-bit registers. Clean up the data types used. Signed-off-by: Andre Przywara <andre.przywara@arm.com> Reviewed-by: Alexandru Elisei <alexandru.elisei@arm.com> Link: https://lore.kernel.org/r/20210315153350.19988-6-andre.przywara@arm.com Signed-off-by: Will Deacon <will@kernel.org>
2021-03-18mmio: Extend handling to include ioport emulationAndre Przywara3-16/+102
In their core functionality MMIO and I/O port traps are not really different, yet we still have two totally separate code paths for handling them. Devices need to decide on one conduit or need to provide different handler functions for each of them. Extend the existing MMIO emulation to also cover ioport handlers. This just adds another RB tree root for holding the I/O port handlers, but otherwise uses the same tree population and lookup code. "ioport" or "mmio" just become a flag in the registration function. Provide wrappers to not break existing users, and allow an easy transition for the existing ioport handlers. This also means that ioport handlers now can use the same emulation callback prototype as MMIO handlers, which means we have to migrate them over. To allow a smooth transition, we hook up the new I/O emulate function to the end of the existing ioport emulation code. Signed-off-by: Andre Przywara <andre.przywara@arm.com> Reviewed-by: Alexandru Elisei <alexandru.elisei@arm.com> Link: https://lore.kernel.org/r/20210315153350.19988-5-andre.przywara@arm.com Signed-off-by: Will Deacon <will@kernel.org>
2021-03-18ioport: Retire .generate_fdt_node functionalityAndre Przywara2-38/+0
The ioport routines support a special way of registering FDT node generator functions. There is no reason to have this separate from the already existing way via the device header. Now that the only user of this special ioport variety has been transferred, we can retire this code, to simplify ioport handling. Signed-off-by: Andre Przywara <andre.przywara@arm.com> Reviewed-by: Alexandru Elisei <alexandru.elisei@arm.com> Link: https://lore.kernel.org/r/20210315153350.19988-4-andre.przywara@arm.com Signed-off-by: Will Deacon <will@kernel.org>
2021-03-18hw/serial: Use device abstraction for FDT generator functionAndre Przywara2-10/+41
At the moment we use the .generate_fdt_node member of the ioport ops structure to store the function pointer for the FDT node generator function. ioport__register() will then put a wrapper and this pointer into the device header. The serial device is the only device making use of this special ioport feature, so let's move this over to using the device header directly. This will allow us to get rid of this .generate_fdt_node member in the ops and simplify the code. Signed-off-by: Andre Przywara <andre.przywara@arm.com> Reviewed-by: Alexandru Elisei <alexandru.elisei@arm.com> Link: https://lore.kernel.org/r/20210315153350.19988-3-andre.przywara@arm.com Signed-off-by: Will Deacon <will@kernel.org>
2021-03-18ioport: Remove ioport__setup_arch()Andre Przywara6-24/+2
Since x86 had a special need for registering tons of special I/O ports, we had an ioport__setup_arch() callback, to allow each architecture to do the same. As it turns out no one uses it beside x86, so we remove that unnecessary abstraction. The generic function was registered via a device_base_init() call, so we just do the same for the x86 specific function only, and can remove the unneeded ioport__setup_arch(). Signed-off-by: Andre Przywara <andre.przywara@arm.com> Reviewed-by: Alexandru Elisei <alexandru.elisei@arm.com> Link: https://lore.kernel.org/r/20210315153350.19988-2-andre.przywara@arm.com Signed-off-by: Will Deacon <will@kernel.org>
2020-08-21update_headers.sh: Remove arm architectureAlexandru Elisei1-2/+1
KVM host support for the arm architecture was removed in commit 541ad0150ca4 ("arm: Remove 32bit KVM host support"). When trying to sync KVM headers we get this error message: $ util/update_headers.sh /path/to/linux cp: cannot stat '/path/to/linux/arch/arm/include/uapi/asm/kvm.h': No such file or directory Do not attempting to copy KVM headers for that architecture. Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com> Link: https://lore.kernel.org/r/20200810153828.216821-1-alexandru.elisei@arm.com Signed-off-by: Will Deacon <will@kernel.org>
2020-08-21virtio: Fix ordering of virtio_queue__should_signal()Alexandru Elisei1-7/+8
The guest programs used_event in the avail ring to let the host know when it wants a notification from the device. The host notifies the guest when the used ring index passes used_event. It is possible for the guest to submit a buffer, and then go into uninterruptible sleep waiting for this notification. The virtio-blk guest driver, in the notification callback virtblk_done(), increments the last known used ring index, then sets used_event to this value, which means it will get a notification after the next buffer is consumed by the host. virtblk_done() exits after the value of the used ring idx has been propagated from the host thread. On the host side, the virtio-blk device increments the used ring index, then compares it to used_event to decide if a notification should be sent. This is a common communication pattern between two threads, called store buffer. Memory barriers are needed in order for the pattern to work correctly, otherwise it is possible for the host to miss sending a required notification. Initial state: vring.used.idx = 2, vring.used_event = 1 (idx passes used_event, which means kvmtool notifies the guest). GUEST (in virtblk_done()) | KVMTOOL (in virtio_blk_complete()) | (increment vq->last_used_idx = 2) | // virtqueue_enable_cb_prepare_split(): | // virt_queue__used_idx_advance(): write vring.used_event = 2 | write vring.used.idx = 3 // virtqueue_poll(): | mb() | wmb() // virtqueue_poll_split(): | // virt_queue__should_signal(): read vring.used.idx = 2 | read vring.used_event = 1 // virtblk_done() exits. | // No notification. The write memory barrier on the host side is not enough to prevent reordering of the read in the kvmtool thread, which can lead to the guest thread waiting forever for IO to complete. Replace it with a full memory barrier to get the correct store buffer pattern described in the Linux litmus test SB+fencembonceonces.litmus, which forbids both threads reading the initial values. Also move the barrier in virtio_queue__should_signal(), because the barrier is needed for notifications to work correctly, and it makes more sense to have it in the function that determines if the host should notify the guest. Reported-by: Anvay Virkar <anvay.virkar@arm.com> Suggested-by: Anvay Virkar <anvay.virkar@arm.com> Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com> Reviewed-by: Jean-Philippe Brucker <jean-philippe@linaro.org> Link: https://lore.kernel.org/r/20200804145317.51633-1-alexandru.elisei@arm.com Signed-off-by: Will Deacon <will@kernel.org>
2020-07-16arm64: Use default kernel offset when the image file can't be seekedMarc Zyngier1-3/+8
While introducing new code to extract the kernel offset from the image, commit fd0a05b ("arm64: Obtain text offset from kernel image") introduced a regression where something such as: ./lkvm run -c 8 -p earlycon <(zcat /boot/vmlinuz-5.8.0-rc5-00172-ga161216e31ba) now fails to load the kernel, as the file descriptor cannot be seeked. Let's assume the good old 0x80000 offset when the seek syscall fails, with a warning for a good measure. Fixes: fd0a05b ("arm64: Obtain text offset from kernel image") Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20200716120801.2996-1-maz@kernel.org Signed-off-by: Will Deacon <will@kernel.org>
2020-07-03kvmtool: arm64: Report missing support for 32bit guestsSuzuki K Poulose1-0/+4
When the host doesn't support 32bit guests, the kvmtool fails without a proper message on what is wrong. i.e, $ lkvm run -c 1 Image --aarch32 # lkvm run -k Image -m 256 -c 1 --name guest-105618 Fatal: Unable to initialise vcpu Given that there is no other easy way to check if the host supports 32bit guests, it is always good to report this by checking the capability, rather than leaving the users to hunt this down by looking at the code! After this patch: $ lkvm run -c 1 Image --aarch32 # lkvm run -k Image -m 256 -c 1 --name guest-105695 Fatal: 32bit guests are not supported Reported-by: Sami Mujawar <sami.mujawar@arm.com> Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com> Acked-by: Marc Zyngier <maz@kernel.org> Cc: Will Deacon <will@kernel.org> Link: https://lore.kernel.org/r/20200701142002.51654-1-suzuki.poulose@arm.com Signed-off-by: Will Deacon <will@kernel.org>
2020-06-08arm64: Obtain text offset from kernel imageMarc Zyngier6-5/+107
Recent changes made to Linux 5.8 have outlined that kvmtool hardcodes the text offset instead of reading it from the arm64 image itself. To address this, import the image header structure into kvmtool and do the right thing. 32bit guests are still loaded to their usual locations. While we're at it, check the image magic and default to the text offset to be 0x80000 when image_size is 0, as described in the kernel's booting.rst document. Reported-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Marc Zyngier <maz@kernel.org> Cc: Alexandru Elisei <alexandru.elisei@arm.com> Link: https://lore.kernel.org/r/20200608152801.1415902-1-maz@kernel.org Signed-off-by: Will Deacon <will@kernel.org>
2020-05-19vfio: Trap MMIO access to BAR addresses which aren't page alignedAlexandru Elisei1-0/+9
KVM_SET_USER_MEMORY_REGION will fail if the guest physical address is not aligned to the page size. However, it is legal for a guest to program an address which isn't aligned to the page size. Trap and emulate MMIO accesses to the region when that happens. Without this patch, when assigning a Seagate Barracude hard drive to a VM I was seeing these errors: [ 0.286029] pci 0000:00:00.0: BAR 0: assigned [mem 0x41004600-0x4100467f] Error: 0000:01:00.0: failed to register region with KVM Error: [1095:3132] Error activating emulation for BAR 0 [..] [ 10.561794] irq 13: nobody cared (try booting with the "irqpoll" option) [ 10.563122] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.4.0-seattle-00009-g909b20467ed1 #133 [ 10.563124] Hardware name: linux,dummy-virt (DT) [ 10.563126] Call trace: [ 10.563134] dump_backtrace+0x0/0x140 [ 10.563137] show_stack+0x14/0x20 [ 10.563141] dump_stack+0xbc/0x100 [ 10.563146] __report_bad_irq+0x48/0xd4 [ 10.563148] note_interrupt+0x288/0x378 [ 10.563151] handle_irq_event_percpu+0x80/0x88 [ 10.563153] handle_irq_event+0x44/0xc8 [ 10.563155] handle_fasteoi_irq+0xb4/0x160 [ 10.563157] generic_handle_irq+0x24/0x38 [ 10.563159] __handle_domain_irq+0x60/0xb8 [ 10.563162] gic_handle_irq+0x50/0xa0 [ 10.563164] el1_irq+0xb8/0x180 [ 10.563166] arch_cpu_idle+0x10/0x18 [ 10.563170] do_idle+0x204/0x290 [ 10.563172] cpu_startup_entry+0x20/0x40 [ 10.563175] rest_init+0xd4/0xe0 [ 10.563180] arch_call_rest_init+0xc/0x14 [ 10.563182] start_kernel+0x420/0x44c [ 10.563183] handlers: [ 10.563650] [<000000001e474803>] sil24_interrupt [ 10.564559] Disabling IRQ #13 [..] [ 11.832916] ata1: spurious interrupt (slot_stat 0x0 active_tag -84148995 sactive 0x0) [ 12.045444] ata_ratelimit: 1 callbacks suppressed With this patch, I don't see the errors and the device works as expected. Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com> Reviewed-by: Andre Przywara <andre.przywara@arm.com> Link: https://lore.kernel.org/r/1589470709-4104-13-git-send-email-alexandru.elisei@arm.com Signed-off-by: Will Deacon <will@kernel.org>
2020-05-19arm/fdt: Remove 'linux,pci-probe-only' propertyJulien Thierry1-1/+0
PCI now supports configurable BARs. Get rid of the no longer needed, Linux-only, fdt property. Signed-off-by: Julien Thierry <julien.thierry@arm.com> Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com> Reviewed-by: Andre Przywara <andre.przywara@arm.com> Link: https://lore.kernel.org/r/1589470709-4104-12-git-send-email-alexandru.elisei@arm.com Signed-off-by: Will Deacon <will@kernel.org>
2020-05-19pci: Implement reassignable BARsAlexandru Elisei3-49/+227
BARs are used by the guest to configure the access to the PCI device by writing the address to which the device will respond. The basic idea for adding support for reassignable BARs is straightforward: deactivate emulation for the memory region described by the old BAR value, and activate emulation for the new region. BAR reassignment can be done while device access is enabled and memory regions for different devices can overlap as long as no access is made to the overlapping memory regions. This means that it is legal for the BARs of two distinct devices to point to an overlapping memory region, and indeed, this is how Linux does resource assignment at boot. To account for this situation, the simple algorithm described above is enhanced to scan for all devices and: - Deactivate emulation for any BARs that might overlap with the new BAR value. - Enable emulation for any BARs that were overlapping with the old value after the BAR has been updated. Activating/deactivating emulation of a memory region has side effects. In order to prevent the execution of the same callback twice we now keep track of the state of the region emulation. For example, this can happen if we program a BAR with an address that overlaps a second BAR, thus deactivating emulation for the second BAR, and then we disable all region accesses to the second BAR by writing to the command register. Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com> Reviewed-by: Andre Przywara <andre.przywara@arm.com> Link: https://lore.kernel.org/r/1589470709-4104-11-git-send-email-alexandru.elisei@arm.com Signed-off-by: Will Deacon <will@kernel.org>
2020-05-19pci: Toggle BAR I/O and memory space emulationAlexandru Elisei1-0/+42
During configuration of the BAR addresses, a Linux guest disables and enables access to I/O and memory space. When access is disabled, we don't stop emulating the memory regions described by the BARs. Now that we have callbacks for activating and deactivating emulation for a BAR region, let's use that to stop emulation when access is disabled, and re-activate it when access is re-enabled. Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com> Reviewed-by: Andre Przywara <andre.przywara@arm.com> Link: https://lore.kernel.org/r/1589470709-4104-10-git-send-email-alexandru.elisei@arm.com Signed-off-by: Will Deacon <will@kernel.org>
2020-05-19pci: Implement callbacks for toggling BAR emulationAlexandru Elisei5-68/+258
Implement callbacks for activating and deactivating emulation for a BAR region. This is in preparation for allowing a guest operating system to enable and disable access to I/O or memory space, or to reassign the BARs. The emulated vesa device framebuffer isn't designed to allow stopping and restarting at arbitrary points in the guest execution. Furthermore, on x86, the kernel will not change the BAR addresses, which on bare metal are programmed by the firmware, so take the easy way out and refuse to activate/deactivate emulation for the BAR regions. We also take this opportunity to make the vesa emulation code more consistent by moving all static variable definitions in one place, at the top of the file. Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com> Reviewed-by: Andre Przywara <andre.przywara@arm.com> Link: https://lore.kernel.org/r/1589470709-4104-9-git-send-email-alexandru.elisei@arm.com Signed-off-by: Will Deacon <will@kernel.org>
2020-05-19Don't allow more than one framebuffersAlexandru Elisei2-2/+5
A vesa device is used by the SDL, GTK or VNC framebuffers. Don't allow the user to specify more than one of these options because kvmtool will create identical vesa devices and bad things will happen: $ ./lkvm run -c2 -m2048 -k bzImage --sdl --gtk # lkvm run -k bzImage -m 2048 -c 2 --name guest-10159 Error: device region [d0000000-d012bfff] would overlap device region [d0000000-d012bfff] *** Error in `./lkvm': free(): invalid pointer: 0x00007fad78002e40 *** *** Error in `./lkvm': free(): invalid pointer: 0x00007fad78002e40 *** *** Error in `./lkvm': free(): invalid pointer: 0x00007fad78002e40 *** ======= Backtrace: ========= ======= Backtrace: ========= /lib/x86_64-linux-gnu/libc.so.6(+0x777e5)[0x7fae0ed447e5] ======= Backtrace: ========= /lib/x86_64-linux-gnu/libc.so.6/lib/x86_64-linux-gnu/libc.so.6(+0x8037a)[0x7fae0ed4d37a] (+0x777e5)[0x7fae0ed447e5] /lib/x86_64-linux-gnu/libc.so.6(+0x777e5)[0x7fae0ed447e5] /lib/x86_64-linux-gnu/libc.so.6(+0x8037a)[0x7fae0ed4d37a] /lib/x86_64-linux-gnu/libc.so.6(cfree+0x4c)[0x7fae0ed5153c] *** Error in `./lkvm': free(): invalid pointer: 0x00007fad78002e40 *** /lib/x86_64-linux-gnu/libglib-2.0.so.0(g_string_free+0x3b)[0x7fae0f814dab] /lib/x86_64-linux-gnu/libglib-2.0.so.0(g_string_free+0x3b)[0x7fae0f814dab] /usr/lib/x86_64-linux-gnu/libgtk-3.so.0(+0x21121c)[0x7fae1023321c] /usr/lib/x86_64-linux-gnu/libgtk-3.so.0(+0x21121c)[0x7fae1023321c] ======= Backtrace: ========= Aborted (core dumped) The vesa device is explicitly created during the initialization phase of the above framebuffers. Also remove the superfluous check for their existence. Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com> Reviewed-by: Andre Przywara <andre.przywara@arm.com> Link: https://lore.kernel.org/r/1589470709-4104-8-git-send-email-alexandru.elisei@arm.com Signed-off-by: Will Deacon <will@kernel.org>
2020-05-19vfio/pci: Don't write configuration value twiceAlexandru Elisei1-2/+7
After writing to the device fd as part of the PCI configuration space emulation, we read back from the device to make sure that the write finished. The value is read back into the PCI configuration space and afterwards, the same value is copied by the PCI emulation code. Let's read from the device fd into a temporary variable, to prevent this double write. The double write is harmless in itself. But when we implement reassignable BARs, we need to keep track of the old BAR value, and the VFIO code is overwritting it. Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com> Reviewed-by: Andre Przywara <andre.przywara@arm.com> Link: https://lore.kernel.org/r/1589470709-4104-7-git-send-email-alexandru.elisei@arm.com Signed-off-by: Will Deacon <will@kernel.org>
2020-05-19pci: Limit configuration transaction size to 32 bitsAlexandru Elisei1-0/+9
>From PCI Local Bus Specification Revision 3.0. section 3.8 "64-Bit Bus Extension": "The bandwidth requirements for I/O and configuration transactions cannot justify the added complexity, and, therefore, only memory transactions support 64-bit data transfers". Further down, the spec also describes the possible responses of a target which has been requested to do a 64-bit transaction. Limit the transaction to the lower 32 bits, to match the second accepted behaviour. Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com> Reviewed-by: Andre Przywara <andre.przywara@arm.com> Link: https://lore.kernel.org/r/1589470709-4104-6-git-send-email-alexandru.elisei@arm.com Signed-off-by: Will Deacon <will@kernel.org>
2020-05-19vfio: Reserve ioports when configuring the BARAlexandru Elisei2-7/+6
Let's be consistent and reserve ioports when we are configuring the BAR, not when we map it, just like we do with mmio regions. Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com> Reviewed-by: Andre Przywara <andre.przywara@arm.com> Link: https://lore.kernel.org/r/1589470709-4104-5-git-send-email-alexandru.elisei@arm.com Signed-off-by: Will Deacon <will@kernel.org>
2020-05-19virtio/pci: Get emulated region address from BARsAlexandru Elisei2-33/+52
The struct virtio_pci fields port_addr, mmio_addr and msix_io_block represent the same addresses that are written in the corresponding BARs. Remove this duplication of information and always use the address from the BAR. This will make our life a lot easier when we add support for reassignable BARs, because we won't have to update the fields on each BAR change. No functional changes. Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com> Reviewed-by: Andre Przywara <andre.przywara@arm.com> Link: https://lore.kernel.org/r/1589470709-4104-4-git-send-email-alexandru.elisei@arm.com Signed-off-by: Will Deacon <will@kernel.org>
2020-05-19pci: Add helpers for BAR values and memory/IO space accessAlexandru Elisei3-3/+56
We're going to be checking the BAR type, the address written to it and if access to memory or I/O space is enabled quite often when we add support for reasignable BARs; make our life easier by adding helpers for it. Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com> Reviewed-by: Andre Przywara <andre.przywara@arm.com> Link: https://lore.kernel.org/r/1589470709-4104-3-git-send-email-alexandru.elisei@arm.com Signed-off-by: Will Deacon <will@kernel.org>
2020-05-19ioport: mmio: Use a mutex and reference counting for lockingAlexandru Elisei4-55/+143
kvmtool uses brlock for protecting accesses to the ioport and mmio red-black trees. brlock allows concurrent reads, but only one writer, which is assumed not to be a VCPU thread (for more information see commit 0b907ed2eaec ("kvm tools: Add a brlock)). This is done by issuing a compiler barrier on read and pausing the entire virtual machine on writes. When KVM_BRLOCK_DEBUG is defined, brlock uses instead a pthread read/write lock. When we will implement reassignable BARs, the mmio or ioport mapping will be done as a result of a VCPU mmio access. When brlock is a pthread read/write lock, it means that we will try to acquire a write lock with the read lock already held by the same VCPU and we will deadlock. When it's not, a VCPU will have to call kvm__pause, which means the virtual machine will stay paused forever. Let's avoid all this by using a mutex and reference counting the red-black tree entries. This way we can guarantee that we won't unregister a node that another thread is currently using for emulation. Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com> Reviewed-by: Andre Przywara <andre.przywara@arm.com> Link: https://lore.kernel.org/r/1589470709-4104-2-git-send-email-alexandru.elisei@arm.com Signed-off-by: Will Deacon <will@kernel.org>
2020-05-19net: uip: Fix GCC 10 warning about checksum calculationAndre Przywara1-14/+12
GCC 10.1 generates a warning in net/ip/csum.c about exceeding a buffer limit in a memcpy operation: ------------------ In function 'memcpy', inlined from 'uip_csum_udp' at net/uip/csum.c:58:3: /usr/include/aarch64-linux-gnu/bits/string_fortified.h:34:10: error: writing 1 byte into a region of size 0 [-Werror=stringop-overflow=] 34 | return __builtin___memcpy_chk (__dest, __src, __len, __bos0 (__dest)); | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ In file included from net/uip/csum.c:1: net/uip/csum.c: In function 'uip_csum_udp': include/kvm/uip.h:132:6: note: at offset 0 to object 'sport' with size 2 declared here 132 | u16 sport; ------------------ This warning originates from the code taking the address of the "sport" member, then using that with some pointer arithmetic in a memcpy call. GCC now sees that the object is only a u16, so copying 12 bytes into it cannot be any good. It's somewhat debatable whether this is a legitimate warning, as there is enough storage at that place, and we knowingly use the struct and its variabled-sized member at the end. However we can also rewrite the code, to not abuse the "&" operation of some *member*, but take the address of the struct itself. This makes the code less dodgy, and indeed appeases GCC 10. Reported-by: Alexandru Elisei <alexandru.elisei@arm.com> Signed-off-by: Andre Przywara <andre.przywara@arm.com> Tested-by: Alexandru Elisei <alexandru.elisei@arm.com> Reviewed-by: Alexandru Elisei <alexandru.elisei@arm.com> Link: https://lore.kernel.org/r/20200518125649.216416-1-andre.przywara@arm.com Signed-off-by: Will Deacon <will@kernel.org>
2020-05-19rtc: Generate fdt node for the real-time clockAndre Przywara1-6/+38
On arm and arm64 we expose the Motorola RTC emulation to the guest, but never advertised this in the device tree. EDK-2 seems to rely on this device, but on its hardcoded address. To make this more future-proof, add a DT node with the address in it. EDK-2 can then read the proper address from there, and we can change this address later (with the flexible memory layout). Please note that an arm64 Linux kernel is not ready to use this device, there are some include files missing under arch/arm64 to compile the driver. I hacked this up in the kernel, just to verify this DT snippet is correct, but don't see much value in enabling this properly in Linux. Signed-off-by: Andre Przywara <andre.przywara@arm.com> Link: https://lore.kernel.org/r/20200514094553.135663-1-andre.przywara@arm.com Signed-off-by: Will Deacon <will@kernel.org>
2020-04-24pci: Move legacy IRQ assignment into devicesAndre Przywara5-16/+8
So far the (legacy) IRQ line for a PCI device is allocated in devices.c, which should actually not take care of that. Since we allocate all other device specific resources in the actual device emulation code, the IRQ should not be something special. Remove the PCI specific code from devices.c, and move the IRQ line allocation to the PCI code. This drops the IRQ line from the VESA device, since it does not use one. Signed-off-by: Andre Przywara <andre.przywara@arm.com> Signed-off-by: Will Deacon <will@kernel.org>
2020-04-24cfi-flash: Add support for mapping flash into guestAndre Przywara1-0/+47
At the moment we trap *every* access to the flash memory, even when we are in array read mode (which just directly copies from the storage array to the guest). To improve performance, allow cacheable mappings and to avoid fatal traps on unsupported instructions (on ARM), export a read-only memslot to the guest when the flash is in read-array mode. A guest does not need to trap on read accesses then. A write command (which always traps) will revoke this mapping if the read mode changes. This reduces the number of read traps from more than 800,000 to a few hundreds when booting into the UEFI shell. Tested-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Andre Przywara <andre.przywara@arm.com> Signed-off-by: Will Deacon <will@kernel.org>
2020-04-24memslot: Add support for READONLY mappingsAndre Przywara2-4/+13
A KVM memslot has a flags field, which allows to mark a region as read-only. Add another memory type bit to allow kvmtool-internal users to map a write-protected region. Write access would trap and can be handled by the MMIO emulation, which should register on the same guest address region. Tested-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Andre Przywara <andre.przywara@arm.com> Signed-off-by: Will Deacon <will@kernel.org>
2020-04-24vfio: Destroy memslot when unmapping the associated VAsAlexandru Elisei3-12/+99
When we want to map a device region into the guest address space, first we perform an mmap on the device fd. The resulting VMA is a mapping between host userspace addresses and physical addresses associated with the device. Next, we create a memslot, which populates the stage 2 table with the mappings between guest physical addresses and the device physical adresses. However, when we want to unmap the device from the guest address space, we only call munmap, which destroys the VMA and the stage 2 mappings, but doesn't destroy the memslot and kvmtool's internal mem_bank structure associated with the memslot. This has been perfectly fine so far, because we only unmap a device region when we exit kvmtool. This is will change when we add support for reassignable BARs, and we will have to unmap vfio regions as the guest kernel writes new addresses in the BARs. This can lead to two possible problems: - We refuse to create a valid BAR mapping because of a stale mem_bank structure which belonged to a previously unmapped region. - It is possible that the mmap in vfio_map_region returns the same address that was used to create a memslot, but was unmapped by vfio_unmap_region. Guest accesses to the device memory will fault because the stage 2 mappings are missing, and this can lead to performance degradation. Let's do the right thing and destroy the memslot and the mem_bank struct associated with it when we unmap a vfio region. Set host_addr to NULL after the munmap call so we won't try to unmap an address which is currently used by the process for something else if vfio_unmap_region gets called twice. Tested-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com> Signed-off-by: Andre Przywara <andre.przywara@arm.com> Signed-off-by: Will Deacon <will@kernel.org>
2020-04-24Add emulation for CFI compatible flash memoryRaphael Gault6-2/+630
The EDK II UEFI firmware implementation requires some storage for the EFI variables, which is typically some flash storage. Since this is already supported on the EDK II side, we add a CFI flash emulation to kvmtool. This is backed by a file, specified via the --flash or -F command line option. Any flash writes done by the guest will immediately be reflected into this file (kvmtool mmap's the file). The flash will be limited to the nearest power-of-2 size, so only the first 2 MB of a 3 MB file will be used. This implements a CFI flash using the "Intel/Sharp extended command set", as specified in: - JEDEC JESD68.01 - JEDEC JEP137B - Intel Application Note 646 Some gaps in those specs have been filled by looking at real devices and other implementations (QEMU, Linux kernel driver). At the moment this relies on DT to advertise the base address of the flash memory (mapped into the MMIO address space) and is only enabled for ARM/ARM64. The emulation itself is architecture agnostic, though. This is one missing piece toward a working UEFI boot with kvmtool on ARM guests, the other is to provide writable PCI BARs, which is WIP. Tested-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Raphael Gault <raphael.gault@arm.com> [Andre: rewriting and fixing] Signed-off-by: Andre Przywra <andre.przywara@arm.com> Signed-off-by: Will Deacon <will@kernel.org>
2020-04-24virtio-mmio: Assign IRQ line directly before registering deviceAndre Przywara3-13/+2
At the moment the IRQ line for a virtio-mmio device is assigned in the generic device__register() routine in devices.c, by calling back into virtio-mmio.c. This does not only sound slightly convoluted, but also breaks when we try to register an MMIO device that is not a virtio-mmio device. In this case container_of will return a bogus pointer (as it assumes a struct virtio_mmio), and the IRQ allocation routine will corrupt some data in the device_header (for instance the first byte of the "data" pointer). Simply assign the IRQ directly in virtio_mmio_init(), before calling device__register(). This avoids the problem and looks actually much more straightforward. Tested-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Andre Przywara <andre.przywara@arm.com> Signed-off-by: Will Deacon <will@kernel.org>
2020-04-24vfio: fix multi-MSI vector handlingLorenzo Pieralisi1-0/+8
A PCI device with a MSI capability enabling Multiple MSI messages (through the Multiple Message Enable field in the Message Control register[6:4]) is expected to drive the Message Data lower bits (number determined by the number of selected vectors) to generate the corresponding MSI messages writes on the PCI bus. Therefore, KVM expects the MSI data lower bits (a number of bits that depend on bits [6:4] of the Message Control register - which in turn control the number of vectors allocated) to be set-up by kvmtool while programming the MSI IRQ routing entries to make sure the MSI entries can actually be demultiplexed by KVM and IRQ routes set-up accordingly so that when an actual HW fires KVM can route it to the correct entry in the interrupt controller (and set-up a correct passthrough route for directly injected interrupt). Current kvmtool code does not set-up Message data entries correctly for multi-MSI vectors - the data field is left as programmed in the MSI capability by the guest for all vector entries, triggering IRQs misrouting. Fix it. Signed-off-by: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com> Acked-by: Marc Zyngier <maz@kernel.org> Cc: Will Deacon <will@kernel.org> Cc: Julien Thierry <julien.thierry.kdev@gmail.com> Signed-off-by: Will Deacon <will@kernel.org>
2020-04-15ioport: Fail when registering overlapping portsAlexandru Elisei1-8/+1
If we try to register a range of ports which overlaps with another, already registered, I/O ports region then device emulation for that region will not work anymore. There's nothing sane that the ioport emulation layer can do in this case so refuse to allocate the port. This matches the behavior of kvm__register_mmio. There's no need to protect allocating a new ioport struct with a lock, so move the lock to protect the actual ioport insertion in the tree. Reviewed-by: Andre Przywara <andre.przywara@arm.com> Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com> Signed-off-by: Will Deacon <will@kernel.org>
2020-04-15hw/vesa: Set the size for BAR 0Alexandru Elisei1-0/+1
Implemented BARs have an non-zero address and a size. Let's set the size for BAR 0. Reviewed-by: Andre Przywara <andre.przywara@arm.com> Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com> Signed-off-by: Will Deacon <will@kernel.org>
2020-04-15hw/vesa: Don't ignore fatal errorsAlexandru Elisei1-8/+20
Failling an mmap call or creating a memslot means that device emulation will not work, don't ignore it. Reviewed-by: Andre Przywara <andre.przywara@arm.com> Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com> Signed-off-by: Will Deacon <will@kernel.org>
2020-04-15Don't ignore errors registering a device, ioport or mmio emulationAlexandru Elisei11-43/+101
An error returned by device__register, kvm__register_mmio and ioport__register means that the device will not be emulated properly. Annotate the functions with __must_check, so we get a compiler warning when this error is ignored. And fix several instances where the caller returns 0 even if the function failed. Also make sure the ioport emulation code uses ioport_remove consistently. Reviewed-by: Andre Przywara <andre.przywara@arm.com> Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com> Signed-off-by: Will Deacon <will@kernel.org>
2020-04-15virtio: Don't ignore initialization failuresAlexandru Elisei11-46/+78
Don't ignore an error in the bus specific initialization function in virtio_init; don't ignore the result of virtio_init; and don't return 0 in virtio_blk__init and virtio_scsi__init when we encounter an error. Hopefully this will save some developer's time debugging faulty virtio devices in a guest. To take advantage of the cleanup function virtio_blk__exit, move appending the new device to the list before the call to virtio_init. Change virtio_net__exit to free all allocated net_dev devices on exit, and matching what virtio_blk__exit does. To safeguard against this in the future, virtio_init has been annoted with the compiler attribute warn_unused_result. Reviewed-by: Andre Przywara <andre.przywara@arm.com> Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com> Signed-off-by: Will Deacon <will@kernel.org>
2020-04-15vfio/pci: Don't access unallocated regionsAlexandru Elisei1-3/+7
Don't try to configure a BAR if there is no region associated with it. Also move the variable declarations from inside the loop to the start of the function for consistency. Reviewed-by: Andre Przywara <andre.przywara@arm.com> Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com> Signed-off-by: Will Deacon <will@kernel.org>
2020-04-15vfio/pci: Ignore expansion ROM BAR writesAlexandru Elisei1-0/+3
To get the size of the expansion ROM, software writes 0xfffff800 to the expansion ROM BAR in the PCI configuration space. PCI emulation executes the optional configuration space write callback that a device can implement before emulating this write. kvmtool's implementation of VFIO doesn't have support for emulating expansion ROMs. However, the callback writes the guest value to the hardware BAR, and then it reads it back to the emulated BAR to make sure the write has completed successfully. After this, we return to regular PCI emulation and because the BAR is no longer 0, we write back to the BAR the value that the guest used to get the size. As a result, the guest will think that the ROM size is 0x800 after the subsequent read and we end up unintentionally exposing to the guest a BAR which we don't emulate. Let's fix this by ignoring writes to the expansion ROM BAR. Reviewed-by: Andre Przywara <andre.przywara@arm.com> Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com> Signed-off-by: Will Deacon <will@kernel.org>
2020-04-15vfio/pci: Don't assume that only even numbered BARs are 64bitAlexandru Elisei1-1/+3
Not all devices have the bottom 32 bits of a 64 bit BAR in an even numbered BAR. For example, on an NVIDIA Quadro P400, BARs 1 and 3 are 64bit. Remove this assumption. Reviewed-by: Andre Przywara <andre.przywara@arm.com> Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com> Signed-off-by: Will Deacon <will@kernel.org>
2020-04-15vfio/pci: Allocate correct size for MSIX table and PBA BARsAlexandru Elisei1-16/+52
kvmtool assumes that the BAR that holds the address for the MSIX table and PBA structure has a size which is equal to their total size and it allocates memory from MMIO space accordingly. However, when initializing the BARs, the BAR size is set to the region size reported by VFIO. When the physical BAR size is greater than the mmio space that kvmtool allocates, we can have a situation where the BAR overlaps with another BAR, in which case kvmtool will fail to map the memory. This was found when trying to do PCI passthrough with a PCIe Realtek r8168 NIC, when the guest was also using virtio-block and virtio-net devices: [..] [ 0.197926] PCI: OF: PROBE_ONLY enabled [ 0.198454] pci-host-generic 40000000.pci: host bridge /pci ranges: [ 0.199291] pci-host-generic 40000000.pci: IO 0x00007000..0x0000ffff -> 0x00007000 [ 0.200331] pci-host-generic 40000000.pci: MEM 0x41000000..0x7fffffff -> 0x41000000 [ 0.201480] pci-host-generic 40000000.pci: ECAM at [mem 0x40000000-0x40ffffff] for [bus 00] [ 0.202635] pci-host-generic 40000000.pci: PCI host bridge to bus 0000:00 [ 0.203535] pci_bus 0000:00: root bus resource [bus 00] [ 0.204227] pci_bus 0000:00: root bus resource [io 0x0000-0x8fff] (bus address [0x7000-0xffff]) [ 0.205483] pci_bus 0000:00: root bus resource [mem 0x41000000-0x7fffffff] [ 0.206456] pci 0000:00:00.0: [10ec:8168] type 00 class 0x020000 [ 0.207399] pci 0000:00:00.0: reg 0x10: [io 0x0000-0x00ff] [ 0.208252] pci 0000:00:00.0: reg 0x18: [mem 0x41002000-0x41002fff] [ 0.209233] pci 0000:00:00.0: reg 0x20: [mem 0x41000000-0x41003fff] [ 0.210481] pci 0000:00:01.0: [1af4:1000] type 00 class 0x020000 [ 0.211349] pci 0000:00:01.0: reg 0x10: [io 0x0100-0x01ff] [ 0.212118] pci 0000:00:01.0: reg 0x14: [mem 0x41003000-0x410030ff] [ 0.212982] pci 0000:00:01.0: reg 0x18: [mem 0x41003200-0x410033ff] [ 0.214247] pci 0000:00:02.0: [1af4:1001] type 00 class 0x018000 [ 0.215096] pci 0000:00:02.0: reg 0x10: [io 0x0200-0x02ff] [ 0.215863] pci 0000:00:02.0: reg 0x14: [mem 0x41003400-0x410034ff] [ 0.216723] pci 0000:00:02.0: reg 0x18: [mem 0x41003600-0x410037ff] [ 0.218105] pci 0000:00:00.0: can't claim BAR 4 [mem 0x41000000-0x41003fff]: address conflict with 0000:00:00.0 [mem 0x41002000-0x41002fff] [..] Guest output of lspci -vv: 00:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller (rev 06) Subsystem: TP-LINK Technologies Co., Ltd. TG-3468 Gigabit PCI Express Network Adapter Control: I/O+ Mem+ BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx- Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx- Interrupt: pin A routed to IRQ 16 Region 0: I/O ports at 0000 [size=256] Region 2: Memory at 41002000 (64-bit, non-prefetchable) [size=4K] Region 4: Memory at 41000000 (64-bit, prefetchable) [size=16K] Capabilities: [50] MSI: Enable- Count=1/1 Maskable- 64bit+ Address: 0000000000000000 Data: 0000 Capabilities: [b0] MSI-X: Enable- Count=4 Masked- Vector table: BAR=4 offset=00000000 PBA: BAR=4 offset=00001000 Let's fix this by allocating an amount of MMIO memory equal to the size of the BAR that contains the MSIX table and/or PBA. Reviewed-by: Andre Przywara <andre.przywara@arm.com> Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com> Signed-off-by: Will Deacon <will@kernel.org>
2020-04-15virtio/pci: Make memory and IO BARs independentJulien Thierry1-23/+40
Currently, callbacks for memory BAR 1 call the IO port emulation. This means that the memory BAR needs I/O Space to be enabled whenever Memory Space is enabled. Refactor the code so the two type of BARs are independent. Also, unify ioport/mmio callback arguments so that they all receive a virtio_device. Reviewed-by: Andre Przywara <andre.przywara@arm.com> Signed-off-by: Julien Thierry <julien.thierry@arm.com> [Cosmetic changes wrt to where local variables are initialized] Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com> Signed-off-by: Will Deacon <will@kernel.org>