Age | Commit message (Collapse) | Author | Files | Lines |
|
The arm code tries to align the memory allocation size to 2M to potentially
make use of the transparent hugepages. But this would be problematic if we
try to allocate from the hugetlbfs, where the allocation size could be more than
2M. Given we support upto 1G, let use leave it to the user to align the
requested memory when hugetlbfs is used.
Without the patch:
$ echo 1 > /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages
$ mount -t hugetlbfs -o pagesize=1G none /root/hugemem/
$ lkvm run -m 1024 --hugetlbfs /root/hugemem/ ...
# lkvm run -k ... -m 1024 -c 6
Fatal: Can't ftruncate for mem mapping size 1075838976
Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com>
Acked-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20230405110905.669217-1-suzuki.poulose@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
|
|
This is a follow-up patch for [0] which proposed the --force-pci option
for riscv. As per the discussion it was concluded to add virtio-tranport
option taking in four options (pci, pci-legacy, mmio, mmio-legacy).
With this change force-pci and virtio-legacy are both deprecated and
arm's default transport changes from MMIO to PCI as agreed in [0].
This is also true for riscv.
Nothing changes for other architectures.
[0]: https://lore.kernel.org/all/20230118172007.408667-1-rkanwal@rivosinc.com/
Signed-off-by: Rajnesh Kanwal <rkanwal@rivosinc.com>
Link: https://lore.kernel.org/r/20230320143344.404307-1-rkanwal@rivosinc.com
Signed-off-by: Will Deacon <will@kernel.org>
|
|
The default serial and rtc IO region overlaps with PCI IO bar
region leading bar 0 activation to fail. Moving these devices
to MMIO region similar to ARM.
Given serial has been moved from 0x3f8 to 0x10000000, this
requires us to now pass earlycon=uart8250,mmio,0x10000000
from cmdline rather than earlycon=uart8250,mmio,0x3f8.
To avoid the need to change the address every time the tool
is updated, we can also just pass "earlycon" from cmdline
and guest then finds the type and base address by following
the Device Tree's stdout-path property.
Signed-off-by: Rajnesh Kanwal <rkanwal@rivosinc.com>
Tested-by: Atish Patra <atishp@rivosinc.com>
Reviewed-by: Atish Patra <atishp@rivosinc.com>
Link: https://lore.kernel.org/r/20230203122934.18714-1-rkanwal@rivosinc.com
Signed-off-by: Will Deacon <will@kernel.org>
|
|
By default, the KVM RISC-V keeps all extensions available to VCPU
enabled and KVMTOOL does not disable any extension.
We add --disable-<xyz> command-line options in KVMTOOL RISC-V to
allow users explicitly disable certain extension if they don't
desire it.
Signed-off-by: Anup Patel <apatel@ventanamicro.com>
Link: https://lore.kernel.org/r/20221018140854.69846-7-apatel@ventanamicro.com
Signed-off-by: Will Deacon <will@kernel.org>
|
|
When the Zicbom extension is available expose it to the guest.
Also provide the guest the size of the cache block through DT.
Signed-off-by: Andrew Jones <ajones@ventanamicro.com>
Link: https://lore.kernel.org/r/20221018140854.69846-6-apatel@ventanamicro.com
Signed-off-by: Will Deacon <will@kernel.org>
|
|
We'll need one of these helpers in the next patch in another file.
Let's proactively move them all now, since others may some day also
be useful.
Signed-off-by: Andrew Jones <ajones@ventanamicro.com>
Link: https://lore.kernel.org/r/20221018140854.69846-5-apatel@ventanamicro.com
Signed-off-by: Will Deacon <will@kernel.org>
|
|
The zihintpause extension allows software to use the PAUSE instruction to
reduce energy consumption while executing spin-wait code sequences. Add the
zihintpause extension to the device tree if it is supported by the host.
Signed-off-by: Mayuresh Chitale <mchitale@ventanamicro.com>
Link: https://lore.kernel.org/r/20221018140854.69846-4-apatel@ventanamicro.com
Signed-off-by: Will Deacon <will@kernel.org>
|
|
Svinval extension allows the guest OS to perform range based TLB
maintenance efficiently. Add the Svinval extensiont to the device
tree if it is supported by the host.
Signed-off-by: Anup Patel <apatel@ventanamicro.com>
Link: https://lore.kernel.org/r/20221018140854.69846-3-apatel@ventanamicro.com
Signed-off-by: Will Deacon <will@kernel.org>
|
|
We update all UAPI headers based on Linux-6.1-rc1 so that we can
use latest features.
Signed-off-by: Anup Patel <apatel@ventanamicro.com>
Link: https://lore.kernel.org/r/20221018140854.69846-2-apatel@ventanamicro.com
Signed-off-by: Will Deacon <will@kernel.org>
|
|
GCC Version:
gcc (GCC) 8.4.1 20200928 (Red Hat 8.4.1-1)
hw/i8042.c: In function ‘kbd_io’:
hw/i8042.c:153:19: error: ‘value’ may be used uninitialized in this function [-Werror=maybe-uninitialized]
state.write_cmd = val;
~~~~~~~~~~~~~~~~^~~~~
hw/i8042.c:298:5: note: ‘value’ was declared here
u8 value;
^~~~~
cc1: all warnings being treated as errors
make: *** [Makefile:508: hw/i8042.o] Error 1
Signed-off-by: hbuxiaofei <hbuxiaofei@gmail.com>
Link: https://lore.kernel.org/r/20221102080501.69274-1-hbuxiaofei@gmail.com
Signed-off-by: Will Deacon <will@kernel.org>
|
|
Although the PCI Status register only contains read-only and
write-1-to-clear bits, we currently keep anything written there, which
can confuse a guest.
The problem was highlighted by recent Linux commit 6cd514e58f12 ("PCI:
Clear PCI_STATUS when setting up device"), which unconditionally writes
0xffff to the Status register in order to clear pending errors. Then the
EDAC driver sees the parity status bits set and attempts to clear them
by writing 0xc100, which in turn clears the Capabilities List bit.
Later on, when the virtio-pci driver starts probing, it assumes due to
missing capabilities that the device is using the legacy transport, and
fails to setup the device because of mismatched protocol.
Filter writes to the config space, keeping only those to writable
fields. Tighten the access size check while we're at it, to prevent
overflow. This is only a small step in the right direction, not a
foolproof solution, because a guest could still write both Command and
Status registers using a single 32-bit write. More work is needed for:
* Supporting arbitrary sized writes.
* Sanitizing accesses to capabilities, which are device-specific.
Also remove the old hack that filtered accesses. It was most likely
guarding against ROM BAR writes, which is now handled by the
pci_config_writable bitmap.
Reported-by: Pierre Gondois <pierre.gondois@arm.com>
Tested-by: Pierre Gondois <pierre.gondois@arm.com>
Signed-off-by: Jean-Philippe Brucker <jean-philippe@linaro.org>
Link: https://lore.kernel.org/r/20221020173452.203043-1-jean-philippe@linaro.org
Signed-off-by: Will Deacon <will@kernel.org>
|
|
VIRTIO_RING_F_EVENT_IDX is a bit position value, but
virtio_init_device_vq populates vq->use_event_idx by ANDing this value
directly to vdev->features.
Fix the check for this flag in virtio_init_device_vq.
Reviewed-by: Andre Przywara <andre.przywara@arm.com>
Signed-off-by: Tu Dinh Ngoc <dinhngoc.tu@irit.fr>
Link: https://lore.kernel.org/r/20220929121858.156-1-dinhngoc.tu@irit.fr
Signed-off-by: Will Deacon <will@kernel.org>
|
|
We have all MMIO devices under "/smb" DT node so the serial0 alias
path should have "/smb" prefix.
Fixes: 7c9aac003925 ("riscv: Generate FDT at runtime for Guest/VM")
Signed-off-by: Anup Patel <apatel@ventanamicro.com>
Link: https://lore.kernel.org/r/20220815101325.477694-6-apatel@ventanamicro.com
Signed-off-by: Will Deacon <will@kernel.org>
|
|
Sstc extension allows the guest OS to program the timer directly without
relying on the SBI call. The kernel detects the presence of Sstc extnesion
from the riscv,isa DT property. Add the Sstc extension to the device tree
if it is supported by the host.
Signed-off-by: Atish Patra <atishp@rivosinc.com>
Signed-off-by: Anup Patel <apatel@ventanamicro.com>
Link: https://lore.kernel.org/r/20220815101325.477694-5-apatel@ventanamicro.com
Signed-off-by: Will Deacon <will@kernel.org>
|
|
The Svpbmt extension allows PTE based memory attributes in page tables.
This extension also allows Guest/VM to use PTE based memory attributes
in VS-stage page tables so let us add it Guest/VM ISA string when KVM
RISC-V supports it.
Signed-off-by: Anup Patel <apatel@ventanamicro.com>
Link: https://lore.kernel.org/r/20220815101325.477694-4-apatel@ventanamicro.com
Signed-off-by: Will Deacon <will@kernel.org>
|
|
The riscv,isa DT property only contains single letter base extensions
until now. However, there are also multi-letter extensions which were
ratified recently. Add a mechanism to append those extension details
to the device tree so that guest can leverage those.
Signed-off-by: Atish Patra <atishp@rivosinc.com>
Signed-off-by: Anup Patel <apatel@ventanamicro.com>
Link: https://lore.kernel.org/r/20220815101325.477694-3-apatel@ventanamicro.com
Signed-off-by: Will Deacon <will@kernel.org>
|
|
We update all UAPI headers based on Linux-6.0-rc1 so that we can
use latest features.
Signed-off-by: Anup Patel <apatel@ventanamicro.com>
Link: https://lore.kernel.org/r/20220815101325.477694-2-apatel@ventanamicro.com
Signed-off-by: Will Deacon <will@kernel.org>
|
|
When a script is specified for a guest nic setup, we fork() and execl()s
the script when it is time to execute the script. However this is not
optimal, given we are running a VM. The fork() will trigger marking the
entire page-table of the current process as CoW, which will trigger
unmapping the entire stage2 page tables from the guest. Anyway, the
child process will exec the script as soon as we fork(), making all
these mm operations moot. Also, this operation could be problematic
for confidential compute VMs, where it may be expensive (and sometimes
destructive) to make changes to the stage2 page tables.
So, instead we could use vfork() and avoid the CoW and unmap of the stage2.
Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com>
Reviewed-by: Alexandru Elisei <alexandru.elisei@arm.com>
Link: https://lore.kernel.org/r/20220809124816.2880990-1-suzuki.poulose@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
|
|
The arm, arm64, powerpc and riscv architectures require that libfdt is
installed on the system, however the library might not be available for
every architecture on the user's distro of choice. Or the static version of
the library, needed for the lkvm-static target, might be missing.
Fortunately, kvmtool has anticipated this situation and it includes
instructions to compile and install libfdt in the INSTALL file.
Unfortunately, those instructions do not always work (for example, because
the user is missing the needed permisssions), leaving the user unable to
compile kvmtool.
As an alternative to installing libfdt system-wide, provide the
LIBFDT_DIR variable when compiling kvmtool. For example, when compiling
with the command:
$ make ARCH=<arch> CROSS_COMPILE=<cross_compile> LIBFDT_DIR=<dir>
kvmtool will link the executable against the static version of the library
located in LIBFDT_DIR/libfdt.a.
LIBFDT_DIR takes precedence over the system library, as there are valid
reasons to prefer a self-compiled library over the one that the distro
provides (like the system library being older).
Note that this will slightly increase the size of the executable. For the
arm64 architecture, the increase has been measured to be about 100KB, or
about 5% of the total executable size.
Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
Link: https://lore.kernel.org/r/20220722141448.168252-2-alexandru.elisei@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
|
|
Use calloc() to avoid uninitialized fields in the rng device.
Signed-off-by: Jean-Philippe Brucker <jean-philippe@linaro.org>
Link: https://lore.kernel.org/r/20220722141731.64039-5-jean-philippe@linaro.org
Signed-off-by: Will Deacon <will@kernel.org>
|
|
Since commit 2108c86d0623 ("virtio/pci: Signal INTx interrupts as level
instead of edge"), virtio uses level-triggered IRQs. Bring the modern
device up to date, by deasserting the IRQ line when the guest reads the
interrupt status register.
Fixes: 3bf79498e6d5 ("virtio: Add support for modern virtio-pci")
Reported-by: Sami Mujawar <sami.mujawar@arm.com>
Signed-off-by: Jean-Philippe Brucker <jean-philippe@linaro.org>
Link: https://lore.kernel.org/r/20220722141731.64039-4-jean-philippe@linaro.org
Signed-off-by: Will Deacon <will@kernel.org>
|
|
Variables set on the command-line are not overridden by normal
assignments. So when passing ARCH=x86_64 on the command-line, build
fails:
Makefile:227: *** This architecture (x86_64) is not supported in kvmtool.
Use the 'override' directive to force the ARCH reassignment.
Signed-off-by: Jean-Philippe Brucker <jean-philippe@linaro.org>
Tested-by: Alexandru Elisei <alexandru.elisei@arm.com>
Reviewed-by: Alexandru Elisei <alexandru.elisei@arm.com>
Link: https://lore.kernel.org/r/20220722141731.64039-3-jean-philippe@linaro.org
Signed-off-by: Will Deacon <will@kernel.org>
|
|
When running kvmtool after updating without doing a make clean, one
might run into strange issues such as:
Warning: Failed init: symbol_init
Fatal: Initialisation failed
or worse. This happens because symbol.o is not automatically rebuilt
after a change of headers, because .symbol.o.d is not in the $(DEPS)
variable. So if the layout of struct kvm_config changes, for example,
symbols.o that was built for an older version will try to read
kvm->vmlinux from the wrong location in struct kvm, and lkvm will die.
Add all .d files to $(DEPS). Also include $(STATIC_DEPS) which was
previously set but not used.
Signed-off-by: Jean-Philippe Brucker <jean-philippe@linaro.org>
Reviewed-by: Alexandru Elisei <alexandru.elisei@arm.com>
Link: https://lore.kernel.org/r/20220722141731.64039-2-jean-philippe@linaro.org
Signed-off-by: Will Deacon <will@kernel.org>
|
|
pvtime uses ARM_PVTIME_BASE instead of ARM_PVTIME_SIZE for the size of the
memory region given to the guest, which causes to the following error when
creating a flash device (via the -F/--flash command line argument):
Error: RAM (read-only) region [2000000-27fffff] would overlap RAM region [1020000-203ffff]
The read-only region represents the guest memory where the flash image is
copied by kvmtool. The region starting at 0x102_0000 (ARM_PVTIME_BASE) is
the pvtime region, which should be 64K in size. kvmtool erroneously creates
the region to be ARM_PVTIME_BASE in size instead, and the last address
becomes:
ARM_PVTIME_BASE + ARM_PVTIME_BASE - 1 = 0x102_0000 + 0x102_0000 - 1 = 0x203_ffff
which corresponds to the end of the region from the error message.
Do the right thing and make the pvtime memory region ARM_PVTIME_SIZE = 64K
bytes, as it was intended.
Fixes: 7d4671e5d372 ("aarch64: Add stolen time support")
Reported-by: Pierre Gondois <pierre.gondois@arm.com>
Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
Reviewed-by: Andre Przywara <andre.przywara@arm.com>
Reviewed-by: Sebastian Ene <sebastianene@google.com>
Link: https://lore.kernel.org/r/20220629103905.24480-1-alexandru.elisei@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
|
|
VIRTIO_PCI_F_SIGNAL_MSI is not a virtio feature but an internal flag.
Change it to bool to avoid confusion.
Signed-off-by: Jean-Philippe Brucker <jean-philippe.brucker@arm.com>
Link: https://lore.kernel.org/r/20220701142434.75170-13-jean-philippe.brucker@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
|
|
According to the virtio spec, all vectors must be initialized to
VIRTIO_MSI_NO_VECTOR (0xffff). In 4.1.5.1.2.1 "Device Requirements:
MSI-X Vector Configuration":
The device MUST return vector mapped to a given event, (NO_VECTOR if
unmapped) on read of config_msix_vector/queue_msix_vector.
Currently we return 0, which is a valid MSI vector. Return NO_VECTOR
instead.
Signed-off-by: Jean-Philippe Brucker <jean-philippe.brucker@arm.com>
Link: https://lore.kernel.org/r/20220701142434.75170-12-jean-philippe.brucker@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
|
|
Add modern MMIO transport to virtio, make it the default. Legacy transport
can be enabled with --virtio-legacy. The main change for MMIO is the queue
addresses. They are now 64-bit addresses instead of 32-bit PFNs. Apart
from that all changes for supporting modern devices are already
implemented.
Signed-off-by: Jean-Philippe Brucker <jean-philippe.brucker@arm.com>
Link: https://lore.kernel.org/r/20220701142434.75170-11-jean-philippe.brucker@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
|
|
To make space for the modern register layout, move the current code to
mmio-legacy.
Signed-off-by: Jean-Philippe Brucker <jean-philippe.brucker@arm.com>
Link: https://lore.kernel.org/r/20220701142434.75170-10-jean-philippe.brucker@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
|
|
Add support for modern virtio-pci implementation (based on the 1.0 virtio
spec). We add a new transport, alongside MMIO and PCI-legacy. This is now
the default when selecting PCI, but users can still select the legacy
transport for all virtio devices by passing "--virtio-legacy" on the
command-line.
The main change in modern PCI is the way we address virtqueues, using
64-bit values instead of PFNs. To keep the queue configuration atomic the
device also gets a "queue enable" register. Configuration is also made
extensible by more feature bits and PCI capabilities. Scalability is
improved as well, as devices can have notification registers for each
virtqueue on separate pages. However this implementation keeps a single
notification register.
Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
Signed-off-by: Jean-Philippe Brucker <jean-philippe.brucker@arm.com>
Link: https://lore.kernel.org/r/20220701142434.75170-9-jean-philippe.brucker@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
|
|
To make space for the more recent virtio version, move the legacy bits of
virtio-pci to a different file.
Signed-off-by: Jean-Philippe Brucker <jean-philippe.brucker@arm.com>
Link: https://lore.kernel.org/r/20220701142434.75170-8-jean-philippe.brucker@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
|
|
Modern virtio uses more than 32 bits of features. Bump the feature
bitfield size to 64 bits.
virtio_set_guest_features() changes in behavior because it will now be
called multiple times, each time the guest writes to a 32-bit slice of
the features.
Signed-off-by: Jean-Philippe Brucker <jean-philippe.brucker@arm.com>
Link: https://lore.kernel.org/r/20220701142434.75170-7-jean-philippe.brucker@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
|
|
We currently call VHOST_SET_BACKEND from notify_vq_gsi(), which can't
work with modern virtio because vhost checks that the virtqueue is
accessible when handling VHOST_SET_BACKEND, and the modern driver
initializes the MSIs before setting up the virtqueue. Move
VHOST_SET_BACKEND to init_vq().
Signed-off-by: Jean-Philippe Brucker <jean-philippe.brucker@arm.com>
Link: https://lore.kernel.org/r/20220701142434.75170-6-jean-philippe.brucker@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
|
|
Legacy virtio drivers write to the I/O port BAR, and the modern virtio
device uses the MMIO BAR. Since vhost can only listen on one ioeventfd,
select the one that the guest will use.
Signed-off-by: Jean-Philippe Brucker <jean-philippe.brucker@arm.com>
Link: https://lore.kernel.org/r/20220701142434.75170-5-jean-philippe.brucker@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
|
|
The doorbell offset depends on the transport - virtio-legacy uses a
fixed offset, but modern virtio can have per-vq offsets. Add an offset
field to the virtio_pci structure.
Signed-off-by: Jean-Philippe Brucker <jean-philippe.brucker@arm.com>
Link: https://lore.kernel.org/r/20220701142434.75170-4-jean-philippe.brucker@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
|
|
Modern virtio will need to reuse this code when initializing a
virtqueue. It's not much, but still nicer to have next to exit_vq().
Signed-off-by: Jean-Philippe Brucker <jean-philippe.brucker@arm.com>
Link: https://lore.kernel.org/r/20220701142434.75170-3-jean-philippe.brucker@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
|
|
On exit_vq() and device reset, remove the MSI routes that were set up at
runtime.
Signed-off-by: Jean-Philippe Brucker <jean-philippe.brucker@arm.com>
Link: https://lore.kernel.org/r/20220701142434.75170-2-jean-philippe.brucker@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
|
|
Allow the user to specify the RAM base address by using -m/--mem size@addr
command line argument. The base address must be above 2GB, as to not
overlap with the MMIO I/O region.
Reviewed-by: Andre Przywara <andre.przywara@arm.com>
Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
Reviewed-and-Tested-by: Suzuki K Poulose <suzuki.poulose@arm.com>
Link: https://lore.kernel.org/r/20220616134828.129006-13-alexandru.elisei@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
|
|
Add a new function, kvm__arch_default_ram_address(), which returns the
default address for guest RAM for each architecture.
Reviewed-by: Andre Przywara <andre.przywara@arm.com>
Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
Reviewed-and-Tested-by: Suzuki K Poulose <suzuki.poulose@arm.com>
Link: https://lore.kernel.org/r/20220616134828.129006-12-alexandru.elisei@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
|
|
RAM initialization is unnecessarily split between kvm__init_ram() and
kvm__arch_init(). Move all code related to RAM initialization to
kvm__init_ram(), making the code easier to follow and to modify.
One thing to note is that the initialization order is slightly altered:
kvm__arch_enable_mte() and gic__create() are now called before mmap'ing the
guest RAM. That is perfectly fine, as they don't use the host's mapping of
the guest memory.
Reviewed-by: Andre Przywara <andre.przywara@arm.com>
Signed-off-by: Julien Grall <julien.grall@arm.com>
Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
Reviewed-and-Tested-by: Suzuki K Poulose <suzuki.poulose@arm.com>
Link: https://lore.kernel.org/r/20220616134828.129006-11-alexandru.elisei@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
|
|
The kvm struct already contains a pointer to the configuration, which
contains both hugetlbfs_path and ram_size, so is it not necessary to pass
them as arguments to kvm__arch_init().
Reviewed-by: Andre Przywara <andre.przywara@arm.com>
Signed-off-by: Julien Grall <julien.grall@arm.com>
Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
Reviewed-and-Tested-by: Suzuki K Poulose <suzuki.poulose@arm.com>
Link: https://lore.kernel.org/r/20220616134828.129006-10-alexandru.elisei@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
|
|
Allow the user to use the standard B (bytes), K (kilobytes), M (megabytes),
G (gigabytes), T (terabytes) and P (petabytes) suffixes for memory size.
When none are specified, the default is megabytes.
Also raise an error if the guest specifies 0 as the memory size, instead
of treating it as uninitialized, as kvmtool has done so far.
Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com>
Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
Reviewed-by: Andre Przywara <andre.przywara@arm.com>
Reviewed-and-Tested-by: Suzuki K Poulose <suzuki.poulose@arm.com>
Link: https://lore.kernel.org/r/20220616134828.129006-9-alexandru.elisei@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
|
|
The ARM_HIMAP_MAX_MEMORY() is a remnant of a time when KVM only supported
40 bits if IPA. There are no users left for this macro, remove it.
Reviewed-by: Andre Przywara <andre.przywara@arm.com>
Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
Reviewed-and-Tested-by: Suzuki K Poulose <suzuki.poulose@arm.com>
Link: https://lore.kernel.org/r/20220616134828.129006-8-alexandru.elisei@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
|
|
For 32-bit guests, the maximum memory size is represented by the define
ARM_LOMAP_MAX_MEMORY, which ARM_MAX_MEMORY() returns.
For 64-bit guests, the RAM size is checked against the maximum allowed
by KVM in kvm__get_vm_type().
There are no users left for the ARM_MAX_MEMORY() macro, remove it.
Reviewed-by: Andre Przywara <andre.przywara@arm.com>
Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
Reviewed-and-Tested-by: Suzuki K Poulose <suzuki.poulose@arm.com>
Link: https://lore.kernel.org/r/20220616134828.129006-7-alexandru.elisei@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
|
|
For 64-bit guests, kvmtool exists with an error in kvm__get_vm_type() if
the memory size is larger than what KVM supports. For 32-bit guests, the
RAM size is silently rounded down to ARM_LOMAP_MAX_MEMORY in
kvm__arch_init().
Be consistent and exit with an error when the user has configured the
wrong RAM size for 32-bit guests.
Reviewed-by: Andre Przywara <andre.przywara@arm.com>
Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
Reviewed-and-Tested-by: Suzuki K Poulose <suzuki.poulose@arm.com>
Link: https://lore.kernel.org/r/20220616134828.129006-6-alexandru.elisei@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
|
|
Architectures are free to set their own command line options. Add an
architecture specific hook to validate these options.
For now, the hook does nothing, but it will be used in later patches.
Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
Reviewed-by: Andre Przywara <andre.przywara@arm.com>
Reviewed-and-Tested-by: Suzuki K Poulose <suzuki.poulose@arm.com>
Link: https://lore.kernel.org/r/20220616134828.129006-5-alexandru.elisei@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
|
|
host_ram_size() uses sysconf() to calculate the available ram, and
sysconf() can fail. When that happens, host_ram_size() returns 0. kvmtool
warns the user when the configured VM ram size exceeds the size of the
host's memory, but doesn't take into account that host_ram_size() can
return 0. If the function returns zero, skip the warning.
Since this can only happen when the user sets the memory size (via the
-m/--mem command line argument), skip the check entirely if the user hasn't
set it. Move the check to kvm_run_validate_cfg(), as it checks for valid
user configuration.
Reviewed-by: Andre Przywara <andre.przywara@arm.com>
Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
Reviewed-and-Tested-by: Suzuki K Poulose <suzuki.poulose@arm.com>
Link: https://lore.kernel.org/r/20220616134828.129006-4-alexandru.elisei@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
|
|
The user can specify the virtual machine memory size in MB, which is saved
in cfg->ram_size. kvmtool validates it against the host memory size,
converted from bytes to MB. ram_size is then converted to bytes, and this
is how it is used throughout the rest of kvmtool.
To avoid any confusion about the unit of measurement, especially once the
user is allowed to specify the unit of measurement, always use ram_size in
bytes.
Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
Reviewed-by: Andre Przywara <andre.przywara@arm.com>
Reviewed-and-Tested-by: Suzuki K Poulose <suzuki.poulose@arm.com>
Link: https://lore.kernel.org/r/20220616134828.129006-3-alexandru.elisei@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
|
|
The help text for the -m/--mem argument states that the guest memory size
is in MiB (mebibyte). MiB is the same thing as MB (megabyte), and indeed
this is how MB is used throughout kvmtool.
Replace MiB with MB, so people don't get the wrong idea and start
believing that for kvmtool a MB is 10^6 bytes instead of 2^20.
Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
Reviewed-and-Tested-by: Suzuki K Poulose <suzuki.poulose@arm.com>
Link: https://lore.kernel.org/r/20220616134828.129006-2-alexandru.elisei@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
|
|
The GICv2 DT binding describes the third cell in each interrupt
descriptor as holding the trigger type, but also the CPU mask that this
IRQ applies to, in bits [15:8]. However this is not the case for GICv3,
where we don't use a CPU mask in the third cell: a simple mask wouldn't
fit for the many more supported cores anyway.
At the moment we fill this CPU mask field regardless of the GIC type,
for the PMU and arch timer DT nodes. This is not only the wrong thing to
do in case of a GICv3, but also triggers UBSAN splats when using more
than 30 cores, as we do shifting beyond what a u32 can hold:
$ lkvm run -k Image -c 31 --pmu
arm/timer.c:13:22: runtime error: left shift of 1 by 31 places cannot be represented in type 'int'
arm/timer.c:13:38: runtime error: signed integer overflow: -2147483648 - 1 cannot be represented in type 'int'
arm/timer.c:13:43: runtime error: left shift of 2147483647 by 8 places cannot be represented in type 'int'
arm/aarch64/pmu.c:202:22: runtime error: left shift of 1 by 31 places cannot be represented in type 'int'
arm/aarch64/pmu.c:202:38: runtime error: signed integer overflow: -2147483648 - 1 cannot be represented in type 'int'
arm/aarch64/pmu.c:202:43: runtime error: left shift of 2147483647 by 8 places cannot be represented in type 'int'
Fix that by adding a function that creates the mask by looking at the
GIC type first, and returning zero when a GICv3 is used. Also we
explicitly check for the CPU limit again, even though this would be
done before already, when we try to create a GICv2 VM with more than 8
cores.
Acked-by: Marc Zyngier <maz@kernel.org>
Signed-off-by: Andre Przywara <andre.przywara@arm.com>
Link: https://lore.kernel.org/r/20220616145526.3337196-1-andre.przywara@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
|
|
The code for creating an MSI route is already duplicated between config
and virtqueue MSI. Modern virtio will need it as well, so move it to a
separate function.
Signed-off-by: Jean-Philippe Brucker <jean-philippe.brucker@arm.com>
Link: https://lore.kernel.org/r/20220607170239.120084-17-jean-philippe.brucker@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
|
|
The current virtio-block implementation assumes that buffers have a
specific layout (5.2.6.4 "Legacy Interface: Framing Requirements").
Modern virtio removes this layout constraint, so we have to be careful
when reading buffers. Note that since the Linux driver uses the same
layout as the legacy transport, arbitrary layouts were not actually
tested.
Signed-off-by: Jean-Philippe Brucker <jean-philippe.brucker@arm.com>
Link: https://lore.kernel.org/r/20220607170239.120084-16-jean-philippe.brucker@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
|
|
Our virtio-console implementation already supports ANY_LAYOUT, because
buffers are accessed with scatter-gather operations. Advertise the
VIRTIO_F_ANY_LAYOUT feature.
Signed-off-by: Jean-Philippe Brucker <jean-philippe.brucker@arm.com>
Link: https://lore.kernel.org/r/20220607170239.120084-15-jean-philippe.brucker@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
|
|
Modern virtio demands that devices do not make assumptions about the
buffer layouts. Currently the user network backend assumes that TX
packets are neatly split between virtio-net header and ethernet frame.
Modern virtio-net usually puts everything into one descriptor, but could
also split the buffer arbitrarily. Handle arbitrary buffer layouts and
advertise the VIRTIO_F_ANY_LAYOUT feature.
Signed-off-by: Jean-Philippe Brucker <jean-philippe.brucker@arm.com>
Link: https://lore.kernel.org/r/20220607170239.120084-14-jean-philippe.brucker@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
|
|
The virtio_net header contains a 'num_buffers' field, used when the
VIRTIO_NET_F_MRG_RXBUF feature is negotiated. The legacy driver does not
present this field when the feature is not negotiated. In that case the
header is 2 bytes smaller.
When using the modern virtio transport, the header always contains the
field and in addition the device MUST set it to 1 when the
VIRTIO_NET_F_MRG_RXBUF is not negotiated. Prepare for modern virtio
support by enabling this case once the 'legacy' flag is switched off.
Signed-off-by: Jean-Philippe Brucker <jean-philippe.brucker@arm.com>
Link: https://lore.kernel.org/r/20220607170239.120084-13-jean-philippe.brucker@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
|
|
The conversion of vnet header fields will be more difficult when
supporting the virtio ANY_LAYOUT feature. Since the uip backend doesn't
use the vnet header, and since tap can handle that conversion itself,
offload it to tap.
Signed-off-by: Jean-Philippe Brucker <jean-philippe.brucker@arm.com>
Link: https://lore.kernel.org/r/20220607170239.120084-12-jean-philippe.brucker@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
|
|
Existing IOV functions don't take the iovec size as parameter. This is
unfortunate because when parsing buffers split into header and body,
callers may want to know where the body starts in the iovec, after copying
the header. Add a function that does the same as memcpy_fromiovec, but
also allows to iterate over the iovec.
Signed-off-by: Jean-Philippe Brucker <jean-philippe.brucker@arm.com>
Link: https://lore.kernel.org/r/20220607170239.120084-11-jean-philippe.brucker@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
|
|
Now that devices have a status callback, they don't use
set_guest_features() anymore. The negotiated feature set is available in
struct virtio_device. Remove the callback from all devices.
Signed-off-by: Jean-Philippe Brucker <jean-philippe.brucker@arm.com>
Link: https://lore.kernel.org/r/20220607170239.120084-10-jean-philippe.brucker@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
|
|
Remove unused set_status() callback
Signed-off-by: Jean-Philippe Brucker <jean-philippe.brucker@arm.com>
Link: https://lore.kernel.org/r/20220607170239.120084-9-jean-philippe.brucker@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
|
|
Some legacy virtio drivers expect to read the device-specific config in
guest endianness (2.5.3 "Legacy Interface: A Note on Device
Configuration Space endian-ness").
Kvmtool doesn't know the guest endianness until it can probe a VCPU. So
the config fields start in host endianness, and are swapped once the
guest is running. Currently this is done in set_guest_features(), but
that is too late because the driver is allowed to read config fields
before setting feature bits (2.5.2 "Device Requirements: Device
Configuration Space"). In addition some devices don't swap the fields,
and those that do swap the fields do it every time the guest writes the
feature register, which can't work if a device gets reset more than
once.
Initialize the config on device reset. Do it on every reset because in
theory multiple guests could run with different endianness during the
lifetime of the device.
Notes:
* the balloon device uses little-endian (5.5.4.0.0.1 "Legacy Interface:
Device configuration layout").
* the vsock device was introduced after virtio 0.9.5, hence doesn't
describe a legacy interface, but the Linux driver allows to use the
legacy transport, and always reads the 64-bit guest_cid field as
little-endian.
* the specification does not describe the 9p device, but the Linux
driver uses guest-endian helpers.
* the specification does not explicitly forbid a driver from reading the
configuration at any time, but a driver must follow the sequence from
3.1.1 "Driver Requirements: Device Initialization", where the driver
is allowed to read the config after setting the DRIVER status bit. It
should therefore be safe to keep dealing with guest endianness only on
device reset, and not on the first config access.
Signed-off-by: Jean-Philippe Brucker <jean-philippe.brucker@arm.com>
Link: https://lore.kernel.org/r/20220607170239.120084-8-jean-philippe.brucker@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
|
|
At the moment device-specific config access is tailored for a Linux
guest, that performs any access in 8 bits. But config access can have
any size, and modern virtio drivers must use the size of the accessed
field. Add helpers that generalize config accesses.
Signed-off-by: Jean-Philippe Brucker <jean-philippe.brucker@arm.com>
Link: https://lore.kernel.org/r/20220607170239.120084-7-jean-philippe.brucker@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
|
|
Modern virtio devices can use separate buffer for descriptors, available
and used rings. They can also use 64-bit addresses instead of 44-bit.
Rework the virtqueue initialization function to support modern virtio.
Signed-off-by: Jean-Philippe Brucker <jean-philippe.brucker@arm.com>
Link: https://lore.kernel.org/r/20220607170239.120084-6-jean-philippe.brucker@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
|
|
All virtio devices perform the same set of operations when initializing
their virtqueues. Move it to virtio core.
Signed-off-by: Jean-Philippe Brucker <jean-philippe.brucker@arm.com>
Link: https://lore.kernel.org/r/20220607170239.120084-5-jean-philippe.brucker@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
|
|
The core already tells us whether a device is being started or stopped.
Signed-off-by: Jean-Philippe Brucker <jean-philippe.brucker@arm.com>
Link: https://lore.kernel.org/r/20220607170239.120084-4-jean-philippe.brucker@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
|
|
Don't test for VIRTIO__STATUS_STOP right after setting it.
Signed-off-by: Jean-Philippe Brucker <jean-philippe.brucker@arm.com>
Link: https://lore.kernel.org/r/20220607170239.120084-3-jean-philippe.brucker@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
|
|
Not all toolchains used to know about VIRTIO_CONFIG_S_NEEDS_RESET, so we
left it out of the status mask. Now that we include our own version of
virtio_config.h and we'll need it for virtio 1.0, add it back.
Signed-off-by: Jean-Philippe Brucker <jean-philippe.brucker@arm.com>
Link: https://lore.kernel.org/r/20220607170239.120084-2-jean-philippe.brucker@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
|
|
Fixes the following compilation issue:
include/linux/kernel.h:5:10: fatal error: asm/kernel.h: No such file
or directory
5 | #include "asm/kernel.h"
Tested-by: Alexandru Elisei <alexandru.elisei@arm.com>
Reviewed-by: Alexandru Elisei <alexandru.elisei@arm.com>
Signed-off-by: Dao Lu <daolu@rivosinc.com>
Reviewed-by: Anup Patel <anup@brainfault.org>
Fixes: 0febaae00bb6 ("Add cpumask functions")
Link: https://lore.kernel.org/r/20220524180030.1848992-1-daolu@rivosinc.com
Signed-off-by: Will Deacon <will@kernel.org>
|
|
Commit 45b4968e0de1 ("hw/serial: ARM/arm64: Use MMIO at higher addresses")
changed how the address for the UART is computed by using KVM_IOPORT_AREA.
The symbol is not defined for MIPS, which results in the following
compilation error:
hw/serial.c:21:27: error: ‘KVM_IOPORT_AREA’ undeclared here (not in a function); did you mean ‘KVM_MIPS_IOPORT_AREA’?
21 | #define serial_iobase_0 (KVM_IOPORT_AREA + 0x3f8)
| ^~~~~~~~~~~~~~~
hw/serial.c:29:27: note: in expansion of macro ‘serial_iobase_0’
29 | #define serial_iobase(nr) serial_iobase_##nr
| ^~~~~~~~~~~~~~
hw/serial.c:92:15: note: in expansion of macro ‘serial_iobase’
92 | .iobase = serial_iobase(0),
| ^~~~~~~~~~~~~
Before the commit, the serial was placed at addresses 0x3f8, 0x2f8,
0x3e8 and 0x2e8. However, MIPS puts the RAM at those addresses, up to
KVM_MMIO_START, which is 0x10000000. Meaning that serial device
emulation never worked, as those addresses were part of a valid memslot
representing memory. This has been the case since commit 7281a8db199b
("kvm tools, mips: Add MIPS support") from 2014.
A quick examination of the MIPS code reveals that the architecture relies
on hypercalls from the guest and the virtio console for input and output.
Since nobody complained about the missing serial device, assume that it is
indeed not needed and do not compile it for MIPS.
Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
Link: https://lore.kernel.org/r/20220525165704.186754-3-alexandru.elisei@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
|
|
Commit 4639b72f61a3 ("arm64: Add --vcpu-affinity command line argument")
introduced the --vcpu-affinity command line argument to pin the VCPUs to a
given list of physical CPUs. Unfortunately, the affinity is set only for an
arm64 guest, leading to the following error when running a 32-bit guest on
a system with two or more PMUs:
KVM exit reason: 9 ("KVM_EXIT_FAIL_ENTRY")
Registers:
PC: 0x8000c608
PSTATE: 0x200000d3
SP_EL1: 0x0
LR: 0x0
*pc:
0x8000c608: 25 3f a0 e1 83 61 a0 e1
0x8000c610: 83 31 98 e7 04 10 82 e1
0x8000c618: 07 2c 81 e3 28 10 1b e5
0x8000c620: 03 20 82 e3 03 00 a0 e1
*lr:
Warning: unable to translate guest address 0x0 to host
0x00000000: <unknown>
0x00000008: <unknown>
0x00000010: <unknown>
0x00000018: <unknown>
# KVM compatibility warning.
virtio-net device was not detected.
While you have requested a virtio-net device, the guest kernel did not initialize it.
Please make sure that the guest kernel was compiled with CONFIG_VIRTIO_NET=y enabled in .config.
# KVM session ended normally.
Make the error go away by setting the affinity of the VCPUs for both 32-bit
and 64-bit guests.
Fixes: 4639b72f61a3 ("arm64: Add --vcpu-affinity command line argument")
Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
Link: https://lore.kernel.org/r/20220525165704.186754-2-alexandru.elisei@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
|
|
Commit a08bb43a0c37 ("kvmtool: Copy Linux' up-to-date virtio headers")
copied in some of the virtio UAPI headers from the kernel tree, but
didn't include all of them, as we were relying on some of them being
provided by the distribution.
Now commit bc77bf49df6e ("stat: Add descriptions for new virtio_balloon
stat types") used some newer virtio balloon symbols, that some older
distros (e.g. Ubuntu 18.04) do not carry, which breaks compilation
there:
=======================
CC builtin-stat.o
builtin-stat.c: In function 'do_memstat':
builtin-stat.c:86:8: error: 'VIRTIO_BALLOON_S_HTLB_PGALLOC' undeclared (first use in this function); did you mean 'VIRTIO_BALLOON_S_AVAIL'?
case VIRTIO_BALLOON_S_HTLB_PGALLOC:
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~
VIRTIO_BALLOON_S_AVAIL
builtin-stat.c:86:8: note: each undeclared identifier is reported only once for each function it appears in
=======================
To fix this include the remaining virtio headers (those that we actually
need for kvmtool at the moment), from Linux v5.18.0.
Fixes: bc77bf49df6e ("stat: Add descriptions for new virtio_balloon stat types")
Signed-off-by: Andre Przywara <andre.przywara@arm.com>
Link: https://lore.kernel.org/r/20220524150611.523910-5-andre.przywara@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
|
|
Commit a08bb43a0c37 ("kvmtool: Copy Linux' up-to-date virtio headers")
copied the kernel's virtio UAPI headers into the kvmtool tree, because
at the time some distros didn't include (all of) them in their kernel
headers package.
Let's update those copies, so that we can use newer features, if needed.
This syncs in the already existing copies of the headers from Linux
v5.18.0.
Signed-off-by: Andre Przywara <andre.przywara@arm.com>
Link: https://lore.kernel.org/r/20220524150611.523910-4-andre.przywara@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
|
|
We already have an update_headers.sh sync script, where we occasionally
update the KVM interface UAPI kernel headers into our tree.
So far this covered only the generic kvm.h, plus each architecture's
version of that file.
Commit bc77bf49df6e ("stat: Add descriptions for new virtio_balloon
stat types") used newer virtio symbols, which some older distros do not
include in their kernel headers package. To help fixing this and to
avoid similar problems in the future, add the virtio headers to our sync
script, so that we can get the same, up-to-date versions of the headers
easily.
Signed-off-by: Andre Przywara <andre.przywara@arm.com>
Link: https://lore.kernel.org/r/20220524150611.523910-3-andre.przywara@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
|
|
At the time we pulled in virtio_mmio.h from the kernel tree (commit
a08bb43a0c37c "kvmtool: Copy Linux' up-to-date virtio headers"), this was
not an official UAPI header file, so wasn't stable and was not shipped
with distributions.
This has changed with Linux commit 51be7a9a261c ("virtio_mmio: expose
header to userspace"), so we can now use that file officially.
However before that the name of some symbols have changed, so we have to
adjust their usage in our source.
This pulls in virtio_mmio.h from Linux v5.18.0.
Signed-off-by: Andre Przywara <andre.przywara@arm.com>
Link: https://lore.kernel.org/r/20220524150611.523910-2-andre.przywara@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
|
|
This patch fixes an issue of having the stack be executable
for x86 builds by ensuring that the two objects bios-rom.o
and entry.o have the section .note.GNU-stack.
Suggested-by: Alexandru Elisei <alexandru.elisei@arm.com>
Signed-off-by: Martin Radev <martin.b.radev@gmail.com>
Link: https://lore.kernel.org/r/20220509203940.754644-7-martin.b.radev@gmail.com
Signed-off-by: Will Deacon <will@kernel.org>
|
|
This patch checks for overflows in QUEUE_NOTIFY and QUEUE_SEL in
the PCI and MMIO operation handling paths. Further, the return
value type of get_vq_count is changed from int to uint since negative
doesn't carry any semantic meaning.
Reviewed-by: Alexandru Elisei <alexandru.elisei@arm.com>
Signed-off-by: Martin Radev <martin.b.radev@gmail.com>
Link: https://lore.kernel.org/r/20220509203940.754644-6-martin.b.radev@gmail.com
Signed-off-by: Will Deacon <will@kernel.org>
|
|
The handling of VIRTIO_PCI_O_CONFIG is prone to buffer access overflows.
This patch sanitizes this operation by using the newly added virtio op
get_config_size. Any access which goes beyond the config structure's
size is prevented and a failure is returned.
Additionally, PCI accesses which span more than a single byte are prevented
and a warning is printed because the implementation does not currently
support the behavior correctly.
Reviewed-by: Alexandru Elisei <alexandru.elisei@arm.com>
Signed-off-by: Martin Radev <martin.b.radev@gmail.com>
Link: https://lore.kernel.org/r/20220509203940.754644-5-martin.b.radev@gmail.com
Signed-off-by: Will Deacon <will@kernel.org>
|
|
Per the Linux user API, the struct virtio_9p_config "tag" field contains
the non-NULL terminated tag name and this is how the tag name is
copied by kvmtool in virtio_9p__register(). However, the memory allocation
for the struct is off by one, as it allocates memory for the tag name and
the NULL byte. Fix it by reducing the allocation by exactly one byte.
This is also matches how the struct is allocated by QEMU tagged v7.0.0 in
virtio_9p_get_config().
Suggested-by: Alexandru Elisei <alexandru.elisei@arm.com>
Signed-off-by: Martin Radev <martin.b.radev@gmail.com>
Link: https://lore.kernel.org/r/YnzhdgUwrLlqmzch@monolith.localdoman
Signed-off-by: Will Deacon <will@kernel.org>
|
|
The PCI access size type is changed from a signed type
to an unsigned type since the size is never expected to
be negative, and the type also matches the type in the
signature of virtio_pci__io_mmio_callback.
This change simplifies size checking in the next patch.
Reviewed-by: Alexandru Elisei <alexandru.elisei@arm.com>
Signed-off-by: Martin Radev <martin.b.radev@gmail.com>
Link: https://lore.kernel.org/r/20220509203940.754644-4-martin.b.radev@gmail.com
Signed-off-by: Will Deacon <will@kernel.org>
|
|
This patch verifies that adding the addr and length arguments
from an MMIO op do not overflow. This is necessary because the
arguments are controlled by the VM. The length may be set to
an arbitrary value by using the rep prefix.
Reviewed-by: Alexandru Elisei <alexandru.elisei@arm.com>
Signed-off-by: Martin Radev <martin.b.radev@gmail.com>
Link: https://lore.kernel.org/r/20220509203940.754644-3-martin.b.radev@gmail.com
[will: Drop redundant o/f check in virtio_mmio_device_specific() per Alex]
Signed-off-by: Will Deacon <will@kernel.org>
|
|
Add a macro to enable to print a warning only once. This is
beneficial for cases where a warning could be helpful for
debugging, but still log pollution is preferred not to happen.
Reviewed-by: Alexandru Elisei <alexandru.elisei@arm.com>
Signed-off-by: Martin Radev <martin.b.radev@gmail.com>
Link: https://lore.kernel.org/r/20220509203940.754644-2-martin.b.radev@gmail.com
Signed-off-by: Will Deacon <will@kernel.org>
|
|
Unknown types would print the value with no descriptive text at all.
Add descriptions for all known stat types, and a default description
when the type is unknown.
Signed-off-by: Keir Fraser <keirf@google.com>
Cc: Will Deacon <will@kernel.org>
Link: https://lore.kernel.org/r/20220520143706.550169-3-keirf@google.com
Signed-off-by: Will Deacon <will@kernel.org>
|
|
The collect_stats hook dereferences the stats virtio queue without
checking that it has been initialised.
Signed-off-by: Keir Fraser <keirf@google.com>
Cc: Will Deacon <will@kernel.org>
Link: https://lore.kernel.org/r/20220520143706.550169-2-keirf@google.com
Signed-off-by: Will Deacon <will@kernel.org>
|
|
KVM doesn't support combination of MTE and AArch32 guest, so do not
even try.
Signed-off-by: Vladimir Murzin <vladimir.murzin@arm.com>
Tested-by: Alexandru Elisei <alexandru.elisei@arm.com>
Reviewed-by: Alexandru Elisei <alexandru.elisei@arm.com>
Link: https://lore.kernel.org/r/20220520123844.127733-1-vladimir.murzin@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
|
|
Add a new command line argument, --vcpu-affinity, to set the CPU affinity
for the VCPUs. The affinity is expressed as a cpulist and will apply to all
VCPU threads.
This gives the user a second option for choosing the PMU on a heterogeneous
system. The PMU setup code, when --vcpu-affinity is specified, will search
for the PMU associated with the CPUs specified with this command line
argument instead of the PMU associated with the CPU on which the main
thread is executing.
Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
Link: https://lore.kernel.org/r/20220412133231.35355-12-alexandru.elisei@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
|
|
The KVM_ARM_VCPU_PMU_V3_CTRL(KVM_ARM_VCPU_PMU_V3_SET_PMU) VCPU ioctl is
used to assign a physical PMU to the events that KVM creates when emulating
the PMU for that VCPU. This is useful on heterogeneous systems, when there
is more than one hardware PMU present. All VCPUs must have the same PMU
assigned.
The assumption that is made in the implementation is that the user will pin
the kvmtool process on a set of CPUs that share the same PMU. This allows
kvmtool to set the same PMU for all VCPUs from the main thread, instead of
in the individual VCPU threads. If a VCPU thread migrates to a CPU which
has a different a PMU than the CPU on which the main thread was executing
when the PMU was set, the KVM_RUN ioctl will fail with kvm_run.exit_reason
set to KVM_EXIT_FAIL_ENTRY, and kvm_run.fail_entry will be populated with
the physical CPU ID on which the VCPU tried to execute.
Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
Link: https://lore.kernel.org/r/20220412133231.35355-11-alexandru.elisei@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
|
|
Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
Link: https://lore.kernel.org/r/20220412133231.35355-10-alexandru.elisei@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
|
|
Add a handful of cpumask functions, some of which will be used when
dealing with different PMUs on heterogeneous systems.
The maximum number of CPUs in a system, NR_CPUS, which dictates the size of
the cpumask, has been taken from the Kconfig file for each architecture,
from Linux version 5.16.
Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
Link: https://lore.kernel.org/r/20220412133231.35355-9-alexandru.elisei@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
|
|
By the time kvmtool generates the DTB node for the PMU, the
KVM_ARM_VCPU_PMU_V3 VCPU feature is already set by kvm_cpu__arch_init().
KVM refuses to run a VCPU if the PMU hasn't been initialized. A PMU
cannot be initialized if the interrupt ID hasn't been set by userspace.
As a consequence, kvmtool will get an error if the interrupt ID or if
the PMU has not been initialized:
KVM_RUN failed: Invalid argument
To make debugging easier, exit with an error message as soon as one the
PMU ioctls fails, instead of waiting until the VCPU is first run.
To avoid the repetition of assigning a new kvm_device_attr struct in the
main body of pmu__generate_fdt_nodes(), which hinders readability of the
function, move the struct to set_pmu_attr().
Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
Link: https://lore.kernel.org/r/20220412133231.35355-8-alexandru.elisei@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
|
|
KVM for aarch32 does not exist anymore, PMUv3 is a hardware feature
present only on aarch64 CPUs, the command line option to enable the
feature for a VCPU is aarch64 specific, the PMU code is called only from
an aarch64 function and it compiles to an empty stub when ARCH=arm.
There is no reason to have the PMUv3 emulation code in the common code
area for arm and arm64, so move it to the arm64 directory, where it can
be expanded in the future without fear of breaking aarch32 support.
Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
Link: https://lore.kernel.org/r/20220412133231.35355-7-alexandru.elisei@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
|
|
The ARM_VCPU_FEATURE_FLAGS() macro sets a feature bit in a rather
convoluted way: if cpu_id is 0, then bit KVM_ARM_VCPU_POWER_OFF is 0,
otherwise is set to 1. There's really no need for this indirection,
especially considering that the macro has been changed to return the same
value for both the arm and arm64 architectures. Replace it with a simple
conditional statement in kvm_cpu__arch_init(), which makes it clearer to
understand.
Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
Link: https://lore.kernel.org/r/20220412133231.35355-6-alexandru.elisei@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
|
|
KVM_CAP_ARM_EL1_32BIT and KVM_CAP_ARM_PMU_V3 are arm64 specific features.
They are set based on arm64 specific command line options and they target
arm64 hardware features. It makes little sense for kvmtool to set the
features in the code that is shared between arm and arm64. Move the logic
to set the feature bits to the arch specific function
kvm_cpu__select_features(), which is already used by arm64 to set other
arm64 specific features.
Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
Link: https://lore.kernel.org/r/20220412133231.35355-5-alexandru.elisei@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
|
|
Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
Link: https://lore.kernel.org/r/20220412133231.35355-4-alexandru.elisei@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
|
|
Trying to build a source file which included bitops.h, but didn't also
bring in the definition for __WORDSIZE (by including limits.h, for example)
would result in the following error:
include/linux/bitops.h:8:23: error: ‘__WORDSIZE’ undeclared (first use in this function)
8 | #define BITS_PER_LONG __WORDSIZE
| ^~~~~~~~~~
The symbol is defined in the bits/wordsize.h header file, include it.
Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
Link: https://lore.kernel.org/r/20220412133231.35355-3-alexandru.elisei@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
|
|
Add missing header stdbool.h to avoid errors like this one, which can
happen if the including file doesn't include stdbool.h:
include/linux/err.h:33:15: error: type defaults to ‘int’ in declaration of ‘bool’ [-Werror=implicit-int]
33 | static inline bool __must_check IS_ERR(__force const void *ptr)
| ^~~~
include/linux/err.h:33:15: error: variable ‘bool’ declared ‘inline’ [-Werror]
include/linux/err.h:33:1: error: ‘warn_unused_result’ attribute only applies to function types [-Werror=attributes]
33 | static inline bool __must_check IS_ERR(__force const void *ptr)
| ^~~~~~
include/linux/err.h:33:33: error: expected ‘,’ or ‘;’ before ‘IS_ERR’
33 | static inline bool __must_check IS_ERR(__force const void *ptr)
| ^~~~~~
include/linux/err.h:38:15: error: type defaults to ‘int’ in declaration of ‘bool’ [-Werror=implicit-int]
38 | static inline bool __must_check IS_ERR_OR_NULL(__force const void *ptr)
| ^~~~
include/linux/err.h:38:15: error: variable ‘bool’ declared ‘inline’ [-Werror]
include/linux/err.h:38:1: error: ‘warn_unused_result’ attribute only applies to function types [-Werror=attributes]
38 | static inline bool __must_check IS_ERR_OR_NULL(__force const void *ptr)
| ^~~~~~
include/linux/err.h:38:15: error: redundant redeclaration of ‘bool’ [-Werror=redundant-decls]
38 | static inline bool __must_check IS_ERR_OR_NULL(__force const void *ptr)
| ^~~~
include/linux/err.h:33:15: note: previous declaration of ‘bool’ was here
33 | static inline bool __must_check IS_ERR(__force const void *ptr)
| ^~~~
include/linux/err.h:38:33: error: expected ‘,’ or ‘;’ before ‘IS_ERR_OR_NULL’
38 | static inline bool __must_check IS_ERR_OR_NULL(__force const void *ptr)
| ^~~~~~~~~~~~~~
include/linux/err.h: In function ‘PTR_ERR_OR_ZERO’:
include/linux/err.h:58:6: error: implicit declaration of function ‘IS_ERR’ [-Werror=implicit-function-declaration]
58 | if (IS_ERR(ptr))
| ^~~~~~
include/linux/err.h:58:6: error: nested extern declaration of ‘IS_ERR’ [-Werror=nested-externs]
Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
Link: https://lore.kernel.org/r/20220412133231.35355-2-alexandru.elisei@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
|
|
MTE has been supported in Linux since commit 673638f434ee ("KVM: arm64:
Expose KVM_ARM_CAP_MTE"), add support for it in kvmtool. MTE is enabled by
default.
Enabling the MTE capability incurs a cost, both in time (for each
translation fault the tags need to be cleared), and in space (the tags need
to be saved when a physical page is swapped out). This overhead is expected
to be negligible for most users, but for those cases where it matters
(like performance benchmarks), a --disable-mte option has been added.
Reviewed-by: Vladimir Murzin <vladimir.murzin@arm.com>
Tested-by: Vladimir Murzin <vladimir.murzin@arm.com>
Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
Link: https://lore.kernel.org/r/20220328103328.18768-3-alexandru.elisei@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
|
|
Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
Link: https://lore.kernel.org/r/20220328103328.18768-2-alexandru.elisei@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
|
|
The stolen time option is available only for aarch64 and is enabled by
default. Move the option that disables stolen time functionality in the
arch specific path.
Signed-off-by: Sebastian Ene <sebastianene@google.com>
Link: https://lore.kernel.org/r/20220324154304.2572891-1-sebastianene@google.com
Signed-off-by: Will Deacon <will@kernel.org>
|
|
This reverts commit bc0b99a2a74047707db73ba057743febf458fd90.
Thanks to some digging from Andre [1], we know that kvmtool commit
bc0b99a2a740 ("kvm tools: Filter out CPU vendor string") was intended
to work around a guest kernel bug resulting from kernel commit
5bbc097d8904 ("x86, amd: Disable GartTlbWlkErr when BIOS forgets it").
Critically, KVM does not implement the MC4 mask MSR and instead injects
a #GP into the guest. On guest kernels without commit d47cc0db8fd6
("x86, amd: Use _safe() msr access for GartTlbWlk disable code") this is
unexpected and causes a kernel oops.
Since the kernel has taken the position to fix the bug in the guest and
not KVM, there is no need for CPU vendor string filtering in kvmtool.
Vendor string filtering is highly problematic for feature discovery,
both in the kernel and userspace. As Andre noted, glibc depends on the
vendor string to discover CPU features at runtime [2]. This has been
generally innocuous, but as distributions begin to raise the minimum ISA
guest userspace will quickly crash and burn on kvmtool. Hiding the
vendor string also makes it impossible to test vendor-specific CPU
features in kvmtool guest kernels.
Given the fact that there are known dependencies in kernel and userspace
on the CPU vendor string, allow the guest to see the native CPU vendor
string. This has the potential to break certain guest kernels of 2011
vintage when running on an AMD Fam10h processor. Onus is on the guest to
update its kernel at this point.
Link: https://lore.kernel.org/kvm/20220311121042.010bbb30@donnerap.cambridge.arm.com/
Link: https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/x86/cpu-features.c;h=514226b37889;hb=HEAD#l398
Reported-by: Dongli Si <sidongli1997@gmail.com>
Suggested-by: Andre Przywara <andre.przywara@arm.com>
Signed-off-by: Oliver Upton <oupton@google.com>
Link: https://lore.kernel.org/r/20220318204938.496840-1-oupton@google.com
Signed-off-by: Will Deacon <will@kernel.org>
|
|
The command line argument disables the stolen time functionality when is
specified.
Reviewed-by: Alexandru Elisei <alexandru.elisei@arm.com>
Signed-off-by: Sebastian Ene <sebastianene@google.com>
Link: https://lore.kernel.org/r/20220313161949.3565171-4-sebastianene@google.com
Signed-off-by: Will Deacon <will@kernel.org>
|
|
This patch adds support for stolen time by sharing a memory region
with the guest which will be used by the hypervisor to store the stolen
time information. Reserve a 64kb MMIO memory region after the RTC peripheral
to be used by pvtime. The exact format of the structure stored by the
hypervisor is described in the ARM DEN0057A document.
Reviewed-by: Alexandru Elisei <alexandru.elisei@arm.com>
Tested-by: Alexandru Elisei <alexandru.elisei@arm.com>
Signed-off-by: Sebastian Ene <sebastianene@google.com>
Link: https://lore.kernel.org/r/20220313161949.3565171-3-sebastianene@google.com
Signed-off-by: Will Deacon <will@kernel.org>
|
|
Move the vCPU structure initialisation before the target->init() call to
keep a reference to the kvm structure during init().
This is required by the pvtime peripheral to reserve a memory region
while the vCPU is beeing initialised.
Reviewed-by: Marc Zyngier <maz@kernel.org>
Signed-off-by: Sebastian Ene <sebastianene@google.com>
Link: https://lore.kernel.org/r/20220313161949.3565171-2-sebastianene@google.com
Signed-off-by: Will Deacon <will@kernel.org>
|
|
The "msi-parent" PCI root complex property describes the MSI parent of the
root complex. When the VM is created with a GICv2 or GICv3 irqchip
(--irqchip=gicv3 or --irqchip=gicv2), there is no MSI controller present on
the system and the corresponding phandle is not generated, leaving the
"msi-parent" property to point to a non-existing phandle. Skip creating the
"msi-parent" property when no MSI controller exists.
Reported-by: Pierre Gondois <pierre.gondois@arm.com>
Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
Link: https://lore.kernel.org/r/20220214165830.69207-4-alexandru.elisei@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
|
|
When loading a kernel image, kvmtool is nice enough to print a message
informing the user where the file was loaded in guest memory, which is very
useful for debugging. Do the same for the firmware image.
Commit e1c7c62afc7b ("arm: turn pr_info() into pr_debug() messages")
changed various pr_info() into pr_debug() messages to stop kvmtool from
cluttering stdout. Do the same when printing where the FDT has been copied
when loading a firmware image.
Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
Link: https://lore.kernel.org/r/20220214165830.69207-3-alexandru.elisei@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
|
|
Linux, besides CPIO, supports 7 different compressed formats for the initrd
(gzip, bzip2, LZMA, XZ, LZO, LZ4, ZSTD), but kvmtool only recognizes one of
them.
Remove the initrd magic check because:
1. It doesn't bring much to the end user, as the Linux kernel still
complains if the initrd is in an unknown format.
2. --kernel can be used to load something that is not a Linux kernel (like
a kvm-unit-tests test), in which case a format which is not supported by
a Linux kernel can still be perfectly valid. For example, kvm-unit-tests
load the test environment as an initrd in plain ASCII format.
3. It cuts down on the maintenance effort when new formats are added to
the Linux kernel. Not a big deal, since that doesn't happen very often,
but it's still an effort with very little gain (see point #1 above).
Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
Link: https://lore.kernel.org/r/20220214165830.69207-2-alexandru.elisei@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
|
|
It appears that the way INTx is emulated is "slightly" out of spec
in kvmtool. We happily inject an edge interrupt, even if the spec
mandates a level.
This doesn't change much for either the guest or userspace (only
KVM will have a bit more work tracking the EOI), but at least
this is correct.
Reported-by: Pierre Gondois <pierre.gondois@arm.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Cc: Ard Biesheuvel <ardb@kernel.org>
Cc: Sami Mujawar <sami.mujawar@arm.com>
Cc: Will Deacon <will@kernel.org>
Acked-by: Ard Biesheuvel <ardb@kernel.org>
Link: https://lore.kernel.org/r/20220131160242.2665191-1-maz@kernel.org
Signed-off-by: Will Deacon <will@kernel.org>
|
|
When kvmtool boots a kernel, the dmesg will print the following message:
[Firmware Bug]: CPU1: APIC id mismatch. Firmware: 1 APIC: 30
Fix this by setting up correct initial_apicid to cpu_id.
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Tested-by: Alexandru Elisei <alexandru.elisei@arm.com>
Link: https://lore.kernel.org/r/20220216113735.52240-2-songmuchun@bytedance.com
Signed-off-by: Will Deacon <will@kernel.org>
|
|
When dev_hdr->dev_num is greater one, the initialization of last_addr
is wrong. Fix it.
Fixes: f83cd16 ("kvm tools: irq: replace the x86 irq rbtree with the PCI device tree")
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Reviewed-by: Alexandru Elisei <alexandru.elisei@arm.com>
Link: https://lore.kernel.org/r/20220216113735.52240-1-songmuchun@bytedance.com
Signed-off-by: Will Deacon <will@kernel.org>
|
|
This patch extends FDT generation to generate PCI host DT node.
Of course, PCI host for Guest/VM is not useful at the moment
because it's mostly for PCI pass-through and we don't have
IOMMU and interrupt routing available for KVM RISC-V. In future,
we might be able to use PCI host for VirtIO PCI transport or
other software emulated PCI devices.
Signed-off-by: Anup Patel <anup.patel@wdc.com>
Link: https://lore.kernel.org/r/20211119124515.89439-9-anup.patel@wdc.com
Signed-off-by: Will Deacon <will@kernel.org>
|
|
The kernel KVM RISC-V module will forward certain SBI calls
to user space. These forwared SBI calls will usually be the
SBI calls which cannot be emulated in kernel space such as
PUTCHAR and GETCHAR calls.
This patch extends kvm_cpu__handle_exit() to handle SBI calls
forwarded to user space.
Signed-off-by: Atish Patra <atish.patra@wdc.com>
Signed-off-by: Anup Patel <anup.patel@wdc.com>
Link: https://lore.kernel.org/r/20211119124515.89439-8-anup.patel@wdc.com
Signed-off-by: Will Deacon <will@kernel.org>
|
|
We generate FDT at runtime for RISC-V Guest/VM so that KVMTOOL users
don't have to pass FDT separately via command-line parameters.
Also, we provide "--dump-dtb <filename>" command-line option to dump
generated FDT into a file for debugging purpose.
Signed-off-by: Atish Patra <atish.patra@wdc.com>
Signed-off-by: Anup Patel <anup.patel@wdc.com>
Link: https://lore.kernel.org/r/20211119124515.89439-7-anup.patel@wdc.com
Signed-off-by: Will Deacon <will@kernel.org>
|
|
The PLIC (platform level interrupt controller) manages peripheral
interrupts in RISC-V world. The per-CPU interrupts are managed
using CPU CSRs hence virtualized in-kernel by KVM RISC-V.
This patch adds PLIC device emulation for KVMTOOL RISC-V.
Signed-off-by: Vincent Chen <vincent.chen@sifive.com>
[For PLIC context CLAIM register emulation]
Signed-off-by: Anup Patel <anup.patel@wdc.com>
Link: https://lore.kernel.org/r/20211119124515.89439-6-anup.patel@wdc.com
Signed-off-by: Will Deacon <will@kernel.org>
|
|
This patch implements kvm_cpu__<xyz> Guest/VM VCPU arch functions.
These functions mostly deal with:
1. VCPU allocation and initialization
2. VCPU reset
3. VCPU show/dump code
4. VCPU show/dump registers
We also save RISC-V ISA, XLEN, and TIMEBASE frequency for each VCPU
so that it can be later used for generating Guest/VM FDT.
Signed-off-by: Atish Patra <atish.patra@wdc.com>
Signed-off-by: Anup Patel <anup.patel@wdc.com>
Link: https://lore.kernel.org/r/20211119124515.89439-5-anup.patel@wdc.com
Signed-off-by: Will Deacon <will@kernel.org>
|
|
This patch implements all kvm__arch_<xyz> Guest/VM arch functions.
These functions mostly deal with:
1. Guest/VM RAM initialization
2. Updating terminals on character read
3. Loading kernel and initrd images
Firmware loading is not implemented currently because initially we
will be booting kernel directly without any bootloader. In future,
we will certainly support firmware loading.
Signed-off-by: Anup Patel <anup.patel@wdc.com>
Link: https://lore.kernel.org/r/20211119124515.89439-4-anup.patel@wdc.com
Signed-off-by: Will Deacon <will@kernel.org>
|
|
This patch adds initial skeletal KVMTOOL RISC-V support which
just compiles for RV32 and RV64 host.
Signed-off-by: Anup Patel <anup.patel@wdc.com>
Link: https://lore.kernel.org/r/20211119124515.89439-3-anup.patel@wdc.com
Signed-off-by: Will Deacon <will@kernel.org>
|
|
We sync-up all ABI headers with Linux-5.16-rc1 so that RISC-V
specfic changes in include/linux/kvm.h are available.
Signed-off-by: Anup Patel <anup.patel@wdc.com>
Link: https://lore.kernel.org/r/20211119124515.89439-2-anup.patel@wdc.com
Signed-off-by: Will Deacon <will@kernel.org>
|
|
Command 'lvm version' works incorrect.
It is expected to print:
# ./lvm version
# kvm tool [KVMTOOLS_VERSION]
but the KVMTOOLS_VERSION is missed:
# ./lvm version
# kvm tool
The KVMTOOLS_VERSION is defined in the KVMTOOLS-VERSION-FILE file which
is included at the end of Makefile. Since the CFLAGS is a 'Simply
expanded variables' which means CFLAGS is only scanned once. So the
definetion of KVMTOOLS_VERSION at the end of Makefile would not scanned
by CFLAGS. So the '-DKVMTOOLS_VERSION=' remains empty.
I fixed the bug by moving the '-include $(OUTPUT)KVMTOOLS-VERSION-FILE'
before the CFLAGS.
Signed-off-by: haibiao.xiao <xiaohaibiao331@outlook.com>
Tested-by: Alexandru Elisei <alexandru.elisei@arm.com>
Link: https://lore.kernel.org/r/20211210030708.288066-1-haibiao.xiao@zstack.io
Signed-off-by: Will Deacon <will@kernel.org>
|
|
The interrupt pin cell in "interrupt-map" property
is defined only for legacy interrupts with a valid
range in [1-4] corrspoding to INTA#..INTD#. And the
PCI endpoint devices that support advance interrupt
mechanism like MSI or MSI-X should not have an entry
with value 0 in "interrupt-map". This patch takes
care of this problem by avoiding redundant entries.
Signed-off-by: Sathyam Panda <sathyam.panda@arm.com>
Reviewed-by: Vivek Kumar Gautam <vivek.gautam@arm.com>
Reviewed-by: Alexandru Elisei <alexandru.elisei@arm.com>
Link: https://lore.kernel.org/r/20211111120231.5468-1-sathyam.panda@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
|
|
When allocating MMIO space for the MSI-X table, kvmtool rounds the
allocation to the host's page size to make it as easy as possible for the
guest to map the table to a page, if it wants to (and doesn't do BAR
reassignment, like the x86 architecture for example). However, the host's
page size can differ from the guest's on architectures which support
multiple page sizes. For example, arm64 supports three different page size,
and it is possible for the host to be using 4k pages, while the guest is
using 64k pages.
To make sure the allocation is always aligned to a guest's page size, round
it up to the maximum architectural page size. Do the same for the pending
bit array if it lives in its own BAR.
Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
Reviewed-by: Andre Przywara <andre.przywara@arm.com>
Link: https://lore.kernel.org/r/20211012132510.42134-8-alexandru.elisei@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
|
|
Now that we keep track of the real size of MSIX table and PBA, print an
error when the guest tries to write to an offset which is not inside the
correct regions.
Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
Link: https://lore.kernel.org/r/20211012132510.42134-7-alexandru.elisei@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
|
|
When creating the MSIX table and PBA, kvmtool rounds up the table and
pending bit array sizes to the host's page size. Unfortunately, when doing
that, it doesn't take into account that the new size can exceed the device
BAR size, leading to hard to diagnose errors for certain configurations.
One theoretical example: PBA and table in the same 4k BAR, host's page size
is 4k. In this case, table->size = 4k, pba->size = 4k, map_size = 4k, which
means that pba->guest_phys_addr = table->guest_phys_addr + 4k, which is
outside of the 4k MMIO range allocated for both structures.
Another example, this time a real-world error that I encountered: happens
with a 64k host booting a 4k guest, an RTL8168 PCIE NIC assigned to the
guest. In this case, kvmtool sets table->size = 64k (because it's rounded
to the host's page size) and pba->size = 64k.
Truncated output of lspci -vv on the host:
01:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller (rev 06)
Subsystem: TP-LINK Technologies Co., Ltd. TG-3468 Gigabit PCI Express Network Adapter
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
Latency: 0
Interrupt: pin A routed to IRQ 255
Region 0: I/O ports at 1000 [size=256]
Region 2: Memory at 40000000 (64-bit, non-prefetchable) [size=4K]
Region 4: Memory at 100000000 (64-bit, prefetchable) [size=16K]
[..]
Capabilities: [b0] MSI-X: Enable- Count=4 Masked-
Vector table: BAR=4 offset=00000000
PBA: BAR=4 offset=00000800
[..]
When booting the guest:
[..]
[ 0.207444] pci-host-generic 40000000.pci: host bridge /pci ranges:
[ 0.208564] pci-host-generic 40000000.pci: IO 0x0000000000..0x000000ffff -> 0x0000000000
[ 0.209857] pci-host-generic 40000000.pci: MEM 0x0050000000..0x007fffffff -> 0x0050000000
[ 0.211184] pci-host-generic 40000000.pci: ECAM at [mem 0x40000000-0x4fffffff] for [bus 00]
[ 0.212625] pci-host-generic 40000000.pci: PCI host bridge to bus 0000:00
[ 0.213647] pci_bus 0000:00: root bus resource [bus 00]
[ 0.214429] pci_bus 0000:00: root bus resource [io 0x0000-0xffff]
[ 0.215355] pci_bus 0000:00: root bus resource [mem 0x50000000-0x7fffffff]
[ 0.216676] pci 0000:00:00.0: [10ec:8168] type 00 class 0x020000
[ 0.223771] pci 0000:00:00.0: reg 0x10: [io 0x6200-0x62ff]
[ 0.239765] pci 0000:00:00.0: reg 0x18: [mem 0x50010000-0x50010fff]
[ 0.244595] pci 0000:00:00.0: reg 0x20: [mem 0x50000000-0x50003fff]
[ 0.246331] pci 0000:00:01.0: [1af4:1000] type 00 class 0x020000
[ 0.247278] pci 0000:00:01.0: reg 0x10: [io 0x6300-0x63ff]
[ 0.248212] pci 0000:00:01.0: reg 0x14: [mem 0x50020000-0x500200ff]
[ 0.249172] pci 0000:00:01.0: reg 0x18: [mem 0x50020400-0x500207ff]
[ 0.250450] pci 0000:00:02.0: [1af4:1001] type 00 class 0x018000
[ 0.251392] pci 0000:00:02.0: reg 0x10: [io 0x6400-0x64ff]
[ 0.252351] pci 0000:00:02.0: reg 0x14: [mem 0x50020800-0x500208ff]
[ 0.253312] pci 0000:00:02.0: reg 0x18: [mem 0x50020c00-0x50020fff]
[ 0.254760] pci 0000:00:00.0: BAR 4: assigned [mem 0x50000000-0x50003fff] (1)
[ 0.255805] pci 0000:00:00.0: BAR 2: assigned [mem 0x50004000-0x50004fff] (2)
Warning: [10ec:8168] Error activating emulation for BAR 2
Warning: [10ec:8168] Error activating emulation for BAR 2
[ 0.260432] pci 0000:00:01.0: BAR 2: assigned [mem 0x50005000-0x500053ff]
Warning: [1af4:1000] Error activating emulation for BAR 2
Warning: [1af4:1000] Error activating emulation for BAR 2
[ 0.261469] pci 0000:00:02.0: BAR 2: assigned [mem 0x50005400-0x500057ff]
Warning: [1af4:1001] Error activating emulation for BAR 2
Warning: [1af4:1001] Error activating emulation for BAR 2
[ 0.262499] pci 0000:00:00.0: BAR 0: assigned [io 0x1000-0x10ff]
[ 0.263415] pci 0000:00:01.0: BAR 0: assigned [io 0x1100-0x11ff]
[ 0.264462] pci 0000:00:01.0: BAR 1: assigned [mem 0x50005800-0x500058ff]
Warning: [1af4:1000] Error activating emulation for BAR 1
Warning: [1af4:1000] Error activating emulation for BAR 1
[ 0.265481] pci 0000:00:02.0: BAR 0: assigned [io 0x1200-0x12ff]
[ 0.266397] pci 0000:00:02.0: BAR 1: assigned [mem 0x50005900-0x500059ff]
Warning: [1af4:1001] Error activating emulation for BAR 1
Warning: [1af4:1001] Error activating emulation for BAR 1
[ 0.267892] EINJ: ACPI disabled.
[ 0.269922] virtio-pci 0000:00:01.0: virtio_pci: leaving for legacy driver
[ 0.271118] virtio-pci 0000:00:02.0: virtio_pci: leaving for legacy driver
[ 0.274122] Serial: 8250/16550 driver, 4 ports, IRQ sharing enabled
[ 0.275930] printk: console [ttyS0] disabled
[ 0.276669] 1000000.U6_16550A: ttyS0 at MMIO 0x1000000 (irq = 13, base_baud = 115200) is a 16550A
[ 0.278058] printk: console [ttyS0] enabled
[ 0.278058] printk: console [ttyS0] enabled
[ 0.279304] printk: bootconsole [ns16550a0] disabled
[ 0.279304] printk: bootconsole [ns16550a0] disabled
[ 0.281252] 1001000.U6_16550A: ttyS1 at MMIO 0x1001000 (irq = 14, base_baud = 115200) is a 16550A
[ 0.282842] 1002000.U6_16550A: ttyS2 at MMIO 0x1002000 (irq = 15, base_baud = 115200) is a 16550A
[ 0.284611] 1003000.U6_16550A: ttyS3 at MMIO 0x1003000 (irq = 16, base_baud = 115200) is a 16550A
[ 0.286094] SuperH (H)SCI(F) driver initialized
[ 0.286868] msm_serial: driver initialized
[ 0.287890] [drm] radeon kernel modesetting enabled.
[ 0.288826] cacheinfo: Unable to detect cache hierarchy for CPU 0
[ 0.293321] loop: module loaded
KVM_SET_GSI_ROUTING: Invalid argument
At (1), the guest writes 0x50000000 into BAR 4 of the NIC (which holds
the MSIX table and PBA), expecting that will cover only 16k of address
space (the BAR size), up to 0x50003fff, inclusive. On the host side, in
vfio_pci_bar_activate(), kvmtool will actually register for MMIO
emulation the region 0x50000000-0x5000ffff (64k in total) for the MSIX
table and 0x50010000-0x5001ffff (another 64k) for the PBA (kvmtool set
table->size and pba->size to 64k when it aligned them to the host's page
size).
Then at step (2), the guest writes the next available address (from its
point of view) into BAR 2 of the NIC, which is 0x50004000. On the host
side, the PCI emulation layer will search all the regions that overlap with
the BAR address range (0x50004000-0x50004fff) and will find none because,
just like the guest, it uses the BAR size to check for overlaps. When
vfio_pci_bar_activate() is reached, kvmtool will try to register memory for
this region, but it is already registered for the MSIX table emulation and
fails.
The same scenario repeats for every following memory BAR, because the MSIX
table and PBA use memory from 0x50000000 to 0x5001ffff.
The error at the end, which finally terminates the VM, is caused by the
guest trying to write to a totally different BAR, which vfio-pci
interpretes as a write to MSI-X table because it falls in the 64k region
that was registered for emulation. The IRQ ID is not a valid SPI number and
gicv2m_update_routing() returns an error (and sets errno to EINVAL).
Fix this by aligning the table and PBA size to 8 bytes to allow for
qword accesses, like PCI 3.0 mandates.
For the sake of simplicity, the PBA offset in a BAR, in case of a shared
BAR, is kept the same as the offset of the physical device. One hopes that
the device respects the recommendations set forth in PCI LOCAL BUS
SPECIFICATION, REV. 3.0, section "MSI-X Capability and Table Structures"
Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
Reviewed-by: Andre Przywara <andre.przywara@arm.com>
Link: https://lore.kernel.org/r/20211012132510.42134-6-alexandru.elisei@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
|
|
The MSI-X capability defines a PBA offset, which is the offset of the PBA
array in the BAR that holds the array.
kvmtool uses the field "pba_offset" in struct msix_cap (which represents
the MSIX capability) to refer to the [PBA offset:BAR] field of the
capability; and the field "offset" in the struct vfio_pci_msix_pba to refer
to offset of the PBA array in the device descriptor created by the VFIO
driver.
As we're getting ready to add yet another field that represents an offset
to struct vfio_pci_msix_pba, try to avoid ambiguities by renaming the
struct's "offset" field to "fd_offset".
No functional change intended.
Reviewed-by: Andre Przywara <andre.przywara@arm.com>
Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
Link: https://lore.kernel.org/r/20211012132510.42134-5-alexandru.elisei@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
|
|
Evaluate the "pci_hdr" argument before attempting to deference a field.
This fixes cryptic errors like this one, which came about during a
debugging session:
vfio/pci.c: In function 'vfio_pci_bar_activate':
include/kvm/pci.h:18:40: error: invalid type argument of '->' (have 'struct pci_device_header')
pr_warning("[%04x:%04x] " fmt, pci_hdr->vendor_id, pci_hdr->device_id, ##__VA_ARGS__)
^~
vfio/pci.c:482:3: note: in expansion of macro 'pci_dev_warn'
pci_dev_warn(&vdev->pci.hdr, "%s: BAR4\n", __func__);
This is caused by the operator precedence rules in C, where pointer
deference via "->" has a higher precedence than taking the address with the
ampersand symbol. When the macro is substituted, it becomes
&vdev->pci.hdr->vendor_id and it dereferences vdev->pci.hdr, which is not a
pointer, instead of dereferencing &vdev->pci.hdr, which is a pointer, and
quite likely what the author intended.
Reviewed-by: Andre Przywara <andre.przywara@arm.com>
Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
Link: https://lore.kernel.org/r/20211012132510.42134-4-alexandru.elisei@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
|
|
assert.h is included twice, keep only one instance.
Reviewed-by: Andre Przywara <andre.przywara@arm.com>
Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
Link: https://lore.kernel.org/r/20211012132510.42134-3-alexandru.elisei@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
|
|
In case of an error when updating the routing table entries,
irq__update_msix_route() uses perror to print an error message.
gicv2m_update_routing() doesn't set errno, and instead returns the value
that errno should have had, which can lead to failure messages like this:
KVM_SET_GSI_ROUTING: Success
Set errno in gicv2m_update_routing() to avoid such messages in the future.
Reviewed-by: Andre Przywara <andre.przywara@arm.com>
Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
Link: https://lore.kernel.org/r/20211012132510.42134-2-alexandru.elisei@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
|
|
kvmtool complains loudly when it parses the kernel header and doesn't find
what it expects, but unless it outright fails to read the kernel image, it
will copy the image in the guest memory at the default offset of 0x80000.
There's no technical reason to stop the user from loading payloads other
than a Linux kernel with the --kernel option. These payloads can behave
just like a kernel and can use an initrd (which is not possible with
--firmware), but don't have the kernel header (like kvm-unit-tests), and
the warnings kvmtool emites can be confusing for this type of payloads.
Change the warnings to debug statements, which can be enabled via the
--debug kvmtool command line option, to make them disappear for these cases
where they aren't really relevant.
Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
Link: https://lore.kernel.org/r/20210923144505.60776-11-alexandru.elisei@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
|
|
Commit fd0a05bd27dd ("arm64: Obtain text offset from kernel image") added
support for getting the kernel offset from the kernel header. The code
checks for the kernel header magic number, and if not found, prints a
warning and continues searching for the kernel offset in the image.
The -k/--kernel option can be used to load things which are not a Linux
kernel, but behave like one, like a kvm-unit-tests test. The tests don't
have a valid kernel header, and because kvmtool insists on searching for
the offset, creating a virtual machine can fail with this message:
$ ./vm run -c2 -m256 -k ../kvm-unit-tests/arm/cache.flat
# lkvm run -k ../kvm-unit-tests/arm/cache.flat -m 256 -c 2 --name guest-7529
Warning: Kernel image magic not matching
Warning: unable to translate host address 0x910100a502a00085 to guest
Fatal: kernel image too big to contain in guest memory.
The host address is a random number read from the test binary from the
location where text_offset is found in the kernel header. Before the
commit, the test was executing just fine:
$ ./vm run -c2 -m256 -k ../kvm-unit-tests/arm/cache.flat
# lkvm run -k ../kvm-unit-tests/arm/cache.flat -m 256 -c 2 --name guest-8105
INFO: IDC-DIC: dcache clean to PoU required
INFO: IDC-DIC: icache invalidation to PoU required
PASS: IDC-DIC: code generation
SUMMARY: 1 tests
Change kvm__arch_get_kern_offset() so it returns the default text_offset
value if the kernel image magic number is not found, making it possible
again to use something other than a Linux kernel with --kernel.
Reported-by: Vivek Kumar Gautam <vivek.gautam@arm.com>
Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
Link: https://lore.kernel.org/r/20210923144505.60776-10-alexandru.elisei@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
|
|
kvmtool attempts to make it as easier as possible on the user to run a VM
by doing a few different things: it tries to create a rootfs filesystem in
a directory if not disk or initrd is set by the user, and it adds various
parameters to the kernel command line based on the VM configuration
options.
While this is generally very useful, today there isn't any way for the user
to prohibit this behaviour, even though there are situations where this
might not be desirable, like, for example: loading something which is not a
kernel (kvm-unit-tests comes to mind, which expects test parameters on the
kernel command line); the kernel has a built-in initramfs and there is no
need to generate the root filesystem, or it not possible; and what is
probably the most important use case, when the user is actively trying to
break things for testing purposes.
Add a --nodefaults command line argument which disables everything that
cannot be disabled via another command line switch. The purpose of this
knob is not to disable the default options for arguments that can be set
via the kvmtool command line, but rather to inhibit behaviour that cannot
be disabled otherwise.
Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
Link: https://lore.kernel.org/r/20210923144505.60776-8-alexandru.elisei@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
|
|
The real kernel command line is gradually generated in kvm_cmd_run_init()
and it is interspersed with the initialization code. This means that both
the code that generates the command line and the rest of the code is
unnecessarily difficult to follow and to modify. Move the code that
generates the command line to one function, to make it easier to
understand, and to declutter kvm_cmd_run_init().
No functional change intended.
Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
Link: https://lore.kernel.org/r/20210923144505.60776-7-alexandru.elisei@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
|
|
A user can specify multiple disk images using the --disk/-d argument. The
callback for the argument ends up in
disk/core.c::calling disk_img_name_parser(), which increments
kvm->cfg.image_count for each disk image.
Immediately after parsing the arguments in kvm_cmd_run_init(),
kvm->nr_disks is set to kvm->cfg.image_count, effectively making
kvm->nr_disks an alias for kvm->cfg.image_count, as image_count is never
changed afterward.
Later on, the core disk code uses kvm->cfg.image_count when opening all the
disk images, but kvm->nr_disks when closing them, which is inconsistent,
but technically correct since they represent the same thing and have the
same value.
Let's remove all this confusing usage and use only kvm->nr_disks to
represent the number of disk images specified by the user.
While this technically means that kvmtool now supports up to INT_MAX disk
images, in practice this is limited by MAX_DISK_IMAGES, which is equal to
four. Which means there are no functional changes.
Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
Link: https://lore.kernel.org/r/20210923144505.60776-6-alexandru.elisei@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
|
|
kvm_cmd_run_init() is a complex function which parses the command line
arguments, configures various aspects of a VM (the size of the RAM, the
number of CPUs, the network, the active console, the kernel command line,
creates a custom rootfs, etc), and after the recent patches, also does a
few checks against mutually exclusive kvmtool arguments.
Make the function just that little bit easier to read by moving the
argument validation into a separate function.
No functional change intended.
Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
Link: https://lore.kernel.org/r/20210923144505.60776-5-alexandru.elisei@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
|
|
kvm->vmlinux is used by symbol.c on x86 to translate a PC address to a
kernel symbol when kvmtool exits unexpectedly. When the --firmware argument
is used, a kernel image is not used for the VM, and the vmlinux file has no
relevance in this case.
Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
Link: https://lore.kernel.org/r/20210923144505.60776-4-alexandru.elisei@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
|
|
The firmware image is copied into the guest memory with the arch specific
function kvm__load_firmware() in kvm__init(). That function ignores the
initrd file, if the user specified one. Let the user know that the file is
ignored by KVM and the --initrd argument does nothing with --firmware.
Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
Link: https://lore.kernel.org/r/20210923144505.60776-3-alexandru.elisei@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
|
|
If the user specifies both the --kernel and the --firmware arguments,
--firmware takes precedence and --kernel is silently ignored. Since kvmtool
has no way of knowing what the user really intended, and guessing that
--firmware is the right argument might prove to be quite unexpected for the
user, be vocal about the incompatibility and refuse to create the VM.
Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
Link: https://lore.kernel.org/r/20210923144505.60776-2-alexandru.elisei@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
|
|
Since 45d3b59e8c45 ("kvm tools: Increase amount of possible interrupts
per PCI device"), the number of MSI-S has gone from 4 to 33.
However, the corresponding storage hasn't been upgraded, and writing
to the MSI-X table is a pretty risky business. Now that the Linux
kernel writes to *all* MSI-X entries before doing anything else
with the device, kvmtool dies a horrible death.
Fix it by properly defining the size of the MSI-X bar, and make
Linux great again.
This includes some fixes the PBA region decoding, as well as minor
cleanups to make this code a bit more maintainable.
Signed-off-by: Marc Zyngier <maz@kernel.org>
Reviewed-by: Andre Przywara <andre.przywara@arm.com>
Link: https://lore.kernel.org/r/20210827115405.1981529-1-maz@kernel.org
Signed-off-by: Will Deacon <will@kernel.org>
|
|
There is some value in keeping the IPA space small, as it reduces
the size of the stage-2 page tables.
Let's compute the required space at VM creation time, and inform
the kernel of our requirements.
Signed-off-by: Marc Zyngier <maz@kernel.org>
Reviewed-by: Oliver Upton <oupton@google.com>
Reviewed-by: Andre Przywara <andre.przywara@arm.com>
Link: https://lore.kernel.org/r/20210822152526.1291918-4-maz@kernel.org
Signed-off-by: Will Deacon <will@kernel.org>
|
|
Instead of just asking the the default VM size, request the maximum
IPA size to the kernel, and use this at VM creation time.
The IPA space is parametrized accordingly.
Signed-off-by: Marc Zyngier <maz@kernel.org>
Reviewed-by: Oliver Upton <oupton@google.com>
Reviewed-by: Andre Przywara <andre.przywara@arm.com>
Link: https://lore.kernel.org/r/20210822152526.1291918-3-maz@kernel.org
Signed-off-by: Will Deacon <will@kernel.org>
|
|
Most architectures pass a fixed value for their VM type. However,
arm64 uses it as a parameter describing the size of the guest's
physical address space.
In order to support this, introduce a kvm__get_vm_type() helper
that only returns KVM_VM_TYPE for now.
Reviewed-by: Andre Przywara <andre.przywara@arm.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Reviewed-by: Oliver Upton <oupton@google.com>
Link: https://lore.kernel.org/r/20210822152526.1291918-2-maz@kernel.org
Signed-off-by: Will Deacon <will@kernel.org>
|
|
It turns out that some Linux drivers (like Realtek R8169) fall back to a
device-specific configuration method if the device is not PCI Express
capable:
[ 1.433825] r8169 0000:00:00.0 enp0s0: No native access to PCI extended config space, falling back to CSI
Add the PCI Express Capability Structure and populate it for assigned
devices, as this is how the Linux PCI driver determines if a device is PCI
Express capable.
Because we don't emulate a PCI Express link, a root complex or any slot
related properties, the PCI Express capability is kept as small as possible
by ignoring those fields.
Reviewed-by: Andre Przywara <andre.przywara@arm.com>
Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
Link: https://lore.kernel.org/r/20210713170631.155595-5-alexandru.elisei@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
|
|
PCI Express comes with an extended addressing scheme, which directly
translated into a bigger device configuration space (256->4096 bytes)
and bigger PCI configuration space (16->256 MB), as well as mandatory
capabilities (power management [1] and PCI Express capability [2]).
However, our virtio PCI implementation implements version 0.9 of the
protocol and it still uses transitional PCI device ID's, so we have
opted to omit the mandatory PCI Express capabilities. For VFIO, the power
management and PCI Express capability are left for a subsequent patch.
[1] PCI Express Base Specification Revision 1.1, section 7.6
[2] PCI Express Base Specification Revision 1.1, section 7.8
Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
Reviewed-by: Andre Przywara <andre.przywara@arm.com>
Link: https://lore.kernel.org/r/20210713170631.155595-4-alexandru.elisei@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
|
|
Print a more helpful debugging message when a MMIO device hasn't set a
function to generate an FDT node instead of causing a segmentation fault by
dereferencing a NULL pointer.
Reviewed-by: Andre Przywara <andre.przywara@arm.com>
Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
Link: https://lore.kernel.org/r/20210713170631.155595-3-alexandru.elisei@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
|
|
The device tree code passes the function generate_irq_prop() to MMIO
devices to create the "interrupts" property. The typedef fdt_irq_fn is the
type used to pass the function to the device. It makes more sense for the
typedef to be in fdt.h with the rest of the device tree functions, so move
it there.
Reviewed-by: Andre Przywara <andre.przywara@arm.com>
Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
Link: https://lore.kernel.org/r/20210713170631.155595-2-alexandru.elisei@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
|
|
pmu__generate_fdt_nodes() checks if the host has support for PMU in a guest
and prints a warning if that's not the case. However, this check is too
late because the function is called after the VCPU has been created, and
VCPU creation fails if KVM_CAP_ARM_PMU_V3 is not available with a rather
unhelpful error:
$ ./vm run -c1 -m64 -f selftest.flat --pmu
# lkvm run --firmware selftest.flat -m 64 -c 1 --name guest-1039
Info: Placing fdt at 0x80200000 - 0x80210000
Fatal: Unable to initialise vcpu
Move the check for KVM_CAP_ARM_PMU_V3 to kvm_cpu__arch_init() before the
VCPU is created so the user can get a more useful error message. This
also matches the behaviour of KVM_CAP_ARM_EL1_32BIT.
Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
Link: https://lore.kernel.org/r/20210415131725.105675-1-alexandru.elisei@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
|
|
The "run" command accepts a new option (--vsock <cid>) which specify the
guest CID. For instance:
$ lkvm run --kernel ./bzImage --disk test --vsock 3
One can easily test by: https://github.com/stefanha/nc-vsock.
In the guest:
# modprobe vsock
# nc-vsock -l 1234
In the host:
# modprobe vhost_vsock
# nc-vsock 3 1234
This patch comes from the early submission of G. Campana. On this basis,
I fixed the compilation errors and runtime crashes. Thanks for the work
done by G. Campana.
https://patchwork.kernel.org/patch/9542313/
Signed-off-by: G. Campana <gcampana+kvm@quarkslab.com>
Signed-off-by: Tianjia Zhang <tianjia.zhang@linux.alibaba.com>
Link: https://lore.kernel.org/r/20200915094402.107988-1-tianjia.zhang@linux.alibaba.com
Signed-off-by: Will Deacon <will@kernel.org>
|
|
Using the RTC device at its legacy I/O address as set by IBM in 1981
was a kludge we used for simplicity on ARM platforms as well.
However this imposes problems due to their missing alignment and overlap
with the PCI I/O address space.
Now that we can switch a device easily between using ioports and
MMIO, let's move the RTC out of the first 4K of memory on ARM platforms.
That should be transparent for well behaved guests, since the change is
naturally reflected in the device tree.
Signed-off-by: Andre Przywara <andre.przywara@arm.com>
Reviewed-by: Alexandru Elisei <alexandru.elisei@arm.com>
Link: https://lore.kernel.org/r/20210315153350.19988-23-andre.przywara@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
|
|
Using the UART devices at their legacy I/O addresses as set by IBM in
1981 was a kludge we used for simplicity on ARM platforms as well.
However this imposes problems due to their missing alignment and overlap
with the PCI I/O address space.
Now that we can switch a device easily between using ioports and MMIO,
let's move the UARTs out of the first 4K of memory on ARM platforms.
That should be transparent for well behaved guests, since the change is
naturally reflected in the device tree. Even "earlycon" keeps working,
as the stdout-path property is adjusted automatically.
People providing direct earlycon parameters via the command line need to
adjust it to: "earlycon=uart,mmio,0x1000000".
Signed-off-by: Andre Przywara <andre.przywara@arm.com>
Reviewed-by: Alexandru Elisei <alexandru.elisei@arm.com>
Link: https://lore.kernel.org/r/20210315153350.19988-22-andre.przywara@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
|
|
The hardcoded memory map we expose to a guest is currently described
using a series of partially interconnected preprocessor constants,
which is hard to read and follow.
In preparation for moving the UART and RTC to some different MMIO
region, document the current map with some ASCII art, and clean up the
definition of the sections.
This changes the only internally used value of ARM_MMIO_AREA, to better
align with its actual meaning and future extensions.
No functional change.
Signed-off-by: Andre Przywara <andre.przywara@arm.com>
Reviewed-by: Alexandru Elisei <alexandru.elisei@arm.com>
Link: https://lore.kernel.org/r/20210315153350.19988-21-andre.przywara@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
|
|
Now that all users of the dedicated ioport trap handler interface are
gone, we can retire the code associated with it.
This removes ioport.c and ioport.h, along with removing prototypes from
other header files.
This also transfers the responsibility for port I/O trap handling
entirely into the new routine in mmio.c.
Signed-off-by: Andre Przywara <andre.przywara@arm.com>
Reviewed-by: Alexandru Elisei <alexandru.elisei@arm.com>
Link: https://lore.kernel.org/r/20210315153350.19988-20-andre.przywara@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
|
|
With the planned retirement of the special ioport emulation code, we
need to provide an emulation function compatible with the MMIO prototype.
Merge the existing _in and _out handlers to adhere to that MMIO
interface, and register these using the new registration function.
Signed-off-by: Andre Przywara <andre.przywara@arm.com>
Reviewed-by: Alexandru Elisei <alexandru.elisei@arm.com>
Link: https://lore.kernel.org/r/20210315153350.19988-19-andre.przywara@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
|
|
With the planned retirement of the special ioport emulation code, we
need to provide an emulation function compatible with the MMIO prototype.
Adjust the existing MMIO callback routine to automatically determine
the region this trap came through, and call the existing I/O handlers.
Register the ioport region using the new registration function.
Signed-off-by: Andre Przywara <andre.przywara@arm.com>
Reviewed-by: Alexandru Elisei <alexandru.elisei@arm.com>
Link: https://lore.kernel.org/r/20210315153350.19988-18-andre.przywara@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
|
|
Now that the vfio device has a trap handler adhering to the MMIO fault
handler prototype, let's switch over to the joint registration routine.
This allows us to get rid of the ioport shim routines.
Signed-off-by: Andre Przywara <andre.przywara@arm.com>
Reviewed-by: Alexandru Elisei <alexandru.elisei@arm.com>
Link: https://lore.kernel.org/r/20210315153350.19988-17-andre.przywara@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
|
|
With the planned retirement of the special ioport emulation code, we
need to provide an emulation function compatible with the MMIO prototype.
Adjust the I/O port trap handler to use that new function, and provide
shims to implement the old ioport interface, for now.
Signed-off-by: Andre Przywara <andre.przywara@arm.com>
Reviewed-by: Alexandru Elisei <alexandru.elisei@arm.com>
Link: https://lore.kernel.org/r/20210315153350.19988-16-andre.przywara@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
|
|
Now that the serial device has a trap handler adhering to the MMIO fault
handler prototype, let's switch over to the joint registration routine.
This allows us to get rid of the ioport shim routines.
Signed-off-by: Andre Przywara <andre.przywara@arm.com>
Reviewed-by: Alexandru Elisei <alexandru.elisei@arm.com>
Link: https://lore.kernel.org/r/20210315153350.19988-15-andre.przywara@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
|
|
With the planned retirement of the special ioport emulation code, we
need to provide an emulation function compatible with the MMIO prototype.
Adjust the trap handler to use that new function, and provide shims to
implement the old ioport interface, for now.
Signed-off-by: Andre Przywara <andre.przywara@arm.com>
Reviewed-by: Alexandru Elisei <alexandru.elisei@arm.com>
Link: https://lore.kernel.org/r/20210315153350.19988-14-andre.przywara@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
|
|
To be able to use the VESA device with the new generic I/O trap handler,
we need to use the different MMIO handler callback routine.
Replace the existing dummy in and out handlers with a joint dummy
MMIO handler, and register this using the new registration function.
Signed-off-by: Andre Przywara <andre.przywara@arm.com>
Reviewed-by: Alexandru Elisei <alexandru.elisei@arm.com>
Link: https://lore.kernel.org/r/20210315153350.19988-13-andre.przywara@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
|
|
Now that the RTC device has a trap handler adhering to the MMIO fault
handler prototype, let's switch over to the joint registration routine.
This allows us to get rid of the ioport shim routines.
Signed-off-by: Andre Przywara <andre.przywara@arm.com>
Reviewed-by: Alexandru Elisei <alexandru.elisei@arm.com>
Link: https://lore.kernel.org/r/20210315153350.19988-12-andre.przywara@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
|
|
With the planned retirement of the special ioport emulation code, we
need to provide emulation functions compatible with the MMIO prototype.
Merge the two different trap handlers into one function, checking for
read/write and data/index register inside.
Adjust the trap handlers to use that new function, and provide shims to
implement the old ioport interface, for now.
Signed-off-by: Andre Przywara <andre.przywara@arm.com>
Reviewed-by: Alexandru Elisei <alexandru.elisei@arm.com>
Link: https://lore.kernel.org/r/20210315153350.19988-11-andre.przywara@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
|
|
Now that the x86 I/O ports have trap handlers adhering to the MMIO fault
handler prototype, let's switch over to the joint registration routine.
This allows us to get rid of the ioport shim routines.
Since the debug output was done in ioport.c, we would lose this
functionality when moving over to the MMIO handlers. So bring this back
here explicitly, by introducing debug_io().
Signed-off-by: Andre Przywara <andre.przywara@arm.com>
Reviewed-by: Alexandru Elisei <alexandru.elisei@arm.com>
Link: https://lore.kernel.org/r/20210315153350.19988-10-andre.przywara@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
|
|
With the planned retirement of the special ioport emulation code, we
need to provide emulation functions compatible with the MMIO
prototype.
Adjust the trap handlers to use that new function, and provide shims to
implement the old ioport interface, for now.
Signed-off-by: Andre Przywara <andre.przywara@arm.com>
Reviewed-by: Alexandru Elisei <alexandru.elisei@arm.com>
Link: https://lore.kernel.org/r/20210315153350.19988-9-andre.przywara@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
|
|
Now that the PC keyboard has a trap handler adhering to the MMIO fault
handler prototype, let's switch over to the joint registration routine.
This allows us to get rid of the ioport shim routines.
Make the kbd_init() function static on the way.
Signed-off-by: Andre Przywara <andre.przywara@arm.com>
Reviewed-by: Alexandru Elisei <alexandru.elisei@arm.com>
Link: https://lore.kernel.org/r/20210315153350.19988-8-andre.przywara@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
|
|
With the planned retirement of the special ioport emulation code, we
need to provide an emulation function compatible with the MMIO
prototype.
Adjust the trap handler to use that new function, and provide shims to
implement the old ioport interface, for now.
Signed-off-by: Andre Przywara <andre.przywara@arm.com>
Reviewed-by: Alexandru Elisei <alexandru.elisei@arm.com>
Link: https://lore.kernel.org/r/20210315153350.19988-7-andre.przywara@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
|
|
The i8042 is clearly an 8-bit era device, so there is little room for
32-bit registers.
Clean up the data types used.
Signed-off-by: Andre Przywara <andre.przywara@arm.com>
Reviewed-by: Alexandru Elisei <alexandru.elisei@arm.com>
Link: https://lore.kernel.org/r/20210315153350.19988-6-andre.przywara@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
|
|
In their core functionality MMIO and I/O port traps are not really
different, yet we still have two totally separate code paths for
handling them. Devices need to decide on one conduit or need to provide
different handler functions for each of them.
Extend the existing MMIO emulation to also cover ioport handlers.
This just adds another RB tree root for holding the I/O port handlers,
but otherwise uses the same tree population and lookup code.
"ioport" or "mmio" just become a flag in the registration function.
Provide wrappers to not break existing users, and allow an easy
transition for the existing ioport handlers.
This also means that ioport handlers now can use the same emulation
callback prototype as MMIO handlers, which means we have to migrate them
over. To allow a smooth transition, we hook up the new I/O emulate
function to the end of the existing ioport emulation code.
Signed-off-by: Andre Przywara <andre.przywara@arm.com>
Reviewed-by: Alexandru Elisei <alexandru.elisei@arm.com>
Link: https://lore.kernel.org/r/20210315153350.19988-5-andre.przywara@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
|
|
The ioport routines support a special way of registering FDT node
generator functions. There is no reason to have this separate from the
already existing way via the device header.
Now that the only user of this special ioport variety has been
transferred, we can retire this code, to simplify ioport handling.
Signed-off-by: Andre Przywara <andre.przywara@arm.com>
Reviewed-by: Alexandru Elisei <alexandru.elisei@arm.com>
Link: https://lore.kernel.org/r/20210315153350.19988-4-andre.przywara@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
|
|
At the moment we use the .generate_fdt_node member of the ioport ops
structure to store the function pointer for the FDT node generator
function. ioport__register() will then put a wrapper and this pointer
into the device header.
The serial device is the only device making use of this special ioport
feature, so let's move this over to using the device header directly.
This will allow us to get rid of this .generate_fdt_node member in the
ops and simplify the code.
Signed-off-by: Andre Przywara <andre.przywara@arm.com>
Reviewed-by: Alexandru Elisei <alexandru.elisei@arm.com>
Link: https://lore.kernel.org/r/20210315153350.19988-3-andre.przywara@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
|
|
Since x86 had a special need for registering tons of special I/O ports,
we had an ioport__setup_arch() callback, to allow each architecture
to do the same. As it turns out no one uses it beside x86, so we remove
that unnecessary abstraction.
The generic function was registered via a device_base_init() call, so
we just do the same for the x86 specific function only, and can remove
the unneeded ioport__setup_arch().
Signed-off-by: Andre Przywara <andre.przywara@arm.com>
Reviewed-by: Alexandru Elisei <alexandru.elisei@arm.com>
Link: https://lore.kernel.org/r/20210315153350.19988-2-andre.przywara@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
|
|
KVM host support for the arm architecture was removed in commit
541ad0150ca4 ("arm: Remove 32bit KVM host support"). When trying to sync
KVM headers we get this error message:
$ util/update_headers.sh /path/to/linux
cp: cannot stat '/path/to/linux/arch/arm/include/uapi/asm/kvm.h': No such file or directory
Do not attempting to copy KVM headers for that architecture.
Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
Link: https://lore.kernel.org/r/20200810153828.216821-1-alexandru.elisei@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
|
|
The guest programs used_event in the avail ring to let the host know when
it wants a notification from the device. The host notifies the guest when
the used ring index passes used_event. It is possible for the guest to
submit a buffer, and then go into uninterruptible sleep waiting for this
notification.
The virtio-blk guest driver, in the notification callback virtblk_done(),
increments the last known used ring index, then sets used_event to this
value, which means it will get a notification after the next buffer is
consumed by the host. virtblk_done() exits after the value of the used
ring idx has been propagated from the host thread.
On the host side, the virtio-blk device increments the used ring index,
then compares it to used_event to decide if a notification should be sent.
This is a common communication pattern between two threads, called store
buffer. Memory barriers are needed in order for the pattern to work
correctly, otherwise it is possible for the host to miss sending a required
notification.
Initial state: vring.used.idx = 2, vring.used_event = 1 (idx passes
used_event, which means kvmtool notifies the guest).
GUEST (in virtblk_done()) | KVMTOOL (in virtio_blk_complete())
|
(increment vq->last_used_idx = 2) |
// virtqueue_enable_cb_prepare_split(): | // virt_queue__used_idx_advance():
write vring.used_event = 2 | write vring.used.idx = 3
// virtqueue_poll(): |
mb() | wmb()
// virtqueue_poll_split(): | // virt_queue__should_signal():
read vring.used.idx = 2 | read vring.used_event = 1
// virtblk_done() exits. | // No notification.
The write memory barrier on the host side is not enough to prevent
reordering of the read in the kvmtool thread, which can lead to the guest
thread waiting forever for IO to complete. Replace it with a full memory
barrier to get the correct store buffer pattern described in the Linux
litmus test SB+fencembonceonces.litmus, which forbids both threads reading
the initial values.
Also move the barrier in virtio_queue__should_signal(), because the barrier
is needed for notifications to work correctly, and it makes more sense to
have it in the function that determines if the host should notify the
guest.
Reported-by: Anvay Virkar <anvay.virkar@arm.com>
Suggested-by: Anvay Virkar <anvay.virkar@arm.com>
Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
Reviewed-by: Jean-Philippe Brucker <jean-philippe@linaro.org>
Link: https://lore.kernel.org/r/20200804145317.51633-1-alexandru.elisei@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
|
|
While introducing new code to extract the kernel offset from the
image, commit fd0a05b ("arm64: Obtain text offset from kernel image")
introduced a regression where something such as:
./lkvm run -c 8 -p earlycon <(zcat /boot/vmlinuz-5.8.0-rc5-00172-ga161216e31ba)
now fails to load the kernel, as the file descriptor cannot be
seeked.
Let's assume the good old 0x80000 offset when the seek syscall fails,
with a warning for a good measure.
Fixes: fd0a05b ("arm64: Obtain text offset from kernel image")
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20200716120801.2996-1-maz@kernel.org
Signed-off-by: Will Deacon <will@kernel.org>
|
|
When the host doesn't support 32bit guests, the kvmtool fails
without a proper message on what is wrong. i.e,
$ lkvm run -c 1 Image --aarch32
# lkvm run -k Image -m 256 -c 1 --name guest-105618
Fatal: Unable to initialise vcpu
Given that there is no other easy way to check if the host supports 32bit
guests, it is always good to report this by checking the capability, rather
than leaving the users to hunt this down by looking at the code!
After this patch:
$ lkvm run -c 1 Image --aarch32
# lkvm run -k Image -m 256 -c 1 --name guest-105695
Fatal: 32bit guests are not supported
Reported-by: Sami Mujawar <sami.mujawar@arm.com>
Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com>
Acked-by: Marc Zyngier <maz@kernel.org>
Cc: Will Deacon <will@kernel.org>
Link: https://lore.kernel.org/r/20200701142002.51654-1-suzuki.poulose@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
|
|
Recent changes made to Linux 5.8 have outlined that kvmtool
hardcodes the text offset instead of reading it from the arm64
image itself.
To address this, import the image header structure into kvmtool
and do the right thing. 32bit guests are still loaded to their
usual locations.
While we're at it, check the image magic and default to the text
offset to be 0x80000 when image_size is 0, as described in the
kernel's booting.rst document.
Reported-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Cc: Alexandru Elisei <alexandru.elisei@arm.com>
Link: https://lore.kernel.org/r/20200608152801.1415902-1-maz@kernel.org
Signed-off-by: Will Deacon <will@kernel.org>
|
|
KVM_SET_USER_MEMORY_REGION will fail if the guest physical address is
not aligned to the page size. However, it is legal for a guest to
program an address which isn't aligned to the page size. Trap and
emulate MMIO accesses to the region when that happens.
Without this patch, when assigning a Seagate Barracude hard drive to a
VM I was seeing these errors:
[ 0.286029] pci 0000:00:00.0: BAR 0: assigned [mem 0x41004600-0x4100467f]
Error: 0000:01:00.0: failed to register region with KVM
Error: [1095:3132] Error activating emulation for BAR 0
[..]
[ 10.561794] irq 13: nobody cared (try booting with the "irqpoll" option)
[ 10.563122] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.4.0-seattle-00009-g909b20467ed1 #133
[ 10.563124] Hardware name: linux,dummy-virt (DT)
[ 10.563126] Call trace:
[ 10.563134] dump_backtrace+0x0/0x140
[ 10.563137] show_stack+0x14/0x20
[ 10.563141] dump_stack+0xbc/0x100
[ 10.563146] __report_bad_irq+0x48/0xd4
[ 10.563148] note_interrupt+0x288/0x378
[ 10.563151] handle_irq_event_percpu+0x80/0x88
[ 10.563153] handle_irq_event+0x44/0xc8
[ 10.563155] handle_fasteoi_irq+0xb4/0x160
[ 10.563157] generic_handle_irq+0x24/0x38
[ 10.563159] __handle_domain_irq+0x60/0xb8
[ 10.563162] gic_handle_irq+0x50/0xa0
[ 10.563164] el1_irq+0xb8/0x180
[ 10.563166] arch_cpu_idle+0x10/0x18
[ 10.563170] do_idle+0x204/0x290
[ 10.563172] cpu_startup_entry+0x20/0x40
[ 10.563175] rest_init+0xd4/0xe0
[ 10.563180] arch_call_rest_init+0xc/0x14
[ 10.563182] start_kernel+0x420/0x44c
[ 10.563183] handlers:
[ 10.563650] [<000000001e474803>] sil24_interrupt
[ 10.564559] Disabling IRQ #13
[..]
[ 11.832916] ata1: spurious interrupt (slot_stat 0x0 active_tag -84148995 sactive 0x0)
[ 12.045444] ata_ratelimit: 1 callbacks suppressed
With this patch, I don't see the errors and the device works as
expected.
Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
Reviewed-by: Andre Przywara <andre.przywara@arm.com>
Link: https://lore.kernel.org/r/1589470709-4104-13-git-send-email-alexandru.elisei@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
|
|
PCI now supports configurable BARs. Get rid of the no longer needed,
Linux-only, fdt property.
Signed-off-by: Julien Thierry <julien.thierry@arm.com>
Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
Reviewed-by: Andre Przywara <andre.przywara@arm.com>
Link: https://lore.kernel.org/r/1589470709-4104-12-git-send-email-alexandru.elisei@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
|
|
BARs are used by the guest to configure the access to the PCI device by
writing the address to which the device will respond. The basic idea for
adding support for reassignable BARs is straightforward: deactivate
emulation for the memory region described by the old BAR value, and
activate emulation for the new region.
BAR reassignment can be done while device access is enabled and memory
regions for different devices can overlap as long as no access is made to
the overlapping memory regions. This means that it is legal for the BARs of
two distinct devices to point to an overlapping memory region, and indeed,
this is how Linux does resource assignment at boot. To account for this
situation, the simple algorithm described above is enhanced to scan for all
devices and:
- Deactivate emulation for any BARs that might overlap with the new BAR
value.
- Enable emulation for any BARs that were overlapping with the old value
after the BAR has been updated.
Activating/deactivating emulation of a memory region has side effects. In
order to prevent the execution of the same callback twice we now keep track
of the state of the region emulation. For example, this can happen if we
program a BAR with an address that overlaps a second BAR, thus deactivating
emulation for the second BAR, and then we disable all region accesses to
the second BAR by writing to the command register.
Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
Reviewed-by: Andre Przywara <andre.przywara@arm.com>
Link: https://lore.kernel.org/r/1589470709-4104-11-git-send-email-alexandru.elisei@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
|
|
During configuration of the BAR addresses, a Linux guest disables and
enables access to I/O and memory space. When access is disabled, we don't
stop emulating the memory regions described by the BARs. Now that we have
callbacks for activating and deactivating emulation for a BAR region,
let's use that to stop emulation when access is disabled, and
re-activate it when access is re-enabled.
Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
Reviewed-by: Andre Przywara <andre.przywara@arm.com>
Link: https://lore.kernel.org/r/1589470709-4104-10-git-send-email-alexandru.elisei@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
|
|
Implement callbacks for activating and deactivating emulation for a BAR
region. This is in preparation for allowing a guest operating system to
enable and disable access to I/O or memory space, or to reassign the
BARs.
The emulated vesa device framebuffer isn't designed to allow stopping and
restarting at arbitrary points in the guest execution. Furthermore, on x86,
the kernel will not change the BAR addresses, which on bare metal are
programmed by the firmware, so take the easy way out and refuse to
activate/deactivate emulation for the BAR regions. We also take this
opportunity to make the vesa emulation code more consistent by moving all
static variable definitions in one place, at the top of the file.
Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
Reviewed-by: Andre Przywara <andre.przywara@arm.com>
Link: https://lore.kernel.org/r/1589470709-4104-9-git-send-email-alexandru.elisei@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
|
|
A vesa device is used by the SDL, GTK or VNC framebuffers. Don't allow the
user to specify more than one of these options because kvmtool will create
identical vesa devices and bad things will happen:
$ ./lkvm run -c2 -m2048 -k bzImage --sdl --gtk
# lkvm run -k bzImage -m 2048 -c 2 --name guest-10159
Error: device region [d0000000-d012bfff] would overlap device region [d0000000-d012bfff]
*** Error in `./lkvm': free(): invalid pointer: 0x00007fad78002e40 ***
*** Error in `./lkvm': free(): invalid pointer: 0x00007fad78002e40 ***
*** Error in `./lkvm': free(): invalid pointer: 0x00007fad78002e40 ***
======= Backtrace: =========
======= Backtrace: =========
/lib/x86_64-linux-gnu/libc.so.6(+0x777e5)[0x7fae0ed447e5]
======= Backtrace: =========
/lib/x86_64-linux-gnu/libc.so.6/lib/x86_64-linux-gnu/libc.so.6(+0x8037a)[0x7fae0ed4d37a]
(+0x777e5)[0x7fae0ed447e5]
/lib/x86_64-linux-gnu/libc.so.6(+0x777e5)[0x7fae0ed447e5]
/lib/x86_64-linux-gnu/libc.so.6(+0x8037a)[0x7fae0ed4d37a]
/lib/x86_64-linux-gnu/libc.so.6(cfree+0x4c)[0x7fae0ed5153c]
*** Error in `./lkvm': free(): invalid pointer: 0x00007fad78002e40 ***
/lib/x86_64-linux-gnu/libglib-2.0.so.0(g_string_free+0x3b)[0x7fae0f814dab]
/lib/x86_64-linux-gnu/libglib-2.0.so.0(g_string_free+0x3b)[0x7fae0f814dab]
/usr/lib/x86_64-linux-gnu/libgtk-3.so.0(+0x21121c)[0x7fae1023321c]
/usr/lib/x86_64-linux-gnu/libgtk-3.so.0(+0x21121c)[0x7fae1023321c]
======= Backtrace: =========
Aborted (core dumped)
The vesa device is explicitly created during the initialization phase of
the above framebuffers. Also remove the superfluous check for their
existence.
Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
Reviewed-by: Andre Przywara <andre.przywara@arm.com>
Link: https://lore.kernel.org/r/1589470709-4104-8-git-send-email-alexandru.elisei@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
|
|
After writing to the device fd as part of the PCI configuration space
emulation, we read back from the device to make sure that the write
finished. The value is read back into the PCI configuration space and
afterwards, the same value is copied by the PCI emulation code. Let's
read from the device fd into a temporary variable, to prevent this
double write.
The double write is harmless in itself. But when we implement
reassignable BARs, we need to keep track of the old BAR value, and the
VFIO code is overwritting it.
Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
Reviewed-by: Andre Przywara <andre.przywara@arm.com>
Link: https://lore.kernel.org/r/1589470709-4104-7-git-send-email-alexandru.elisei@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
|
|
>From PCI Local Bus Specification Revision 3.0. section 3.8 "64-Bit Bus
Extension":
"The bandwidth requirements for I/O and configuration transactions cannot
justify the added complexity, and, therefore, only memory transactions
support 64-bit data transfers".
Further down, the spec also describes the possible responses of a target
which has been requested to do a 64-bit transaction. Limit the transaction
to the lower 32 bits, to match the second accepted behaviour.
Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
Reviewed-by: Andre Przywara <andre.przywara@arm.com>
Link: https://lore.kernel.org/r/1589470709-4104-6-git-send-email-alexandru.elisei@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
|
|
Let's be consistent and reserve ioports when we are configuring the BAR,
not when we map it, just like we do with mmio regions.
Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
Reviewed-by: Andre Przywara <andre.przywara@arm.com>
Link: https://lore.kernel.org/r/1589470709-4104-5-git-send-email-alexandru.elisei@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
|
|
The struct virtio_pci fields port_addr, mmio_addr and msix_io_block
represent the same addresses that are written in the corresponding BARs.
Remove this duplication of information and always use the address from the
BAR. This will make our life a lot easier when we add support for
reassignable BARs, because we won't have to update the fields on each BAR
change.
No functional changes.
Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
Reviewed-by: Andre Przywara <andre.przywara@arm.com>
Link: https://lore.kernel.org/r/1589470709-4104-4-git-send-email-alexandru.elisei@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
|
|
We're going to be checking the BAR type, the address written to it and if
access to memory or I/O space is enabled quite often when we add support
for reasignable BARs; make our life easier by adding helpers for it.
Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
Reviewed-by: Andre Przywara <andre.przywara@arm.com>
Link: https://lore.kernel.org/r/1589470709-4104-3-git-send-email-alexandru.elisei@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
|
|
kvmtool uses brlock for protecting accesses to the ioport and mmio
red-black trees. brlock allows concurrent reads, but only one writer, which
is assumed not to be a VCPU thread (for more information see commit
0b907ed2eaec ("kvm tools: Add a brlock)). This is done by issuing a
compiler barrier on read and pausing the entire virtual machine on writes.
When KVM_BRLOCK_DEBUG is defined, brlock uses instead a pthread read/write
lock.
When we will implement reassignable BARs, the mmio or ioport mapping will
be done as a result of a VCPU mmio access. When brlock is a pthread
read/write lock, it means that we will try to acquire a write lock with the
read lock already held by the same VCPU and we will deadlock. When it's
not, a VCPU will have to call kvm__pause, which means the virtual machine
will stay paused forever.
Let's avoid all this by using a mutex and reference counting the red-black
tree entries. This way we can guarantee that we won't unregister a node
that another thread is currently using for emulation.
Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
Reviewed-by: Andre Przywara <andre.przywara@arm.com>
Link: https://lore.kernel.org/r/1589470709-4104-2-git-send-email-alexandru.elisei@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
|
|
GCC 10.1 generates a warning in net/ip/csum.c about exceeding a buffer
limit in a memcpy operation:
------------------
In function 'memcpy',
inlined from 'uip_csum_udp' at net/uip/csum.c:58:3:
/usr/include/aarch64-linux-gnu/bits/string_fortified.h:34:10: error: writing 1 byte into a region of size 0 [-Werror=stringop-overflow=]
34 | return __builtin___memcpy_chk (__dest, __src, __len, __bos0 (__dest));
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
In file included from net/uip/csum.c:1:
net/uip/csum.c: In function 'uip_csum_udp':
include/kvm/uip.h:132:6: note: at offset 0 to object 'sport' with size 2 declared here
132 | u16 sport;
------------------
This warning originates from the code taking the address of the "sport"
member, then using that with some pointer arithmetic in a memcpy call.
GCC now sees that the object is only a u16, so copying 12 bytes into it
cannot be any good.
It's somewhat debatable whether this is a legitimate warning, as there
is enough storage at that place, and we knowingly use the struct and
its variabled-sized member at the end.
However we can also rewrite the code, to not abuse the "&" operation of
some *member*, but take the address of the struct itself.
This makes the code less dodgy, and indeed appeases GCC 10.
Reported-by: Alexandru Elisei <alexandru.elisei@arm.com>
Signed-off-by: Andre Przywara <andre.przywara@arm.com>
Tested-by: Alexandru Elisei <alexandru.elisei@arm.com>
Reviewed-by: Alexandru Elisei <alexandru.elisei@arm.com>
Link: https://lore.kernel.org/r/20200518125649.216416-1-andre.przywara@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
|
|
On arm and arm64 we expose the Motorola RTC emulation to the guest,
but never advertised this in the device tree.
EDK-2 seems to rely on this device, but on its hardcoded address. To
make this more future-proof, add a DT node with the address in it.
EDK-2 can then read the proper address from there, and we can change
this address later (with the flexible memory layout).
Please note that an arm64 Linux kernel is not ready to use this device,
there are some include files missing under arch/arm64 to compile the
driver. I hacked this up in the kernel, just to verify this DT snippet
is correct, but don't see much value in enabling this properly in
Linux.
Signed-off-by: Andre Przywara <andre.przywara@arm.com>
Link: https://lore.kernel.org/r/20200514094553.135663-1-andre.przywara@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
|
|
So far the (legacy) IRQ line for a PCI device is allocated in devices.c,
which should actually not take care of that. Since we allocate all other
device specific resources in the actual device emulation code, the IRQ
should not be something special.
Remove the PCI specific code from devices.c, and move the IRQ line
allocation to the PCI code.
This drops the IRQ line from the VESA device, since it does not use one.
Signed-off-by: Andre Przywara <andre.przywara@arm.com>
Signed-off-by: Will Deacon <will@kernel.org>
|
|
At the moment we trap *every* access to the flash memory, even when we
are in array read mode (which just directly copies from the storage
array to the guest).
To improve performance, allow cacheable mappings and to avoid fatal traps
on unsupported instructions (on ARM), export a read-only memslot to the
guest when the flash is in read-array mode. A guest does not need to
trap on read accesses then.
A write command (which always traps) will revoke this mapping if the
read mode changes.
This reduces the number of read traps from more than 800,000 to a few
hundreds when booting into the UEFI shell.
Tested-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Andre Przywara <andre.przywara@arm.com>
Signed-off-by: Will Deacon <will@kernel.org>
|
|
A KVM memslot has a flags field, which allows to mark a region as
read-only.
Add another memory type bit to allow kvmtool-internal users to map a
write-protected region. Write access would trap and can be handled by
the MMIO emulation, which should register on the same guest address
region.
Tested-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Andre Przywara <andre.przywara@arm.com>
Signed-off-by: Will Deacon <will@kernel.org>
|
|
When we want to map a device region into the guest address space, first we
perform an mmap on the device fd. The resulting VMA is a mapping between
host userspace addresses and physical addresses associated with the device.
Next, we create a memslot, which populates the stage 2 table with the
mappings between guest physical addresses and the device physical adresses.
However, when we want to unmap the device from the guest address space, we
only call munmap, which destroys the VMA and the stage 2 mappings, but
doesn't destroy the memslot and kvmtool's internal mem_bank structure
associated with the memslot.
This has been perfectly fine so far, because we only unmap a device region
when we exit kvmtool. This is will change when we add support for
reassignable BARs, and we will have to unmap vfio regions as the guest
kernel writes new addresses in the BARs. This can lead to two possible
problems:
- We refuse to create a valid BAR mapping because of a stale mem_bank
structure which belonged to a previously unmapped region.
- It is possible that the mmap in vfio_map_region returns the same address
that was used to create a memslot, but was unmapped by vfio_unmap_region.
Guest accesses to the device memory will fault because the stage 2
mappings are missing, and this can lead to performance degradation.
Let's do the right thing and destroy the memslot and the mem_bank struct
associated with it when we unmap a vfio region. Set host_addr to NULL after
the munmap call so we won't try to unmap an address which is currently used
by the process for something else if vfio_unmap_region gets called twice.
Tested-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
Signed-off-by: Andre Przywara <andre.przywara@arm.com>
Signed-off-by: Will Deacon <will@kernel.org>
|
|
The EDK II UEFI firmware implementation requires some storage for the EFI
variables, which is typically some flash storage.
Since this is already supported on the EDK II side, we add a CFI flash
emulation to kvmtool.
This is backed by a file, specified via the --flash or -F command line
option. Any flash writes done by the guest will immediately be reflected
into this file (kvmtool mmap's the file).
The flash will be limited to the nearest power-of-2 size, so only the
first 2 MB of a 3 MB file will be used.
This implements a CFI flash using the "Intel/Sharp extended command
set", as specified in:
- JEDEC JESD68.01
- JEDEC JEP137B
- Intel Application Note 646
Some gaps in those specs have been filled by looking at real devices and
other implementations (QEMU, Linux kernel driver).
At the moment this relies on DT to advertise the base address of the
flash memory (mapped into the MMIO address space) and is only enabled
for ARM/ARM64. The emulation itself is architecture agnostic, though.
This is one missing piece toward a working UEFI boot with kvmtool on
ARM guests, the other is to provide writable PCI BARs, which is WIP.
Tested-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Raphael Gault <raphael.gault@arm.com>
[Andre: rewriting and fixing]
Signed-off-by: Andre Przywra <andre.przywara@arm.com>
Signed-off-by: Will Deacon <will@kernel.org>
|
|
At the moment the IRQ line for a virtio-mmio device is assigned in the
generic device__register() routine in devices.c, by calling back into
virtio-mmio.c. This does not only sound slightly convoluted, but also
breaks when we try to register an MMIO device that is not a virtio-mmio
device. In this case container_of will return a bogus pointer (as it
assumes a struct virtio_mmio), and the IRQ allocation routine will
corrupt some data in the device_header (for instance the first byte
of the "data" pointer).
Simply assign the IRQ directly in virtio_mmio_init(), before calling
device__register(). This avoids the problem and looks actually much more
straightforward.
Tested-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Andre Przywara <andre.przywara@arm.com>
Signed-off-by: Will Deacon <will@kernel.org>
|
|
A PCI device with a MSI capability enabling Multiple MSI messages
(through the Multiple Message Enable field in the Message Control
register[6:4]) is expected to drive the Message Data lower bits (number
determined by the number of selected vectors) to generate the
corresponding MSI messages writes on the PCI bus.
Therefore, KVM expects the MSI data lower bits (a number of
bits that depend on bits [6:4] of the Message Control
register - which in turn control the number of vectors
allocated) to be set-up by kvmtool while programming the
MSI IRQ routing entries to make sure the MSI entries can
actually be demultiplexed by KVM and IRQ routes set-up
accordingly so that when an actual HW fires KVM can
route it to the correct entry in the interrupt controller
(and set-up a correct passthrough route for directly
injected interrupt).
Current kvmtool code does not set-up Message data entries
correctly for multi-MSI vectors - the data field is left
as programmed in the MSI capability by the guest for all
vector entries, triggering IRQs misrouting.
Fix it.
Signed-off-by: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com>
Acked-by: Marc Zyngier <maz@kernel.org>
Cc: Will Deacon <will@kernel.org>
Cc: Julien Thierry <julien.thierry.kdev@gmail.com>
Signed-off-by: Will Deacon <will@kernel.org>
|
|
If we try to register a range of ports which overlaps with another, already
registered, I/O ports region then device emulation for that region will not
work anymore. There's nothing sane that the ioport emulation layer can do
in this case so refuse to allocate the port. This matches the behavior of
kvm__register_mmio.
There's no need to protect allocating a new ioport struct with a lock, so
move the lock to protect the actual ioport insertion in the tree.
Reviewed-by: Andre Przywara <andre.przywara@arm.com>
Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
Signed-off-by: Will Deacon <will@kernel.org>
|
|
Implemented BARs have an non-zero address and a size. Let's set the size
for BAR 0.
Reviewed-by: Andre Przywara <andre.przywara@arm.com>
Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
Signed-off-by: Will Deacon <will@kernel.org>
|
|
Failling an mmap call or creating a memslot means that device emulation
will not work, don't ignore it.
Reviewed-by: Andre Przywara <andre.przywara@arm.com>
Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
Signed-off-by: Will Deacon <will@kernel.org>
|
|
An error returned by device__register, kvm__register_mmio and
ioport__register means that the device will
not be emulated properly. Annotate the functions with __must_check, so we
get a compiler warning when this error is ignored.
And fix several instances where the caller returns 0 even if the function
failed.
Also make sure the ioport emulation code uses ioport_remove consistently.
Reviewed-by: Andre Przywara <andre.przywara@arm.com>
Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
Signed-off-by: Will Deacon <will@kernel.org>
|
|
Don't ignore an error in the bus specific initialization function in
virtio_init; don't ignore the result of virtio_init; and don't return 0
in virtio_blk__init and virtio_scsi__init when we encounter an error.
Hopefully this will save some developer's time debugging faulty virtio
devices in a guest.
To take advantage of the cleanup function virtio_blk__exit, move appending
the new device to the list before the call to virtio_init. Change
virtio_net__exit to free all allocated net_dev devices on exit, and
matching what virtio_blk__exit does.
To safeguard against this in the future, virtio_init has been annoted
with the compiler attribute warn_unused_result.
Reviewed-by: Andre Przywara <andre.przywara@arm.com>
Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
Signed-off-by: Will Deacon <will@kernel.org>
|
|
Don't try to configure a BAR if there is no region associated with it.
Also move the variable declarations from inside the loop to the start of
the function for consistency.
Reviewed-by: Andre Przywara <andre.przywara@arm.com>
Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
Signed-off-by: Will Deacon <will@kernel.org>
|
|
To get the size of the expansion ROM, software writes 0xfffff800 to the
expansion ROM BAR in the PCI configuration space. PCI emulation executes
the optional configuration space write callback that a device can implement
before emulating this write.
kvmtool's implementation of VFIO doesn't have support for emulating
expansion ROMs. However, the callback writes the guest value to the
hardware BAR, and then it reads it back to the emulated BAR to make sure
the write has completed successfully.
After this, we return to regular PCI emulation and because the BAR is no
longer 0, we write back to the BAR the value that the guest used to get the
size. As a result, the guest will think that the ROM size is 0x800 after
the subsequent read and we end up unintentionally exposing to the guest a
BAR which we don't emulate.
Let's fix this by ignoring writes to the expansion ROM BAR.
Reviewed-by: Andre Przywara <andre.przywara@arm.com>
Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
Signed-off-by: Will Deacon <will@kernel.org>
|
|
Not all devices have the bottom 32 bits of a 64 bit BAR in an even
numbered BAR. For example, on an NVIDIA Quadro P400, BARs 1 and 3 are
64bit. Remove this assumption.
Reviewed-by: Andre Przywara <andre.przywara@arm.com>
Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
Signed-off-by: Will Deacon <will@kernel.org>
|
|
kvmtool assumes that the BAR that holds the address for the MSIX table
and PBA structure has a size which is equal to their total size and it
allocates memory from MMIO space accordingly. However, when
initializing the BARs, the BAR size is set to the region size reported
by VFIO. When the physical BAR size is greater than the mmio space that
kvmtool allocates, we can have a situation where the BAR overlaps with
another BAR, in which case kvmtool will fail to map the memory. This was
found when trying to do PCI passthrough with a PCIe Realtek r8168 NIC,
when the guest was also using virtio-block and virtio-net devices:
[..]
[ 0.197926] PCI: OF: PROBE_ONLY enabled
[ 0.198454] pci-host-generic 40000000.pci: host bridge /pci ranges:
[ 0.199291] pci-host-generic 40000000.pci: IO 0x00007000..0x0000ffff -> 0x00007000
[ 0.200331] pci-host-generic 40000000.pci: MEM 0x41000000..0x7fffffff -> 0x41000000
[ 0.201480] pci-host-generic 40000000.pci: ECAM at [mem 0x40000000-0x40ffffff] for [bus 00]
[ 0.202635] pci-host-generic 40000000.pci: PCI host bridge to bus 0000:00
[ 0.203535] pci_bus 0000:00: root bus resource [bus 00]
[ 0.204227] pci_bus 0000:00: root bus resource [io 0x0000-0x8fff] (bus address [0x7000-0xffff])
[ 0.205483] pci_bus 0000:00: root bus resource [mem 0x41000000-0x7fffffff]
[ 0.206456] pci 0000:00:00.0: [10ec:8168] type 00 class 0x020000
[ 0.207399] pci 0000:00:00.0: reg 0x10: [io 0x0000-0x00ff]
[ 0.208252] pci 0000:00:00.0: reg 0x18: [mem 0x41002000-0x41002fff]
[ 0.209233] pci 0000:00:00.0: reg 0x20: [mem 0x41000000-0x41003fff]
[ 0.210481] pci 0000:00:01.0: [1af4:1000] type 00 class 0x020000
[ 0.211349] pci 0000:00:01.0: reg 0x10: [io 0x0100-0x01ff]
[ 0.212118] pci 0000:00:01.0: reg 0x14: [mem 0x41003000-0x410030ff]
[ 0.212982] pci 0000:00:01.0: reg 0x18: [mem 0x41003200-0x410033ff]
[ 0.214247] pci 0000:00:02.0: [1af4:1001] type 00 class 0x018000
[ 0.215096] pci 0000:00:02.0: reg 0x10: [io 0x0200-0x02ff]
[ 0.215863] pci 0000:00:02.0: reg 0x14: [mem 0x41003400-0x410034ff]
[ 0.216723] pci 0000:00:02.0: reg 0x18: [mem 0x41003600-0x410037ff]
[ 0.218105] pci 0000:00:00.0: can't claim BAR 4 [mem 0x41000000-0x41003fff]: address conflict with 0000:00:00.0 [mem 0x41002000-0x41002fff]
[..]
Guest output of lspci -vv:
00:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller (rev 06)
Subsystem: TP-LINK Technologies Co., Ltd. TG-3468 Gigabit PCI Express Network Adapter
Control: I/O+ Mem+ BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
Interrupt: pin A routed to IRQ 16
Region 0: I/O ports at 0000 [size=256]
Region 2: Memory at 41002000 (64-bit, non-prefetchable) [size=4K]
Region 4: Memory at 41000000 (64-bit, prefetchable) [size=16K]
Capabilities: [50] MSI: Enable- Count=1/1 Maskable- 64bit+
Address: 0000000000000000 Data: 0000
Capabilities: [b0] MSI-X: Enable- Count=4 Masked-
Vector table: BAR=4 offset=00000000
PBA: BAR=4 offset=00001000
Let's fix this by allocating an amount of MMIO memory equal to the size
of the BAR that contains the MSIX table and/or PBA.
Reviewed-by: Andre Przywara <andre.przywara@arm.com>
Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
Signed-off-by: Will Deacon <will@kernel.org>
|
|
Currently, callbacks for memory BAR 1 call the IO port emulation. This
means that the memory BAR needs I/O Space to be enabled whenever Memory
Space is enabled.
Refactor the code so the two type of BARs are independent. Also, unify
ioport/mmio callback arguments so that they all receive a virtio_device.
Reviewed-by: Andre Przywara <andre.przywara@arm.com>
Signed-off-by: Julien Thierry <julien.thierry@arm.com>
[Cosmetic changes wrt to where local variables are initialized]
Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
Signed-off-by: Will Deacon <will@kernel.org>
|