aboutsummaryrefslogtreecommitdiffstats
AgeCommit message (Collapse)AuthorFilesLines
2023-11-28proc_cpuinfo: Fix bit shift for socket bitmaskHEADmasterTony Luck1-1/+1
If there are more than 32 sockets "1 << s" doesn't work. Use "1L << s" Signed-off-by: Tony Luck <tony.luck@intel.com>
2023-10-24proc_cpuinfo: Add sanity check for number of socketsTony Luck1-0/+4
A misconfigured system appeared to have a huge number of sockets. The code here did not handle this gracefully as the bitmask of sockets is only 64-bits wide, so the extra sockets were not counted. It doesn't seem worth changiing the code to support more sockets as such systems do not exist. Just check, warn, and exit if a socket > 63 is found. Signed-off-by: Tony Luck <tony.luck@intel.com>
2023-07-19einj_mem_uc: Check if kernel has CMCI disabledTony Luck1-0/+16
On Intel there is a race between a memory controller reporting it saw an error with CMCI and the consumption of an uncorrected error reporting with machine check. If the CMCI wins the race, Linux takes the page offline before any consumption can occur. Thus there may be no machine check. Some users want to explicity test the #MC recovery case. They disable CMCI in the kernel with the boot flag "mce=no_cmci". In this case there will always be a machine check. But the test reports "fail" because it was expecting to se a CMCI. Add a check to see if CMCI is disabled. If it is, mask out the F_CMCI expectation. Signed-off-by: Tony Luck <tony.luck@intel.com>
2023-06-12einj_mem_uc: Delete the checks for "advanced RAS" CPU modelsTony Luck1-50/+0
These made sense in the early days of this tool when only a few CPU models supported recovery from poisoned memory consumption. But more new models support recovery than do not. Signed-off-by: Tony Luck <tony.luck@intel.com>
2023-06-12einj_mem_uc: support error injection on AMD EPYC platformShuai Xue1-0/+5
AMD EPYC CPUs also support APEI EINJ error injection. Tested on AMD Milan and Genoa. Signed-off-by: Shuai Xue <xueshuai@linux.alibaba.com> Signed-off-by: Tony Luck <tony.luck@intel.com>
2023-06-09einj_pcie_err: support PCIe error injection through EINJShuai Xue5-2/+145
Support PCIe error injection, e.g. fatal error, through APEI EINJ interface. Tested on ARM platform (Alibaba Yitian 710) and X86 platform (Intel Sapphire Rapids). Signed-off-by: Shuai Xue <xueshuai@linux.alibaba.com> Signed-off-by: Tony Luck <tony.luck@intel.com>
2023-04-24einj.h: add a header file to declare common EINJ related operationsShuai Xue8-285/+212
Lots of files declare the same EINJ related macros like EINJ_ETYPE and functions like wfile(), include the same header files. To simplify the code and make it easier to maintain, move all common EINJ related operations to a header file. [Tony: Move the code out of einj.h and into new file einj.c] Signed-off-by: Shuai Xue <xueshuai@linux.alibaba.com> Signed-off-by: Tony Luck <tony.luck@intel.com>
2023-03-06einj_mem_uc: add extra arguments to support guest error injectionShuai Xue2-6/+117
To support Guest Error injection, add two extra arguments: - '-j': skip error injection, this step should do with host physical address on host which creates GPA->HPA mappings for the guest. - '-k': kick off trigger by writing a file from remote (host). The steps to inject guest error are: STEP 1: start a VM with a stdio monitor which allows giving complex commands to the QEMU emulator. qemu-system-aarch64 -enable-kvm \ -cpu host \ -M virt,gic-version=3 \ -m 8G \ -d guest_errors \ -rtc base=localtime,clock=host \ -smp cores=2,threads=2,sockets=2 \ -object memory-backend-ram,id=mem0,size=4G \ -object memory-backend-ram,id=mem1,size=4G \ -numa node,memdev=mem0,cpus=0-3,nodeid=0 \ -numa node,memdev=mem1,cpus=4-7,nodeid=1 \ -bios /usr/share/AAVMF/AAVMF_CODE.fd \ -drive driver=qcow2,media=disk,cache=writeback,if=virtio,id=alinu1_rootfs,file=/path/to/image.qcow2 \ -netdev user,id=n1,hostfwd=tcp::5555-:22 \ -serial telnet:localhost:4321,server,nowait \ -device virtio-net-pci,netdev=n1 \ -monitor stdio QEMU 7.2.0 monitor - type 'help' for more information (qemu) VNC server running on 127.0.0.1:5900 STEP 2: login guest and install ras-tools, then run `einj_mem_uc` to allocate a page in userspace, dumps the virtual and physical address of the page. The `-j` is to skip error injection and `-k` is to wait for a kick. $ ./einj_mem_uc single -j -k 0: single vaddr = 0xffffbd88c400 paddr = 151f21400 STEP 3: run command `gpa2hpa` in QEMU monitor and it will print the host physical address at which the guest's physical address addr is mapped. (qemu) gpa2hpa 0x151f21400 Host physical address for 0x151f21400 (mem1) is 0x935757400 STEP 4: inject an uncorrected error via the APEI interface to the finally translated host physical address on host. echo 0x949a84400 > /sys/kernel/debug/apei/einj/param1 echo 0xfffffffffffff000 > /sys/kernel/debug/apei/einj/param2 echo 0x0 > /sys/kernel/debug/apei/einj/flags echo 0x10 > /sys/kernel/debug/apei/einj/error_type echo 1 > /sys/kernel/debug/apei/einj/notrigger echo 1 > /sys/kernel/debug/apei/einj/error_inject STEP 5: then kick `einj_mem_uc` to trigger the error by writing "trigger_start". In this example, the kick is done on host. ssh -p 5555 root@localhost "echo trigger > ~/trigger_start" STEP 6: We will observe that the QEMU process exit. (qemu) qemu-system-aarch64: Hardware memory error! Signed-off-by: zhangyangzeyu.zyzy <xiaoque@linux.alibaba.com> Signed-off-by: Shuai Xue <xueshuai@linux.alibaba.com> Signed-off-by: Tony Luck <tony.luck@intel.com>
2023-02-27einj_mem_uc: Support before 3.14 kernelBixuan Cui1-0/+6
Run einj_mem_uc in 3.10 kernel: ./einj_mem_uc: cannot open '/sys/kernel/debug/apei/einj/flags' The 'flags' is added by 3482fb5e0c1c (ACPI, APEI, EINJ: Changes to the ACPI/APEI/EINJ debugfs interface) on 3.14 kernel. Add kernel version check. Signed-off-by: Bixuan Cui <cuibixuan@linux.alibaba.com> Signed-off-by: Tony Luck <tony.luck@intel.com>
2023-02-07proc_cpuinfo: fix the bug that modelnum is always zeroBixuan Cui1-1/+1
Fixes: 65d692c5ce8e (einj_mem_uc: Count Ice Lake Xeon as "advanced RAS") Signed-off-by: Bixuan Cui <cuibixuan@linux.alibaba.com> Signed-off-by: Tony Luck <tony.luck@intel.com>
2023-01-30einj_mem_uc: Add new test case for overflowBixuan Cui1-0/+34
Trigger two UCE by reading from two target addresses at the same time. The OVER(Error overflow) bit will be set and probably fatal. Signed-off-by: Bixuan Cui <cuibixuan@linux.alibaba.com> Signed-off-by: Tony Luck <tony.luck@intel.com>
2023-01-12einj_mem_uc: fix compilation error about trigger_shareShuai Xue1-1/+2
Some compilers (GCC 9.2.1) complain: einj_mem_uc.c: In function ‘trigger_share’: einj_mem_uc.c:656:3: error: a label can only be part of a statement and a declaration is not a statement 656 | char *p = mmap(NULL, pagesize, PROT_READ, MAP_SHARED, fileno(pcfile), 0); | ^~~~ make: *** [<builtin>: einj_mem_uc.o] Error 1 Make all declarations precede all statements within the block. Signed-off-by: Shuai Xue <xueshuai@linux.alibaba.com> Signed-off-by: Tony Luck <tony.luck@intel.com>
2023-01-05einj_mem_uc: add new test case for share memoryBixuan Cui1-2/+34
Share memory is read by two tasks to target address. [Tony: Fix some indenting. Rename page_cache_alloc() to map_file_alloc() now it is used for another test] Signed-off-by: Bixuan Cui <cuibixuan@linux.alibaba.com> Signed-off-by: Tony Luck <tony.luck@intel.com>
2022-11-29vtop: unfity all cases with the same vtop() functionShuai Xue9-52/+38
There are multiple implementations of vtop() function, remove extra copies of the vtop() function and use the one from proc_pagemap.c Suggested-by: Luck, Tony <tony.luck@intel.com> Signed-off-by: Shuai Xue <xueshuai@linux.alibaba.com> Signed-off-by: Tony Luck <tony.luck@intel.com>
2022-11-29victim: add a victim to provide target injection memory under user contextShuai Xue3-2/+372
Victim workes under user context, which provides target memory chunk for error injection. It can be used for all kinds of error types, including Corrected error and Uncorrected error(IFU/DCU). Here is an simple example for DCU: Mmap one page memory and returns starting address, and then translate virtual address to physical address. Caller like shell script can inject UC error (error type 0x10 in EINJ table) on returned physical address. Meanwhile, victim continues to read/write on returned memory space to trigger DCU happening ASAP. NOTE: this workload is borrowed from mce-test. Thanks to the origial authors, Tony Luck, Gong Chen, Wen Jin. Signed-off-by: Shuai Xue <xueshuai@linux.alibaba.com> Signed-off-by: Tony Luck <tony.luck@intel.com>
2022-11-29memattr: move the test case out of driver directoryShuai Xue7-40/+25
Move the memattr test case out of driver directory and rename it with a generic name. Signed-off-by: Shuai Xue <xueshuai@linux.alibaba.com> Signed-off-by: Tony Luck <tony.luck@intel.com>
2022-11-28einj_mem_uc: relax vendor id checkShuai Xue1-2/+3
Some firmwares support advanced RAS but does not fill in the vendor id, e.g. Kunpeng BIOS v1.91. Users complain that they can not use ras-tools directly. Therefore, relax vendor id check and just print an warning. Signed-off-by: Shuai Xue <xueshuai@linux.alibaba.com> Signed-off-by: Tony Luck <tony.luck@intel.com>
2022-11-04einj_mem_uc: Implement trigger_prefetch() for x86Tony Luck1-0/+3
Signed-off-by: Tony Luck <tony.luck@intel.com>
2022-11-04einj_mem_uc: add a case to trigger prefetchShuai Xue2-0/+17
Signed-off-by: Shuai Xue <xueshuai@linux.alibaba.com> Signed-off-by: Tony Luck <tony.luck@intel.com>
2022-10-20mca-recover: Fix compilation warning about fgets() return valueTony Luck1-1/+2
Some compilers complain: mca-recover.c:135:9: warning: ignoring return value of ‘fgets’ \ declared with attribute ‘warn_unused_result’ [-Wunused-result] Check the return value (even though it doesn't really matter). Signed-off-by: Tony Luck <tony.luck@intel.com>
2022-10-19einj_mem_uc: Better handling of "-c" and copy-on-write testTony Luck1-1/+10
If the copy-on-write test passes, then the child process gets a SIGBUS and longjmp's back to the main loop. If "-c" had specified to repeat the test, both parent and child will go back around. Later loops will also include the grand-children and cousins! Set a flag to break out of the loop for the child of a copy-on-write test. Signed-off-by: Tony Luck <tony.luck@intel.com>
2022-10-18hornet: extend ptrace with PTRACE_GETREGSET on arm64 platformShuai Xue1-10/+47
hornet use ptrace(2) with PTRACE_GETREGS request to read the tracee's general-purpose registers, but it does not work on arm64 platform. To extend hornet on both X86 and arm64 platform, retrieve rip or PC in an architecture-dependent way when PTRACE_GETREGS is not available. Signed-off-by: Shuai Xue <xueshuai@linux.alibaba.com> Signed-off-by: Tony Luck <tony.luck@intel.com>
2022-10-13einj_mem-uc: Add new test case for kernel copy-on-writeTony Luck1-2/+36
Someday I'd like to fix this case in the kernel. For now just create a test case to generate the issue. Signed-off-by: Tony Luck <tony.luck@intel.com>
2022-10-13einj_mem_uc: Error return from mmap(2) is not NULLTony Luck1-1/+1
Check for MAP_FAILED instead. Signed-off-by: Tony Luck <tony.luck@intel.com>
2022-10-13einj_mem_uc: Add missing argument to error messageTony Luck1-1/+1
Compiler complains (correctly) that the format string specifies two additional arguments, but only one is present. Fix it. Signed-off-by: Tony Luck <tony.luck@intel.com>
2022-10-13Add LICENSE.Shuai Xue1-0/+339
These tools are all under GPLv2. Add the missing LICENSE. Tony: cherry-picked from https://gitee.com/anolis/ras-tools.git Signed-off-by: Shuai Xue <xueshuai@linux.alibaba.com> Signed-off-by: Tony Luck <tony.luck@intel.com>
2022-10-13README: add a brief introduction of ras-toolsShuai Xue1-0/+21
Tony: Cherry-picked from https://gitee.com/anolis/ras-tools.git with the last paragraph about being a clone of this repo dropped. Signed-off-by: Shuai Xue <xueshuai@linux.alibaba.com> Signed-off-by: Tony Luck <tony.luck@intel.com>
2022-10-13Merge from https://gitee.com/anolis/ras-tools.gitTony Luck12-26/+1565
Lots of bugs fixes & cleanups. Plus ARM support! Signed-off-by: Tony Luck <tony.luck@intel.com>
2022-09-25ras-tolerance: overwrite error severity to a lower level at runtimeShuai Xue4-0/+269
When a hardware error occurs for a non corrected ras event the kernel can take different actions. If the severity is fatal, the kernel panic immediately. This driver allows to overwrite error severity to a lower level at runtime, recoverable by default. It is useful for test. Signed-off-by: Shuai Xue <xueshuai@linux.alibaba.com>
2022-09-23memattr: add a test suit to poison specific memory attributeShuai Xue5-0/+816
This patch add - pgprot_drv: a driver that allows a user-space program to mmap a buffer of contiguous physical memory with specific memory attribute. - test.c: a test case to poison the remaped memory. Signed-off-by: Shuai Xue <xueshuai@linux.alibaba.com>
2022-09-21ras-tools: Add SPDX license tagsTony Luck12-0/+24
These tools are all under GPLv2. Add the missing tags. Reported-by: Jiaqi Yan <jiaqiyan@google.com> Signed-off-by: Tony Luck <tony.luck@intel.com>
2022-09-15add thread casesBixuan Cui2-1/+35
Single read by two threads to target address at the same time. Signed-off-by: Bixuan Cui <cuibixuan@linux.alibaba.com>
2022-09-14einj_mem_uc: enhance sig action to explicitly print si_codeShuai Xue1-1/+1
The current sig action only prints fault address and restores the environment saved before and , we can not tell the SIGBUS reason. Therefore, explictly print si_code, 4 for BUS_MCEERR_AR, and 5 for BUS_MCEERR_AO. Signed-off-by: Shuai Xue <xueshuai@linux.alibaba.com>
2022-09-05einj_mem_uc: trigger single with an offsetShuai Xue1-13/+15
The Advanced ECC X4 employs symbol-based Reed-Solomon encoding. One symbol is 8 bits, message (data) length is 32 symbols (256 bits), ECC parity length is 4 symbols (32 bits). When we inject a UC error, the platfrom may only poison 32 symbols, in other words, only half cacheline is poisoned. Therefore, add a parameter to trigger with offset, e.g: ./einj_mem_uc single #equals to ./einj_mem_uc -z 0 single ./einj_mem_uc -z 32 single In such scenario, the former will signal a exception while the latter will not. Signed-off-by: Shuai Xue <xueshuai@linux.alibaba.com>
2022-08-30einj_mem_uc: Wait for injection to take effect before triggeringShuai Xue1-0/+1
Einj interrupt may be a SPI on arm64 and could be dispatched to any core, so the current process could run trigger action before the injection takes effect. Add a sleep to wait for injection completion, and then trigger. Signed-off-by: Shuai Xue <xueshuai@linux.alibaba.com>
2022-08-30einj_mem_uc: add cases for platform specificShuai Xue1-0/+122
Add cases for platform specific, including CMN, GIC, SMMU, etc. Signed-off-by: Shuai Xue <xueshuai@linux.alibaba.com>
2022-08-30einj_mem_uc: add a case for hugetlb pageShuai Xue1-0/+43
Signed-off-by: Shuai Xue <xueshuai@linux.alibaba.com>
2022-08-30einj_mem_uc: add cases to inject processor errorShuai Xue1-0/+51
Signed-off-by: Shuai Xue <xueshuai@linux.alibaba.com>
2022-08-25einj_mem_uc: add a case to trigger LLC UCE on arm64Shuai Xue1-2/+13
Signed-off-by: Shuai Xue <xueshuai@linux.alibaba.com>
2022-08-25einj_mem_uc: add explicitly str, strb and strh case for Arm64Shuai Xue1-0/+68
Add cases to explicitly trigger write with STR, STRB, and STRH instruction on Arm64 platform. Signed-off-by: Shuai Xue <xueshuai@linux.alibaba.com>
2022-08-25einj_mem_uc: add a z flag to trigger write with an offsetShuai Xue1-2/+12
On some platform, write to different offset within the poison cacheline performs differently. Add a z flag so that we could trigger write with an offset. Signed-off-by: Shuai Xue <xueshuai@linux.alibaba.com>
2022-08-25einj_mem_uc: explicitly print step when inject and trigger errorShuai Xue1-1/+20
The error injection mechanism is a two-step process. First inject the error, then perform some actions to trigger it. When the system is in early kill mode, trigger step is not needed. Explicitly print step which are runnig on, so we can tell the how the error occurs. Signed-off-by: Shuai Xue <xueshuai@linux.alibaba.com>
2022-08-25einj_mem_uc: implement memcpy in assembly on Arm64Shuai Xue1-0/+12
There is only X86 assembly version of memcpy, add Arm64 version for memcpy case. Signed-off-by: Shuai Xue <xueshuai@linux.alibaba.com>
2022-08-25einj_mem_uc: check advanced RAS support by vendor idShuai Xue1-0/+38
The vendor interface of EINJ provides vendor_id, device_id, rev_id, etc. Check advanced RAS support by vendor id. Signed-off-by: Shuai Xue <xueshuai@linux.alibaba.com>
2022-08-25einj_mem_uc: add Sflag as condition when check configurationShuai Xue1-2/+2
It is not necessary to check configuration when use madvise(2) to simulate poison on a page. Add Sflag as condition. Signed-off-by: Shuai Xue <xueshuai@linux.alibaba.com>
2022-08-25einj_mem_uc: surround arch dependent code with target arch macrosShuai Xue2-19/+62
eing_mem_uc injects Memory Uncorrectable non-fatal errors through APEI Error INJection (EINJ) interface, which is arch independent. However, einj_mem_uc fails to compile due to arch dependent configuration checks. Simply surround target arch macros to avoid compile error so we could debug and test with this tool in both X86 and arm platforms. Signed-off-by: Shuai Xue <xueshuai@linux.alibaba.com>
2022-08-04einj_mem_uc: Add "-i" flag to skip reporting of CMCI interruptsAdam Vaughn1-7/+13
AMD systems don't use CMCI in the same way as Intel systems. Add a flag to skip reporting of CMCI counts. Signed-off-by: Tony Luck <tony.luck@intel.com>
2022-06-27ras-tools: Add count argument to rep_ce_pageTony Luck1-2/+6
Optional argument to rep_ce_page for how many times to inject a corrected error to the target page. Signed-off-by: Tony Luck <tony.luck@intel.com>
2022-06-06ras-tools: New test "rep_ce_page"Tony Luck2-2/+90
This test injects and consumes corrected errors from a single page until either the page is taken offline (and replaced) by the OS, or a limit of 30 tries is reached. Signed-off-by: Tony Luck <tony.luck@intel.com>
2022-05-11einj_mem_uc: Wait longer for patrol scrub CMCITony Luck1-7/+10
It may take up to 20 seconds for the patrol scrubber to restart and scan to the specific location where the error was injected. Add a flag to indicate that the test should wait for much longer to check for patrol scrub CMCI. Also change the message to print the actual delay in units of seconds, not microseconds (since the values are large enough that this is a more human readable format). Signed-off-by: Tony Luck <tony.luck@intel.com>
2022-04-25hornet: fix the missed page offine when ptrace detached启瑞1-3/+7
When we detached the pftraced process, we miss out isolating the poisoned page. Adding a goto statement to make sure this. Signed-off-by: 启瑞 <qirui.001@bytedance.com> Signed-off-by: Tony Luck <tony.luck@intel.com>
2022-03-23einj_mem_uc: Fix parsing of available_error_typesTony Luck1-1/+1
Commit 38f47153c2c1 ("Check the injected error type available before write error type") didn't skip the extra text on the end of each line when checking whether a specific error type is supported. Reported-by: Liu Xinpeng <liuxp11@chinatelecom.cn> Signed-off-by: Tony Luck <tony.luck@intel.com>
2022-03-21Add #include <string.h> to proc_interrupt.cTony Luck1-0/+1
Complier is grousing about missing prototypes for strncmp() Signed-off-by: Tony Luck <tony.luck@intel.com>
2022-03-21Check the injected error type available before write error typeLiu Xinpeng1-1/+33
before: 0: llc vaddr = 0x7fee6b865400 paddr = f90e0eb400 ./einj_mem_uc: write error on '/sys/kernel/debug/apei/einj/error_type' after: 0: llc vaddr = 0x7f86e6bac400 paddr = f915477400 ./einj_mem_uc: no support for error type: 0x2 [Tony: re-word error message] Signed-off-by: Liu Xinpeng <liuxp11@chinatelecom.cn> Signed-off-by: Tony Luck <tony.luck@intel.com>
2022-01-20vtop: Fix check on number of argumentsYizhan Xu1-1/+1
The number of arguments must be three. Signed-off-by: Yizhan Xu <yizhan.xu@intel.com>
2021-11-28einj_mem_uc: Fix vtop failed in "instr" test caseZhongyu Gao1-0/+3
An error occurs during the execution of the einj_mem_uc "instr" test and exits. The error message shows that the instr memory page is not found, and the einj_mem_uc exits. The detailed log is as follows: $ ./einj_mem_uc -c 10 -f instr page not present 0: instr vaddr = 0x403000 paddr = ffffffffffffffff ~/einj_mem_uc : write error on '/sys/kernel/debug/apei/einj/error_inject' Test Environment: OS Version: CentOS Linux release 7.9.2009 (Core) Kernel Version: 3.10.0-1160.el7.x86_64 #1 SMP Mon Oct 19 16:18:59 UTC 2020 x86_64 Solution: Call the dosum function in instr_alloc to load the dosum memory page in advance to prevent vtop conversion failure. Signed-off-by: Zhongyu Gao <gzy@sangfor.com.cn> Signed-off-by: Tony Luck <tony.luck@intel.com>
2021-11-15einj_mem_uc: Patrol scrub might be SRAO or UCNATony Luck1-4/+24
Starting with Icelake Xeon patrol scrub errors are signalled using CMCI with a UCNA signature instead of a machine check with an SRAO signature. Add a new flag "F_EITHER" to indicate that CMCI or MCE (but not both) is an acceptable response for the patrol scrub test. Signed-off-by: Tony Luck <tony.luck@intel.com>
2021-07-20einj_mem_uc: Count Ice Lake Xeon as "advanced RAS"Tony Luck2-5/+14
All SKUs of Ice Lake Xeon support the memory recovery advanced RAS feature. Just check for the model number instead of looking for Platinum/Gold in the model name. Signed-off-by: Tony Luck <tony.luck@intel.com>
2021-03-25einj_mem_uc: Add a case for kernel accessing a poisoned futex(2) operandTony Luck1-1/+26
[Also changed flags for the copyin case to remove F_SIGBUS] Signed-off-by: Tony Luck <tony.luck@intel.com>
2021-02-23einj_mem_uc: Fix typos in trigger_copyinAili Yao1-2/+2
1.In if check, ret should be compared to memcpy_size; 2.In else branch, correct the fprintf parameter order. Signed-off-by: Aili Yao <yaoaili@kingsoft.com> Signed-off-by: Tony Luck <tony.luck@intel.com>
2021-01-29einj_mem_uc: Add "-S" flag for MADV_HWPOISON page offlineTony Luck1-7/+17
Instead of using ACPI/EINJ to inject a real error, use madvise(2) to simulate poison on a page. Note that this only has any effect for tests that use inject_uc() for injection ... and doesn't really match the behavior with a real injection. [Idea to use MADV_HWPOISON from Aili Yao <yaoaili@kingsoft.com>] Signed-off-by: Tony Luck <tony.luck@intel.com>
2021-01-29Add .gitignore fileTony Luck1-0/+8
For ".o" files and executables
2020-12-09mca-recover: Make sure we consume poison at right point in codeTony Luck1-3/+8
It seems that modern compilers have become smart enough to not just blindly read from "buf" at the point where the C code says "i = buf[0];" Throw in a function call, and a volatile cast, to make sure the consumption happens at the right point in code flow. [Also fix one duplicated "0x" on output, and one missing "0x"] Signed-off-by: Tony Luck <tony.luck@intel.com>
2020-10-26vtop: Multiply by "pagesize" instead of hardcode shift by 12Tony Luck4-4/+4
When convering a page frame number to an address there is a hard-coded shift by 12 but the mask for the low order bits is computed based on the "pagesize" variable. Fix this inconsistency by swapping out the shift for a multiply. Reported-by: 葛士建 <geshijian@bytedance.com> Signed-off-by: Tony Luck <tony.luck@intel.com>
2020-09-03einj_mem_uc: Fix file descriptor leak for copyin testTony Luck1-4/+11
The copyin test fails at just over 1000 iterations because it can no longer open new file descriptors. The problem is that the test opens a file, and closes at the end of the function. But since successful recover sends a SIGBUS, the close is never executed. Make the file descriptor global and close it in the main() function after the normal or SIGBUS return paths. Signed-off-by: Tony Luck <tony.luck@intel.com>
2020-08-24einj_mem_uc: Apply runup/size parameters from "-m" argument to copyin testTony Luck1-2/+2
In order to force different code paths in the kernel, allow user to adjust start/size of kernel copy so that the poison is consumed at different points in the copy. Signed-off-by: Tony Luck <tony.luck@intel.com>
2020-08-04einj_mem_uc: Copyout test gets SIGSEGVTony Luck1-2/+9
Bug in the code. The page_cache_alloc() function uses a local "FILE *pcfile" instead of the global one. Result is that the trigger step gets a NULL dereference accessing the global pcfile. Reported-by: Youquan Song <youquan.song@intel.com> Signed-off-by: Tony Luck <tony.luck@intel.com>
2020-07-27einj_mem_uc: Print errno or byte count for unsuccessful writeTony Luck1-2/+3
If the write fails then print out the system error code. If it partially fails, print how many bytes were written. Signed-off-by: Tony Luck <tony.luck@intel.com>
2020-06-17einj_mem_uc: new test case for copyoutTony Luck1-0/+43
Create a file and write a page of data to it. Map that file and use an address in the mapped range to inject an error. Trigger the error by issing a read(2) system call which will make the kernel copy from the page cache copy of the data (which has been injected with a UC error). Signed-off-by: Tony Luck <tony.luck@intel.com>
2020-06-02einj_mem_uc: Add explicit "Test passed/failed" messages and exit statusTony Luck1-0/+7
Validation teams using this test would find it easier to build into a test script if it reported success/fail both with a message, and by exit code. Reported-by: Jun J Li <jun.j.li@intel.com> Signed-off-by: Tony Luck <tony.luck@intel.com>
2018-11-12Add new test program to validate LMCE featureJin Wen2-2/+450
Design different cases to validate LMCE feature: 1. multi thread run on same or different cores; 2. inject memory error into one same address or two different addresses; 3. trigger IFU or DCU error individually. Note that injecting errors on the same core will likely result in undefined behavior as logical processors sharing a core also share machine check banks that log recoverable machine checks. Signed-off-by: Jin Wen <wenx.jin@intel.com> Signed-off-by: Tony Luck <tony.luck@intel.com>
2018-03-14hornet: Fix some issues with addition of ptrace supportJin Wen1-14/+13
1) Fix check_ptrace macro to stringify the "req" argument in error message 2) Set lo & hi so that pickaddr() will print reasonable range with "-v" 3) Replace NULL with empty string in verbose print 4) Remove unused "pagesize" variable. Signed-off-by: Jin Wen <wenx.jin@intel.com> Signed-off-by: Tony Luck <tony.luck@intel.com>
2018-03-13hornet: Add "-P pid" flag to stop process using ptraceJin Wen2-10/+54
Picking instruction addresses inside running processes is rather hit or miss. We may pick an address for injection that is never executed. Using ptrace(2) to stop the process we can find the precise address of the next instruction to be executed and thus guarantee that we will immediately hit the injected address when we resume running the process. Signed-off-by: Jin Wen <wenx.jin@intel.com> Signed-off-by: Tony Luck <tony.luck@intel.com>
2017-12-13Add "Gold" to list of strings to check to see if platform supports error ↵Tony Luck1-0/+2
recovery Both Platinum and Gold Skylake SKUs support advanced RAS recovery features Reported-by: Youquan Song Signed-off-by: Tony Luck <tony.luck@intel.com>
2017-08-15einj_mem_uc: Update check for cpu models that can recoverTony Luck1-1/+10
Skylake (Xeon-SP) doesn't use the "E7-" model name convention. Signed-off-by: Tony Luck <tony.luck@intel.com>
2017-07-26einj_mem_uc: Add test case to mlock(2) the target page.Tony Luck1-0/+23
2017-05-01Fixup mca-recoverTony Luck1-22/+19
I'd messed with this to do something with repeated recoveries. But that doesn't match with the use case that we've been explaining to people. Go back to just one recovery. Signed-off-by: Tony Luck <tony.luck@intel.com>
2016-08-18Add extra "-m" argument to provide options for memcpy testTony Luck2-9/+55
2016-03-01Add "llc" option to inject processor uncorrected non-fatal and trigger LLC ↵Tony Luck2-14/+61
writeback
2016-01-16Don't wait for a fixed interval for CMCIs to be countedTony Luck1-2/+20
Instead of hard coding a fixed large time to sleep before checking how many CMCIs were logged, just sleep in 100us increments until we see at least the expected number. Give up waiting after 1000 such rechecks. If we blocked for any more than 100us, then report the actual time. Original patch by Carl Sapp. Modified to use gettimeofday() to report actual delay. Signed-off-by: Tony Luck <tony.luck@intel.com>
2016-01-12Increase delay before re-reading /proc/interrupts. 1ms wasn't enoughTony Luck1-1/+1
for all cpus to wake from C6. Bump to 10ms. Reported-by: Carl Sapp Signed-off-by: Tony Luck <tony.luck@intel.com>
2015-12-31Add some new error testing toys:Tony Luck10-0/+1052
cmcistorm - inject a bunch of corrected errors, then trigger them all quickly hornet - inject a UC memory error into some other process einj_mem_uc - inject a UC error and then trigger it in one of a variety of ways.
2014-03-18Add vtop.c - for finding physical address in arbitrary process.Tony Luck1-0/+81
2014-03-11Add example recovery applicationTony Luck1-0/+146
This is pretty trivial - just shows how to setup a SIGBUS handler for recoverable machine checks. Injection of the actual error is handled externally (e.g. using EINJ). Signed-off-by: Tony Luck <tony.luck@intel.com>