diff options
author | Haicheng Li <haicheng.li@intel.com> | 2009-10-27 10:52:57 +0800 |
---|---|---|
committer | Andi Kleen <ak@linux.intel.com> | 2009-10-27 05:32:41 +0100 |
commit | 661d6e0fbc5c61389a4d7328ed06ec5dbc574fd8 (patch) | |
tree | 3ece8850af18b3f71c517bd3f4b91da8a989e0d7 | |
parent | 6eec280c1bac479ed5f75481c275b7573f67d12b (diff) | |
download | mce-test-661d6e0fbc5c61389a4d7328ed06ec5dbc574fd8.tar.gz |
HOWTO: documentation of MCE stress test suite.
Documentation of MCE stress test suite.
Reviewed-by: Jiajia Zheng <jiajia.zheng@intel.com>
Signed-off-by: Haicheng Li <haicheng.li@intel.com>
Signed-off-by: Andi Kleen <ak@linux.intel.com>
-rw-r--r-- | README | 4 | ||||
-rw-r--r-- | doc/stress-howto.txt | 340 |
2 files changed, 344 insertions, 0 deletions
@@ -80,6 +80,8 @@ bin/ Some tools used by test drivers or test cases will be installed into this directory. +stress/ + Linux MCE stress test suite. Test Instruction ---------------- @@ -95,6 +97,8 @@ in doc sub-directory. doc/howto.txt: a more detailed HOWTO document +doc/stress-howto.txt: Detailed HOWTO document for MCE stress test suite + doc/cases/*.txt: Description of every test case, including test objective, code patch tested, reference and expected results diff --git a/doc/stress-howto.txt b/doc/stress-howto.txt new file mode 100644 index 0000000..f96a0ec --- /dev/null +++ b/doc/stress-howto.txt @@ -0,0 +1,340 @@ +MCE Stress Test HOWTO +==================== + +Oct 10th, 2009 + +Haicheng Li + + +Abstract +-------- + +This document explains the design and structure of MCE stress test suite, +the kernel configurations and user space tools required for automated +stress testing, as well as usage guide and etc. + + +0. Quick Shortcut +----------------- + +- Install the Linux kernel (2.6.32 or newer) with full MCA recovery support. + Make sure following configuration options are enabled: + + CONFIG_X86_MCE=y + CONFIG_MEMORY_FAILURE=y + + With these two options enabled, you can do stress testing thru madvise + syscall (sec 4.1). + +- Install page-types tool (sec 3.3), which is accompanied with Linux kernel + source (2.6.32 or newer). + + # cd $KERNEL_SRC/Documentation/vm/ + # gcc -o page-types page-types.c + # cp page-types /usr/bin/ + +- Get latest LTP (Linux Test Project) image from http://ltp.sf.net. Refer + to INSTALL of LTP to install LTP on your machine. + +- Build and run stress testing + + # make + # cd stress + # ./hwpoison.sh -d $YOUR_PARTITION -M -o $YOUR_LTP_DIR -N + + Note here, '-d $YOUR_PARTITION' is a mandatory option. Test will create + all temporary files on $YOUR_PARTITION, and error injection will just + affect the pages associated with $$YOUR_PARTITION. So you must provide a + free disk partition to stress test driver! + + This will do the stress testing thru madvise syscall (sec 4.1). However, + there are more advanced test methods provided (sec 4.2, 4.3). + +Note, for all examples in the rest of this doc, it is supposed that $PWD is +the stress subdir. + +1. Overview +----------- + +The MCE stress test suite is a collection of tools and test scripts, which +intends to achieve stress testing on Linux kernel MCA high level handlers +that include HWPosion page recovery, soft page offline, and so on. + +In general, this test suite is designed to do stress testing thru various +test interfaces, i.e. madvise syscall, HWPoison page injector, and APEI +injector (see ACPI4.0 spec). And it's able to support most of popular +Linux File Systems (FS), that is, there is an option for user to specify which +FS type they want the test to be running on. + +If you just want to start testing as quickly as possible, you can skip +section 2 & 3, just go to section 4 directly. + + +2. Design Details +----------------- + +The MCE stress test suite consists of four parts: test driver, workload +controller, customized workloads, and background workloads. + +The main test idea is described as below: +- Test driver launchs various customized workloads to continuously generate + lots of pages with expected page states, Note, all of these workloads know + about their expected results that should not be affected by Linux MCE high + level handlers. +- Then test driver injects MCE errors to these pages thru either madvise + syscall or HWPoison injector or APEI injector. While Linux Kernel handling + these MCE errors, all the workloads continue running normally, +- After long time running, test driver will collect test result of each + workload to see if any unexpected failures happened. In such a way, it can + decide if any bug is found. +- If any system panics or FS corruption happens, that means there must be a + bug. It's the bottom line to decide if test gets pass. + +2.1 Test Driver + +Test driver (a.k.a hwpoison.sh) drives the whole test procedure. It's +responsible for managing test environment, setting up error injection +interface, controlling test progress, launching workloads, injecting page +errors, as well as recording test logs and reporting test result. + +For detailed usage of hwpoison.sh test driver, please refer to: +# ./hwpoison.sh -h + +2.2 Workload Controller + +Workload controller needs to have various test workloads running parallelly +and continuously within a required duration time. We select ltp-pan +program of Linux Test Project (LTP) as the workload controller of this +stress test suite. + +Test driver (hwpoison.sh) interacts with ltp-pan in following ways: +- hwpoison.sh generates a test config file that lists the workload type + to be launched by ltp-pan. +- hwpoison also passes test duration time and other workload specific + parameters to ltp-pan via test config file. +- ltp-pan makes each workload run and get finished in time, then test driver + can get the result of each workload via corresponding result files. +- finally, hwpoison.sh will decide the overall test result based on each + workload result, and report final result out. + +2.3 Customized Workloads + +There are three types of customized workloads, which are intended to generate +pages with various page state. + +* Type0: page-poisoning workload, meant to cover: + - anonymous pages operations. + - file data operations. + +* Type1: fs-metadata workload, meant to cover: + - inode operations. + +* Type2: fs_type specific workload, meant to cover: + - extended functions of some special FS. + +2.4 Background Workloads + +LTP is selected as the background workload to simulate normal system +operations in background while stress testing is running. + +Besides LTP, there are also some alternatives, like AIM. We might extend more +background workloads in future. + +2.5 Test Result + +How to determine that stress testing gets pass? +- at least no kernel panics happens during stress testing. +- fsck on the target disk at the end of stress testing should get pass. +- there is no failure found by customized workloads, especially for + page-poisoning workload. + +Where to get detailed test result? +- When stress testing is done, the general test result is recorded in + result/hwpoison.result, and the general test log is in result/hwpoison.log. + However, you can specify them in following way: + # hwpoison.sh -r $YOUR_RESULT -l $YOUR_LOG +- The test result and test log of each workload are recorded as + log/$workload/$workload.result and log/$workload/$workload.log. + For example, for page-poisoning workload, its test result and test logs are + log/page-poisoning/page-poisoning.result and + log/page-poisoning/page-poisoning.log. +- Besides, under each workload result dir, you can find other extra logs + like pan_log, pan_output and etc. These logs are generated by ltp-pan + workload controller. Usually they can help you understand what has been + going on with ltp-pan while workload is running. Pls. refer to ltp-pan doc + for details. + + +3. Tools +-------- + +3.1 page-poisoning + +It is the page-poisoning workload. page-poisoning workload is an extension of +tinjpage test program with a multi-process model. It spawns thousands of +processes that inject HWPosion error to various pages simultaneously thru +madvise syscall. Then it checks if these errors get handled correctly, +i.e. whether each test process receives or doesn't receive SIGBUS signal as +expected. + +For more info about page-poisoning workload, pls. read through README file +under stress/tools/page-poisoning/. + +3.2 fs-metadata + +It is the fs-metadata workload. fs-metadata is designed to test i-node +operations with heavy workload and make sure every i-node operation gets +the expected result. In details, it firstly generates a huge directory +hierarchy on the target disk, then it performs unlink operations on this +directory hierarchy and duplicate a copy of the directory, finally it +checks if these two directories are same as expected. + +For more info about fs-metadata workload, pls. read through README file +under stress/tools/fs-metadata/. + +3.3 page-types + +page-types is a tool to query the page type of every memory page in the +system. We use it to filter out pages with required page types. Test will +inject error to these pages via error injector, although the page filter +of HWPosion handler in Linux Kernel will filter them out for a second +time. Note, the reason we need to use page-types to do first time filtering +is just about performance. + +To install page-types on your test machine: + + # cd $KERNEL_SRC/Documentation/vm/ + # gcc -o page-types page-types.c + # cp page-types /usr/bin/ + +3.4 ltp-pan + +It's the workload controller of this stress test suite. In fact, ltp-pan +is the test harness of LTP (Linux Test Project), and is included in +LTP package. For more information, please refer to ltp-pan document of LTP. + + +4. Usage Guide +-------------- + +This section is trying to show you how to conduct the stress testing thru +various test interfaces. + +As an example, we choose to run stress testing based on partition /dev/sda1 +for 1 hour. Note, we've installed LTP to /ltp. + +4.1 Stress Test thru Madvise Syscall. + +To run this stress testing, you need to strictly follow below test +instructions. + +* Test instructions: + +- make sure following kernel options are enabled: + CONFIG_X86_MCE=y + CONFIG_MEMORY_FAILURE=y + +- build and run stress testing + # make + # ./hwpoison.sh -d $YOUR_PARTITION -M -o $YOUR_LTP_DIR + +* Example: + +- launch testing + # ./hwpoison.sh -d /dev/sda1 -M -t 3600 + +- general test results + result: result/hwpoison.result + logs: result/hwpoison.log + +- detailed workload results + result: log/page-poisoning/page-poisoning.result + log: log/page-poisoning/page-poisoning.log + +4.2 Stress Test thru HWPosion Page Injector + +This is the default test method of this stress test suite. + +To run this stress testing, you need to strictly follow below test +instructions. + +* Test instructions: + +- make sure following kernel options are enabled: + CONFIG_X86_MCE=y + CONFIG_MEMORY_FAILURE=y + CONFIG_DEBUG_KERNEL=y + CONFIG_CGROUP_MEM_RES_CTLR_SWAP=y + CONFIG_HWPOISON_INJECT=y + +- build and run stress testing + # make + # ./hwpoison.sh -d $YOUR_PARTITION -o $YOUR_LTP_DIR -L + +* Example: + +- launch testing + # ./hwpoison.sh -d /dev/sda1 -t 3600 -L + +- general test results + result: result/hwpoison.result + logs: result/hwpoison.log + +- detailed workload results + fs-metadata result: log/fs-metadata/fs-metadata.result + fs-metadata log: log/fs-metadata/fs-metadata.log + ltp result: log/ltp/ltp.result + ltp log: log/ltp/ltp.log + fs-specific result: log/fs-specific/fs-specific.result + fs-specific log: log/fs-specific/fs-specific.log + +4.3 Stress Test thru APEI Injector + +To run this stress testing, you need to follow below test instructions. + +* Test instructions: + +- make sure following kernel options are enabled: + CONFIG_X86_MCE=y + CONFIG_X86_MCE_INTEL=y + CONFIG_MEMORY_FAILURE=y + CONFIG_ACPI_APEI=y + CONFIG_ACPI_APEI_EINJ=y + +- build and run stress testing + # make + # ./hwpoison.sh -d $YOUR_PARTITION -o $YOUR_LTP_DIR -L -A + +* Example: + +- launch testing + # ./hwpoison.sh -d /dev/sda1 -t 3600 -L -A + +- general test results + result: result/hwpoison.result + logs: result/hwpoison.log + +- detailed workload results + fs-metadata result: log/fs-metadata/fs-metadata.result + fs-metadata log: log/fs-metadata/fs-metadata.log + ltp result: log/ltp/ltp.result + ltp log: log/ltp/ltp.log + fs-specific result: log/fs-specific/fs-specific.result + fs-specific log: log/fs-specific/fs-specific.log + + +5. FAQs +------- + +Here is a collection of frequently asked questions: + +Q: How to tell test driver not to format my disk partition? +A: Use the option '-N'. + +Q: Can three types of tests run on same sytem simultaneously? +A: No. There are limitations in Linux Kernel HWPoison page filtering. + +Q: Can I run this stress testing on multiple disks parallely? +A: Yes. But it requires updated Kernel patches for HWPosion page filtering. + Now, it just supports one same test with same pagetype flags specified. + |