aboutsummaryrefslogtreecommitdiffstats
diff options
context:
space:
mode:
authorHaicheng Li <haicheng.li@intel.com>2009-10-27 10:52:57 +0800
committerAndi Kleen <ak@linux.intel.com>2009-10-27 05:32:41 +0100
commit661d6e0fbc5c61389a4d7328ed06ec5dbc574fd8 (patch)
tree3ece8850af18b3f71c517bd3f4b91da8a989e0d7
parent6eec280c1bac479ed5f75481c275b7573f67d12b (diff)
downloadmce-test-661d6e0fbc5c61389a4d7328ed06ec5dbc574fd8.tar.gz
HOWTO: documentation of MCE stress test suite.
Documentation of MCE stress test suite. Reviewed-by: Jiajia Zheng <jiajia.zheng@intel.com> Signed-off-by: Haicheng Li <haicheng.li@intel.com> Signed-off-by: Andi Kleen <ak@linux.intel.com>
-rw-r--r--README4
-rw-r--r--doc/stress-howto.txt340
2 files changed, 344 insertions, 0 deletions
diff --git a/README b/README
index 193ff15..e4a7cfb 100644
--- a/README
+++ b/README
@@ -80,6 +80,8 @@ bin/
Some tools used by test drivers or test cases will be
installed into this directory.
+stress/
+ Linux MCE stress test suite.
Test Instruction
----------------
@@ -95,6 +97,8 @@ in doc sub-directory.
doc/howto.txt: a more detailed HOWTO document
+doc/stress-howto.txt: Detailed HOWTO document for MCE stress test suite
+
doc/cases/*.txt: Description of every test case, including test
objective, code patch tested, reference and
expected results
diff --git a/doc/stress-howto.txt b/doc/stress-howto.txt
new file mode 100644
index 0000000..f96a0ec
--- /dev/null
+++ b/doc/stress-howto.txt
@@ -0,0 +1,340 @@
+MCE Stress Test HOWTO
+====================
+
+Oct 10th, 2009
+
+Haicheng Li
+
+
+Abstract
+--------
+
+This document explains the design and structure of MCE stress test suite,
+the kernel configurations and user space tools required for automated
+stress testing, as well as usage guide and etc.
+
+
+0. Quick Shortcut
+-----------------
+
+- Install the Linux kernel (2.6.32 or newer) with full MCA recovery support.
+ Make sure following configuration options are enabled:
+
+ CONFIG_X86_MCE=y
+ CONFIG_MEMORY_FAILURE=y
+
+ With these two options enabled, you can do stress testing thru madvise
+ syscall (sec 4.1).
+
+- Install page-types tool (sec 3.3), which is accompanied with Linux kernel
+ source (2.6.32 or newer).
+
+ # cd $KERNEL_SRC/Documentation/vm/
+ # gcc -o page-types page-types.c
+ # cp page-types /usr/bin/
+
+- Get latest LTP (Linux Test Project) image from http://ltp.sf.net. Refer
+ to INSTALL of LTP to install LTP on your machine.
+
+- Build and run stress testing
+
+ # make
+ # cd stress
+ # ./hwpoison.sh -d $YOUR_PARTITION -M -o $YOUR_LTP_DIR -N
+
+ Note here, '-d $YOUR_PARTITION' is a mandatory option. Test will create
+ all temporary files on $YOUR_PARTITION, and error injection will just
+ affect the pages associated with $$YOUR_PARTITION. So you must provide a
+ free disk partition to stress test driver!
+
+ This will do the stress testing thru madvise syscall (sec 4.1). However,
+ there are more advanced test methods provided (sec 4.2, 4.3).
+
+Note, for all examples in the rest of this doc, it is supposed that $PWD is
+the stress subdir.
+
+1. Overview
+-----------
+
+The MCE stress test suite is a collection of tools and test scripts, which
+intends to achieve stress testing on Linux kernel MCA high level handlers
+that include HWPosion page recovery, soft page offline, and so on.
+
+In general, this test suite is designed to do stress testing thru various
+test interfaces, i.e. madvise syscall, HWPoison page injector, and APEI
+injector (see ACPI4.0 spec). And it's able to support most of popular
+Linux File Systems (FS), that is, there is an option for user to specify which
+FS type they want the test to be running on.
+
+If you just want to start testing as quickly as possible, you can skip
+section 2 & 3, just go to section 4 directly.
+
+
+2. Design Details
+-----------------
+
+The MCE stress test suite consists of four parts: test driver, workload
+controller, customized workloads, and background workloads.
+
+The main test idea is described as below:
+- Test driver launchs various customized workloads to continuously generate
+ lots of pages with expected page states, Note, all of these workloads know
+ about their expected results that should not be affected by Linux MCE high
+ level handlers.
+- Then test driver injects MCE errors to these pages thru either madvise
+ syscall or HWPoison injector or APEI injector. While Linux Kernel handling
+ these MCE errors, all the workloads continue running normally,
+- After long time running, test driver will collect test result of each
+ workload to see if any unexpected failures happened. In such a way, it can
+ decide if any bug is found.
+- If any system panics or FS corruption happens, that means there must be a
+ bug. It's the bottom line to decide if test gets pass.
+
+2.1 Test Driver
+
+Test driver (a.k.a hwpoison.sh) drives the whole test procedure. It's
+responsible for managing test environment, setting up error injection
+interface, controlling test progress, launching workloads, injecting page
+errors, as well as recording test logs and reporting test result.
+
+For detailed usage of hwpoison.sh test driver, please refer to:
+# ./hwpoison.sh -h
+
+2.2 Workload Controller
+
+Workload controller needs to have various test workloads running parallelly
+and continuously within a required duration time. We select ltp-pan
+program of Linux Test Project (LTP) as the workload controller of this
+stress test suite.
+
+Test driver (hwpoison.sh) interacts with ltp-pan in following ways:
+- hwpoison.sh generates a test config file that lists the workload type
+ to be launched by ltp-pan.
+- hwpoison also passes test duration time and other workload specific
+ parameters to ltp-pan via test config file.
+- ltp-pan makes each workload run and get finished in time, then test driver
+ can get the result of each workload via corresponding result files.
+- finally, hwpoison.sh will decide the overall test result based on each
+ workload result, and report final result out.
+
+2.3 Customized Workloads
+
+There are three types of customized workloads, which are intended to generate
+pages with various page state.
+
+* Type0: page-poisoning workload, meant to cover:
+ - anonymous pages operations.
+ - file data operations.
+
+* Type1: fs-metadata workload, meant to cover:
+ - inode operations.
+
+* Type2: fs_type specific workload, meant to cover:
+ - extended functions of some special FS.
+
+2.4 Background Workloads
+
+LTP is selected as the background workload to simulate normal system
+operations in background while stress testing is running.
+
+Besides LTP, there are also some alternatives, like AIM. We might extend more
+background workloads in future.
+
+2.5 Test Result
+
+How to determine that stress testing gets pass?
+- at least no kernel panics happens during stress testing.
+- fsck on the target disk at the end of stress testing should get pass.
+- there is no failure found by customized workloads, especially for
+ page-poisoning workload.
+
+Where to get detailed test result?
+- When stress testing is done, the general test result is recorded in
+ result/hwpoison.result, and the general test log is in result/hwpoison.log.
+ However, you can specify them in following way:
+ # hwpoison.sh -r $YOUR_RESULT -l $YOUR_LOG
+- The test result and test log of each workload are recorded as
+ log/$workload/$workload.result and log/$workload/$workload.log.
+ For example, for page-poisoning workload, its test result and test logs are
+ log/page-poisoning/page-poisoning.result and
+ log/page-poisoning/page-poisoning.log.
+- Besides, under each workload result dir, you can find other extra logs
+ like pan_log, pan_output and etc. These logs are generated by ltp-pan
+ workload controller. Usually they can help you understand what has been
+ going on with ltp-pan while workload is running. Pls. refer to ltp-pan doc
+ for details.
+
+
+3. Tools
+--------
+
+3.1 page-poisoning
+
+It is the page-poisoning workload. page-poisoning workload is an extension of
+tinjpage test program with a multi-process model. It spawns thousands of
+processes that inject HWPosion error to various pages simultaneously thru
+madvise syscall. Then it checks if these errors get handled correctly,
+i.e. whether each test process receives or doesn't receive SIGBUS signal as
+expected.
+
+For more info about page-poisoning workload, pls. read through README file
+under stress/tools/page-poisoning/.
+
+3.2 fs-metadata
+
+It is the fs-metadata workload. fs-metadata is designed to test i-node
+operations with heavy workload and make sure every i-node operation gets
+the expected result. In details, it firstly generates a huge directory
+hierarchy on the target disk, then it performs unlink operations on this
+directory hierarchy and duplicate a copy of the directory, finally it
+checks if these two directories are same as expected.
+
+For more info about fs-metadata workload, pls. read through README file
+under stress/tools/fs-metadata/.
+
+3.3 page-types
+
+page-types is a tool to query the page type of every memory page in the
+system. We use it to filter out pages with required page types. Test will
+inject error to these pages via error injector, although the page filter
+of HWPosion handler in Linux Kernel will filter them out for a second
+time. Note, the reason we need to use page-types to do first time filtering
+is just about performance.
+
+To install page-types on your test machine:
+
+ # cd $KERNEL_SRC/Documentation/vm/
+ # gcc -o page-types page-types.c
+ # cp page-types /usr/bin/
+
+3.4 ltp-pan
+
+It's the workload controller of this stress test suite. In fact, ltp-pan
+is the test harness of LTP (Linux Test Project), and is included in
+LTP package. For more information, please refer to ltp-pan document of LTP.
+
+
+4. Usage Guide
+--------------
+
+This section is trying to show you how to conduct the stress testing thru
+various test interfaces.
+
+As an example, we choose to run stress testing based on partition /dev/sda1
+for 1 hour. Note, we've installed LTP to /ltp.
+
+4.1 Stress Test thru Madvise Syscall.
+
+To run this stress testing, you need to strictly follow below test
+instructions.
+
+* Test instructions:
+
+- make sure following kernel options are enabled:
+ CONFIG_X86_MCE=y
+ CONFIG_MEMORY_FAILURE=y
+
+- build and run stress testing
+ # make
+ # ./hwpoison.sh -d $YOUR_PARTITION -M -o $YOUR_LTP_DIR
+
+* Example:
+
+- launch testing
+ # ./hwpoison.sh -d /dev/sda1 -M -t 3600
+
+- general test results
+ result: result/hwpoison.result
+ logs: result/hwpoison.log
+
+- detailed workload results
+ result: log/page-poisoning/page-poisoning.result
+ log: log/page-poisoning/page-poisoning.log
+
+4.2 Stress Test thru HWPosion Page Injector
+
+This is the default test method of this stress test suite.
+
+To run this stress testing, you need to strictly follow below test
+instructions.
+
+* Test instructions:
+
+- make sure following kernel options are enabled:
+ CONFIG_X86_MCE=y
+ CONFIG_MEMORY_FAILURE=y
+ CONFIG_DEBUG_KERNEL=y
+ CONFIG_CGROUP_MEM_RES_CTLR_SWAP=y
+ CONFIG_HWPOISON_INJECT=y
+
+- build and run stress testing
+ # make
+ # ./hwpoison.sh -d $YOUR_PARTITION -o $YOUR_LTP_DIR -L
+
+* Example:
+
+- launch testing
+ # ./hwpoison.sh -d /dev/sda1 -t 3600 -L
+
+- general test results
+ result: result/hwpoison.result
+ logs: result/hwpoison.log
+
+- detailed workload results
+ fs-metadata result: log/fs-metadata/fs-metadata.result
+ fs-metadata log: log/fs-metadata/fs-metadata.log
+ ltp result: log/ltp/ltp.result
+ ltp log: log/ltp/ltp.log
+ fs-specific result: log/fs-specific/fs-specific.result
+ fs-specific log: log/fs-specific/fs-specific.log
+
+4.3 Stress Test thru APEI Injector
+
+To run this stress testing, you need to follow below test instructions.
+
+* Test instructions:
+
+- make sure following kernel options are enabled:
+ CONFIG_X86_MCE=y
+ CONFIG_X86_MCE_INTEL=y
+ CONFIG_MEMORY_FAILURE=y
+ CONFIG_ACPI_APEI=y
+ CONFIG_ACPI_APEI_EINJ=y
+
+- build and run stress testing
+ # make
+ # ./hwpoison.sh -d $YOUR_PARTITION -o $YOUR_LTP_DIR -L -A
+
+* Example:
+
+- launch testing
+ # ./hwpoison.sh -d /dev/sda1 -t 3600 -L -A
+
+- general test results
+ result: result/hwpoison.result
+ logs: result/hwpoison.log
+
+- detailed workload results
+ fs-metadata result: log/fs-metadata/fs-metadata.result
+ fs-metadata log: log/fs-metadata/fs-metadata.log
+ ltp result: log/ltp/ltp.result
+ ltp log: log/ltp/ltp.log
+ fs-specific result: log/fs-specific/fs-specific.result
+ fs-specific log: log/fs-specific/fs-specific.log
+
+
+5. FAQs
+-------
+
+Here is a collection of frequently asked questions:
+
+Q: How to tell test driver not to format my disk partition?
+A: Use the option '-N'.
+
+Q: Can three types of tests run on same sytem simultaneously?
+A: No. There are limitations in Linux Kernel HWPoison page filtering.
+
+Q: Can I run this stress testing on multiple disks parallely?
+A: Yes. But it requires updated Kernel patches for HWPosion page filtering.
+ Now, it just supports one same test with same pagetype flags specified.
+