Documentation


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341

********************************************************************************
*                                                                              *
*                                                                              *
*                              Documentation                                   *
*                                                                              *
*                                                                              *
********************************************************************************


The motivation behind this suite is to exercise functions and regions of the
mm/ of the Linux kernel which are of interest to us. There are many more
regions which have not been covered by the test cases.


################################################################################
#                                  KERNEL CONFIGURATION                        #
################################################################################

The test suite was developed with the help of gcov code coverage analysis tool.
Gcov can be enabled as a configuration option in linux kernel 2.6 and upwards.
GCOV gives a per line execution count for the source files. The supporting tool
versions that were used in the development of the suite are

gcov   - 4.6.3
gcc    - 4.6.3
Kernel - 3.4

The test kernel needs to be configured with the following options set

        CONFIG_LOCK_STAT=y

        CONFIG_GCOV_KERNEL=y
        CONFIG_GCOV_PROFILE_ALL=y

        CONFIG_PERF_EVENTS=y
        CONFIG_FTRACE_SYSCALLS=y

        CONFIG_TRANSPARENT_HUGEPAGE=y

Once the test kernel has been compiled and installed, a debugfs is mounted on
/sys/kernel.  writing to the file /sys/kernel/debug/gcov/reset resets all the
counters. The directory /sys/kernel/debug/gcov/ also has a link to the build
directory on the test system. For more information about setting up gcov,
consult the gcov documentation.


################################################################################
#                                    FILES                                     #
################################################################################

hw_vars: This file is the interface between the system and the suite and
provides the system configuration environment information like total memory,
cpus, nr_hugepages etc. All the /proc/meminfo output can be extracted using
this file.

This file also has functions to create and delete sparse files in the
$SPARSE_ROOT directory which is set to /tmp/vm-scalability. Most of the cases
work on these files. The sparse root could be made to be of either the btrfs,
ext4 or the xfs type by suitable editing the hw_vars file.

run_cases: This file first resets the counters and then executes all the test
cases one by one. Some of the functions are invoked only during startup and are
reset before the cases are run. Hence some cases need to enable them again via
the case script before calling the executables so that the neccessary invoking
functions are counted for in the source_file.gcov. Similary, some of the
functions are invoked dynamically when a system configuration like nr_hugepages
changes during the execution of the program. Cases like that have been covered
by resetting such parameters from within the program.

cases-*: These files call the executable files with suitable options. Some of
them have multithreaded option while some dont.


################################################################################
#                                    USAGE                                     #
################################################################################

cd /path/to/suite/directory
make all
./run
gcov -o /path/to/build/directory/with/the/.gcno <source_file>.c

the last command produces the source_file.c.gcov file which has the coverage
information for each line of the mm .c and .f files.

Note: scripts to aumatically gather the gcno/gcda files and extract the
source_file.gcov data coverage files are available in the gcov documentation.

The cases in the suite call an exucutable file with options. Most of the cases
work on usemem. Some of the cases that call other ececutables have been written
in seperate files in order to modularise the code and have been named based on
the kernel functionality they exercise.

Some of the cases merely call trivial system calls and do not do anything else.
They can be extended suitably as per need.

Some cases like case-migration, case-mbind etc need a numa setup. This was
achieved using the numa=fake=<value>. The value is the number of nodes to be
emulated. The suite was tested with value = 2, which is the minimum value for
inter-core page migration. passed as a kernel boot option. Those cases that
require the numa setup need to be linked with -lnuma flag and the libnuma has
to be installed in the system. The executables that these cases call have been
taken from the numactl documentation and slightly modified. They have been
found to work on a 2 node numa-emulated machine.

cases which require the sysfs parameters to be using echo <value> >
sysfs_parameter set may need tweaking based on the system configuration. The
default values used in the case scripts may not scale well when systems
parameters are scaled. For example, for systems with higher memory, the
/sys/kernel/mm/transparent_hugepage/khugepaged/pages_to_scan may be needed to
be set to a higher value or the scan_sleep_millisecs needs to be reduced or
both. Failure to scale the values may result in disproportional or sometimes,
no observable coverage in corresponding functions.

cases can be run individually using

./case-name

with the suite directory as pwd. The scripts work under this assumption. Also,
care has to taken to make sure that the sparse root is mounted. The run_cases
script takes care of mounting the sparse partition before running the scripts.

Hugepages are assumed to be of 2MB.


################################################################################
#                                  WARNING                                     #
################################################################################

The coverage analysis with gcov enabled by setting the

CONFIG_GCOV_KERNEL=y
CONFIG_GCOV_PROFILE_ALL=y

configuration options, profiles the entire kernel. Hence the system boot time
in considerably increased and the system runs a little slower too. Enabling
these configuration parameters on high-end server systems have been observed to
cause boot problems or unstable kernels with or without a lot of errors.

################################################################################
#                                  CASE DESCRIPTION                            #
################################################################################

case-000-anon:
Fill 1/3 of total memory with anonymous pages by creating an anonymous
memory region and continuous write to it. This test is used to exercise
kernel's page fault handler and memory allocation.

case-000-shm:
Fill 1/3 of total memory by continuously writing to a file that is hosted
on a tmpfs; the file is accessed with mmap.

The above two test cases are meant to eat memory, they should be used
with other test cases.

Anonymous page related:

case-anon-cow-seq/rand:
Parent allocate a portion of anonymous memory and then fork several child
processes; these child processes will write data sequentially/randomly to
that memory region. This test is used to test kernel's copy-on-write
functionality.

case-anon-cow-seq/rand-mt:
Since threads share the same memory space, this test doesn't seem to make
much sense(No COW happened).
This test is used to make sure no COW occurred and thus its performance should
be much better than the above one.

case-anon-r-rand(-mt):
mmap a specified size of anonymous region and read at random position to
trigger page faults. Since this is a readonly test and kernel will always use
the same ZERO page for all the faults, this test is used to test if this fast
path works.

The -mt version use thread instead of process, it could involve the test
of page table lock.

case-anon-r-seq(-mt):
Sequentially read instead of randomly read compared to the above test. Should
have the same effect since the mapped physical page, i.e. the ZERO page is
always accessible so no matter it is sequentially read or randomly read, only
page fault occurred, no IO happened.

case-anon-rx-seq-mt:
Almost the same as case-anon-r-seq-mt, the only difference is that it does the
allocation before the test starts. The allocation is actually a mmap of a huge
anonymous region and that alone shouldn't take much time(on populate).
This test is meant to see the speed difference? between prefault(this case)
non-prefault(the above case).

case-anon-rx-rand-mt:
read randomly instead of sequentially compared to case-anon-rx-seq-mt.
Since it's preallocated, it shouldn't matter if it is read randomly or
sequentially so this test case is used to verify that.

case-anon-w-seq/rand:
Start N tasks and each mmap 1/2N whole system memory size of anonymous region
and write sequentially/randomly to that region. This will trigger page
fault and memory allocation.

case-anon-w-seq/rand-mt:
use threads instead of tasks compared to the above test case.

case-anon-wx-seq/rand-mt:
preallocate, i.e. do mmap of this anonymous memory region before the test
starts compared to the above test case.

case-direct-write:
Open a file with O_DIRECT and then continuously write data to it till the
world ends. The file is a sparse file.

case-fork:
start 20000 processes and do nothing. This test is used to test fork performance.

case-fork-sleep:
Almost the same as the above test, except that each task will sleep 10s before exit.
The sleep is used to make sure there are many processes alive at the same time.

case-hugetlb:
Try to allocate 1/3 of whole memory size as huge page by manipulating the
/proc/sys/vm/nr_hugepages file and then free them all.

case-ksm:
Start a process per node and the processes each will RW mmap a private anonymous
region of MemTotal/1000 size and populate when mmap, then use madvise to set
MERGEABLE to trigger KSM; sleep 1 minute and disable MERGEABLE.
The populate will trigger a lot of zero pages so KSM should make an huge effect.
We could measure CPU consumption and how much memory gets freed.

case-ksm-hugepages:
Start N processes, each will mmap a RW private anonymous region of size
MemTotal/2N. The transparent hugepage is always enabled in this case so
all allocations are done in hugepages(zeored hugepages). Then the test
set MERGEABLE for this region and that should trigger KSM. After the set,
sleep 30 seconds and then clear MERGEABLE flag for this region. The test
is done now.
Again, we could measure CPU consumption and how much memory gets freed.

File related:
case-lru-file-mmap-read/rand:
Fork N processes and each process mmap a separate file whose size is
ROTATE_SIZE/n and read its data sequentially/randomly to test LRU related functions.

case-lru-file-mmap-write:
Almost the same as the above test, except doing write instead of read.

case-lru-file-readonce:
Quite similar as case-lru-file-mmap-read except it is using dd to read the file.

case-lru-file-readtwice:
The file is read twice instead of once compared to the above test.

case-lru-memcg:
Set memory limit to 1/3 total memory size and then using case-lru-file-readonce
to do the test.

case-lru-shm:
Start N task, each create a sparse file on a tmpfs with a size of MemTotal.
Then start reading 1/2N size of its data. So the N processes will fill the
memory half full after they are all done. It doesn't matter if the task has
exited since the read portion of the file will always occupy the memory.

case-lru-shm-rand:
Almost the same as the above test case except the read is done in random
order instead of sequentially.

case-mbind:
Start N processes, each allocates 1/5N MemFree memory anonymous pages, then
move these pages to node 0 with numa_move_pages and mbind(seems duplicated
here since mbind could also move existing pages), use get_mempolicy to verify
if these pages are indeed at the desired node and print error messages if any.
Then use mbind to move these pages to node 1 and verify again with get_mempolicy.
This test is used to test if mbind's moving existing pages functionality works.

case-migrate:
/* FIXME: */ migratepages is missing.

case-migrate-across-nodes:
Start $nr_node processes, each will allocate 1/2.5N MemFree memory anonymous
pages, then use numa_move_pages to move these pages to node 1, verify the
move is successful; use numa_migrate_pages to migrate these pages to node 0 and
verify if the migrate is successful.
This test is used to test if migrate_pages is working as expected.
KSM is enabled before the test, not sure the impact.

case-mincore:
Start N threads, each thread has a separate file as its backing store. Mmap
that file and read in MemTotal/N bytes data randomly(which means some part
of the memory space will never be touched). After this, use the mincore
system call to see how many pages are in core. Then the test exits.

case-mlock:
Start N processes, each will allocate 1/3N reclaimable memory((nr_free_page +
nr_file_page)*PAGE_SIZE) with mmap and then use mlock to lock all the allocated
space into memory. The mlock system call will also cause the memory
to be actually allocated.

case-mmap-pread-seq/rand:
Create a sparse file with a size of 4T and then start N processes to mmap the
file and read its content sequentially/randomly. This is used to cause pressure
on the LRU.

case-mmap-pread-seq/rand-mt:
Uses threads to do the read instead of processes compared to the above case.

case-mmap-xread-seq/rand-mt:
Preallocate and then all the threads will use the same memory space of their
own allocated one. This should cause less pressue on the LRU list compared
to the above test case.

case-msync:
Create N sparse files, each with a size of $MemTotal. For each sparse file,
start a process to write 1/2N of the sparse file's size. After the write,
do a msync to make sure the change in memory has reached the file.

case-msync-mt:
Create a sparse file with size of $MemTotal, before creating N threads, it
will preallocate and prefault 1/2 memory space with mmap using this sparse
file as backing store and then the N threads will all write data there
using the preallocated space. When this is done, use msync to flush
change back to the file.

case-shm-pread-seq/rand:
Start N processes to read sequentially/randomly a file hosted on a tmpfs.
The file's size is the same as $MemTotal and the read size is half the
file's size. The end result is the file will occupy 1/2 $MemTotal.

case-shm-pread-seq/rand-mt:
Use thread instead of process compared to the above case. Will generate
some pressure on the page table lock.

case-shm-xread-seq/rand:
Preallocate the space using mmap before forking N processes to do the
read compared to case-shm-pread-seq/rand. The difference between them
is the preallocate could save these processes a mmap call.

case-shm-xread-seq/rand-mt:
Use thread instead of process compared to the above case. Since threads
share the same VM, the page faults for some space pages may occur concurrently
and this test may be able to test page fault's scalability.