kernel/git/cohuck/linux.git - Cornelia Huck's fork of linux.git

Age	Commit message (Collapse)	Author	Files	Lines
2017-07-10	Merge branch 'akpm' (patches from Andrew)HEAD master	Linus Torvalds	103	-1249/+1538
	Merge more updates from Andrew Morton: - most of the rest of MM - KASAN updates - lib/ updates - checkpatch updates - some binfmt_elf changes - various misc bits * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (115 commits) kernel/exit.c: avoid undefined behaviour when calling wait4() kernel/signal.c: avoid undefined behaviour in kill_something_info binfmt_elf: safely increment argv pointers s390: reduce ELF_ET_DYN_BASE powerpc: move ELF_ET_DYN_BASE to 4GB / 4MB arm64: move ELF_ET_DYN_BASE to 4GB / 4MB arm: move ELF_ET_DYN_BASE to 4MB binfmt_elf: use ELF_ET_DYN_BASE only for PIE fs, epoll: short circuit fetching events if thread has been killed checkpatch: improve multi-line alignment test checkpatch: improve macro reuse test checkpatch: change format of --color argument to --color[=WHEN] checkpatch: silence perl 5.26.0 unescaped left brace warnings checkpatch: improve tests for multiple line function definitions checkpatch: remove false warning for commit reference checkpatch: fix stepping through statements with $stat and ctx_statement_block checkpatch: [HLP]LIST_HEAD is also declaration checkpatch: warn when a MAINTAINERS entry isn't [A-Z]:\t checkpatch: improve the unnecessary OOM message test lib/bsearch.c: micro-optimize pivot position calculation ...
2017-07-10	kernel/exit.c: avoid undefined behaviour when calling wait4()	zhongjiang	1	-0/+4
	wait4(-2147483648, 0x20, 0, 0xdd0000) triggers: UBSAN: Undefined behaviour in kernel/exit.c:1651:9 The related calltrace is as follows: negation of -2147483648 cannot be represented in type 'int': CPU: 9 PID: 16482 Comm: zj Tainted: G B ---- ------- 3.10.0-327.53.58.71.x86_64+ #66 Hardware name: Huawei Technologies Co., Ltd. Tecal RH2285 /BC11BTSA , BIOS CTSAV036 04/27/2011 Call Trace: dump_stack+0x19/0x1b ubsan_epilogue+0xd/0x50 __ubsan_handle_negate_overflow+0x109/0x14e SyS_wait4+0x1cb/0x1e0 system_call_fastpath+0x16/0x1b Exclude the overflow to avoid the UBSAN warning. Link: http://lkml.kernel.org/r/1497264618-20212-1-git-send-email-zhongjiang@huawei.com Signed-off-by: zhongjiang <zhongjiang@huawei.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Xishi Qiu <qiuxishi@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-10	kernel/signal.c: avoid undefined behaviour in kill_something_info	zhongjiang	1	-0/+4
	When running kill(72057458746458112, 0) in userspace I hit the following issue. UBSAN: Undefined behaviour in kernel/signal.c:1462:11 negation of -2147483648 cannot be represented in type 'int': CPU: 226 PID: 9849 Comm: test Tainted: G B ---- ------- 3.10.0-327.53.58.70.x86_64_ubsan+ #116 Hardware name: Huawei Technologies Co., Ltd. RH8100 V3/BC61PBIA, BIOS BLHSV028 11/11/2014 Call Trace: dump_stack+0x19/0x1b ubsan_epilogue+0xd/0x50 __ubsan_handle_negate_overflow+0x109/0x14e SYSC_kill+0x43e/0x4d0 SyS_kill+0xe/0x10 system_call_fastpath+0x16/0x1b Add code to avoid the UBSAN detection. [akpm@linux-foundation.org: tweak comment] Link: http://lkml.kernel.org/r/1496670008-59084-1-git-send-email-zhongjiang@huawei.com Signed-off-by: zhongjiang <zhongjiang@huawei.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Xishi Qiu <qiuxishi@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-10	binfmt_elf: safely increment argv pointers	Kees Cook	1	-11/+9
	When building the argv/envp pointers, the envp is needlessly pre-incremented instead of just continuing after the argv pointers are finished. In some (likely impossible) race where the strings could be changed from userspace between copy_strings() and here, it might be possible to confuse the envp position. Instead, just use sp like everything else. Link: http://lkml.kernel.org/r/20170622173838.GA43308@beast Signed-off-by: Kees Cook <keescook@chromium.org> Cc: Rik van Riel <riel@redhat.com> Cc: Daniel Micay <danielmicay@gmail.com> Cc: Qualys Security Advisory <qsa@qualys.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Dmitry Safonov <dsafonov@virtuozzo.com> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Grzegorz Andrejczuk <grzegorz.andrejczuk@intel.com> Cc: Masahiro Yamada <yamada.masahiro@socionext.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-10	s390: reduce ELF_ET_DYN_BASE	Kees Cook	1	-8/+7
	Now that explicitly executed loaders are loaded in the mmap region, we have more freedom to decide where we position PIE binaries in the address space to avoid possible collisions with mmap or stack regions. For 64-bit, align to 4GB to allow runtimes to use the entire 32-bit address space for 32-bit pointers. On 32-bit use 4MB, which is the traditional x86 minimum load location, likely to avoid historically requiring a 4MB page table entry when only a portion of the first 4MB would be used (since the NULL address is avoided). For s390 the position could be 0x10000, but that is needlessly close to the NULL address. Link: http://lkml.kernel.org/r/1498154792-49952-5-git-send-email-keescook@chromium.org Signed-off-by: Kees Cook <keescook@chromium.org> Cc: Russell King <linux@armlinux.org.uk> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Will Deacon <will.deacon@arm.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: James Hogan <james.hogan@imgtec.com> Cc: Pratyush Anand <panand@redhat.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-10	powerpc: move ELF_ET_DYN_BASE to 4GB / 4MB	Kees Cook	1	-6/+7
	Now that explicitly executed loaders are loaded in the mmap region, we have more freedom to decide where we position PIE binaries in the address space to avoid possible collisions with mmap or stack regions. For 64-bit, align to 4GB to allow runtimes to use the entire 32-bit address space for 32-bit pointers. On 32-bit use 4MB, which is the traditional x86 minimum load location, likely to avoid historically requiring a 4MB page table entry when only a portion of the first 4MB would be used (since the NULL address is avoided). Link: http://lkml.kernel.org/r/1498154792-49952-4-git-send-email-keescook@chromium.org Signed-off-by: Kees Cook <keescook@chromium.org> Tested-by: Michael Ellerman <mpe@ellerman.id.au> Acked-by: Michael Ellerman <mpe@ellerman.id.au> Cc: Russell King <linux@armlinux.org.uk> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Will Deacon <will.deacon@arm.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: James Hogan <james.hogan@imgtec.com> Cc: Pratyush Anand <panand@redhat.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-10	arm64: move ELF_ET_DYN_BASE to 4GB / 4MB	Kees Cook	1	-6/+6
	Now that explicitly executed loaders are loaded in the mmap region, we have more freedom to decide where we position PIE binaries in the address space to avoid possible collisions with mmap or stack regions. For 64-bit, align to 4GB to allow runtimes to use the entire 32-bit address space for 32-bit pointers. On 32-bit use 4MB, to match ARM. This could be 0x8000, the standard ET_EXEC load address, but that is needlessly close to the NULL address, and anyone running arm compat PIE will have an MMU, so the tight mapping is not needed. Link: http://lkml.kernel.org/r/1498251600-132458-4-git-send-email-keescook@chromium.org Signed-off-by: Kees Cook <keescook@chromium.org> Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-10	arm: move ELF_ET_DYN_BASE to 4MB	Kees Cook	1	-6/+2
	Now that explicitly executed loaders are loaded in the mmap region, we have more freedom to decide where we position PIE binaries in the address space to avoid possible collisions with mmap or stack regions. 4MB is chosen here mainly to have parity with x86, where this is the traditional minimum load location, likely to avoid historically requiring a 4MB page table entry when only a portion of the first 4MB would be used (since the NULL address is avoided). For ARM the position could be 0x8000, the standard ET_EXEC load address, but that is needlessly close to the NULL address, and anyone running PIE on 32-bit ARM will have an MMU, so the tight mapping is not needed. Link: http://lkml.kernel.org/r/1498154792-49952-2-git-send-email-keescook@chromium.org Signed-off-by: Kees Cook <keescook@chromium.org> Cc: Russell King <linux@armlinux.org.uk> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Will Deacon <will.deacon@arm.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: James Hogan <james.hogan@imgtec.com> Cc: Pratyush Anand <panand@redhat.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Daniel Micay <danielmicay@gmail.com> Cc: Dmitry Safonov <dsafonov@virtuozzo.com> Cc: Grzegorz Andrejczuk <grzegorz.andrejczuk@intel.com> Cc: Kees Cook <keescook@chromium.org> Cc: Masahiro Yamada <yamada.masahiro@socionext.com> Cc: Qualys Security Advisory <qsa@qualys.com> Cc: Rik van Riel <riel@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-10	binfmt_elf: use ELF_ET_DYN_BASE only for PIE	Kees Cook	2	-14/+58
	The ELF_ET_DYN_BASE position was originally intended to keep loaders away from ET_EXEC binaries. (For example, running "/lib/ld-linux.so.2 /bin/cat" might cause the subsequent load of /bin/cat into where the loader had been loaded.) With the advent of PIE (ET_DYN binaries with an INTERP Program Header), ELF_ET_DYN_BASE continued to be used since the kernel was only looking at ET_DYN. However, since ELF_ET_DYN_BASE is traditionally set at the top 1/3rd of the TASK_SIZE, a substantial portion of the address space is unused. For 32-bit tasks when RLIMIT_STACK is set to RLIM_INFINITY, programs are loaded above the mmap region. This means they can be made to collide (CVE-2017-1000370) or nearly collide (CVE-2017-1000371) with pathological stack regions. Lowering ELF_ET_DYN_BASE solves both by moving programs below the mmap region in all cases, and will now additionally avoid programs falling back to the mmap region by enforcing MAP_FIXED for program loads (i.e. if it would have collided with the stack, now it will fail to load instead of falling back to the mmap region). To allow for a lower ELF_ET_DYN_BASE, loaders (ET_DYN without INTERP) are loaded into the mmap region, leaving space available for either an ET_EXEC binary with a fixed location or PIE being loaded into mmap by the loader. Only PIE programs are loaded offset from ELF_ET_DYN_BASE, which means architectures can now safely lower their values without risk of loaders colliding with their subsequently loaded programs. For 64-bit, ELF_ET_DYN_BASE is best set to 4GB to allow runtimes to use the entire 32-bit address space for 32-bit pointers. Thanks to PaX Team, Daniel Micay, and Rik van Riel for inspiration and suggestions on how to implement this solution. Fixes: d1fd836dcf00 ("mm: split ET_DYN ASLR from mmap ASLR") Link: http://lkml.kernel.org/r/20170621173201.GA114489@beast Signed-off-by: Kees Cook <keescook@chromium.org> Acked-by: Rik van Riel <riel@redhat.com> Cc: Daniel Micay <danielmicay@gmail.com> Cc: Qualys Security Advisory <qsa@qualys.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Dmitry Safonov <dsafonov@virtuozzo.com> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Grzegorz Andrejczuk <grzegorz.andrejczuk@intel.com> Cc: Masahiro Yamada <yamada.masahiro@socionext.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: James Hogan <james.hogan@imgtec.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Paul Mackerras <paulus@samba.org> Cc: Pratyush Anand <panand@redhat.com> Cc: Russell King <linux@armlinux.org.uk> Cc: Will Deacon <will.deacon@arm.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-10	fs, epoll: short circuit fetching events if thread has been killed	David Rientjes	1	-0/+10
	We've encountered zombies that are waiting for a thread to exit that are looping in ep_poll() almost endlessly although there is a pending SIGKILL as a result of a group exit. This happens because we always find ep_events_available() and fetch more events and never are able to check for signal_pending() that would break from the loop and return -EINTR. Special case fatal signals and break immediately to guarantee that we loop to fetch more events and delay making a timely exit. It would also be possible to simply move the check for signal_pending() higher than checking for ep_events_available(), but there have been no reports of delayed signal handling other than SIGKILL preventing zombies from exiting that would be fixed by this. It fixes an issue for us where we have witnessed zombies sticking around for at least O(minutes), but considering the code has been like this forever and nobody else has complained that I have found, I would simply queue it up for 4.12. Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1705031722350.76784@chino.kir.corp.google.com Signed-off-by: David Rientjes <rientjes@google.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Jan Kara <jack@suse.cz> Cc: Davide Libenzi <davidel@xmailserver.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-10	checkpatch: improve multi-line alignment test	Joe Perches	1	-1/+1
	The current test fails to warn about improper alignment with code like foo->bar = func(arg1, arg2); because foo->bar is not a single identifier. Convert the $Ident to $Lval which allows for multiple dereferences. Link: http://lkml.kernel.org/r/01c35b9b6a12a415e57746d45d589bfaad39952a.1498841563.git.joe@perches.com Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-10	checkpatch: improve macro reuse test	Joe Perches	1	-6/+6
	checkpatch reports a false positive when using token pasting argument multiple times in a macro. Fix it. Miscellanea: o Make the $tmp variable name used in the macro argument tests a bit more descriptive Link: http://lkml.kernel.org/r/cf434ae7602838388c7cb49d42bca93ab88527e7.1498483044.git.joe@perches.com Signed-off-by: Joe Perches <joe@perches.com> Reported-by: Johannes Berg <johannes@sipsolutions.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-10	checkpatch: change format of --color argument to --color[=WHEN]	John Brooks	1	-6/+29
	The boolean --color argument did not offer the ability to force colourized output even if stdout is not a terminal. Change the format of the argument to the familiar --color[=WHEN] construct as seen in common Linux utilities such as git, ls and dmesg, which allows the user to specify whether to colourize output "always", "never", or "auto" when the output is a terminal. The default is "auto". The old command-line uses of --color and --no-color are unchanged. Link: http://lkml.kernel.org/r/efe43bdbad400f39ba691ae663044462493b0773.1496799721.git.joe@perches.com Signed-off-by: John Brooks <john@fastquake.com> Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-10	checkpatch: silence perl 5.26.0 unescaped left brace warnings	Cyril Bur	1	-3/+3
	As of perl 5, version 26, subversion 0 (v5.26.0) some new warnings have occurred when running checkpatch. Unescaped left brace in regex is deprecated here (and will be fatal in Perl 5.30), passed through in regex; marked by <-- HERE in m/^(.\s){ <-- HERE \s/ at scripts/checkpatch.pl line 3544. Unescaped left brace in regex is deprecated here (and will be fatal in Perl 5.30), passed through in regex; marked by <-- HERE in m/^(.\s){ <-- HERE \s/ at scripts/checkpatch.pl line 3885. Unescaped left brace in regex is deprecated here (and will be fatal in Perl 5.30), passed through in regex; marked by <-- HERE in m/^(\+.*(?:do\|\))){ <-- HERE / at scripts/checkpatch.pl line 4374. It seems perfectly reasonable to do as the warning suggests and simply escape the left brace in these three locations. Link: http://lkml.kernel.org/r/20170607060135.17384-1-cyrilbur@gmail.com Signed-off-by: Cyril Bur <cyrilbur@gmail.com> Acked-by: Joe Perches <joe@perches.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-10	checkpatch: improve tests for multiple line function definitions	Joe Perches	1	-1/+25
	Add a block that identifies multiple line function definitions. Save the function name into $context_function to improve the embedded function name test. Look for misplaced open brace on the function definition. Emit an OPEN_BRACE error when the function definition is similar to void foo(int arg1, int arg2) { Miscellanea: o Remove the $realfile test in function declaration w/o named arguments test o Comment the function declaration w/o named arguments test Link: http://lkml.kernel.org/r/de620ed6ebab75fdfa323741ada2134a0f545892.1496835238.git.joe@perches.com Signed-off-by: Joe Perches <joe@perches.com> Tested-by: David Kershner <david.kershner@unisys.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-10	checkpatch: remove false warning for commit reference	Heinrich Schuchardt	1	-1/+3
	Checkpatch warns of an incorrect commit reference style for any hexadecimal number of 12 digits and more. Numbers of 12 digits are not necessarily commit ids. For an example provoking the problem see https://patchwork.kernel.org/patch/9170897/ Checkpatch should only warn if the number refers to an existing commit. Link: http://lkml.kernel.org/r/20170607184008.5869-1-xypron.glpk@gmx.de Signed-off-by: Heinrich Schuchardt <xypron.glpk@gmx.de> Acked-by: Joe Perches <joe@perches.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-10	checkpatch: fix stepping through statements with $stat and ctx_statement_block	Joe Perches	1	-1/+1
	Fix the off-by-one in the suppression of lines in a statement block. This means that for multiple line statements like foo(bar, baz, qux); $stat has been inspected first correctly for the entire statement, and subsequently incorrectly just for qux); This fix will help make tracking appropriate indentation a little easier. Link: http://lkml.kernel.org/r/71b25979c90412133c717084036c9851cd2b7bcb.1496862585.git.joe@perches.com Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-10	checkpatch: [HLP]LIST_HEAD is also declaration	Steffen Maier	1	-1/+1
	Fixes the following false warning among others for LLIST_HEAD and PLIST_HEAD: WARNING: Missing a blank line after declarations #71: FILE: drivers/s390/scsi/zfcp_fsf.c:422: + struct hlist_node *tmp; + HLIST_HEAD(remove_queue); Link: http://lkml.kernel.org/r/20170614133512.89425-1-maier@linux.vnet.ibm.com Signed-off-by: Steffen Maier <maier@linux.vnet.ibm.com> Acked-by: Joe Perches <joe@perches.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-10	checkpatch: warn when a MAINTAINERS entry isn't [A-Z]:\t	Joe Perches	1	-0/+11
	For consistency, MAINTAINERS entries should be an upper case letter, then a colon, then a tab, then the value. Warn when an entry doesn't have this form. --fix it too. Link: http://lkml.kernel.org/r/9aaaf03ec10adf3888b5e98dd2176b7fe9b5fad8.1496343345.git.joe@perches.com Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-10	checkpatch: improve the unnecessary OOM message test	Joe Perches	1	-1/+1
	Use the context around a patch to avoid missing some candidates. Link: http://lkml.kernel.org/r/865e874fbae5decc331a849bd8d71c325db6bc80.1496343345.git.joe@perches.com Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-10	lib/bsearch.c: micro-optimize pivot position calculation	Sergey Senozhatsky	1	-10/+12
	There is a slightly faster way (in terms of the number of instructions being used) to calculate the position of a middle element, preserving integer overflow safeness. ./scripts/bloat-o-meter lib/bsearch.o.old lib/bsearch.o.new add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-24 (-24) function old new delta bsearch 122 98 -24 TEST INT array of size 100001, elements [0..100000]. gcc 7.1, Os, x86_64. a) bsearch() of existing key "100001 - 2": BASE ==== $ perf stat ./a.out Performance counter stats for './a.out': 619.445196 task-clock:u (msec) # 0.999 CPUs utilized 0 context-switches:u # 0.000 K/sec 0 cpu-migrations:u # 0.000 K/sec 133 page-faults:u # 0.215 K/sec 1,949,517,279 cycles:u # 3.147 GHz (83.06%) 181,017,938 stalled-cycles-frontend:u # 9.29% frontend cycles idle (83.05%) 82,959,265 stalled-cycles-backend:u # 4.26% backend cycles idle (67.02%) 4,355,706,383 instructions:u # 2.23 insn per cycle # 0.04 stalled cycles per insn (83.54%) 1,051,539,242 branches:u # 1697.550 M/sec (83.54%) 15,263,381 branch-misses:u # 1.45% of all branches (83.43%) 0.620082548 seconds time elapsed PATCHED ======= $ perf stat ./a.out Performance counter stats for './a.out': 475.097316 task-clock:u (msec) # 0.999 CPUs utilized 0 context-switches:u # 0.000 K/sec 0 cpu-migrations:u # 0.000 K/sec 135 page-faults:u # 0.284 K/sec 1,487,467,717 cycles:u # 3.131 GHz (82.95%) 186,537,162 stalled-cycles-frontend:u # 12.54% frontend cycles idle (82.93%) 28,797,869 stalled-cycles-backend:u # 1.94% backend cycles idle (67.10%) 3,807,564,203 instructions:u # 2.56 insn per cycle # 0.05 stalled cycles per insn (83.57%) 1,049,344,291 branches:u # 2208.693 M/sec (83.60%) 5,485 branch-misses:u # 0.00% of all branches (83.58%) 0.475760235 seconds time elapsed b) bsearch() of un-existing key "100001 + 2": BASE ==== $ perf stat ./a.out Performance counter stats for './a.out': 499.244480 task-clock:u (msec) # 0.999 CPUs utilized 0 context-switches:u # 0.000 K/sec 0 cpu-migrations:u # 0.000 K/sec 132 page-faults:u # 0.264 K/sec 1,571,194,855 cycles:u # 3.147 GHz (83.18%) 13,450,980 stalled-cycles-frontend:u # 0.86% frontend cycles idle (83.18%) 21,256,072 stalled-cycles-backend:u # 1.35% backend cycles idle (66.78%) 4,171,197,909 instructions:u # 2.65 insn per cycle # 0.01 stalled cycles per insn (83.68%) 1,009,175,281 branches:u # 2021.405 M/sec (83.79%) 3,122 branch-misses:u # 0.00% of all branches (83.37%) 0.499871144 seconds time elapsed PATCHED ======= $ perf stat ./a.out Performance counter stats for './a.out': 399.023499 task-clock:u (msec) # 0.998 CPUs utilized 0 context-switches:u # 0.000 K/sec 0 cpu-migrations:u # 0.000 K/sec 134 page-faults:u # 0.336 K/sec 1,245,793,991 cycles:u # 3.122 GHz (83.39%) 11,529,273 stalled-cycles-frontend:u # 0.93% frontend cycles idle (83.46%) 12,116,311 stalled-cycles-backend:u # 0.97% backend cycles idle (66.92%) 3,679,710,005 instructions:u # 2.95 insn per cycle # 0.00 stalled cycles per insn (83.47%) 1,009,792,625 branches:u # 2530.660 M/sec (83.46%) 2,590 branch-misses:u # 0.00% of all branches (83.12%) 0.399733539 seconds time elapsed Link: http://lkml.kernel.org/r/20170607150457.5905-1-sergey.senozhatsky@gmail.com Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-10	lib/extable.c: use bsearch() library function in search_extable()	Thomas Meyer	8	-56/+63
	[thomas@m3y3r.de: v3: fix arch specific implementations] Link: http://lkml.kernel.org/r/1497890858.12931.7.camel@m3y3r.de Signed-off-by: Thomas Meyer <thomas@m3y3r.de> Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-10	lib/rhashtable.c: use kvzalloc() in bucket_table_alloc() when possible	Michal Hocko	1	-4/+3
	bucket_table_alloc() can be currently called with GFP_KERNEL or GFP_ATOMIC. For the former we basically have an open coded kvzalloc() while the later only uses kzalloc(). Let's simplify the code a bit by the dropping the open coded path and replace it with kvzalloc(). Link: http://lkml.kernel.org/r/20170531155145.17111-3-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Cc: Thomas Graf <tgraf@suug.ch> Cc: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-10	lib/interval_tree_test.c: allow full tree search	Davidlohr Bueso	1	-5/+10
	... such that a user can specify visiting all the nodes in the tree (intersects with the world). This is a nice opposite from the very basic default query which is a single point. Link: http://lkml.kernel.org/r/20170518174936.20265-5-dave@stgolabs.net Signed-off-by: Davidlohr Bueso <dbueso@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-10	lib/interval_tree_test.c: allow users to limit scope of endpoint	Davidlohr Bueso	1	-10/+13
	Add a 'max_endpoint' parameter such that users may easily limit the size of the intervals that are randomly generated. Link: http://lkml.kernel.org/r/20170518174936.20265-4-dave@stgolabs.net Signed-off-by: Davidlohr Bueso <dbueso@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-10	lib/interval_tree_test.c: make test options module parameters	Davidlohr Bueso	1	-17/+40
	Allows for more flexible debugging. Link: http://lkml.kernel.org/r/20170518174936.20265-3-dave@stgolabs.net Signed-off-by: Davidlohr Bueso <dbueso@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-10	lib/interval_tree_test.c: allow the module to be compiled-in	Davidlohr Bueso	1	-1/+1
	Patch series "lib/interval_tree_test: some debugging improvements". Here are some patches that update the interval_tree_test module allowing users to pass finer grained options to run the actual test. This patch (of 4): It is a tristate after all, and also serves well for quick debugging. Link: http://lkml.kernel.org/r/20170518174936.20265-2-dave@stgolabs.net Signed-off-by: Davidlohr Bueso <dbueso@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-10	lib/kstrtox.c: use "unsigned int" more	Alexey Dobriyan	1	-4/+6
	gcc does generates stupid code sign extending data back and forth. Help by using "unsigned int". add/remove: 0/0 grow/shrink: 0/3 up/down: 0/-61 (-61) function old new delta _parse_integer 128 123 -5 It _still_ does generate useless MOVSX but I don't know how to delete it: 0000000000000070 <_parse_integer>: ... a0: 89 c2 mov edx,eax a2: 83 e8 30 sub eax,0x30 a5: 83 f8 09 cmp eax,0x9 a8: 76 11 jbe bb <_parse_integer+0x4b> aa: 83 ca 20 or edx,0x20 ad: 0f be c2 ===> movsx eax,dl <=== useless b0: 8d 50 9f lea edx,[rax-0x61] b3: 83 fa 05 cmp edx,0x5 Patch also helps on embedded archs which generally only like "int". On arm "and 0xff" is generated which is waste because all values used in comparisons are positive. Link: http://lkml.kernel.org/r/20170514194720.GB32563@avx2 Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-10	lib/kstrtox.c: delete end-of-string test	Alexey Dobriyan	1	-1/+1
	Standard "while (*s)" test is unnecessary because NUL won't pass valid-digit test anyway. Save one branch per parsed character. Link: http://lkml.kernel.org/r/20170514193756.GA32563@avx2 Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-10	bitmap: use memcmp optimisation in more situations	Matthew Wilcox	2	-3/+2
	Commit 7dd968163f7c ("bitmap: bitmap_equal memcmp optimization") was rather more restrictive than necessary; we can use memcmp() to implement bitmap_equal() as long as the number of bits can be proved to be a multiple of 8. And architectures other than s390 may be able to make good use of this optimisation. [arnd@arndb.de: fix build: add a memcmp() declaration] Link: http://lkml.kernel.org/r/20170630153908.3439707-1-arnd@arndb.de Link: http://lkml.kernel.org/r/20170628153221.11322-5-willy@infradead.org Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com> Signed-off-by: Arnd Bergmann <arnd@arndb.de> Acked-by: Rasmus Villemoes <linux@rasmusvillemoes.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-10	include/linux/bitmap.h: turn bitmap_set and bitmap_clear into memset when ↵	Matthew Wilcox	1	-0/+6
	possible Several callers have constant 'start' and an 'nbits' that is a multiple of 8, so we can turn them into calls to memset. We don't need the entirety of 'start' and 'nbits' to be constant, we just need to know whether they're divisible by 8. Link: http://lkml.kernel.org/r/20170628153221.11322-4-willy@infradead.org Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com> Acked-by: Rasmus Villemoes <linux@rasmusvillemoes.dk> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Matthew Wilcox <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-10	bitmap: optimise bitmap_set and bitmap_clear of a single bit	Matthew Wilcox	3	-10/+24
	We have eight users calling bitmap_clear for a single bit and seventeen calling bitmap_set for a single bit. Rather than fix all of them to call __clear_bit or __set_bit, turn bitmap_clear and bitmap_set into inline functions and make this special case efficient. Link: http://lkml.kernel.org/r/20170628153221.11322-3-willy@infradead.org Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com> Acked-by: Rasmus Villemoes <linux@rasmusvillemoes.dk> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Matthew Wilcox <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-10	lib/test_bitmap.c: add optimisation tests	Matthew Wilcox	1	-0/+32
	Patch series "Bitmap optimisations", v2. These three bitmap patches use more efficient specialisations when the compiler can figure out that it's safe to do so. Thanks to Rasmus's eagle eyes, a nasty bug in v1 was avoided, and I've added a test case which would have caught it. This patch (of 4): This version of the test is actually a no-op; the next patch will enable it. Link: http://lkml.kernel.org/r/20170628153221.11322-2-willy@infradead.org Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com> Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Matthew Wilcox <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-10	MAINTAINERS: give proc sysctl some maintainer love	Luis R. Rodriguez	1	-0/+11
	We poke at proc sysctl enough that really we should declare it maintained. We'll just be Cc'd and sending updates / ACK'ing changes through akpm's tree. Link: http://lkml.kernel.org/r/20170524231305.8649-1-mcgrof@kernel.org Signed-off-by: Luis R. Rodriguez <mcgrof@kernel.org> Acked-by: Kees Cook <keescook@chromium.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-10	kernel/kallsyms.c: replace all_var with IS_ENABLED(CONFIG_KALLSYMS_ALL)	Masahiro Yamada	1	-8/+2
	'all_var' looks like a variable, but is actually a macro. Use IS_ENABLED(CONFIG_KALLSYMS_ALL) for clarification. Link: http://lkml.kernel.org/r/1497577591-3434-1-git-send-email-yamada.masahiro@socionext.com Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: "David S. Miller" <davem@davemloft.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-10	kernel/groups.c: use sort library function	Rasmus Villemoes	1	-24/+11
	setgroups is not exactly a hot path, so we might as well use the library function instead of open-coding the sorting. Saves ~150 bytes. Link: http://lkml.kernel.org/r/1497301378-22739-1-git-send-email-linux@rasmusvillemoes.dk Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk> Cc: Matthew Wilcox <mawilcox@microsoft.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-10	kernel/ksysfs.c: constify attribute_group structures.	Arvind Yadav	1	-1/+1
	attribute_groups are not supposed to change at runtime. All functions working with attribute_groups provided by <linux/sysfs.h> work with const attribute_group. So mark the non-const structs as const. File size before: text data bss dec hex filename 1120 544 16 1680 690 kernel/ksysfs.o File size After adding 'const': text data bss dec hex filename 1160 480 16 1656 678 kernel/ksysfs.o Link: http://lkml.kernel.org/r/aa224b3cc923fdbb3edd0c41b2c639c85408c9e8.1498737347.git.arvind.yadav.cs@gmail.com Signed-off-by: Arvind Yadav <arvind.yadav.cs@gmail.com> Acked-by: Kees Cook <keescook@chromium.org> Cc: Russell King <rmk+kernel@arm.linux.org.uk> Cc: Dave Young <dyoung@redhat.com> Cc: Hari Bathini <hbathini@linux.vnet.ibm.com> Cc: Petr Tesarik <ptesarik@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-10	ARM: fix rd_size declaration	Bart Van Assche	3	-2/+5
	The global variable 'rd_size' is declared as 'int' in source file arch/arm/kernel/atags_parse.c and as 'unsigned long' in drivers/block/brd.c. Fix this inconsistency. Additionally, remove the declarations of rd_image_start, rd_prompt and rd_doload from parse_tag_ramdisk() since these duplicate existing declarations in <linux/initrd.h>. Link: http://lkml.kernel.org/r/20170627065024.12347-1-bart.vanassche@wdc.com Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com> Acked-by: Russell King <rmk+kernel@armlinux.org.uk> Cc: Jens Axboe <axboe@kernel.dk> Cc: Jan Kara <jack@suse.cz> Cc: Jason Yan <yanaijie@huawei.com> Cc: Zhaohongjiang <zhaohongjiang@huawei.com> Cc: Miao Xie <miaoxie@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-10	bug: split BUILD_BUG stuff out into <linux/build_bug.h>	Ian Abbott	2	-73/+85
	Including <linux/bug.h> pulls in a lot of bloat from <asm/bug.h> and <asm-generic/bug.h> that is not needed to call the BUILD_BUG() family of macros. Split them out into their own header, <linux/build_bug.h>. Also correct some checkpatch.pl errors for the BUILD_BUG_ON_ZERO() and BUILD_BUG_ON_NULL() macros by adding parentheses around the bitfield widths that begin with a minus sign. Link: http://lkml.kernel.org/r/20170525120316.24473-6-abbotti@mev.co.uk Signed-off-by: Ian Abbott <abbotti@mev.co.uk> Acked-by: Michal Nazarewicz <mina86@mina86.com> Acked-by: Kees Cook <keescook@chromium.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Jakub Kicinski <jakub.kicinski@netronome.com> Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-10	linux/bug.h: correct "space required before that '-'"	Ian Abbott	1	-2/+2
	Correct these checkpatch.pl errors: \|ERROR: space required before that '-' (ctx:OxO) \|#37: FILE: include/linux/bug.h:37: \|+#define BUILD_BUG_ON_ZERO(e) (sizeof(struct { int:-!!(e); })) \|ERROR: space required before that '-' (ctx:OxO) \|#38: FILE: include/linux/bug.h:38: \|+#define BUILD_BUG_ON_NULL(e) ((void *)sizeof(struct { int:-!!(e); })) I decided to wrap the bitfield expressions that begin with minus signs in parentheses rather than insert spaces before the minus signs. Link: http://lkml.kernel.org/r/20170525120316.24473-5-abbotti@mev.co.uk Signed-off-by: Ian Abbott <abbotti@mev.co.uk> Acked-by: Michal Nazarewicz <mina86@mina86.com> Cc: Kees Cook <keescook@chromium.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Jakub Kicinski <jakub.kicinski@netronome.com> Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-10	linux/bug.h: correct "(foo)" should be "(foo )"	Ian Abbott	1	-1/+1
	Correct this checkpatch.pl error: \|ERROR: "(foo)" should be "(foo )" \|#19: FILE: include/linux/bug.h:19: \|+#define BUILD_BUG_ON_NULL(e) ((void*)0) Link: http://lkml.kernel.org/r/20170525120316.24473-4-abbotti@mev.co.uk Signed-off-by: Ian Abbott <abbotti@mev.co.uk> Acked-by: Michal Nazarewicz <mina86@mina86.com> Cc: Kees Cook <keescook@chromium.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Jakub Kicinski <jakub.kicinski@netronome.com> Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-10	linux/bug.h: correct formatting of block comment	Ian Abbott	1	-4/+6
	Correct these checkpatch.pl warnings: \|WARNING: Block comments use * on subsequent lines \|#34: FILE: include/linux/bug.h:34: \|+/* Force a compilation error if condition is true, but also produce a \|+ result (of value 0 and type size_t), so the expression can be used \|WARNING: Block comments use a trailing / on a separate line \|#36: FILE: include/linux/bug.h:36: \|+ aren't permitted). / Link: http://lkml.kernel.org/r/20170525120316.24473-3-abbotti@mev.co.uk Signed-off-by: Ian Abbott <abbotti@mev.co.uk> Acked-by: Michal Nazarewicz <mina86@mina86.com> Cc: Kees Cook <keescook@chromium.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Jakub Kicinski <jakub.kicinski@netronome.com> Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-10	asm-generic/bug.h: declare struct pt_regs; before function prototype	Ian Abbott	1	-0/+1
	This series of patches splits BUILD_BUG related macros out of "include/linux/bug.h" into new file "include/linux/build_bug.h" (patch 5), and changes the pointer type checking in the `container_of()` macro to deal with pointers of array type better (patch 6). Patches 1 to 4 are prerequisites. Patches 2, 3, 4, and 5 have been inserted since the previous version of this patch series. Patch 6 here corresponds to v3 and v4's patch 2. Patch 1 was a prerequisite in v3 of this series to avoid a lot of warnings when <linux/bug.h> was included by <linux/kernel.h>. That is no longer relevant for v5 of the series, but I left it in because it was acked by a Arnd Bergmann and Michal Nazarewicz. Patches 2, 3, and 4 are some checkpatch clean-ups on "include/linux/bug.h" before splitting out the BUILD_BUG stuff in patch 5. Patch 5 splits the BUILD_BUG related macros out of "include/linux/bug.h" into new file "include/linux/build_bug.h" because including <linux/bug.h> in "include/linux/kernel.h" would result in build failures due to circular dependencies. Patch 6 changes the pointer type checking by `container_of()` to avoid some incompatible pointer warnings when the dereferenced pointer has array type. 1) asm-generic/bug.h: declare struct pt_regs; before function prototype 2) linux/bug.h: correct formatting of block comment 3) linux/bug.h: correct "(foo)" should be "(foo )" 4) linux/bug.h: correct "space required before that '-'" 5) bug: split BUILD_BUG stuff out into <linux/build_bug.h> 6) kernel.h: handle pointers to arrays better in container_of() This patch (of 6): The declaration of `__warn()` has `struct pt_regs *regs` as one of its parameters. This can result in compiler warnings if `struct regs` is not already declared. Add an empty declaration of `struct pt_regs` to avoid the warnings. Link: http://lkml.kernel.org/r/20170525120316.24473-2-abbotti@mev.co.uk Signed-off-by: Ian Abbott <abbotti@mev.co.uk> Acked-by: Arnd Bergmann <arnd@arndb.de> Acked-by: Michal Nazarewicz <mina86@mina86.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Kees Cook <keescook@chromium.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-10	fs/proc/generic.c: switch to ida_simple_get/remove	Heiner Kallweit	1	-25/+7
	The code can be much simplified by switching to ida_simple_get/remove. Link: http://lkml.kernel.org/r/8d1cc9f7-5115-c9dc-028e-c0770b6bfe1f@gmail.com Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-10	frv: cmpxchg: implement cmpxchg64()	Will Deacon	1	-0/+1
	FRV supports 64-bit cmpxchg, which is provided by the arch code as __cmpxchg_64 and subsequently used to implement atomic64_cmpxchg. This patch hooks up the generic cmpxchg64 API using the same function, which also provides default definitions of the relaxed, acquire and release variants. This fixes the build when COMPILE_TEST=y and IOMMU_IO_PGTABLE_LPAE=y. Link: http://lkml.kernel.org/r/1499084670-6996-1-git-send-email-will.deacon@arm.com Signed-off-by: Will Deacon <will.deacon@arm.com> Reported-by: kbuild test robot <fengguang.wu@intel.com> Cc: Joerg Roedel <joro@8bytes.org> Cc: Robin Murphy <robin.murphy@arm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ingo Molnar <mingo@kernel.org> Cc: David Howells <dhowells@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-10	frv: use generic fb.h	Tobias Klauser	2	-12/+1
	The arch uses a verbatim copy of the asm-generic version and does not add any own implementations to the header, so use asm-generic/fb.h instead of duplicating code. Link: http://lkml.kernel.org/r/20170517083307.1697-1-tklauser@distanz.ch Signed-off-by: Tobias Klauser <tklauser@distanz.ch> Reviewed-by: David Howells <dhowells@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-10	frv: remove wrapper header for asm/device.h	Tobias Klauser	2	-7/+1
	frv's asm/device.h is merely including asm-generic/device.h. Thus, the arch specific header can be omitted and the generic header can be used directly. Link: http://lkml.kernel.org/r/20170517124915.26904-1-tklauser@distanz.ch Signed-off-by: Tobias Klauser <tklauser@distanz.ch> Reviewed-by: David Howells <dhowells@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-10	kasan: make get_wild_bug_type() static	Colin Ian King	1	-1/+1
	The helper function get_wild_bug_type() does not need to be in global scope, so make it static. Cleans up sparse warning: "symbol 'get_wild_bug_type' was not declared. Should it be static?" Link: http://lkml.kernel.org/r/20170622090049.10658-1-colin.king@canonical.com Signed-off-by: Colin Ian King <colin.king@canonical.com> Acked-by: Dmitry Vyukov <dvyukov@google.com> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Alexander Potapenko <glider@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-10	mm/kasan/kasan.c: rename XXX_is_zero to XXX_is_nonzero	Joonsoo Kim	1	-7/+7
	They return positive value, that is, true, if non-zero value is found. Rename them to reduce confusion. Link: http://lkml.kernel.org/r/20170516012350.GA16015@js1304-desktop Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Alexander Potapenko <glider@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-10	mm/kasan: add support for memory hotplug	Andrey Ryabinin	2	-6/+35
	KASAN doesn't happen work with memory hotplug because hotplugged memory doesn't have any shadow memory. So any access to hotplugged memory would cause a crash on shadow check. Use memory hotplug notifier to allocate and map shadow memory when the hotplugged memory is going online and free shadow after the memory offlined. Link: http://lkml.kernel.org/r/20170601162338.23540-4-aryabinin@virtuozzo.com Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Alexander Potapenko <glider@google.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Ingo Molnar <mingo@redhat.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Will Deacon <will.deacon@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-10	arm64/kasan: don't allocate extra shadow memory	Andrey Ryabinin	1	-7/+1
	We used to read several bytes of the shadow memory in advance. Therefore additional shadow memory mapped to prevent crash if speculative load would happen near the end of the mapped shadow memory. Now we don't have such speculative loads, so we no longer need to map additional shadow memory. Link: http://lkml.kernel.org/r/20170601162338.23540-3-aryabinin@virtuozzo.com Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com> Acked-by: Mark Rutland <mark.rutland@arm.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Will Deacon <will.deacon@arm.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Alexander Potapenko <glider@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-10	x86/kasan: don't allocate extra shadow memory	Andrey Ryabinin	1	-6/+1
	We used to read several bytes of the shadow memory in advance. Therefore additional shadow memory mapped to prevent crash if speculative load would happen near the end of the mapped shadow memory. Now we don't have such speculative loads, so we no longer need to map additional shadow memory. Link: http://lkml.kernel.org/r/20170601162338.23540-2-aryabinin@virtuozzo.com Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Alexander Potapenko <glider@google.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Will Deacon <will.deacon@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-10	mm/kasan: get rid of speculative shadow checks	Andrey Ryabinin	1	-82/+16
	For some unaligned memory accesses we have to check additional byte of the shadow memory. Currently we load that byte speculatively to have only single load + branch on the optimistic fast path. However, this approach has some downsides: - It's unaligned access, so this prevents porting KASAN on architectures which doesn't support unaligned accesses. - We have to map additional shadow page to prevent crash if speculative load happens near the end of the mapped memory. This would significantly complicate upcoming memory hotplug support. I wasn't able to notice any performance degradation with this patch. So these speculative loads is just a pain with no gain, let's remove them. Link: http://lkml.kernel.org/r/20170601162338.23540-1-aryabinin@virtuozzo.com Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com> Acked-by: Dmitry Vyukov <dvyukov@google.com> Cc: Alexander Potapenko <glider@google.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Will Deacon <will.deacon@arm.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-10	mm/kasan/kasan_init.c: use kasan_zero_pud for p4d table	Joonsoo Kim	1	-0/+12
	There is missing optimization in zero_p4d_populate() that can save some memory when mapping zero shadow. Implement it like as others. Link: http://lkml.kernel.org/r/1494829255-23946-1-git-send-email-iamjoonsoo.kim@lge.com Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Acked-by: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com> Cc: Alexander Potapenko <glider@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-10	mm/zsmalloc: simplify zs_max_alloc_size handling	Jerome Marchand	1	-37/+15
	Commit 40f9fb8cffc6 ("mm/zsmalloc: support allocating obj with size of ZS_MAX_ALLOC_SIZE") fixes a size calculation error that prevented zsmalloc to allocate an object of the maximal size (ZS_MAX_ALLOC_SIZE). I think however the fix is unneededly complicated. This patch replaces the dynamic calculation of zs_size_classes at init time by a compile time calculation that uses the DIV_ROUND_UP() macro already used in get_size_class_index(). [akpm@linux-foundation.org: use min_t] Link: http://lkml.kernel.org/r/20170630114859.1979-1-jmarchan@redhat.com Signed-off-by: Jerome Marchand <jmarchan@redhat.com> Acked-by: Minchan Kim <minchan@kernel.org> Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com> Cc: Mahendran Ganesh <opensource.ganesh@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-10	zram: constify attribute_group structures.	Arvind Yadav	1	-1/+1
	attribute_groups are not supposed to change at runtime. All functions working with attribute_groups provided by <linux/sysfs.h> work with const attribute_group. So mark the non-const structs as const. File size before: text data bss dec hex filename 8293 841 4 9138 23b2 drivers/block/zram/zram_drv.o File size After adding 'const': text data bss dec hex filename 8357 777 4 9138 23b2 drivers/block/zram/zram_drv.o Link: http://lkml.kernel.org/r/65680c1c4d85818f7094cbfa31c91bf28185ba1b.1499061182.git.arvind.yadav.cs@gmail.com Signed-off-by: Arvind Yadav <arvind.yadav.cs@gmail.com> Acked-by: Minchan Kim <minchan@kernel.org> Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-10	mm: disallow early_pfn_to_nid on configurations which do not implement it	Michal Hocko	1	-0/+1
	early_pfn_to_nid will return node 0 if both HAVE_ARCH_EARLY_PFN_TO_NID and HAVE_MEMBLOCK_NODE_MAP are disabled. It seems we are safe now because all architectures which support NUMA define one of them (with an exception of alpha which however has CONFIG_NUMA marked as broken) so this works as expected. It can get silently and subtly broken too easily, though. Make sure we fail the compilation if NUMA is enabled and there is no proper implementation for this function. If that ever happens we know that either the specific configuration is invalid and the fix should either disable NUMA or enable one of the above configs. Link: http://lkml.kernel.org/r/20170704075803.15979-1-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Joonsoo Kim <js1304@gmail.com> Cc: Yang Shi <yang.shi@linaro.org> Cc: Mel Gorman <mgorman@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-10	mm/memory-hotplug: switch locking to a percpu rwsem	Thomas Gleixner	2	-75/+16
	Andrey reported a potential deadlock with the memory hotplug lock and the cpu hotplug lock. The reason is that memory hotplug takes the memory hotplug lock and then calls stop_machine() which calls get_online_cpus(). That's the reverse lock order to get_online_cpus(); get_online_mems(); in mm/slub_common.c The problem has been there forever. The reason why this was never reported is that the cpu hotplug locking had this homebrewn recursive reader writer semaphore construct which due to the recursion evaded the full lock dep coverage. The memory hotplug code copied that construct verbatim and therefor has similar issues. Three steps to fix this: 1) Convert the memory hotplug locking to a per cpu rwsem so the potential issues get reported proper by lockdep. 2) Lock the online cpus in mem_hotplug_begin() before taking the memory hotplug rwsem and use stop_machine_cpuslocked() in the page_alloc code to avoid recursive locking. 3) The cpu hotpluck locking in #2 causes a recursive locking of the cpu hotplug lock via __offline_pages() -> lru_add_drain_all(). Solve this by invoking lru_add_drain_all_cpuslocked() instead. Link: http://lkml.kernel.org/r/20170704093421.506836322@linutronix.de Reported-by: Andrey Ryabinin <aryabinin@virtuozzo.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Davidlohr Bueso <dave@stgolabs.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-10	mm: swap: provide lru_add_drain_all_cpuslocked()	Thomas Gleixner	2	-3/+9
	The rework of the cpu hotplug locking unearthed potential deadlocks with the memory hotplug locking code. The solution for these is to rework the memory hotplug locking code as well and take the cpu hotplug lock before the memory hotplug lock in mem_hotplug_begin(), but this will cause a recursive locking of the cpu hotplug lock when the memory hotplug code calls lru_add_drain_all(). Split out the inner workings of lru_add_drain_all() into lru_add_drain_all_cpuslocked() so this function can be invoked from the memory hotplug code with the cpu hotplug lock held. Link: http://lkml.kernel.org/r/20170704093421.419329357@linutronix.de Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reported-by: Andrey Ryabinin <aryabinin@virtuozzo.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Davidlohr Bueso <dave@stgolabs.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-10	mm: use dedicated helper to access rlimit value	Krzysztof Opasiak	1	-3/+2
	Use rlimit() helper instead of manually writing whole chain from current task to rlim_cur. Link: http://lkml.kernel.org/r/20170705172811.8027-1-k.opasiak@samsung.com Signed-off-by: Krzysztof Opasiak <k.opasiak@samsung.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-10	fs/dcache.c: fix spin lockup issue on nlru->lock	Sahitya Tummala	1	-2/+3
	__list_lru_walk_one() acquires nlru spin lock (nlru->lock) for longer duration if there are more number of items in the lru list. As per the current code, it can hold the spin lock for upto maximum UINT_MAX entries at a time. So if there are more number of items in the lru list, then "BUG: spinlock lockup suspected" is observed in the below path: spin_bug+0x90 do_raw_spin_lock+0xfc _raw_spin_lock+0x28 list_lru_add+0x28 dput+0x1c8 path_put+0x20 terminate_walk+0x3c path_lookupat+0x100 filename_lookup+0x6c user_path_at_empty+0x54 SyS_faccessat+0xd0 el0_svc_naked+0x24 This nlru->lock is acquired by another CPU in this path - d_lru_shrink_move+0x34 dentry_lru_isolate_shrink+0x48 __list_lru_walk_one.isra.10+0x94 list_lru_walk_node+0x40 shrink_dcache_sb+0x60 do_remount_sb+0xbc do_emergency_remount+0xb0 process_one_work+0x228 worker_thread+0x2e0 kthread+0xf4 ret_from_fork+0x10 Fix this lockup by reducing the number of entries to be shrinked from the lru list to 1024 at once. Also, add cond_resched() before processing the lru list again. Link: http://marc.info/?t=149722864900001&r=1&w=2 Link: http://lkml.kernel.org/r/1498707575-2472-1-git-send-email-stummala@codeaurora.org Signed-off-by: Sahitya Tummala <stummala@codeaurora.org> Suggested-by: Jan Kara <jack@suse.cz> Suggested-by: Vladimir Davydov <vdavydov.dev@gmail.com> Acked-by: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Alexander Polakov <apolyakov@beget.ru> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-10	mm/list_lru.c: fix list_lru_count_node() to be race free	Sahitya Tummala	2	-8/+7
	list_lru_count_node() iterates over all memcgs to get the total number of entries on the node but it can race with memcg_drain_all_list_lrus(), which migrates the entries from a dead cgroup to another. This can return incorrect number of entries from list_lru_count_node(). Fix this by keeping track of entries per node and simply return it in list_lru_count_node(). Link: http://lkml.kernel.org/r/1498707555-30525-1-git-send-email-stummala@codeaurora.org Signed-off-by: Sahitya Tummala <stummala@codeaurora.org> Acked-by: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Jan Kara <jack@suse.cz> Cc: Alexander Polakov <apolyakov@beget.ru> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-10	mm/mmap.c: expand_downwards: don't require the gap if !vm_prev	Oleg Nesterov	1	-7/+3
	expand_stack(vma) fails if address < stack_guard_gap even if there is no vma->vm_prev. I don't think this makes sense, and we didn't do this before the recent commit 1be7107fbe18 ("mm: larger stack guard gap, between vmas"). We do not need a gap in this case, any address is fine as long as security_mmap_addr() doesn't object. This also simplifies the code, we know that address >= prev->vm_end and thus underflow is not possible. Link: http://lkml.kernel.org/r/20170628175258.GA24881@redhat.com Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Hugh Dickins <hughd@google.com> Cc: Larry Woodman <lwoodman@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-10	mm/mmap.c: do not blow on PROT_NONE MAP_FIXED holes in the stack	Michal Hocko	1	-2/+4
	Commit 1be7107fbe18 ("mm: larger stack guard gap, between vmas") has introduced a regression in some rust and Java environments which are trying to implement their own stack guard page. They are punching a new MAP_FIXED mapping inside the existing stack Vma. This will confuse expand_{downwards,upwards} into thinking that the stack expansion would in fact get us too close to an existing non-stack vma which is a correct behavior wrt safety. It is a real regression on the other hand. Let's work around the problem by considering PROT_NONE mapping as a part of the stack. This is a gros hack but overflowing to such a mapping would trap anyway an we only can hope that usespace knows what it is doing and handle it propely. Fixes: 1be7107fbe18 ("mm: larger stack guard gap, between vmas") Link: http://lkml.kernel.org/r/20170705182849.GA18027@dhcp22.suse.cz Signed-off-by: Michal Hocko <mhocko@suse.com> Debugged-by: Vlastimil Babka <vbabka@suse.cz> Cc: Ben Hutchings <ben@decadent.org.uk> Cc: Willy Tarreau <w@1wt.eu> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-10	mm/balloon_compaction.c: enqueue zero page to balloon device	zhenwei.pi	1	-1/+1
	presently pages in the balloon device have random value, and these pages will be scanned by ksmd on the host. They usually cannot be merged. Enqueue zero pages will resolve this problem. Link: http://lkml.kernel.org/r/1498698637-26389-1-git-send-email-zhenwei.pi@youruncloud.com Signed-off-by: zhenwei.pi <zhenwei.pi@youruncloud.com> Cc: Gioh Kim <gi-oh.kim@profitbricks.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Minchan Kim <minchan@kernel.org> Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Cc: Rafael Aquini <aquini@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-10	cma: fix calculation of aligned offset	Doug Berger	1	-9/+6
	The align_offset parameter is used by bitmap_find_next_zero_area_off() to represent the offset of map's base from the previous alignment boundary; the function ensures that the returned index, plus the align_offset, honors the specified align_mask. The logic introduced by commit b5be83e308f7 ("mm: cma: align to physical address, not CMA region position") has the cma driver calculate the offset to the next alignment boundary. In most cases, the base alignment is greater than that specified when making allocations, resulting in a zero offset whether we align up or down. In the example given with the commit, the base alignment (8MB) was half the requested alignment (16MB) so the math also happened to work since the offset is 8MB in both directions. However, when requesting allocations with an alignment greater than twice that of the base, the returned index would not be correctly aligned. Also, the align_order arguments of cma_bitmap_aligned_mask() and cma_bitmap_aligned_offset() should not be negative so the argument type was made unsigned. Fixes: b5be83e308f7 ("mm: cma: align to physical address, not CMA region position") Link: http://lkml.kernel.org/r/20170628170742.2895-1-opendmb@gmail.com Signed-off-by: Angus Clark <angus@angusclark.org> Signed-off-by: Doug Berger <opendmb@gmail.com> Acked-by: Gregory Fong <gregory.0xf0@gmail.com> Cc: Doug Berger <opendmb@gmail.com> Cc: Angus Clark <angus@angusclark.org> Cc: Laura Abbott <labbott@redhat.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Lucas Stach <l.stach@pengutronix.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Shiraz Hashim <shashim@codeaurora.org> Cc: Jaewon Kim <jaewon31.kim@samsung.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-10	mm/memory_hotplug.c: remove unused local zone_type from __remove_zone()	John Hubbard	1	-3/+0
	__remove_zone() sets up up zone_type, but never uses it for anything. This does not cause a warning, due to the (necessary) use of -Wno-unused-but-set-variable. However, it's noise, so just delete it. Link: http://lkml.kernel.org/r/20170624043421.24465-2-jhubbard@nvidia.com Signed-off-by: John Hubbard <jhubbard@nvidia.com> Acked-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-10	mm: document highmem_is_dirtyable sysctl	Michal Hocko	1	-0/+20
	It seems that there are still people using 32b kernels which a lot of memory and the IO tend to suck a lot for them by default. Mostly because writers are throttled too when the lowmem is used. We have highmem_is_dirtyable to work around that issue but it seems we never bothered to document it. Let's do it now, finally. Link: http://lkml.kernel.org/r/20170626093200.18958-1-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Alkis Georgopoulos <alkisg@gmail.com> Cc: Mel Gorman <mgorman@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-10	include/linux/backing-dev.h: simplify wb_stat_sum	Nikolay Borisov	1	-14/+1
	wb_stat_sum() disables interrupts and calls __wb_stat_sum() which eventually calls __percpu_counter_sum(). However, the percpu routine is already irq-safe. Simplify the code a bit by making wb_stat_sum() directly call percpu_counter_sum_positive() and not disable interrupts. Also remove the now-uneeded __wb_stat_sum() which was just a wrapper over percpu_counter_sum_positive(). Link: http://lkml.kernel.org/r/1498230681-29103-1-git-send-email-nborisov@suse.com Signed-off-by: Nikolay Borisov <nborisov@suse.com> Acked-by: Peter Zijlstra <peterz@infradead.org> Cc: Tejun Heo <tj@kernel.org> Cc: Jan Kara <jack@suse.cz> Cc: Jens Axboe <axboe@fb.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-10	include/linux/mmzone.h: remove ancient/ambiguous comment	Nikolay Borisov	1	-5/+2
	Currently pg_data_t is just a struct which describes a NUMA node memory layout. Let's keep the comment simple and remove ambiguity. Link: http://lkml.kernel.org/r/1498220534-22717-1-git-send-email-nborisov@suse.com Signed-off-by: Nikolay Borisov <nborisov@suse.com> Acked-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-10	mm/swap_slots.c: don't disable preemption while taking the per-CPU cache	Sebastian Andrzej Siewior	1	-3/+2
	get_cpu_var() disables preemption and returns the per-CPU version of the variable. Disabling preemption is useful to ensure atomic access to the variable within the critical section. In this case however, after the per-CPU version of the variable is obtained the ->free_lock is acquired. For that reason it seems the raw accessor could be used. It only seems that ->slots_ret should be retested (because with disabled preemption this variable can not be set to NULL otherwise). This popped up during PREEMPT-RT testing because it tries to take spinlocks in a preempt disabled section. In RT, spinlocks can sleep. Link: http://lkml.kernel.org/r/20170623114755.2ebxdysacvgxzott@linutronix.de Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Tim Chen <tim.c.chen@linux.intel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ying Huang <ying.huang@intel.com> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-10	mm/page_alloc.c: eliminate unsigned confusion in __rmqueue_fallback	Rasmus Villemoes	1	-4/+7
	Since current_order starts as MAX_ORDER-1 and is then only decremented, the second half of the loop condition seems superfluous. However, if order is 0, we may decrement current_order past 0, making it UINT_MAX. This is obviously too subtle ([1], [2]). Since we need to add some comment anyway, change the two variables to signed, making the counting-down for loop look more familiar, and apparently also making gcc generate slightly smaller code. [1] https://lkml.org/lkml/2016/6/20/493 [2] https://lkml.org/lkml/2017/6/19/345 [akpm@linux-foundation.org: fix up reject fixupping] Link: http://lkml.kernel.org/r/20170621185529.2265-1-linux@rasmusvillemoes.dk Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk> Reported-by: Hao Lee <haolee.swjtu@gmail.com> Acked-by: Wei Yang <weiyang@gmail.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-10	fs/proc/task_mmu.c: remove obsolete comment in show_map_vma()	Vasily Averin	1	-1/+0
	After commit 1be7107fbe18 ("mm: larger stack guard gap, between vmas") we do not hide stack guard page in /proc/<pid>/maps Link: http://lkml.kernel.org/r/211f3c2a-f7ef-7c13-82bf-46fd426f6e1b@virtuozzo.com Signed-off-by: Vasily Averin <vvs@virtuozzo.com> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-10	mm: drop useless local parameters of __register_one_node()	Dou Liyang	1	-7/+2
	__register_one_node() initializes local parameters "p_node" & "parent" for register_node(). But, register_node() does not use them. Remove the related code of "parent" node, cleanup __register_one_node() and register_node(). Link: http://lkml.kernel.org/r/1498013846-20149-1-git-send-email-douly.fnst@cn.fujitsu.com Signed-off-by: Dou Liyang <douly.fnst@cn.fujitsu.com> Acked-by: David Rientjes <rientjes@google.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: <isimatu.yasuaki@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-10	mm: avoid taking zone lock in pagetypeinfo_showmixed()	Vinayak Menon	2	-11/+19
	pagetypeinfo_showmixedcount_print is found to take a lot of time to complete and it does this holding the zone lock and disabling interrupts. In some cases it is found to take more than a second (On a 2.4GHz,8Gb RAM,arm64 cpu). Avoid taking the zone lock similar to what is done by read_page_owner, which means possibility of inaccurate results. Link: http://lkml.kernel.org/r/1498045643-12257-1-git-send-email-vinmenon@codeaurora.org Signed-off-by: Vinayak Menon <vinmenon@codeaurora.org> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: zhongjiang <zhongjiang@huawei.com> Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: Sudip Mukherjee <sudipm.mukherjee@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Michal Hocko <mhocko@suse.com> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: David Rientjes <rientjes@google.com> Cc: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-10	mm, hugetlb, soft_offline: use new_page_nodemask for soft offline migration	Michal Hocko	1	-9/+1
	new_page is yet another duplication of the migration callback which has to handle hugetlb migration specially. We can safely use the generic new_page_nodemask for the same purpose. Please note that gigantic hugetlb pages do not need any special handling because alloc_huge_page_nodemask will make sure to check pages in all per node pools. The reason this was done previously was that alloc_huge_page_node treated NO_NUMA_NODE and a specific node differently and so alloc_huge_page_node(nid) would check on this specific node. Link: http://lkml.kernel.org/r/20170622193034.28972-4-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Reported-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Tested-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Mel Gorman <mgorman@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-10	hugetlb: add support for preferred node to alloc_huge_page_nodemask	Michal Hocko	3	-47/+48
	alloc_huge_page_nodemask tries to allocate from any numa node in the allowed node mask starting from lower numa nodes. This might lead to filling up those low NUMA nodes while others are not used. We can reduce this risk by introducing a concept of the preferred node similar to what we have in the regular page allocator. We will start allocating from the preferred nid and then iterate over all allowed nodes in the zonelist order until we try them all. This is mimicing the page allocator logic except it operates on per-node mempools. dequeue_huge_page_vma already does this so distill the zonelist logic into a more generic dequeue_huge_page_nodemask and use it in alloc_huge_page_nodemask. This will allow us to use proper per numa distance fallback also for alloc_huge_page_node which can use alloc_huge_page_nodemask now and we can get rid of alloc_huge_page_node helper which doesn't have any user anymore. Link: http://lkml.kernel.org/r/20170622193034.28972-3-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Tested-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Mel Gorman <mgorman@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-10	mm, hugetlb: unclutter hugetlb allocation layers	Michal Hocko	2	-105/+30
	Patch series "mm, hugetlb: allow proper node fallback dequeue". While working on a hugetlb migration issue addressed in a separate patchset[1] I have noticed that the hugetlb allocations from the preallocated pool are quite subotimal. [1] //lkml.kernel.org/r/20170608074553.22152-1-mhocko@kernel.org There is no fallback mechanism implemented and no notion of preferred node. I have tried to work around it but Vlastimil was right to push back for a more robust solution. It seems that such a solution is to reuse zonelist approach we use for the page alloctor. This series has 3 patches. The first one tries to make hugetlb allocation layers more clear. The second one implements the zonelist hugetlb pool allocation and introduces a preferred node semantic which is used by the migration callbacks. The last patch is a clean up. This patch (of 3): Hugetlb allocation path for fresh huge pages is unnecessarily complex and it mixes different interfaces between layers. __alloc_buddy_huge_page is the central place to perform a new allocation. It checks for the hugetlb overcommit and then relies on __hugetlb_alloc_buddy_huge_page to invoke the page allocator. This is all good except that __alloc_buddy_huge_page pushes vma and address down the callchain and so __hugetlb_alloc_buddy_huge_page has to deal with two different allocation modes - one for memory policy and other node specific (or to make it more obscure node non-specific) requests. This just screams for a reorganization. This patch pulls out all the vma specific handling up to __alloc_buddy_huge_page_with_mpol where it belongs. __alloc_buddy_huge_page will get nodemask argument and __hugetlb_alloc_buddy_huge_page will become a trivial wrapper over the page allocator. In short: __alloc_buddy_huge_page_with_mpol - memory policy handling __alloc_buddy_huge_page - overcommit handling and accounting __hugetlb_alloc_buddy_huge_page - page allocator layer Also note that __hugetlb_alloc_buddy_huge_page and its cpuset retry loop is not really needed because the page allocator already handles the cpusets update. Finally __hugetlb_alloc_buddy_huge_page had a special case for node specific allocations (when no policy is applied and there is a node given). This has relied on __GFP_THISNODE to not fallback to a different node. alloc_huge_page_node is the only caller which relies on this behavior so move the __GFP_THISNODE there. Not only does this remove quite some code it also should make those layers easier to follow and clear wrt responsibilities. Link: http://lkml.kernel.org/r/20170622193034.28972-2-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Tested-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Mel Gorman <mgorman@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-10	mm/oom_kill.c: add tracepoints for oom reaper-related events	Roman Gushchin	2	-0/+87
	During the debugging of the problem described in https://lkml.org/lkml/2017/5/17/542 and fixed by Tetsuo Handa in https://lkml.org/lkml/2017/5/19/383 , I've found that the existing debug output is not really useful to understand issues related to the oom reaper. So, I assume, that adding some tracepoints might help with debugging of similar issues. Trace the following events: 1) a process is marked as an oom victim, 2) a process is added to the oom reaper list, 3) the oom reaper starts reaping process's mm, 4) the oom reaper finished reaping, 5) the oom reaper skips reaping. How it works in practice? Below is an example which show how the problem mentioned above can be found: one process is added twice to the oom_reaper list: $ cd /sys/kernel/debug/tracing $ echo "oom:mark_victim" > set_event $ echo "oom:wake_reaper" >> set_event $ echo "oom:skip_task_reaping" >> set_event $ echo "oom:start_task_reaping" >> set_event $ echo "oom:finish_task_reaping" >> set_event $ cat trace_pipe allocate-502 [001] .... 91.836405: mark_victim: pid=502 allocate-502 [001] .N.. 91.837356: wake_reaper: pid=502 allocate-502 [000] .N.. 91.871149: wake_reaper: pid=502 oom_reaper-23 [000] .... 91.871177: start_task_reaping: pid=502 oom_reaper-23 [000] .N.. 91.879511: finish_task_reaping: pid=502 oom_reaper-23 [000] .... 91.879580: skip_task_reaping: pid=502 Link: http://lkml.kernel.org/r/20170530185231.GA13412@castle Signed-off-by: Roman Gushchin <guro@fb.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-10	userfaultfd: non-cooperative: add madvise() event for MADV_FREE request	Mike Rapoport	1	-20/+22
	MADV_FREE is identical to MADV_DONTNEED from the point of view of uffd monitor. The monitor has to stop handling #PF events in the range being freed. We are reusing userfaultfd_remove callback along with the logic required to re-get and re-validate the VMA which may change or disappear because userfaultfd_remove releases mmap_sem. Link: http://lkml.kernel.org/r/1497876311-18615-1-git-send-email-rppt@linux.vnet.ibm.com Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Pavel Emelyanov <xemul@virtuozzo.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-10	mm/truncate.c: fix THP handling in invalidate_mapping_pages()	Jan Kara	1	-2/+8
	The condition checking for THP straddling end of invalidated range is wrong - it checks 'index' against 'end' but 'index' has been already advanced to point to the end of THP and thus the condition can never be true. As a result THP straddling 'end' has been fully invalidated. Given the nature of invalidate_mapping_pages(), this could be only performance issue. In fact, we are lucky the condition is wrong because if it was ever true, we'd leave locked page behind. Fix the condition checking for THP straddling 'end' and also properly unlock the page. Also update the comment before the condition to explain why we decide not to invalidate the page as it was not clear to me and I had to ask Kirill. Link: http://lkml.kernel.org/r/20170619124723.21656-1-jack@suse.cz Signed-off-by: Jan Kara <jack@suse.cz> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-10	mm/hugetlb.c: replace memfmt with string_get_size	Matthew Wilcox	1	-14/+5
	The hugetlb code has its own function to report human-readable sizes. Convert it to use the shared string_get_size() function. This will lead to a minor difference in user visible output (MiB/GiB instead of MB/GB), but some would argue that's desirable anyway. Link: http://lkml.kernel.org/r/20170606190350.GA20010@bombadil.infradead.org Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com> Cc: Liam R. Howlett <Liam.Howlett@Oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com> Cc: zhong jiang <zhongjiang@huawei.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-10	mm, memcg: fix potential undefined behavior in mem_cgroup_event_ratelimit()	Michal Hocko	1	-1/+1
	Alice has reported the following UBSAN splat: UBSAN: Undefined behaviour in mm/memcontrol.c:661:17 signed integer overflow: -2147483644 - 2147483525 cannot be represented in type 'long int' CPU: 1 PID: 11758 Comm: mybibtex2filena Tainted: P O 4.9.25-gentoo #4 Hardware name: XXXXXX, BIOS YYYYYY Call Trace: dump_stack+0x59/0x87 ubsan_epilogue+0xe/0x40 handle_overflow+0xbb/0xf0 __ubsan_handle_sub_overflow+0x12/0x20 memcg_check_events.isra.36+0x223/0x360 mem_cgroup_commit_charge+0x55/0x140 wp_page_copy+0x34e/0xb80 do_wp_page+0x1e6/0x1300 handle_mm_fault+0x88b/0x1990 __do_page_fault+0x2de/0x8a0 do_page_fault+0x1a/0x20 error_code+0x67/0x6c The reason is that we subtract two signed types. Let's fix this by truly mimicing time_after and cast the result of the subtraction. Link: http://lkml.kernel.org/r/20170616150057.GQ30580@dhcp22.suse.cz Signed-off-by: Michal Hocko <mhocko@suse.com> Reported-by: Alice Ferrazzi <alicef@gentoo.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-10	mm, hugetlb: schedule when potentially allocating many hugepages	David Rientjes	1	-0/+2
	A few hugetlb allocators loop while calling the page allocator and can potentially prevent rescheduling if the page allocator slowpath is not utilized. Conditionally schedule when large numbers of hugepages can be allocated. Anshuman: "Fixes a task which was getting hung while writing like 10000 hugepages (16MB on POWER8) into /proc/sys/vm/nr_hugepages." Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1706091535300.66176@chino.kir.corp.google.com Signed-off-by: David Rientjes <rientjes@google.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Tested-by: Anshuman Khandual <khandual@linux.vnet.ibm.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-10	mm: unify new_node_page and alloc_migrate_target	Michal Hocko	3	-26/+19
	Commit 394e31d2ceb4 ("mem-hotplug: alloc new page from a nearest neighbor node when mem-offline") has duplicated a large part of alloc_migrate_target with some hotplug specific special casing. To be more precise it tried to enfore the allocation from a different node than the original page. As a result the two function diverged in their shared logic, e.g. the hugetlb allocation strategy. Let's unify the two and express different NUMA requirements by the given nodemask. new_node_page will simply exclude the node it doesn't care about and alloc_migrate_target will use all the available nodes. alloc_migrate_target will then learn to migrate hugetlb pages more sanely and use preallocated pool when possible. Please note that alloc_migrate_target used to call alloc_page resp. alloc_pages_current so the memory policy of the current context which is quite strange when we consider that it is used in the context of alloc_contig_range which just tries to migrate pages which stand in the way. Link: http://lkml.kernel.org/r/20170608074553.22152-4-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Xishi Qiu <qiuxishi@huawei.com> Cc: zhong jiang <zhongjiang@huawei.com> Cc: Joonsoo Kim <js1304@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-10	hugetlb, memory_hotplug: prefer to use reserved pages for migration	Michal Hocko	3	-7/+31
	new_node_page will try to use the origin's next NUMA node as the migration destination for hugetlb pages. If such a node doesn't have any preallocated pool it falls back to __alloc_buddy_huge_page_no_mpol to allocate a surplus page instead. This is quite subotpimal for any configuration when hugetlb pages are no distributed to all NUMA nodes evenly. Say we have a hotplugable node 4 and spare hugetlb pages are node 0 /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages:10000 /sys/devices/system/node/node1/hugepages/hugepages-2048kB/nr_hugepages:0 /sys/devices/system/node/node2/hugepages/hugepages-2048kB/nr_hugepages:0 /sys/devices/system/node/node3/hugepages/hugepages-2048kB/nr_hugepages:0 /sys/devices/system/node/node4/hugepages/hugepages-2048kB/nr_hugepages:10000 /sys/devices/system/node/node5/hugepages/hugepages-2048kB/nr_hugepages:0 /sys/devices/system/node/node6/hugepages/hugepages-2048kB/nr_hugepages:0 /sys/devices/system/node/node7/hugepages/hugepages-2048kB/nr_hugepages:0 Now we consume the whole pool on node 4 and try to offline this node. All the allocated pages should be moved to node0 which has enough preallocated pages to hold them. With the current implementation offlining very likely fails because hugetlb allocations during runtime are much less reliable. Fix this by reusing the nodemask which excludes migration source and try to find a first node which has a page in the preallocated pool first and fall back to __alloc_buddy_huge_page_no_mpol only when the whole pool is consumed. [akpm@linux-foundation.org: remove bogus arg from alloc_huge_page_nodemask() stub] Link: http://lkml.kernel.org/r/20170608074553.22152-3-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Xishi Qiu <qiuxishi@huawei.com> Cc: zhong jiang <zhongjiang@huawei.com> Cc: Joonsoo Kim <js1304@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-10	mm, memory_hotplug: simplify empty node mask handling in new_node_page	Michal Hocko	1	-9/+10
	new_node_page tries to allocate the target page on a different NUMA node than the source page. This makes sense in most cases during the hotplug because we are likely to offline the whole numa node. But there are cases where there are no other nodes to fallback (e.g. when offlining parts of the only existing node) and we have to fallback to allocating from the source node. The current code does that but it can be simplified by checking the nmask and updating it before we even try to allocate rather than special casing it. This patch shouldn't introduce any functional change. Link: http://lkml.kernel.org/r/20170608074553.22152-2-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Xishi Qiu <qiuxishi@huawei.com> Cc: zhong jiang <zhongjiang@huawei.com> Cc: Joonsoo Kim <js1304@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-10	mm, memory_hotplug: support movable_node for hotpluggable nodes	Michal Hocko	2	-6/+25
	movable_node kernel parameter allows making hotpluggable NUMA nodes to put all the hotplugable memory into movable zone which allows more or less reliable memory hotremove. At least this is the case for the NUMA nodes present during the boot (see find_zone_movable_pfns_for_nodes). This is not the case for the memory hotplug, though. echo online > /sys/devices/system/memory/memoryXYZ/state will default to a kernel zone (usually ZONE_NORMAL) unless the particular memblock is already in the movable zone range which is not the case normally when onlining the memory from the udev rule context for a freshly hotadded NUMA node. The only option currently is to have a special udev rule to echo online_movable to all memblocks belonging to such a node which is rather clumsy. Not to mention this is inconsistent as well because what ended up in the movable zone during the boot will end up in a kernel zone after hotremove & hotadd without special care. It would be nice to reuse memblock_is_hotpluggable but the runtime hotplug doesn't have that information available because the boot and hotplug paths are not shared and it would be really non trivial to make them use the same code path because the runtime hotplug doesn't play with the memblock allocator at all. Teach move_pfn_range that MMOP_ONLINE_KEEP can use the movable zone if movable_node is enabled and the range doesn't overlap with the existing normal zone. This should provide a reasonable default onlining strategy. Strictly speaking the semantic is not identical with the boot time initialization because find_zone_movable_pfns_for_nodes covers only the hotplugable range as described by the BIOS/FW. From my experience this is usually a full node though (except for Node0 which is special and never goes away completely). If this turns out to be a problem in the real life we can tweak the code to store hotplug flag into memblocks but let's keep this simple now. Link: http://lkml.kernel.org/r/20170612111227.GI7476@dhcp22.suse.cz Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Reza Arbab <arbab@linux.vnet.ibm.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Yasuaki Ishimatsu <yasu.isimatu@gmail.com> Cc: <qiuxishi@huawei.com> Cc: Kani Toshimitsu <toshi.kani@hpe.com> Cc: <slaoub@gmail.com> Cc: Joonsoo Kim <js1304@gmail.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: David Rientjes <rientjes@google.com> Cc: Daniel Kiper <daniel.kiper@oracle.com> Cc: Igor Mammedov <imammedo@redhat.com> Cc: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-10	zram: use __sysfs_match_string() helper	Andy Shevchenko	1	-6/+4
	Use __sysfs_match_string() helper instead of open coded variant. Link: http://lkml.kernel.org/r/20170609120835.22156-1-andriy.shevchenko@linux.intel.com Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Acked-by: Minchan Kim <minchan@kernel.org> Cc: Nitin Gupta <ngupta@vflare.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-10	mm/migrate.c: stabilise page count when migrating transparent hugepages	Will Deacon	1	-13/+2
	When migrating a transparent hugepage, migrate_misplaced_transhuge_page guards itself against a concurrent fastgup of the page by checking that the page count is equal to 2 before and after installing the new pmd. If the page count changes, then the pmd is reverted back to the original entry, however there is a small window where the new (possibly writable) pmd is installed and the underlying page could be written by userspace. Restoring the old pmd could therefore result in loss of data. This patch fixes the problem by freezing the page count whilst updating the page tables, which protects against a concurrent fastgup without the need to restore the old pmd in the failure case (since the page count can no longer change under our feet). Link: http://lkml.kernel.org/r/1497349722-6731-4-git-send-email-will.deacon@arm.com Signed-off-by: Will Deacon <will.deacon@arm.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Steve Capper <steve.capper@arm.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-10	include/linux/page_ref.h: ensure page_ref_unfreeze is ordered against prior ↵	Will Deacon	1	-0/+1
	accesses page_ref_freeze and page_ref_unfreeze are designed to be used as a pair, wrapping a critical section where struct pages can be modified without having to worry about consistency for a concurrent fast-GUP. Whilst page_ref_freeze has full barrier semantics due to its use of atomic_cmpxchg, page_ref_unfreeze is implemented using atomic_set, which doesn't provide any barrier semantics and allows the operation to be reordered with respect to page modifications in the critical section. This patch ensures that page_ref_unfreeze is ordered after any critical section updates, by invoking smp_mb() prior to the atomic_set. Link: http://lkml.kernel.org/r/1497349722-6731-3-git-send-email-will.deacon@arm.com Signed-off-by: Will Deacon <will.deacon@arm.com> Acked-by: Steve Capper <steve.capper@arm.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-10	mm: always enable thp for dax mappings	Dan Williams	3	-5/+11
	The madvise policy for transparent huge pages is meant to avoid unwanted allocations of transparent huge pages. It allows a policy of disabling the extra memory pressure and effort to arrange for a huge page when it is not needed. DAX by definition never incurs this overhead since it is statically allocated. The policy choice makes even less sense for device-dax which tries to guarantee a given tlb-fault size. Specifically, the following setting: echo never > /sys/kernel/mm/transparent_hugepage/enabled ...violates that guarantee and silently disables all device-dax instances with a 2M or 1G alignment. So, let's avoid that non-obvious side effect by force enabling thp for dax mappings in all cases. It is worth noting that the reason this uses vma_is_dax(), and the resulting header include changes, is that previous attempts to add a VM_DAX flag were NAKd. Link: http://lkml.kernel.org/r/149739531127.20686.15813586620597484283.stgit@dwillia2-desk3.amr.corp.intel.com Signed-off-by: Dan Williams <dan.j.williams@intel.com> Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com> Cc: Jan Kara <jack@suse.cz> Cc: Christoph Hellwig <hch@lst.de> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-10	mm: improve readability of transparent_hugepage_enabled()	Dan Williams	1	-12/+29
	Turn the macro into a static inline and rewrite the condition checks for better readability in preparation for adding another condition. [ross.zwisler@linux.intel.com: fix logic to make conversion equivalent] [akpm@linux-foundation.org: resolve vs mm-make-pr_set_thp_disable-immediately-active.patch] [akpm@linux-foundation.org: include coredump.h for MMF_DISABLE_THP] Link: http://lkml.kernel.org/r/149739530612.20686.14760671150202647861.stgit@dwillia2-desk3.amr.corp.intel.com Signed-off-by: Dan Williams <dan.j.williams@intel.com> Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com> Acked-by: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Jan Kara <jack@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-10	oom, trace: remove ENUM evaluation of COMPACTION_FEEDBACK	Steven Rostedt (VMware)	1	-1/+1
	After enabling CONFIG_TRACE_ENUM_MAP_FILE (which will soon be renamed to CONFIG_TRACE_EVAL_MAP_FILE), I am able to examine the enums that have been evaluated: # cat /sys/kernel/debug/tracing/enum_map (which will soon be renamed to eval_map) And it showed some interesting results: [..] ZONE_MOVABLE 3 (oom) ZONE_NORMAL 2 (oom) ZONE_DMA32 1 (oom) ZONE_DMA 0 (oom) 3 3 (oom) 2 2 (oom) 1 1 (oom) COMPACT_PRIO_ASYNC 2 (oom) COMPACT_PRIO_SYNC_LIGHT 1 (oom) COMPACT_PRIO_SYNC_FULL 0 (oom) [..] ZONE_DMA 0 (vmscan) 3 3 (vmscan) 2 2 (vmscan) 1 1 (vmscan) COMPACT_PRIO_ASYNC 2 (vmscan) [..] ZONE_DMA 0 (kmem) 3 3 (kmem) 2 2 (kmem) 1 1 (kmem) COMPACT_PRIO_ASYNC 2 (kmem) [..] ZONE_DMA 0 (compaction) 3 3 (compaction) 2 2 (compaction) 1 1 (compaction) COMPACT_PRIO_ASYNC 2 (compaction) [..] The name within the parenthesis are the trace systems that the enum/eval maps are associated with. When there's a number evaluated to another number, that tells me that the TRACE_DEFINE_ENUM() was used on a #define and not an enum. As #defines get converted normally, they are not needed to be evaluated. Each of the above trace systems with the number to number evaluation included the file include/trace/events/mmflags.h which has: /* High-level compaction status feedback */ #define COMPACTION_FAILED 1 #define COMPACTION_WITHDRAWN 2 #define COMPACTION_PROGRESS 3 [..] #define COMPACTION_FEEDBACK \ EM(COMPACTION_FAILED, "failed") \ EM(COMPACTION_WITHDRAWN, "withdrawn") \ EMe(COMPACTION_PROGRESS, "progress") Which is still needed for the __print_symbolic() usage in the trace_event. But it is not needed to be evaluated. Removing the evaluation part removes the unnecessary evaluations of numbers to numbers. Link: http://lkml.kernel.org/r/20170615074944.7be9a647@gandalf.local.home Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-10	mm/hugetlb.c: warn the user when issues arise on boot due to hugepages	Liam R. Howlett	1	-12/+24
	When the user specifies too many hugepages or an invalid default_hugepagesz the communication to the user is implicit in the allocation message. This patch adds a warning when the desired page count is not allocated and prints an error when the default_hugepagesz is invalid on boot. During boot hugepages will allocate until there is a fraction of the hugepage size left. That is, we allocate until either the request is satisfied or memory for the pages is exhausted. When memory for the pages is exhausted, it will most likely lead to the system failing with the OOM manager not finding enough (or anything) to kill (unless you're using really big hugepages in the order of 100s of MB or in the GBs). The user will most likely see the OOM messages much later in the boot sequence than the implicitly stated message. Worse yet, you may even get an OOM for each processor which causes many pages of OOMs on modern systems. Although these messages will be printed earlier than the OOM messages, at least giving the user errors and warnings will highlight the configuration as an issue. I'm trying to point the user in the right direction by providing a more robust statement of what is failing. During the sysctl or echo command, the user can check the results much easier than if the system hangs during boot and the scenario of having nothing to OOM for kernel memory is highly unlikely. Mike said: "Before sending out this patch, I asked Liam off list why he was doing it. Was it something he just thought would be useful? Or, was there some type of user situation/need. He said that he had been called in to assist on several occasions when a system OOMed during boot. In almost all of these situations, the user had grossly misconfigured huge pages. DB users want to pre-allocate just the right amount of huge pages, but sometimes they can be really off. In such situations, the huge page init code just allocates as many huge pages as it can and reports the number allocated. There is no indication that it quit allocating because it ran out of memory. Of course, a user could compare the number in the message to what they requested on the command line to determine if they got all the huge pages they requested. The thought was that it would be useful to at least flag this situation. That way, the user might be able to better relate the huge page allocation failure to the OOM. I'm not sure if the e-mail discussion made it obvious that this is something he has seen on several occasions. I see Michal's point that this will only flag the situation where someone configures huge pages very badly. And, a more extensive look at the situation of misconfiguring huge pages might be in order. But, this has happened on several occasions which led to the creation of this patch" [akpm@linux-foundation.org: reposition memfmt() to avoid forward declaration] Link: http://lkml.kernel.org/r/20170603005413.10380-1-Liam.Howlett@Oracle.com Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com> Cc: zhongjiang <zhongjiang@huawei.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-10	mm/cma.c: warn if the CMA area could not be activated	Anshuman Khandual	1	-2/+3
	While activating a CMA area we check to make sure that all the PFNs in the range are inside the same zone. This is a requirement for alloc_contig_range() to work. Any CMA area failing the check is disabled for good. This happens silently right now making all future cma_alloc() allocations failure inevitable. Here we add an error message stating that the CMA area could not be activated which makes it easier to explain any future cma_alloc() failures on it. While in there, change the bail out goto label from 'err' to 'not_in_zone' which makes more sense. Link: http://lkml.kernel.org/r/20170605023729.26303-1-khandual@linux.vnet.ibm.com Signed-off-by: Anshuman Khandual <khandual@linux.vnet.ibm.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-10	vmalloc: show lazy-purged vma info in vmallocinfo	Yisheng Xie	1	-1/+9
	When ioremap a 67112960 bytes vm_area with the vmallocinfo: [..] 0xec79b000-0xec7fa000 389120 ftl_add_mtd+0x4d0/0x754 pages=94 vmalloc 0xec800000-0xecbe1000 4067328 kbox_proc_mem_write+0x104/0x1c4 phys=8b520000 ioremap we get the result: 0xf1000000-0xf5001000 67112960 devm_ioremap+0x38/0x7c phys=40000000 ioremap For the align for ioremap must be less than '1 << IOREMAP_MAX_ORDER': if (flags & VM_IOREMAP) align = 1ul << clamp_t(int, get_count_order_long(size), PAGE_SHIFT, IOREMAP_MAX_ORDER); So it makes idiot like me a litte puzzled why this was a jump the vm_area from 0xec800000-0xecbe1000 to 0xf1000000-0xf5001000, and leaving 0xed000000-0xf1000000 as a big hole. This patch is to show all of vm_area, including vmas which are freeing but still in the vmap_area_list, to make it more clear about why we will get 0xf1000000-0xf5001000 in the above case. And we will get a vmallocinfo like: [..] 0xec79b000-0xec7fa000 389120 ftl_add_mtd+0x4d0/0x754 pages=94 vmalloc 0xec800000-0xecbe1000 4067328 kbox_proc_mem_write+0x104/0x1c4 phys=8b520000 ioremap [..] 0xece7c000-0xece7e000 8192 unpurged vm_area 0xece7e000-0xece83000 20480 vm_map_ram 0xf0099000-0xf00aa000 69632 vm_map_ram after this patch. Link: http://lkml.kernel.org/r/1496649682-20710-1-git-send-email-xieyisheng1@huawei.com Signed-off-by: Yisheng Xie <xieyisheng1@huawei.com> Cc: Michal Hocko <mhocko@suse.com> Cc: zijun_hu <zijun_hu@htc.com> Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com> Cc: Tim Chen <tim.c.chen@linux.intel.com> Cc: Hanjun Guo <guohanjun@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-10	mm/memcontrol: exclude @root from checks in mem_cgroup_low	Sean Christopherson	1	-18/+32
	Make @root exclusive in mem_cgroup_low; it is never considered low when looked at directly and is not checked when traversing the tree. In effect, @root is handled identically to how root_mem_cgroup was previously handled by mem_cgroup_low. If @root is not excluded from the checks, a cgroup underneath @root will never be considered low during targeted reclaim of @root, e.g. due to memory.current > memory.high, unless @root is misconfigured to have memory.low > memory.high. Excluding @root enables using memory.low to prioritize memory usage between cgroups within a subtree of the hierarchy that is limited by memory.high or memory.max, e.g. when ROOT owns @root's controls but delegates the @root directory to a USER so that USER can create and administer children of @root. For example, given cgroup A with children B and C: A / \ B C and 1. A/memory.current > A/memory.high 2. A/B/memory.current < A/B/memory.low 3. A/C/memory.current >= A/C/memory.low As 'A' is high, i.e. triggers reclaim from 'A', and 'B' is low, we should reclaim from 'C' until 'A' is no longer high or until we can no longer reclaim from 'C'. If 'A', i.e. @root, isn't excluded by mem_cgroup_low when reclaming from 'A', then 'B' won't be considered low and we will reclaim indiscriminately from both 'B' and 'C'. Here is the test I used to confirm the bug and the patch. 20:00:55@sjchrist-vm ? ~ $ cat ~/.bin/memcg_low_test #!/bin/bash x62mb=$((62<<20)) x66mb=$((66<<20)) x94mb=$((94<<20)) x98mb=$((98<<20)) setup() { set -e if [[ -n $DEBUG ]]; then set -x fi trap teardown EXIT HUP INT TERM if [[ ! -e /mnt/1gb.swap ]]; then sudo fallocate -l 1G /mnt/1gb.swap > /dev/null sudo mkswap /mnt/1gb.swap > /dev/null fi if ! swapon --show=NAME \| grep -q "/mnt/1gb.swap"; then sudo swapon /mnt/1gb.swap fi if [[ ! -e /cgroup/cgroup.controllers ]]; then sudo mount -t cgroup2 none /cgroup fi grep -q memory /cgroup/cgroup.controllers sudo sh -c "echo '+memory' > /cgroup/cgroup.subtree_control" sudo mkdir /cgroup/A && sudo chown $USER:$USER /cgroup/A sudo sh -c "echo '+memory' > /cgroup/A/cgroup.subtree_control" sudo sh -c "echo '96m' > /cgroup/A/memory.high" mkdir /cgroup/A/0 mkdir /cgroup/A/1 echo 64m > /cgroup/A/0/memory.low } teardown() { set +e trap - EXIT HUP INT TERM if [[ -z $1 ]]; then printf "\n" printf "%0.s" {1..35} printf "\nFAILED!\n\n" tail /cgroup/A//memory.current printf "%0.s" {1..35} printf "\n\n" fi ps \| grep stress \| tr -s ' ' \| cut -f 2 -d ' ' \| xargs -I % kill % sleep 2 if [[ -e /cgroup/A/0 ]]; then rmdir /cgroup/A/0 fi if [[ -e /cgroup/A/1 ]]; then rmdir /cgroup/A/1 fi if [[ -e /cgroup/A ]]; then sudo rmdir /cgroup/A fi } stress_test() { sudo sh -c "echo $$ > /cgroup/A/$1/cgroup.procs" stress --vm 1 --vm-bytes 64M --vm-keep > /dev/null & sudo sh -c "echo $$ > /cgroup/A/$2/cgroup.procs" stress --vm 1 --vm-bytes 64M --vm-keep > /dev/null & sudo sh -c "echo $$ > /cgroup/cgroup.procs" sleep 1 # A/0 should be consuming more memory than A/1 [[ $(cat /cgroup/A/0/memory.current) -ge $(cat /cgroup/A/1/memory.current) ]] # A/0 should be consuming ~64mb [[ $(cat /cgroup/A/0/memory.current) -ge $x62mb ]] && [[ $(cat /cgroup/A/0/memory.current) -le $x66mb ]] # A should cumulatively be consuming ~96mb [[ $(cat /cgroup/A/memory.current) -ge $x94mb ]] && [[ $(cat /cgroup/A/memory.current) -le $x98mb ]] # Stop the stressors ps \| grep stress \| tr -s ' ' \| cut -f 2 -d ' ' \| xargs -I % kill % } teardown 1 setup for ((i=1;i<=$1;i++)); do printf "ITERATION $i of $1 - stress_test 0 1" stress_test 0 1 printf "\x1b[2K\r" printf "ITERATION $i of $1 - stress_test 1 0" stress_test 1 0 printf "\x1b[2K\r" printf "ITERATION $i of $1 - PASSED\n" done teardown 1 echo PASSED! 20:11:26@sjchrist-vm ? ~ $ memcg_low_test 10 Link: http://lkml.kernel.org/r/1496434412-21005-1-git-send-email-sean.j.christopherson@intel.com Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Acked-by: Vladimir Davydov <vdavydov.dev@gmail.com> Acked-by: Balbir Singh <bsingharora@gmail.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-10	mm: make PR_SET_THP_DISABLE immediately active	Michal Hocko	6	-9/+17
	PR_SET_THP_DISABLE has a rather subtle semantic. It doesn't affect any existing mapping because it only updated mm->def_flags which is a template for new mappings. The mappings created after prctl(PR_SET_THP_DISABLE) have VM_NOHUGEPAGE flag set. This can be quite surprising for all those applications which do not do prctl(); fork() & exec() and want to control their own THP behavior. Another usecase when the immediate semantic of the prctl might be useful is a combination of pre- and post-copy migration of containers with CRIU. In this case CRIU populates a part of a memory region with data that was saved during the pre-copy stage. Afterwards, the region is registered with userfaultfd and CRIU expects to get page faults for the parts of the region that were not yet populated. However, khugepaged collapses the pages and the expected page faults do not occur. In more general case, the prctl(PR_SET_THP_DISABLE) could be used as a temporary mechanism for enabling/disabling THP process wide. Implementation wise, a new MMF_DISABLE_THP flag is added. This flag is tested when decision whether to use huge pages is taken either during page fault of at the time of THP collapse. It should be noted, that the new implementation makes PR_SET_THP_DISABLE master override to any per-VMA setting, which was not the case previously. Fixes: a0715cc22601 ("mm, thp: add VM_INIT_DEF_MASK and PRCTL_THP_DISABLE") Link: http://lkml.kernel.org/r/1496415802-30944-1-git-send-email-rppt@linux.vnet.ibm.com Signed-off-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Pavel Emelyanov <xemul@virtuozzo.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-10	mm, vmpressure: pass-through notification support	David Rientjes	2	-41/+128
	By default, vmpressure events are not pass-through, i.e. they propagate up through the memcg hierarchy until an event notifier is found for any threshold level. This presents a difficulty when a thread waiting on a read(2) for a vmpressure event cannot distinguish between local memory pressure and memory pressure in a descendant memcg, especially when that thread may not control the memcg hierarchy. Consider a user-controlled child memcg with a smaller limit than a top-level memcg controlled by the "Activity Manager" specified in Documentation/cgroup-v1/memory.txt. It may register for memory pressure notification for descendant memcgs to make a policy decision: oom kill a low priority job, increase the limit, decrease other limits, etc. If it registers for memory pressure notification on the top-level memcg, it currently cannot distinguish between memory pressure in its own memcg or a descendant memcg, which is user-controlled. Conversely, if a user registers for memory pressure notification on their own descendant memcg, the Activity Manager does not receive any pressure notification for that child memcg hierarchy. Vmpressure events are not received for ancestor memcgs if the memcg experiencing pressure have notifiers registered, perhaps outside the knowledge of the thread waiting on read(2) at the top level. Both of these are consequences of vmpressure notification not being pass-through. This implements a pass-through behavior for vmpressure events. When writing to control.event_control, vmpressure event handlers may optionally specify a mode. There are two new modes: - "hierarchy": always propagate memory pressure events up the hierarchy regardless if descendant memcgs have their own notifiers registered, and - "local": only receive notifications when the memcg for which the event is registered experiences memory pressure. Of course, processes may register for one notification of "low,local", for example, and another for "low". If no mode is specified, the current behavior is maintained for backwards compatibility. See the change to Documentation/cgroup-v1/memory.txt for full specification. [dan.carpenter@oracle.com: free the same pointer we allocated] Link: http://lkml.kernel.org/r/20170613191820.GA20003@elgon.mountain Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1705311421320.8946@chino.kir.corp.google.com Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Minchan Kim <minchan@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Anton Vorontsov <anton@enomsg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-10	mm: hwpoison: introduce idenfity_page_state	Naoya Horiguchi	1	-32/+25
	Factoring duplicate code into a function. Link: http://lkml.kernel.org/r/1496305019-5493-10-git-send-email-n-horiguchi@ah.jp.nec.com Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-10	mm: hugetlb: delete dequeue_hwpoisoned_huge_page()	Naoya Horiguchi	3	-50/+0
	dequeue_hwpoisoned_huge_page() is no longer used, so let's remove it. Link: http://lkml.kernel.org/r/1496305019-5493-9-git-send-email-n-horiguchi@ah.jp.nec.com Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-10	mm: hwpoison: dissolve in-use hugepage in unrecoverable memory error	Naoya Horiguchi	2	-40/+64
	Currently me_huge_page() relies on dequeue_hwpoisoned_huge_page() to keep the error hugepage away from the system, which is OK but not good enough because the hugepage still has a refcount and unpoison doesn't work on the error hugepage (PageHWPoison flags are cleared but pages are still leaked.) And there's "wasting health subpages" issue too. This patch reworks on me_huge_page() to solve these issues. For hugetlb file, recently we have truncating code so let's use it in hugetlbfs specific ->error_remove_page(). For anonymous hugepage, it's helpful to dissolve the error page after freeing it into free hugepage list. Migration entry and PageHWPoison in the head page prevent the access to it. TODO: dissolve_free_huge_page() can fail but we don't considered it yet. It's not critical (and at least no worse that now) because in such case the error hugepage just stays in free hugepage list without being dissolved. By virtue of PageHWPoison in head page, it's never allocated to processes. [akpm@linux-foundation.org: fix unused var warnings] Fixes: 23a003bfd23ea9ea0b7756b920e51f64b284b468 ("mm/madvise: pass return code of memory_failure() to userspace") Link: http://lkml.kernel.org/r/20170417055948.GM31394@yexl-desktop Link: http://lkml.kernel.org/r/1496305019-5493-8-git-send-email-n-horiguchi@ah.jp.nec.com Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-10	mm: hwpoison: introduce memory_failure_hugetlb()	Naoya Horiguchi	1	-52/+82
	memory_failure() is a big function and hard to maintain. Handling hugetlb- and non-hugetlb- case in a single function is not good, so this patch separates PageHuge() branch into a new function, which saves many PageHuge() check. Link: http://lkml.kernel.org/r/1496305019-5493-7-git-send-email-n-horiguchi@ah.jp.nec.com Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-10	mm: soft-offline: dissolve free hugepage if soft-offlined	Naoya Horiguchi	1	-1/+1
	Now we have code to rescue most of healthy pages from a hwpoisoned hugepage. So let's apply it to soft_offline_free_page too. Link: http://lkml.kernel.org/r/1496305019-5493-6-git-send-email-n-horiguchi@ah.jp.nec.com Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-10	mm: hugetlb: soft-offline: dissolve source hugepage after successful migration	Anshuman Khandual	4	-9/+39
	Currently hugepage migrated by soft-offline (i.e. due to correctable memory errors) is contained as a hugepage, which means many non-error pages in it are unreusable, i.e. wasted. This patch solves this issue by dissolving source hugepages into buddy. As done in previous patch, PageHWPoison is set only on a head page of the error hugepage. Then in dissoliving we move the PageHWPoison flag to the raw error page so that all healthy subpages return back to buddy. [arnd@arndb.de: fix warnings: replace some macros with inline functions] Link: http://lkml.kernel.org/r/20170609102544.2947326-1-arnd@arndb.de Link: http://lkml.kernel.org/r/1496305019-5493-5-git-send-email-n-horiguchi@ah.jp.nec.com Signed-off-by: Anshuman Khandual <khandual@linux.vnet.ibm.com> Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-10	mm: hwpoison: change PageHWPoison behavior on hugetlb pages	Naoya Horiguchi	2	-72/+24
	We'd like to narrow down the error region in memory error on hugetlb pages. However, currently we set PageHWPoison flags on all subpages in the error hugepage and add # of subpages to num_hwpoison_pages, which doesn't fit our purpose. So this patch changes the behavior and we only set PageHWPoison on the head page then increase num_hwpoison_pages only by 1. This is a preparation for narrow-down part which comes in later patches. Link: http://lkml.kernel.org/r/1496305019-5493-4-git-send-email-n-horiguchi@ah.jp.nec.com Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-10	mm: hugetlb: return immediately for hugetlb page in __delete_from_page_cache()	Naoya Horiguchi	1	-3/+5
	We avoid calling __mod_node_page_state(NR_FILE_PAGES) for hugetlb page now, but it's not enough because later code doesn't handle hugetlb properly. Actually in our testing, WARN_ON_ONCE(PageDirty(page)) at the end of this function fires for hugetlb, which makes no sense. So we should return immediately for hugetlb pages. Link: http://lkml.kernel.org/r/1496305019-5493-3-git-send-email-n-horiguchi@ah.jp.nec.com Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-10	mm: hugetlb: prevent reuse of hwpoisoned free hugepages	Naoya Horiguchi	2	-3/+1
	Patch series "mm: hwpoison: fixlet for hugetlb migration". This patchset updates the hwpoison/hugetlb code to address 2 reported issues. One is madvise(MADV_HWPOISON) failure reported by Intel's lkp robot (see http://lkml.kernel.org/r/20170417055948.GM31394@yexl-desktop.) First half was already fixed in mainline, and another half about hugetlb cases are solved in this series. Another issue is "narrow-down error affected region into a single 4kB page instead of a whole hugetlb page" issue, which was tried by Anshuman (http://lkml.kernel.org/r/20170420110627.12307-1-khandual@linux.vnet.ibm.com) and I updated it to apply it more widely. This patch (of 9): We no longer use MIGRATE_ISOLATE to prevent reuse of hwpoison hugepages as we did before. So current dequeue_huge_page_node() doesn't work as intended because it still uses is_migrate_isolate_page() for this check. This patch fixes it with PageHWPoison flag. Link: http://lkml.kernel.org/r/1496305019-5493-2-git-send-email-n-horiguchi@ah.jp.nec.com Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-10	fs/buffer.c: make bh_lru_install() more efficient	Eric Biggers	1	-28/+15
	To install a buffer_head into the cpu's LRU queue, bh_lru_install() would construct a new copy of the queue and then memcpy it over the real queue. But it's easily possible to do the update in-place, which is faster and simpler. Some work can also be skipped if the buffer_head was already in the queue. As a microbenchmark I timed how long it takes to run sb_getblk() 10,000,000 times alternating between BH_LRU_SIZE + 1 blocks. Effectively, this benchmarks looking up buffer_heads that are in the page cache but not in the LRU: Before this patch: 1.758s After this patch: 1.653s This patch also removes about 350 bytes of compiled code (on x86_64), partly due to removal of the memcpy() which was being inlined+unrolled. Link: http://lkml.kernel.org/r/20161229193445.1913-1-ebiggers3@gmail.com Signed-off-by: Eric Biggers <ebiggers@google.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Christoph Lameter <cl@linux.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-10	mm/zsmalloc.c: fix -Wunneeded-internal-declaration warning	Nick Desaulniers	1	-1/+1
	is_first_page() is only called from the macro VM_BUG_ON_PAGE() which is only compiled in as a runtime check when CONFIG_DEBUG_VM is set, otherwise is checked at compile time and not actually compiled in. Fixes the following warning, found with Clang: mm/zsmalloc.c:472:12: warning: function 'is_first_page' is not needed and will not be emitted [-Wunneeded-internal-declaration] static int is_first_page(struct page *page) ^ Link: http://lkml.kernel.org/r/20170524053859.29059-1-nick.desaulniers@gmail.com Signed-off-by: Nick Desaulniers <nick.desaulniers@gmail.com> Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Acked-by: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-10	mm/memory_hotplug.c: add NULL check to avoid potential NULL pointer dereference	Gustavo A. R. Silva	1	-1/+1
	The NULL check at line 1226: if (!pgdat), implies that pointer pgdat might be NULL. rollback_node_hotadd() dereferences this pointer. Add NULL check to avoid a potential NULL pointer dereference. Addresses-Coverity-ID: 1369133 Link: http://lkml.kernel.org/r/20170530212436.GA6195@embeddedgus Signed-off-by: Gustavo A. R. Silva <garsilva@embeddedor.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-10	mm, vmscan: avoid thrashing anon lru when free + file is low	David Rientjes	1	-2/+11
	The purpose of the code that commit 623762517e23 ("revert 'mm: vmscan: do not swap anon pages just because free+file is low'") reintroduces is to prefer swapping anonymous memory rather than trashing the file lru. If the anonymous inactive lru for the set of eligible zones is considered low, however, or the length of the list for the given reclaim priority does not allow for effective anonymous-only reclaiming, then avoid forcing SCAN_ANON. Forcing SCAN_ANON will end up thrashing the small list and leave unreclaimed memory on the file lrus. If the inactive list is insufficient, fallback to balanced reclaim so the file lru doesn't remain untouched. [akpm@linux-foundation.org: fix build] Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1705011432220.137835@chino.kir.corp.google.com Signed-off-by: David Rientjes <rientjes@google.com> Suggested-by: Minchan Kim <minchan@kernel.org> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-10	mm/memory.c: convert to DEFINE_DEBUGFS_ATTRIBUTE	Yevgen Pronenko	1	-2/+2
	The preferred strategy to define debugfs attributes is to use the DEFINE_DEBUGFS_ATTRIBUTE() macro and to use debugfs_create_file_unsafe(). Link: http://lkml.kernel.org/r/20170528145948.32127-1-y.pronenko@gmail.com Signed-off-by: Yevgen Pronenko <y.pronenko@gmail.com> Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com> Cc: Jan Kara <jack@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-10	mm, page_alloc: fallback to smallest page when not stealing whole pageblock	Vlastimil Babka	1	-9/+44
	Since commit 3bc48f96cf11 ("mm, page_alloc: split smallest stolen page in fallback") we pick the smallest (but sufficient) page of all that have been stolen from a pageblock of different migratetype. However, there are cases when we decide not to steal the whole pageblock. Practically in the current implementation it means that we are trying to fallback for a MIGRATE_MOVABLE allocation of order X, go through the freelists from MAX_ORDER-1 down to X, and find free page of order Y. If Y is less than pageblock_order / 2, we decide not to steal all pages from the pageblock. When Y > X, it means we are potentially splitting a larger page than we need, as there might be other pages of order Z, where X <= Z < Y. Since Y is already too small to steal whole pageblock, picking smallest available Z will result in the same decision and we avoid splitting a higher-order page in a MIGRATE_UNMOVABLE or MIGRATE_RECLAIMABLE pageblock. This patch therefore changes the fallback algorithm so that in the situation described above, we switch the fallback search strategy to go from order X upwards to find the smallest suitable fallback. In theory there shouldn't be a downside of this change wrt fragmentation. This has been tested with mmtests' stress-highalloc performing GFP_KERNEL order-4 allocations, here is the relevant extfrag tracepoint statistics: 4.12.0-rc2 4.12.0-rc2 1-kernel4 2-kernel4 Page alloc extfrag event 25640976 69680977 Extfrag fragmenting 25621086 69661364 Extfrag fragmenting for unmovable 74409 73204 Extfrag fragmenting unmovable placed with movable 69003 67684 Extfrag fragmenting unmovable placed with reclaim. 5406 5520 Extfrag fragmenting for reclaimable 6398 8467 Extfrag fragmenting reclaimable placed with movable 869 884 Extfrag fragmenting reclaimable placed with unmov. 5529 7583 Extfrag fragmenting for movable 25540279 69579693 Since we force movable allocations to steal the smallest available page (which we then practially always split), we steal less per fallback, so the number of fallbacks increases and steals potentially happen from different pageblocks. This is however not an issue for movable pages that can be compacted. Importantly, the "unmovable placed with movable" statistics is lower, which is the result of less fragmentation in the unmovable pageblocks. The effect on reclaimable allocation is a bit unclear. Link: http://lkml.kernel.org/r/20170529093947.22618-1-vbabka@suse.cz Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Mel Gorman <mgorman@techsingularity.net> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-10	swap: add block io poll in swapin path	Shaohua Li	5	-11/+33
	For fast flash disk, async IO could introduce overhead because of context switch. block-mq now supports IO poll, which improves performance and latency a lot. swapin is a good place to use this technique, because the task is waiting for the swapin page to continue execution. In my virtual machine, directly read 4k data from a NVMe with iopoll is about 60% better than that without poll. With iopoll support in swapin patch, my microbenchmark (a task does random memory write) is about 10%~25% faster. CPU utilization increases a lot though, 2x and even 3x CPU utilization. This will depend on disk speed. While iopoll in swapin isn't intended for all usage cases, it's a win for latency sensistive workloads with high speed swap disk. block layer has knob to control poll in runtime. If poll isn't enabled in block layer, there should be no noticeable change in swapin. I got a chance to run the same test in a NVMe with DRAM as the media. In simple fio IO test, blkpoll boosts 50% performance in single thread test and ~20% in 8 threads test. So this is the base line. In above swap test, blkpoll boosts ~27% performance in single thread test. blkpoll uses 2x CPU time though. If we enable hybid polling, the performance gain has very slight drop but CPU time is only 50% worse than that without blkpoll. Also we can adjust parameter of hybid poll, with it, the CPU time penality is reduced further. In 8 threads test, blkpoll doesn't help though. The performance is similar to that without blkpoll, but cpu utilization is similar too. There is lock contention in swap path. The cpu time spending on blkpoll isn't high. So overall, blkpoll swapin isn't worse than that without it. The swapin readahead might read several pages in in the same time and form a big IO request. Since the IO will take longer time, it doesn't make sense to do poll, so the patch only does iopoll for single page swapin. [akpm@linux-foundation.org: coding-style fixes] Link: http://lkml.kernel.org/r/070c3c3e40b711e7b1390002c991e86a-b5408f0@7511894063d3764ff01ea8111f5a004d7dd700ed078797c204a24e620ddb965c Signed-off-by: Shaohua Li <shli@fb.com> Cc: Tim Chen <tim.c.chen@intel.com> Cc: Huang Ying <ying.huang@intel.com> Cc: Jens Axboe <axboe@fb.com> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-10	Merge tag 'devprop-4.13-rc1' of ↵	Linus Torvalds	10	-207/+502
	git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm Pull device properties framework updates from Rafael Wysocki: "These mostly rearrange the device properties core code and add a few helper functions to it as a foundation for future work. Specifics: - Rearrange the core device properties code by moving the code specific to each supported platform configuration framework (ACPI, DT and build-in) into a separate file (Sakari Ailus). - Add helper functions for accessing device properties in a firmware-agnostic way (Sakari Ailus, Kieran Bingham)" * tag 'devprop-4.13-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: device property: Add fwnode_graph_get_port_parent device property: Add FW type agnostic fwnode_graph_get_remote_node device property: Introduce fwnode_device_is_available() device property: Move fwnode graph ops to firmware specific locations device property: Move FW type specific functionality to FW specific files ACPI: Constify argument to acpi_device_is_present()
2017-07-10	Merge tag 'acpi-extra-4.13-rc1' of ↵	Linus Torvalds	9	-18/+74
	git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm Pull more ACPI updates from Rafael Wysocki: "These fix the ACPI SPCR table handling and add a workaround for APM X-Gene 8250 UART on top of that, fix two ACPI hotplug issues related to hot-remove failures, add a missing "static" to one function and constify some attribute_group structures. Specifics: - Fix the ACPI code handling the SPCR table to check access width of MMIO regions and add a workaround for APM X-Gene 8250 UART to use 32-bit MMIO accesses with its register (Loc Ho). - Fix two ACPI-based hotplug issues related to the handling of hot-remove failures on the OS side (Chun-Yi Lee). - Constify attribute_group structures in a few places (Arvind Yadav). - Make one local function static (Colin Ian King)" * tag 'acpi-extra-4.13-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: ACPI / DPTF: constify attribute_group structures ACPI / LPSS: constify attribute_group structures ACPI: BGRT: constify attribute_group structures ACPI / power: constify attribute_group structures ACPI / scan: Indicate to platform when hot remove returns busy ACPI / bus: handle ACPI hotplug schedule errors completely ACPI / osi: Make local function acpi_osi_dmi_linux() static ACPI: SPCR: Workaround for APM X-Gene 8250 UART 32-alignment errata ACPI: SPCR: Use access width to determine mmio usage
2017-07-10	Merge tag 'pm-extra-4.13-rc1' of ↵	Linus Torvalds	4	-5/+6
	git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm Pull more power management updates from Rafael Wysocki: "These revert one recent change in the generic power domains framework, fix a recently introduced build issue in there and constify attribute_group structures in some places. Specifics: - Revert a recent change in the generic power domains (genpd) framework that led to regressions and turned out the be misguided (Rafael Wysocki). - Fix a recently introduced build issue in the generic power domains (genpd) framework (Arnd Bergmann). - Constify attribute_group structures in the PM core, the cpufreq stats code and in intel_pstate (Arvind Yadav)" * tag 'pm-extra-4.13-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: cpufreq: intel_pstate: constify attribute_group structures cpufreq: cpufreq_stats: constify attribute_group structures PM / sleep: constify attribute_group structures PM / Domains: provide pm_genpd_poweroff_noirq() stub Revert "PM / Domains: Handle safely genpd_syscore_switch() call on non-genpd device"
2017-07-10	Merge tag 'for-f2fs-4.13' of ↵	Linus Torvalds	21	-664/+1559
	git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs Pull f2fs updates from Jaegeuk Kim: "In this round, we've added new features such as disk quota and statx, and modified internal bio management flow to merge more IOs depending on block types. We've also made internal threads freezeable for Android battery life. In addition to them, there are some patches to avoid lock contention as well as a couple of deadlock conditions. Enhancements: - support usrquota, grpquota, and statx - manage DATA/NODE typed bios separately to serialize more IOs - modify f2fs_lock_op/wio_mutex to avoid lock contention - prevent lock contention in migratepage Bug fixes: - fix missing load of written inode flag - fix worst case victim selection in GC - freezeable GC and discard threads for Android battery life - sanitize f2fs metadata to deal with security hole - clean up sysfs-related code and docs" * tag 'for-f2fs-4.13' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs: (59 commits) f2fs: support plain user/group quota f2fs: avoid deadlock caused by lock order of page and lock_op f2fs: use spin_{,un}lock_irq{save,restore} f2fs: relax migratepage for atomic written page f2fs: don't count inode block in in-memory inode.i_blocks Revert "f2fs: fix to clean previous mount option when remount_fs" f2fs: do not set LOST_PINO for renamed dir f2fs: do not set LOST_PINO for newly created dir f2fs: skip ->writepages for {mete,node}_inode during recovery f2fs: introduce __check_sit_bitmap f2fs: stop gc/discard thread in prior during umount f2fs: introduce reserved_blocks in sysfs f2fs: avoid redundant f2fs_flush after remount f2fs: report # of free inodes more precisely f2fs: add ioctl to do gc with target block address f2fs: don't need to check encrypted inode for partial truncation f2fs: measure inode.i_blocks as generic filesystem f2fs: set CP_TRIMMED_FLAG correctly f2fs: require key for truncate(2) of encrypted file f2fs: move sysfs code from super.c to fs/f2fs/sysfs.c ...
2017-07-10	Merge branches 'acpi-spcr', 'acpi-osi', 'acpi-bus', 'acpi-scan' and 'acpi-misc'	Rafael J. Wysocki	9	-18/+74
	* acpi-spcr: ACPI: SPCR: Workaround for APM X-Gene 8250 UART 32-alignment errata ACPI: SPCR: Use access width to determine mmio usage * acpi-osi: ACPI / osi: Make local function acpi_osi_dmi_linux() static * acpi-bus: ACPI / bus: handle ACPI hotplug schedule errors completely * acpi-scan: ACPI / scan: Indicate to platform when hot remove returns busy * acpi-misc: ACPI / DPTF: constify attribute_group structures ACPI / LPSS: constify attribute_group structures ACPI: BGRT: constify attribute_group structures ACPI / power: constify attribute_group structures
2017-07-10	Merge branches 'pm-domains', 'pm-sleep' and 'pm-cpufreq'	Rafael J. Wysocki	4	-5/+6
	* pm-domains: PM / Domains: provide pm_genpd_poweroff_noirq() stub Revert "PM / Domains: Handle safely genpd_syscore_switch() call on non-genpd device" * pm-sleep: PM / sleep: constify attribute_group structures * pm-cpufreq: cpufreq: intel_pstate: constify attribute_group structures cpufreq: cpufreq_stats: constify attribute_group structures
2017-07-10	Fix up over-eager 'wait_queue_t' renaming	Linus Torvalds	5	-20/+20
	Commit ac6424b981bc ("sched/wait: Rename wait_queue_t => wait_queue_entry_t") had scripted the renaming incorrectly, and didn't actually check that the 'wait_queue_t' was a full token. As a result, it also triggered on 'wait_queue_token', and renamed that to 'wait_queue_entry_token' entry in the autofs4 packet structure definition too. That was entirely incorrect, and not intended. The end result built fine when building just the kernel - because everything had been renamed consistently there - but caused problems in user space because the "struct autofs_packet_missing" type is exported as part of the uapi. This scripts it all back again: git grep -lw wait_queue_entry_token \| xargs sed -i 's/wait_queue_entry_token/wait_queue_token/g' and checks the end result. Reported-by: Florian Fainelli <f.fainelli@gmail.com> Acked-by: Ingo Molnar <mingo@kernel.org> Fixes: ac6424b981bc ("sched/wait: Rename wait_queue_t => wait_queue_entry_t") Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-10	Merge tag 'for-linus-4.13-v2' of git://github.com/cminyard/linux-ipmi	Linus Torvalds	8	-214/+533
	Pull IPMI updates from Corey Minyard: "Some small fixes for IPMI, and one medium sized changed. The medium sized change is adding a platform device for IPMI entries in the DMI table. Otherwise there is no auto loading for IPMI devices if they are only in the DMI table" * tag 'for-linus-4.13-v2' of git://github.com/cminyard/linux-ipmi: ipmi:ssif: Add missing unlock in error branch char: ipmi: constify bmc_dev_attr_group and bmc_device_type ipmi:ssif: Check dev before setting drvdata ipmi: Convert DMI handling over to a platform device ipmi: Create a platform device for a DMI-specified IPMI interface ipmi: use rcu lock around call to intf->handlers->sender() ipmi:ssif: Use i2c_adapter_id instead of adapter->nr ipmi: Use the proper default value for register size in ACPI ipmi_ssif: remove redundant null check on array client->adapter->name ipmi/watchdog: fix watchdog timeout set on reboot ipmi_ssif: unlock on allocation failure
2017-07-10	Merge tag 'xfs-4.13-merge-5' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux	Linus Torvalds	112	-1855/+2116
	Pull XFS updates from Darrick Wong: "Here are some changes for you for 4.13. For the most part it's fixes for bugs and deadlock problems, and preparation for online fsck in some future merge window. - Avoid quotacheck deadlocks - Fix transaction overflows when bunmapping fragmented files - Refactor directory readahead - Allow admin to configure if ASSERT is fatal - Improve transaction usage detail logging during overflows - Minor cleanups - Don't leak log items when the log shuts down - Remove double-underscore typedefs - Various preparation for online scrubbing - Introduce new error injection configuration sysfs knobs - Refactor dq_get_next to use extent map directly - Fix problems with iterating the page cache for unwritten data - Implement SEEK_{HOLE,DATA} via iomap - Refactor XFS to use iomap SEEK_HOLE and SEEK_DATA - Don't use MAXPATHLEN to check on-disk symlink target lengths" * tag 'xfs-4.13-merge-5' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: (48 commits) xfs: don't crash on unexpected holes in dir/attr btrees xfs: rename MAXPATHLEN to XFS_SYMLINK_MAXLEN xfs: fix contiguous dquot chunk iteration livelock xfs: Switch to iomap for SEEK_HOLE / SEEK_DATA vfs: Add iomap_seek_hole and iomap_seek_data helpers vfs: Add page_cache_seek_hole_data helper xfs: remove a whitespace-only line from xfs_fs_get_nextdqblk xfs: rewrite xfs_dq_get_next_id using xfs_iext_lookup_extent xfs: Check for m_errortag initialization in xfs_errortag_test xfs: grab dquots without taking the ilock xfs: fix semicolon.cocci warnings xfs: Don't clear SGID when inheriting ACLs xfs: free cowblocks and retry on buffered write ENOSPC xfs: replace log_badcrc_factor knob with error injection tag xfs: convert drop_writes to use the errortag mechanism xfs: remove unneeded parameter from XFS_TEST_ERROR xfs: expose errortag knobs via sysfs xfs: make errortag a per-mountpoint structure xfs: free uncommitted transactions during log recovery xfs: don't allow bmap on rt files ...
2017-07-10	Merge branch 'nowait-aio-btrfs-fixup' of ↵	Linus Torvalds	1	-12/+14
	git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fix from David Sterba: "This fixes a user-visible bug introduced by the nowait-aio patches merged in this cycle" * 'nowait-aio-btrfs-fixup' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: nowait aio: Correct assignment of pos
2017-07-10	Merge branch 'fix-uio' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs	Linus Torvalds	1	-4/+4
	Pull copy_iter fix from Al Viro. [ Al used entirely the wrong return value. Oopsie. ] 'fix-uio' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: fix brown paperbag bug in inlined copy_..._iter()
2017-07-10	Merge branch 'for-linus' of ↵	Linus Torvalds	30	-427/+798
	git://git.kernel.org/pub/scm/linux/kernel/git/jikos/hid Pull HID updates from Jiri Kosina: - open/close tracking improvements from Dmitry Torokhov - battery support improvements in Wacom driver from Jason Gerecke - Win8 support fixes from Benjamin Tissories and Hans de Geode - misc fixes to Intel-ISH driver from Arnd Bergmann - support for quite a few new devices and small assorted fixes here and there * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/hid: (35 commits) HID: intel-ish-hid: Enable Gemini Lake ish driver HID: intel-ish-hid: Enable Cannon Lake ish driver HID: wacom: fix mistake in printk HID: multitouch: optimize the sticky fingers timer HID: multitouch: fix rare Win 8 cases when the touch up event gets missing HID: multitouch: use BIT macro HID: Add driver for Retrode2 joypad adapter HID: multitouch: Add support for Google Rose Touchpad HID: multitouch: Support PTP Stick and Touchpad device HID: core: don't use negative operands when shift HID: apple: Use country code to detect ISO keyboards HID: remove no longer used hid->open field greybus: hid: remove custom locking from gb_hid_open/close HID: usbhid: remove custom locking from usbhid_open/close HID: i2c-hid: remove custom locking from i2c_hid_open/close HID: serialize hid_hw_open and hid_hw_close HID: usbhid: do not rely on hid->open when deciding to do IO HID: hiddev: use hid_hw_power instead of usbhid_get/put_power HID: hiddev: use hid_hw_open/close instead of usbhid_open/close HID: asus: Add support for Zen AiO MD-5110 keyboard ...
2017-07-10	btrfs: nowait aio: Correct assignment of pos	Goldwyn Rodrigues	1	-12/+14
	Assigning pos for usage early messes up in append mode, where the pos is re-assigned in generic_write_checks(). Assign pos later to get the correct position to write from iocb->ki_pos. Since check_can_nocow also uses the value of pos, we shift generic_write_checks() before check_can_nocow(). Checks with IOCB_DIRECT are present in generic_write_checks(), so checking for IOCB_NOWAIT is enough. Also, put locking sequence in the fast path. This fixes a user visible bug, as reported: "apparently breaks several shell related features on my system. In zsh history stopped working, because no new entries are added anymore. I fist noticed the issue when I tried to build mplayer. It uses a shell script to generate a help_mp.h file: [...] Here is a simple testcase: % echo "foo" >> test % echo "foo" >> test % cat test foo % " Fixes: edf064e7c6fe ("btrfs: nowait aio support") CC: Jens Axboe <axboe@kernel.dk> Reported-by: Markus Trippelsdorf <markus@trippelsdorf.de> Link: https://lkml.kernel.org/r/20170704042306.GA274@x4 Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2017-07-10	fix brown paperbag bug in inlined copy_..._iter()	Al Viro	1	-4/+4
	"copied nothing" == "return 0", not "return full size". Fixes: aa28de275a24 "iov_iter/hardening: move object size checks to inlined part" Spotted-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-07-10	Merge branches 'for-4.13/multitouch', 'for-4.13/retrode', ↵	Jiri Kosina	865	-5245/+8217
	'for-4.13/transport-open-close-consolidation', 'for-4.13/upstream' and 'for-4.13/wacom' into for-linus
2017-07-10	Merge branches 'for-4.13/ish' and 'for-4.13/ite' into for-linus	Jiri Kosina	15	-53/+125
	Conflicts: drivers/hid/hid-core.c
2017-07-10	Merge branches 'for-4.13/apple' and 'for-4.13/asus' into for-linus	Jiri Kosina	5	-38/+63
	Conflicts: drivers/hid/hid-core.c
2017-07-09	Merge tag 'drm-for-v4.13' of git://people.freedesktop.org/~airlied/linux	Linus Torvalds	903	-25523/+359485
	Pull drm updates from Dave Airlie: "This is the main pull request for the drm, I think I've got one later driver pull for mediatek SoC driver, I'm undecided on if it needs to go to you yet. Otherwise summary below: Core drm: - Atomic add driver private objects - Deprecate preclose hook in modern drivers - MST bandwidth tracking - Use kvmalloc in more places - Add mode_valid hook for crtc/encoder/bridge - Reduce sync_file construction time - Documentation updates - New DRM synchronisation object support New drivers: - pl111 - pl111 CLCD display controller Panel: - Innolux P079ZCA panel driver - Add NL12880B20-05, NL192108AC18-02D, P320HVN03 panels - panel-samsung-s6e3ha2: Add s6e3hf2 panel support i915: - SKL+ watermark fixes - G4x/G33 reset improvements - DP AUX backlight improvements - Buffer based GuC/host communication - New getparam for (sub)slice infomation - Cannonlake and Coffeelake initial patches - Execbuf optimisations radeon/amdgpu: - Lots of Vega10 bug fixes - Preliminary raven support - KIQ support for compute rings - MEC queue management rework - DCE6 Audio support - SR-IOV improvements - Better radeon/amdgpu selection support nouveau: - HDMI stereoscopic support - Display code rework for >= GM20x GPUs msm: - GEM rework for fine-grained locking - Per-process pagetable work - HDMI fixes for Snapdragon 820. vc4: - Remove 256MB CMA limit from vc4 - Add out-fence support - Add support for cygnus - Get/set tiling ioctls support - Add T-format tiling support for scanout zte: - add VGA support. etnaviv: - Thermal throttle support for newer GPUs - Restore userspace buffer cache performance - dma-buf sync fix stm: - add stm32f429 display support exynos: - Rework vblank handling - Fixup sw-trigger code sun4i: - V3s display engine support - HDMI support for older SoCs - Preliminary work on dual-pipeline SoCs. rcar-du: - VSP work imx-drm: - Remove counter load enable from PRE - Double read/write reduction flag support tegra: - Documentation for the host1x and drm driver. - Lots of staging ioctl fixes due to grate project work. omapdrm: - dma-buf fence support - TILER rotation fixes" * tag 'drm-for-v4.13' of git://people.freedesktop.org/~airlied/linux: (1270 commits) drm: Remove unused drm_file parameter to drm_syncobj_replace_fence() drm/amd/powerplay: fix bug fail to remove sysfs when rmmod amdgpu. amdgpu: Set cik/si_support to 1 by default if radeon isn't built drm/amdgpu/gfx9: fix driver reload with KIQ drm/amdgpu/gfx8: fix driver reload with KIQ drm/amdgpu: Don't call amd_powerplay_destroy() if we don't have powerplay drm/ttm: Fix use-after-free in ttm_bo_clean_mm drm/amd/amdgpu: move get memory type function from early init to sw init drm/amdgpu/cgs: always set reference clock in mode_info drm/amdgpu: fix vblank_time when displays are off drm/amd/powerplay: power value format change for Vega10 drm/amdgpu/gfx9: support the amdgpu.disable_cu option drm/amd/powerplay: change PPSMC_MSG_GetCurrPkgPwr for Vega10 drm/amdgpu: Make amdgpu_cs_parser_init static (v2) drm/amdgpu/cs: fix a typo in a comment drm/amdgpu: Fix the exported always on CU bitmap drm/amdgpu/gfx9: gfx_v9_0_enable_gfx_static_mg_power_gating() can be static drm/amdgpu/psp: upper_32_bits/lower_32_bits for address setup drm/amd/powerplay/cz: print message if smc message fails drm/amdgpu: fix typo in amdgpu_debugfs_test_ib_init ...
2017-07-09	afs: Add metadata xattrs	David Howells	8	-2/+138
	Add xattrs to allow the user to get/set metadata in lieu of having pioctl() available. The following xattrs are now available: - "afs.cell" The name of the cell in which the vnode's volume resides. - "afs.fid" The volume ID, vnode ID and vnode uniquifier of the file as three hex numbers separated by colons. - "afs.volume" The name of the volume in which the vnode resides. For example: # getfattr -d -m ".*" /mnt/scratch getfattr: Removing leading '/' from absolute path names # file: mnt/scratch afs.cell="mycell.myorg.org" afs.fid="10000b:1:1" afs.volume="scratch" Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-09	afs: Ignore AFS_ACE_READ and AFS_ACE_WRITE for directories	Marc Dionne	1	-3/+2
	The AFS_ACE_READ and AFS_ACE_WRITE permission bits should not be used to make access decisions for the directory itself. They are meant to control access for the objects contained in that directory. Reading a directory is allowed if the AFS_ACE_LOOKUP bit is set. This would cause an incorrect access denied error for a directory with AFS_ACE_LOOKUP but not AFS_ACE_READ. The AFS_ACE_WRITE bit does not allow operations that modify the directory. For a directory with AFS_ACE_WRITE but neither AFS_ACE_INSERT nor AFS_ACE_DELETE, this would result in trying operations that would ultimately be denied by the server. Signed-off-by: Marc Dionne <marc.dionne@auristor.com> Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-09	mqueue: fix a use-after-free in sys_mq_notify()	Cong Wang	1	-1/+3
	The retry logic for netlink_attachskb() inside sys_mq_notify() is nasty and vulnerable: 1) The sock refcnt is already released when retry is needed 2) The fd is controllable by user-space because we already release the file refcnt so we when retry but the fd has been just closed by user-space during this small window, we end up calling netlink_detachskb() on the error path which releases the sock again, later when the user-space closes this socket a use-after-free could be triggered. Setting 'sock' to NULL here should be sufficient to fix it. Reported-by: GeneBlue <geneblue.mail@gmail.com> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Manfred Spraul <manfred@colorfullife.com> Cc: stable@kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-09	Merge branch 'x86-urgent-for-linus' of ↵	Linus Torvalds	7	-56/+65
	git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 fixes from Thomas Gleixner: "The x86 updates contain: - A fix for a longstanding PAT bug, where PAT was reported on CPUs that do not support it, which leads to wrong caching attributes and missing MTRR updates - Prevent overwriting of the e820 firmware table, which causes kexec kernels to lose the fake mptable which is stored there. - Cleanup of the UV/BAU code, removing unused code and making local functions static" * 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/boot/e820: Introduce the bootloader provided e820_table_firmware[] table x86/boot/e820: Rename the e820_table_firmware to e820_table_kexec x86/boot/e820: Avoid overwriting e820_table_firmware x86/mm/pat: Don't report PAT on CPUs that don't support it x86/platform/uv/BAU: Minor cleanup, make some local functions static
2017-07-09	Merge branch 'timers-urgent-for-linus' of ↵	Linus Torvalds	2	-3/+13
	git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull timers fixlet from Thomas Gleixner: "Add Frederic Weisbecker as NOHZ/dyntick maintainer" [ And an unmentioned and unrelated typo fix in the same commit? Hmm.. ] * 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: MAINTAINERS: Add Frederic Weisbecker as nohz/dyntics maintainer
2017-07-09	Merge branch 'smp-urgent-for-linus' of ↵	Linus Torvalds	1	-18/+19
	git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull smp/hotplug fix from Thomas Gleixner: "A single fix for a brown paperbag bug: The unparking of the initial percpu threads of an upcoming CPU happens right now on the idle task, but that's wrong as the unpark function might sleep. Move it to the control CPU." * 'smp-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: smp/hotplug: Move unparking of percpu threads to the control CPU
2017-07-09	Merge branch 'sched-urgent-for-linus' of ↵	Linus Torvalds	7	-113/+165
	git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler fixes from Thomas Gleixner: "This scheduler update provides: - The (hopefully) final fix for the vtime accounting issues which were around for quite some time - Use types known to user space in UAPI headers to unbreak user space builds - Make load balancing respect the current scheduling domain again instead of evaluating unrelated CPUs" * 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched/headers/uapi: Fix linux/sched/types.h userspace compilation errors sched/fair: Fix load_balance() affinity redo path sched/cputime: Accumulate vtime on top of nsec clocksource sched/cputime: Move the vtime task fields to their own struct sched/cputime: Rename vtime fields sched/cputime: Always set tsk->vtime_snap_whence after accounting vtime vtime, sched/cputime: Remove vtime_account_user() Revert "sched/cputime: Refactor the cputime_adjust() code"
2017-07-09	Merge branch 'perf-urgent-for-linus' of ↵	Linus Torvalds	6	-23/+30
	git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull perf fixes from Thomas Gleixner: "A couple of fixes for perf and kprobes: - Add he missing exclude_kernel attribute for the precise_ip level so !CAP_SYS_ADMIN users get the proper results. - Warn instead of failing completely when perf has no unwind support for a particular architectiure built in. - Ensure that jprobes are at function entry and not at some random place" * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: kprobes: Ensure that jprobe probepoints are at function entry kprobes: Simplify register_jprobes() kprobes: Rename [arch_]function_offset_within_entry() to [arch_]kprobe_on_func_entry() perf unwind: Do not fail due to missing unwind support perf evsel: Set attr.exclude_kernel when probing max attr.precise_ip
2017-07-09	Merge branch 'locking-urgent-for-linus' of ↵	Linus Torvalds	2	-2/+3
	git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull locking fixes from Thomas Gleixner: - Fix the EINTR logic in rwsem-spinlock to avoid double locking by a writer and a reader - Add a missing include to qspinlocks * 'locking-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: locking/qspinlock: Explicitly include asm/prefetch.h locking/rwsem-spinlock: Fix EINTR branch in __down_write_common()
2017-07-09	Merge branch 'irq-urgent-for-linus' of ↵	Linus Torvalds	7	-27/+75
	git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull irq fixes from Thomas Gleixner: - A few fixes mopping up the fallout of the big irq overhaul - Move the interrupt resource management logic out of the spin locked, irq disabled region to avoid unnecessary restrictions of the resource callbacks - Preparation for reworking the per cpu irq request function. * 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: irqdomain: Allow ACPI device nodes to be used as irqdomain identifiers genirq/debugfs: Remove redundant NULL pointer check genirq: Allow to pass the IRQF_TIMER flag with percpu irq request genirq/timings: Move free timings out of spinlocked region genirq: Move irq resource handling out of spinlocked region genirq: Add mutex to irq desc to serialize request/free_irq() genirq: Move bus locking into __setup_irq() genirq: Force inlining of __irq_startup_managed to prevent build failure genirq/debugfs: Fix build for !CONFIG_IRQ_DOMAIN
2017-07-09	Merge branch 'core-urgent-for-linus' of ↵	Linus Torvalds	1	-2/+11
	git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull objtool fix from Thomas Gleixner: "A fix to the objtool sibling call detection logic to distinguish normal jumps inside a function from a real sibling call" * 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: objtool: Fix sibling call detection logic
2017-07-09	Merge tag 'ext4_for_linus' of ↵	Linus Torvalds	30	-513/+2081
	git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 Pull ext4 updates from Ted Ts'o: "The first major feature for ext4 this merge window is the largedir feature, which allows ext4 directories to support over 2 billion directory entries (assuming ~64 byte file names; in practice, users will run into practical performance limits first.) This feature was originally written by the Lustre team, and credit goes to Artem Blagodarenko from Seagate for getting this feature upstream. The second major major feature allows ext4 to support extended attribute values up to 64k. This feature was also originally from Lustre, and has been enhanced by Tahsin Erdogan from Google with a deduplication feature so that if multiple files have the same xattr value (for example, Windows ACL's stored by Samba), only one copy will be stored on disk for encoding and caching efficiency. We also have the usual set of bug fixes, cleanups, and optimizations" * tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (47 commits) ext4: fix spelling mistake: "prellocated" -> "preallocated" ext4: fix __ext4_new_inode() journal credits calculation ext4: skip ext4_init_security() and encryption on ea_inodes fs: generic_block_bmap(): initialize all of the fields in the temp bh ext4: change fast symlink test to not rely on i_blocks ext4: require key for truncate(2) of encrypted file ext4: don't bother checking for encryption key in ->mmap() ext4: check return value of kstrtoull correctly in reserved_clusters_store ext4: fix off-by-one fsmap error on 1k block filesystems ext4: return EFSBADCRC if a bad checksum error is found in ext4_find_entry() ext4: return EIO on read error in ext4_find_entry ext4: forbid encrypting root directory ext4: send parallel discards on commit completions ext4: avoid unnecessary stalls in ext4_evict_inode() ext4: add nombcache mount option ext4: strong binding of xattr inode references ext4: eliminate xattr entry e_hash recalculation for removes ext4: reserve space for xattr entries/names quota: add get_inode_usage callback to transfer multi-inode charges ext4: xattr inode deduplication ...
2017-07-09	Merge tag 'fscrypt_for_linus' of ↵	Linus Torvalds	10	-70/+182
	git://git.kernel.org/pub/scm/linux/kernel/git/tytso/fscrypt Pull fscrypt updates from Ted Ts'o: "Add support for 128-bit AES and some cleanups to fscrypt" * tag 'fscrypt_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/fscrypt: fscrypt: make ->dummy_context() return bool fscrypt: add support for AES-128-CBC fscrypt: inline fscrypt_free_filename()
2017-07-09	Merge branch 'waitid-fix' of ↵	Linus Torvalds	1	-5/+12
	git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull waitid fix from Al Viro. * 'waitid-fix' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: fix waitid(2) breakage
2017-07-08	f2fs: support plain user/group quota	Chao Yu	8	-40/+454
	This patch adds to support plain user/group quota. Change Note by Jaegeuk Kim. - Use f2fs page cache for quota files in order to consider garbage collection. so, quota files are not tolerable for sudden power-cuts, so user needs to do quotacheck. - setattr() calls dquot_transfer which will transfer inode->i_blocks. We can't reclaim that during f2fs_evict_inode(). So, we need to count node blocks as well in order to match i_blocks with dquot's space. Note that, Chao wrote a patch to count inode->i_blocks without inode block. (f2fs: don't count inode block in in-memory inode.i_blocks) - in f2fs_remount, we need to make RW in prior to dquot_resume. - handle fault_injection case during f2fs_quota_off_umount - TODO: Project quota Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-07-08	Merge tag 'pci-v4.13-changes' of ↵	Linus Torvalds	90	-822/+3786
	git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci Pull PCI updates from Bjorn Helgaas: - add sysfs max_link_speed/width, current_link_speed/width (Wong Vee Khee) - make host bridge IRQ mapping much more generic (Matthew Minter, Lorenzo Pieralisi) - convert most drivers to pci_scan_root_bus_bridge() (Lorenzo Pieralisi) - mutex sriov_configure() (Jakub Kicinski) - mutex pci_error_handlers callbacks (Christoph Hellwig) - split ->reset_notify() into ->reset_prepare()/reset_done() (Christoph Hellwig) - support multiple PCIe portdrv interrupts for MSI as well as MSI-X (Gabriele Paoloni) - allocate MSI/MSI-X vector for Downstream Port Containment (Gabriele Paoloni) - fix MSI IRQ affinity pre/post/min_vecs issue (Michael Hernandez) - test INTx masking during enumeration, not at run-time (Piotr Gregor) - avoid using device_may_wakeup() for runtime PM (Rafael J. Wysocki) - restore the status of PCI devices across hibernation (Chen Yu) - keep parent resources that start at 0x0 (Ard Biesheuvel) - enable ECRC only if device supports it (Bjorn Helgaas) - restore PRI and PASID state after Function-Level Reset (CQ Tang) - skip DPC event if device is not present (Keith Busch) - check domain when matching SMBIOS info (Sujith Pandel) - mark Intel XXV710 NIC INTx masking as broken (Alex Williamson) - avoid AMD SB7xx EHCI USB wakeup defect (Kai-Heng Feng) - work around long-standing Macbook Pro poweroff issue (Bjorn Helgaas) - add Switchtec "running" status flag (Logan Gunthorpe) - fix dra7xx incorrect RW1C IRQ register usage (Arvind Yadav) - modify xilinx-nwl IRQ chip for legacy interrupts (Bharat Kumar Gogada) - move VMD SRCU cleanup after bus, child device removal (Jon Derrick) - add Faraday clock handling (Linus Walleij) - configure Rockchip MPS and reorganize (Shawn Lin) - limit Qualcomm TLP size to 2K (hardware issue) (Srinivas Kandagatla) - support Tegra MSI 64-bit addressing (Thierry Reding) - use Rockchip normal (not privileged) register bank (Shawn Lin) - add HiSilicon Kirin SoC PCIe controller driver (Xiaowei Song) - add Sigma Designs Tango SMP8759 PCIe controller driver (Marc Gonzalez) - add MediaTek PCIe host controller support (Ryder Lee) - add Qualcomm IPQ4019 support (John Crispin) - add HyperV vPCI protocol v1.2 support (Jork Loeser) - add i.MX6 regulator support (Quentin Schulz) * tag 'pci-v4.13-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci: (113 commits) PCI: tango: Add Sigma Designs Tango SMP8759 PCIe host bridge support PCI: Add DT binding for Sigma Designs Tango PCIe controller PCI: rockchip: Use normal register bank for config accessors dt-bindings: PCI: Add documentation for MediaTek PCIe PCI: Remove __pci_dev_reset() and pci_dev_reset() PCI: Split ->reset_notify() method into ->reset_prepare() and ->reset_done() PCI: xilinx: Make of_device_ids const PCI: xilinx-nwl: Modify IRQ chip for legacy interrupts PCI: vmd: Move SRCU cleanup after bus, child device removal PCI: vmd: Correct comment: VMD domains start at 0x10000, not 0x1000 PCI: versatile: Add local struct device pointers PCI: tegra: Do not allocate MSI target memory PCI: tegra: Support MSI 64-bit addressing PCI: rockchip: Use local struct device pointer consistently PCI: rockchip: Check for clk_prepare_enable() errors during resume MAINTAINERS: Remove Wenrui Li as Rockchip PCIe driver maintainer PCI: rockchip: Configure RC's MPS setting PCI: rockchip: Reconfigure configuration space header type PCI: rockchip: Split out rockchip_pcie_cfg_configuration_accesses() PCI: rockchip: Move configuration accesses into rockchip_pcie_cfg_atu() ...
2017-07-08	Merge branch 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/shli/md	Linus Torvalds	9	-47/+92
	Pull MD update from Shaohua Li: - fixed deadlock in MD suspend and a potential bug in bio allocation (Neil Brown) - fixed signal issue (Mikulas Patocka) - fixed typo in FailFast test (Guoqing Jiang) - other trival fixes * 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/shli/md: MD: fix sleep in atomic MD: fix a null dereference md: use a separate bio_set for synchronous IO. md: change the initialization value for a spare device spot to MD_DISK_ROLE_SPARE md/raid1: remove unused bio in sync_request_write md/raid10: fix FailFast test for wrong device md: don't use flush_signals in userspace processes md: fix deadlock between mddev_suspend() and md_write_start()
2017-07-08	Merge branch 'for-linus' of ↵	Linus Torvalds	34	-71/+1349
	git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input Pull input updates from Dmitry Torokhov: - a new driver for STM FingerTip touchscreen - a new driver for D-Link DIR-685 touch keys - updated list of supported devices in xpad driver - other assorted updates and fixes * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input: (23 commits) MAINTAINERS: update input subsystem patterns Input: introduce KEY_ASSISTANT Input: xpad - sync supported devices with XBCD Input: xpad - sync supported devices with 360Controller Input: xen-kbdfront - use string constants from PV protocol Input: stmfts - mark all PM functions as __maybe_unused Input: add support for the STMicroelectronics FingerTip touchscreen Input: add D-Link DIR-685 touchkeys driver Input: s3c2410_ts - handle return value of clk_prepare_enable Input: axp20x-pek - add wakeup support Input: synaptics-rmi4 - use %phN to form F34 configuration ID Input: synaptics-rmi4 - change a char type to u8 Input: sparse-keymap - remove sparse_keymap_free() Input: tsc2007 - move header file out of I2C realm Input: mms114 - move header file out of I2C realm Input: mcs - move header file out of I2C realm Input: lm8323 - move header file out of I2C realm Input: elantech - force relative mode on a certain module Input: elan_i2c - add support for fetching chip type on newer hardware Input: elan_i2c - check if device is there before really probing ...
2017-07-08	Merge tag 'dmaengine-4.13-rc1' of git://git.infradead.org/users/vkoul/slave-dma	Linus Torvalds	34	-443/+3026
	Pull dmaengine updates from Vinod Koul: - removal of AVR32 support in dw driver as AVR32 is gone - new driver for Broadcom stream buffer accelerator (SBA) RAID driver - add support for Faraday Technology FTDMAC020 in amba-pl08x driver - IOMMU support in pl330 driver - updates to bunch of drivers * tag 'dmaengine-4.13-rc1' of git://git.infradead.org/users/vkoul/slave-dma: (36 commits) dmaengine: qcom_hidma: correct API violation for submit dmaengine: zynqmp_dma: Remove max len check in zynqmp_dma_prep_memcpy dmaengine: tegra-apb: Really fix runtime-pm usage dmaengine: fsl_raid: make of_device_ids const. dmaengine: qcom_hidma: allow ACPI/DT parameters to be overridden dmaengine: fsldma: set BWC, DAHTS and SAHTS values correctly dmaengine: Kconfig: Simplify the help text for MXS_DMA dmaengine: pl330: Delete unused functions dmaengine: Replace WARN_TAINT_ONCE() with pr_warn_once() dmaengine: Kconfig: Extend the dependency for MXS_DMA dmaengine: mxs: Use %zu for printing a size_t variable dmaengine: ste_dma40: Cleanup scatterlist layering violations dmaengine: imx-dma: cleanup scatterlist layering violations dmaengine: use proper name for the R-Car SoC dmaengine: imx-sdma: Fix compilation warning. dmaengine: imx-sdma: Handle return value of clk_prepare_enable dmaengine: pl330: Add IOMMU support to slave tranfers dmaengine: DW DMAC: Handle return value of clk_prepare_enable dmaengine: pl08x: use GENMASK() to create bitmasks dmaengine: pl08x: Add support for Faraday Technology FTDMAC020 ...
2017-07-08	Merge branch 'for-linus' of git://git.armlinux.org.uk/~rmk/linux-arm	Linus Torvalds	11	-18/+185
	Pull ARM updates from Russell King: - add support for ftrace-with-registers, which is needed for kgraft and other ftrace tools - support for mremap() for the sigpage/vDSO so that checkpoint/restore can work - add timestamps to each line of the register dump output - remove the unused KTHREAD_SIZE from nommu - align the ARM bitops APIs with the generic API (using unsigned long pointers rather than void pointers) - make the configuration of userspace Thumb support an expert option so that we can default it on, and avoid some hard to debug userspace crashes * 'for-linus' of git://git.armlinux.org.uk/~rmk/linux-arm: ARM: 8684/1: NOMMU: Remove unused KTHREAD_SIZE definition ARM: 8683/1: ARM32: Support mremap() for sigpage/vDSO ARM: 8679/1: bitops: Align prototypes to generic API ARM: 8678/1: ftrace: Adds support for CONFIG_DYNAMIC_FTRACE_WITH_REGS ARM: make configuration of userspace Thumb support an expert option ARM: 8673/1: Fix __show_regs output timestamps
2017-07-08	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/sparc-next	Linus Torvalds	24	-492/+890
	Pull sparc updates from David Miller: 1) Queued spinlocks and rwlocks for sparc64, from Babu Moger. 2) Some const'ification from Arvind Yadav. 3) LDC/VIO driver infrastructure changes to facilitate future upcoming drivers, from Jag Raman. 4) Initialize sched_clock() et al. early so that the initial printk timestamps are all done while the implementation is available and functioning. From Pavel Tatashin. * git://git.kernel.org/pub/scm/linux/kernel/git/davem/sparc-next: (38 commits) sparc: kernel: pmc: make of_device_ids const. sparc64: fix typo in property sparc64: add port_id to VIO device metadata sparc64: Enhance search for VIO device in MDESC sparc64: enhance VIO device probing sparc64: check if a client is allowed to register for MDESC notifications sparc64: remove restriction on VIO device name size sparc64: refactor code to obtain cfg_handle property from MDESC sparc64: add MDESC node name property to VIO device metadata sparc64: mdesc: use __GFP_REPEAT action modifier for VM allocation sparc64: expand MDESC interface sparc64: skip handshake for LDC channels in RAW mode sparc64: specify the device class in VIO version info. packet sparc64: ensure VIO operations are defined while being used sparc: kernel: apc: make of_device_ids const sparc/time: make of_device_ids const sparc64: broken %tick frequency on spitfire cpus sparc64: use prom interface to get %stick frequency sparc64: optimize functions that access tick sparc64: add hot-patched and inlined get_tick() ...
2017-07-08	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net	Linus Torvalds	30	-65/+92
	Pull networking fixes from David Miller: "Mostly fixing some light fallout from the changes that went into the merge window. 1) Fix memory leaks on network namespace teardown in netfilter, from Liping Zhang. 2) When comparing ipv6 nexthops, we have to take the lightweight tunnel state into account as well. From David Ahern. 3) Fix socket option object length check in the new TLS code, from Matthias Rosenfelder. 4) Fix memory leak in nfp driver flower support, from Jakub Kicinski. 5) Several netlink attribute validation fixes in cfg80211, from Srinivas Dasari. 6) Fix context array leak in virtio_net, from Jason Wang. 7) SKB use after free in hns driver, from Yusheng Lin. 8) Fix socket leak on accept() in RDS, from Sowmini Varadhan. Also add a WARN_ON() to sock_graft() so other protocol stacks don't trip over this as well" * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (26 commits) net: ethernet: mediatek: remove useless code in mtk_probe() mpls: fix uninitialized in_label var warning in mpls_getroute doc: SKB_GSO_[IPIP\|SIT] have been replaced bonding: avoid NETDEV_CHANGEMTU event when unregistering slave net/sock: add WARN_ON(parent->sk) in sock_graft() rds: tcp: use sock_create_lite() to create the accept socket net: hns: Fix a skb used after free bug net: hns: Fix a wrong op phy C45 code net: macb: Adding Support for Jumbo Frames up to 10240 Bytes in SAMA5D3 net: Update networking MAINTAINERS entry. virtio-net: fix leaking of ctx array cfg80211: Validate frequencies nested in NL80211_ATTR_SCAN_FREQUENCIES cfg80211: Define nla_policy for NL80211_ATTR_LOCAL_MESH_POWER_MODE cfg80211: Check if NAN service ID is of expected size cfg80211: Check if PMKID attribute is of expected size arcnet: com20020-pci: Fix an error handling path in 'com20020pci_probe()' nfp: flower: add missing clean up call to avoid memory leaks vrf: fix bug_on triggered by rx when destroying a vrf ptp: dte: Use LL suffix for 64-bit constants sctp: set the value of flowi6_oif to sk_bound_dev_if to make sctp_v6_get_dst to find the correct route entry. ...
2017-07-08	Merge branch 'work.misc' of ↵	Linus Torvalds	10	-46/+61
	git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull misc filesystem updates from Al Viro: "Assorted normal VFS / filesystems stuff..." * 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: dentry name snapshots Make statfs properly return read-only state after emergency remount fs/dcache: init in_lookup_hashtable minix: Deinline get_block, save 2691 bytes fs: Reorder inode_owner_or_capable() to avoid needless fs: warn in case userspace lied about modprobe return
2017-07-08	Merge branch 'for-spi' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs	Linus Torvalds	1	-31/+11
	Pull spi uaccess delousing from Al Viro: "Getting rid of pointless __get_user() and friends in drivers/spi. [ the only reason it's on a separate branch is that I hoped it would be picked by spi folks; looks like mail asking them to grab it got lost and I hadn't followed up on that ]" * 'for-spi' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: spidev: quit messing with access_ok()
2017-07-08	Merge branch 'work.__copy_in_user' of ↵	Linus Torvalds	2	-16/+9
	git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull __copy_in_user removal from Al Viro: "There used to be 6 places in the entire tree calling __copy_in_user(), all of them bogus. Four got killed off in work.drm branch, this takes care of the remaining ones and kills the definition of that sucker" * 'work.__copy_in_user' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: kill __copy_in_user() sanitize do_i2c_smbus_ioctl()
2017-07-08	fix waitid(2) breakage	Al Viro	1	-5/+12
	We lose the distinction between "found a PID" and "nothing, but that's not an error" a bit too early in waitid(). Easily fixed, fortunately... Reported-by: Markus Trippelsdorf <markus@trippelsdorf.de> Fixes: 67d7ddded322 ("waitid(2): leave copyout of siginfo to syscall itself") Tested-by: Markus Trippelsdorf <markus@trippelsdorf.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-07-08	net: ethernet: mediatek: remove useless code in mtk_probe()	Gustavo A. R. Silva	1	-5/+0
	Remove useless local variables _match_, _soc_ and the code related. Notice that const struct of_device_id of_mtk_match[] = { { .compatible = "mediatek,mt2701-eth" }, {}, }; So match->data is NULL. Suggested-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: Gustavo A. R. Silva <garsilva@embeddedor.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-08	mpls: fix uninitialized in_label var warning in mpls_getroute	Roopa Prabhu	1	-4/+8
	Fix the below warning generated by static checker: net/mpls/af_mpls.c:2111 mpls_getroute() error: uninitialized symbol 'in_label'." Fixes: 397fc9e5cefe ("mpls: route get support") Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-08	doc: SKB_GSO_[IPIP\|SIT] have been replaced	Nicolas Dichtel	1	-1/+1
	Those enum values don't exist anymore. Fixes: 7e13318daa4a ("net: define gso types for IPx over IPv4 and IPv6") CC: Tom Herbert <tom@herbertland.com> Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-08	bonding: avoid NETDEV_CHANGEMTU event when unregistering slave	WANG Cong	3	-7/+12
	As Hongjun/Nicolas summarized in their original patch: " When a device changes from one netns to another, it's first unregistered, then the netns reference is updated and the dev is registered in the new netns. Thus, when a slave moves to another netns, it is first unregistered. This triggers a NETDEV_UNREGISTER event which is caught by the bonding driver. The driver calls bond_release(), which calls dev_set_mtu() and thus triggers NETDEV_CHANGEMTU (the device is still in the old netns). " This is a very special case, because the device is being unregistered no one should still care about the NETDEV_CHANGEMTU event triggered at this point, we can avoid broadcasting this event on this path, and avoid touching inetdev_event()/addrconf_notify() path. It requires to export __dev_set_mtu() to bonding driver. Reported-by: Hongjun Li <hongjun.li@6wind.com> Reported-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Cc: Jay Vosburgh <j.vosburgh@gmail.com> Cc: Veaceslav Falico <vfalico@gmail.com> Cc: Andy Gospodarek <andy@greyhouse.net> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-08	Merge branch 'rds-tcp-sock_graft-leak'	David S. Miller	2	-1/+2
	Sowmini Varadhan says: ==================== rds-tcp: sock_graft() leak Following up on the discussion at https://www.spinics.net/lists/netdev/msg442859.html - make rds_tcp_accept_one() call sock_create_lite() - add a WARN_ON() to sock_graft() Tested by running an infinite while() loop that does (module-load; rds-stress; module-unload) and monitors TCP slabinfo while the test is running. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-08	net/sock: add WARN_ON(parent->sk) in sock_graft()	Sowmini Varadhan	1	-0/+1
	sock_graft() unilaterally sets up parent->sk based on the assumption that the existing parent->sk is null. If this condition is not true, then the existing parent->sk would be leaked, so add a WARN_ON() to alert callers who may fall in this category. Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-08	rds: tcp: use sock_create_lite() to create the accept socket	Sowmini Varadhan	1	-1/+1
	There are two problems with calling sock_create_kern() from rds_tcp_accept_one() 1. it sets up a new_sock->sk that is wasteful, because this ->sk is going to get replaced by inet_accept() in the subsequent ->accept() 2. The new_sock->sk is a leaked reference in sock_graft() which expects to find a null parent->sk Avoid these problems by calling sock_create_lite(). Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com> Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-08	Merge branch 'hns-fixes'	David S. Miller	3	-16/+14
	Lin Yun Sheng says: ==================== Bugfixs for hns ethernet driver This patchset fix skb used after free and C45 op code issues in hns driver. Patch V2: 1. Remove ndev->feature checking in TX description patch. 2. Add Fixes: Tag in patch description. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-08	net: hns: Fix a skb used after free bug	Yunsheng Lin	2	-15/+13
	skb maybe freed in hns_nic_net_xmit_hw() and return NETDEV_TX_OK, which cause hns_nic_net_xmit to use a freed skb. BUG: KASAN: use-after-free in hns_nic_net_xmit_hw+0x62c/0x940... [17659.112635] alloc_debug_processing+0x18c/0x1a0 [17659.117208] __slab_alloc+0x52c/0x560 [17659.120909] kmem_cache_alloc_node+0xac/0x2c0 [17659.125309] __alloc_skb+0x6c/0x260 [17659.128837] tcp_send_ack+0x8c/0x280 [17659.132449] __tcp_ack_snd_check+0x9c/0xf0 [17659.136587] tcp_rcv_established+0x5a4/0xa70 [17659.140899] tcp_v4_do_rcv+0x27c/0x620 [17659.144687] tcp_prequeue_process+0x108/0x170 [17659.149085] tcp_recvmsg+0x940/0x1020 [17659.152787] inet_recvmsg+0x124/0x180 [17659.156488] sock_recvmsg+0x64/0x80 [17659.160012] SyS_recvfrom+0xd8/0x180 [17659.163626] __sys_trace_return+0x0/0x4 [17659.167506] INFO: Freed in kfree_skbmem+0xa0/0xb0 age=23 cpu=1 pid=13 [17659.174000] free_debug_processing+0x1d4/0x2c0 [17659.178486] __slab_free+0x240/0x390 [17659.182100] kmem_cache_free+0x24c/0x270 [17659.186062] kfree_skbmem+0xa0/0xb0 [17659.189587] __kfree_skb+0x28/0x40 [17659.193025] napi_gro_receive+0x168/0x1c0 [17659.197074] hns_nic_rx_up_pro+0x58/0x90 [17659.201038] hns_nic_rx_poll_one+0x518/0xbc0 [17659.205352] hns_nic_common_poll+0x94/0x140 [17659.209576] net_rx_action+0x458/0x5e0 [17659.213363] __do_softirq+0x1b8/0x480 [17659.217062] run_ksoftirqd+0x64/0x80 [17659.220679] smpboot_thread_fn+0x224/0x310 [17659.224821] kthread+0x150/0x170 [17659.228084] ret_from_fork+0x10/0x40 BUG: KASAN: use-after-free in hns_nic_net_xmit+0x8c/0xc0... [17751.080490] __slab_alloc+0x52c/0x560 [17751.084188] kmem_cache_alloc+0x244/0x280 [17751.088238] __build_skb+0x40/0x150 [17751.091764] build_skb+0x28/0x100 [17751.095115] __alloc_rx_skb+0x94/0x150 [17751.098900] __napi_alloc_skb+0x34/0x90 [17751.102776] hns_nic_rx_poll_one+0x180/0xbc0 [17751.107097] hns_nic_common_poll+0x94/0x140 [17751.111333] net_rx_action+0x458/0x5e0 [17751.115123] __do_softirq+0x1b8/0x480 [17751.118823] run_ksoftirqd+0x64/0x80 [17751.122437] smpboot_thread_fn+0x224/0x310 [17751.126575] kthread+0x150/0x170 [17751.129838] ret_from_fork+0x10/0x40 [17751.133454] INFO: Freed in kfree_skbmem+0xa0/0xb0 age=19 cpu=7 pid=43 [17751.139951] free_debug_processing+0x1d4/0x2c0 [17751.144436] __slab_free+0x240/0x390 [17751.148051] kmem_cache_free+0x24c/0x270 [17751.152014] kfree_skbmem+0xa0/0xb0 [17751.155543] __kfree_skb+0x28/0x40 [17751.159022] napi_gro_receive+0x168/0x1c0 [17751.163074] hns_nic_rx_up_pro+0x58/0x90 [17751.167041] hns_nic_rx_poll_one+0x518/0xbc0 [17751.171358] hns_nic_common_poll+0x94/0x140 [17751.175585] net_rx_action+0x458/0x5e0 [17751.179373] __do_softirq+0x1b8/0x480 [17751.183076] run_ksoftirqd+0x64/0x80 [17751.186691] smpboot_thread_fn+0x224/0x310 [17751.190826] kthread+0x150/0x170 [17751.194093] ret_from_fork+0x10/0x40 Fixes: 13ac695e7ea1 ("net:hns: Add support of Hip06 SoC to the Hislicon Network Subsystem") Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com> Signed-off-by: lipeng <lipeng321@huawei.com> Reported-by: Jun He <hjat2005@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-08	net: hns: Fix a wrong op phy C45 code	Yunsheng Lin	1	-1/+1
	As the user manual described, the second step to write to C45 phy by mdio should be data, but not address. Here we should fix this issue. Fixes: 5b904d39406a ("net: add Hisilicon Network Subsystem MDIO support") Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com> Reviewed-by: lipeng <lipeng321@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-08	net: macb: Adding Support for Jumbo Frames up to 10240 Bytes in SAMA5D3	vishnuvardhan	1	-1/+2
	As per the SAMA5D3 device specification it supports Jumbo frames. But the suggested flag and length of bytes it supports was not updated in this driver config_structure. The maximum jumbo frames the device supports : 10240 bytes as per the device spec. While changing the MTU value greater than 1500, it threw error: sudo ifconfig eth1 mtu 9000 SIOCSIFMTU: Invalid argument Add this support to driver so that it works as expected and designed. Signed-off-by: vishnuvardhan <vardhanraj4143@gmail.com> [nicolas.ferre@microchip.com: modify slightly commit msg] Signed-off-by: Nicolas Ferre <nicolas.ferre@microchip.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-08	sched/headers/uapi: Fix linux/sched/types.h userspace compilation errors	Dmitry V. Levin	1	-8/+8
	Consistently use types provided by <linux/types.h> to fix the following linux/sched/types.h userspace compilation errors: /usr/include/linux/sched/types.h:57:2: error: unknown type name 'u32' u32 size; ... u64 sched_period; Signed-off-by: Dmitry V. Levin <ldv@altlinux.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: stable@vger.kernel.org # v4.12 Fixes: e2d1e2aec572 ("sched/headers: Move various ABI definitions to <uapi/linux/sched/types.h>") Link: http://lkml.kernel.org/r/20170705162328.GA11026@altlinux.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-07-08	kprobes: Ensure that jprobe probepoints are at function entry	Naveen N. Rao	1	-2/+6
	Similar to commit 90ec5e89e393c ("kretprobes: Ensure probe location is at function entry"), ensure that the jprobe probepoint is at function entry. Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com> Cc: Ananth N Mavinakayanahalli <ananth@linux.vnet.ibm.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/a4525af6c5a42df385efa31251246cf7cca73598.1499443367.git.naveen.n.rao@linux.vnet.ibm.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-07-08	kprobes: Simplify register_jprobes()	Naveen N. Rao	1	-14/+16
	Re-factor jprobe registration functions as the current version is getting too unwieldy. Move the actual jprobe registration to register_jprobe() and re-organize code accordingly. Suggested-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com> Cc: Ananth N Mavinakayanahalli <ananth@linux.vnet.ibm.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/089cae4bfe73767f765291ee0e6fb0c3d240e5f1.1499443367.git.naveen.n.rao@linux.vnet.ibm.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-07-08	kprobes: Rename [arch_]function_offset_within_entry() to ↵	Naveen N. Rao	4	-8/+8
	[arch_]kprobe_on_func_entry() Rename function_offset_within_entry() to scope it to kprobe namespace by using kprobe_ prefix, and to also simplify it. Suggested-by: Ingo Molnar <mingo@kernel.org> Suggested-by: Masami Hiramatsu <mhiramat@kernel.org> Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com> Cc: Ananth N Mavinakayanahalli <ananth@linux.vnet.ibm.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/3aa6c7e2e4fb6e00f3c24fa306496a66edb558ea.1499443367.git.naveen.n.rao@linux.vnet.ibm.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-07-08	locking/qspinlock: Explicitly include asm/prefetch.h	Stafford Horne	1	-0/+1
	In architectures that use qspinlock, like x86, prefetch is loaded indirectly via the asm/qspinlock.h include. On other architectures, like OpenRISC, which may want to use asm-generic/qspinlock.h the built will fail without the asm/prefetch.h include. Fix this by including directly. Signed-off-by: Stafford Horne <shorne@gmail.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20170707195658.23840-1-shorne@gmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-07-08	objtool: Fix sibling call detection logic	Josh Poimboeuf	1	-2/+11
	With some configs, objtool reports the following warning: arch/x86/kernel/ftrace.o: warning: objtool: ftrace_modify_code_direct()+0x2d: sibling call from callable instruction with modified stack frame The instruction it's complaining about isn't actually a sibling call. It's just a normal jump to an address inside the function. Objtool thought it was a sibling call because the instruction's jump_dest wasn't initialized because the function was supposed to be ignored due to its use of sync_core(). Objtool ended up validating the function instead of ignoring it because it didn't properly recognize a sibling call to the function. So fix the sibling call logic. Also add a warning to catch ignored functions being validated so we'll get a more useful error message next time. Reported-by: Mike Galbraith <efault@gmx.de> Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/96cc8ecbcdd8cb29ddd783817b4af918a6a171b0.1499437107.git.jpoimboe@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-07-07	Merge branch 'work.read_write' of ↵	Linus Torvalds	1	-2/+4
	git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull read/write fix from Al Viro: "file_start_write()/file_end_write() got mixed into vfs_iter_write() by accident; that's a deadlock for all existing callers - they already do that, some - quite a bit outside. Easily fixed, fortunately" * 'work.read_write' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: move file_{start,end}_write() out of do_iter_write()
2017-07-07	Merge branch 'uaccess-work.iov_iter' of ↵	Linus Torvalds	5	-75/+178
	git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull iov_iter hardening from Al Viro: "This is the iov_iter/uaccess/hardening pile. For one thing, it trims the inline part of copy_to_user/copy_from_user to the minimum that does need to be inlined - object size checks, basically. For another, it sanitizes the checks for iov_iter primitives. There are 4 groups of checks: access_ok(), might_fault(), object size and KASAN. - access_ok() had been verified by whoever had set the iov_iter up. However, that has happened in a function far away, so proving that there's no path to actual copying bypassing those checks is hard and proving that iov_iter has not been buggered in the meanwhile is also not pleasant. So we want those redone in actual copyin/copyout. - might_fault() is better off consolidated - we know whether it needs to be checked as soon as we enter iov_iter primitive and observe the iov_iter flavour. No need to wait until the copyin/copyout. The call chains are short enough to make sure we won't miss anything - in fact, it's more robust that way, since there are cases where we do e.g. forced fault-in before getting to copyin/copyout. It's not quite what we need to check (in particular, combination of iovec-backed and set_fs(KERNEL_DS) is almost certainly a bug, not a cause to skip checks), but that's for later series. For now let's keep might_fault(). - KASAN checks belong in copyin/copyout - at the same level where other iov_iter flavours would've hit them in memcpy(). - object size checks should apply to all iov_iter flavours, not just iovec-backed ones. There are two groups of primitives - one gets the kernel object described as pointer + size (copy_to_iter(), etc.) while another gets it as page + offset + size (copy_page_to_iter(), etc.) For the first group the checks are best done where we actually have a chance to find the object size. In other words, those belong in inline wrappers in uio.h, before calling into iov_iter.c. Same kind as we have for inlined part of copy_to_user(). For the second group there is no object to look at - offset in page is just a number, it bears no type information. So we do them in the common helper called by iov_iter.c primitives of that kind. All it currently does is checking that we are not trying to access outside of the compound page; eventually we might want to add some sanity checks on the page involved. So the things we need in copyin/copyout part of iov_iter.c do not quite match anything in uaccess.h (we want no zeroing, we do want access_ok() and KASAN and we want no might_fault() or object size checks done on that level). OTOH, these needs are simple enough to provide a couple of helpers (static in iov_iter.c) doing just what we need..." * 'uaccess-work.iov_iter' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: iov_iter: saner checks on copyin/copyout iov_iter: sanity checks for copy to/from page primitives iov_iter/hardening: move object size checks to inlined part copy_{to,from}_user(): consolidate object size checks copy_{from,to}_user(): move kasan checks and might_fault() out-of-line
2017-07-07	exec: Limit arg stack to at most 75% of _STK_LIM	Kees Cook	1	-5/+6
	To avoid pathological stack usage or the need to special-case setuid execs, just limit all arg stack usage to at most 75% of _STK_LIM (6MB). Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-07	Merge tag 'for-linus-v4.13-2' of ↵	Linus Torvalds	24	-63/+572
	git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux Pull Writeback error handling updates from Jeff Layton: "This pile represents the bulk of the writeback error handling fixes that I have for this cycle. Some of the earlier patches in this pile may look trivial but they are prerequisites for later patches in the series. The aim of this set is to improve how we track and report writeback errors to userland. Most applications that care about data integrity will periodically call fsync/fdatasync/msync to ensure that their writes have made it to the backing store. For a very long time, we have tracked writeback errors using two flags in the address_space: AS_EIO and AS_ENOSPC. Those flags are set when a writeback error occurs (via mapping_set_error) and are cleared as a side-effect of filemap_check_errors (as you noted yesterday). This model really sucks for userland. Only the first task to call fsync (or msync or fdatasync) will see the error. Any subsequent task calling fsync on a file will get back 0 (unless another writeback error occurs in the interim). If I have several tasks writing to a file and calling fsync to ensure that their writes got stored, then I need to have them coordinate with one another. That's difficult enough, but in a world of containerized setups that coordination may even not be possible. But wait...it gets worse! The calls to filemap_check_errors can be buried pretty far down in the call stack, and there are internal callers of filemap_write_and_wait and the like that also end up clearing those errors. Many of those callers ignore the error return from that function or return it to userland at nonsensical times (e.g. truncate() or stat()). If I get back -EIO on a truncate, there is no reason to think that it was because some previous writeback failed, and a subsequent fsync() will (incorrectly) return 0. This pile aims to do three things: 1) ensure that when a writeback error occurs that that error will be reported to userland on a subsequent fsync/fdatasync/msync call, regardless of what internal callers are doing 2) report writeback errors on all file descriptions that were open at the time that the error occurred. This is a user-visible change, but I think most applications are written to assume this behavior anyway. Those that aren't are unlikely to be hurt by it. 3) document what filesystems should do when there is a writeback error. Today, there is very little consistency between them, and a lot of cargo-cult copying. We need to make it very clear what filesystems should do in this situation. To achieve this, the set adds a new data type (errseq_t) and then builds new writeback error tracking infrastructure around that. Once all of that is in place, we change the filesystems to use the new infrastructure for reporting wb errors to userland. Note that this is just the initial foray into cleaning up this mess. There is a lot of work remaining here: 1) convert the rest of the filesystems in a similar fashion. Once the initial set is in, then I think most other fs' will be fairly simple to convert. Hopefully most of those can in via individual filesystem trees. 2) convert internal waiters on writeback to use errseq_t for detecting errors instead of relying on the AS_* flags. I have some draft patches for this for ext4, but they are not quite ready for prime time yet. This was a discussion topic this year at LSF/MM too. If you're interested in the gory details, LWN has some good articles about this: https://lwn.net/Articles/718734/ https://lwn.net/Articles/724307/" * tag 'for-linus-v4.13-2' of git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux: btrfs: minimal conversion to errseq_t writeback error reporting on fsync xfs: minimal conversion to errseq_t writeback error reporting ext4: use errseq_t based error handling for reporting data writeback errors fs: convert __generic_file_fsync to use errseq_t based reporting block: convert to errseq_t based writeback error tracking dax: set errors in mapping when writeback fails Documentation: flesh out the section in vfs.txt on storing and reporting writeback errors mm: set both AS_EIO/AS_ENOSPC and errseq_t in mapping_set_error fs: new infrastructure for writeback error handling and reporting lib: add errseq_t type and infrastructure for handling it mm: don't TestClearPageError in __filemap_fdatawait_range mm: clear AS_EIO/AS_ENOSPC when writeback initiation fails jbd2: don't clear and reset errors after waiting on writeback buffer: set errors in mapping at the time that the error occurs fs: check for writeback errors after syncing out buffers in generic_file_fsync buffer: use mapping_set_error instead of setting the flag mm: fix mapping_set_error call in me_pagecache_dirty
2017-07-07	xfs: don't crash on unexpected holes in dir/attr btrees	Darrick J. Wong	4	-5/+5
	In quite a few places we call xfs_da_read_buf with a mappedbno that we don't control, then assume that the function passes back either an error code or a buffer pointer. Unfortunately, if mappedbno == -2 and bno maps to a hole, we get a return code of zero and a NULL buffer, which means that we crash if we actually try to use that buffer pointer. This happens immediately when we set the buffer type for transaction context. Therefore, check that we have no error code and a non-NULL bp before trying to use bp. This patch is a follow-up to an incomplete fix in 96a3aefb8ffde231 ("xfs: don't crash if reading a directory results in an unexpected hole"). Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-07-07	Merge tag 'for-linus-v4.13-1' of ↵	Linus Torvalds	12	-26/+23
	git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux Pull Writeback error handling fixes from Jeff Layton: "The main rationale for all of these changes is to tighten up writeback error reporting to userland. There are many ways now that writeback errors can be lost, such that fsync/fdatasync/msync return 0 when writeback actually failed. This pile contains a small set of cleanups and writeback error handling fixes that I was able to break off from the main pile (#2). Two of the patches in this pile are trivial. The exceptions are the patch to fix up error handling in write_one_page, and the patch to make JFS pay attention to write_one_page errors" * tag 'for-linus-v4.13-1' of git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux: fs: remove call_fsync helper function mm: clean up error handling in write_one_page JFS: do not ignore return code from write_one_page() mm: drop "wait" parameter from write_one_page()
2017-07-07	Merge tag 'cifs-bug-fixes-for-4.13' of git://git.samba.org/sfrench/cifs-2.6	Linus Torvalds	8	-24/+218
	Pull cifs fixes from Steve French: "First set of CIFS/SMB3 fixes for the merge window. Also improves POSIX character mapping for SMB3" * tag 'cifs-bug-fixes-for-4.13' of git://git.samba.org/sfrench/cifs-2.6: CIFS: fix circular locking dependency cifs: set oparms.create_options rather than or'ing in CREATE_OPEN_BACKUP_INTENT cifs: Do not modify mid entry after submitting I/O in cifs_call_async CIFS: add SFM mapping for 0x01-0x1F cifs: hide unused functions cifs: Use smb 2 - 3 and cifsacl mount options getacl functions cifs: prototype declaration and definition for smb 2 - 3 and cifsacl mount options CIFS: add CONFIG_CIFS_DEBUG_KEYS to dump encryption keys cifs: set mapping error when page writeback fails in writepage or launder_pages SMB3: Enable encryption for SMB3.1.1
2017-07-07	Merge tag 'gfs2-4.13.fixes.addendum' of ↵	Linus Torvalds	2	-2/+10
	git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2 Pull GFS2 fix from Bob Peterson: "Sorry for the additional merge request, but Andreas discovered this problem soon after you processed our last gfs2 merge. This fixes a regression introduced by a patch we did in mid-2015 (commit 88ffbf3e037e: "GFS2: Use resizable hash table for glocks"), so best to get it fixed. Some code was reverted that should not have been. The patch from Andreas Gruenbacher just re-adds code that had been there originally" * tag 'gfs2-4.13.fixes.addendum' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2: gfs2: Fix glock rhashtable rcu bug
2017-07-07	dentry name snapshots	Al Viro	6	-42/+48
	take_dentry_name_snapshot() takes a safe snapshot of dentry name; if the name is a short one, it gets copied into caller-supplied structure, otherwise an extra reference to external name is grabbed (those are never modified). In either case the pointer to stable string is stored into the same structure. dentry must be held by the caller of take_dentry_name_snapshot(), but may be freely dropped afterwards - the snapshot will stay until destroyed by release_dentry_name_snapshot(). Intended use: struct name_snapshot s; take_dentry_name_snapshot(&s, dentry); ... access s.name ... release_dentry_name_snapshot(&s); Replaces fsnotify_oldname_...(), gets used in fsnotify to obtain the name to pass down with event. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-07-07	Merge branch 'for-linus' of ↵	Linus Torvalds	12	-133/+261
	git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security Pull security layer fixes from James Morris: "Bugfixes for TPM and SELinux" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security: IB/core: Fix static analysis warning in ib_policy_change_task IB/core: Fix uninitialized variable use in check_qp_port_pkey_settings tpm: do not suspend/resume if power stays on tpm: use tpm2_pcr_read() in tpm2_do_selftest() tpm: use tpm_buf functions in tpm2_pcr_read() tpm_tis: make ilb_base_addr static tpm: consolidate the TPM startup code tpm: Enable CLKRUN protocol for Braswell systems tpm/tpm_crb: fix priv->cmd_size initialisation tpm: fix a kernel memory leak in tpm-sysfs.c tpm: Issue a TPM2_Shutdown for TPM2 devices. Add "shutdown" to "struct class".
2017-07-08	net: Update networking MAINTAINERS entry.	David S. Miller	1	-2/+0
	James and Patrick haven't been active in years. Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-07	Merge tag 'kbuild-thinar-v4.13' of ↵	Linus Torvalds	12	-72/+93
	git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild Pull Kbuild thin archives updates from Masahiro Yamada: "Thin archives migration by Nicholas Piggin. THIN_ARCHIVES has been available for a while as an optional feature only for PowerPC architecture, but we do not need two different intermediate-artifact schemes. Using thin archives instead of conventional incremental linking has various advantages: - save disk space for builds - speed-up building a little - fix some link issues (for example, allyesconfig on ARM) due to more flexibility for the final linking - work better with dead code elimination we are planning As discussed before, this migration has been done unconditionally so that any problems caused by this will show up with "git bisect". With testing with 0-day and linux-next, some architectures actually showed up problems, but they were trivial and all fixed now" * tag 'kbuild-thinar-v4.13' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild: tile: remove unneeded extra-y in Makefile kbuild: thin archives make default for all archs x86/um: thin archives build fix tile: thin archives fix linking ia64: thin archives fix linking sh: thin archives fix linking kbuild: handle libs-y archives separately from built-in.o archives kbuild: thin archives use P option to ar kbuild: thin archives final link close --whole-archives option ia64: remove unneeded extra-y in Makefile.gate tile: fix dependency and .*.cmd inclusion for incremental build sparc64: Use indirect calls in hamming weight stubs
2017-07-07	Merge tag 'kbuild-misc-v4.13' of ↵	Linus Torvalds	25	-26/+37
	git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild Pull misc Kbuild updates from Masahiro Yamada: - Use more portable shebang for Perl scripts - Remove trailing spaces from GCC version in kernel log - Make initramfs generation deterministic * tag 'kbuild-misc-v4.13' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild: kbuild: create deterministic initramfs directory listings scripts/mkcompile_h: Remove trailing spaces from compiler version scripts: Switch to more portable Perl shebang
2017-07-07	Merge tag 'kbuild-v4.13' of ↵	Linus Torvalds	12	-104/+67
	git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild Pull Kbuild updates from Masahiro Yamada: - Clean up Makefiles and scripts - Improve clang support - Remove unneeded genhdr-y syntax - Remove unneeded cc-option-align macro - Introduce __cc-option macro and use it to fix x86 boot code compiler flags * tag 'kbuild-v4.13' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild: kbuild: improve comments on KBUILD_SRC x86/build: Specify stack alignment for clang x86/build: Use __cc-option for boot code compiler options kbuild: Add __cc-option macro kbuild: remove cc-option-align kbuild: replace genhdr-y with generated-y kbuild: clang: Disable 'address-of-packed-member' warning kbuild: remove duplicated arch/*/include/generated/uapi include path kbuild: speed up checksyscalls.sh kbuild: simplify silent build (-s) detection
2017-07-07	Merge tag 'linux-kselftest-4.13-rc1-update' of ↵	Linus Torvalds	42	-509/+1011
	git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest Pull Kselftest updates from Shuah Khan: "This update consists of: - TAP13 framework and changes to some tests to convert to TAP13. Converting kselftest output to standard format will help identify run to run differences and pin point failures easily. TAP13 format has been in use for several years and the output is human friendly. Please find the specification: https://testanything.org/tap-version-13-specification.html Credit goes to Tim Bird for recommending TAP13 as a suitable format, and to Grag KH for kick starting the work with help from Paul Elder and Alice Ferrazzi The first phase of the TAp13 conversion is included in this update. Future updates will include updates to rest of the tests. - Masami Hiramatsu fixed ftrace to run on 4.9 stable kernels. - Kselftest documnetation has been converted to ReST format. Document now has a new home under Documentation/dev-tools. - kselftest_harness.h is now available for general use as a result of Mickaël Salaün's work. - Several fixes to skip and/or fail tests gracefully on older releases" * tag 'linux-kselftest-4.13-rc1-update' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest: (48 commits) selftests: membarrier: use ksft_* var arg msg api selftests: breakpoints: breakpoint_test_arm64: convert test to use TAP13 selftests: breakpoints: step_after_suspend_test use ksft_* var arg msg api selftests: breakpoint_test: use ksft_* var arg msg api kselftest: add ksft_print_msg() function to output general information kselftest: make ksft_* output functions variadic selftests/capabilities: Fix the test_execve test selftests: intel_pstate: add .gitignore selftests: fix memory-hotplug test selftests: add missing test name in memory-hotplug test selftests: check percentage range for memory-hotplug test selftests: check hot-pluggagble memory for memory-hotplug test selftests: typo correction for memory-hotplug test selftests: ftrace: Use md5sum to take less time of checking logs tools/testing/selftests/sysctl: Add pre-check to the value of writes_strict kselftest.rst: do some adjustments after ReST conversion selftest/net/Makefile: Specify output with $(OUTPUT) selftest/intel_pstate/aperf: Use LDLIBS instead of LDFLAGS selftest/memfd/Makefile: Fix build error selftests: lib: Skip tests on missing test modules ...
2017-07-07	Merge tag 'openrisc-for-linus' of git://github.com/openrisc/linux	Linus Torvalds	5	-5/+6
	Pull OpenRISC updates from Stafford Horne: "Openrisc fixes for this 4.13 merge window, there is not really much here: - include cleanups, one with should reduce build time slightly - switch to new toolchain to new (>2 year old) toolchain prefix" * tag 'openrisc-for-linus' of git://github.com/openrisc/linux: openrisc: defconfig: Cleanup from old Kconfig options openrisc: explicitly include linux/bug.h in asm/fixmap.h openrisc: Switch to use export.h instead of module.h openrisc: Change toolchain from or32- to or1k-
2017-07-07	Merge tag 'powerpc-4.13-1' of ↵	Linus Torvalds	161	-1082/+4328
	git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux Pull powerpc updates from Michael Ellerman: "Highlights include: - Support for STRICT_KERNEL_RWX on 64-bit server CPUs. - Platform support for FSP2 (476fpe) board - Enable ZONE_DEVICE on 64-bit server CPUs. - Generic & powerpc spin loop primitives to optimise busy waiting - Convert VDSO update function to use new update_vsyscall() interface - Optimisations to hypercall/syscall/context-switch paths - Improvements to the CPU idle code on Power8 and Power9. As well as many other fixes and improvements. Thanks to: Akshay Adiga, Andrew Donnellan, Andrew Jeffery, Anshuman Khandual, Anton Blanchard, Balbir Singh, Benjamin Herrenschmidt, Christophe Leroy, Christophe Lombard, Colin Ian King, Dan Carpenter, Gautham R. Shenoy, Hari Bathini, Ian Munsie, Ivan Mikhaylov, Javier Martinez Canillas, Madhavan Srinivasan, Masahiro Yamada, Matt Brown, Michael Neuling, Michal Suchanek, Murilo Opsfelder Araujo, Naveen N. Rao, Nicholas Piggin, Oliver O'Halloran, Paul Mackerras, Pavel Machek, Russell Currey, Santosh Sivaraj, Stephen Rothwell, Thiago Jung Bauermann, Yang Li" * tag 'powerpc-4.13-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (158 commits) powerpc/Kconfig: Enable STRICT_KERNEL_RWX for some configs powerpc/mm/radix: Implement STRICT_RWX/mark_rodata_ro() for Radix powerpc/mm/hash: Implement mark_rodata_ro() for hash powerpc/vmlinux.lds: Align __init_begin to 16M powerpc/lib/code-patching: Use alternate map for patch_instruction() powerpc/xmon: Add patch_instruction() support for xmon powerpc/kprobes/optprobes: Use patch_instruction() powerpc/kprobes: Move kprobes over to patch_instruction() powerpc/mm/radix: Fix execute permissions for interrupt_vectors powerpc/pseries: Fix passing of pp0 in updatepp() and updateboltedpp() powerpc/64s: Blacklist rtas entry/exit from kprobes powerpc/64s: Blacklist functions invoked on a trap powerpc/64s: Un-blacklist system_call() from kprobes powerpc/64s: Move system_call() symbol to just after setting MSR_EE powerpc/64s: Blacklist system_call() and system_call_common() from kprobes powerpc/64s: Convert .L__replay_interrupt_return to a local label powerpc64/elfv1: Only dereference function descriptor for non-text symbols cxl: Export library to support IBM XSL powerpc/dts: Use #include "..." to include local DT powerpc/perf/hv-24x7: Aggregate result elements on POWER9 SMT8 ...
2017-07-07	vfs: fix flock compat thinko	Linus Torvalds	1	-15/+15
	Michael Ellerman reported that commit 8c6657cb50cb ("Switch flock copyin/copyout primitives to copy_{from,to}_user()") broke his networking on a bunch of PPC machines (64-bit kernel, 32-bit userspace). The reason is a brown-paper bug by that commit, which had the arguments to "copy_flock_fields()" in the wrong order, breaking the compat handling for file locking. Apparently very few people run 32-bit user space on x86 any more, so the PPC people got the honor of noticing this "feature". Michael also sent a minimal diff that just changed the order of the arguments in that macro. This is not that minimal diff. This not only changes the order of the arguments in the macro, it also changes them to be pointers (to be consistent with all the other uses of those pointers), and makes the functions that do all of this also have the proper "const" attribution on the source pointers in order to make issues like that (using the source as a destination) be really obvious. Reported-by: Michael Ellerman <mpe@ellerman.id.au> Acked-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-07	Merge tag 'usb-4.13-rc1' of ↵	Linus Torvalds	5	-0/+21
	git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb Pull USB fixes from Greg KH: "Here are some remaining USB fixes for 4.13-rc1. They were originally scheduled for 4.12-final, but I didn't send them to you in time. Because of that, they were in a separate branch from the larger USB set of patches, so here they are in a separate pull request. Nothing major here a all, just three small patches: - some usb-serial new device ids - xhci bugfix for some crazy AMD hardware All of these have been in linux-next for a long time with no reported issues" * tag 'usb-4.13-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb: xhci: Limit USB2 port wake support for AMD Promontory hosts USB: serial: qcserial: new Sierra Wireless EM7305 device ID USB: serial: option: add two Longcheer device ids
2017-07-07	Merge tag 'backlight-next-4.13' of ↵	Linus Torvalds	6	-9/+14
	git://git.kernel.org/pub/scm/linux/kernel/git/lee/backlight Pull backlight updates from Lee Jones: "Core Framework: - Report correct error status to user Fix-ups: - Move Backlight headers out of I2C (adp8860, adp8870)" * tag 'backlight-next-4.13' of git://git.kernel.org/pub/scm/linux/kernel/git/lee/backlight: video: adp8870: move header file out of I2C realm backlight: adp8860: Move header file out of I2C realm backlight: Report error on failure
2017-07-07	Merge (most of) tag 'mfd-next-4.13' of ↵	Linus Torvalds	37	-183/+985
	git://git.kernel.org/pub/scm/linux/kernel/git/lee/mfd Pull MFD updates from Lee Jones: "New Drivers: - Intel Cherry Trail Whiskey Cove PMIC - TI LP87565 PMIC New Device Support: - Add support for Cannonlake to intel-lpss-pci - Add support for Simatic IOT2000 to intel_quark_i2c_gpio New Functionality: - Add Regulator support (axp20x) Fix-ups: - Rework IRQ handling (intel_soc_pmic_bxtwc, rtsx_pcr, cros_ec) - Remove unused/unwelcome code (ipaq-micro, wm831x-core, da9062-core) - Provide deregistration on unbind (rn5t618) - Rework DT code/documentation (arizona) - Constify things (fsl-imx25-tsadc) - MAINTAINERS updates (DA9062/61) - Kconfig configuration adaptions (INTEL_SOC_PMIC, MFD_AXP20X_I2C) - Switch to DMI matching (intel_quark_i2c_gpio) - Provide an appropriate level of error checking (wm831x-{i2c,spi}, twl4030-irq, tc6393xb) - Make use of devm_* (resource handling) calls (intel_soc_pmic_bxtwc, stm32-timers, atmel-flexcom, cros_ec, fsl-imx25-tsadc, exynos-lpass, palmas, qcom-spmi-pmic, smsc-ece1099, motorola-cpcap)" [ Skipped the last commit in that series that added eight thousand lines of pointless repeated register definitions. - Linus ] * tag 'mfd-next-4.13' of git://git.kernel.org/pub/scm/linux/kernel/git/lee/mfd: (38 commits) mfd: Add LP87565 PMIC support mfd: cros_ec: Free IRQ on exit dt-bindings: vendor-prefixes: Add arctic to vendor prefix mfd: da9061: Fix to remove BBAT_CONT register from chip model mfd: da9061: Fix to remove BBAT_CONT register from chip model mfd: axp20x-i2c: Document that this must be builtin on x86 mfd: Add Cherry Trail Whiskey Cove PMIC driver mfd: tc6393xb: Handle return value of clk_prepare_enable mfd: intel_quark_i2c_gpio: Add support for SIMATIC IOT2000 platform mfd: intel_quark_i2c_gpio: Use dmi_system_id table for retrieving frequency mfd: motorola-cpcap: Use devm_of_platform_populate() mfd: smsc-ece: Use devm_of_platform_populate() mfd: qcom-spmi-pmic: Use devm_of_platform_populate() mfd: palmas: Use devm_of_platform_populate() mfd: exynos: Use devm_of_platform_populate() mfd: fsl-imx25: Use devm_of_platform_populate() mfd: cros_ec: Use devm_of_platform_populate() mfd: atmel: Use devm_of_platform_populate() mfd: stm32-timers: Use devm_of_platform_populate() mfd: intel_soc_pmic: Select designware i2c-bus driver ...
2017-07-07	Merge tag 'gpio-v4.13-1' of ↵	Linus Torvalds	49	-459/+1507
	git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-gpio Pull GPIO updates from Linus Walleij: "This is the bulk of GPIO changes for the v4.13 series. Some administrativa: I have a slew of 8250 serial patches and the new IOT2040 serial+GPIO driver coming in through this tree, along with a whole bunch of Exar 8250 fixes. These are ACKed by Greg and also hit drivers/platform/* where they are ACKed by Andy Shevchenko. Speaking about drivers/platform/* there is also a bunch of ACPI stuff coming through that route, again ACKed by Andy. The MCP23S08 changes are coming in here as well. You already have the commits in your tree, so this is just a result of sharing an immutable branch between pin control and GPIO. Core: - Export add/remove for lookup tables so that modules can export GPIO descriptor tables. - Handle GPIO sleep states: it is now possible to flag that a GPIO line may loose its state during suspend/resume of the system to save power. This is used in the Wolfson Micro Arizona driver. - ACPI-based GPIO was tightened up a lot around the edges. - Use bitmap_fill() to speed up a loop. New drivers: - Exar XRA1403 SPI-based GPIO. - MVEBU driver now supports Armada 7K and 8K. - LP87565 PMIC GPIO. - Renesas R-CAR R8A7743 (RZ/G1M). - The new IOT2040 8250 serial/GPIO also comes in through this changeset. Substantial driver changes: - Seriously fix the Exar 8250 GPIO portions to work. - The MCP23S08 was moved out to a pin control driver. - Convert MEVEBU to use regmap for register access. - Drop Vulcan support from the Broadcom driver. - Serious cleanup and improvement of the mockup driver, giving us a better test coverage. Misc: - Lots of janitorial clean up. - A bunch of documentation fixes" * tag 'gpio-v4.13-1' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-gpio: (70 commits) serial: exar: Add support for IOT2040 device gpio-exar/8250-exar: Make set of exported GPIOs configurable platform: Accept const properties serial: exar: Factor out platform hooks gpio-exar/8250-exar: Rearrange gpiochip parenthood gpio: exar: Fix iomap request gpio-exar/8250-exar: Do not even instantiate a GPIO device for Commtech cards serial: uapi: Add support for bus termination gpio: rcar: Add R8A7743 (RZ/G1M) support gpio: gpio-wcove: Fix GPIO control register offset calculation gpio: lp87565: Add support for GPIO gpio: dwapb: fix missing first irq for edgeboth irq type MAINTAINERS: Take maintainership for GPIO ACPI support gpio: exar: Fix reading of directions and values gpio: exar: Allocate resources on behalf of the platform device gpio-exar/8250-exar: Fix passing in of parent PCI device gpio: mockup: use devm_kcalloc() where applicable gpio: mockup: add myself as author gpio: mockup: improve the error message gpio: mockup: don't return magic numbers from probe() ...
2017-07-08	openrisc: defconfig: Cleanup from old Kconfig options	Krzysztof Kozlowski	1	-1/+0
	Remove old, dead Kconfig option INET_LRO. It is gone since commit 7bbf3cae65b6 ("ipv4: Remove inet_lro library"). Signed-off-by: Krzysztof Kozlowski <krzk@kernel.org> Signed-off-by: Stafford Horne <shorne@gmail.com>