aboutsummaryrefslogtreecommitdiffstats
path: root/kernel/sys.c
AgeCommit message (Collapse)AuthorFilesLines
12 dayspowerpc/dexcr: Add DEXCR prctl interfaceBenjamin Gray1-0/+16
Now that we track a DEXCR on a per-task basis, individual tasks are free to configure it as they like. The interface is a pair of getter/setter prctl's that work on a single aspect at a time (multiple aspects at once is more difficult if there are different rules applied for each aspect, now or in future). The getter shows the current state of the process config, and the setter allows setting/clearing the aspect. Signed-off-by: Benjamin Gray <bgray@linux.ibm.com> [mpe: Account for PR_RISCV_SET_ICACHE_FLUSH_CTX, shrink some longs lines] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://msgid.link/20240417112325.728010-5-bgray@linux.ibm.com
2024-03-26prctl: generalize PR_SET_MDWE support check to be per-archZev Weiss1-2/+5
Patch series "ARM: prctl: Reject PR_SET_MDWE where not supported". I noticed after a recent kernel update that my ARM926 system started segfaulting on any execve() after calling prctl(PR_SET_MDWE). After some investigation it appears that ARMv5 is incapable of providing the appropriate protections for MDWE, since any readable memory is also implicitly executable. The prctl_set_mdwe() function already had some special-case logic added disabling it on PARISC (commit 793838138c15, "prctl: Disable prctl(PR_SET_MDWE) on parisc"); this patch series (1) generalizes that check to use an arch_*() function, and (2) adds a corresponding override for ARM to disable MDWE on pre-ARMv6 CPUs. With the series applied, prctl(PR_SET_MDWE) is rejected on ARMv5 and subsequent execve() calls (as well as mmap(PROT_READ|PROT_WRITE)) can succeed instead of unconditionally failing; on ARMv6 the prctl works as it did previously. [0] https://lore.kernel.org/all/2023112456-linked-nape-bf19@gregkh/ This patch (of 2): There exist systems other than PARISC where MDWE may not be feasible to support; rather than cluttering up the generic code with additional arch-specific logic let's add a generic function for checking MDWE support and allow each arch to override it as needed. Link: https://lkml.kernel.org/r/20240227013546.15769-4-zev@bewilderbeest.net Link: https://lkml.kernel.org/r/20240227013546.15769-5-zev@bewilderbeest.net Signed-off-by: Zev Weiss <zev@bewilderbeest.net> Acked-by: Helge Deller <deller@gmx.de> [parisc] Cc: Borislav Petkov <bp@alien8.de> Cc: David Hildenbrand <david@redhat.com> Cc: Florent Revest <revest@chromium.org> Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com> Cc: Josh Triplett <josh@joshtriplett.org> Cc: Kees Cook <keescook@chromium.org> Cc: Miguel Ojeda <ojeda@kernel.org> Cc: Mike Rapoport (IBM) <rppt@kernel.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Ondrej Mosnacek <omosnace@redhat.com> Cc: Rick Edgecombe <rick.p.edgecombe@intel.com> Cc: Russell King (Oracle) <linux@armlinux.org.uk> Cc: Sam James <sam@gentoo.org> Cc: Stefan Roesch <shr@devkernel.io> Cc: Yang Shi <yang@os.amperecomputing.com> Cc: Yin Fengwei <fengwei.yin@intel.com> Cc: <stable@vger.kernel.org> [6.3+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-07getrusage: use sig->stats_lock rather than lock_task_sighand()Oleg Nesterov1-3/+13
lock_task_sighand() can trigger a hard lockup. If NR_CPUS threads call getrusage() at the same time and the process has NR_THREADS, spin_lock_irq will spin with irqs disabled O(NR_CPUS * NR_THREADS) time. Change getrusage() to use sig->stats_lock, it was specifically designed for this type of use. This way it runs lockless in the likely case. TODO: - Change do_task_stat() to use sig->stats_lock too, then we can remove spin_lock_irq(siglock) in wait_task_zombie(). - Turn sig->stats_lock into seqcount_rwlock_t, this way the readers in the slow mode won't exclude each other. See https://lore.kernel.org/all/20230913154907.GA26210@redhat.com/ - stats_lock has to disable irqs because ->siglock can be taken in irq context, it would be very nice to change __exit_signal() to avoid the siglock->stats_lock dependency. Link: https://lkml.kernel.org/r/20240122155053.GA26214@redhat.com Signed-off-by: Oleg Nesterov <oleg@redhat.com> Reported-by: Dylan Hatch <dylanbhatch@google.com> Tested-by: Dylan Hatch <dylanbhatch@google.com> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-07getrusage: move thread_group_cputime_adjusted() outside of lock_task_sighand()Oleg Nesterov1-15/+19
Patch series "getrusage: use sig->stats_lock", v2. This patch (of 2): thread_group_cputime() does its own locking, we can safely shift thread_group_cputime_adjusted() which does another for_each_thread loop outside of ->siglock protected section. This is also preparation for the next patch which changes getrusage() to use stats_lock instead of siglock, thread_group_cputime() takes the same lock. With the current implementation recursive read_seqbegin_or_lock() is fine, thread_group_cputime() can't enter the slow mode if the caller holds stats_lock, yet this looks more safe and better performance-wise. Link: https://lkml.kernel.org/r/20240122155023.GA26169@redhat.com Link: https://lkml.kernel.org/r/20240122155050.GA26205@redhat.com Signed-off-by: Oleg Nesterov <oleg@redhat.com> Reported-by: Dylan Hatch <dylanbhatch@google.com> Tested-by: Dylan Hatch <dylanbhatch@google.com> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-11-18prctl: Disable prctl(PR_SET_MDWE) on pariscHelge Deller1-0/+4
systemd-254 tries to use prctl(PR_SET_MDWE) for it's MemoryDenyWriteExecute functionality, but fails on parisc which still needs executable stacks in certain combinations of gcc/glibc/kernel. Disable prctl(PR_SET_MDWE) by returning -EINVAL for now on parisc, until userspace has catched up. Signed-off-by: Helge Deller <deller@gmx.de> Co-developed-by: Linus Torvalds <torvalds@linux-foundation.org> Reported-by: Sam James <sam@gentoo.org> Closes: https://github.com/systemd/systemd/issues/29775 Tested-by: Sam James <sam@gentoo.org> Link: https://lore.kernel.org/all/875y2jro9a.fsf@gentoo.org/ Cc: <stable@vger.kernel.org> # v6.3+
2023-11-02Merge tag 'mm-nonmm-stable-2023-11-02-14-08' of ↵Linus Torvalds1-21/+20
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull non-MM updates from Andrew Morton: "As usual, lots of singleton and doubleton patches all over the tree and there's little I can say which isn't in the individual changelogs. The lengthier patch series are - 'kdump: use generic functions to simplify crashkernel reservation in arch', from Baoquan He. This is mainly cleanups and consolidation of the 'crashkernel=' kernel parameter handling - After much discussion, David Laight's 'minmax: Relax type checks in min() and max()' is here. Hopefully reduces some typecasting and the use of min_t() and max_t() - A group of patches from Oleg Nesterov which clean up and slightly fix our handling of reads from /proc/PID/task/... and which remove task_struct.thread_group" * tag 'mm-nonmm-stable-2023-11-02-14-08' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (64 commits) scripts/gdb/vmalloc: disable on no-MMU scripts/gdb: fix usage of MOD_TEXT not defined when CONFIG_MODULES=n .mailmap: add address mapping for Tomeu Vizoso mailmap: update email address for Claudiu Beznea tools/testing/selftests/mm/run_vmtests.sh: lower the ptrace permissions .mailmap: map Benjamin Poirier's address scripts/gdb: add lx_current support for riscv ocfs2: fix a spelling typo in comment proc: test ProtectionKey in proc-empty-vm test proc: fix proc-empty-vm test with vsyscall fs/proc/base.c: remove unneeded semicolon do_io_accounting: use sig->stats_lock do_io_accounting: use __for_each_thread() ocfs2: replace BUG_ON() at ocfs2_num_free_extents() with ocfs2_error() ocfs2: fix a typo in a comment scripts/show_delta: add __main__ judgement before main code treewide: mark stuff as __ro_after_init fs: ocfs2: check status values proc: test /proc/${pid}/statm compiler.h: move __is_constexpr() to compiler.h ...
2023-10-06mm: add a NO_INHERIT flag to the PR_SET_MDWE prctlFlorent Revest1-6/+26
This extends the current PR_SET_MDWE prctl arg with a bit to indicate that the process doesn't want MDWE protection to propagate to children. To implement this no-inherit mode, the tag in current->mm->flags must be absent from MMF_INIT_MASK. This means that the encoding for "MDWE but without inherit" is different in the prctl than in the mm flags. This leads to a bit of bit-mangling in the prctl implementation. Link: https://lkml.kernel.org/r/20230828150858.393570-6-revest@chromium.org Signed-off-by: Florent Revest <revest@chromium.org> Reviewed-by: Kees Cook <keescook@chromium.org> Reviewed-by: Catalin Marinas <catalin.marinas@arm.com> Cc: Alexey Izbyshev <izbyshev@ispras.ru> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Ayush Jain <ayush.jain3@amd.com> Cc: David Hildenbrand <david@redhat.com> Cc: Greg Thelen <gthelen@google.com> Cc: Joey Gouly <joey.gouly@arm.com> Cc: KP Singh <kpsingh@kernel.org> Cc: Mark Brown <broonie@kernel.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Peter Xu <peterx@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Szabolcs Nagy <Szabolcs.Nagy@arm.com> Cc: Topi Miettinen <toiwoton@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-10-04getrusage: use __for_each_thread()Oleg Nesterov1-3/+1
do/while_each_thread should be avoided when possible. Plus this change allows to avoid lock_task_sighand(), we can use rcu and/or sig->stats_lock instead. Link: https://lkml.kernel.org/r/20230909172629.GA20454@redhat.com Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-10-04getrusage: add the "signal_struct *sig" local variableOleg Nesterov1-18/+19
No functional changes, cleanup/preparation. Link: https://lkml.kernel.org/r/20230909172554.GA20441@redhat.com Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-07-17prctl: move PR_GET_AUXV out of PR_MCE_KILLMiguel Ojeda1-5/+5
Somehow PR_GET_AUXV got added into PR_MCE_KILL's switch when the patch was applied [1]. Thus move it out of the switch, to the place the patch added it. In the recently released v6.4 kernel some user could, in principle, be already using this feature by mapping the right page and passing the PR_GET_AUXV constant as a pointer: prctl(PR_MCE_KILL, PR_GET_AUXV, ...) So this does change the behavior for users. We could keep the bug since the other subcases in PR_MCE_KILL (PR_MCE_KILL_CLEAR and PR_MCE_KILL_SET) do not overlap. However, v6.4 may be recent enough (2 weeks old) that moving the lines (rather than just adding a new case) does not break anybody? Moreover, the documentation in man-pages was just committed today [2]. Link: https://lkml.kernel.org/r/20230708233344.361854-1-ojeda@kernel.org Fixes: ddc65971bb67 ("prctl: add PR_GET_AUXV to copy auxv to userspace") Link: https://lore.kernel.org/all/d81864a7f7f43bca6afa2a09fc2e850e4050ab42.1680611394.git.josh@joshtriplett.org/ [1] Link: https://git.kernel.org/pub/scm/docs/man-pages/man-pages.git/commit/?id=8cf0c06bfd3c2b219b044d4151c96f0da50af9ad [2] Signed-off-by: Miguel Ojeda <ojeda@kernel.org> Cc: Josh Triplett <josh@joshtriplett.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-08riscv: Add prctl controls for userspace vector managementAndy Chiu1-0/+12
This patch add two riscv-specific prctls, to allow usespace control the use of vector unit: * PR_RISCV_V_SET_CONTROL: control the permission to use Vector at next, or all following execve for a thread. Turning off a thread's Vector live is not possible since libraries may have registered ifunc that may execute Vector instructions. * PR_RISCV_V_GET_CONTROL: get the same permission setting for the current thread, and the setting for following execve(s). Signed-off-by: Andy Chiu <andy.chiu@sifive.com> Reviewed-by: Greentime Hu <greentime.hu@sifive.com> Reviewed-by: Vincent Chen <vincent.chen@sifive.com> Link: https://lore.kernel.org/r/20230605110724.21391-22-andy.chiu@sifive.com Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
2023-05-02mm/ksm: unmerge and clear VM_MERGEABLE when setting PR_SET_MEMORY_MERGE=0David Hildenbrand1-9/+3
Patch series "mm/ksm: improve PR_SET_MEMORY_MERGE=0 handling and cleanup disabling KSM", v2. (1) Make PR_SET_MEMORY_MERGE=0 unmerge pages like setting MADV_UNMERGEABLE does, (2) add a selftest for it and (3) factor out disabling of KSM from s390/gmap code. This patch (of 3): Let's unmerge any KSM pages when setting PR_SET_MEMORY_MERGE=0, and clear the VM_MERGEABLE flag from all VMAs -- just like KSM would. Of course, only do that if we previously set PR_SET_MEMORY_MERGE=1. Link: https://lkml.kernel.org/r/20230422205420.30372-1-david@redhat.com Link: https://lkml.kernel.org/r/20230422205420.30372-2-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Acked-by: Stefan Roesch <shr@devkernel.io> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: Claudio Imbrenda <imbrenda@linux.ibm.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Janosch Frank <frankja@linux.ibm.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Rik van Riel <riel@surriel.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-04-21mm: add new api to enable ksm per processStefan Roesch1-0/+27
Patch series "mm: process/cgroup ksm support", v9. So far KSM can only be enabled by calling madvise for memory regions. To be able to use KSM for more workloads, KSM needs to have the ability to be enabled / disabled at the process / cgroup level. Use case 1: The madvise call is not available in the programming language. An example for this are programs with forked workloads using a garbage collected language without pointers. In such a language madvise cannot be made available. In addition the addresses of objects get moved around as they are garbage collected. KSM sharing needs to be enabled "from the outside" for these type of workloads. Use case 2: The same interpreter can also be used for workloads where KSM brings no benefit or even has overhead. We'd like to be able to enable KSM on a workload by workload basis. Use case 3: With the madvise call sharing opportunities are only enabled for the current process: it is a workload-local decision. A considerable number of sharing opportunities may exist across multiple workloads or jobs (if they are part of the same security domain). Only a higler level entity like a job scheduler or container can know for certain if its running one or more instances of a job. That job scheduler however doesn't have the necessary internal workload knowledge to make targeted madvise calls. Security concerns: In previous discussions security concerns have been brought up. The problem is that an individual workload does not have the knowledge about what else is running on a machine. Therefore it has to be very conservative in what memory areas can be shared or not. However, if the system is dedicated to running multiple jobs within the same security domain, its the job scheduler that has the knowledge that sharing can be safely enabled and is even desirable. Performance: Experiments with using UKSM have shown a capacity increase of around 20%. Here are the metrics from an instagram workload (taken from a machine with 64GB main memory): full_scans: 445 general_profit: 20158298048 max_page_sharing: 256 merge_across_nodes: 1 pages_shared: 129547 pages_sharing: 5119146 pages_to_scan: 4000 pages_unshared: 1760924 pages_volatile: 10761341 run: 1 sleep_millisecs: 20 stable_node_chains: 167 stable_node_chains_prune_millisecs: 2000 stable_node_dups: 2751 use_zero_pages: 0 zero_pages_sharing: 0 After the service is running for 30 minutes to an hour, 4 to 5 million shared pages are common for this workload when using KSM. Detailed changes: 1. New options for prctl system command This patch series adds two new options to the prctl system call. The first one allows to enable KSM at the process level and the second one to query the setting. The setting will be inherited by child processes. With the above setting, KSM can be enabled for the seed process of a cgroup and all processes in the cgroup will inherit the setting. 2. Changes to KSM processing When KSM is enabled at the process level, the KSM code will iterate over all the VMA's and enable KSM for the eligible VMA's. When forking a process that has KSM enabled, the setting will be inherited by the new child process. 3. Add general_profit metric The general_profit metric of KSM is specified in the documentation, but not calculated. This adds the general profit metric to /sys/kernel/debug/mm/ksm. 4. Add more metrics to ksm_stat This adds the process profit metric to /proc/<pid>/ksm_stat. 5. Add more tests to ksm_tests and ksm_functional_tests This adds an option to specify the merge type to the ksm_tests. This allows to test madvise and prctl KSM. It also adds a two new tests to ksm_functional_tests: one to test the new prctl options and the other one is a fork test to verify that the KSM process setting is inherited by client processes. This patch (of 3): So far KSM can only be enabled by calling madvise for memory regions. To be able to use KSM for more workloads, KSM needs to have the ability to be enabled / disabled at the process / cgroup level. 1. New options for prctl system command This patch series adds two new options to the prctl system call. The first one allows to enable KSM at the process level and the second one to query the setting. The setting will be inherited by child processes. With the above setting, KSM can be enabled for the seed process of a cgroup and all processes in the cgroup will inherit the setting. 2. Changes to KSM processing When KSM is enabled at the process level, the KSM code will iterate over all the VMA's and enable KSM for the eligible VMA's. When forking a process that has KSM enabled, the setting will be inherited by the new child process. 1) Introduce new MMF_VM_MERGE_ANY flag This introduces the new flag MMF_VM_MERGE_ANY flag. When this flag is set, kernel samepage merging (ksm) gets enabled for all vma's of a process. 2) Setting VM_MERGEABLE on VMA creation When a VMA is created, if the MMF_VM_MERGE_ANY flag is set, the VM_MERGEABLE flag will be set for this VMA. 3) support disabling of ksm for a process This adds the ability to disable ksm for a process if ksm has been enabled for the process with prctl. 4) add new prctl option to get and set ksm for a process This adds two new options to the prctl system call - enable ksm for all vmas of a process (if the vmas support it). - query if ksm has been enabled for a process. 3. Disabling MMF_VM_MERGE_ANY for storage keys in s390 In the s390 architecture when storage keys are used, the MMF_VM_MERGE_ANY will be disabled. Link: https://lkml.kernel.org/r/20230418051342.1919757-1-shr@devkernel.io Link: https://lkml.kernel.org/r/20230418051342.1919757-2-shr@devkernel.io Signed-off-by: Stefan Roesch <shr@devkernel.io> Acked-by: David Hildenbrand <david@redhat.com> Cc: David Hildenbrand <david@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Rik van Riel <riel@surriel.com> Cc: Bagas Sanjaya <bagasdotme@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-04-18prctl: add PR_GET_AUXV to copy auxv to userspaceJosh Triplett1-0/+15
If a library wants to get information from auxv (for instance, AT_HWCAP/AT_HWCAP2), it has a few options, none of them perfectly reliable or ideal: - Be main or the pre-main startup code, and grub through the stack above main. Doesn't work for a library. - Call libc getauxval. Not ideal for libraries that are trying to be libc-independent and/or don't otherwise require anything from other libraries. - Open and read /proc/self/auxv. Doesn't work for libraries that may run in arbitrarily constrained environments that may not have /proc mounted (e.g. libraries that might be used by an init program or a container setup tool). - Assume you're on the main thread and still on the original stack, and try to walk the stack upwards, hoping to find auxv. Extremely bad idea. - Ask the caller to pass auxv in for you. Not ideal for a user-friendly library, and then your caller may have the same problem. Add a prctl that copies current->mm->saved_auxv to a userspace buffer. Link: https://lkml.kernel.org/r/d81864a7f7f43bca6afa2a09fc2e850e4050ab42.1680611394.git.josh@joshtriplett.org Signed-off-by: Josh Triplett <josh@joshtriplett.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-04-18kernel/sys.c: fix and improve control flow in __sys_setres[ug]id()Ondrej Mosnacek1-29/+40
Linux Security Modules (LSMs) that implement the "capable" hook will usually emit an access denial message to the audit log whenever they "block" the current task from using the given capability based on their security policy. The occurrence of a denial is used as an indication that the given task has attempted an operation that requires the given access permission, so the callers of functions that perform LSM permission checks must take care to avoid calling them too early (before it is decided if the permission is actually needed to perform the requested operation). The __sys_setres[ug]id() functions violate this convention by first calling ns_capable_setid() and only then checking if the operation requires the capability or not. It means that any caller that has the capability granted by DAC (task's capability set) but not by MAC (LSMs) will generate a "denied" audit record, even if is doing an operation for which the capability is not required. Fix this by reordering the checks such that ns_capable_setid() is checked last and -EPERM is returned immediately if it returns false. While there, also do two small optimizations: * move the capability check before prepare_creds() and * bail out early in case of a no-op. Link: https://lkml.kernel.org/r/20230217162154.837549-1-omosnace@redhat.com Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Signed-off-by: Ondrej Mosnacek <omosnace@redhat.com> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-02-23Merge tag 'mm-stable-2023-02-20-13-37' of ↵Linus Torvalds1-0/+33
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: - Daniel Verkamp has contributed a memfd series ("mm/memfd: add F_SEAL_EXEC") which permits the setting of the memfd execute bit at memfd creation time, with the option of sealing the state of the X bit. - Peter Xu adds a patch series ("mm/hugetlb: Make huge_pte_offset() thread-safe for pmd unshare") which addresses a rare race condition related to PMD unsharing. - Several folioification patch serieses from Matthew Wilcox, Vishal Moola, Sidhartha Kumar and Lorenzo Stoakes - Johannes Weiner has a series ("mm: push down lock_page_memcg()") which does perform some memcg maintenance and cleanup work. - SeongJae Park has added DAMOS filtering to DAMON, with the series "mm/damon/core: implement damos filter". These filters provide users with finer-grained control over DAMOS's actions. SeongJae has also done some DAMON cleanup work. - Kairui Song adds a series ("Clean up and fixes for swap"). - Vernon Yang contributed the series "Clean up and refinement for maple tree". - Yu Zhao has contributed the "mm: multi-gen LRU: memcg LRU" series. It adds to MGLRU an LRU of memcgs, to improve the scalability of global reclaim. - David Hildenbrand has added some userfaultfd cleanup work in the series "mm: uffd-wp + change_protection() cleanups". - Christoph Hellwig has removed the generic_writepages() library function in the series "remove generic_writepages". - Baolin Wang has performed some maintenance on the compaction code in his series "Some small improvements for compaction". - Sidhartha Kumar is doing some maintenance work on struct page in his series "Get rid of tail page fields". - David Hildenbrand contributed some cleanup, bugfixing and generalization of pte management and of pte debugging in his series "mm: support __HAVE_ARCH_PTE_SWP_EXCLUSIVE on all architectures with swap PTEs". - Mel Gorman and Neil Brown have removed the __GFP_ATOMIC allocation flag in the series "Discard __GFP_ATOMIC". - Sergey Senozhatsky has improved zsmalloc's memory utilization with his series "zsmalloc: make zspage chain size configurable". - Joey Gouly has added prctl() support for prohibiting the creation of writeable+executable mappings. The previous BPF-based approach had shortcomings. See "mm: In-kernel support for memory-deny-write-execute (MDWE)". - Waiman Long did some kmemleak cleanup and bugfixing in the series "mm/kmemleak: Simplify kmemleak_cond_resched() & fix UAF". - T.J. Alumbaugh has contributed some MGLRU cleanup work in his series "mm: multi-gen LRU: improve". - Jiaqi Yan has provided some enhancements to our memory error statistics reporting, mainly by presenting the statistics on a per-node basis. See the series "Introduce per NUMA node memory error statistics". - Mel Gorman has a second and hopefully final shot at fixing a CPU-hog regression in compaction via his series "Fix excessive CPU usage during compaction". - Christoph Hellwig does some vmalloc maintenance work in the series "cleanup vfree and vunmap". - Christoph Hellwig has removed block_device_operations.rw_page() in ths series "remove ->rw_page". - We get some maple_tree improvements and cleanups in Liam Howlett's series "VMA tree type safety and remove __vma_adjust()". - Suren Baghdasaryan has done some work on the maintainability of our vm_flags handling in the series "introduce vm_flags modifier functions". - Some pagemap cleanup and generalization work in Mike Rapoport's series "mm, arch: add generic implementation of pfn_valid() for FLATMEM" and "fixups for generic implementation of pfn_valid()" - Baoquan He has done some work to make /proc/vmallocinfo and /proc/kcore better represent the real state of things in his series "mm/vmalloc.c: allow vread() to read out vm_map_ram areas". - Jason Gunthorpe rationalized the GUP system's interface to the rest of the kernel in the series "Simplify the external interface for GUP". - SeongJae Park wishes to migrate people from DAMON's debugfs interface over to its sysfs interface. To support this, we'll temporarily be printing warnings when people use the debugfs interface. See the series "mm/damon: deprecate DAMON debugfs interface". - Andrey Konovalov provided the accurately named "lib/stackdepot: fixes and clean-ups" series. - Huang Ying has provided a dramatic reduction in migration's TLB flush IPI rates with the series "migrate_pages(): batch TLB flushing". - Arnd Bergmann has some objtool fixups in "objtool warning fixes". * tag 'mm-stable-2023-02-20-13-37' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (505 commits) include/linux/migrate.h: remove unneeded externs mm/memory_hotplug: cleanup return value handing in do_migrate_range() mm/uffd: fix comment in handling pte markers mm: change to return bool for isolate_movable_page() mm: hugetlb: change to return bool for isolate_hugetlb() mm: change to return bool for isolate_lru_page() mm: change to return bool for folio_isolate_lru() objtool: add UACCESS exceptions for __tsan_volatile_read/write kmsan: disable ftrace in kmsan core code kasan: mark addr_has_metadata __always_inline mm: memcontrol: rename memcg_kmem_enabled() sh: initialize max_mapnr m68k/nommu: add missing definition of ARCH_PFN_OFFSET mm: percpu: fix incorrect size in pcpu_obj_full_size() maple_tree: reduce stack usage with gcc-9 and earlier mm: page_alloc: call panic() when memoryless node allocation fails mm: multi-gen LRU: avoid futile retries migrate_pages: move THP/hugetlb migration support check to simplify code migrate_pages: batch flushing TLB migrate_pages: share more code between _unmap and _move ...
2023-02-02mm: implement memory-deny-write-execute as a prctlJoey Gouly1-0/+33
Patch series "mm: In-kernel support for memory-deny-write-execute (MDWE)", v2. The background to this is that systemd has a configuration option called MemoryDenyWriteExecute [2], implemented as a SECCOMP BPF filter. Its aim is to prevent a user task from inadvertently creating an executable mapping that is (or was) writeable. Since such BPF filter is stateless, it cannot detect mappings that were previously writeable but subsequently changed to read-only. Therefore the filter simply rejects any mprotect(PROT_EXEC). The side-effect is that on arm64 with BTI support (Branch Target Identification), the dynamic loader cannot change an ELF section from PROT_EXEC to PROT_EXEC|PROT_BTI using mprotect(). For libraries, it can resort to unmapping and re-mapping but for the main executable it does not have a file descriptor. The original bug report in the Red Hat bugzilla - [3] - and subsequent glibc workaround for libraries - [4]. This series adds in-kernel support for this feature as a prctl PR_SET_MDWE, that is inherited on fork(). The prctl denies PROT_WRITE | PROT_EXEC mappings. Like the systemd BPF filter it also denies adding PROT_EXEC to mappings. However unlike the BPF filter it only denies it if the mapping didn't previous have PROT_EXEC. This allows to PROT_EXEC -> PROT_EXEC | PROT_BTI with mprotect(), which is a problem with the BPF filter. This patch (of 2): The aim of such policy is to prevent a user task from creating an executable mapping that is also writeable. An example of mmap() returning -EACCESS if the policy is enabled: mmap(0, size, PROT_READ | PROT_WRITE | PROT_EXEC, flags, 0, 0); Similarly, mprotect() would return -EACCESS below: addr = mmap(0, size, PROT_READ | PROT_EXEC, flags, 0, 0); mprotect(addr, size, PROT_READ | PROT_WRITE | PROT_EXEC); The BPF filter that systemd MDWE uses is stateless, and disallows mprotect() with PROT_EXEC completely. This new prctl allows PROT_EXEC to be enabled if it was already PROT_EXEC, which allows the following case: addr = mmap(0, size, PROT_READ | PROT_EXEC, flags, 0, 0); mprotect(addr, size, PROT_READ | PROT_EXEC | PROT_BTI); where PROT_BTI enables branch tracking identification on arm64. Link: https://lkml.kernel.org/r/20230119160344.54358-1-joey.gouly@arm.com Link: https://lkml.kernel.org/r/20230119160344.54358-2-joey.gouly@arm.com Signed-off-by: Joey Gouly <joey.gouly@arm.com> Co-developed-by: Catalin Marinas <catalin.marinas@arm.com> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Jeremy Linton <jeremy.linton@arm.com> Cc: Kees Cook <keescook@chromium.org> Cc: Lennart Poettering <lennart@poettering.net> Cc: Mark Brown <broonie@kernel.org> Cc: nd <nd@arm.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Szabolcs Nagy <szabolcs.nagy@arm.com> Cc: Topi Miettinen <toiwoton@gmail.com> Cc: Zbigniew Jędrzejewski-Szmek <zbyszek@in.waw.pl> Cc: David Hildenbrand <david@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-01-21prlimit: do_prlimit needs to have a speculation checkGreg Kroah-Hartman1-0/+2
do_prlimit() adds the user-controlled resource value to a pointer that will subsequently be dereferenced. In order to help prevent this codepath from being used as a spectre "gadget" a barrier needs to be added after checking the range. Reported-by: Jordy Zomer <jordyzomer@google.com> Tested-by: Jordy Zomer <jordyzomer@google.com> Suggested-by: Linus Torvalds <torvalds@linuxfoundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-10-10Merge tag 'random-6.1-rc1-for-linus' of ↵Linus Torvalds1-0/+3
git://git.kernel.org/pub/scm/linux/kernel/git/crng/random Pull random number generator updates from Jason Donenfeld: - Huawei reported that when they updated their kernel from 4.4 to something much newer, some userspace code they had broke, the culprit being the accidental removal of O_NONBLOCK from /dev/random way back in 5.6. It's been gone for over 2 years now and this is the first we've heard of it, but userspace breakage is userspace breakage, so O_NONBLOCK is now back. - Use randomness from hardware RNGs much more often during early boot, at the same interval that crng reseeds are done, from Dominik. - A semantic change in hardware RNG throttling, so that the hwrng framework can properly feed random.c with randomness from hardware RNGs that aren't specifically marked as creditable. A related patch coming to you via Herbert's hwrng tree depends on this one, not to compile, but just to function properly, so you may want to merge this PULL before that one. - A fix to clamp credited bits from the interrupts pool to the size of the pool sample. This is mainly just a theoretical fix, as it'd be pretty hard to exceed it in practice. - Oracle reported that InfiniBand TCP latency regressed by around 10-15% after a change a few cycles ago made at the request of the RT folks, in which we hoisted a somewhat rare operation (1 in 1024 times) out of the hard IRQ handler and into a workqueue, a pretty common and boring pattern. It turns out, though, that scheduling a worker from there has overhead of its own, whereas scheduling a timer on that same CPU for the next jiffy amortizes better and doesn't incur the same overhead. I also eliminated a cache miss by moving the work_struct (and subsequently, the timer_list) to below a critical cache line, so that the more critical members that are accessed on every hard IRQ aren't split between two cache lines. - The boot-time initialization of the RNG has been split into two approximate phases: what we can accomplish before timekeeping is possible and what we can accomplish after. This winds up being useful so that we can use RDRAND to seed the RNG before CONFIG_SLAB_FREELIST_RANDOM=y systems initialize slabs, in addition to other early uses of randomness. The effect is that systems with RDRAND (or a bootloader seed) will never see any warnings at all when setting CONFIG_WARN_ALL_UNSEEDED_RANDOM=y. And kfence benefits from getting a better seed of its own. - Small systems without much entropy sometimes wind up putting some truncated serial number read from flash into hostname, so contribute utsname changes to the RNG, without crediting. - Add smaller batches to serve requests for smaller integers, and make use of them when people ask for random numbers bounded by a given compile-time constant. This has positive effects all over the tree, most notably in networking and kfence. - The original jitter algorithm intended (I believe) to schedule the timer for the next jiffy, not the next-next jiffy, yet it used mod_timer(jiffies + 1), which will fire on the next-next jiffy, instead of what I believe was intended, mod_timer(jiffies), which will fire on the next jiffy. So fix that. - Fix a comment typo, from William. * tag 'random-6.1-rc1-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/crng/random: random: clear new batches when bringing new CPUs online random: fix typos in get_random_bytes() comment random: schedule jitter credit for next jiffy, not in two jiffies prandom: make use of smaller types in prandom_u32_max random: add 8-bit and 16-bit batches utsname: contribute changes to RNG random: use init_utsname() instead of utsname() kfence: use better stack hash seed random: split initialization into early step and later step random: use expired timer rather than wq for mixing fast pool random: avoid reading two cache lines on irq randomness random: clamp credited irq bits to maximum mixed random: throttle hwrng writes if no entropy is credited random: use hwgenerator randomness more frequently at early boot random: restore O_NONBLOCK support
2022-10-09Merge tag 'ucount-rlimits-cleanups-for-v5.19' of ↵Linus Torvalds1-1/+1
git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace Pull ucounts update from Eric Biederman: "Split rlimit and ucount values and max values After the ucount rlimit code was merged a bunch of small but siginificant bugs were found and fixed. At the time it was realized that part of the problem was that while the ucount rlimits were very similar to the oridinary ucounts (in being nested counts with limits) the semantics were slightly different and the code would be less error prone if there was less sharing. This is the long awaited cleanup that should hopefully keep things more comprehensible and less error prone for whoever needs to touch that code next" * tag 'ucount-rlimits-cleanups-for-v5.19' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: ucounts: Split rlimit and ucount values and max values
2022-09-29utsname: contribute changes to RNGJason A. Donenfeld1-0/+3
On some small machines with little entropy, a quasi-unique hostname is sometimes a relevant factor. I've seen, for example, 8 character alpha-numeric serial numbers. In addition, the time at which the hostname is set is usually a decent measurement of how long early boot took. So, call add_device_randomness() on new hostnames, which feeds its arguments to the RNG in addition to a fresh cycle counter. Low cost hooks like this never hurt and can only ever help, and since this costs basically nothing for an operation that is never a fast path, this is an overall easy win. Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Dominik Brodowski <linux@dominikbrodowski.net> Reviewed-by: Kees Cook <keescook@chromium.org> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2022-05-18ucounts: Split rlimit and ucount values and max valuesAlexey Gladkov1-1/+1
Since the semantics of maximum rlimit values are different, it would be better not to mix ucount and rlimit values. This will prevent the error of using inc_count/dec_ucount for rlimit parameters. This patch also renames the functions to emphasize the lack of connection between rlimit and ucount. v3: - Fix BUG:KASAN:use-after-free_in_dec_ucount. v2: - Fix the array-index-out-of-bounds that was found by the lkp project. Reported-by: kernel test robot <oliver.sang@intel.com> Signed-off-by: Alexey Gladkov <legion@kernel.org> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Link: https://lkml.kernel.org/r/20220518171730.l65lmnnjtnxnftpq@example.org Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2022-04-22arm64/sme: Implement vector length configuration prctl()sMark Brown1-0/+12
As for SVE provide a prctl() interface which allows processes to configure their SME vector length. Signed-off-by: Mark Brown <broonie@kernel.org> Reviewed-by: Catalin Marinas <catalin.marinas@arm.com> Link: https://lore.kernel.org/r/20220419112247.711548-12-broonie@kernel.org Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2022-03-08prlimit: do not grab the tasklist_lockBarret Rhoden1-11/+14
Unnecessarily grabbing the tasklist_lock can be a scalability bottleneck for workloads that also must grab the tasklist_lock for waiting, killing, and cloning. The tasklist_lock was grabbed to protect tsk->sighand from disappearing (becoming NULL). tsk->signal was already protected by holding a reference to tsk. update_rlimit_cpu() assumed tsk->sighand != NULL. With this commit, it attempts to lock_task_sighand(). However, this means that update_rlimit_cpu() can fail. This only happens when a task is exiting. Note that during exec, sighand may *change*, but it will not be NULL. Prior to this commit, the do_prlimit() ensured that update_rlimit_cpu() would not fail by read locking the tasklist_lock and checking tsk->sighand != NULL. If update_rlimit_cpu() fails, there may be other tasks that are not exiting that share tsk->signal. However, the group_leader is the last task to be released, so if we cannot update_rlimit_cpu(group_leader), then the entire process is exiting. The only other caller of update_rlimit_cpu() is selinux_bprm_committing_creds(). It has tsk == current, so update_rlimit_cpu() cannot fail (current->sighand cannot disappear until current exits). This change resulted in a 14% speedup on a microbenchmark where parents kill and wait on their children, and children getpriority, setpriority, and getrlimit. Signed-off-by: Barret Rhoden <brho@google.com> Link: https://lkml.kernel.org/r/20220106172041.522167-4-brho@google.com Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2022-03-08prlimit: make do_prlimit() staticBarret Rhoden1-57/+59
There are no other callers in the kernel. Fixed up a comment format and whitespace issue when moving do_prlimit() higher in sys.c. Signed-off-by: Barret Rhoden <brho@google.com> Link: https://lkml.kernel.org/r/20220106172041.522167-3-brho@google.com Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2022-03-05mm: refactor vm_area_struct::anon_vma_name usage codeSuren Baghdasaryan1-7/+12
Avoid mixing strings and their anon_vma_name referenced pointers by using struct anon_vma_name whenever possible. This simplifies the code and allows easier sharing of anon_vma_name structures when they represent the same name. [surenb@google.com: fix comment] Link: https://lkml.kernel.org/r/20220223153613.835563-1-surenb@google.com Link: https://lkml.kernel.org/r/20220224231834.1481408-1-surenb@google.com Signed-off-by: Suren Baghdasaryan <surenb@google.com> Suggested-by: Matthew Wilcox <willy@infradead.org> Suggested-by: Michal Hocko <mhocko@suse.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Colin Cross <ccross@google.com> Cc: Sumit Semwal <sumit.semwal@linaro.org> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Kees Cook <keescook@chromium.org> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Christian Brauner <brauner@kernel.org> Cc: Alexey Gladkov <legion@kernel.org> Cc: Sasha Levin <sashal@kernel.org> Cc: Chris Hyser <chris.hyser@oracle.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Peter Collingbourne <pcc@google.com> Cc: Xiaofeng Cao <caoxiaofeng@yulong.com> Cc: David Hildenbrand <david@redhat.com> Cc: Cyrill Gorcunov <gorcunov@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-02-17ucounts: Move RLIMIT_NPROC handling after set_userEric W. Biederman1-5/+14
During set*id() which cred->ucounts to charge the the current process to is not known until after set_cred_ucounts. So move the RLIMIT_NPROC checking into a new helper flag_nproc_exceeded and call flag_nproc_exceeded after set_cred_ucounts. This is very much an arbitrary subset of the places where we currently change the RLIMIT_NPROC accounting, designed to preserve the existing logic. Fixing the existing logic will be the subject of another series of changes. Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/20220216155832.680775-4-ebiederm@xmission.com Fixes: 21d1c5e386bc ("Reimplement RLIMIT_NPROC on top of ucounts") Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2022-02-17rlimit: Fix RLIMIT_NPROC enforcement failure caused by capability calls in ↵Eric W. Biederman1-2/+1
set_user Solar Designer <solar@openwall.com> wrote: > I'm not aware of anyone actually running into this issue and reporting > it. The systems that I personally know use suexec along with rlimits > still run older/distro kernels, so would not yet be affected. > > So my mention was based on my understanding of how suexec works, and > code review. Specifically, Apache httpd has the setting RLimitNPROC, > which makes it set RLIMIT_NPROC: > > https://httpd.apache.org/docs/2.4/mod/core.html#rlimitnproc > > The above documentation for it includes: > > "This applies to processes forked from Apache httpd children servicing > requests, not the Apache httpd children themselves. This includes CGI > scripts and SSI exec commands, but not any processes forked from the > Apache httpd parent, such as piped logs." > > In code, there are: > > ./modules/generators/mod_cgid.c: ( (cgid_req.limits.limit_nproc_set) && ((rc = apr_procattr_limit_set(procattr, APR_LIMIT_NPROC, > ./modules/generators/mod_cgi.c: ((rc = apr_procattr_limit_set(procattr, APR_LIMIT_NPROC, > ./modules/filters/mod_ext_filter.c: rv = apr_procattr_limit_set(procattr, APR_LIMIT_NPROC, conf->limit_nproc); > > For example, in mod_cgi.c this is in run_cgi_child(). > > I think this means an httpd child sets RLIMIT_NPROC shortly before it > execs suexec, which is a SUID root program. suexec then switches to the > target user and execs the CGI script. > > Before 2863643fb8b9, the setuid() in suexec would set the flag, and the > target user's process count would be checked against RLIMIT_NPROC on > execve(). After 2863643fb8b9, the setuid() in suexec wouldn't set the > flag because setuid() is (naturally) called when the process is still > running as root (thus, has those limits bypass capabilities), and > accordingly execve() would not check the target user's process count > against RLIMIT_NPROC. In commit 2863643fb8b9 ("set_user: add capability check when rlimit(RLIMIT_NPROC) exceeds") capable calls were added to set_user to make it more consistent with fork. Unfortunately because of call site differences those capable calls were checking the credentials of the user before set*id() instead of after set*id(). This breaks enforcement of RLIMIT_NPROC for applications that set the rlimit and then call set*id() while holding a full set of capabilities. The capabilities are only changed in the new credential in security_task_fix_setuid(). The code in apache suexec appears to follow this pattern. Commit 909cc4ae86f3 ("[PATCH] Fix two bugs with process limits (RLIMIT_NPROC)") where this check was added describes the targes of this capability check as: 2/ When a root-owned process (e.g. cgiwrap) sets up process limits and then calls setuid, the setuid should fail if the user would then be running more than rlim_cur[RLIMIT_NPROC] processes, but it doesn't. This patch adds an appropriate test. With this patch, and per-user process limit imposed in cgiwrap really works. So the original use case of this check also appears to match the broken pattern. Restore the enforcement of RLIMIT_NPROC by removing the bad capable checks added in set_user. This unfortunately restores the inconsistent state the code has been in for the last 11 years, but dealing with the inconsistencies looks like a larger problem. Cc: stable@vger.kernel.org Link: https://lore.kernel.org/all/20210907213042.GA22626@openwall.com/ Link: https://lkml.kernel.org/r/20220212221412.GA29214@openwall.com Link: https://lkml.kernel.org/r/20220216155832.680775-1-ebiederm@xmission.com Fixes: 2863643fb8b9 ("set_user: add capability check when rlimit(RLIMIT_NPROC) exceeds") History-Tree: https://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git Reviewed-by: Solar Designer <solar@openwall.com> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2022-01-20Merge branch 'akpm' (patches from Andrew)Linus Torvalds1-8/+8
Merge more updates from Andrew Morton: "55 patches. Subsystems affected by this patch series: percpu, procfs, sysctl, misc, core-kernel, get_maintainer, lib, checkpatch, binfmt, nilfs2, hfs, fat, adfs, panic, delayacct, kconfig, kcov, and ubsan" * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (55 commits) lib: remove redundant assignment to variable ret ubsan: remove CONFIG_UBSAN_OBJECT_SIZE kcov: fix generic Kconfig dependencies if ARCH_WANTS_NO_INSTR lib/Kconfig.debug: make TEST_KMOD depend on PAGE_SIZE_LESS_THAN_256KB btrfs: use generic Kconfig option for 256kB page size limit arch/Kconfig: split PAGE_SIZE_LESS_THAN_256KB from PAGE_SIZE_LESS_THAN_64KB configs: introduce debug.config for CI-like setup delayacct: track delays from memory compact Documentation/accounting/delay-accounting.rst: add thrashing page cache and direct compact delayacct: cleanup flags in struct task_delay_info and functions use it delayacct: fix incomplete disable operation when switch enable to disable delayacct: support swapin delay accounting for swapping without blkio panic: remove oops_id panic: use error_report_end tracepoint on warnings fs/adfs: remove unneeded variable make code cleaner FAT: use io_schedule_timeout() instead of congestion_wait() hfsplus: use struct_group_attr() for memcpy() region nilfs2: remove redundant pointer sbufs fs/binfmt_elf: use PT_LOAD p_align values for static PIE const_structs.checkpatch: add frequently used ops structs ...
2022-01-20kernel/sys.c: only take tasklist_lock for get/setpriority(PRIO_PGRP)Davidlohr Bueso1-8/+8
PRIO_PGRP needs the tasklist_lock mainly to serialize vs setpgid(2), to protect against any concurrent change_pid(PIDTYPE_PGID) that can move the task from one hlist to another while iterating. However, the remaining can only rely only on RCU: PRIO_PROCESS only does the task lookup and never iterates over tasklist and we already have an rcu-aware stable pointer. PRIO_USER is already racy vs setuid(2) so with creds being rcu protected, we can end up seeing stale data. When removing the tasklist_lock there can be a race with (i) fork but this is benign as the child's nice is inherited and the new task is not observable by the user yet either, hence the return semantics do not differ. And (ii) a race with exit, which is a small window and can cause us to miss a task which was removed from the list and it had the highest nice. Similarly change the buggy do_each_thread/while_each_thread combo in PRIO_USER for the rcu-safe for_each_process_thread flavor, which doesn't make use of next_thread/p->thread_group. [akpm@linux-foundation.org: coding style fixes] Link: https://lkml.kernel.org/r/20211210182250.43734-1-dave@stgolabs.net Signed-off-by: Davidlohr Bueso <dbueso@suse.de> Acked-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-01-15mm: add a field to store names for private anonymous memoryColin Cross1-0/+63
In many userspace applications, and especially in VM based applications like Android uses heavily, there are multiple different allocators in use. At a minimum there is libc malloc and the stack, and in many cases there are libc malloc, the stack, direct syscalls to mmap anonymous memory, and multiple VM heaps (one for small objects, one for big objects, etc.). Each of these layers usually has its own tools to inspect its usage; malloc by compiling a debug version, the VM through heap inspection tools, and for direct syscalls there is usually no way to track them. On Android we heavily use a set of tools that use an extended version of the logic covered in Documentation/vm/pagemap.txt to walk all pages mapped in userspace and slice their usage by process, shared (COW) vs. unique mappings, backing, etc. This can account for real physical memory usage even in cases like fork without exec (which Android uses heavily to share as many private COW pages as possible between processes), Kernel SamePage Merging, and clean zero pages. It produces a measurement of the pages that only exist in that process (USS, for unique), and a measurement of the physical memory usage of that process with the cost of shared pages being evenly split between processes that share them (PSS). If all anonymous memory is indistinguishable then figuring out the real physical memory usage (PSS) of each heap requires either a pagemap walking tool that can understand the heap debugging of every layer, or for every layer's heap debugging tools to implement the pagemap walking logic, in which case it is hard to get a consistent view of memory across the whole system. Tracking the information in userspace leads to all sorts of problems. It either needs to be stored inside the process, which means every process has to have an API to export its current heap information upon request, or it has to be stored externally in a filesystem that somebody needs to clean up on crashes. It needs to be readable while the process is still running, so it has to have some sort of synchronization with every layer of userspace. Efficiently tracking the ranges requires reimplementing something like the kernel vma trees, and linking to it from every layer of userspace. It requires more memory, more syscalls, more runtime cost, and more complexity to separately track regions that the kernel is already tracking. This patch adds a field to /proc/pid/maps and /proc/pid/smaps to show a userspace-provided name for anonymous vmas. The names of named anonymous vmas are shown in /proc/pid/maps and /proc/pid/smaps as [anon:<name>]. Userspace can set the name for a region of memory by calling prctl(PR_SET_VMA, PR_SET_VMA_ANON_NAME, start, len, (unsigned long)name) Setting the name to NULL clears it. The name length limit is 80 bytes including NUL-terminator and is checked to contain only printable ascii characters (including space), except '[',']','\','$' and '`'. Ascii strings are being used to have a descriptive identifiers for vmas, which can be understood by the users reading /proc/pid/maps or /proc/pid/smaps. Names can be standardized for a given system and they can include some variable parts such as the name of the allocator or a library, tid of the thread using it, etc. The name is stored in a pointer in the shared union in vm_area_struct that points to a null terminated string. Anonymous vmas with the same name (equivalent strings) and are otherwise mergeable will be merged. The name pointers are not shared between vmas even if they contain the same name. The name pointer is stored in a union with fields that are only used on file-backed mappings, so it does not increase memory usage. CONFIG_ANON_VMA_NAME kernel configuration is introduced to enable this feature. It keeps the feature disabled by default to prevent any additional memory overhead and to avoid confusing procfs parsers on systems which are not ready to support named anonymous vmas. The patch is based on the original patch developed by Colin Cross, more specifically on its latest version [1] posted upstream by Sumit Semwal. It used a userspace pointer to store vma names. In that design, name pointers could be shared between vmas. However during the last upstreaming attempt, Kees Cook raised concerns [2] about this approach and suggested to copy the name into kernel memory space, perform validity checks [3] and store as a string referenced from vm_area_struct. One big concern is about fork() performance which would need to strdup anonymous vma names. Dave Hansen suggested experimenting with worst-case scenario of forking a process with 64k vmas having longest possible names [4]. I ran this experiment on an ARM64 Android device and recorded a worst-case regression of almost 40% when forking such a process. This regression is addressed in the followup patch which replaces the pointer to a name with a refcounted structure that allows sharing the name pointer between vmas of the same name. Instead of duplicating the string during fork() or when splitting a vma it increments the refcount. [1] https://lore.kernel.org/linux-mm/20200901161459.11772-4-sumit.semwal@linaro.org/ [2] https://lore.kernel.org/linux-mm/202009031031.D32EF57ED@keescook/ [3] https://lore.kernel.org/linux-mm/202009031022.3834F692@keescook/ [4] https://lore.kernel.org/linux-mm/5d0358ab-8c47-2f5f-8e43-23b89d6a8e95@intel.com/ Changes for prctl(2) manual page (in the options section): PR_SET_VMA Sets an attribute specified in arg2 for virtual memory areas starting from the address specified in arg3 and spanning the size specified in arg4. arg5 specifies the value of the attribute to be set. Note that assigning an attribute to a virtual memory area might prevent it from being merged with adjacent virtual memory areas due to the difference in that attribute's value. Currently, arg2 must be one of: PR_SET_VMA_ANON_NAME Set a name for anonymous virtual memory areas. arg5 should be a pointer to a null-terminated string containing the name. The name length including null byte cannot exceed 80 bytes. If arg5 is NULL, the name of the appropriate anonymous virtual memory areas will be reset. The name can contain only printable ascii characters (including space), except '[',']','\','$' and '`'. This feature is available only if the kernel is built with the CONFIG_ANON_VMA_NAME option enabled. [surenb@google.com: docs: proc.rst: /proc/PID/maps: fix malformed table] Link: https://lkml.kernel.org/r/20211123185928.2513763-1-surenb@google.com [surenb: rebased over v5.15-rc6, replaced userpointer with a kernel copy, added input sanitization and CONFIG_ANON_VMA_NAME config. The bulk of the work here was done by Colin Cross, therefore, with his permission, keeping him as the author] Link: https://lkml.kernel.org/r/20211019215511.3771969-2-surenb@google.com Signed-off-by: Colin Cross <ccross@google.com> Signed-off-by: Suren Baghdasaryan <surenb@google.com> Reviewed-by: Kees Cook <keescook@chromium.org> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Cyrill Gorcunov <gorcunov@openvz.org> Cc: Dave Hansen <dave.hansen@intel.com> Cc: David Rientjes <rientjes@google.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Hugh Dickins <hughd@google.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Jan Glauber <jan.glauber@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: John Stultz <john.stultz@linaro.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Minchan Kim <minchan@kernel.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rob Landley <rob@landley.net> Cc: "Serge E. Hallyn" <serge.hallyn@ubuntu.com> Cc: Shaohua Li <shli@fusionio.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-09-08Merge branch 'akpm' (patches from Andrew)Linus Torvalds1-7/+0
Merge more updates from Andrew Morton: "147 patches, based on 7d2a07b769330c34b4deabeed939325c77a7ec2f. Subsystems affected by this patch series: mm (memory-hotplug, rmap, ioremap, highmem, cleanups, secretmem, kfence, damon, and vmscan), alpha, percpu, procfs, misc, core-kernel, MAINTAINERS, lib, checkpatch, epoll, init, nilfs2, coredump, fork, pids, criu, kconfig, selftests, ipc, and scripts" * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (94 commits) scripts: check_extable: fix typo in user error message mm/workingset: correct kernel-doc notations ipc: replace costly bailout check in sysvipc_find_ipc() selftests/memfd: remove unused variable Kconfig.debug: drop selecting non-existing HARDLOCKUP_DETECTOR_ARCH configs: remove the obsolete CONFIG_INPUT_POLLDEV prctl: allow to setup brk for et_dyn executables pid: cleanup the stale comment mentioning pidmap_init(). kernel/fork.c: unexport get_{mm,task}_exe_file coredump: fix memleak in dump_vma_snapshot() fs/coredump.c: log if a core dump is aborted due to changed file permissions nilfs2: use refcount_dec_and_lock() to fix potential UAF nilfs2: fix memory leak in nilfs_sysfs_delete_snapshot_group nilfs2: fix memory leak in nilfs_sysfs_create_snapshot_group nilfs2: fix memory leak in nilfs_sysfs_delete_##name##_group nilfs2: fix memory leak in nilfs_sysfs_create_##name##_group nilfs2: fix NULL pointer in nilfs_##name##_attr_release nilfs2: fix memory leak in nilfs_sysfs_create_device_group trap: cleanup trap_init() init: move usermodehelper_enable() to populate_rootfs() ...
2021-09-08prctl: allow to setup brk for et_dyn executablesCyrill Gorcunov1-7/+0
Keno Fischer reported that when a binray loaded via ld-linux-x the prctl(PR_SET_MM_MAP) doesn't allow to setup brk value because it lays before mm:end_data. For example a test program shows | # ~/t | | start_code 401000 | end_code 401a15 | start_stack 7ffce4577dd0 | start_data 403e10 | end_data 40408c | start_brk b5b000 | sbrk(0) b5b000 and when executed via ld-linux | # /lib64/ld-linux-x86-64.so.2 ~/t | | start_code 7fc25b0a4000 | end_code 7fc25b0c4524 | start_stack 7fffcc6b2400 | start_data 7fc25b0ce4c0 | end_data 7fc25b0cff98 | start_brk 55555710c000 | sbrk(0) 55555710c000 This of course prevent criu from restoring such programs. Looking into how kernel operates with brk/start_brk inside brk() syscall I don't see any problem if we allow to setup brk/start_brk without checking for end_data. Even if someone pass some weird address here on a purpose then the worst possible result will be an unexpected unmapping of existing vma (own vma, since prctl works with the callers memory) but test for RLIMIT_DATA is still valid and a user won't be able to gain more memory in case of expanding VMAs via new values shipped with prctl call. Link: https://lkml.kernel.org/r/20210121221207.GB2174@grain Fixes: bbdc6076d2e5 ("binfmt_elf: move brk out of mmap when doing direct loader exec") Signed-off-by: Cyrill Gorcunov <gorcunov@gmail.com> Reported-by: Keno Fischer <keno@juliacomputing.com> Acked-by: Andrey Vagin <avagin@gmail.com> Tested-by: Andrey Vagin <avagin@gmail.com> Cc: Dmitry Safonov <0x7f454c46@gmail.com> Cc: Kirill Tkhai <ktkhai@virtuozzo.com> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: Pavel Tikhomirov <ptikhomirov@virtuozzo.com> Cc: Alexander Mikhalitsyn <alexander.mikhalitsyn@virtuozzo.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-09-04Merge tag 'denywrite-for-5.15' of git://github.com/davidhildenbrand/linuxLinus Torvalds1-32/+1
Pull MAP_DENYWRITE removal from David Hildenbrand: "Remove all in-tree usage of MAP_DENYWRITE from the kernel and remove VM_DENYWRITE. There are some (minor) user-visible changes: - We no longer deny write access to shared libaries loaded via legacy uselib(); this behavior matches modern user space e.g. dlopen(). - We no longer deny write access to the elf interpreter after exec completed, treating it just like shared libraries (which it often is). - We always deny write access to the file linked via /proc/pid/exe: sys_prctl(PR_SET_MM_MAP/EXE_FILE) will fail if write access to the file cannot be denied, and write access to the file will remain denied until the link is effectivel gone (exec, termination, sys_prctl(PR_SET_MM_MAP/EXE_FILE)) -- just as if exec'ing the file. Cross-compiled for a bunch of architectures (alpha, microblaze, i386, s390x, ...) and verified via ltp that especially the relevant tests (i.e., creat07 and execve04) continue working as expected" * tag 'denywrite-for-5.15' of git://github.com/davidhildenbrand/linux: fs: update documentation of get_write_access() and friends mm: ignore MAP_DENYWRITE in ksys_mmap_pgoff() mm: remove VM_DENYWRITE binfmt: remove in-tree usage of MAP_DENYWRITE kernel/fork: always deny write access to current MM exe_file kernel/fork: factor out replacing the current MM exe_file binfmt: don't use MAP_DENYWRITE when loading shared libraries via uselib()
2021-09-03kernel/fork: factor out replacing the current MM exe_fileDavid Hildenbrand1-32/+1
Let's factor the main logic out into replace_mm_exe_file(), such that all mm->exe_file logic is contained in kernel/fork.c. While at it, perform some simple cleanups that are possible now that we're simplifying the individual functions. Acked-by: Christian Brauner <christian.brauner@ubuntu.com> Acked-by: "Eric W. Biederman" <ebiederm@xmission.com> Acked-by: Christian König <christian.koenig@amd.com> Signed-off-by: David Hildenbrand <david@redhat.com>
2021-08-12set_user: add capability check when rlimit(RLIMIT_NPROC) exceedsRan Xiaokai1-1/+2
in copy_process(): non root users but with capability CAP_SYS_RESOURCE or CAP_SYS_ADMIN will clean PF_NPROC_EXCEEDED flag even rlimit(RLIMIT_NPROC) exceeds. Add the same capability check logic here. Align the permission checks in copy_process() and set_user(). In copy_process() CAP_SYS_RESOURCE or CAP_SYS_ADMIN capable users will be able to circumvent and clear the PF_NPROC_EXCEEDED flag whereas they aren't able to the same in set_user(). There's no obvious logic to this and trying to unearth the reason in the thread didn't go anywhere. The gist seems to be that this code wants to make sure that a program can't successfully exec if it has gone through a set*id() transition while exceeding its RLIMIT_NPROC. A capable but non-INIT_USER caller getting PF_NPROC_EXCEEDED set during a set*id() transition wouldn't be able to exec right away if they still exceed their RLIMIT_NPROC at the time of exec. So their exec would fail in fs/exec.c: if ((current->flags & PF_NPROC_EXCEEDED) && is_ucounts_overlimit(current_ucounts(), UCOUNT_RLIMIT_NPROC, rlimit(RLIMIT_NPROC))) { retval = -EAGAIN; goto out_ret; } However, if the caller were to fork() right after the set*id() transition but before the exec while still exceeding their RLIMIT_NPROC then they would get PF_NPROC_EXCEEDED cleared (while the child would inherit it): retval = -EAGAIN; if (is_ucounts_overlimit(task_ucounts(p), UCOUNT_RLIMIT_NPROC, rlimit(RLIMIT_NPROC))) { if (p->real_cred->user != INIT_USER && !capable(CAP_SYS_RESOURCE) && !capable(CAP_SYS_ADMIN)) goto bad_fork_free; } current->flags &= ~PF_NPROC_EXCEEDED; which means a subsequent exec by the capable caller would now succeed even though they could still exceed their RLIMIT_NPROC limit. This seems inconsistent. Allow a CAP_SYS_ADMIN or CAP_SYS_RESOURCE capable user to avoid PF_NPROC_EXCEEDED as they already can in copy_process(). Cc: peterz@infradead.org, tglx@linutronix.de, linux-kernel@vger.kernel.org, Ran Xiaokai <ran.xiaokai@zte.com.cn>, , , Link: https://lore.kernel.org/r/20210728072629.530435-1-ran.xiaokai@zte.com.cn Cc: Neil Brown <neilb@suse.de> Cc: Kees Cook <keescook@chromium.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: James Morris <jamorris@linux.microsoft.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Ran Xiaokai <ran.xiaokai@zte.com.cn> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2021-06-28Merge branch 'for-linus' of ↵Linus Torvalds1-1/+13
git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace Pull user namespace rlimit handling update from Eric Biederman: "This is the work mainly by Alexey Gladkov to limit rlimits to the rlimits of the user that created a user namespace, and to allow users to have stricter limits on the resources created within a user namespace." * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: cred: add missing return error code when set_cred_ucounts() failed ucounts: Silence warning in dec_rlimit_ucounts ucounts: Set ucount_max to the largest positive value the type can hold kselftests: Add test to check for rlimit changes in different user namespaces Reimplement RLIMIT_MEMLOCK on top of ucounts Reimplement RLIMIT_SIGPENDING on top of ucounts Reimplement RLIMIT_MSGQUEUE on top of ucounts Reimplement RLIMIT_NPROC on top of ucounts Use atomic_t for ucounts reference counting Add a reference to ucounts for each cred Increase size of ucounts to atomic_long_t
2021-05-12sched: prctl() core-scheduling interfaceChris Hyser1-0/+5
This patch provides support for setting and copying core scheduling 'task cookies' between threads (PID), processes (TGID), and process groups (PGID). The value of core scheduling isn't that tasks don't share a core, 'nosmt' can do that. The value lies in exploiting all the sharing opportunities that exist to recover possible lost performance and that requires a degree of flexibility in the API. From a security perspective (and there are others), the thread, process and process group distinction is an existent hierarchal categorization of tasks that reflects many of the security concerns about 'data sharing'. For example, protecting against cache-snooping by a thread that can just read the memory directly isn't all that useful. With this in mind, subcommands to CREATE/SHARE (TO/FROM) provide a mechanism to create and share cookies. CREATE/SHARE_TO specify a target pid with enum pidtype used to specify the scope of the targeted tasks. For example, PIDTYPE_TGID will share the cookie with the process and all of it's threads as typically desired in a security scenario. API: prctl(PR_SCHED_CORE, PR_SCHED_CORE_GET, tgtpid, pidtype, &cookie) prctl(PR_SCHED_CORE, PR_SCHED_CORE_CREATE, tgtpid, pidtype, NULL) prctl(PR_SCHED_CORE, PR_SCHED_CORE_SHARE_TO, tgtpid, pidtype, NULL) prctl(PR_SCHED_CORE, PR_SCHED_CORE_SHARE_FROM, srcpid, pidtype, NULL) where 'tgtpid/srcpid == 0' implies the current process and pidtype is kernel enum pid_type {PIDTYPE_PID, PIDTYPE_TGID, PIDTYPE_PGID, ...}. For return values, EINVAL, ENOMEM are what they say. ESRCH means the tgtpid/srcpid was not found. EPERM indicates lack of PTRACE permission access to tgtpid/srcpid. ENODEV indicates your machines lacks SMT. [peterz: complete rewrite] Signed-off-by: Chris Hyser <chris.hyser@oracle.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Don Hiatt <dhiatt@digitalocean.com> Tested-by: Hongyu Ning <hongyu.ning@linux.intel.com> Tested-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lkml.kernel.org/r/20210422123309.039845339@infradead.org
2021-05-07kernel/sys.c: fix typoXiaofeng Cao1-7/+7
change 'infite' to 'infinite' change 'concurent' to 'concurrent' change 'memvers' to 'members' change 'decendants' to 'descendants' change 'argumets' to 'arguments' Link: https://lkml.kernel.org/r/20210316112904.10661-1-cxfcosmos@gmail.com Signed-off-by: Xiaofeng Cao <caoxiaofeng@yulong.com> Acked-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-04-30Reimplement RLIMIT_NPROC on top of ucountsAlexey Gladkov1-1/+1
The rlimit counter is tied to uid in the user_namespace. This allows rlimit values to be specified in userns even if they are already globally exceeded by the user. However, the value of the previous user_namespaces cannot be exceeded. To illustrate the impact of rlimits, let's say there is a program that does not fork. Some service-A wants to run this program as user X in multiple containers. Since the program never fork the service wants to set RLIMIT_NPROC=1. service-A \- program (uid=1000, container1, rlimit_nproc=1) \- program (uid=1000, container2, rlimit_nproc=1) The service-A sets RLIMIT_NPROC=1 and runs the program in container1. When the service-A tries to run a program with RLIMIT_NPROC=1 in container2 it fails since user X already has one running process. We cannot use existing inc_ucounts / dec_ucounts because they do not allow us to exceed the maximum for the counter. Some rlimits can be overlimited by root or if the user has the appropriate capability. Changelog v11: * Change inc_rlimit_ucounts() which now returns top value of ucounts. * Drop inc_rlimit_ucounts_and_test() because the return code of inc_rlimit_ucounts() can be checked. Signed-off-by: Alexey Gladkov <legion@kernel.org> Link: https://lkml.kernel.org/r/c5286a8aa16d2d698c222f7532f3d735c82bc6bc.1619094428.git.legion@kernel.org Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2021-04-30Add a reference to ucounts for each credAlexey Gladkov1-0/+12
For RLIMIT_NPROC and some other rlimits the user_struct that holds the global limit is kept alive for the lifetime of a process by keeping it in struct cred. Adding a pointer to ucounts in the struct cred will allow to track RLIMIT_NPROC not only for user in the system, but for user in the user_namespace. Updating ucounts may require memory allocation which may fail. So, we cannot change cred.ucounts in the commit_creds() because this function cannot fail and it should always return 0. For this reason, we modify cred.ucounts before calling the commit_creds(). Changelog v6: * Fix null-ptr-deref in is_ucounts_overlimit() detected by trinity. This error was caused by the fact that cred_alloc_blank() left the ucounts pointer empty. Reported-by: kernel test robot <oliver.sang@intel.com> Signed-off-by: Alexey Gladkov <legion@kernel.org> Link: https://lkml.kernel.org/r/b37aaef28d8b9b0d757e07ba6dd27281bbe39259.1619094428.git.legion@kernel.org Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2021-04-13arm64: Introduce prctl(PR_PAC_{SET,GET}_ENABLED_KEYS)Peter Collingbourne1-0/+16
This change introduces a prctl that allows the user program to control which PAC keys are enabled in a particular task. The main reason why this is useful is to enable a userspace ABI that uses PAC to sign and authenticate function pointers and other pointers exposed outside of the function, while still allowing binaries conforming to the ABI to interoperate with legacy binaries that do not sign or authenticate pointers. The idea is that a dynamic loader or early startup code would issue this prctl very early after establishing that a process may load legacy binaries, but before executing any PAC instructions. This change adds a small amount of overhead to kernel entry and exit due to additional required instruction sequences. On a DragonBoard 845c (Cortex-A75) with the powersave governor, the overhead of similar instruction sequences was measured as 4.9ns when simulating the common case where IA is left enabled, or 43.7ns when simulating the uncommon case where IA is disabled. These numbers can be seen as the worst case scenario, since in more realistic scenarios a better performing governor would be used and a newer chip would be used that would support PAC unlike Cortex-A75 and would be expected to be faster than Cortex-A75. On an Apple M1 under a hypervisor, the overhead of the entry/exit instruction sequences introduced by this patch was measured as 0.3ns in the case where IA is left enabled, and 33.0ns in the case where IA is disabled. Signed-off-by: Peter Collingbourne <pcc@google.com> Reviewed-by: Dave Martin <Dave.Martin@arm.com> Link: https://linux-review.googlesource.com/id/Ibc41a5e6a76b275efbaa126b31119dc197b927a5 Link: https://lore.kernel.org/r/d6609065f8f40397a4124654eb68c9f490b4d477.1616123271.git.pcc@google.com Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2021-03-14prctl: fix PR_SET_MM_AUXV kernel stack leakAlexey Dobriyan1-1/+1
Doing a prctl(PR_SET_MM, PR_SET_MM_AUXV, addr, 1); will copy 1 byte from userspace to (quite big) on-stack array and then stash everything to mm->saved_auxv. AT_NULL terminator will be inserted at the very end. /proc/*/auxv handler will find that AT_NULL terminator and copy original stack contents to userspace. This devious scheme requires CAP_SYS_RESOURCE. Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-02-25Merge tag 'kbuild-v5.12' of ↵Linus Torvalds1-1/+1
git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild Pull Kbuild updates from Masahiro Yamada: - Fix false-positive build warnings for ARCH=ia64 builds - Optimize dictionary size for module compression with xz - Check the compiler and linker versions in Kconfig - Fix misuse of extra-y - Support DWARF v5 debug info - Clamp SUBLEVEL to 255 because stable releases 4.4.x and 4.9.x exceeded the limit - Add generic syscall{tbl,hdr}.sh for cleanups across arches - Minor cleanups of genksyms - Minor cleanups of Kconfig * tag 'kbuild-v5.12' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild: (38 commits) initramfs: Remove redundant dependency of RD_ZSTD on BLK_DEV_INITRD kbuild: remove deprecated 'always' and 'hostprogs-y/m' kbuild: parse C= and M= before changing the working directory kbuild: reuse this-makefile to define abs_srctree kconfig: unify rule of config, menuconfig, nconfig, gconfig, xconfig kconfig: omit --oldaskconfig option for 'make config' kconfig: fix 'invalid option' for help option kconfig: remove dead code in conf_askvalue() kconfig: clean up nested if-conditionals in check_conf() kconfig: Remove duplicate call to sym_get_string_value() Makefile: Remove # characters from compiler string Makefile: reuse CC_VERSION_TEXT kbuild: check the minimum linker version in Kconfig kbuild: remove ld-version macro scripts: add generic syscallhdr.sh scripts: add generic syscalltbl.sh arch: syscalls: remove $(srctree)/ prefix from syscall tables arch: syscalls: add missing FORCE and fix 'targets' to make if_changed work gen_compile_commands: prune some directories kbuild: simplify access to the kernel's version ...
2021-02-23Merge tag 'idmapped-mounts-v5.12' of ↵Linus Torvalds1-1/+1
git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux Pull idmapped mounts from Christian Brauner: "This introduces idmapped mounts which has been in the making for some time. Simply put, different mounts can expose the same file or directory with different ownership. This initial implementation comes with ports for fat, ext4 and with Christoph's port for xfs with more filesystems being actively worked on by independent people and maintainers. Idmapping mounts handle a wide range of long standing use-cases. Here are just a few: - Idmapped mounts make it possible to easily share files between multiple users or multiple machines especially in complex scenarios. For example, idmapped mounts will be used in the implementation of portable home directories in systemd-homed.service(8) where they allow users to move their home directory to an external storage device and use it on multiple computers where they are assigned different uids and gids. This effectively makes it possible to assign random uids and gids at login time. - It is possible to share files from the host with unprivileged containers without having to change ownership permanently through chown(2). - It is possible to idmap a container's rootfs and without having to mangle every file. For example, Chromebooks use it to share the user's Download folder with their unprivileged containers in their Linux subsystem. - It is possible to share files between containers with non-overlapping idmappings. - Filesystem that lack a proper concept of ownership such as fat can use idmapped mounts to implement discretionary access (DAC) permission checking. - They allow users to efficiently changing ownership on a per-mount basis without having to (recursively) chown(2) all files. In contrast to chown (2) changing ownership of large sets of files is instantenous with idmapped mounts. This is especially useful when ownership of a whole root filesystem of a virtual machine or container is changed. With idmapped mounts a single syscall mount_setattr syscall will be sufficient to change the ownership of all files. - Idmapped mounts always take the current ownership into account as idmappings specify what a given uid or gid is supposed to be mapped to. This contrasts with the chown(2) syscall which cannot by itself take the current ownership of the files it changes into account. It simply changes the ownership to the specified uid and gid. This is especially problematic when recursively chown(2)ing a large set of files which is commong with the aforementioned portable home directory and container and vm scenario. - Idmapped mounts allow to change ownership locally, restricting it to specific mounts, and temporarily as the ownership changes only apply as long as the mount exists. Several userspace projects have either already put up patches and pull-requests for this feature or will do so should you decide to pull this: - systemd: In a wide variety of scenarios but especially right away in their implementation of portable home directories. https://systemd.io/HOME_DIRECTORY/ - container runtimes: containerd, runC, LXD:To share data between host and unprivileged containers, unprivileged and privileged containers, etc. The pull request for idmapped mounts support in containerd, the default Kubernetes runtime is already up for quite a while now: https://github.com/containerd/containerd/pull/4734 - The virtio-fs developers and several users have expressed interest in using this feature with virtual machines once virtio-fs is ported. - ChromeOS: Sharing host-directories with unprivileged containers. I've tightly synced with all those projects and all of those listed here have also expressed their need/desire for this feature on the mailing list. For more info on how people use this there's a bunch of talks about this too. Here's just two recent ones: https://www.cncf.io/wp-content/uploads/2020/12/Rootless-Containers-in-Gitpod.pdf https://fosdem.org/2021/schedule/event/containers_idmap/ This comes with an extensive xfstests suite covering both ext4 and xfs: https://git.kernel.org/brauner/xfstests-dev/h/idmapped_mounts It covers truncation, creation, opening, xattrs, vfscaps, setid execution, setgid inheritance and more both with idmapped and non-idmapped mounts. It already helped to discover an unrelated xfs setgid inheritance bug which has since been fixed in mainline. It will be sent for inclusion with the xfstests project should you decide to merge this. In order to support per-mount idmappings vfsmounts are marked with user namespaces. The idmapping of the user namespace will be used to map the ids of vfs objects when they are accessed through that mount. By default all vfsmounts are marked with the initial user namespace. The initial user namespace is used to indicate that a mount is not idmapped. All operations behave as before and this is verified in the testsuite. Based on prior discussions we want to attach the whole user namespace and not just a dedicated idmapping struct. This allows us to reuse all the helpers that already exist for dealing with idmappings instead of introducing a whole new range of helpers. In addition, if we decide in the future that we are confident enough to enable unprivileged users to setup idmapped mounts the permission checking can take into account whether the caller is privileged in the user namespace the mount is currently marked with. The user namespace the mount will be marked with can be specified by passing a file descriptor refering to the user namespace as an argument to the new mount_setattr() syscall together with the new MOUNT_ATTR_IDMAP flag. The system call follows the openat2() pattern of extensibility. The following conditions must be met in order to create an idmapped mount: - The caller must currently have the CAP_SYS_ADMIN capability in the user namespace the underlying filesystem has been mounted in. - The underlying filesystem must support idmapped mounts. - The mount must not already be idmapped. This also implies that the idmapping of a mount cannot be altered once it has been idmapped. - The mount must be a detached/anonymous mount, i.e. it must have been created by calling open_tree() with the OPEN_TREE_CLONE flag and it must not already have been visible in the filesystem. The last two points guarantee easier semantics for userspace and the kernel and make the implementation significantly simpler. By default vfsmounts are marked with the initial user namespace and no behavioral or performance changes are observed. The manpage with a detailed description can be found here: https://git.kernel.org/brauner/man-pages/c/1d7b902e2875a1ff342e036a9f866a995640aea8 In order to support idmapped mounts, filesystems need to be changed and mark themselves with the FS_ALLOW_IDMAP flag in fs_flags. The patches to convert individual filesystem are not very large or complicated overall as can be seen from the included fat, ext4, and xfs ports. Patches for other filesystems are actively worked on and will be sent out separately. The xfstestsuite can be used to verify that port has been done correctly. The mount_setattr() syscall is motivated independent of the idmapped mounts patches and it's been around since July 2019. One of the most valuable features of the new mount api is the ability to perform mounts based on file descriptors only. Together with the lookup restrictions available in the openat2() RESOLVE_* flag namespace which we added in v5.6 this is the first time we are close to hardened and race-free (e.g. symlinks) mounting and path resolution. While userspace has started porting to the new mount api to mount proper filesystems and create new bind-mounts it is currently not possible to change mount options of an already existing bind mount in the new mount api since the mount_setattr() syscall is missing. With the addition of the mount_setattr() syscall we remove this last restriction and userspace can now fully port to the new mount api, covering every use-case the old mount api could. We also add the crucial ability to recursively change mount options for a whole mount tree, both removing and adding mount options at the same time. This syscall has been requested multiple times by various people and projects. There is a simple tool available at https://github.com/brauner/mount-idmapped that allows to create idmapped mounts so people can play with this patch series. I'll add support for the regular mount binary should you decide to pull this in the following weeks: Here's an example to a simple idmapped mount of another user's home directory: u1001@f2-vm:/$ sudo ./mount --idmap both:1000:1001:1 /home/ubuntu/ /mnt u1001@f2-vm:/$ ls -al /home/ubuntu/ total 28 drwxr-xr-x 2 ubuntu ubuntu 4096 Oct 28 22:07 . drwxr-xr-x 4 root root 4096 Oct 28 04:00 .. -rw------- 1 ubuntu ubuntu 3154 Oct 28 22:12 .bash_history -rw-r--r-- 1 ubuntu ubuntu 220 Feb 25 2020 .bash_logout -rw-r--r-- 1 ubuntu ubuntu 3771 Feb 25 2020 .bashrc -rw-r--r-- 1 ubuntu ubuntu 807 Feb 25 2020 .profile -rw-r--r-- 1 ubuntu ubuntu 0 Oct 16 16:11 .sudo_as_admin_successful -rw------- 1 ubuntu ubuntu 1144 Oct 28 00:43 .viminfo u1001@f2-vm:/$ ls -al /mnt/ total 28 drwxr-xr-x 2 u1001 u1001 4096 Oct 28 22:07 . drwxr-xr-x 29 root root 4096 Oct 28 22:01 .. -rw------- 1 u1001 u1001 3154 Oct 28 22:12 .bash_history -rw-r--r-- 1 u1001 u1001 220 Feb 25 2020 .bash_logout -rw-r--r-- 1 u1001 u1001 3771 Feb 25 2020 .bashrc -rw-r--r-- 1 u1001 u1001 807 Feb 25 2020 .profile -rw-r--r-- 1 u1001 u1001 0 Oct 16 16:11 .sudo_as_admin_successful -rw------- 1 u1001 u1001 1144 Oct 28 00:43 .viminfo u1001@f2-vm:/$ touch /mnt/my-file u1001@f2-vm:/$ setfacl -m u:1001:rwx /mnt/my-file u1001@f2-vm:/$ sudo setcap -n 1001 cap_net_raw+ep /mnt/my-file u1001@f2-vm:/$ ls -al /mnt/my-file -rw-rwxr--+ 1 u1001 u1001 0 Oct 28 22:14 /mnt/my-file u1001@f2-vm:/$ ls -al /home/ubuntu/my-file -rw-rwxr--+ 1 ubuntu ubuntu 0 Oct 28 22:14 /home/ubuntu/my-file u1001@f2-vm:/$ getfacl /mnt/my-file getfacl: Removing leading '/' from absolute path names # file: mnt/my-file # owner: u1001 # group: u1001 user::rw- user:u1001:rwx group::rw- mask::rwx other::r-- u1001@f2-vm:/$ getfacl /home/ubuntu/my-file getfacl: Removing leading '/' from absolute path names # file: home/ubuntu/my-file # owner: ubuntu # group: ubuntu user::rw- user:ubuntu:rwx group::rw- mask::rwx other::r--" * tag 'idmapped-mounts-v5.12' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux: (41 commits) xfs: remove the possibly unused mp variable in xfs_file_compat_ioctl xfs: support idmapped mounts ext4: support idmapped mounts fat: handle idmapped mounts tests: add mount_setattr() selftests fs: introduce MOUNT_ATTR_IDMAP fs: add mount_setattr() fs: add attr_flags_to_mnt_flags helper fs: split out functions to hold writers namespace: only take read lock in do_reconfigure_mnt() mount: make {lock,unlock}_mount_hash() static namespace: take lock_mount_hash() directly when changing flags nfs: do not export idmapped mounts overlayfs: do not mount on top of idmapped mounts ecryptfs: do not mount on top of idmapped mounts ima: handle idmapped mounts apparmor: handle idmapped mounts fs: make helpers idmap mount aware exec: handle idmapped mounts would_dump: handle idmapped mounts ...
2021-02-16kbuild: simplify access to the kernel's versionSasha Levin1-1/+1
Instead of storing the version in a single integer and having various kernel (and userspace) code how it's constructed, export individual (major, patchlevel, sublevel) components and simplify kernel code that uses it. This should also make it easier on userspace. Signed-off-by: Sasha Levin <sashal@kernel.org> Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
2021-01-29fs: Remove dcookies supportViresh Kumar1-1/+0
The dcookies stuff was only used by the kernel's old oprofile code. Now that oprofile's support is removed from the kernel, there is no need for dcookies as well. Remove it. Suggested-by: Christoph Hellwig <hch@infradead.org> Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org> Acked-by: Robert Richter <rric@kernel.org> Acked-by: William Cohen <wcohen@redhat.com> Acked-by: Al Viro <viro@zeniv.linux.org.uk> Acked-by: Thomas Gleixner <tglx@linutronix.de>
2021-01-24fs: add file and path permissions helpersChristian Brauner1-1/+1
Add two simple helpers to check permissions on a file and path respectively and convert over some callers. It simplifies quite a few codepaths and also reduces the churn in later patches quite a bit. Christoph also correctly points out that this makes codepaths (e.g. ioctls) way easier to follow that would otherwise have to do more complex argument passing than necessary. Link: https://lore.kernel.org/r/20210121131959.646623-4-christian.brauner@ubuntu.com Cc: David Howells <dhowells@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: linux-fsdevel@vger.kernel.org Suggested-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: James Morris <jamorris@linux.microsoft.com> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2020-12-02kernel: Implement selective syscall userspace redirectionGabriel Krisman Bertazi1-0/+5
Introduce a mechanism to quickly disable/enable syscall handling for a specific process and redirect to userspace via SIGSYS. This is useful for processes with parts that require syscall redirection and parts that don't, but who need to perform this boundary crossing really fast, without paying the cost of a system call to reconfigure syscall handling on each boundary transition. This is particularly important for Windows games running over Wine. The proposed interface looks like this: prctl(PR_SET_SYSCALL_USER_DISPATCH, <op>, <off>, <length>, [selector]) The range [<offset>,<offset>+<length>) is a part of the process memory map that is allowed to by-pass the redirection code and dispatch syscalls directly, such that in fast paths a process doesn't need to disable the trap nor the kernel has to check the selector. This is essential to return from SIGSYS to a blocked area without triggering another SIGSYS from rt_sigreturn. selector is an optional pointer to a char-sized userspace memory region that has a key switch for the mechanism. This key switch is set to either PR_SYS_DISPATCH_ON, PR_SYS_DISPATCH_OFF to enable and disable the redirection without calling the kernel. The feature is meant to be set per-thread and it is disabled on fork/clone/execv. Internally, this doesn't add overhead to the syscall hot path, and it requires very little per-architecture support. I avoided using seccomp, even though it duplicates some functionality, due to previous feedback that maybe it shouldn't mix with seccomp since it is not a security mechanism. And obviously, this should never be considered a security mechanism, since any part of the program can by-pass it by using the syscall dispatcher. For the sysinfo benchmark, which measures the overhead added to executing a native syscall that doesn't require interception, the overhead using only the direct dispatcher region to issue syscalls is pretty much irrelevant. The overhead of using the selector goes around 40ns for a native (unredirected) syscall in my system, and it is (as expected) dominated by the supervisor-mode user-address access. In fact, with SMAP off, the overhead is consistently less than 5ns on my test box. Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Andy Lutomirski <luto@kernel.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/r/20201127193238.821364-4-krisman@collabora.com
2020-10-25kernel/sys.c: fix prototype of prctl_get_tid_address()Rasmus Villemoes1-3/+3
tid_addr is not a "pointer to (pointer to int in userspace)"; it is in fact a "pointer to (pointer to int in userspace) in userspace". So sparse rightfully complains about passing a kernel pointer to put_user(). Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-10-25Merge tag 'safesetid-5.10' of git://github.com/micah-morton/linuxLinus Torvalds1-5/+5
Pull SafeSetID updates from Micah Morton: "The changes are mostly contained to within the SafeSetID LSM, with the exception of a few 1-line changes to change some ns_capable() calls to ns_capable_setid() -- causing a flag (CAP_OPT_INSETID) to be set that is examined by SafeSetID code and nothing else in the kernel. The changes to SafeSetID internally allow for setting up GID transition security policies, as already existed for UIDs" * tag 'safesetid-5.10' of git://github.com/micah-morton/linux: LSM: SafeSetID: Fix warnings reported by test bot LSM: SafeSetID: Add GID security policy handling LSM: Signal to SafeSetID when setting group IDs
2020-10-16kernel/sys.c: replace do_brk with do_brk_flags in comment of prctl_set_mm_map()Liao Pingfang1-1/+1
Replace do_brk with do_brk_flags in comment of prctl_set_mm_map(), since do_brk was removed in following commit. Fixes: bb177a732c4369 ("mm: do not bug_on on incorrect length in __mm_populate()") Signed-off-by: Liao Pingfang <liao.pingfang@zte.com.cn> Signed-off-by: Yi Wang <wang.yi59@zte.com.cn> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Link: https://lkml.kernel.org/r/1600650751-43127-1-git-send-email-wang.yi59@zte.com.cn Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-10-13LSM: Signal to SafeSetID when setting group IDsThomas Cedeno1-5/+5
For SafeSetID to properly gate set*gid() calls, it needs to know whether ns_capable() is being called from within a sys_set*gid() function or is being called from elsewhere in the kernel. This allows SafeSetID to deny CAP_SETGID to restricted groups when they are attempting to use the capability for code paths other than updating GIDs (e.g. setting up userns GID mappings). This is the identical approach to what is currently done for CAP_SETUID. NOTE: We also add signaling to SafeSetID from the setgroups() syscall, as we have future plans to restrict a process' ability to set supplementary groups in addition to what is added in this series for restricting setting of the primary group. Signed-off-by: Thomas Cedeno <thomascedeno@google.com> Signed-off-by: Micah Morton <mortonm@chromium.org>
2020-08-23treewide: Use fallthrough pseudo-keywordGustavo A. R. Silva1-1/+1
Replace the existing /* fall through */ comments and its variants with the new pseudo-keyword macro fallthrough[1]. Also, remove unnecessary fall-through markings when it is the case. [1] https://www.kernel.org/doc/html/v5.7/process/deprecated.html?highlight=fallthrough#implicit-switch-case-fall-through Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
2020-07-19prctl: exe link permission error changed from -EINVAL to -EPERMNicolas Viennot1-1/+1
This brings consistency with the rest of the prctl() syscall where -EPERM is returned when failing a capability check. Signed-off-by: Nicolas Viennot <Nicolas.Viennot@twosigma.com> Signed-off-by: Adrian Reber <areber@redhat.com> Reviewed-by: Serge Hallyn <serge@hallyn.com> Link: https://lore.kernel.org/r/20200719100418.2112740-7-areber@redhat.com Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2020-07-19prctl: Allow local CAP_CHECKPOINT_RESTORE to change /proc/self/exeNicolas Viennot1-4/+7
Originally, only a local CAP_SYS_ADMIN could change the exe link, making it difficult for doing checkpoint/restore without CAP_SYS_ADMIN. This commit adds CAP_CHECKPOINT_RESTORE in addition to CAP_SYS_ADMIN for permitting changing the exe link. The following describes the history of the /proc/self/exe permission checks as it may be difficult to understand what decisions lead to this point. * [1] May 2012: This commit introduces the ability of changing /proc/self/exe if the user is CAP_SYS_RESOURCE capable. In the related discussion [2], no clear thread model is presented for what could happen if the /proc/self/exe changes multiple times, or why would the admin be at the mercy of userspace. * [3] Oct 2014: This commit introduces a new API to change /proc/self/exe. The permission no longer checks for CAP_SYS_RESOURCE, but instead checks if the current user is root (uid=0) in its local namespace. In the related discussion [4] it is said that "Controlling exe_fd without privileges may turn out to be dangerous. At least things like tomoyo examine it for making policy decisions (see tomoyo_manager())." * [5] Dec 2016: This commit removes the restriction to change /proc/self/exe at most once. The related discussion [6] informs that the audit subsystem relies on the exe symlink, presumably audit_log_d_path_exe() in kernel/audit.c. * [7] May 2017: This commit changed the check from uid==0 to local CAP_SYS_ADMIN. No discussion. * [8] July 2020: A PoC to spoof any program's /proc/self/exe via ptrace is demonstrated Overall, the concrete points that were made to retain capability checks around changing the exe symlink is that tomoyo_manager() and audit_log_d_path_exe() uses the exe_file path. Christian Brauner said that relying on /proc/<pid>/exe being immutable (or guarded by caps) in a sake of security is a bit misleading. It can only be used as a hint without any guarantees of what code is being executed once execve() returns to userspace. Christian suggested that in the future, we could call audit_log() or similar to inform the admin of all exe link changes, instead of attempting to provide security guarantees via permission checks. However, this proposed change requires the understanding of the security implications in the tomoyo/audit subsystems. [1] b32dfe377102 ("c/r: prctl: add ability to set new mm_struct::exe_file") [2] https://lore.kernel.org/patchwork/patch/292515/ [3] f606b77f1a9e ("prctl: PR_SET_MM -- introduce PR_SET_MM_MAP operation") [4] https://lore.kernel.org/patchwork/patch/479359/ [5] 3fb4afd9a504 ("prctl: remove one-shot limitation for changing exe link") [6] https://lore.kernel.org/patchwork/patch/697304/ [7] 4d28df6152aa ("prctl: Allow local CAP_SYS_ADMIN changing exe_file") [8] https://github.com/nviennot/run_as_exe Signed-off-by: Nicolas Viennot <Nicolas.Viennot@twosigma.com> Signed-off-by: Adrian Reber <areber@redhat.com> Link: https://lore.kernel.org/r/20200719100418.2112740-6-areber@redhat.com Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2020-06-14Merge tag 'LSM-add-setgid-hook-5.8-author-fix' of ↵Linus Torvalds1-1/+14
git://github.com/micah-morton/linux Pull SafeSetID update from Micah Morton: "Add additional LSM hooks for SafeSetID SafeSetID is capable of making allow/deny decisions for set*uid calls on a system, and we want to add similar functionality for set*gid calls. The work to do that is not yet complete, so probably won't make it in for v5.8, but we are looking to get this simple patch in for v5.8 since we have it ready. We are planning on the rest of the work for extending the SafeSetID LSM being merged during the v5.9 merge window" * tag 'LSM-add-setgid-hook-5.8-author-fix' of git://github.com/micah-morton/linux: security: Add LSM hooks to set*gid syscalls
2020-06-14security: Add LSM hooks to set*gid syscallsThomas Cedeno1-1/+14
The SafeSetID LSM uses the security_task_fix_setuid hook to filter set*uid() syscalls according to its configured security policy. In preparation for adding analagous support in the LSM for set*gid() syscalls, we add the requisite hook here. Tested by putting print statements in the security_task_fix_setgid hook and seeing them get hit during kernel boot. Signed-off-by: Thomas Cedeno <thomascedeno@google.com> Signed-off-by: Micah Morton <mortonm@chromium.org>
2020-06-09mmap locking API: convert mmap_sem commentsMichel Lespinasse1-2/+2
Convert comments that reference mmap_sem to reference mmap_lock instead. [akpm@linux-foundation.org: fix up linux-next leftovers] [akpm@linux-foundation.org: s/lockaphore/lock/, per Vlastimil] [akpm@linux-foundation.org: more linux-next fixups, per Michel] Signed-off-by: Michel Lespinasse <walken@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com> Cc: Davidlohr Bueso <dbueso@suse.de> Cc: David Rientjes <rientjes@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Jerome Glisse <jglisse@redhat.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Laurent Dufour <ldufour@linux.ibm.com> Cc: Liam Howlett <Liam.Howlett@oracle.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ying Han <yinghan@google.com> Link: http://lkml.kernel.org/r/20200520052908.204642-13-walken@google.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-09mmap locking API: use coccinelle to convert mmap_sem rwsem call sitesMichel Lespinasse1-9/+9
This change converts the existing mmap_sem rwsem calls to use the new mmap locking API instead. The change is generated using coccinelle with the following rule: // spatch --sp-file mmap_lock_api.cocci --in-place --include-headers --dir . @@ expression mm; @@ ( -init_rwsem +mmap_init_lock | -down_write +mmap_write_lock | -down_write_killable +mmap_write_lock_killable | -down_write_trylock +mmap_write_trylock | -up_write +mmap_write_unlock | -downgrade_write +mmap_write_downgrade | -down_read +mmap_read_lock | -down_read_killable +mmap_read_lock_killable | -down_read_trylock +mmap_read_trylock | -up_read +mmap_read_unlock ) -(&mm->mmap_sem) +(mm) Signed-off-by: Michel Lespinasse <walken@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com> Reviewed-by: Laurent Dufour <ldufour@linux.ibm.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: Davidlohr Bueso <dbueso@suse.de> Cc: David Rientjes <rientjes@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Jerome Glisse <jglisse@redhat.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Liam Howlett <Liam.Howlett@oracle.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ying Han <yinghan@google.com> Link: http://lkml.kernel.org/r/20200520052908.204642-5-walken@google.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-02Merge branch 'akpm' (patches from Andrew)Linus Torvalds1-1/+1
Merge updates from Andrew Morton: "A few little subsystems and a start of a lot of MM patches. Subsystems affected by this patch series: squashfs, ocfs2, parisc, vfs. With mm subsystems: slab-generic, slub, debug, pagecache, gup, swap, memcg, pagemap, memory-failure, vmalloc, kasan" * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (128 commits) kasan: move kasan_report() into report.c mm/mm_init.c: report kasan-tag information stored in page->flags ubsan: entirely disable alignment checks under UBSAN_TRAP kasan: fix clang compilation warning due to stack protector x86/mm: remove vmalloc faulting mm: remove vmalloc_sync_(un)mappings() x86/mm/32: implement arch_sync_kernel_mappings() x86/mm/64: implement arch_sync_kernel_mappings() mm/ioremap: track which page-table levels were modified mm/vmalloc: track which page-table levels were modified mm: add functions to track page directory modifications s390: use __vmalloc_node in stack_alloc powerpc: use __vmalloc_node in alloc_vm_stack arm64: use __vmalloc_node in arch_alloc_vmap_stack mm: remove vmalloc_user_node_flags mm: switch the test_vmalloc module to use __vmalloc_node mm: remove __vmalloc_node_flags_caller mm: remove both instances of __vmalloc_node_flags mm: remove the prot argument to __vmalloc_node mm: remove the pgprot argument to __vmalloc ...
2020-06-02mm/writeback: replace PF_LESS_THROTTLE with PF_LOCAL_THROTTLENeilBrown1-1/+1
PF_LESS_THROTTLE exists for loop-back nfsd (and a similar need in the loop block driver and callers of prctl(PR_SET_IO_FLUSHER)), where a daemon needs to write to one bdi (the final bdi) in order to free up writes queued to another bdi (the client bdi). The daemon sets PF_LESS_THROTTLE and gets a larger allowance of dirty pages, so that it can still dirty pages after other processses have been throttled. The purpose of this is to avoid deadlock that happen when the PF_LESS_THROTTLE process must write for any dirty pages to be freed, but it is being thottled and cannot write. This approach was designed when all threads were blocked equally, independently on which device they were writing to, or how fast it was. Since that time the writeback algorithm has changed substantially with different threads getting different allowances based on non-trivial heuristics. This means the simple "add 25%" heuristic is no longer reliable. The important issue is not that the daemon needs a *larger* dirty page allowance, but that it needs a *private* dirty page allowance, so that dirty pages for the "client" bdi that it is helping to clear (the bdi for an NFS filesystem or loop block device etc) do not affect the throttling of the daemon writing to the "final" bdi. This patch changes the heuristic so that the task is not throttled when the bdi it is writing to has a dirty page count below below (or equal to) the free-run threshold for that bdi. This ensures it will always be able to have some pages in flight, and so will not deadlock. In a steady-state, it is expected that PF_LOCAL_THROTTLE tasks might still be throttled by global threshold, but that is acceptable as it is only the deadlock state that is interesting for this flag. This approach of "only throttle when target bdi is busy" is consistent with the other use of PF_LESS_THROTTLE in current_may_throttle(), were it causes attention to be focussed only on the target bdi. So this patch - renames PF_LESS_THROTTLE to PF_LOCAL_THROTTLE, - removes the 25% bonus that that flag gives, and - If PF_LOCAL_THROTTLE is set, don't delay at all unless the global and the local free-run thresholds are exceeded. Note that previously realtime threads were treated the same as PF_LESS_THROTTLE threads. This patch does *not* change the behvaiour for real-time threads, so it is now different from the behaviour of nfsd and loop tasks. I don't know what is wanted for realtime. [akpm@linux-foundation.org: coding style fixes] Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Jan Kara <jack@suse.cz> Acked-by: Chuck Lever <chuck.lever@oracle.com> [nfsd] Cc: Christoph Hellwig <hch@lst.de> Cc: Michal Hocko <mhocko@suse.com> Cc: Trond Myklebust <trond.myklebust@hammerspace.com> Link: http://lkml.kernel.org/r/87ftbf7gs3.fsf@notabene.neil.brown.name Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-04-25compat sysinfo(2): don't bother with field-by-field copyoutAl Viro1-16/+17
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-03-03sys/sysinfo: Respect boottime inside time namespaceCyril Hrubis1-0/+2
The sysinfo() syscall includes uptime in seconds but has no correction for time namespaces which makes it inconsistent with the /proc/uptime inside of a time namespace. Add the missing time namespace adjustment call. Signed-off-by: Cyril Hrubis <chrubis@suse.cz> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Dmitry Safonov <dima@arista.com> Link: https://lkml.kernel.org/r/20200303150638.7329-1-chrubis@suse.cz
2020-01-28prctl: PR_{G,S}ET_IO_FLUSHER to support controlling memory reclaimMike Christie1-0/+25
There are several storage drivers like dm-multipath, iscsi, tcmu-runner, amd nbd that have userspace components that can run in the IO path. For example, iscsi and nbd's userspace deamons may need to recreate a socket and/or send IO on it, and dm-multipath's daemon multipathd may need to send SG IO or read/write IO to figure out the state of paths and re-set them up. In the kernel these drivers have access to GFP_NOIO/GFP_NOFS and the memalloc_*_save/restore functions to control the allocation behavior, but for userspace we would end up hitting an allocation that ended up writing data back to the same device we are trying to allocate for. The device is then in a state of deadlock, because to execute IO the device needs to allocate memory, but to allocate memory the memory layers want execute IO to the device. Here is an example with nbd using a local userspace daemon that performs network IO to a remote server. We are using XFS on top of the nbd device, but it can happen with any FS or other modules layered on top of the nbd device that can write out data to free memory. Here a nbd daemon helper thread, msgr-worker-1, is performing a write/sendmsg on a socket to execute a request. This kicks off a reclaim operation which results in a WRITE to the nbd device and the nbd thread calling back into the mm layer. [ 1626.609191] msgr-worker-1 D 0 1026 1 0x00004000 [ 1626.609193] Call Trace: [ 1626.609195] ? __schedule+0x29b/0x630 [ 1626.609197] ? wait_for_completion+0xe0/0x170 [ 1626.609198] schedule+0x30/0xb0 [ 1626.609200] schedule_timeout+0x1f6/0x2f0 [ 1626.609202] ? blk_finish_plug+0x21/0x2e [ 1626.609204] ? _xfs_buf_ioapply+0x2e6/0x410 [ 1626.609206] ? wait_for_completion+0xe0/0x170 [ 1626.609208] wait_for_completion+0x108/0x170 [ 1626.609210] ? wake_up_q+0x70/0x70 [ 1626.609212] ? __xfs_buf_submit+0x12e/0x250 [ 1626.609214] ? xfs_bwrite+0x25/0x60 [ 1626.609215] xfs_buf_iowait+0x22/0xf0 [ 1626.609218] __xfs_buf_submit+0x12e/0x250 [ 1626.609220] xfs_bwrite+0x25/0x60 [ 1626.609222] xfs_reclaim_inode+0x2e8/0x310 [ 1626.609224] xfs_reclaim_inodes_ag+0x1b6/0x300 [ 1626.609227] xfs_reclaim_inodes_nr+0x31/0x40 [ 1626.609228] super_cache_scan+0x152/0x1a0 [ 1626.609231] do_shrink_slab+0x12c/0x2d0 [ 1626.609233] shrink_slab+0x9c/0x2a0 [ 1626.609235] shrink_node+0xd7/0x470 [ 1626.609237] do_try_to_free_pages+0xbf/0x380 [ 1626.609240] try_to_free_pages+0xd9/0x1f0 [ 1626.609245] __alloc_pages_slowpath+0x3a4/0xd30 [ 1626.609251] ? ___slab_alloc+0x238/0x560 [ 1626.609254] __alloc_pages_nodemask+0x30c/0x350 [ 1626.609259] skb_page_frag_refill+0x97/0xd0 [ 1626.609274] sk_page_frag_refill+0x1d/0x80 [ 1626.609279] tcp_sendmsg_locked+0x2bb/0xdd0 [ 1626.609304] tcp_sendmsg+0x27/0x40 [ 1626.609307] sock_sendmsg+0x54/0x60 [ 1626.609308] ___sys_sendmsg+0x29f/0x320 [ 1626.609313] ? sock_poll+0x66/0xb0 [ 1626.609318] ? ep_item_poll.isra.15+0x40/0xc0 [ 1626.609320] ? ep_send_events_proc+0xe6/0x230 [ 1626.609322] ? hrtimer_try_to_cancel+0x54/0xf0 [ 1626.609324] ? ep_read_events_proc+0xc0/0xc0 [ 1626.609326] ? _raw_write_unlock_irq+0xa/0x20 [ 1626.609327] ? ep_scan_ready_list.constprop.19+0x218/0x230 [ 1626.609329] ? __hrtimer_init+0xb0/0xb0 [ 1626.609331] ? _raw_spin_unlock_irq+0xa/0x20 [ 1626.609334] ? ep_poll+0x26c/0x4a0 [ 1626.609337] ? tcp_tsq_write.part.54+0xa0/0xa0 [ 1626.609339] ? release_sock+0x43/0x90 [ 1626.609341] ? _raw_spin_unlock_bh+0xa/0x20 [ 1626.609342] __sys_sendmsg+0x47/0x80 [ 1626.609347] do_syscall_64+0x5f/0x1c0 [ 1626.609349] ? prepare_exit_to_usermode+0x75/0xa0 [ 1626.609351] entry_SYSCALL_64_after_hwframe+0x44/0xa9 This patch adds a new prctl command that daemons can use after they have done their initial setup, and before they start to do allocations that are in the IO path. It sets the PF_MEMALLOC_NOIO and PF_LESS_THROTTLE flags so both userspace block and FS threads can use it to avoid the allocation recursion and try to prevent from being throttled while writing out data to free up memory. Signed-off-by: Mike Christie <mchristi@redhat.com> Acked-by: Michal Hocko <mhocko@suse.com> Tested-by: Masato Suzuki <masato.suzuki@wdc.com> Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Link: https://lore.kernel.org/r/20191112001900.9206-1-mchristi@redhat.com Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2019-12-04kernel/sys.c: avoid copying possible padding bytes in copy_to_userJoe Perches1-1/+3
Initialization is not guaranteed to zero padding bytes so use an explicit memset instead to avoid leaking any kernel content in any possible padding bytes. Link: http://lkml.kernel.org/r/dfa331c00881d61c8ee51577a082d8bebd61805c.camel@perches.com Signed-off-by: Joe Perches <joe@perches.com> Cc: Dan Carpenter <error27@gmail.com> Cc: Julia Lawall <julia.lawall@lip6.fr> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Kees Cook <keescook@chromium.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-11-15y2038: rusage: use __kernel_old_timevalArnd Bergmann1-2/+2
There are two 'struct timeval' fields in 'struct rusage'. Unfortunately the definition of timeval is now ambiguous when used in user space with a libc that has a 64-bit time_t, and this also changes the 'rusage' definition in user space in a way that is incompatible with the system call interface. While there is no good solution to avoid all ambiguity here, change the definition in the kernel headers to be compatible with the kernel ABI, using __kernel_old_timeval as an unambiguous base type. In previous discussions, there was also a plan to add a replacement for rusage based on 64-bit timestamps and nanosecond resolution, i.e. 'struct __kernel_timespec'. I have patches for that as well, if anyone thinks we should do that. Reviewed-by: Cyrill Gorcunov <gorcunov@gmail.com> Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2019-09-17Merge branch 'timers-core-for-linus' of ↵Linus Torvalds1-13/+3
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull core timer updates from Thomas Gleixner: "Timers and timekeeping updates: - A large overhaul of the posix CPU timer code which is a preparation for moving the CPU timer expiry out into task work so it can be properly accounted on the task/process. An update to the bogus permission checks will come later during the merge window as feedback was not complete before heading of for travel. - Switch the timerqueue code to use cached rbtrees and get rid of the homebrewn caching of the leftmost node. - Consolidate hrtimer_init() + hrtimer_init_sleeper() calls into a single function - Implement the separation of hrtimers to be forced to expire in hard interrupt context even when PREEMPT_RT is enabled and mark the affected timers accordingly. - Implement a mechanism for hrtimers and the timer wheel to protect RT against priority inversion and live lock issues when a (hr)timer which should be canceled is currently executing the callback. Instead of infinitely spinning, the task which tries to cancel the timer blocks on a per cpu base expiry lock which is held and released by the (hr)timer expiry code. - Enable the Hyper-V TSC page based sched_clock for Hyper-V guests resulting in faster access to timekeeping functions. - Updates to various clocksource/clockevent drivers and their device tree bindings. - The usual small improvements all over the place" * 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (101 commits) posix-cpu-timers: Fix permission check regression posix-cpu-timers: Always clear head pointer on dequeue hrtimer: Add a missing bracket and hide `migration_base' on !SMP posix-cpu-timers: Make expiry_active check actually work correctly posix-timers: Unbreak CONFIG_POSIX_TIMERS=n build tick: Mark sched_timer to expire in hard interrupt context hrtimer: Add kernel doc annotation for HRTIMER_MODE_HARD x86/hyperv: Hide pv_ops access for CONFIG_PARAVIRT=n posix-cpu-timers: Utilize timerqueue for storage posix-cpu-timers: Move state tracking to struct posix_cputimers posix-cpu-timers: Deduplicate rlimit handling posix-cpu-timers: Remove pointless comparisons posix-cpu-timers: Get rid of 64bit divisions posix-cpu-timers: Consolidate timer expiry further posix-cpu-timers: Get rid of zero checks rlimit: Rewrite non-sensical RLIMIT_CPU comment posix-cpu-timers: Respect INFINITY for hard RTTIME limit posix-cpu-timers: Switch thread group sampling to array posix-cpu-timers: Restructure expiry array posix-cpu-timers: Remove cputime_expires ...
2019-09-16Merge branch 'x86-cpu-for-linus' of ↵Linus Torvalds1-14/+2
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 cpu-feature updates from Ingo Molnar: - Rework the Intel model names symbols/macros, which were decades of ad-hoc extensions and added random noise. It's now a coherent, easy to follow nomenclature. - Add new Intel CPU model IDs: - "Tiger Lake" desktop and mobile models - "Elkhart Lake" model ID - and the "Lightning Mountain" variant of Airmont, plus support code - Add the new AVX512_VP2INTERSECT instruction to cpufeatures - Remove Intel MPX user-visible APIs and the self-tests, because the toolchain (gcc) is not supporting it going forward. This is the first, lowest-risk phase of MPX removal. - Remove X86_FEATURE_MFENCE_RDTSC - Various smaller cleanups and fixes * 'x86-cpu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (25 commits) x86/cpu: Update init data for new Airmont CPU model x86/cpu: Add new Airmont variant to Intel family x86/cpu: Add Elkhart Lake to Intel family x86/cpu: Add Tiger Lake to Intel family x86: Correct misc typos x86/intel: Add common OPTDIFFs x86/intel: Aggregate microserver naming x86/intel: Aggregate big core graphics naming x86/intel: Aggregate big core mobile naming x86/intel: Aggregate big core client naming x86/cpufeature: Explain the macro duplication x86/ftrace: Remove mcount() declaration x86/PCI: Remove superfluous returns from void functions x86/msr-index: Move AMD MSRs where they belong x86/cpu: Use constant definitions for CPU models lib: Remove redundant ftrace flag removal x86/crash: Remove unnecessary comparison x86/bitops: Use __builtin_constant_p() directly instead of IS_IMMEDIATE() x86: Remove X86_FEATURE_MFENCE_RDTSC x86/mpx: Remove MPX APIs ...
2019-08-28posix-cpu-timers: Get rid of zero checksThomas Gleixner1-9/+0
Deactivation of the expiry cache is done by setting all clock caches to 0. That requires to have a check for zero in all places which update the expiry cache: if (cache == 0 || new < cache) cache = new; Use U64_MAX as the deactivated value, which allows to remove the zero checks when updating the cache and reduces it to the obvious check: if (new < cache) cache = new; This also removes the weird workaround in do_prlimit() which was required to convert a RLIMIT_CPU value of 0 (immediate expiry) to 1 because handing in 0 to the posix CPU timer code would have effectively disarmed it. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Link: https://lkml.kernel.org/r/20190821192922.275086128@linutronix.de
2019-08-28rlimit: Rewrite non-sensical RLIMIT_CPU commentThomas Gleixner1-4/+3
The comment above the function which arms RLIMIT_CPU in the posix CPU timer code makes no sense at all. It claims that the kernel does not return an error code when it rejected the attempt to set RLIMIT_CPU. That's clearly bogus as the code does an error check and the rlimit is only set and activated when the permission checks are ok. In case of a rejection an appropriate error code is returned. This is a historical and outdated comment which got dragged along even when the rlimit handling code was rewritten. Replace it with an explanation why the setup function is not called when the rlimit value is RLIM_INFINITY and how the 'disarming' is handled. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Link: https://lkml.kernel.org/r/20190821192922.185511287@linutronix.de
2019-08-20arm64: Tighten the PR_{SET, GET}_TAGGED_ADDR_CTRL prctl() unused argumentsCatalin Marinas1-0/+4
Require that arg{3,4,5} of the PR_{SET,GET}_TAGGED_ADDR_CTRL prctl and arg2 of the PR_GET_TAGGED_ADDR_CTRL prctl() are zero rather than ignored for future extensions. Acked-by: Andrey Konovalov <andreyknvl@google.com> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com> Signed-off-by: Will Deacon <will@kernel.org>
2019-08-06arm64: Introduce prctl() options to control the tagged user addresses ABICatalin Marinas1-0/+12
It is not desirable to relax the ABI to allow tagged user addresses into the kernel indiscriminately. This patch introduces a prctl() interface for enabling or disabling the tagged ABI with a global sysctl control for preventing applications from enabling the relaxed ABI (meant for testing user-space prctl() return error checking without reconfiguring the kernel). The ABI properties are inherited by threads of the same application and fork()'ed children but cleared on execve(). A Kconfig option allows the overall disabling of the relaxed ABI. The PR_SET_TAGGED_ADDR_CTRL will be expanded in the future to handle MTE-specific settings like imprecise vs precise exceptions. Reviewed-by: Kees Cook <keescook@chromium.org> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com> Signed-off-by: Andrey Konovalov <andreyknvl@google.com> Signed-off-by: Will Deacon <will@kernel.org>
2019-07-22x86/mpx: Remove MPX APIsDave Hansen1-14/+2
MPX is being removed from the kernel due to a lack of support in the toolchain going forward (gcc). The first step is to remove the userspace-visible ABIs so that applications will stop using it. The most visible one are the enable/disable prctl()s. Remove them first. This is the most minimal and least invasive change needed to ensure that apps stop using MPX with new kernels. Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/20190705175321.DB42F0AD@viggo.jf.intel.com
2019-06-01prctl_set_mm: downgrade mmap_sem to read lockMichal Koutný1-2/+9
The commit a3b609ef9f8b ("proc read mm's {arg,env}_{start,end} with mmap semaphore taken.") added synchronization of reading argument/environment boundaries under mmap_sem. Later commit 88aa7cc688d4 ("mm: introduce arg_lock to protect arg_start|end and env_start|end in mm_struct") avoided the coarse use of mmap_sem in similar situations. But there still remained two places that (mis)use mmap_sem. get_cmdline should also use arg_lock instead of mmap_sem when it reads the boundaries. The second place that should use arg_lock is in prctl_set_mm. By protecting the boundaries fields with the arg_lock, we can downgrade mmap_sem to reader lock (analogous to what we already do in prctl_set_mm_map). [akpm@linux-foundation.org: coding style fixes] Link: http://lkml.kernel.org/r/20190502125203.24014-3-mkoutny@suse.com Fixes: 88aa7cc688d4 ("mm: introduce arg_lock to protect arg_start|end and env_start|end in mm_struct") Signed-off-by: Michal Koutný <mkoutny@suse.com> Signed-off-by: Laurent Dufour <ldufour@linux.ibm.com> Co-developed-by: Laurent Dufour <ldufour@linux.ibm.com> Reviewed-by: Cyrill Gorcunov <gorcunov@gmail.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Yang Shi <yang.shi@linux.alibaba.com> Cc: Mateusz Guzik <mguzik@redhat.com> Cc: Kirill Tkhai <ktkhai@virtuozzo.com> Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-06-01prctl_set_mm: refactor checks from validate_prctl_mapMichal Koutný1-26/+25
Despite comment of validate_prctl_map claims there are no capability checks, it is not completely true since commit 4d28df6152aa ("prctl: Allow local CAP_SYS_ADMIN changing exe_file"). Extract the check out of the function and make the function perform purely arithmetic checks. This patch should not change any behavior, it is mere refactoring for following patch. [akpm@linux-foundation.org: coding style fixes] Link: http://lkml.kernel.org/r/20190502125203.24014-2-mkoutny@suse.com Signed-off-by: Michal Koutný <mkoutny@suse.com> Reviewed-by: Kirill Tkhai <ktkhai@virtuozzo.com> Reviewed-by: Cyrill Gorcunov <gorcunov@gmail.com> Cc: Kirill Tkhai <ktkhai@virtuozzo.com> Cc: Laurent Dufour <ldufour@linux.ibm.com> Cc: Mateusz Guzik <mguzik@redhat.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Yang Shi <yang.shi@linux.alibaba.com> Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-05-14kernel/sys.c: prctl: fix false positive in validate_prctl_map()Cyrill Gorcunov1-1/+1
While validating new map we require the @start_data to be strictly less than @end_data, which is fine for regular applications (this is why this nit didn't trigger for that long). These members are set from executable loaders such as elf handers, still it is pretty valid to have a loadable data section with zero size in file, in such case the start_data is equal to end_data once kernel loader finishes. As a result when we're trying to restore such programs the procedure fails and the kernel returns -EINVAL. From the image dump of a program: | "mm_start_code": "0x400000", | "mm_end_code": "0x8f5fb4", | "mm_start_data": "0xf1bfb0", | "mm_end_data": "0xf1bfb0", Thus we need to change validate_prctl_map from strictly less to less or equal operator use. Link: http://lkml.kernel.org/r/20190408143554.GY1421@uranus.lan Fixes: f606b77f1a9e3 ("prctl: PR_SET_MM -- introduce PR_SET_MM_MAP operation") Signed-off-by: Cyrill Gorcunov <gorcunov@gmail.com> Cc: Andrey Vagin <avagin@gmail.com> Cc: Dmitry Safonov <0x7f454c46@gmail.com> Cc: Pavel Emelyanov <xemul@virtuozzo.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-07Merge branch 'akpm' (patches from Andrew)Linus Torvalds1-0/+1
Merge more updates from Andrew Morton: - some of the rest of MM - various misc things - dynamic-debug updates - checkpatch - some epoll speedups - autofs - rapidio - lib/, lib/lzo/ updates * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (83 commits) samples/mic/mpssd/mpssd.h: remove duplicate header kernel/fork.c: remove duplicated include include/linux/relay.h: fix percpu annotation in struct rchan arch/nios2/mm/fault.c: remove duplicate include unicore32: stop printing the virtual memory layout MAINTAINERS: fix GTA02 entry and mark as orphan mm: create the new vm_fault_t type arm, s390, unicore32: remove oneliner wrappers for memblock_alloc() arch: simplify several early memory allocations openrisc: simplify pte_alloc_one_kernel() sh: prefer memblock APIs returning virtual address microblaze: prefer memblock API returning virtual address powerpc: prefer memblock APIs returning virtual address lib/lzo: separate lzo-rle from lzo lib/lzo: implement run-length encoding lib/lzo: fast 8-byte copy on arm64 lib/lzo: 64-bit CTZ on arm64 lib/lzo: tidy-up ifdefs ipc/sem.c: replace kvmalloc/memset with kvzalloc and use struct_size ipc: annotate implicit fall through ...
2019-03-07kernel/sys: annotate implicit fall throughMathieu Malaterre1-0/+1
There is a plan to build the kernel with -Wimplicit-fallthrough and this place in the code produced a warning (W=1). This commit remove the following warning: kernel/sys.c:1748:6: warning: this statement may fall through [-Wimplicit-fallthrough=] Link: http://lkml.kernel.org/r/20190114203347.17530-1-malat@debian.org Signed-off-by: Mathieu Malaterre <malat@debian.org> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-01-25LSM: add SafeSetID module that gates setid callsMicah Morton1-5/+5
This change ensures that the set*uid family of syscalls in kernel/sys.c (setreuid, setuid, setresuid, setfsuid) all call ns_capable_common with the CAP_OPT_INSETID flag, so capability checks in the security_capable hook can know whether they are being called from within a set*uid syscall. This change is a no-op by itself, but is needed for the proposed SafeSetID LSM. Signed-off-by: Micah Morton <mortonm@chromium.org> Acked-by: Kees Cook <keescook@chromium.org> Signed-off-by: James Morris <james.morris@microsoft.com>
2019-01-14kernel/sys.c: Clarify that UNAME26 does not generate unique versions anymoreJonathan Neuschäfer1-1/+2
UNAME26 is a mechanism to report Linux's version as 2.6.x, for compatibility with old/broken software. Due to the way it is implemented, it would have to be updated after 5.0, to keep the resulting versions unique. Linus Torvalds argued: "Do we actually need this? I'd rather let it bitrot, and just let it return random versions. It will just start again at 2.4.60, won't it? Anybody who uses UNAME26 for a 5.x kernel might as well think it's still 4.x. The user space is so old that it can't possibly care about differences between 4.x and 5.x, can it? The only thing that matters is that it shows "2.4.<largeenough>", which it will do regardless" Signed-off-by: Jonathan Neuschäfer <j.neuschaefer@gmx.net> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-01-03Remove 'type' argument from access_ok() functionLinus Torvalds1-1/+1
Nobody has actually used the type (VERIFY_READ vs VERIFY_WRITE) argument of the user address range verification function since we got rid of the old racy i386-only code to walk page tables by hand. It existed because the original 80386 would not honor the write protect bit when in kernel mode, so you had to do COW by hand before doing any user access. But we haven't supported that in a long time, and these days the 'type' argument is a purely historical artifact. A discussion about extending 'user_access_begin()' to do the range checking resulted this patch, because there is no way we're going to move the old VERIFY_xyz interface to that model. And it's best done at the end of the merge window when I've done most of my merges, so let's just get this done once and for all. This patch was mostly done with a sed-script, with manual fix-ups for the cases that weren't of the trivial 'access_ok(VERIFY_xyz' form. There were a couple of notable cases: - csky still had the old "verify_area()" name as an alias. - the iter_iov code had magical hardcoded knowledge of the actual values of VERIFY_{READ,WRITE} (not that they mattered, since nothing really used it) - microblaze used the type argument for a debug printout but other than those oddities this should be a total no-op patch. I tried to fix up all architectures, did fairly extensive grepping for access_ok() uses, and the changes are trivial, but I may have missed something. Any missed conversion should be trivially fixable, though. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-13arm64: add prctl control for resetting ptrauth keysKristina Martsenko1-0/+8
Add an arm64-specific prctl to allow a thread to reinitialize its pointer authentication keys to random values. This can be useful when exec() is not used for starting new processes, to ensure that different processes still have different keys. Signed-off-by: Kristina Martsenko <kristina.martsenko@arm.com> Signed-off-by: Will Deacon <will.deacon@arm.com>
2018-09-20kernel/sys.c: remove duplicated includeYueHaibing1-3/+0
Link: http://lkml.kernel.org/r/20180821133424.18716-1-yuehaibing@huawei.com Signed-off-by: YueHaibing <yuehaibing@huawei.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-08-24Merge branch 'userns-linus' of ↵Linus Torvalds1-50/+45
git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace Pull namespace fixes from Eric Biederman: "This is a set of four fairly obvious bug fixes: - a switch from d_find_alias to d_find_any_alias because the xattr code perversely takes a dentry - two mutex vs copy_to_user fixes from Jann Horn - a fix to use a sanitized size not the size userspace passed in from Christian Brauner" * 'userns-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: getxattr: use correct xattr length sys: don't hold uts_sem while accessing userspace memory userns: move user access out of the mutex cap_inode_getsecurity: use d_find_any_alias() instead of d_find_alias()
2018-08-11sys: don't hold uts_sem while accessing userspace memoryJann Horn1-50/+45
Holding uts_sem as a writer while accessing userspace memory allows a namespace admin to stall all processes that attempt to take uts_sem. Instead, move data through stack buffers and don't access userspace memory while uts_sem is held. Cc: stable@vger.kernel.org Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Signed-off-by: Jann Horn <jannh@google.com> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2018-06-19sysinfo: Remove get_monotonic_boottime()Arnd Bergmann1-2/+2
get_monotonic_boottime() is deprecated because it uses the old 'timespec' structure. This replaces one of the last callers with a call to ktime_get_boottime. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Cyrill Gorcunov <gorcunov@gmail.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: y2038@lists.linaro.org Cc: Dominik Brodowski <linux@dominikbrodowski.net> Cc: Cyrill Gorcunov <gorcunov@openvz.org> Link: https://lkml.kernel.org/r/20180618150114.849216-1-arnd@arndb.de
2018-06-07mm: introduce arg_lock to protect arg_start|end and env_start|end in mm_structYang Shi1-2/+8
mmap_sem is on the hot path of kernel, and it very contended, but it is abused too. It is used to protect arg_start|end and evn_start|end when reading /proc/$PID/cmdline and /proc/$PID/environ, but it doesn't make sense since those proc files just expect to read 4 values atomically and not related to VM, they could be set to arbitrary values by C/R. And, the mmap_sem contention may cause unexpected issue like below: INFO: task ps:14018 blocked for more than 120 seconds. Tainted: G E 4.9.79-009.ali3000.alios7.x86_64 #1 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. ps D 0 14018 1 0x00000004 Call Trace: schedule+0x36/0x80 rwsem_down_read_failed+0xf0/0x150 call_rwsem_down_read_failed+0x18/0x30 down_read+0x20/0x40 proc_pid_cmdline_read+0xd9/0x4e0 __vfs_read+0x37/0x150 vfs_read+0x96/0x130 SyS_read+0x55/0xc0 entry_SYSCALL_64_fastpath+0x1a/0xc5 Both Alexey Dobriyan and Michal Hocko suggested to use dedicated lock for them to mitigate the abuse of mmap_sem. So, introduce a new spinlock in mm_struct to protect the concurrent access to arg_start|end, env_start|end and others, as well as replace write map_sem to read to protect the race condition between prctl and sys_brk which might break check_data_rlimit(), and makes prctl more friendly to other VM operations. This patch just eliminates the abuse of mmap_sem, but it can't resolve the above hung task warning completely since the later access_remote_vm() call needs acquire mmap_sem. The mmap_sem scalability issue will be solved in the future. [yang.shi@linux.alibaba.com: add comment about mmap_sem and arg_lock] Link: http://lkml.kernel.org/r/1524077799-80690-1-git-send-email-yang.shi@linux.alibaba.com Link: http://lkml.kernel.org/r/1523730291-109696-1-git-send-email-yang.shi@linux.alibaba.com Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com> Reviewed-by: Cyrill Gorcunov <gorcunov@openvz.org> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mateusz Guzik <mguzik@redhat.com> Cc: Kirill Tkhai <ktkhai@virtuozzo.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-05-25kernel/sys.c: fix potential Spectre v1 issueGustavo A. R. Silva1-0/+5
`resource' can be controlled by user-space, hence leading to a potential exploitation of the Spectre variant 1 vulnerability. This issue was detected with the help of Smatch: kernel/sys.c:1474 __do_compat_sys_old_getrlimit() warn: potential spectre issue 'get_current()->signal->rlim' (local cap) kernel/sys.c:1455 __do_sys_old_getrlimit() warn: potential spectre issue 'get_current()->signal->rlim' (local cap) Fix this by sanitizing *resource* before using it to index current->signal->rlim Notice that given that speculation windows are large, the policy is to kill the speculation on the first load and not worry if it can be completed with a dependent load/store [1]. [1] https://marc.info/?l=linux-kernel&m=152449131114778&w=2 Link: http://lkml.kernel.org/r/20180515030038.GA11822@embeddedor.com Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-05-03nospec: Allow getting/setting on non-current taskKees Cook1-4/+5
Adjust arch_prctl_get/set_spec_ctrl() to operate on tasks other than current. This is needed both for /proc/$pid/status queries and for seccomp (since thread-syncing can trigger seccomp in non-current threads). Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2018-05-03prctl: Add speculation control prctlsThomas Gleixner1-0/+22
Add two new prctls to control aspects of speculation related vulnerabilites and their mitigations to provide finer grained control over performance impacting mitigations. PR_GET_SPECULATION_CTRL returns the state of the speculation misfeature which is selected with arg2 of prctl(2). The return value uses bit 0-2 with the following meaning: Bit Define Description 0 PR_SPEC_PRCTL Mitigation can be controlled per task by PR_SET_SPECULATION_CTRL 1 PR_SPEC_ENABLE The speculation feature is enabled, mitigation is disabled 2 PR_SPEC_DISABLE The speculation feature is disabled, mitigation is enabled If all bits are 0 the CPU is not affected by the speculation misfeature. If PR_SPEC_PRCTL is set, then the per task control of the mitigation is available. If not set, prctl(PR_SET_SPECULATION_CTRL) for the speculation misfeature will fail. PR_SET_SPECULATION_CTRL allows to control the speculation misfeature, which is selected by arg2 of prctl(2) per task. arg3 is used to hand in the control value, i.e. either PR_SPEC_ENABLE or PR_SPEC_DISABLE. The common return values are: EINVAL prctl is not implemented by the architecture or the unused prctl() arguments are not 0 ENODEV arg2 is selecting a not supported speculation misfeature PR_SET_SPECULATION_CTRL has these additional return values: ERANGE arg3 is incorrect, i.e. it's not either PR_SPEC_ENABLE or PR_SPEC_DISABLE ENXIO prctl control of the selected speculation misfeature is disabled The first supported controlable speculation misfeature is PR_SPEC_STORE_BYPASS. Add the define so this can be shared between architectures. Based on an initial patch from Tim Chen and mostly rewritten. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2018-04-02kernel: add ksys_setsid() helper; remove in-kernel call to sys_setsid()Dominik Brodowski1-1/+6
Using this helper allows us to avoid the in-kernel call to the sys_setsid() syscall. The ksys_ prefix denotes that this function is meant as a drop-in replacement for the syscall. In particular, it uses the same calling convention as sys_setsid(). This patch is part of a series which removes in-kernel calls to syscalls. On this basis, the syscall entry path can be streamlined. For details, see http://lkml.kernel.org/r/20180325162527.GA17492@light.dominikbrodowski.net Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net>
2018-04-02kernel: provide ksys_*() wrappers for syscalls called by kernel/uid16.cDominik Brodowski1-8/+50
Using these helpers allows us to avoid the in-kernel calls to these syscalls: sys_setregid(), sys_setgid(), sys_setreuid(), sys_setuid(), sys_setresuid(), sys_setresgid(), sys_setfsuid(), and sys_setfsgid(). The ksys_ prefix denotes that these function are meant as a drop-in replacement for the syscall. In particular, they use the same calling convention. This patch is part of a series which removes in-kernel calls to syscalls. On this basis, the syscall entry path can be streamlined. For details, see http://lkml.kernel.org/r/20180325162527.GA17492@light.dominikbrodowski.net Cc: Al Viro <viro@ZenIV.linux.org.uk> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net>
2018-04-02kernel: add do_getpgid() helper; remove internal call to sys_getpgid()Dominik Brodowski1-2/+7
Using the do_getpgid() helper removes an in-kernel call to the sys_getpgid() syscall. This patch is part of a series which removes in-kernel calls to syscalls. On this basis, the syscall entry path can be streamlined. For details, see http://lkml.kernel.org/r/20180325162527.GA17492@light.dominikbrodowski.net Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net>
2017-12-14fix typo in assignment of fs default overflow gidWolffhardt Schwabe1-1/+1
The patch remains without practical effect since both macros carry identical values. Still, it might become a problem in the future if (for whatever reason) the default overflow uid and gid differ. The DEFAULT_FS_OVERFLOWGID macro was previously unused. Signed-off-by: Wolffhardt Schwabe <wolffhardt.schwabe@fau.de> Signed-off-by: Anatoliy Cherepantsev <anatoliy.cherepantsev@fau.de> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2017-11-15Merge tag 'arm64-upstream' of ↵Linus Torvalds1-0/+12
git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux Pull arm64 updates from Will Deacon: "The big highlight is support for the Scalable Vector Extension (SVE) which required extensive ABI work to ensure we don't break existing applications by blowing away their signal stack with the rather large new vector context (<= 2 kbit per vector register). There's further work to be done optimising things like exception return, but the ABI is solid now. Much of the line count comes from some new PMU drivers we have, but they're pretty self-contained and I suspect we'll have more of them in future. Plenty of acronym soup here: - initial support for the Scalable Vector Extension (SVE) - improved handling for SError interrupts (required to handle RAS events) - enable GCC support for 128-bit integer types - remove kernel text addresses from backtraces and register dumps - use of WFE to implement long delay()s - ACPI IORT updates from Lorenzo Pieralisi - perf PMU driver for the Statistical Profiling Extension (SPE) - perf PMU driver for Hisilicon's system PMUs - misc cleanups and non-critical fixes" * tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (97 commits) arm64: Make ARMV8_DEPRECATED depend on SYSCTL arm64: Implement __lshrti3 library function arm64: support __int128 on gcc 5+ arm64/sve: Add documentation arm64/sve: Detect SVE and activate runtime support arm64/sve: KVM: Hide SVE from CPU features exposed to guests arm64/sve: KVM: Treat guest SVE use as undefined instruction execution arm64/sve: KVM: Prevent guests from using SVE arm64/sve: Add sysctl to set the default vector length for new processes arm64/sve: Add prctl controls for userspace vector length management arm64/sve: ptrace and ELF coredump support arm64/sve: Preserve SVE registers around EFI runtime service calls arm64/sve: Preserve SVE registers around kernel-mode NEON use arm64/sve: Probe SVE capabilities and usable vector lengths arm64: cpufeature: Move sys_caps_initialised declarations arm64/sve: Backend logic for setting the vector length arm64/sve: Signal handling support arm64/sve: Support vector length resetting for new processes arm64/sve: Core task context handling arm64/sve: Low-level CPU setup ...
2017-11-03arm64/sve: Add prctl controls for userspace vector length managementDave Martin1-0/+12
This patch adds two arm64-specific prctls, to permit userspace to control its vector length: * PR_SVE_SET_VL: set the thread's SVE vector length and vector length inheritance mode. * PR_SVE_GET_VL: get the same information. Although these prctls resemble instruction set features in the SVE architecture, they provide additional control: the vector length inheritance mode is Linux-specific and nothing to do with the architecture, and the architecture does not permit EL0 to set its own vector length directly. Both can be used in portable tools without requiring the use of SVE instructions. Signed-off-by: Dave Martin <Dave.Martin@arm.com> Reviewed-by: Catalin Marinas <catalin.marinas@arm.com> Cc: Alex Bennée <alex.bennee@linaro.org> [will: Fixed up prctl constants to avoid clash with PDEATHSIG] Signed-off-by: Will Deacon <will.deacon@arm.com>
2017-11-02License cleanup: add SPDX GPL-2.0 license identifier to files with no licenseGreg Kroah-Hartman1-0/+1
Many source files in the tree are missing licensing information, which makes it harder for compliance tools to determine the correct license. By default all files without license information are under the default license of the kernel, which is GPL version 2. Update the files which contain no license information with the 'GPL-2.0' SPDX license identifier. The SPDX identifier is a legally binding shorthand, which can be used instead of the full boiler plate text. This patch is based on work done by Thomas Gleixner and Kate Stewart and Philippe Ombredanne. How this work was done: Patches were generated and checked against linux-4.14-rc6 for a subset of the use cases: - file had no licensing information it it. - file was a */uapi/* one with no licensing information in it, - file was a */uapi/* one with existing licensing information, Further patches will be generated in subsequent months to fix up cases where non-standard license headers were used, and references to license had to be inferred by heuristics based on keywords. The analysis to determine which SPDX License Identifier to be applied to a file was done in a spreadsheet of side by side results from of the output of two independent scanners (ScanCode & Windriver) producing SPDX tag:value files created by Philippe Ombredanne. Philippe prepared the base worksheet, and did an initial spot review of a few 1000 files. The 4.13 kernel was the starting point of the analysis with 60,537 files assessed. Kate Stewart did a file by file comparison of the scanner results in the spreadsheet to determine which SPDX license identifier(s) to be applied to the file. She confirmed any determination that was not immediately clear with lawyers working with the Linux Foundation. Criteria used to select files for SPDX license identifier tagging was: - Files considered eligible had to be source code files. - Make and config files were included as candidates if they contained >5 lines of source - File already had some variant of a license header in it (even if <5 lines). All documentation files were explicitly excluded. The following heuristics were used to determine which SPDX license identifiers to apply. - when both scanners couldn't find any license traces, file was considered to have no license information in it, and the top level COPYING file license applied. For non */uapi/* files that summary was: SPDX license identifier # files ---------------------------------------------------|------- GPL-2.0 11139 and resulted in the first patch in this series. If that file was a */uapi/* path one, it was "GPL-2.0 WITH Linux-syscall-note" otherwise it was "GPL-2.0". Results of that was: SPDX license identifier # files ---------------------------------------------------|------- GPL-2.0 WITH Linux-syscall-note 930 and resulted in the second patch in this series. - if a file had some form of licensing information in it, and was one of the */uapi/* ones, it was denoted with the Linux-syscall-note if any GPL family license was found in the file or had no licensing in it (per prior point). Results summary: SPDX license identifier # files ---------------------------------------------------|------ GPL-2.0 WITH Linux-syscall-note 270 GPL-2.0+ WITH Linux-syscall-note 169 ((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause) 21 ((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause) 17 LGPL-2.1+ WITH Linux-syscall-note 15 GPL-1.0+ WITH Linux-syscall-note 14 ((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause) 5 LGPL-2.0+ WITH Linux-syscall-note 4 LGPL-2.1 WITH Linux-syscall-note 3 ((GPL-2.0 WITH Linux-syscall-note) OR MIT) 3 ((GPL-2.0 WITH Linux-syscall-note) AND MIT) 1 and that resulted in the third patch in this series. - when the two scanners agreed on the detected license(s), that became the concluded license(s). - when there was disagreement between the two scanners (one detected a license but the other didn't, or they both detected different licenses) a manual inspection of the file occurred. - In most cases a manual inspection of the information in the file resulted in a clear resolution of the license that should apply (and which scanner probably needed to revisit its heuristics). - When it was not immediately clear, the license identifier was confirmed with lawyers working with the Linux Foundation. - If there was any question as to the appropriate license identifier, the file was flagged for further research and to be revisited later in time. In total, over 70 hours of logged manual review was done on the spreadsheet to determine the SPDX license identifiers to apply to the source files by Kate, Philippe, Thomas and, in some cases, confirmation by lawyers working with the Linux Foundation. Kate also obtained a third independent scan of the 4.13 code base from FOSSology, and compared selected files where the other two scanners disagreed against that SPDX file, to see if there was new insights. The Windriver scanner is based on an older version of FOSSology in part, so they are related. Thomas did random spot checks in about 500 files from the spreadsheets for the uapi headers and agreed with SPDX license identifier in the files he inspected. For the non-uapi files Thomas did random spot checks in about 15000 files. In initial set of patches against 4.14-rc6, 3 files were found to have copy/paste license identifier errors, and have been fixed to reflect the correct identifier. Additionally Philippe spent 10 hours this week doing a detailed manual inspection and review of the 12,461 patched files from the initial patch version early this week with: - a full scancode scan run, collecting the matched texts, detected license ids and scores - reviewing anything where there was a license detected (about 500+ files) to ensure that the applied SPDX license was correct - reviewing anything where there was no detection but the patch license was not GPL-2.0 WITH Linux-syscall-note to ensure that the applied SPDX license was correct This produced a worksheet with 20 files needing minor correction. This worksheet was then exported into 3 different .csv files for the different types of files to be modified. These .csv files were then reviewed by Greg. Thomas wrote a script to parse the csv files and add the proper SPDX tag to the file, in the format that the file expected. This script was further refined by Greg based on the output to detect more types of files automatically and to distinguish between header and source .c files (which need different comment types.) Finally Greg ran the script using the .csv files to generate the patches. Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org> Reviewed-by: Philippe Ombredanne <pombredanne@nexb.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-07-20prctl: Allow local CAP_SYS_ADMIN changing exe_fileKirill Tkhai1-6/+2
During checkpointing and restore of userspace tasks we bumped into the situation, that it's not possible to restore the tasks, which user namespace does not have uid 0 or gid 0 mapped. People create user namespace mappings like they want, and there is no a limitation on obligatory uid and gid "must be mapped". So, if there is no uid 0 or gid 0 in the mapping, it's impossible to restore mm->exe_file of the processes belonging to this user namespace. Also, there is no a workaround. It's impossible to create a temporary uid/gid mapping, because only one write to /proc/[pid]/uid_map and gid_map is allowed during a namespace lifetime. If there is an entry, then no more mapings can't be written. If there isn't an entry, we can't write there too, otherwise user task won't be able to do that in the future. The patch changes the check, and looks for CAP_SYS_ADMIN instead of zero uid and gid. This allows to restore a task independently of its user namespace mappings. Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com> CC: Andrew Morton <akpm@linux-foundation.org> CC: Serge Hallyn <serge@hallyn.com> CC: "Eric W. Biederman" <ebiederm@xmission.com> CC: Oleg Nesterov <oleg@redhat.com> CC: Michal Hocko <mhocko@suse.com> CC: Andrei Vagin <avagin@openvz.org> CC: Cyrill Gorcunov <gorcunov@openvz.org> CC: Stanislav Kinsburskiy <skinsbursky@virtuozzo.com> CC: Pavel Tikhomirov <ptikhomirov@virtuozzo.com> Reviewed-by: Cyrill Gorcunov <gorcunov@openvz.org> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2017-07-12fix a braino in compat_sys_getrlimit()Al Viro1-1/+1
Reported-and-tested-by: Meelis Roos <mroos@linux.ee> Fixes: commit d9e968cb9f84 "getrlimit()/setrlimit(): move compat to native" Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Acked-by: David S. Miller <davem@davemloft.net> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-10mm: make PR_SET_THP_DISABLE immediately activeMichal Hocko1-3/+3
PR_SET_THP_DISABLE has a rather subtle semantic. It doesn't affect any existing mapping because it only updated mm->def_flags which is a template for new mappings. The mappings created after prctl(PR_SET_THP_DISABLE) have VM_NOHUGEPAGE flag set. This can be quite surprising for all those applications which do not do prctl(); fork() & exec() and want to control their own THP behavior. Another usecase when the immediate semantic of the prctl might be useful is a combination of pre- and post-copy migration of containers with CRIU. In this case CRIU populates a part of a memory region with data that was saved during the pre-copy stage. Afterwards, the region is registered with userfaultfd and CRIU expects to get page faults for the parts of the region that were not yet populated. However, khugepaged collapses the pages and the expected page faults do not occur. In more general case, the prctl(PR_SET_THP_DISABLE) could be used as a temporary mechanism for enabling/disabling THP process wide. Implementation wise, a new MMF_DISABLE_THP flag is added. This flag is tested when decision whether to use huge pages is taken either during page fault of at the time of THP collapse. It should be noted, that the new implementation makes PR_SET_THP_DISABLE master override to any per-VMA setting, which was not the case previously. Fixes: a0715cc22601 ("mm, thp: add VM_INIT_DEF_MASK and PRCTL_THP_DISABLE") Link: http://lkml.kernel.org/r/1496415802-30944-1-git-send-email-rppt@linux.vnet.ibm.com Signed-off-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Pavel Emelyanov <xemul@virtuozzo.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-06Merge branch 'misc.compat' of ↵Linus Torvalds1-1/+99
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull misc compat stuff updates from Al Viro: "This part is basically untangling various compat stuff. Compat syscalls moved to their native counterparts, getting rid of quite a bit of double-copying and/or set_fs() uses. A lot of field-by-field copyin/copyout killed off. - kernel/compat.c is much closer to containing just the copyin/copyout of compat structs. Not all compat syscalls are gone from it yet, but it's getting there. - ipc/compat_mq.c killed off completely. - block/compat_ioctl.c cleaned up; floppy compat ioctls moved to drivers/block/floppy.c where they belong. Yes, there are several drivers that implement some of the same ioctls. Some are m68k and one is 32bit-only pmac. drivers/block/floppy.c is the only one in that bunch that can be built on biarch" * 'misc.compat' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: mqueue: move compat syscalls to native ones usbdevfs: get rid of field-by-field copyin compat_hdio_ioctl: get rid of set_fs() take floppy compat ioctls to sodding floppy.c ipmi: get rid of field-by-field __get_user() ipmi: get COMPAT_IPMICTL_RECEIVE_MSG in sync with the native one rt_sigtimedwait(): move compat to native select: switch compat_{get,put}_fd_set() to compat_{get,put}_bitmap() put_compat_rusage(): switch to copy_to_user() sigpending(): move compat to native getrlimit()/setrlimit(): move compat to native times(2): move compat to native compat_{get,put}_bitmap(): use unsafe_{get,put}_user() fb_get_fscreeninfo(): don't bother with do_fb_ioctl() do_sigaltstack(): lift copying to/from userland into callers take compat_sys_old_getrlimit() to native syscall trim __ARCH_WANT_SYS_OLD_GETRLIMIT
2017-06-09getrlimit()/setrlimit(): move compat to nativeAl Viro1-0/+48
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-06-09times(2): move compat to nativeAl Viro1-1/+27
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-05-27take compat_sys_old_getrlimit() to native syscallAl Viro1-0/+24
... and sanitize the ifdefs in there Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-05-21wait4(2)/waitid(2): separate copying rusage to userlandAl Viro1-10/+6
New helpers: kernel_waitid() and kernel_wait4(). sys_waitid(), sys_wait4() and their compat variants switched to those. Copying struct rusage to userland is left to syscall itself. For compat_sys_wait4() that eliminates the use of set_fs() completely. For compat_sys_waitid() it's still needed (for siginfo handling); that will change shortly. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-05-05Merge branch 'for-linus' of ↵Linus Torvalds1-2/+1
git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace Pull namespace updates from Eric Biederman: "This is a set of small fixes that were mostly stumbled over during more significant development. This proc fix and the fix to posix-timers are the most significant of the lot. There is a lot of good development going on but unfortunately it didn't quite make the merge window" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: proc: Fix unbalanced hard link numbers signal: Make kill_proc_info static rlimit: Properly call security_task_setrlimit signal: Remove unused definition of sig_user_definied ia64: Remove unused IA64_TASK_SIGHAND_OFFSET and IA64_SIGHAND_SIGLOCK_OFFSET ipc: Remove unused declaration of recompute_msgmni posix-timers: Correct sanity check in posix_cpu_nsleep sysctl: Remove dead register_sysctl_root
2017-04-21rlimit: Properly call security_task_setrlimitEric W. Biederman1-2/+1
Modify do_prlimit to call security_task_setrlimit passing the task whose rlimit we are changing not the tsk->group_leader. In general this should not matter as the lsms implementing security_task_setrlimit apparmor and selinux both examine the task->cred to see what should be allowed on the destination task. That task->cred is shared between tasks created with CLONE_THREAD unless thread keyrings are in play, in which case both apparmor and selinux create duplicate security contexts. So the only time when it will matter which thread is passed to security_task_setrlimit is if one of the threads of a process performs an operation that changes only it's credentials. At which point if a thread has done that we don't want to hide that information from the lsms. So fix the call of security_task_setrlimit. With the removal of tsk->group_leader this makes the code slightly faster, more comprehensible and maintainable. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2017-03-06prlimit,security,selinux: add a security hook for prlimitStephen Smalley1-12/+18
When SELinux was first added to the kernel, a process could only get and set its own resource limits via getrlimit(2) and setrlimit(2), so no MAC checks were required for those operations, and thus no security hooks were defined for them. Later, SELinux introduced a hook for setlimit(2) with a check if the hard limit was being changed in order to be able to rely on the hard limit value as a safe reset point upon context transitions. Later on, when prlimit(2) was added to the kernel with the ability to get or set resource limits (hard or soft) of another process, LSM/SELinux was not updated other than to pass the target process to the setrlimit hook. This resulted in incomplete control over both getting and setting the resource limits of another process. Add a new security_task_prlimit() hook to the check_prlimit_permission() function to provide complete mediation. The hook is only called when acting on another task, and only if the existing DAC/capability checks would allow access. Pass flags down to the hook to indicate whether the prlimit(2) call will read, write, or both read and write the resource limits of the target process. The existing security_task_setrlimit() hook is left alone; it continues to serve a purpose in supporting the ability to make decisions based on the old and/or new resource limit values when setting limits. This is consistent with the DAC/capability logic, where check_prlimit_permission() performs generic DAC/capability checks for acting on another task, while do_prlimit() performs a capability check based on a comparison of the old and new resource limits. Fix the inline documentation for the hook to match the code. Implement the new hook for SELinux. For setting resource limits, we reuse the existing setrlimit permission. Note that this does overload the setrlimit permission to mean the ability to set the resource limit (soft or hard) of another process or the ability to change one's own hard limit. For getting resource limits, a new getrlimit permission is defined. This was not originally defined since getrlimit(2) could only be used to obtain a process' own limits. Signed-off-by: Stephen Smalley <sds@tycho.nsa.gov> Signed-off-by: James Morris <james.l.morris@oracle.com>
2017-03-02sched/headers: Prepare to move cputime functionality from <linux/sched.h> ↵Ingo Molnar1-0/+1
into <linux/sched/cputime.h> Introduce a trivial, mostly empty <linux/sched/cputime.h> header to prepare for the moving of cputime functionality out of sched.h. Update all code that relies on these facilities. Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-02sched/headers: Prepare for new header dependencies before moving code to ↵Ingo Molnar1-0/+1
<linux/sched/task.h> We are going to split <linux/sched/task.h> out of <linux/sched.h>, which will have to be picked up from other headers and a couple of .c files. Create a trivial placeholder <linux/sched/task.h> file that just maps to <linux/sched.h> to make this patch obviously correct and bisectable. Include the new header in the files that are going to need it. Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-02sched/headers: Prepare for new header dependencies before moving code to ↵Ingo Molnar1-0/+1
<linux/sched/stat.h> We are going to split <linux/sched/stat.h> out of <linux/sched.h>, which will have to be picked up from other headers and a couple of .c files. Create a trivial placeholder <linux/sched/stat.h> file that just maps to <linux/sched.h> to make this patch obviously correct and bisectable. Include the new header in the files that are going to need it. Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-02sched/headers: Prepare for new header dependencies before moving code to ↵Ingo Molnar1-0/+1
<linux/sched/coredump.h> We are going to split <linux/sched/coredump.h> out of <linux/sched.h>, which will have to be picked up from other headers and a couple of .c files. Create a trivial placeholder <linux/sched/coredump.h> file that just maps to <linux/sched.h> to make this patch obviously correct and bisectable. Include the new header in the files that are going to need it. Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-02sched/headers: Prepare for new header dependencies before moving code to ↵Ingo Molnar1-0/+1
<linux/sched/mm.h> We are going to split <linux/sched/mm.h> out of <linux/sched.h>, which will have to be picked up from other headers and a couple of .c files. Create a trivial placeholder <linux/sched/mm.h> file that just maps to <linux/sched.h> to make this patch obviously correct and bisectable. The APIs that are going to be moved first are: mm_alloc() __mmdrop() mmdrop() mmdrop_async_fn() mmdrop_async() mmget_not_zero() mmput() mmput_async() get_task_mm() mm_access() mm_release() Include the new header in the files that are going to need it. Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-02sched/headers: Prepare for new header dependencies before moving code to ↵Ingo Molnar1-0/+1
<linux/sched/autogroup.h> We are going to split <linux/sched/autogroup.h> out of <linux/sched.h>, which will have to be picked up from other headers and a couple of .c files. Create a trivial placeholder <linux/sched/autogroup.h> file that just maps to <linux/sched.h> to make this patch obviously correct and bisectable. Include the new header in the files that are going to need it. Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-02sched/headers: Prepare for new header dependencies before moving code to ↵Ingo Molnar1-0/+1
<linux/sched/loadavg.h> We are going to split <linux/sched/loadavg.h> out of <linux/sched.h>, which will have to be picked up from a couple of .c files. Create a trivial placeholder <linux/sched/topology.h> file that just maps to <linux/sched.h> to make this patch obviously correct and bisectable. Include the new header in the files that are going to need it. Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-02-23Merge branch 'for-linus' of ↵Linus Torvalds1-0/+22
git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace Pull namespace updates from Eric Biederman: "There is a lot here. A lot of these changes result in subtle user visible differences in kernel behavior. I don't expect anything will care but I will revert/fix things immediately if any regressions show up. From Seth Forshee there is a continuation of the work to make the vfs ready for unpriviled mounts. We had thought the previous changes prevented the creation of files outside of s_user_ns of a filesystem, but it turns we missed the O_CREAT path. Ooops. Pavel Tikhomirov and Oleg Nesterov worked together to fix a long standing bug in the implemenation of PR_SET_CHILD_SUBREAPER where only children that are forked after the prctl are considered and not children forked before the prctl. The only known user of this prctl systemd forks all children after the prctl. So no userspace regressions will occur. Holding earlier forked children to the same rules as later forked children creates a semantic that is sane enough to allow checkpoing of processes that use this feature. There is a long delayed change by Nikolay Borisov to limit inotify instances inside a user namespace. Michael Kerrisk extends the API for files used to maniuplate namespaces with two new trivial ioctls to allow discovery of the hierachy and properties of namespaces. Konstantin Khlebnikov with the help of Al Viro adds code that when a network namespace exits purges it's sysctl entries from the dcache. As in some circumstances this could use a lot of memory. Vivek Goyal fixed a bug with stacked filesystems where the permissions on the wrong inode were being checked. I continue previous work on ptracing across exec. Allowing a file to be setuid across exec while being ptraced if the tracer has enough credentials in the user namespace, and if the process has CAP_SETUID in it's own namespace. Proc files for setuid or otherwise undumpable executables are now owned by the root in the user namespace of their mm. Allowing debugging of setuid applications in containers to work better. A bug I introduced with permission checking and automount is now fixed. The big change is to mark the mounts that the kernel initiates as a result of an automount. This allows the permission checks in sget to be safely suppressed for this kind of mount. As the permission check happened when the original filesystem was mounted. Finally a special case in the mount namespace is removed preventing unbounded chains in the mount hash table, and making the semantics simpler which benefits CRIU. The vfs fix along with related work in ima and evm I believe makes us ready to finish developing and merge fully unprivileged mounts of the fuse filesystem. The cleanups of the mount namespace makes discussing how to fix the worst case complexity of umount. The stacked filesystem fixes pave the way for adding multiple mappings for the filesystem uids so that efficient and safer containers can be implemented" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: proc/sysctl: Don't grab i_lock under sysctl_lock. vfs: Use upper filesystem inode in bprm_fill_uid() proc/sysctl: prune stale dentries during unregistering mnt: Tuck mounts under others instead of creating shadow/side mounts. prctl: propagate has_child_subreaper flag to every descendant introduce the walk_process_tree() helper nsfs: Add an ioctl() to return owner UID of a userns fs: Better permission checking for submounts exit: fix the setns() && PR_SET_CHILD_SUBREAPER interaction vfs: open() with O_CREAT should not create inodes with unknown ids nsfs: Add an ioctl() to return the namespace type proc: Better ownership of files for non-dumpable tasks in user namespaces exec: Remove LSM_UNSAFE_PTRACE_CAP exec: Test the ptracer's saved cred to see if the tracee can gain caps exec: Don't reset euid and egid when the tracee has CAP_SETUID inotify: Convert to using per-namespace limits
2017-02-03prctl: propagate has_child_subreaper flag to every descendantPavel Tikhomirov1-0/+22
If process forks some children when it has is_child_subreaper flag enabled they will inherit has_child_subreaper flag - first group, when is_child_subreaper is disabled forked children will not inherit it - second group. So child-subreaper does not reparent all his descendants when their parents die. Having these two differently behaving groups can lead to confusion. Also it is a problem for CRIU, as when we restore process tree we need to somehow determine which descendants belong to which group and much harder - to put them exactly to these group. To simplify these we can add a propagation of has_child_subreaper flag on PR_SET_CHILD_SUBREAPER, walking all descendants of child- subreaper to setup has_child_subreaper flag. In common cases when process like systemd first sets itself to be a child-subreaper and only after that forks its services, we will have zero-length list of descendants to walk. Testing with binary subtree of 2^15 processes prctl took < 0.007 sec and has shown close to linear dependency(~0.2 * n * usec) on lower numbers of processes. Moreover, I doubt someone intentionaly pre-forks the children whitch should reparent to init before becoming subreaper, because some our ancestor migh have had is_child_subreaper flag while forking our sub-tree and our childs will all inherit has_child_subreaper flag, and we have no way to influence it. And only way to check if we have no has_child_subreaper flag is to create some childs, kill them and see where they will reparent to. Using walk_process_tree helper to walk subtree, thanks to Oleg! Timing seems to be the same. Optimize: a) When descendant already has has_child_subreaper flag all his subtree has it too already. * for a) to be true need to move has_child_subreaper inheritance under the same tasklist_lock with adding task to its ->real_parent->children as without it process can inherit zero has_child_subreaper, then we set 1 to it's parent flag, check that parent has no more children, and only after child with wrong flag is added to the tree. * Also make these inheritance more clear by using real_parent instead of current, as on clone(CLONE_PARENT) if current has is_child_subreaper and real_parent has no is_child_subreaper or has_child_subreaper, child will have has_child_subreaper flag set without actually having a subreaper in it's ancestors. b) When some descendant is child_reaper, it's subtree is in different pidns from us(original child-subreaper) and processes from other pidns will never reparent to us. So we can skip their(a,b) subtree from walk. v2: switch to walk_process_tree() general helper, move has_child_subreaper inheritance v3: remove csr_descendant leftover, change current to real_parent in has_child_subreaper inheritance v4: small commit message fix Fixes: ebec18a6d3aa ("prctl: add PR_{SET,GET}_CHILD_SUBREAPER to allow simple process supervision") Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com> Reviewed-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2017-02-01sched/cputime: Convert task/group cputime to nsecsFrederic Weisbecker1-8/+8
Now that most cputime readers use the transition API which return the task cputime in old style cputime_t, we can safely store the cputime in nsecs. This will eventually make cputime statistics less opaque and more granular. Back and forth convertions between cputime_t and nsecs in order to deal with cputime_t random granularity won't be needed anymore. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Paul Mackerras <paulus@samba.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: Stanislaw Gruszka <sgruszka@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tony Luck <tony.luck@intel.com> Cc: Wanpeng Li <wanpeng.li@hotmail.com> Link: http://lkml.kernel.org/r/1485832191-26889-8-git-send-email-fweisbec@gmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-12-24Replace <asm/uaccess.h> with <linux/uaccess.h> globallyLinus Torvalds1-1/+1
This was entirely automated, using the script by Al: PATT='^[[:blank:]]*#[[:blank:]]*include[[:blank:]]*<asm/uaccess.h>' sed -i -e "s!$PATT!#include <linux/uaccess.h>!" \ $(git grep -l "$PATT"|grep -v ^include/linux/uaccess.h) to do the replacement at the end of the merge window. Requested-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-12Merge branch 'akpm' (patches from Andrew)Linus Torvalds1-10/+0
Merge updates from Andrew Morton: - various misc bits - most of MM (quite a lot of MM material is awaiting the merge of linux-next dependencies) - kasan - printk updates - procfs updates - MAINTAINERS - /lib updates - checkpatch updates * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (123 commits) init: reduce rootwait polling interval time to 5ms binfmt_elf: use vmalloc() for allocation of vma_filesz checkpatch: don't emit unified-diff error for rename-only patches checkpatch: don't check c99 types like uint8_t under tools checkpatch: avoid multiple line dereferences checkpatch: don't check .pl files, improve absolute path commit log test scripts/checkpatch.pl: fix spelling checkpatch: don't try to get maintained status when --no-tree is given lib/ida: document locking requirements a bit better lib/rbtree.c: fix typo in comment of ____rb_erase_color lib/Kconfig.debug: make CONFIG_STRICT_DEVMEM depend on CONFIG_DEVMEM MAINTAINERS: add drm and drm/i915 irc channels MAINTAINERS: add "C:" for URI for chat where developers hang out MAINTAINERS: add drm and drm/i915 bug filing info MAINTAINERS: add "B:" for URI where to file bugs get_maintainer: look for arbitrary letter prefixes in sections printk: add Kconfig option to set default console loglevel printk/sound: handle more message headers printk/btrfs: handle more message headers printk/kdb: handle more message headers ...
2016-12-12prctl: remove one-shot limitation for changing exe linkStanislav Kinsburskiy1-10/+0
This limitation came with the reason to remove "another way for malicious code to obscure a compromised program and masquerade as a benign process" by allowing "security-concious program can use this prctl once during its early initialization to ensure the prctl cannot later be abused for this purpose": http://marc.info/?l=linux-kernel&m=133160684517468&w=2 This explanation doesn't look sufficient. The only thing "exe" link is indicating is the file, used to execve, which is basically nothing and not reliable immediately after process has returned from execve system call. Moreover, to use this feture, all the mappings to previous exe file have to be unmapped and all the new exe file permissions must be satisfied. Which means, that changing exe link is very similar to calling execve on the binary. The need to remove this limitations comes from migration of NFS mount point, which is not accessible during restore and replaced by other file system. Because of this exe link has to be changed twice. [akpm@linux-foundation.org: fix up comment] Link: http://lkml.kernel.org/r/20160927153755.9337.69650.stgit@localhost.localdomain Signed-off-by: Stanislav Kinsburskiy <skinsbursky@virtuozzo.com> Acked-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Cyrill Gorcunov <gorcunov@openvz.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Kees Cook <keescook@chromium.org> Cc: Andy Lutomirski <luto@amacapital.net> Cc: John Stultz <john.stultz@linaro.org> Cc: Matt Helsley <matthltc@us.ibm.com> Cc: Pavel Emelyanov <xemul@virtuozzo.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-11-16posix-timers: Make them configurableNicolas Pitre1-1/+2
Some embedded systems have no use for them. This removes about 25KB from the kernel binary size when configured out. Corresponding syscalls are routed to a stub logging the attempt to use those syscalls which should be enough of a clue if they were disabled without proper consideration. They are: timer_create, timer_gettime: timer_getoverrun, timer_settime, timer_delete, clock_adjtime, setitimer, getitimer, alarm. The clock_settime, clock_gettime, clock_getres and clock_nanosleep syscalls are replaced by simple wrappers compatible with CLOCK_REALTIME, CLOCK_MONOTONIC and CLOCK_BOOTTIME only which should cover the vast majority of use cases with very little code. Signed-off-by: Nicolas Pitre <nico@linaro.org> Acked-by: Richard Cochran <richardcochran@gmail.com> Acked-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: John Stultz <john.stultz@linaro.org> Reviewed-by: Josh Triplett <josh@joshtriplett.org> Cc: Paul Bolle <pebolle@tiscali.nl> Cc: linux-kbuild@vger.kernel.org Cc: netdev@vger.kernel.org Cc: Michal Marek <mmarek@suse.com> Cc: Edward Cree <ecree@solarflare.com> Link: http://lkml.kernel.org/r/1478841010-28605-7-git-send-email-nicolas.pitre@linaro.org Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-05-23prctl: make PR_SET_THP_DISABLE wait for mmap_sem killableMichal Hocko1-1/+2
PR_SET_THP_DISABLE requires mmap_sem for write. If the waiting task gets killed by the oom killer it would block oom_reaper from asynchronous address space reclaim and reduce the chances of timely OOM resolving. Wait for the lock in the killable mode and return with EINTR if the task got killed while waiting. Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Alex Thorlton <athorlton@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-17timer: convert timer_slack_ns from unsigned long to u64John Stultz1-1/+4
This patchset introduces a /proc/<pid>/timerslack_ns interface which would allow controlling processes to be able to set the timerslack value on other processes in order to save power by avoiding wakeups (Something Android currently does via out-of-tree patches). The first patch tries to fix the internal timer_slack_ns usage which was defined as a long, which limits the slack range to ~4 seconds on 32bit systems. It converts it to a u64, which provides the same basically unlimited slack (500 years) on both 32bit and 64bit machines. The second patch introduces the /proc/<pid>/timerslack_ns interface which allows the full 64bit slack range for a task to be read or set on both 32bit and 64bit machines. With these two patches, on a 32bit machine, after setting the slack on bash to 10 seconds: $ time sleep 1 real 0m10.747s user 0m0.001s sys 0m0.005s The first patch is a little ugly, since I had to chase the slack delta arguments through a number of functions converting them to u64s. Let me know if it makes sense to break that up more or not. Other than that things are fairly straightforward. This patch (of 2): The timer_slack_ns value in the task struct is currently a unsigned long. This means that on 32bit applications, the maximum slack is just over 4 seconds. However, on 64bit machines, its much much larger (~500 years). This disparity could make application development a little (as well as the default_slack) to a u64. This means both 32bit and 64bit systems have the same effective internal slack range. Now the existing ABI via PR_GET_TIMERSLACK and PR_SET_TIMERSLACK specify the interface as a unsigned long, so we preserve that limitation on 32bit systems, where SET_TIMERSLACK can only set the slack to a unsigned long value, and GET_TIMERSLACK will return ULONG_MAX if the slack is actually larger then what can be stored by an unsigned long. This patch also modifies hrtimer functions which specified the slack delta as a unsigned long. Signed-off-by: John Stultz <john.stultz@linaro.org> Cc: Arjan van de Ven <arjan@linux.intel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Oren Laadan <orenl@cellrox.com> Cc: Ruchi Kandoi <kandoiruchi@google.com> Cc: Rom Lemarchand <romlem@android.com> Cc: Kees Cook <keescook@chromium.org> Cc: Android Kernel Team <kernel-team@android.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-20prctl: take mmap sem for writing to protect against othersMateusz Guzik1-10/+10
An unprivileged user can trigger an oops on a kernel with CONFIG_CHECKPOINT_RESTORE. proc_pid_cmdline_read takes mmap_sem for reading and obtains args + env start/end values. These get sanity checked as follows: BUG_ON(arg_start > arg_end); BUG_ON(env_start > env_end); These can be changed by prctl_set_mm. Turns out also takes the semaphore for reading, effectively rendering it useless. This results in: kernel BUG at fs/proc/base.c:240! invalid opcode: 0000 [#1] SMP Modules linked in: virtio_net CPU: 0 PID: 925 Comm: a.out Not tainted 4.4.0-rc8-next-20160105dupa+ #71 Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 task: ffff880077a68000 ti: ffff8800784d0000 task.ti: ffff8800784d0000 RIP: proc_pid_cmdline_read+0x520/0x530 RSP: 0018:ffff8800784d3db8 EFLAGS: 00010206 RAX: ffff880077c5b6b0 RBX: ffff8800784d3f18 RCX: 0000000000000000 RDX: 0000000000000002 RSI: 00007f78e8857000 RDI: 0000000000000246 RBP: ffff8800784d3e40 R08: 0000000000000008 R09: 0000000000000001 R10: 0000000000000000 R11: 0000000000000001 R12: 0000000000000050 R13: 00007f78e8857800 R14: ffff88006fcef000 R15: ffff880077c5b600 FS: 00007f78e884a740(0000) GS:ffff88007b200000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 00007f78e8361770 CR3: 00000000790a5000 CR4: 00000000000006f0 Call Trace: __vfs_read+0x37/0x100 vfs_read+0x82/0x130 SyS_read+0x58/0xd0 entry_SYSCALL_64_fastpath+0x12/0x76 Code: 4c 8b 7d a8 eb e9 48 8b 9d 78 ff ff ff 4c 8b 7d 90 48 8b 03 48 39 45 a8 0f 87 f0 fe ff ff e9 d1 fe ff ff 4c 8b 7d 90 eb c6 0f 0b <0f> 0b 0f 0b 66 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 RIP proc_pid_cmdline_read+0x520/0x530 ---[ end trace 97882617ae9c6818 ]--- Turns out there are instances where the code just reads aformentioned values without locking whatsoever - namely environ_read and get_cmdline. Interestingly these functions look quite resilient against bogus values, but I don't believe this should be relied upon. The first patch gets rid of the oops bug by grabbing mmap_sem for writing. The second patch is optional and puts locking around aformentioned consumers for safety. Consumers of other fields don't seem to benefit from similar treatment and are left untouched. This patch (of 2): The code was taking the semaphore for reading, which does not protect against readers nor concurrent modifications. The problem could cause a sanity checks to fail in procfs's cmdline reader, resulting in an OOPS. Note that some functions perform an unlocked read of various mm fields, but they seem to be fine despite possible modificaton. Signed-off-by: Mateusz Guzik <mguzik@redhat.com> Acked-by: Cyrill Gorcunov <gorcunov@openvz.org> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Jarod Wilson <jarod@redhat.com> Cc: Jan Stancek <jstancek@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Anshuman Khandual <anshuman.linux@gmail.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-06pidns: fix set/getpriority and ioprio_set/get in PRIO_USER modeBen Segall1-2/+2
setpriority(PRIO_USER, 0, x) will change the priority of tasks outside of the current pid namespace. This is in contrast to both the other modes of setpriority and the example of kill(-1). Fix this. getpriority and ioprio have the same failure mode, fix them too. Eric said: : After some more thinking about it this patch sounds justifiable. : : My goal with namespaces is not to build perfect isolation mechanisms : as that can get into ill defined territory, but to build well defined : mechanisms. And to handle the corner cases so you can use only : a single namespace with well defined results. : : In this case you have found the two interfaces I am aware of that : identify processes by uid instead of by pid. Which quite frankly is : weird. Unfortunately the weird unexpected cases are hard to handle : in the usual way. : : I was hoping for a little more information. Changes like this one we : have to be careful of because someone might be depending on the current : behavior. I don't think they are and I do think this make sense as part : of the pid namespace. Signed-off-by: Ben Segall <bsegall@google.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Ambrose Feinstein <ambrose@google.com> Acked-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-07-10vfs: Commit to never having exectuables on proc and sysfs.Eric W. Biederman1-2/+1
Today proc and sysfs do not contain any executable files. Several applications today mount proc or sysfs without noexec and nosuid and then depend on there being no exectuables files on proc or sysfs. Having any executable files show on proc or sysfs would cause a user space visible regression, and most likely security problems. Therefore commit to never allowing executables on proc and sysfs by adding a new flag to mark them as filesystems without executables and enforce that flag. Test the flag where MNT_NOEXEC is tested today, so that the only user visible effect will be that exectuables will be treated as if the execute bit is cleared. The filesystems proc and sysfs do not currently incoporate any executable files so this does not result in any user visible effects. This makes it unnecessary to vet changes to proc and sysfs tightly for adding exectuable files or changes to chattr that would modify existing files, as no matter what the individual file say they will not be treated as exectuable files by the vfs. Not having to vet changes to closely is important as without this we are only one proc_create call (or another goof up in the implementation of notify_change) from having problematic executables on proc. Those mistakes are all too easy to make and would create a situation where there are security issues or the assumptions of some program having to be broken (and cause userspace regressions). Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2015-06-25prctl: more prctl(PR_SET_MM_*) checksAlexey Dobriyan1-67/+91
Individual prctl(PR_SET_MM_*) calls do some checking to maintain a consistent view of mm->arg_start et al fields, but not enough. In particular PR_SET_MM_ARG_START/PR_SET_MM_ARG_END/ R_SET_MM_ENV_START/ PR_SET_MM_ENV_END only check that the address lies in an existing VMA, but don't check that the start address is lower than the end address _at all_. Consolidate all consistency checks, so there will be no difference in the future between PR_SET_MM_MAP and individual PR_SET_MM_* calls. The program below makes both ARGV and ENVP areas be reversed. It makes /proc/$PID/cmdline show garbage (it doesn't oops by luck). #include <sys/mman.h> #include <sys/prctl.h> #include <unistd.h> enum {PAGE_SIZE=4096}; int main(void) { void *p; p = mmap(NULL, PAGE_SIZE, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0); #define PR_SET_MM 35 #define PR_SET_MM_ARG_START 8 #define PR_SET_MM_ARG_END 9 #define PR_SET_MM_ENV_START 10 #define PR_SET_MM_ENV_END 11 prctl(PR_SET_MM, PR_SET_MM_ARG_START, (unsigned long)p + PAGE_SIZE - 1, 0, 0); prctl(PR_SET_MM, PR_SET_MM_ARG_END, (unsigned long)p, 0, 0); prctl(PR_SET_MM, PR_SET_MM_ENV_START, (unsigned long)p + PAGE_SIZE - 1, 0, 0); prctl(PR_SET_MM, PR_SET_MM_ENV_END, (unsigned long)p, 0, 0); pause(); return 0; } [akpm@linux-foundation.org: tidy code, tweak comment] Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Acked-by: Cyrill Gorcunov <gorcunov@openvz.org> Cc: Jarod Wilson <jarod@redhat.com> Cc: Jan Stancek <jstancek@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-06-09x86/mpx: Clean up the code by not passing a task pointer around when unnecessaryDave Hansen1-4/+4
The MPX code can only work on the current task. You can not, for instance, enable MPX management in another process or thread. You can also not handle a fault for another process or thread. Despite this, we pass a task_struct around prolifically. This patch removes all of the task struct passing for code paths where the code can not deal with another task (which turns out to be all of them). This has no functional changes. It's just a cleanup. Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Dave Hansen <dave@sr71.net> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: bp@alien8.de Link: http://lkml.kernel.org/r/20150607183702.6A81DA2C@viggo.jf.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-04-17prctl: avoid using mmap_sem for exe_file serializationDavidlohr Bueso1-19/+28
Oleg cleverly suggested using xchg() to set the new mm->exe_file instead of calling set_mm_exe_file() which requires some form of serialization -- mmap_sem in this case. For archs that do not have atomic rmw instructions we still fallback to a spinlock alternative, so this should always be safe. As such, we only need the mmap_sem for looking up the backing vm_file, which can be done sharing the lock. Naturally, this means we need to manually deal with both the new and old file reference counting, and we need not worry about the MMF_EXE_FILE_CHANGED bits, which can probably be deleted in the future anyway. Signed-off-by: Davidlohr Bueso <dbueso@suse.de> Suggested-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Oleg Nesterov <oleg@redhat.com> Reviewed-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-15kernel: conditionally support non-root users, groups and capabilitiesIulia Manda1-0/+2
There are a lot of embedded systems that run most or all of their functionality in init, running as root:root. For these systems, supporting multiple users is not necessary. This patch adds a new symbol, CONFIG_MULTIUSER, that makes support for non-root users, non-root groups, and capabilities optional. It is enabled under CONFIG_EXPERT menu. When this symbol is not defined, UID and GID are zero in any possible case and processes always have all capabilities. The following syscalls are compiled out: setuid, setregid, setgid, setreuid, setresuid, getresuid, setresgid, getresgid, setgroups, getgroups, setfsuid, setfsgid, capget, capset. Also, groups.c is compiled out completely. In kernel/capability.c, capable function was moved in order to avoid adding two ifdef blocks. This change saves about 25 KB on a defconfig build. The most minimal kernels have total text sizes in the high hundreds of kB rather than low MB. (The 25k goes down a bit with allnoconfig, but not that much. The kernel was booted in Qemu. All the common functionalities work. Adding users/groups is not possible, failing with -ENOSYS. Bloat-o-meter output: add/remove: 7/87 grow/shrink: 19/397 up/down: 1675/-26325 (-24650) [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Iulia Manda <iulia.manda21@gmail.com> Reviewed-by: Josh Triplett <josh@joshtriplett.org> Acked-by: Geert Uytterhoeven <geert@linux-m68k.org> Tested-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Reviewed-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-02-28kernel/sys.c: fix UNAME26 for 4.0Jon DeVree1-1/+2
There's a uname workaround for broken userspace which can't handle kernel versions of 3.x. Update it for 4.x. Signed-off-by: Jon DeVree <nuxi@vault24.org> Cc: Andi Kleen <andi@firstfloor.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-02-21Merge branch 'upstream' of git://git.linux-mips.org/pub/scm/ralf/upstream-linusLinus Torvalds1-0/+12
Pull MIPS updates from Ralf Baechle: "This is the main pull request for MIPS: - a number of fixes that didn't make the 3.19 release. - a number of cleanups. - preliminary support for Cavium's Octeon 3 SOCs which feature up to 48 MIPS64 R3 cores with FPU and hardware virtualization. - support for MIPS R6 processors. Revision 6 of the MIPS architecture is a major revision of the MIPS architecture which does away with many of original sins of the architecture such as branch delay slots. This and other changes in R6 require major changes throughout the entire MIPS core architecture code and make up for the lion share of this pull request. - finally some preparatory work for eXtendend Physical Address support, which allows support of up to 40 bit of physical address space on 32 bit processors" [ Ahh, MIPS can't leave the PAE brain damage alone. It's like every CPU architect has to make that mistake, but pee in the snow by changing the TLA. But whether it's called PAE, LPAE or XPA, it's horrid crud - Linus ] * 'upstream' of git://git.linux-mips.org/pub/scm/ralf/upstream-linus: (114 commits) MIPS: sead3: Corrected get_c0_perfcount_int MIPS: mm: Remove dead macro definitions MIPS: OCTEON: irq: add CIB and other fixes MIPS: OCTEON: Don't do acknowledge operations for level triggered irqs. MIPS: OCTEON: More OCTEONIII support MIPS: OCTEON: Remove setting of processor specific CVMCTL icache bits. MIPS: OCTEON: Core-15169 Workaround and general CVMSEG cleanup. MIPS: OCTEON: Update octeon-model.h code for new SoCs. MIPS: OCTEON: Implement DCache errata workaround for all CN6XXX MIPS: OCTEON: Add little-endian support to asm/octeon/octeon.h MIPS: OCTEON: Implement the core-16057 workaround MIPS: OCTEON: Delete unused COP2 saving code MIPS: OCTEON: Use correct instruction to read 64-bit COP0 register MIPS: OCTEON: Save and restore CP2 SHA3 state MIPS: OCTEON: Fix FP context save. MIPS: OCTEON: Save/Restore wider multiply registers in OCTEON III CPUs MIPS: boot: Provide more uImage options MIPS: Remove unneeded #ifdef __KERNEL__ from asm/processor.h MIPS: ip22-gio: Remove legacy suspend/resume support mips: pci: Add ifdef around pci_proc_domain ...
2015-02-12MIPS,prctl: add PR_[GS]ET_FP_MODE prctl options for MIPSPaul Burton1-0/+12
Userland code may be built using an ABI which permits linking to objects that have more restrictive floating point requirements. For example, userland code may be built to target the O32 FPXX ABI. Such code may be linked with other FPXX code, or code built for either one of the more restrictive FP32 or FP64. When linking with more restrictive code, the overall requirement of the process becomes that of the more restrictive code. The kernel has no way to know in advance which mode the process will need to be executed in, and indeed it may need to change during execution. The dynamic loader is the only code which will know the overall required mode, and so it needs to have a means to instruct the kernel to switch the FP mode of the process. This patch introduces 2 new options to the prctl syscall which provide such a capability. The FP mode of the process is represented as a simple bitmask combining a number of mode bits mirroring those present in the hardware. Userland can either retrieve the current FP mode of the process: mode = prctl(PR_GET_FP_MODE); or modify the current FP mode of the process: err = prctl(PR_SET_FP_MODE, new_mode); Signed-off-by: Paul Burton <paul.burton@imgtec.com> Cc: Matthew Fortune <matthew.fortune@imgtec.com> Cc: Markos Chandras <markos.chandras@imgtec.com> Cc: linux-mips@linux-mips.org Patchwork: https://patchwork.linux-mips.org/patch/8899/ Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
2015-01-22x86, mpx: Strictly enforce empty prctl() argsDave Hansen1-0/+4
Description from Michael Kerrisk. He suggested an identical patch to one I had already coded up and tested. commit fe3d197f8431 "x86, mpx: On-demand kernel allocation of bounds tables" added two new prctl() operations, PR_MPX_ENABLE_MANAGEMENT and PR_MPX_DISABLE_MANAGEMENT. However, no checks were included to ensure that unused arguments are zero, as is done in many existing prctl()s and as should be done for all new prctl()s. This patch adds the required checks. Suggested-by: Andy Lutomirski <luto@amacapital.net> Suggested-by: Michael Kerrisk <mtk.manpages@gmail.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Cc: Dave Hansen <dave@sr71.net> Link: http://lkml.kernel.org/r/20150108223022.7F56FD13@viggo.jf.intel.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2014-11-18x86, mpx: On-demand kernel allocation of bounds tablesDave Hansen1-0/+12
This is really the meat of the MPX patch set. If there is one patch to review in the entire series, this is the one. There is a new ABI here and this kernel code also interacts with userspace memory in a relatively unusual manner. (small FAQ below). Long Description: This patch adds two prctl() commands to provide enable or disable the management of bounds tables in kernel, including on-demand kernel allocation (See the patch "on-demand kernel allocation of bounds tables") and cleanup (See the patch "cleanup unused bound tables"). Applications do not strictly need the kernel to manage bounds tables and we expect some applications to use MPX without taking advantage of this kernel support. This means the kernel can not simply infer whether an application needs bounds table management from the MPX registers. The prctl() is an explicit signal from userspace. PR_MPX_ENABLE_MANAGEMENT is meant to be a signal from userspace to require kernel's help in managing bounds tables. PR_MPX_DISABLE_MANAGEMENT is the opposite, meaning that userspace don't want kernel's help any more. With PR_MPX_DISABLE_MANAGEMENT, the kernel won't allocate and free bounds tables even if the CPU supports MPX. PR_MPX_ENABLE_MANAGEMENT will fetch the base address of the bounds directory out of a userspace register (bndcfgu) and then cache it into a new field (->bd_addr) in the 'mm_struct'. PR_MPX_DISABLE_MANAGEMENT will set "bd_addr" to an invalid address. Using this scheme, we can use "bd_addr" to determine whether the management of bounds tables in kernel is enabled. Also, the only way to access that bndcfgu register is via an xsaves, which can be expensive. Caching "bd_addr" like this also helps reduce the cost of those xsaves when doing table cleanup at munmap() time. Unfortunately, we can not apply this optimization to #BR fault time because we need an xsave to get the value of BNDSTATUS. ==== Why does the hardware even have these Bounds Tables? ==== MPX only has 4 hardware registers for storing bounds information. If MPX-enabled code needs more than these 4 registers, it needs to spill them somewhere. It has two special instructions for this which allow the bounds to be moved between the bounds registers and some new "bounds tables". They are similar conceptually to a page fault and will be raised by the MPX hardware during both bounds violations or when the tables are not present. This patch handles those #BR exceptions for not-present tables by carving the space out of the normal processes address space (essentially calling the new mmap() interface indroduced earlier in this patch set.) and then pointing the bounds-directory over to it. The tables *need* to be accessed and controlled by userspace because the instructions for moving bounds in and out of them are extremely frequent. They potentially happen every time a register pointing to memory is dereferenced. Any direct kernel involvement (like a syscall) to access the tables would obviously destroy performance. ==== Why not do this in userspace? ==== This patch is obviously doing this allocation in the kernel. However, MPX does not strictly *require* anything in the kernel. It can theoretically be done completely from userspace. Here are a few ways this *could* be done. I don't think any of them are practical in the real-world, but here they are. Q: Can virtual space simply be reserved for the bounds tables so that we never have to allocate them? A: As noted earlier, these tables are *HUGE*. An X-GB virtual area needs 4*X GB of virtual space, plus 2GB for the bounds directory. If we were to preallocate them for the 128TB of user virtual address space, we would need to reserve 512TB+2GB, which is larger than the entire virtual address space today. This means they can not be reserved ahead of time. Also, a single process's pre-popualated bounds directory consumes 2GB of virtual *AND* physical memory. IOW, it's completely infeasible to prepopulate bounds directories. Q: Can we preallocate bounds table space at the same time memory is allocated which might contain pointers that might eventually need bounds tables? A: This would work if we could hook the site of each and every memory allocation syscall. This can be done for small, constrained applications. But, it isn't practical at a larger scale since a given app has no way of controlling how all the parts of the app might allocate memory (think libraries). The kernel is really the only place to intercept these calls. Q: Could a bounds fault be handed to userspace and the tables allocated there in a signal handler instead of in the kernel? A: (thanks to tglx) mmap() is not on the list of safe async handler functions and even if mmap() would work it still requires locking or nasty tricks to keep track of the allocation state there. Having ruled out all of the userspace-only approaches for managing bounds tables that we could think of, we create them on demand in the kernel. Based-on-patch-by: Qiaowei Ren <qiaowei.ren@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Cc: linux-mm@kvack.org Cc: linux-mips@linux-mips.org Cc: Dave Hansen <dave@sr71.net> Link: http://lkml.kernel.org/r/20141114151829.AD4310DE@viggo.jf.intel.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2014-10-13Merge branch 'sched-core-for-linus' of ↵Linus Torvalds1-2/+0
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler updates from Ingo Molnar: "The main changes in this cycle were: - Optimized support for Intel "Cluster-on-Die" (CoD) topologies (Dave Hansen) - Various sched/idle refinements for better idle handling (Nicolas Pitre, Daniel Lezcano, Chuansheng Liu, Vincent Guittot) - sched/numa updates and optimizations (Rik van Riel) - sysbench speedup (Vincent Guittot) - capacity calculation cleanups/refactoring (Vincent Guittot) - Various cleanups to thread group iteration (Oleg Nesterov) - Double-rq-lock removal optimization and various refactorings (Kirill Tkhai) - various sched/deadline fixes ... and lots of other changes" * 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (72 commits) sched/dl: Use dl_bw_of() under rcu_read_lock_sched() sched/fair: Delete resched_cpu() from idle_balance() sched, time: Fix build error with 64 bit cputime_t on 32 bit systems sched: Improve sysbench performance by fixing spurious active migration sched/x86: Fix up typo in topology detection x86, sched: Add new topology for multi-NUMA-node CPUs sched/rt: Use resched_curr() in task_tick_rt() sched: Use rq->rd in sched_setaffinity() under RCU read lock sched: cleanup: Rename 'out_unlock' to 'out_free_new_mask' sched: Use dl_bw_of() under RCU read lock sched/fair: Remove duplicate code from can_migrate_task() sched, mips, ia64: Remove __ARCH_WANT_UNLOCKED_CTXSW sched: print_rq(): Don't use tasklist_lock sched: normalize_rt_tasks(): Don't use _irqsave for tasklist_lock, use task_rq_lock() sched: Fix the task-group check in tg_has_rt_tasks() sched/fair: Leverage the idle state info when choosing the "idlest" cpu sched: Let the scheduler see CPU idle states sched/deadline: Fix inter- exclusive cpusets migrations sched/deadline: Clear dl_entity params when setscheduling to different class sched/numa: Kill the wrong/dead TASK_DEAD check in task_numa_fault() ...
2014-10-09kernel/sys.c: compat sysinfo syscall: fix undefined behaviorScotty Bauer1-1/+1
Fix undefined behavior and compiler warning by replacing right shift 32 with upper_32_bits macro Signed-off-by: Scotty Bauer <sbauer@eng.utah.edu> Cc: Clemens Ladisch <clemens@ladisch.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-10-09kernel/sys.c: whitespace fixesvishnu.ps1-128/+137
Fix minor errors and warning messages in kernel/sys.c. These errors were reported by checkpatch while working with some modifications in sys.c file. Fixing this first will help me to improve my further patches. ERROR: trailing whitespace - 9 ERROR: do not use assignment in if condition - 4 ERROR: spaces required around that '?' (ctx:VxO) - 10 ERROR: switch and case should be at the same indent - 3 total 26 errors & 3 warnings fixed. Signed-off-by: vishnu.ps <vishnu.ps@samsung.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-10-09mm: use VM_BUG_ON_MM where possibleSasha Levin1-1/+1
Dump the contents of the relevant struct_mm when we hit the bug condition. Signed-off-by: Sasha Levin <sasha.levin@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-10-09prctl: PR_SET_MM -- introduce PR_SET_MM_MAP operationCyrill Gorcunov1-1/+189
During development of c/r we've noticed that in case if we need to support user namespaces we face a problem with capabilities in prctl(PR_SET_MM, ...) call, in particular once new user namespace is created capable(CAP_SYS_RESOURCE) no longer passes. A approach is to eliminate CAP_SYS_RESOURCE check but pass all new values in one bundle, which would allow the kernel to make more intensive test for sanity of values and same time allow us to support checkpoint/restore of user namespaces. Thus a new command PR_SET_MM_MAP introduced. It takes a pointer of prctl_mm_map structure which carries all the members to be updated. prctl(PR_SET_MM, PR_SET_MM_MAP, struct prctl_mm_map *, size) struct prctl_mm_map { __u64 start_code; __u64 end_code; __u64 start_data; __u64 end_data; __u64 start_brk; __u64 brk; __u64 start_stack; __u64 arg_start; __u64 arg_end; __u64 env_start; __u64 env_end; __u64 *auxv; __u32 auxv_size; __u32 exe_fd; }; All members except @exe_fd correspond ones of struct mm_struct. To figure out which available values these members may take here are meanings of the members. - start_code, end_code: represent bounds of executable code area - start_data, end_data: represent bounds of data area - start_brk, brk: used to calculate bounds for brk() syscall - start_stack: used when accounting space needed for command line arguments, environment and shmat() syscall - arg_start, arg_end, env_start, env_end: represent memory area supplied for command line arguments and environment variables - auxv, auxv_size: carries auxiliary vector, Elf format specifics - exe_fd: file descriptor number for executable link (/proc/self/exe) Thus we apply the following requirements to the values 1) Any member except @auxv, @auxv_size, @exe_fd is rather an address in user space thus it must be laying inside [mmap_min_addr, mmap_max_addr) interval. 2) While @[start|end]_code and @[start|end]_data may point to an nonexisting VMAs (say a program maps own new .text and .data segments during execution) the rest of members should belong to VMA which must exist. 3) Addresses must be ordered, ie @start_ member must not be greater or equal to appropriate @end_ member. 4) As in regular Elf loading procedure we require that @start_brk and @brk be greater than @end_data. 5) If RLIMIT_DATA rlimit is set to non-infinity new values should not exceed existing limit. Same applies to RLIMIT_STACK. 6) Auxiliary vector size must not exceed existing one (which is predefined as AT_VECTOR_SIZE and depends on architecture). 7) File descriptor passed in @exe_file should be pointing to executable file (because we use existing prctl_set_mm_exe_file_locked helper it ensures that the file we are going to use as exe link has all required permission granted). Now about where these members are involved inside kernel code: - @start_code and @end_code are used in /proc/$pid/[stat|statm] output; - @start_data and @end_data are used in /proc/$pid/[stat|statm] output, also they are considered if there enough space for brk() syscall result if RLIMIT_DATA is set; - @start_brk shown in /proc/$pid/stat output and accounted in brk() syscall if RLIMIT_DATA is set; also this member is tested to find a symbolic name of mmap event for perf system (we choose if event is generated for "heap" area); one more aplication is selinux -- we test if a process has PROCESS__EXECHEAP permission if trying to make heap area being executable with mprotect() syscall; - @brk is a current value for brk() syscall which lays inside heap area, it's shown in /proc/$pid/stat. When syscall brk() succesfully provides new memory area to a user space upon brk() completion the mm::brk is updated to carry new value; Both @start_brk and @brk are actively used in /proc/$pid/maps and /proc/$pid/smaps output to find a symbolic name "heap" for VMA being scanned; - @start_stack is printed out in /proc/$pid/stat and used to find a symbolic name "stack" for task and threads in /proc/$pid/maps and /proc/$pid/smaps output, and as the same as with @start_brk -- perf system uses it for event naming. Also kernel treat this member as a start address of where to map vDSO pages and to check if there is enough space for shmat() syscall; - @arg_start, @arg_end, @env_start and @env_end are printed out in /proc/$pid/stat. Another access to the data these members represent is to read /proc/$pid/environ or /proc/$pid/cmdline. Any attempt to read these areas kernel tests with access_process_vm helper so a user must have enough rights for this action; - @auxv and @auxv_size may be read from /proc/$pid/auxv. Strictly speaking kernel doesn't care much about which exactly data is sitting there because it is solely for userspace; - @exe_fd is referred from /proc/$pid/exe and when generating coredump. We uses prctl_set_mm_exe_file_locked helper to update this member, so exe-file link modification remains one-shot action. Still note that updating exe-file link now doesn't require sys-resource capability anymore, after all there is no much profit in preventing setup own file link (there are a number of ways to execute own code -- ptrace, ld-preload, so that the only reliable way to find which exactly code is executed is to inspect running program memory). Still we require the caller to be at least user-namespace root user. I believe the old interface should be deprecated and ripped off in a couple of kernel releases if no one against. To test if new interface is implemented in the kernel one can pass PR_SET_MM_MAP_SIZE opcode and the kernel returns the size of currently supported struct prctl_mm_map. [akpm@linux-foundation.org: fix 80-col wordwrap in macro definitions] Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org> Cc: Kees Cook <keescook@chromium.org> Cc: Tejun Heo <tj@kernel.org> Acked-by: Andrew Vagin <avagin@openvz.org> Tested-by: Andrew Vagin <avagin@openvz.org> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: H. Peter Anvin <hpa@zytor.com> Acked-by: Serge Hallyn <serge.hallyn@canonical.com> Cc: Pavel Emelyanov <xemul@parallels.com> Cc: Vasiliy Kulikov <segoon@openwall.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Cc: Julien Tinnes <jln@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-10-09prctl: PR_SET_MM -- factor out mmap_sem when updating mm::exe_fileCyrill Gorcunov1-10/+11
Instead of taking mm->mmap_sem inside prctl_set_mm_exe_file() move it out and rename the helper to prctl_set_mm_exe_file_locked(). This will allow to reuse this function in a next patch. Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org> Cc: Kees Cook <keescook@chromium.org> Cc: Tejun Heo <tj@kernel.org> Cc: Andrew Vagin <avagin@openvz.org> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: H. Peter Anvin <hpa@zytor.com> Acked-by: Serge Hallyn <serge.hallyn@canonical.com> Cc: Pavel Emelyanov <xemul@parallels.com> Cc: Vasiliy Kulikov <segoon@openwall.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Cc: Julien Tinnes <jln@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-10-09mm: use may_adjust_brk helperCyrill Gorcunov1-7/+4
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org> Cc: Kees Cook <keescook@chromium.org> Cc: Tejun Heo <tj@kernel.org> Cc: Andrew Vagin <avagin@openvz.org> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: H. Peter Anvin <hpa@zytor.com> Acked-by: Serge Hallyn <serge.hallyn@canonical.com> Cc: Pavel Emelyanov <xemul@parallels.com> Cc: Vasiliy Kulikov <segoon@openwall.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Cc: Julien Tinnes <jln@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-09-08time, signal: Protect resource use statistics with seqlockRik van Riel1-2/+0
Both times() and clock_gettime(CLOCK_PROCESS_CPUTIME_ID) have scalability issues on large systems, due to both functions being serialized with a lock. The lock protects against reporting a wrong value, due to a thread in the task group exiting, its statistics reporting up to the signal struct, and that exited task's statistics being counted twice (or not at all). Protecting that with a lock results in times() and clock_gettime() being completely serialized on large systems. This can be fixed by using a seqlock around the events that gather and propagate statistics. As an additional benefit, the protection code can be moved into thread_group_cputime(), slightly simplifying the calling functions. In the case of posix_cpu_clock_get_task() things can be simplified a lot, because the calling function already ensures that the task sticks around, and the rest is now taken care of in thread_group_cputime(). This way the statistics reporting code can run lockless. Signed-off-by: Rik van Riel <riel@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Alex Thorlton <athorlton@sgi.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Daeseok Youn <daeseok.youn@gmail.com> Cc: David Rientjes <rientjes@google.com> Cc: Dongsheng Yang <yangds.fnst@cn.fujitsu.com> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Guillaume Morin <guillaume@morinfr.org> Cc: Ionut Alexa <ionut.m.alexa@gmail.com> Cc: Kees Cook <keescook@chromium.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Li Zefan <lizefan@huawei.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: Michal Schmidt <mschmidt@redhat.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Vladimir Davydov <vdavydov@parallels.com> Cc: umgwanakikbuti@gmail.com Cc: fweisbec@gmail.com Cc: srao@redhat.com Cc: lwoodman@redhat.com Cc: atheurer@redhat.com Link: http://lkml.kernel.org/r/20140816134010.26a9b572@annuminas.surriel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-07-18sched: move no_new_privs into new atomic flagsKees Cook1-2/+2
Since seccomp transitions between threads requires updates to the no_new_privs flag to be atomic, the flag must be part of an atomic flag set. This moves the nnp flag into a separate task field, and introduces accessors. Signed-off-by: Kees Cook <keescook@chromium.org> Reviewed-by: Oleg Nesterov <oleg@redhat.com> Reviewed-by: Andy Lutomirski <luto@amacapital.net>
2014-05-22sched: Consolidate open coded implementations of nice level frobbing into ↵Dongsheng Yang1-3/+3
nice_to_rlimit() and rlimit_to_nice() Signed-off-by: Dongsheng Yang <yangds.fnst@cn.fujitsu.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/a568a1e3cc8e78648f41b5035fa5e381d36274da.1399532322.git.yangds.fnst@cn.fujitsu.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-04-07mm, thp: add VM_INIT_DEF_MASK and PRCTL_THP_DISABLEAlex Thorlton1-0/+15
Add VM_INIT_DEF_MASK, to allow us to set the default flags for VMs. It also adds a prctl control which allows us to set the THP disable bit in mm->def_flags so that VMs will pick up the setting as they are created. Signed-off-by: Alex Thorlton <athorlton@sgi.com> Suggested-by: Oleg Nesterov <oleg@redhat.com> Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Christian Borntraeger <borntraeger@de.ibm.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Mel Gorman <mgorman@suse.de> Acked-by: Rik van Riel <riel@redhat.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: David Rientjes <rientjes@google.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-02-22sys: Replace hardcoding of -20 and 19 with MIN_NICE and MAX_NICEDongsheng Yang1-4/+4
Signed-off-by: Dongsheng Yang <yangds.fnst@cn.fujitsu.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Robin Holt <holt@sgi.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Kees Cook <keescook@chromium.org> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Link: http://lkml.kernel.org/r/0261f094b836f1acbcdf52e7166487c0c77323c8.1392103744.git.yangds.fnst@cn.fujitsu.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-01-23kernel/sys.c: k_getrusage() can use while_each_thread()Oleg Nesterov1-2/+1
Change k_getrusage() to use while_each_thread(), no changes in the compiled code. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Kees Cook <keescook@chromium.org> Reviewed-by: Sameer Nanda <snanda@chromium.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-01-23exec: kill task_struct->did_execOleg Nesterov1-3/+2
We can kill either task->did_exec or PF_FORKNOEXEC, they are mutually exclusive. The patch kills ->did_exec because it has a single user. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Kees Cook <keescook@chromium.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-11-13kernel/sys.c: remove obsolete #include <linux/kexec.h>Geert Uytterhoeven1-1/+0
Commit 15d94b82565e ("reboot: move shutdown/reboot related functions to kernel/reboot.c") moved all kexec-related functionality to kernel/reboot.c, so kernel/sys.c no longer needs to include <linux/kexec.h>. Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Robin Holt <holt@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-08-30userns: Kill nsown_capable it makes the wrong thing easyEric W. Biederman1-10/+10
nsown_capable is a special case of ns_capable essentially for just CAP_SETUID and CAP_SETGID. For the existing users it doesn't noticably simplify things and from the suggested patches I have seen it encourages people to do the wrong thing. So remove nsown_capable. Acked-by: Serge Hallyn <serge.hallyn@canonical.com> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2013-07-09reboot: move shutdown/reboot related functions to kernel/reboot.cRobin Holt1-331/+0
This patch is preparatory. It moves reboot related syscall, etc functions from kernel/sys.c to kernel/reboot.c. Signed-off-by: Robin Holt <holt@sgi.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Russ Anderson <rja@sgi.com> Cc: Robin Holt <holt@sgi.com> Cc: Russell King <rmk+kernel@arm.linux.org.uk> Cc: Guan Xuetao <gxt@mprc.pku.edu.cn> Cc: Ingo Molnar <mingo@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-07-09reboot: remove -stable friendly PF_THREAD_BOUND defineRobin Holt1-5/+0
Remove the prior patch's #define for easier backporting to the stable releases. Signed-off-by: Robin Holt <holt@sgi.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Russ Anderson <rja@sgi.com> Cc: Robin Holt <holt@sgi.com> Cc: Russell King <rmk+kernel@arm.linux.org.uk> Cc: Guan Xuetao <gxt@mprc.pku.edu.cn> Cc: Ingo Molnar <mingo@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-07-03exit.c: unexport __set_special_pids()Oleg Nesterov1-1/+12
Move __set_special_pids() from exit.c to sys.c close to its single caller and make it static. And rename it to set_special_pids(), another helper with this name has gone away. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-07-03kernel/sys.c:do_sysinfo(): use get_monotonic_boottime()Oleg Nesterov1-2/+1
Change do_sysinfo() to use get_monotonic_boottime() instead of do_posix_clock_monotonic_gettime() + monotonic_to_bootbased(). Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Acked-by: John Stultz <johnstul@us.ibm.com> Cc: Tomas Janousek <tjanouse@redhat.com> Cc: Tomas Smetana <tsmetana@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-07-03kernel/sys.c: sys_reboot(): fix malformed panic messageliguang1-1/+1
If LINUX_REBOOT_CMD_HALT for reboot failed, the message "cannot halt" will stay on the same line with the next message, so append a '\n'. Signed-off-by: liguang <lig.fnst@cn.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-06-12reboot: rigrate shutdown/reboot to boot cpuRobin Holt1-3/+26
We recently noticed that reboot of a 1024 cpu machine takes approx 16 minutes of just stopping the cpus. The slowdown was tracked to commit f96972f2dc63 ("kernel/sys.c: call disable_nonboot_cpus() in kernel_restart()"). The current implementation does all the work of hot removing the cpus before halting the system. We are switching to just migrating to the boot cpu and then continuing with shutdown/reboot. This also has the effect of not breaking x86's command line parameter for specifying the reboot cpu. Note, this code was shamelessly copied from arch/x86/kernel/reboot.c with bits removed pertaining to the reboot_cpu command line parameter. Signed-off-by: Robin Holt <holt@sgi.com> Tested-by: Shawn Guo <shawn.guo@linaro.org> Cc: "Srivatsa S. Bhat" <srivatsa.bhat@linux.vnet.ibm.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@elte.hu> Cc: Russ Anderson <rja@sgi.com> Cc: Robin Holt <holt@sgi.com> Cc: Russell King <linux@arm.linux.org.uk> Cc: Guan Xuetao <gxt@mprc.pku.edu.cn> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-05-01Merge branch 'for-linus' of ↵Linus Torvalds1-0/+14
git://git.kernel.org/pub/scm/linux/kernel/git/viro/signal Pull compat cleanup from Al Viro: "Mostly about syscall wrappers this time; there will be another pile with patches in the same general area from various people, but I'd rather push those after both that and vfs.git pile are in." * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/signal: syscalls.h: slightly reduce the jungles of macros get rid of union semop in sys_semctl(2) arguments make do_mremap() static sparc: no need to sign-extend in sync_file_range() wrapper ppc compat wrappers for add_key(2) and request_key(2) are pointless x86: trim sys_ia32.h x86: sys32_kill and sys32_mprotect are pointless get rid of compat_sys_semctl() and friends in case of ARCH_WANT_OLD_COMPAT_IPC merge compat sys_ipc instances consolidate compat lookup_dcookie() convert vmsplice to COMPAT_SYSCALL_DEFINE switch getrusage() to COMPAT_SYSCALL_DEFINE switch epoll_pwait to COMPAT_SYSCALL_DEFINE convert sendfile{,64} to COMPAT_SYSCALL_DEFINE switch signalfd{,4}() to COMPAT_SYSCALL_DEFINE make SYSCALL_DEFINE<n>-generated wrappers do asmlinkage_protect make HAVE_SYSCALL_WRAPPERS unconditional consolidate cond_syscall and SYSCALL_ALIAS declarations teach SYSCALL_DEFINE<n> how to deal with long long/unsigned long long get rid of duplicate logics in __SC_....[1-6] definitions
2013-04-30kernel/sys.c: make prctl(PR_SET_MM) generally availableAmnon Shiloh1-8/+2
The purpose of this patch is to allow privileged processes to set their own per-memory memory-region fields: start_code, end_code, start_data, end_data, start_brk, brk, start_stack, arg_start, arg_end, env_start, env_end. This functionality is needed by any application or package that needs to reconstruct Linux processes, that is, to start them in any way other than by means of an "execve()" from an executable file. This includes: 1. Restoring processes from a checkpoint-file (by all potential user-level checkpointing packages, not only CRIU's). 2. Restarting processes on another node after process migration. 3. Starting duplicated copies of a running process (for reliability and high-availablity). 4. Starting a process from an executable format that is not supported by Linux, thus requiring a "manual execve" by a user-level utility. 5. Similarly, starting a process from a networked and/or crypted executable that, for confidentiality, licensing or other reasons, may not be written to the local file-systems. The code that does that was already included in the Linux kernel by the CRIU group, in the form of "prctl(PR_SET_MM)", but prior to this was enclosed within their private "#ifdef CONFIG_CHECKPOINT_RESTORE", which is normally disabled. The patch removes those ifdefs. Signed-off-by: Amnon Shiloh <u3557@miso.sublimeip.com> Cc: Cyrill Gorcunov <gorcunov@openvz.org> Cc: Pavel Emelyanov <xemul@parallels.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-04-30kernel/timer.c: move some non timer related syscalls to kernel/sys.cStephen Rothwell1-0/+211
Andrew Morton noted: akpm3:/usr/src/25> grep SYSCALL kernel/timer.c SYSCALL_DEFINE1(alarm, unsigned int, seconds) SYSCALL_DEFINE0(getpid) SYSCALL_DEFINE0(getppid) SYSCALL_DEFINE0(getuid) SYSCALL_DEFINE0(geteuid) SYSCALL_DEFINE0(getgid) SYSCALL_DEFINE0(getegid) SYSCALL_DEFINE0(gettid) SYSCALL_DEFINE1(sysinfo, struct sysinfo __user *, info) COMPAT_SYSCALL_DEFINE1(sysinfo, struct compat_sysinfo __user *, info) Only one of those should be in kernel/timer.c. Who wrote this thing? [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au> Acked-by: Thomas Gleixner <tglx@linutronix.de> Cc: Guenter Roeck <linux@roeck-us.net> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-04-08PM / reboot: call syscore_shutdown() after disable_nonboot_cpus()Huacai Chen1-1/+2
As commit 40dc166c (PM / Core: Introduce struct syscore_ops for core subsystems PM) say, syscore_ops operations should be carried with one CPU on-line and interrupts disabled. However, after commit f96972f2d (kernel/sys.c: call disable_nonboot_cpus() in kernel_restart()), syscore_shutdown() is called before disable_nonboot_cpus(), so break the rules. We have a MIPS machine with a 8259A PIC, and there is an external timer (HPET) linked at 8259A. Since 8259A has been shutdown too early (by syscore_shutdown()), disable_nonboot_cpus() runs without timer interrupt, so it hangs and reboot fails. This patch call syscore_shutdown() a little later (after disable_nonboot_cpus()) to avoid reboot failure, this is the same way as poweroff does. For consistency, add disable_nonboot_cpus() to kernel_halt(). Signed-off-by: Huacai Chen <chenhc@lemote.com> Cc: <stable@vger.kernel.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2013-03-22poweroff: change orderly_poweroff() to use schedule_work()Oleg Nesterov1-25/+32
David said: Commit 6c0c0d4d1080 ("poweroff: fix bug in orderly_poweroff()") apparently fixes one bug in orderly_poweroff(), but introduces another. The comments on orderly_poweroff() claim it can be called from any context - and indeed we call it from interrupt context in arch/powerpc/platforms/pseries/ras.c for example. But since that commit this is no longer safe, since call_usermodehelper_fns() is not safe in interrupt context without the UMH_NO_WAIT option. orderly_poweroff() can be used from any context but UMH_WAIT_EXEC is sleepable. Move the "force" logic into __orderly_poweroff() and change orderly_poweroff() to use the global poweroff_work which simply calls __orderly_poweroff(). While at it, remove the unneeded "int argc" and change argv_split() to use GFP_KERNEL. We use the global "bool poweroff_force" to pass the argument, this can obviously affect the previous request if it is pending/running. So we only allow the "false => true" transition assuming that the pending "true" should succeed anyway. If schedule_work() fails after that we know that work->func() was not called yet, it must see the new value. This means that orderly_poweroff() becomes async even if we do not run the command and always succeeds, schedule_work() can only fail if the work is already pending. We can export __orderly_poweroff() and change the non-atomic callers which want the old semantics. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Reported-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Reported-by: David Gibson <david@gibson.dropbear.id.au> Cc: Lucas De Marchi <lucas.demarchi@profusion.mobi> Cc: Feng Hong <hongfeng@marvell.com> Cc: Kees Cook <keescook@chromium.org> Cc: Serge Hallyn <serge.hallyn@canonical.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: "Rafael J. Wysocki" <rjw@sisk.pl> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-03-03switch getrusage() to COMPAT_SYSCALL_DEFINEAl Viro1-0/+14
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-02-27usermodehelper: cleanup/fix __orderly_poweroff() && argv_free()Oleg Nesterov1-8/+2
__orderly_poweroff() does argv_free() if call_usermodehelper_fns() returns -ENOMEM. As Lucas pointed out, this can be wrong if -ENOMEM was not triggered by the failing call_usermodehelper_setup(), in this case both __orderly_poweroff() and argv_cleanup() can do kfree(). Kill argv_cleanup() and change __orderly_poweroff() to call argv_free() unconditionally like do_coredump() does. This info->cleanup() is not needed (and wrong) since 6c0c0d4d "fix bug in orderly_poweroff() which did the UMH_NO_WAIT => UMH_WAIT_EXEC change, we can rely on the fact that CLONE_VFORK can't return until do_execve() succeeds/fails. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Reported-by: Lucas De Marchi <lucas.demarchi@profusion.mobi> Cc: David Howells <dhowells@redhat.com> Cc: James Morris <james.l.morris@oracle.com> Cc: hongfeng <hongfeng@marvell.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-02-26Merge branch 'for-linus' of ↵Linus Torvalds1-4/+4
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull vfs pile (part one) from Al Viro: "Assorted stuff - cleaning namei.c up a bit, fixing ->d_name/->d_parent locking violations, etc. The most visible changes here are death of FS_REVAL_DOT (replaced with "has ->d_weak_revalidate()") and a new helper getting from struct file to inode. Some bits of preparation to xattr method interface changes. Misc patches by various people sent this cycle *and* ocfs2 fixes from several cycles ago that should've been upstream right then. PS: the next vfs pile will be xattr stuff." * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (46 commits) saner proc_get_inode() calling conventions proc: avoid extra pde_put() in proc_fill_super() fs: change return values from -EACCES to -EPERM fs/exec.c: make bprm_mm_init() static ocfs2/dlm: use GFP_ATOMIC inside a spin_lock ocfs2: fix possible use-after-free with AIO ocfs2: Fix oops in ocfs2_fast_symlink_readpage() code path get_empty_filp()/alloc_file() leave both ->f_pos and ->f_version zero target: writev() on single-element vector is pointless export kernel_write(), convert open-coded instances fs: encode_fh: return FILEID_INVALID if invalid fid_type kill f_vfsmnt vfs: kill FS_REVAL_DOT by adding a d_weak_revalidate dentry op nfsd: handle vfs_getattr errors in acl protocol switch vfs_getattr() to struct path default SET_PERSONALITY() in linux/elf.h ceph: prepopulate inodes only when request is aborted d_hash_and_lookup(): export, switch open-coded instances 9p: switch v9fs_set_create_acl() to inode+fid, do it before d_instantiate() 9p: split dropping the acls from v9fs_set_create_acl() ...
2013-02-25Merge branch 'for-linus' of ↵Linus Torvalds1-2/+3
git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace Pull user namespace and namespace infrastructure changes from Eric W Biederman: "This set of changes starts with a few small enhnacements to the user namespace. reboot support, allowing more arbitrary mappings, and support for mounting devpts, ramfs, tmpfs, and mqueuefs as just the user namespace root. I do my best to document that if you care about limiting your unprivileged users that when you have the user namespace support enabled you will need to enable memory control groups. There is a minor bug fix to prevent overflowing the stack if someone creates way too many user namespaces. The bulk of the changes are a continuation of the kuid/kgid push down work through the filesystems. These changes make using uids and gids typesafe which ensures that these filesystems are safe to use when multiple user namespaces are in use. The filesystems converted for 3.9 are ceph, 9p, afs, ocfs2, gfs2, ncpfs, nfs, nfsd, and cifs. The changes for these filesystems were a little more involved so I split the changes into smaller hopefully obviously correct changes. XFS is the only filesystem that remains. I was hoping I could get that in this release so that user namespace support would be enabled with an allyesconfig or an allmodconfig but it looks like the xfs changes need another couple of days before it they are ready." * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (93 commits) cifs: Enable building with user namespaces enabled. cifs: Convert struct cifs_ses to use a kuid_t and a kgid_t cifs: Convert struct cifs_sb_info to use kuids and kgids cifs: Modify struct smb_vol to use kuids and kgids cifs: Convert struct cifsFileInfo to use a kuid cifs: Convert struct cifs_fattr to use kuid and kgids cifs: Convert struct tcon_link to use a kuid. cifs: Modify struct cifs_unix_set_info_args to hold a kuid_t and a kgid_t cifs: Convert from a kuid before printing current_fsuid cifs: Use kuids and kgids SID to uid/gid mapping cifs: Pass GLOBAL_ROOT_UID and GLOBAL_ROOT_GID to keyring_alloc cifs: Use BUILD_BUG_ON to validate uids and gids are the same size cifs: Override unmappable incoming uids and gids nfsd: Enable building with user namespaces enabled. nfsd: Properly compare and initialize kuids and kgids nfsd: Store ex_anon_uid and ex_anon_gid as kuids and kgids nfsd: Modify nfsd4_cb_sec to use kuids and kgids nfsd: Handle kuids and kgids in the nfs4acl to posix_acl conversion nfsd: Convert nfsxdr to use kuids and kgids nfsd: Convert nfs3xdr to use kuids and kgids ...
2013-02-22new helper: file_inode(file)Al Viro1-4/+4
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-02-21sys_prctl(): coding-style cleanupAndrew Morton1-145/+143
Remove a tabstop from the switch statement, in the usual fashion. A few instances of weirdwrapping were removed as a result. Cc: Chen Gang <gang.chen@asianux.com> Cc: Cyrill Gorcunov <gorcunov@openvz.org> Acked-by: Kees Cook <keescook@chromium.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-02-21sys_prctl(): arg2 is unsigned long which is never < 0Chen Gang1-1/+3
arg2 will never < 0, for its type is 'unsigned long' Also, use the provided macros. Signed-off-by: Chen Gang <gang.chen@asianux.com> Reported-by: Cyrill Gorcunov <gorcunov@openvz.org> Acked-by: Kees Cook <keescook@chromium.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-12-26userns: Allow unprivileged rebootLi Zefan1-2/+3
In a container with its own pid namespace and user namespace, rebooting the system won't reboot the host, but terminate all the processes in it and thus have the container shutdown, so it's safe. Signed-off-by: Li Zefan <lizefan@huawei.com> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2012-11-28cputime: Rename thread_group_times to thread_group_cputime_adjustedFrederic Weisbecker1-3/+3
We have thread_group_cputime() and thread_group_times(). The naming doesn't provide enough information about the difference between these two APIs. To lower the confusion, rename thread_group_times() to thread_group_cputime_adjusted(). This name better suggests that it's a version of thread_group_cputime() that does some stabilization on the raw cputime values. ie here: scale on top of CFS runtime stats and bound lower value for monotonicity. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
2012-10-19use clamp_t in UNAME26 fixKees Cook1-1/+1
The min/max call needed to have explicit types on some architectures (e.g. mn10300). Use clamp_t instead to avoid the warning: kernel/sys.c: In function 'override_release': kernel/sys.c:1287:10: warning: comparison of distinct pointer types lacks a cast [enabled by default] Reported-by: Fengguang Wu <fengguang.wu@intel.com> Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-10-19kernel/sys.c: fix stack memory content leak via UNAME26Kees Cook1-5/+7
Calling uname() with the UNAME26 personality set allows a leak of kernel stack contents. This fixes it by defensively calculating the length of copy_to_user() call, making the len argument unsigned, and initializing the stack buffer to zero (now technically unneeded, but hey, overkill). CVE-2012-0957 Reported-by: PaX Team <pageexec@freemail.hu> Signed-off-by: Kees Cook <keescook@chromium.org> Cc: Andi Kleen <ak@linux.intel.com> Cc: PaX Team <pageexec@freemail.hu> Cc: Brad Spengler <spender@grsecurity.net> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-10-06poweroff: fix bug in orderly_poweroff()hongfeng1-1/+1
orderly_poweroff is trying to poweroff platform in two steps: step 1: Call user space application to poweroff step 2: If user space poweroff fail, then do a force power off if force param is set. The bug here is, step 1 is always successful with param UMH_NO_WAIT, which obey the design goal of orderly_poweroff. We have two choices here: UMH_WAIT_EXEC which means wait for the exec, but not the process; UMH_WAIT_PROC which means wait for the process to complete. we need to trade off the two choices: If using UMH_WAIT_EXEC, there is potential issue comments by Serge E. Hallyn: The exec will have started, but may for whatever (very unlikely) reason fail. If using UMH_WAIT_PROC, there is potential issue comments by Eric W. Biederman: If the caller is not running in a kernel thread then we can easily get into a case where the user space caller will block waiting for us when we are waiting for the user space caller. Thanks for their excellent ideas, based on the above discussion, we finally choose UMH_WAIT_EXEC, which is much more safe, if the user application really fails, we just complain the application itself, it seems a better choice here. Signed-off-by: Feng Hong <hongfeng@marvell.com> Acked-by: Kees Cook <keescook@chromium.org> Acked-by: Serge Hallyn <serge.hallyn@canonical.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Acked-by: "Rafael J. Wysocki" <rjw@sisk.pl> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-10-06kernel/sys.c: call disable_nonboot_cpus() in kernel_restart()Shawn Guo1-0/+1
As kernel_power_off() calls disable_nonboot_cpus(), we may also want to have kernel_restart() call disable_nonboot_cpus(). Doing so can help machines that require boot cpu be the last alive cpu during reboot to survive with kernel restart. This fixes one reboot issue seen on imx6q (Cortex-A9 Quad). The machine requires that the restart routine be run on the primary cpu rather than secondary ones. Otherwise, the secondary core running the restart routine will fail to come to online after reboot. Signed-off-by: Shawn Guo <shawn.guo@linaro.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-09-26switch simple cases of fget_light to fdgetAl Viro1-8/+8
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-09-26switch prctl_set_mm_exe_file() to fget_light()Al Viro1-4/+4
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-07-30kernel/sys.c: avoid argv_free(NULL)Andrew Morton1-19/+25
If argv_split() failed, the code will end up calling argv_free(NULL). Fix it up and clean things up a bit. Addresses Coverity report 703573. Cc: Cyrill Gorcunov <gorcunov@openvz.org> Cc: Kees Cook <keescook@chromium.org> Cc: Serge Hallyn <serge.hallyn@canonical.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: WANG Cong <xiyou.wangcong@gmail.com> Cc: Alan Cox <alan@linux.intel.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-07-30prctl: remove redunant assignment of "error" to zeroSasikantha babu1-11/+2
Just setting the "error" to error number is enough on failure and It doesn't require to set "error" variable to zero in each switch case, since it was already initialized with zero. And also removed return 0 in switch case with break statement Signed-off-by: Sasikantha babu <sasikanth.v19@gmail.com> Acked-by: Kees Cook <keescook@chromium.org> Acked-by: Serge E. Hallyn <serge@hallyn.com> Cc: Cyrill Gorcunov <gorcunov@openvz.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-07-11c/r: prctl: less paranoid prctl_set_mm_exe_file()Konstantin Khlebnikov1-6/+10
"no other files mapped" requirement from my previous patch (c/r: prctl: update prctl_set_mm_exe_file() after mm->num_exe_file_vmas removal) is too paranoid, it forbids operation even if there mapped one shared-anon vma. Let's check that current mm->exe_file already unmapped, in this case exe_file symlink already outdated and its changing is reasonable. Plus, this patch fixes exit code in case operation success. Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org> Reported-by: Cyrill Gorcunov <gorcunov@openvz.org> Tested-by: Cyrill Gorcunov <gorcunov@openvz.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Matt Helsley <matthltc@us.ibm.com> Cc: Kees Cook <keescook@chromium.org> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Tejun Heo <tj@kernel.org> Cc: Pavel Emelyanov <xemul@parallels.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-06-20c/r: prctl: Move PR_GET_TID_ADDRESS to a proper placeCyrill Gorcunov1-3/+3
During merging of PR_GET_TID_ADDRESS patch the code has been misplaced (it happened to appear under PR_MCE_KILL) in result noone can use this option. Fix it by moving code snippet to a proper place. Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org> Acked-by: Kees Cook <keescook@chromium.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Pavel Emelyanov <xemul@parallels.com> Cc: Andrey Vagin <avagin@openvz.org> Cc: Serge Hallyn <serge.hallyn@canonical.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-06-07c/r: prctl: drop VMA flags test on PR_SET_MM_ stack data assignmentCyrill Gorcunov1-14/+0
In commit b76437579d13 ("procfs: mark thread stack correctly in proc/<pid>/maps") the stack allocated via clone() is marked in /proc/<pid>/maps as [stack:%d] thus it might be out of the former mm->start_stack/end_stack values (and even has some custom VMA flags set). So to be able to restore mm->start_stack/end_stack drop vma flags test, but still require the underlying VMA to exist. As always note this feature is under CONFIG_CHECKPOINT_RESTORE and requires CAP_SYS_RESOURCE to be granted. Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org> Cc: Oleg Nesterov <oleg@redhat.com> Acked-by: Kees Cook <keescook@chromium.org> Cc: Pavel Emelyanov <xemul@parallels.com> Cc: Serge Hallyn <serge.hallyn@canonical.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-06-07c/r: prctl: add ability to get clear_tid_addressCyrill Gorcunov1-0/+13
Zero is written at clear_tid_address when the process exits. This functionality is used by pthread_join(). We already have sys_set_tid_address() to change this address for the current task but there is no way to obtain it from user space. Without the ability to find this address and dump it we can't restore pthread'ed apps which call pthread_join() once they have been restored. This patch introduces the PR_GET_TID_ADDRESS prctl option which allows the current process to obtain own clear_tid_address. This feature is available iif CONFIG_CHECKPOINT_RESTORE is set. [akpm@linux-foundation.org: fix prctl numbering] Signed-off-by: Andrew Vagin <avagin@openvz.org> Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org> Cc: Pedro Alves <palves@redhat.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Pavel Emelyanov <xemul@parallels.com> Cc: Tejun Heo <tj@kernel.org> Acked-by: Kees Cook <keescook@chromium.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-06-07c/r: prctl: add minimal address test to PR_SET_MMCyrill Gorcunov1-1/+1
Make sure the address being set is greater than mmap_min_addr (as suggested by Kees Cook). Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org> Acked-by: Kees Cook <keescook@chromium.org> Cc: Serge Hallyn <serge.hallyn@canonical.com> Cc: Tejun Heo <tj@kernel.org> Cc: Pavel Emelyanov <xemul@parallels.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-06-07c/r: prctl: update prctl_set_mm_exe_file() after mm->num_exe_file_vmas removalKonstantin Khlebnikov1-12/+19
A fix for commit b32dfe377102 ("c/r: prctl: add ability to set new mm_struct::exe_file"). After removing mm->num_exe_file_vmas kernel keeps mm->exe_file until final mmput(), it never becomes NULL while task is alive. We can check for other mapped files in mm instead of checking mm->num_exe_file_vmas, and mark mm with flag MMF_EXE_FILE_CHANGED in order to forbid second changing of mm->exe_file. Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org> Reviewed-by: Cyrill Gorcunov <gorcunov@openvz.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Matt Helsley <matthltc@us.ibm.com> Cc: Kees Cook <keescook@chromium.org> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Tejun Heo <tj@kernel.org> Cc: Pavel Emelyanov <xemul@parallels.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-05-31c/r: prctl: add ability to set new mm_struct::exe_fileCyrill Gorcunov1-0/+56
When we do restore we would like to have a way to setup a former mm_struct::exe_file so that /proc/pid/exe would point to the original executable file a process had at checkpoint time. For this the PR_SET_MM_EXE_FILE code is introduced. This option takes a file descriptor which will be set as a source for new /proc/$pid/exe symlink. Note it allows to change /proc/$pid/exe if there are no VM_EXECUTABLE vmas present for current process, simply because this feature is a special to C/R and mm::num_exe_file_vmas become meaningless after that. To minimize the amount of transition the /proc/pid/exe symlink might have, this feature is implemented in one-shot manner. Thus once changed the symlink can't be changed again. This should help sysadmins to monitor the symlinks over all process running in a system. In particular one could make a snapshot of processes and ring alarm if there unexpected changes of /proc/pid/exe's in a system. Note -- this feature is available iif CONFIG_CHECKPOINT_RESTORE is set and the caller must have CAP_SYS_RESOURCE capability granted, otherwise the request to change symlink will be rejected. Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org> Reviewed-by: Oleg Nesterov <oleg@redhat.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Pavel Emelyanov <xemul@parallels.com> Cc: Kees Cook <keescook@chromium.org> Cc: Tejun Heo <tj@kernel.org> Cc: Matt Helsley <matthltc@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-05-31c/r: prctl: extend PR_SET_MM to set up more mm_struct entriesCyrill Gorcunov1-51/+83
During checkpoint we dump whole process memory to a file and the dump includes process stack memory. But among stack data itself, the stack carries additional parameters such as command line arguments, environment data and auxiliary vector. So when we do restore procedure and once we've restored stack data itself we need to setup mm_struct::arg_start/end, env_start/end, so restored process would be able to find command line arguments and environment data it had at checkpoint time. The same applies to auxiliary vector. For this reason additional PR_SET_MM_(ARG_START | ARG_END | ENV_START | ENV_END | AUXV) codes are introduced. Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org> Acked-by: Kees Cook <keescook@chromium.org> Cc: Tejun Heo <tj@kernel.org> Cc: Andrew Vagin <avagin@openvz.org> Cc: Serge Hallyn <serge.hallyn@canonical.com> Cc: Pavel Emelyanov <xemul@parallels.com> Cc: Vasiliy Kulikov <segoon@openwall.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-05-31kmod: convert two call sites to call_usermodehelper_fns()Boaz Harrosh1-11/+8
Both kernel/sys.c && security/keys/request_key.c where inlining the exact same code as call_usermodehelper_fns(); So simply convert these sites to directly use call_usermodehelper_fns(). Signed-off-by: Boaz Harrosh <bharrosh@panasas.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Ingo Molnar <mingo@elte.hu> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-05-31sethostname/setdomainname: notify userspace when there is a change in ↵Sasikantha babu1-2/+2
uts_kern_table sethostname() and setdomainname() notify userspace on failure (without modifying uts_kern_table). Change things so that we only notify userspace on success, when uts_kern_table was actually modified. Signed-off-by: Sasikantha babu <sasikanth.v19@gmail.com> Cc: Paul Gortmaker <paul.gortmaker@windriver.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: WANG Cong <amwang@redhat.com> Reviewed-by: Cyrill Gorcunov <gorcunov@openvz.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-05-23Merge branch 'for-linus' of ↵Linus Torvalds1-92/+174
git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace Pull user namespace enhancements from Eric Biederman: "This is a course correction for the user namespace, so that we can reach an inexpensive, maintainable, and reasonably complete implementation. Highlights: - Config guards make it impossible to enable the user namespace and code that has not been converted to be user namespace safe. - Use of the new kuid_t type ensures the if you somehow get past the config guards the kernel will encounter type errors if you enable user namespaces and attempt to compile in code whose permission checks have not been updated to be user namespace safe. - All uids from child user namespaces are mapped into the initial user namespace before they are processed. Removing the need to add an additional check to see if the user namespace of the compared uids remains the same. - With the user namespaces compiled out the performance is as good or better than it is today. - For most operations absolutely nothing changes performance or operationally with the user namespace enabled. - The worst case performance I could come up with was timing 1 billion cache cold stat operations with the user namespace code enabled. This went from 156s to 164s on my laptop (or 156ns to 164ns per stat operation). - (uid_t)-1 and (gid_t)-1 are reserved as an internal error value. Most uid/gid setting system calls treat these value specially anyway so attempting to use -1 as a uid would likely cause entertaining failures in userspace. - If setuid is called with a uid that can not be mapped setuid fails. I have looked at sendmail, login, ssh and every other program I could think of that would call setuid and they all check for and handle the case where setuid fails. - If stat or a similar system call is called from a context in which we can not map a uid we lie and return overflowuid. The LFS experience suggests not lying and returning an error code might be better, but the historical precedent with uids is different and I can not think of anything that would break by lying about a uid we can't map. - Capabilities are localized to the current user namespace making it safe to give the initial user in a user namespace all capabilities. My git tree covers all of the modifications needed to convert the core kernel and enough changes to make a system bootable to runlevel 1." Fix up trivial conflicts due to nearby independent changes in fs/stat.c * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (46 commits) userns: Silence silly gcc warning. cred: use correct cred accessor with regards to rcu read lock userns: Convert the move_pages, and migrate_pages permission checks to use uid_eq userns: Convert cgroup permission checks to use uid_eq userns: Convert tmpfs to use kuid and kgid where appropriate userns: Convert sysfs to use kgid/kuid where appropriate userns: Convert sysctl permission checks to use kuid and kgids. userns: Convert proc to use kuid/kgid where appropriate userns: Convert ext4 to user kuid/kgid where appropriate userns: Convert ext3 to use kuid/kgid where appropriate userns: Convert ext2 to use kuid/kgid where appropriate. userns: Convert devpts to use kuid/kgid where appropriate userns: Convert binary formats to use kuid/kgid where appropriate userns: Add negative depends on entries to avoid building code that is userns unsafe userns: signal remove unnecessary map_cred_ns userns: Teach inode_capable to understand inodes whose uids map to other namespaces. userns: Fail exec for suid and sgid binaries with ids outside our user namespace. userns: Convert stat to return values mapped from kuids and kgids userns: Convert user specfied uids and gids in chown into kuids and kgid userns: Use uid_eq gid_eq helpers when comparing kuids and kgids in the vfs ...
2012-05-03userns: Convert ptrace, kill, set_priority permission checks to work with ↵Eric W. Biederman1-10/+8
kuids and kgids Update the permission checks to use the new uid_eq and gid_eq helpers and remove the now unnecessary user_ns equality comparison. Acked-by: Serge Hallyn <serge.hallyn@canonical.com> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2012-05-03userns: Convert setting and getting uid and gid system calls to use kuid and ↵Eric W. Biederman1-67/+149
kgid Convert setregid, setgid, setreuid, setuid, setresuid, getresuid, setresgid, getresgid, setfsuid, setfsgid, getuid, geteuid, getgid, getegid, waitpid, waitid, wait4. Convert userspace uids and gids into kuids and kgids before being placed on struct cred. Convert struct cred kuids and kgids into userspace uids and gids when returning them. Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2012-05-03userns: Store uid and gid values in struct cred with kuid_t and kgid_t typesEric W. Biederman1-17/+9
cred.h and a few trivial users of struct cred are changed. The rest of the users of struct cred are left for other patches as there are too many changes to make in one go and leave the change reviewable. If the user namespace is disabled and CONFIG_UIDGID_STRICT_TYPE_CHECKS are disabled the code will contiue to compile and behave correctly. Acked-by: Serge Hallyn <serge.hallyn@canonical.com> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2012-04-14seccomp: add system call filtering using BPFWill Drewry1-1/+1
[This patch depends on luto@mit.edu's no_new_privs patch: https://lkml.org/lkml/2012/1/30/264 The whole series including Andrew's patches can be found here: https://github.com/redpig/linux/tree/seccomp Complete diff here: https://github.com/redpig/linux/compare/1dc65fed...seccomp ] This patch adds support for seccomp mode 2. Mode 2 introduces the ability for unprivileged processes to install system call filtering policy expressed in terms of a Berkeley Packet Filter (BPF) program. This program will be evaluated in the kernel for each system call the task makes and computes a result based on data in the format of struct seccomp_data. A filter program may be installed by calling: struct sock_fprog fprog = { ... }; ... prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, &fprog); The return value of the filter program determines if the system call is allowed to proceed or denied. If the first filter program installed allows prctl(2) calls, then the above call may be made repeatedly by a task to further reduce its access to the kernel. All attached programs must be evaluated before a system call will be allowed to proceed. Filter programs will be inherited across fork/clone and execve. However, if the task attaching the filter is unprivileged (!CAP_SYS_ADMIN) the no_new_privs bit will be set on the task. This ensures that unprivileged tasks cannot attach filters that affect privileged tasks (e.g., setuid binary). There are a number of benefits to this approach. A few of which are as follows: - BPF has been exposed to userland for a long time - BPF optimization (and JIT'ing) are well understood - Userland already knows its ABI: system call numbers and desired arguments - No time-of-check-time-of-use vulnerable data accesses are possible. - system call arguments are loaded on access only to minimize copying required for system call policy decisions. Mode 2 support is restricted to architectures that enable HAVE_ARCH_SECCOMP_FILTER. In this patch, the primary dependency is on syscall_get_arguments(). The full desired scope of this feature will add a few minor additional requirements expressed later in this series. Based on discussion, SECCOMP_RET_ERRNO and SECCOMP_RET_TRACE seem to be the desired additional functionality. No architectures are enabled in this patch. Signed-off-by: Will Drewry <wad@chromium.org> Acked-by: Serge Hallyn <serge.hallyn@canonical.com> Reviewed-by: Indan Zupancic <indan@nul.nu> Acked-by: Eric Paris <eparis@redhat.com> Reviewed-by: Kees Cook <keescook@chromium.org> v18: - rebase to v3.4-rc2 - s/chk/check/ (akpm@linux-foundation.org,jmorris@namei.org) - allocate with GFP_KERNEL|__GFP_NOWARN (indan@nul.nu) - add a comment for get_u32 regarding endianness (akpm@) - fix other typos, style mistakes (akpm@) - added acked-by v17: - properly guard seccomp filter needed headers (leann@ubuntu.com) - tighten return mask to 0x7fff0000 v16: - no change v15: - add a 4 instr penalty when counting a path to account for seccomp_filter size (indan@nul.nu) - drop the max insns to 256KB (indan@nul.nu) - return ENOMEM if the max insns limit has been hit (indan@nul.nu) - move IP checks after args (indan@nul.nu) - drop !user_filter check (indan@nul.nu) - only allow explicit bpf codes (indan@nul.nu) - exit_code -> exit_sig v14: - put/get_seccomp_filter takes struct task_struct (indan@nul.nu,keescook@chromium.org) - adds seccomp_chk_filter and drops general bpf_run/chk_filter user - add seccomp_bpf_load for use by net/core/filter.c - lower max per-process/per-hierarchy: 1MB - moved nnp/capability check prior to allocation (all of the above: indan@nul.nu) v13: - rebase on to 88ebdda6159ffc15699f204c33feb3e431bf9bdc v12: - added a maximum instruction count per path (indan@nul.nu,oleg@redhat.com) - removed copy_seccomp (keescook@chromium.org,indan@nul.nu) - reworded the prctl_set_seccomp comment (indan@nul.nu) v11: - reorder struct seccomp_data to allow future args expansion (hpa@zytor.com) - style clean up, @compat dropped, compat_sock_fprog32 (indan@nul.nu) - do_exit(SIGSYS) (keescook@chromium.org, luto@mit.edu) - pare down Kconfig doc reference. - extra comment clean up v10: - seccomp_data has changed again to be more aesthetically pleasing (hpa@zytor.com) - calling convention is noted in a new u32 field using syscall_get_arch. This allows for cross-calling convention tasks to use seccomp filters. (hpa@zytor.com) - lots of clean up (thanks, Indan!) v9: - n/a v8: - use bpf_chk_filter, bpf_run_filter. update load_fns - Lots of fixes courtesy of indan@nul.nu: -- fix up load behavior, compat fixups, and merge alloc code, -- renamed pc and dropped __packed, use bool compat. -- Added a hidden CONFIG_SECCOMP_FILTER to synthesize non-arch dependencies v7: (massive overhaul thanks to Indan, others) - added CONFIG_HAVE_ARCH_SECCOMP_FILTER - merged into seccomp.c - minimal seccomp_filter.h - no config option (part of seccomp) - no new prctl - doesn't break seccomp on systems without asm/syscall.h (works but arg access always fails) - dropped seccomp_init_task, extra free functions, ... - dropped the no-asm/syscall.h code paths - merges with network sk_run_filter and sk_chk_filter v6: - fix memory leak on attach compat check failure - require no_new_privs || CAP_SYS_ADMIN prior to filter installation. (luto@mit.edu) - s/seccomp_struct_/seccomp_/ for macros/functions (amwang@redhat.com) - cleaned up Kconfig (amwang@redhat.com) - on block, note if the call was compat (so the # means something) v5: - uses syscall_get_arguments (indan@nul.nu,oleg@redhat.com, mcgrathr@chromium.org) - uses union-based arg storage with hi/lo struct to handle endianness. Compromises between the two alternate proposals to minimize extra arg shuffling and account for endianness assuming userspace uses offsetof(). (mcgrathr@chromium.org, indan@nul.nu) - update Kconfig description - add include/seccomp_filter.h and add its installation - (naive) on-demand syscall argument loading - drop seccomp_t (eparis@redhat.com) v4: - adjusted prctl to make room for PR_[SG]ET_NO_NEW_PRIVS - now uses current->no_new_privs (luto@mit.edu,torvalds@linux-foundation.com) - assign names to seccomp modes (rdunlap@xenotime.net) - fix style issues (rdunlap@xenotime.net) - reworded Kconfig entry (rdunlap@xenotime.net) v3: - macros to inline (oleg@redhat.com) - init_task behavior fixed (oleg@redhat.com) - drop creator entry and extra NULL check (oleg@redhat.com) - alloc returns -EINVAL on bad sizing (serge.hallyn@canonical.com) - adds tentative use of "always_unprivileged" as per torvalds@linux-foundation.org and luto@mit.edu v2: - (patch 2 only) Signed-off-by: James Morris <james.l.morris@oracle.com>
2012-04-14Add PR_{GET,SET}_NO_NEW_PRIVS to prevent execve from granting privsAndy Lutomirski1-0/+10
With this change, calling prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0) disables privilege granting operations at execve-time. For example, a process will not be able to execute a setuid binary to change their uid or gid if this bit is set. The same is true for file capabilities. Additionally, LSM_UNSAFE_NO_NEW_PRIVS is defined to ensure that LSMs respect the requested behavior. To determine if the NO_NEW_PRIVS bit is set, a task may call prctl(PR_GET_NO_NEW_PRIVS, 0, 0, 0, 0); It returns 1 if set and 0 if it is not set. If any of the arguments are non-zero, it will return -1 and set errno to -EINVAL. (PR_SET_NO_NEW_PRIVS behaves similarly.) This functionality is desired for the proposed seccomp filter patch series. By using PR_SET_NO_NEW_PRIVS, it allows a task to modify the system call behavior for itself and its child tasks without being able to impact the behavior of a more privileged task. Another potential use is making certain privileged operations unprivileged. For example, chroot may be considered "safe" if it cannot affect privileged tasks. Note, this patch causes execve to fail when PR_SET_NO_NEW_PRIVS is set and AppArmor is in use. It is fixed in a subsequent patch. Signed-off-by: Andy Lutomirski <luto@amacapital.net> Signed-off-by: Will Drewry <wad@chromium.org> Acked-by: Eric Paris <eparis@redhat.com> Acked-by: Kees Cook <keescook@chromium.org> v18: updated change desc v17: using new define values as per 3.4 Signed-off-by: James Morris <james.l.morris@oracle.com>
2012-04-07userns: Disassociate user_struct from the user_namespace.Eric W. Biederman1-11/+23
Modify alloc_uid to take a kuid and make the user hash table global. Stop holding a reference to the user namespace in struct user_struct. This simplifies the code and makes the per user accounting not care about which user namespace a uid happens to appear in. Acked-by: Serge Hallyn <serge.hallyn@canonical.com> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2012-04-07userns: Add kuid_t and kgid_t and associated infrastructure in uidgid.hEric W. Biederman1-2/+0
Start distinguishing between internal kernel uids and gids and values that userspace can use. This is done by introducing two new types: kuid_t and kgid_t. These types and their associated functions are infrastructure are declared in the new header uidgid.h. Ultimately there will be a different implementation of the mapping functions for use with user namespaces. But to keep it simple we introduce the mapping functions first to separate the meat from the mechanical code conversions. Export overflowuid and overflowgid so we can use from_kuid_munged and from_kgid_munged in modular code. Acked-by: Serge Hallyn <serge.hallyn@canonical.com> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2012-04-07userns: Use cred->user_ns instead of cred->user->user_nsEric W. Biederman1-4/+4
Optimize performance and prepare for the removal of the user_ns reference from user_struct. Remove the slow long walk through cred->user->user_ns and instead go straight to cred->user_ns. Acked-by: Serge Hallyn <serge.hallyn@canonical.com> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>