aboutsummaryrefslogtreecommitdiffstats
path: root/mm/memory.c
AgeCommit message (Collapse)AuthorFilesLines
6 hoursmm: simplify and improve print_vma_addr() outputLinus Torvalds1-13/+6
Use '%pD' to print out the filename, and print out the actual offset within the file too, rather than just what the virtual address of the mapping is (which doesn't tell you anything about any mapping offsets). Also, use the exact vma_lookup() instead of find_vma() - the latter looks up any vma _after_ the address, which is of questionable value (yes, maybe you fell off the beginning, but you'd be more likely to fall off the end). Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
3 daysMerge tag 'mm-stable-2024-05-17-19-19' of ↵Linus Torvalds1-134/+171
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull mm updates from Andrew Morton: "The usual shower of singleton fixes and minor series all over MM, documented (hopefully adequately) in the respective changelogs. Notable series include: - Lucas Stach has provided some page-mapping cleanup/consolidation/ maintainability work in the series "mm/treewide: Remove pXd_huge() API". - In the series "Allow migrate on protnone reference with MPOL_PREFERRED_MANY policy", Donet Tom has optimized mempolicy's MPOL_PREFERRED_MANY mode, yielding almost doubled performance in one test. - In their series "Memory allocation profiling" Kent Overstreet and Suren Baghdasaryan have contributed a means of determining (via /proc/allocinfo) whereabouts in the kernel memory is being allocated: number of calls and amount of memory. - Matthew Wilcox has provided the series "Various significant MM patches" which does a number of rather unrelated things, but in largely similar code sites. - In his series "mm: page_alloc: freelist migratetype hygiene" Johannes Weiner has fixed the page allocator's handling of migratetype requests, with resulting improvements in compaction efficiency. - In the series "make the hugetlb migration strategy consistent" Baolin Wang has fixed a hugetlb migration issue, which should improve hugetlb allocation reliability. - Liu Shixin has hit an I/O meltdown caused by readahead in a memory-tight memcg. Addressed in the series "Fix I/O high when memory almost met memcg limit". - In the series "mm/filemap: optimize folio adding and splitting" Kairui Song has optimized pagecache insertion, yielding ~10% performance improvement in one test. - Baoquan He has cleaned up and consolidated the early zone initialization code in the series "mm/mm_init.c: refactor free_area_init_core()". - Baoquan has also redone some MM initializatio code in the series "mm/init: minor clean up and improvement". - MM helper cleanups from Christoph Hellwig in his series "remove follow_pfn". - More cleanups from Matthew Wilcox in the series "Various page->flags cleanups". - Vlastimil Babka has contributed maintainability improvements in the series "memcg_kmem hooks refactoring". - More folio conversions and cleanups in Matthew Wilcox's series: "Convert huge_zero_page to huge_zero_folio" "khugepaged folio conversions" "Remove page_idle and page_young wrappers" "Use folio APIs in procfs" "Clean up __folio_put()" "Some cleanups for memory-failure" "Remove page_mapping()" "More folio compat code removal" - David Hildenbrand chipped in with "fs/proc/task_mmu: convert hugetlb functions to work on folis". - Code consolidation and cleanup work related to GUP's handling of hugetlbs in Peter Xu's series "mm/gup: Unify hugetlb, part 2". - Rick Edgecombe has developed some fixes to stack guard gaps in the series "Cover a guard gap corner case". - Jinjiang Tu has fixed KSM's behaviour after a fork+exec in the series "mm/ksm: fix ksm exec support for prctl". - Baolin Wang has implemented NUMA balancing for multi-size THPs. This is a simple first-cut implementation for now. The series is "support multi-size THP numa balancing". - Cleanups to vma handling helper functions from Matthew Wilcox in the series "Unify vma_address and vma_pgoff_address". - Some selftests maintenance work from Dev Jain in the series "selftests/mm: mremap_test: Optimizations and style fixes". - Improvements to the swapping of multi-size THPs from Ryan Roberts in the series "Swap-out mTHP without splitting". - Kefeng Wang has significantly optimized the handling of arm64's permission page faults in the series "arch/mm/fault: accelerate pagefault when badaccess" "mm: remove arch's private VM_FAULT_BADMAP/BADACCESS" - GUP cleanups from David Hildenbrand in "mm/gup: consistently call it GUP-fast". - hugetlb fault code cleanups from Vishal Moola in "Hugetlb fault path to use struct vm_fault". - selftests build fixes from John Hubbard in the series "Fix selftests/mm build without requiring "make headers"". - Memory tiering fixes/improvements from Ho-Ren (Jack) Chuang in the series "Improved Memory Tier Creation for CPUless NUMA Nodes". Fixes the initialization code so that migration between different memory types works as intended. - David Hildenbrand has improved follow_pte() and fixed an errant driver in the series "mm: follow_pte() improvements and acrn follow_pte() fixes". - David also did some cleanup work on large folio mapcounts in his series "mm: mapcount for large folios + page_mapcount() cleanups". - Folio conversions in KSM in Alex Shi's series "transfer page to folio in KSM". - Barry Song has added some sysfs stats for monitoring multi-size THP's in the series "mm: add per-order mTHP alloc and swpout counters". - Some zswap cleanups from Yosry Ahmed in the series "zswap same-filled and limit checking cleanups". - Matthew Wilcox has been looking at buffer_head code and found the documentation to be lacking. The series is "Improve buffer head documentation". - Multi-size THPs get more work, this time from Lance Yang. His series "mm/madvise: enhance lazyfreeing with mTHP in madvise_free" optimizes the freeing of these things. - Kemeng Shi has added more userspace-visible writeback instrumentation in the series "Improve visibility of writeback". - Kemeng Shi then sent some maintenance work on top in the series "Fix and cleanups to page-writeback". - Matthew Wilcox reduces mmap_lock traffic in the anon vma code in the series "Improve anon_vma scalability for anon VMAs". Intel's test bot reported an improbable 3x improvement in one test. - SeongJae Park adds some DAMON feature work in the series "mm/damon: add a DAMOS filter type for page granularity access recheck" "selftests/damon: add DAMOS quota goal test" - Also some maintenance work in the series "mm/damon/paddr: simplify page level access re-check for pageout" "mm/damon: misc fixes and improvements" - David Hildenbrand has disabled some known-to-fail selftests ni the series "selftests: mm: cow: flag vmsplice() hugetlb tests as XFAIL". - memcg metadata storage optimizations from Shakeel Butt in "memcg: reduce memory consumption by memcg stats". - DAX fixes and maintenance work from Vishal Verma in the series "dax/bus.c: Fixups for dax-bus locking"" * tag 'mm-stable-2024-05-17-19-19' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (426 commits) memcg, oom: cleanup unused memcg_oom_gfp_mask and memcg_oom_order selftests/mm: hugetlb_madv_vs_map: avoid test skipping by querying hugepage size at runtime mm/hugetlb: add missing VM_FAULT_SET_HINDEX in hugetlb_wp mm/hugetlb: add missing VM_FAULT_SET_HINDEX in hugetlb_fault selftests: cgroup: add tests to verify the zswap writeback path mm: memcg: make alloc_mem_cgroup_per_node_info() return bool mm/damon/core: fix return value from damos_wmark_metric_value mm: do not update memcg stats for NR_{FILE/SHMEM}_PMDMAPPED selftests: cgroup: remove redundant enabling of memory controller Docs/mm/damon/maintainer-profile: allow posting patches based on damon/next tree Docs/mm/damon/maintainer-profile: change the maintainer's timezone from PST to PT Docs/mm/damon/design: use a list for supported filters Docs/admin-guide/mm/damon/usage: fix wrong schemes effective quota update command Docs/admin-guide/mm/damon/usage: fix wrong example of DAMOS filter matching sysfs file selftests/damon: classify tests for functionalities and regressions selftests/damon/_damon_sysfs: use 'is' instead of '==' for 'None' selftests/damon/_damon_sysfs: find sysfs mount point from /proc/mounts selftests/damon/_damon_sysfs: check errors from nr_schemes file reads mm/damon/core: initialize ->esz_bp from damos_quota_init_priv() selftests/damon: add a test for DAMOS quota goal ...
2024-05-05mm: optimise vmf_anon_prepare() for VMAs without an anon_vmaMatthew Wilcox (Oracle)1-4/+9
If the mmap_lock can be taken for read, we can call __anon_vma_prepare() while holding it, saving ourselves a trip back through the fault handler. Link: https://lkml.kernel.org/r/20240426144506.1290619-5-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Jann Horn <jannh@google.com> Reviewed-by: Suren Baghdasaryan <surenb@google.com> Reviewed-by: David Hildenbrand <david@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-05mm: delay the check for a NULL anon_vmaMatthew Wilcox (Oracle)1-11/+18
Instead of checking the anon_vma early in the fault path where all page faults pay the cost, delay it until we know we're going to need the anon_vma to be filled in. This will have a slight negative effect on the first fault in an anonymous VMA, but it shortens every other page fault. It also makes the code slightly cleaner as the anon and file backed fault handling look more similar. The Intel kernel test bot reports a 3x improvement in vm-scalability throughput with the small-allocs-mt test. This is clearly an extreme situation that won't be replicated in any real-world workload, but it's a nice win. https://lore.kernel.org/all/202404261055.c5e24608-oliver.sang@intel.com/ Link: https://lkml.kernel.org/r/20240426144506.1290619-3-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Suren Baghdasaryan <surenb@google.com> Cc: David Hildenbrand <david@redhat.com> Cc: Jann Horn <jannh@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-05mm: simplify thp_vma_allowable_orderMatthew Wilcox1-4/+6
Combine the three boolean arguments into one flags argument for readability. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: David Hildenbrand <david@redhat.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-05mm: memory: check userfaultfd_wp() in vmf_orig_pte_uffd_wp()Kefeng Wang1-5/+5
Add userfaultfd_wp() check in vmf_orig_pte_uffd_wp() to avoid the unnecessary FAULT_FLAG_ORIG_PTE_VALID check/pte_marker_entry_uffd_wp() in most pagefault, note, the function vmf_orig_pte_uffd_wp() is not inlined in the two kernel versions, the difference is shown below, perf date, perf report -i perf.data.before | grep vmf 0.17% 0.13% lat_pagefault [kernel.kallsyms] [k] vmf_orig_pte_uffd_wp.part.0.isra.0 perf report -i perf.data.after | grep vmf lat_pagefault -W 5 -N 5 /tmp/XXX latency before after diff average(8 tests) 0.262675 0.2600375 -0.0026375 Although it's a small, but the uffd_wp is a new feature than previous kernel, when the vma is not registered with UFFD_WP, let's avoid to execute the new logical, also adding __always_inline attribute to vmf_orig_pte_uffd_wp(), which make set_pte_range() only check VM_UFFD_WP flags without the function call. In addition, directly call the vmf_orig_pte_uffd_wp() in do_anonymous_page() and set_pte_range() to save an uffd_wp variable. Link: https://lkml.kernel.org/r/20240422030039.3293568-1-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Reviewed-by: Peter Xu <peterx@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-05mm/memory: add any_dirty optional pointer to folio_pte_batch()Lance Yang1-2/+2
This commit adds the any_dirty pointer as an optional parameter to folio_pte_batch() function. By using both the any_young and any_dirty pointers, madvise_free can make smarter decisions about whether to clear the PTEs when marking large folios as lazyfree. Link: https://lkml.kernel.org/r/20240418134435.6092-4-ioworker0@gmail.com Signed-off-by: Lance Yang <ioworker0@gmail.com> Suggested-by: David Hildenbrand <david@redhat.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Barry Song <21cnbao@gmail.com> Cc: Jeff Xie <xiehuan09@gmail.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Peter Xu <peterx@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Yin Fengwei <fengwei.yin@intel.com> Cc: Zach O'Keefe <zokeefe@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-05mm: move mm counter updating out of set_pte_range()Kefeng Wang1-3/+5
Patch series "mm: batch mm counter updating in filemap_map_pages()", v3. Let's batch mm counter updating to accelerate filemap_map_pages(). This patch (of 2): In order to support batch mm counter updating in filemap_map_pages(), move mm counter updating out of set_pte_range(), the folios are file from filemap, and distinguish folios by vmf->flags and vma->vm_flags from another caller finish_fault(). Link: https://lkml.kernel.org/r/20240412064751.119015-1-wangkefeng.wang@huawei.com Link: https://lkml.kernel.org/r/20240412064751.119015-2-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-05mm: add per-order mTHP anon_fault_alloc and anon_fault_fallback countersBarry Song1-0/+5
Patch series "mm: add per-order mTHP alloc and swpout counters", v6. The patchset introduces a framework to facilitate mTHP counters, starting with the allocation and swap-out counters. Currently, only four new nodes are appended to the stats directory for each mTHP size. /sys/kernel/mm/transparent_hugepage/hugepages-<size>/stats anon_fault_alloc anon_fault_fallback anon_fault_fallback_charge anon_swpout anon_swpout_fallback These nodes are crucial for us to monitor the fragmentation levels of both the buddy system and the swap partitions. In the future, we may consider adding additional nodes for further insights. This patch (of 4): Profiling a system blindly with mTHP has become challenging due to the lack of visibility into its operations. Presenting the success rate of mTHP allocations appears to be pressing need. Recently, I've been experiencing significant difficulty debugging performance improvements and regressions without these figures. It's crucial for us to understand the true effectiveness of mTHP in real-world scenarios, especially in systems with fragmented memory. This patch establishes the framework for per-order mTHP counters. It begins by introducing the anon_fault_alloc and anon_fault_fallback counters. Additionally, to maintain consistency with thp_fault_fallback_charge in /proc/vmstat, this patch also tracks anon_fault_fallback_charge when mem_cgroup_charge fails for mTHP. Incorporating additional counters should now be straightforward as well. Link: https://lkml.kernel.org/r/20240412114858.407208-1-21cnbao@gmail.com Link: https://lkml.kernel.org/r/20240412114858.407208-2-21cnbao@gmail.com Signed-off-by: Barry Song <v-songbaohua@oppo.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Chris Li <chrisl@kernel.org> Cc: Domenico Cerasuolo <cerasuolodomenico@gmail.com> Cc: Kairui Song <kasong@tencent.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Peter Xu <peterx@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Yosry Ahmed <yosryahmed@google.com> Cc: Yu Zhao <yuzhao@google.com> Cc: Jonathan Corbet <corbet@lwn.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-05mm/memory: use folio_mapcount() in zap_present_folio_ptes()David Hildenbrand1-2/+1
We want to limit the use of page_mapcount() to the places where it is absolutely necessary. In zap_present_folio_ptes(), let's simply check the folio mapcount(). If there is some issue, it will underflow at some point either way when unmapping. As indicated already in commit 10ebac4f95e7 ("mm/memory: optimize unmap/zap with PTE-mapped THP"), we already documented "If we ever have a cheap folio_mapcount(), we might just want to check for underflows there.". There is no change for small folios. For large folios, we'll now catch more underflows when batch-unmapping, because instead of only testing the mapcount of the first subpage, we'll test if the folio mapcount underflows. Link: https://lkml.kernel.org/r/20240409192301.907377-8-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Cc: Chris Zankel <chris@zankel.net> Cc: Hugh Dickins <hughd@google.com> Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Max Filippov <jcmvbkbc@gmail.com> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Cc: Peter Xu <peterx@redhat.com> Cc: Richard Chang <richardycc@google.com> Cc: Rich Felker <dalias@libc.org> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Yin Fengwei <fengwei.yin@intel.com> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-05mm: follow_pte() improvementsDavid Hildenbrand1-5/+15
follow_pte() is now our main function to lookup PTEs in VM_PFNMAP/VM_IO VMAs. Let's perform some more sanity checks to make this exported function harder to abuse. Further, extend the doc a bit, it still focuses on the KVM use case with MMU notifiers. Drop the KVM+follow_pfn() comment, follow_pfn() is no more, and we have other users nowadays. Also extend the doc regarding refcounted pages and the interaction with MMU notifiers. KVM is one example that uses MMU notifiers and can deal with refcounted pages properly. VFIO is one example that doesn't use MMU notifiers, and to prevent use-after-free, rejects refcounted pages: pfn_valid(pfn) && !PageReserved(pfn_to_page(pfn)). Protection changes are less of a concern for users like VFIO: the behavior is similar to longterm-pinning a page, and getting the PTE protection changed afterwards. The primary concern with refcounted pages is use-after-free, which callers should be aware of. Link: https://lkml.kernel.org/r/20240410155527.474777-4-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Cc: Alex Williamson <alex.williamson@redhat.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Fei Li <fei1.li@intel.com> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Sean Christopherson <seanjc@google.com> Cc: Yonghua Huang <yonghua.huang@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-05mm: pass VMA instead of MM to follow_pte()David Hildenbrand1-7/+8
... and centralize the VM_IO/VM_PFNMAP sanity check in there. We'll now also perform these sanity checks for direct follow_pte() invocations. For generic_access_phys(), we might now check multiple times: nothing to worry about, really. Link: https://lkml.kernel.org/r/20240410155527.474777-3-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Acked-by: Sean Christopherson <seanjc@google.com> [KVM] Cc: Alex Williamson <alex.williamson@redhat.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Fei Li <fei1.li@intel.com> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Yonghua Huang <yonghua.huang@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-25mm: madvise: avoid split during MADV_PAGEOUT and MADV_COLDRyan Roberts1-2/+2
Rework madvise_cold_or_pageout_pte_range() to avoid splitting any large folio that is fully and contiguously mapped in the pageout/cold vm range. This change means that large folios will be maintained all the way to swap storage. This both improves performance during swap-out, by eliding the cost of splitting the folio, and sets us up nicely for maintaining the large folio when it is swapped back in (to be covered in a separate series). Folios that are not fully mapped in the target range are still split, but note that behavior is changed so that if the split fails for any reason (folio locked, shared, etc) we now leave it as is and move to the next pte in the range and continue work on the proceeding folios. Previously any failure of this sort would cause the entire operation to give up and no folios mapped at higher addresses were paged out or made cold. Given large folios are becoming more common, this old behavior would have likely lead to wasted opportunities. While we are at it, change the code that clears young from the ptes to use ptep_test_and_clear_young(), via the new mkold_ptes() batch helper function. This is more efficent than get_and_clear/modify/set, especially for contpte mappings on arm64, where the old approach would require unfolding/refolding and the new approach can be done in place. Link: https://lkml.kernel.org/r/20240408183946.2991168-8-ryan.roberts@arm.com Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> Reviewed-by: Barry Song <v-songbaohua@oppo.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Barry Song <21cnbao@gmail.com> Cc: Chris Li <chrisl@kernel.org> Cc: Gao Xiang <xiang@kernel.org> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Lance Yang <ioworker0@gmail.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Yu Zhao <yuzhao@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-25mm: swap: free_swap_and_cache_nr() as batched free_swap_and_cache()Ryan Roberts1-6/+7
Now that we no longer have a convenient flag in the cluster to determine if a folio is large, free_swap_and_cache() will take a reference and lock a large folio much more often, which could lead to contention and (e.g.) failure to split large folios, etc. Let's solve that problem by batch freeing swap and cache with a new function, free_swap_and_cache_nr(), to free a contiguous range of swap entries together. This allows us to first drop a reference to each swap slot before we try to release the cache folio. This means we only try to release the folio once, only taking the reference and lock once - much better than the previous 512 times for the 2M THP case. Contiguous swap entries are gathered in zap_pte_range() and madvise_free_pte_range() in a similar way to how present ptes are already gathered in zap_pte_range(). While we are at it, let's simplify by converting the return type of both functions to void. The return value was used only by zap_pte_range() to print a bad pte, and was ignored by everyone else, so the extra reporting wasn't exactly guaranteed. We will still get the warning with most of the information from get_swap_device(). With the batch version, we wouldn't know which pte was bad anyway so could print the wrong one. [ryan.roberts@arm.com: fix a build warning on parisc] Link: https://lkml.kernel.org/r/20240409111840.3173122-1-ryan.roberts@arm.com Link: https://lkml.kernel.org/r/20240408183946.2991168-3-ryan.roberts@arm.com Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Barry Song <21cnbao@gmail.com> Cc: Barry Song <v-songbaohua@oppo.com> Cc: Chris Li <chrisl@kernel.org> Cc: Gao Xiang <xiang@kernel.org> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Lance Yang <ioworker0@gmail.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Yu Zhao <yuzhao@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-25mm: support multi-size THP numa balancingBaolin Wang1-12/+50
Now the anonymous page allocation already supports multi-size THP (mTHP), but the numa balancing still prohibits mTHP migration even though it is an exclusive mapping, which is unreasonable. Allow scanning mTHP: Commit 859d4adc3415 ("mm: numa: do not trap faults on shared data section pages") skips shared CoW pages' NUMA page migration to avoid shared data segment migration. In addition, commit 80d47f5de5e3 ("mm: don't try to NUMA-migrate COW pages that have other uses") change to use page_count() to avoid GUP pages migration, that will also skip the mTHP numa scanning. Theoretically, we can use folio_maybe_dma_pinned() to detect the GUP issue, although there is still a GUP race, the issue seems to have been resolved by commit 80d47f5de5e3. Meanwhile, use the folio_likely_mapped_shared() to skip shared CoW pages though this is not a precise sharers count. To check if the folio is shared, ideally we want to make sure every page is mapped to the same process, but doing that seems expensive and using the estimated mapcount seems can work when running autonuma benchmark. Allow migrating mTHP: As mentioned in the previous thread[1], large folios (including THP) are more susceptible to false sharing issues among threads than 4K base page, leading to pages ping-pong back and forth during numa balancing, which is currently not easy to resolve. Therefore, as a start to support mTHP numa balancing, we can follow the PMD mapped THP's strategy, that means we can reuse the 2-stage filter in should_numa_migrate_memory() to check if the mTHP is being heavily contended among threads (through checking the CPU id and pid of the last access) to avoid false sharing at some degree. Thus, we can restore all PTE maps upon the first hint page fault of a large folio to follow the PMD mapped THP's strategy. In the future, we can continue to optimize the NUMA balancing algorithm to avoid the false sharing issue with large folios as much as possible. Performance data: Machine environment: 2 nodes, 128 cores Intel(R) Xeon(R) Platinum Base: 2024-03-25 mm-unstable branch Enable mTHP to run autonuma-benchmark mTHP:16K Base Patched numa01 numa01 224.70 143.48 numa01_THREAD_ALLOC numa01_THREAD_ALLOC 118.05 47.43 numa02 numa02 13.45 9.29 numa02_SMT numa02_SMT 14.80 7.50 mTHP:64K Base Patched numa01 numa01 216.15 114.40 numa01_THREAD_ALLOC numa01_THREAD_ALLOC 115.35 47.41 numa02 numa02 13.24 9.25 numa02_SMT numa02_SMT 14.67 7.34 mTHP:128K Base Patched numa01 numa01 205.13 144.45 numa01_THREAD_ALLOC numa01_THREAD_ALLOC 112.93 41.88 numa02 numa02 13.16 9.18 numa02_SMT numa02_SMT 14.81 7.49 [1] https://lore.kernel.org/all/20231117100745.fnpijbk4xgmals3k@techsingularity.net/ [baolin.wang@linux.alibaba.com: v3] Link: https://lkml.kernel.org/r/c33a5c0b0a0323b1f8ed53772f50501f4b196e25.1712132950.git.baolin.wang@linux.alibaba.com Link: https://lkml.kernel.org/r/d28d276d599c26df7f38c9de8446f60e22dd1950.1711683069.git.baolin.wang@linux.alibaba.com Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: "Huang, Ying" <ying.huang@intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Ryan Roberts <ryan.roberts@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-25mm: factor out the numa mapping rebuilding into a new helperBaolin Wang1-7/+15
Patch series "support multi-size THP numa balancing", v2. This patchset tries to support mTHP numa balancing, as a simple solution to start, the NUMA balancing algorithm for mTHP will follow the THP strategy as the basic support. Please find details in each patch. This patch (of 2): To support large folio's numa balancing, factor out the numa mapping rebuilding into a new helper as a preparation. Link: https://lkml.kernel.org/r/cover.1712132950.git.baolin.wang@linux.alibaba.com Link: https://lkml.kernel.org/r/cover.1711683069.git.baolin.wang@linux.alibaba.com Link: https://lkml.kernel.org/r/8bc2586bdd8dbbe6d83c09b77b360ec8fcac3736.1711683069.git.baolin.wang@linux.alibaba.com Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: "Huang, Ying" <ying.huang@intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Ryan Roberts <ryan.roberts@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-25mm: alloc_anon_folio: avoid doing vma_thp_gfp_mask in fallback casesBarry Song1-0/+3
Fallback rates surpassing 90% have been observed on phones utilizing 64KiB CONT-PTE mTHP. In these scenarios, when one out of every 16 PTEs fails to allocate large folios, the remaining 15 PTEs fallback. Consequently, invoking vma_thp_gfp_mask seems redundant in such cases. Furthermore, abstaining from its use can also contribute to improved code readability. Link: https://lkml.kernel.org/r/20240329073750.20012-1-21cnbao@gmail.com Signed-off-by: Barry Song <v-songbaohua@oppo.com> Reviewed-by: Zi Yan <ziy@nvidia.com> Acked-by: Yu Zhao <yuzhao@google.com> Reviewed-by: Ryan Roberts <ryan.roberts@arm.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: David Hildenbrand <david@redhat.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: David Rientjes <rientjes@google.com> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Itaru Kitayama <itaru.kitayama@gmail.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Luis Chamberlain <mcgrof@kernel.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Yang Shi <shy828301@gmail.com> Cc: Yin Fengwei <fengwei.yin@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-25mm: init_mlocked_on_free_v3York Jasper Niebuhr1-0/+6
Implements the "init_mlocked_on_free" boot option. When this boot option is enabled, any mlock'ed pages are zeroed on free. If the pages are munlock'ed beforehand, no initialization takes place. This boot option is meant to combat the performance hit of "init_on_free" as reported in commit 6471384af2a6 ("mm: security: introduce init_on_alloc=1 and init_on_free=1 boot options"). With "init_mlocked_on_free=1" only relevant data is freed while everything else is left untouched by the kernel. Correspondingly, this patch introduces no performance hit for unmapping non-mlock'ed memory. The unmapping overhead for purely mlocked memory was measured to be approximately 13%. Realistically, most systems mlock only a fraction of the total memory so the real-world system overhead should be close to zero. Optimally, userspace programs clear any key material or other confidential memory before exit and munlock the according memory regions. If a program crashes, userspace key managers fail to do this job. Accordingly, no munlock operations are performed so the data is caught and zeroed by the kernel. Should the program not crash, all memory will ideally be munlocked so no overhead is caused. CONFIG_INIT_MLOCKED_ON_FREE_DEFAULT_ON can be set to enable "init_mlocked_on_free" by default. Link: https://lkml.kernel.org/r/20240329145605.149917-1-yjnworkstation@gmail.com Signed-off-by: York Jasper Niebuhr <yjnworkstation@gmail.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: York Jasper Niebuhr <yjnworkstation@gmail.com> Cc: Kees Cook <keescook@chromium.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-25mm: introduce vma_pgtable_walk_{begin|end}()Peter Xu1-0/+12
Introduce per-vma begin()/end() helpers for pgtable walks. This is a preparation work to merge hugetlb pgtable walkers with generic mm. The helpers need to be called before and after a pgtable walk, will start to be needed if the pgtable walker code supports hugetlb pages. It's a hook point for any type of VMA, but for now only hugetlb uses it to stablize the pgtable pages from getting away (due to possible pmd unsharing). Link: https://lkml.kernel.org/r/20240327152332.950956-5-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Reviewed-by: Christoph Hellwig <hch@infradead.org> Reviewed-by: Muchun Song <muchun.song@linux.dev> Tested-by: Ryan Roberts <ryan.roberts@arm.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Andrew Jones <andrew.jones@linux.dev> Cc: Aneesh Kumar K.V (IBM) <aneesh.kumar@kernel.org> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: David Hildenbrand <david@redhat.com> Cc: James Houghton <jthoughton@google.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Kirill A. Shutemov <kirill@shutemov.name> Cc: Lorenzo Stoakes <lstoakes@gmail.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: "Mike Rapoport (IBM)" <rppt@kernel.org> Cc: Rik van Riel <riel@surriel.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Yang Shi <shy828301@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-25mm: move follow_phys to arch/x86/mm/pat/memtype.cChristoph Hellwig1-32/+0
follow_phys is only used by two callers in arch/x86/mm/pat/memtype.c. Move it there and hardcode the two arguments that get the same values passed by both callers. [david@redhat.com: conflict resolutions] Link: https://lkml.kernel.org/r/20240403212131.929421-4-david@redhat.com Link: https://lkml.kernel.org/r/20240324234542.2038726-4-hch@lst.de Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: David Hildenbrand <david@redhat.com> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Fei Li <fei1.li@intel.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Nathan Chancellor <nathan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-25mm: remove follow_pfnChristoph Hellwig1-34/+2
Remove follow_pfn now that the last user is gone. Link: https://lkml.kernel.org/r/20240324234542.2038726-3-hch@lst.de Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Fei Li <fei1.li@intel.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Nathan Chancellor <nathan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-25mm: convert folio_estimated_sharers() to folio_likely_mapped_shared()David Hildenbrand1-1/+1
Callers of folio_estimated_sharers() only care about "mapped shared vs. mapped exclusively", not the exact estimate of sharers. Let's consolidate and unify the condition users are checking. While at it clarify the semantics and extend the discussion on the fuzziness. Use the "likely mapped shared" terminology to better express what the (adjusted) function actually checks. Whether a partially-mappable folio is more likely to not be partially mapped than partially mapped is debatable. In the future, we might be able to improve our estimate for partially-mappable folios, though. Note that we will now consistently detect "mapped shared" only if the first subpage is actually mapped multiple times. When the first subpage is not mapped, we will consistently detect it as "mapped exclusively". This change should currently only affect the usage in madvise_free_pte_range() and queue_folios_pte_range() for large folios: if the first page was already unmapped, we would have skipped the folio. [david@redhat.com: folio_likely_mapped_shared() kerneldoc fixup] Link: https://lkml.kernel.org/r/dd0ad9f2-2d7a-45f3-9ba3-979488c7dd27@redhat.com Link: https://lkml.kernel.org/r/20240227201548.857831-1-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Reviewed-by: Khalid Aziz <khalid.aziz@oracle.com> Acked-by: Barry Song <v-songbaohua@oppo.com> Reviewed-by: Vishal Moola (Oracle) <vishal.moola@gmail.com> Reviewed-by: Ryan Roberts <ryan.roberts@arm.com> Reviewed-by: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-25arm64: mm: swap: support THP_SWAP on hardware with MTEBarry Song1-1/+1
Commit d0637c505f8a1 ("arm64: enable THP_SWAP for arm64") brings up THP_SWAP on ARM64, but it doesn't enable THP_SWP on hardware with MTE as the MTE code works with the assumption tags save/restore is always handling a folio with only one page. The limitation should be removed as more and more ARM64 SoCs have this feature. Co-existence of MTE and THP_SWAP becomes more and more important. This patch makes MTE tags saving support large folios, then we don't need to split large folios into base pages for swapping out on ARM64 SoCs with MTE any more. arch_prepare_to_swap() should take folio rather than page as parameter because we support THP swap-out as a whole. It saves tags for all pages in a large folio. As now we are restoring tags based-on folio, in arch_swap_restore(), we may increase some extra loops and early-exitings while refaulting a large folio which is still in swapcache in do_swap_page(). In case a large folio has nr pages, do_swap_page() will only set the PTE of the particular page which is causing the page fault. Thus do_swap_page() runs nr times, and each time, arch_swap_restore() will loop nr times for those subpages in the folio. So right now the algorithmic complexity becomes O(nr^2). Once we support mapping large folios in do_swap_page(), extra loops and early-exitings will decrease while not being completely removed as a large folio might get partially tagged in corner cases such as, 1. a large folio in swapcache can be partially unmapped, thus, MTE tags for the unmapped pages will be invalidated; 2. users might use mprotect() to set MTEs on a part of a large folio. arch_thp_swp_supported() is dropped since ARM64 MTE was the only one who needed it. Link: https://lkml.kernel.org/r/20240322114136.61386-2-21cnbao@gmail.com Signed-off-by: Barry Song <v-songbaohua@oppo.com> Reviewed-by: Steven Price <steven.price@arm.com> Acked-by: Chris Li <chrisl@kernel.org> Reviewed-by: Ryan Roberts <ryan.roberts@arm.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Will Deacon <will@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: David Hildenbrand <david@redhat.com> Cc: Kemeng Shi <shikemeng@huaweicloud.com> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Peter Collingbourne <pcc@google.com> Cc: Yosry Ahmed <yosryahmed@google.com> Cc: Peter Xu <peterx@redhat.com> Cc: Lorenzo Stoakes <lstoakes@gmail.com> Cc: "Mike Rapoport (IBM)" <rppt@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> Cc: Rick Edgecombe <rick.p.edgecombe@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-25mm/mempolicy: use numa_node_id() instead of cpu_to_node()Donet Tom1-3/+5
Patch series "Allow migrate on protnone reference with MPOL_PREFERRED_MANY policy:, v4. This patchset is to optimize the cross-socket memory access with MPOL_PREFERRED_MANY policy. To test this patch we ran the following test on a 3 node system. Node 0 - 2GB - Tier 1 Node 1 - 11GB - Tier 1 Node 6 - 10GB - Tier 2 Below changes are made to memcached to set the memory policy, It select Node0 and Node1 as preferred nodes. #include <numaif.h> #include <numa.h> unsigned long nodemask; int ret; nodemask = 0x03; ret = set_mempolicy(MPOL_PREFERRED_MANY | MPOL_F_NUMA_BALANCING, &nodemask, 10); /* If MPOL_F_NUMA_BALANCING isn't supported, * fall back to MPOL_PREFERRED_MANY */ if (ret < 0 && errno == EINVAL){ printf("set mem policy normal\n"); ret = set_mempolicy(MPOL_PREFERRED_MANY, &nodemask, 10); } if (ret < 0) { perror("Failed to call set_mempolicy"); exit(-1); } Test Procedure: =============== 1. Make sure memory tiring and demotion are enabled. 2. Start memcached. # ./memcached -b 100000 -m 204800 -u root -c 1000000 -t 7 -d -s "/tmp/memcached.sock" 3. Run memtier_benchmark to store 3200000 keys. #./memtier_benchmark -S "/tmp/memcached.sock" --protocol=memcache_binary --threads=1 --pipeline=1 --ratio=1:0 --key-pattern=S:S --key-minimum=1 --key-maximum=3200000 -n allkeys -c 1 -R -x 1 -d 1024 4. Start a memory eater on node 0 and 1. This will demote all memcached pages to node 6. 5. Make sure all the memcached pages got demoted to lower tier by reading /proc/<memcaced PID>/numa_maps. # cat /proc/2771/numa_maps --- default anon=1009 dirty=1009 active=0 N6=1009 kernelpagesize_kB=64 default anon=1009 dirty=1009 active=0 N6=1009 kernelpagesize_kB=64 --- 6. Kill memory eater. 7. Read the pgpromote_success counter. 8. Start reading the keys by running memtier_benchmark. #./memtier_benchmark -S "/tmp/memcached.sock" --protocol=memcache_binary --pipeline=1 --distinct-client-seed --ratio=0:3 --key-pattern=R:R --key-minimum=1 --key-maximum=3200000 -n allkeys --threads=64 -c 1 -R -x 6 9. Read the pgpromote_success counter. Test Results: ============= Without Patch ------------------ 1. pgpromote_success before test Node 0: pgpromote_success 11 Node 1: pgpromote_success 140974 pgpromote_success after test Node 0: pgpromote_success 11 Node 1: pgpromote_success 140974 2. Memtier-benchmark result. AGGREGATED AVERAGE RESULTS (6 runs) ================================================================== Type Ops/sec Hits/sec Misses/sec Avg. Latency p50 Latency ------------------------------------------------------------------ Sets 0.00 --- --- --- --- Gets 305792.03 305791.93 0.10 0.18949 0.16700 Waits 0.00 --- --- --- --- Totals 305792.03 305791.93 0.10 0.18949 0.16700 ====================================== p99 Latency p99.9 Latency KB/sec ------------------------------------- --- --- 0.00 0.44700 1.71100 11542.69 --- --- --- 0.44700 1.71100 11542.69 With Patch --------------- 1. pgpromote_success before test Node 0: pgpromote_success 5 Node 1: pgpromote_success 89386 pgpromote_success after test Node 0: pgpromote_success 57895 Node 1: pgpromote_success 141463 2. Memtier-benchmark result. AGGREGATED AVERAGE RESULTS (6 runs) ==================================================================== Type Ops/sec Hits/sec Misses/sec Avg. Latency p50 Latency -------------------------------------------------------------------- Sets 0.00 --- --- --- --- Gets 521942.24 521942.07 0.17 0.11459 0.10300 Waits 0.00 --- --- --- --- Totals 521942.24 521942.07 0.17 0.11459 0.10300 ======================================= p99 Latency p99.9 Latency KB/sec --------------------------------------- --- --- 0.00 0.23100 0.31900 19701.68 --- --- --- 0.23100 0.31900 19701.68 Test Result Analysis: ===================== 1. With patch we could observe pages are getting promoted. 2. Memtier-benchmark results shows that, with the patch, performance has increased more than 50%. Ops/sec without fix - 305792.03 Ops/sec with fix - 521942.24 This patch (of 2): Instead of using 'cpu_to_node()', we use 'numa_node_id()', which is quicker. smp_processor_id is guaranteed to be stable in the 'mpol_misplaced()' function because it is called with ptl held. lockdep_assert_held was added to ensure that. No functional change in this patch. [donettom@linux.ibm.com: add "* @vmf: structure describing the fault" comment] Link: https://lkml.kernel.org/r/d8b993ea9dccfac0bc3ed61d3a81f4ac5f376e46.1711002865.git.donettom@linux.ibm.com Link: https://lkml.kernel.org/r/cover.1711373653.git.donettom@linux.ibm.com Link: https://lkml.kernel.org/r/6059f034f436734b472d066db69676fb3a459864.1711373653.git.donettom@linux.ibm.com Link: https://lkml.kernel.org/r/cover.1709909210.git.donettom@linux.ibm.com Link: https://lkml.kernel.org/r/744646531af02cc687cde8ae788fb1779e99d02c.1709909210.git.donettom@linux.ibm.com Signed-off-by: Aneesh Kumar K.V (IBM) <aneesh.kumar@kernel.org> Signed-off-by: Donet Tom <donettom@linux.ibm.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Feng Tang <feng.tang@intel.com> Cc: Huang, Ying <ying.huang@intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Michal Hocko <mhocko@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@surriel.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-25mm/treewide: replace pXd_huge() with pXd_leaf()Peter Xu1-1/+1
Now after we're sure all pXd_huge() definitions are the same as pXd_leaf(), reuse it. Luckily, pXd_huge() isn't widely used. Link: https://lkml.kernel.org/r/20240318200404.448346-12-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Andreas Larsson <andreas@gaisler.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Bjorn Andersson <andersson@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David S. Miller <davem@davemloft.net> Cc: Fabio Estevam <festevam@denx.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Konrad Dybcio <konrad.dybcio@linaro.org> Cc: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org> Cc: Lucas Stach <l.stach@pengutronix.de> Cc: Mark Salter <msalter@redhat.com> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Mike Rapoport (IBM) <rppt@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Cc: "Naveen N. Rao" <naveen.n.rao@linux.ibm.com> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Russell King <linux@armlinux.org.uk> Cc: Shawn Guo <shawnguo@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-19Merge x86 bugfixes from Linux 6.9-rc3Paolo Bonzini1-1/+7
Pull fix for SEV-SNP late disable bugs. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-04-12mm: replace set_pte_at_notify() with just set_pte_at()Paolo Bonzini1-6/+1
With the demise of the .change_pte() MMU notifier callback, there is no notification happening in set_pte_at_notify(). It is a synonym of set_pte_at() and can be replaced with it. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org> Message-ID: <20240405115815.3226315-5-pbonzini@redhat.com> Acked-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-04-05x86/mm/pat: fix VM_PAT handling in COW mappingsDavid Hildenbrand1-0/+4
PAT handling won't do the right thing in COW mappings: the first PTE (or, in fact, all PTEs) can be replaced during write faults to point at anon folios. Reliably recovering the correct PFN and cachemode using follow_phys() from PTEs will not work in COW mappings. Using follow_phys(), we might just get the address+protection of the anon folio (which is very wrong), or fail on swap/nonswap entries, failing follow_phys() and triggering a WARN_ON_ONCE() in untrack_pfn() and track_pfn_copy(), not properly calling free_pfn_range(). In free_pfn_range(), we either wouldn't call memtype_free() or would call it with the wrong range, possibly leaking memory. To fix that, let's update follow_phys() to refuse returning anon folios, and fallback to using the stored PFN inside vma->vm_pgoff for COW mappings if we run into that. We will now properly handle untrack_pfn() with COW mappings, where we don't need the cachemode. We'll have to fail fork()->track_pfn_copy() if the first page was replaced by an anon folio, though: we'd have to store the cachemode in the VMA to make this work, likely growing the VMA size. For now, lets keep it simple and let track_pfn_copy() just fail in that case: it would have failed in the past with swap/nonswap entries already, and it would have done the wrong thing with anon folios. Simple reproducer to trigger the WARN_ON_ONCE() in untrack_pfn(): <--- C reproducer ---> #include <stdio.h> #include <sys/mman.h> #include <unistd.h> #include <liburing.h> int main(void) { struct io_uring_params p = {}; int ring_fd; size_t size; char *map; ring_fd = io_uring_setup(1, &p); if (ring_fd < 0) { perror("io_uring_setup"); return 1; } size = p.sq_off.array + p.sq_entries * sizeof(unsigned); /* Map the submission queue ring MAP_PRIVATE */ map = mmap(0, size, PROT_READ | PROT_WRITE, MAP_PRIVATE, ring_fd, IORING_OFF_SQ_RING); if (map == MAP_FAILED) { perror("mmap"); return 1; } /* We have at least one page. Let's COW it. */ *map = 0; pause(); return 0; } <--- C reproducer ---> On a system with 16 GiB RAM and swap configured: # ./iouring & # memhog 16G # killall iouring [ 301.552930] ------------[ cut here ]------------ [ 301.553285] WARNING: CPU: 7 PID: 1402 at arch/x86/mm/pat/memtype.c:1060 untrack_pfn+0xf4/0x100 [ 301.553989] Modules linked in: binfmt_misc nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_g [ 301.558232] CPU: 7 PID: 1402 Comm: iouring Not tainted 6.7.5-100.fc38.x86_64 #1 [ 301.558772] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebu4 [ 301.559569] RIP: 0010:untrack_pfn+0xf4/0x100 [ 301.559893] Code: 75 c4 eb cf 48 8b 43 10 8b a8 e8 00 00 00 3b 6b 28 74 b8 48 8b 7b 30 e8 ea 1a f7 000 [ 301.561189] RSP: 0018:ffffba2c0377fab8 EFLAGS: 00010282 [ 301.561590] RAX: 00000000ffffffea RBX: ffff9208c8ce9cc0 RCX: 000000010455e047 [ 301.562105] RDX: 07fffffff0eb1e0a RSI: 0000000000000000 RDI: ffff9208c391d200 [ 301.562628] RBP: 0000000000000000 R08: ffffba2c0377fab8 R09: 0000000000000000 [ 301.563145] R10: ffff9208d2292d50 R11: 0000000000000002 R12: 00007fea890e0000 [ 301.563669] R13: 0000000000000000 R14: ffffba2c0377fc08 R15: 0000000000000000 [ 301.564186] FS: 0000000000000000(0000) GS:ffff920c2fbc0000(0000) knlGS:0000000000000000 [ 301.564773] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 301.565197] CR2: 00007fea88ee8a20 CR3: 00000001033a8000 CR4: 0000000000750ef0 [ 301.565725] PKRU: 55555554 [ 301.565944] Call Trace: [ 301.566148] <TASK> [ 301.566325] ? untrack_pfn+0xf4/0x100 [ 301.566618] ? __warn+0x81/0x130 [ 301.566876] ? untrack_pfn+0xf4/0x100 [ 301.567163] ? report_bug+0x171/0x1a0 [ 301.567466] ? handle_bug+0x3c/0x80 [ 301.567743] ? exc_invalid_op+0x17/0x70 [ 301.568038] ? asm_exc_invalid_op+0x1a/0x20 [ 301.568363] ? untrack_pfn+0xf4/0x100 [ 301.568660] ? untrack_pfn+0x65/0x100 [ 301.568947] unmap_single_vma+0xa6/0xe0 [ 301.569247] unmap_vmas+0xb5/0x190 [ 301.569532] exit_mmap+0xec/0x340 [ 301.569801] __mmput+0x3e/0x130 [ 301.570051] do_exit+0x305/0xaf0 ... Link: https://lkml.kernel.org/r/20240403212131.929421-3-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Reported-by: Wupeng Ma <mawupeng1@huawei.com> Closes: https://lkml.kernel.org/r/20240227122814.3781907-1-mawupeng1@huawei.com Fixes: b1a86e15dc03 ("x86, pat: remove the dependency on 'vm_pgoff' in track/untrack pfn vma routines") Fixes: 5899329b1910 ("x86: PAT: implement track/untrack of pfnmap regions for x86 - v3") Acked-by: Ingo Molnar <mingo@kernel.org> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Borislav Petkov <bp@alien8.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-03-26mm/memory: fix missing pte marker for !page on pte zapsPeter Xu1-1/+3
Commit 0cf18e839f64 of large folio zap work broke uffd-wp. Now mm's uffd unit test "wp-unpopulated" will trigger this WARN_ON_ONCE(). The WARN_ON_ONCE() asserts that an VMA cannot be registered with userfaultfd-wp if it contains a !normal page, but it's actually possible. One example is an anonymous vma, register with uffd-wp, read anything will install a zero page. Then when zap on it, this should trigger. What's more, removing that WARN_ON_ONCE may not be enough either, because we should also not rely on "whether it's a normal page" to decide whether pte marker is needed. For example, one can register wr-protect over some DAX regions to track writes when UFFD_FEATURE_WP_ASYNC enabled, in which case it can have page==NULL for a devmap but we may want to keep the marker around. Link: https://lkml.kernel.org/r/20240313213107.235067-1-peterx@redhat.com Fixes: 0cf18e839f64 ("mm/memory: handle !page case in zap_present_pte() separately") Signed-off-by: Peter Xu <peterx@redhat.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Muhammad Usama Anjum <usama.anjum@collabora.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-03-13mm: prohibit the last subpage from reusing the entire large folioBarry Song1-0/+10
In a Copy-on-Write (CoW) scenario, the last subpage will reuse the entire large folio, resulting in the waste of (nr_pages - 1) pages. This wasted memory remains allocated until it is either unmapped or memory reclamation occurs. The following small program can serve as evidence of this behavior main() { #define SIZE 1024 * 1024 * 1024UL void *p = malloc(SIZE); memset(p, 0x11, SIZE); if (fork() == 0) _exit(0); memset(p, 0x12, SIZE); printf("done\n"); while(1); } For example, using a 1024KiB mTHP by: echo always > /sys/kernel/mm/transparent_hugepage/hugepages-1024kB/enabled (1) w/o the patch, it takes 2GiB, Before running the test program, / # free -m total used free shared buff/cache available Mem: 5754 84 5692 0 17 5669 Swap: 0 0 0 / # /a.out & / # done After running the test program, / # free -m total used free shared buff/cache available Mem: 5754 2149 3627 0 19 3605 Swap: 0 0 0 (2) w/ the patch, it takes 1GiB only, Before running the test program, / # free -m total used free shared buff/cache available Mem: 5754 89 5687 0 17 5664 Swap: 0 0 0 / # /a.out & / # done After running the test program, / # free -m total used free shared buff/cache available Mem: 5754 1122 4655 0 17 4632 Swap: 0 0 0 This patch migrates the last subpage to a small folio and immediately returns the large folio to the system. It benefits both memory availability and anti-fragmentation. Link: https://lkml.kernel.org/r/20240308092721.144735-1-21cnbao@gmail.com Signed-off-by: Barry Song <v-songbaohua@oppo.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Lance Yang <ioworker0@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-03-06mm: memory: fix shift-out-of-bounds in fault_around_bytes_setKefeng Wang1-1/+2
The rounddown_pow_of_two(0) is undefined, so val = 0 is not allowed in the fault_around_bytes_set(), and leads to shift-out-of-bounds, UBSAN: shift-out-of-bounds in include/linux/log2.h:67:13 shift exponent 4294967295 is too large for 64-bit type 'long unsigned int' CPU: 7 PID: 107 Comm: sh Not tainted 6.8.0-rc6-next-20240301 #294 Hardware name: QEMU QEMU Virtual Machine, BIOS 0.0.0 02/06/2015 Call trace: dump_backtrace+0x94/0xec show_stack+0x18/0x24 dump_stack_lvl+0x78/0x90 dump_stack+0x18/0x24 ubsan_epilogue+0x10/0x44 __ubsan_handle_shift_out_of_bounds+0x98/0x134 fault_around_bytes_set+0xa4/0xb0 simple_attr_write_xsigned.isra.0+0xe4/0x1ac simple_attr_write+0x18/0x24 debugfs_attr_write+0x4c/0x98 vfs_write+0xd0/0x4b0 ksys_write+0x6c/0xfc __arm64_sys_write+0x1c/0x28 invoke_syscall+0x44/0x104 el0_svc_common.constprop.0+0x40/0xe0 do_el0_svc+0x1c/0x28 el0_svc+0x34/0xdc el0t_64_sync_handler+0xc0/0xc4 el0t_64_sync+0x190/0x194 ---[ end trace ]--- Fix it by setting the minimum val to PAGE_SIZE. Link: https://lkml.kernel.org/r/20240302064312.2358924-1-wangkefeng.wang@huawei.com Fixes: 53d36a56d8c4 ("mm: prefer fault_around_pages to fault_around_bytes") Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Reported-by: Yue Sun <samsun1006219@gmail.com> Closes: https://lore.kernel.org/all/CAEkJfYPim6DQqW1GqCiHLdh2-eweqk1fGyXqs3JM+8e1qGge8w@mail.gmail.com/ Reviewed-by: Lorenzo Stoakes <lstoakes@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-03-06mm: make folio_pte_batch available outside of mm/memory.cBarry Song1-76/+0
madvise, mprotect and some others might need folio_pte_batch to check if a range of PTEs are completely mapped to a large folio with contiguous physical addresses. Let's make it available in mm/internal.h. While at it, add proper kernel doc and sanity-check more input parameters using two additional VM_WARN_ON_FOLIO(). [21cnbao@gmail.com: build fix] Link: https://lkml.kernel.org/r/CAGsJ_4wWzG-37D82vqP_zt+Fcbz+URVe5oXLBc4M5wbN8A_gpQ@mail.gmail.com [david@redhat.com: improve the doc for the exported func] Link: https://lkml.kernel.org/r/20240227104201.337988-1-21cnbao@gmail.com Signed-off-by: David Hildenbrand <david@redhat.com> Signed-off-by: Barry Song <v-songbaohua@oppo.com> Suggested-by: David Hildenbrand <david@redhat.com> Reviewed-by: Ryan Roberts <ryan.roberts@arm.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Lance Yang <ioworker0@gmail.com> Cc: Yin Fengwei <fengwei.yin@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-03-04mm/memory.c: do_numa_page(): remove a redundant page table readJohn Hubbard1-6/+6
do_numa_page() is reading from the same page table entry, twice, while holding the page table lock: once while checking that the pte hasn't changed, and again in order to modify the pte. Instead, just read the pte once, and save it in the same old_pte variable that already exists. This has no effect on behavior, other than to provide a tiny potential improvement to performance, by avoiding the redundant memory read (which the compiler cannot elide, due to READ_ONCE()). Also improve the associated comments nearby. Link: https://lkml.kernel.org/r/20240228034151.459370-1-jhubbard@nvidia.com Signed-off-by: John Hubbard <jhubbard@nvidia.com> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Ryan Roberts <ryan.roberts@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-03-04mm: convert free_swap_cache() to take a folioMatthew Wilcox (Oracle)1-1/+1
All but one caller already has a folio, so convert free_page_and_swap_cache() to have a folio and remove the call to page_folio(). Link: https://lkml.kernel.org/r/20240227174254.710559-19-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Ryan Roberts <ryan.roberts@arm.com> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-03-04mm/memory: change vmf_anon_prepare() to be non-staticVishal Moola (Oracle)1-1/+1
Patch series "Handle hugetlb faults under the VMA lock", v2. It is generally safe to handle hugetlb faults under the VMA lock. The only time this is unsafe is when no anon_vma has been allocated to this vma yet, so we can use vmf_anon_prepare() instead of anon_vma_prepare() to bailout if necessary. This should only happen for the first hugetlb page in the vma. Additionally, this patchset begins to use struct vm_fault within hugetlb_fault(). This works towards cleaning up hugetlb code, and should significantly reduce the number of arguments passed to functions. The last patch in this series may cause ltp hugemmap10 to "fail". This is because vmf_anon_prepare() may bailout with no anon_vma under the VMA lock after allocating a folio for the hugepage. In free_huge_folio(), this folio is completely freed on bailout iff there is a surplus of hugetlb pages. This will remove a folio off the freelist and decrement the number of hugepages while ltp expects these counters to remain unchanged on failure. The rest of the ltp testcases pass. This patch (of 2): In order to handle hugetlb faults under the VMA lock, hugetlb can use vmf_anon_prepare() to ensure we can safely prepare an anon_vma. Change it to be a non-static function so it can be used within hugetlb as well. Link: https://lkml.kernel.org/r/20240221234732.187629-6-vishal.moola@gmail.com Link: https://lkml.kernel.org/r/20240221234732.187629-2-vishal.moola@gmail.com Signed-off-by: Vishal Moola (Oracle) <vishal.moola@gmail.com> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Muchun Song <muchun.song@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-22mm: add pte_batch_hint() to reduce scanning in folio_pte_batch()Ryan Roberts1-7/+12
Some architectures (e.g. arm64) can tell from looking at a pte, if some follow-on ptes also map contiguous physical memory with the same pgprot. (for arm64, these are contpte mappings). Take advantage of this knowledge to optimize folio_pte_batch() so that it can skip these ptes when scanning to create a batch. By default, if an arch does not opt-in, folio_pte_batch() returns a compile-time 1, so the changes are optimized out and the behaviour is as before. arm64 will opt-in to providing this hint in the next patch, which will greatly reduce the cost of ptep_get() when scanning a range of contptes. Link: https://lkml.kernel.org/r/20240215103205.2607016-16-ryan.roberts@arm.com Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> Acked-by: David Hildenbrand <david@redhat.com> Tested-by: John Hubbard <jhubbard@nvidia.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Ard Biesheuvel <ardb@kernel.org> Cc: Barry Song <21cnbao@gmail.com> Cc: Borislav Petkov (AMD) <bp@alien8.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: James Morse <james.morse@arm.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Marc Zyngier <maz@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Will Deacon <will@kernel.org> Cc: Yang Shi <shy828301@gmail.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-22mm/memory: optimize unmap/zap with PTE-mapped THPDavid Hildenbrand1-26/+66
Similar to how we optimized fork(), let's implement PTE batching when consecutive (present) PTEs map consecutive pages of the same large folio. Most infrastructure we need for batching (mmu gather, rmap) is already there. We only have to add get_and_clear_full_ptes() and clear_full_ptes(). Similarly, extend zap_install_uffd_wp_if_needed() to process a PTE range. We won't bother sanity-checking the mapcount of all subpages, but only check the mapcount of the first subpage we process. If there is a real problem hiding somewhere, we can trigger it simply by using small folios, or when we zap single pages of a large folio. Ideally, we had that check in rmap code (including for delayed rmap), but then we cannot print the PTE. Let's keep it simple for now. If we ever have a cheap folio_mapcount(), we might just want to check for underflows there. To keep small folios as fast as possible force inlining of a specialized variant using __always_inline with nr=1. Link: https://lkml.kernel.org/r/20240214204435.167852-11-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Reviewed-by: Ryan Roberts <ryan.roberts@arm.com> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Hocko <mhocko@suse.com> Cc: "Naveen N. Rao" <naveen.n.rao@linux.ibm.com> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Will Deacon <will@kernel.org> Cc: Yin Fengwei <fengwei.yin@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-22mm/memory: factor out zapping folio pte into zap_present_folio_pte()David Hildenbrand1-21/+32
Let's prepare for further changes by factoring it out into a separate function. Link: https://lkml.kernel.org/r/20240214204435.167852-5-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Reviewed-by: Ryan Roberts <ryan.roberts@arm.com> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Hocko <mhocko@suse.com> Cc: "Naveen N. Rao" <naveen.n.rao@linux.ibm.com> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Will Deacon <will@kernel.org> Cc: Yin Fengwei <fengwei.yin@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-22mm/memory: further separate anon and pagecache folio handling in ↵David Hildenbrand1-5/+11
zap_present_pte() We don't need up-to-date accessed-dirty information for anon folios and can simply work with the ptent we already have. Also, we know the RSS counter we want to update. We can safely move arch_check_zapped_pte() + tlb_remove_tlb_entry() + zap_install_uffd_wp_if_needed() after updating the folio and RSS. While at it, only call zap_install_uffd_wp_if_needed() if there is even any chance that pte_install_uffd_wp_if_needed() would do *something*. That is, just don't bother if uffd-wp does not apply. Link: https://lkml.kernel.org/r/20240214204435.167852-4-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Reviewed-by: Ryan Roberts <ryan.roberts@arm.com> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Hocko <mhocko@suse.com> Cc: "Naveen N. Rao" <naveen.n.rao@linux.ibm.com> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Will Deacon <will@kernel.org> Cc: Yin Fengwei <fengwei.yin@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-22mm/memory: handle !page case in zap_present_pte() separatelyDavid Hildenbrand1-11/+11
We don't need uptodate accessed/dirty bits, so in theory we could replace ptep_get_and_clear_full() by an optimized ptep_clear_full() function. Let's rely on the provided pte. Further, there is no scenario where we would have to insert uffd-wp markers when zapping something that is not a normal page (i.e., zeropage). Add a sanity check to make sure this remains true. should_zap_folio() no longer has to handle NULL pointers. This change replaces 2/3 "!page/!folio" checks by a single "!page" one. Note that arch_check_zapped_pte() on x86-64 checks the HW-dirty bit to detect shadow stack entries. But for shadow stack entries, the HW dirty bit (in combination with non-writable PTEs) is set by software. So for the arch_check_zapped_pte() check, we don't have to sync against HW setting the HW dirty bit concurrently, it is always set. Link: https://lkml.kernel.org/r/20240214204435.167852-3-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Reviewed-by: Ryan Roberts <ryan.roberts@arm.com> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Hocko <mhocko@suse.com> Cc: "Naveen N. Rao" <naveen.n.rao@linux.ibm.com> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Will Deacon <will@kernel.org> Cc: Yin Fengwei <fengwei.yin@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-22mm/memory: factor out zapping of present pte into zap_present_pte()David Hildenbrand1-41/+53
Patch series "mm/memory: optimize unmap/zap with PTE-mapped THP", v3. This series is based on [1]. Similar to what we did with fork(), let's implement PTE batching during unmap/zap when processing PTE-mapped THPs. We collect consecutive PTEs that map consecutive pages of the same large folio, making sure that the other PTE bits are compatible, and (a) adjust the refcount only once per batch, (b) call rmap handling functions only once per batch, (c) perform batch PTE setting/updates and (d) perform TLB entry removal once per batch. Ryan was previously working on this in the context of cont-pte for arm64, int latest iteration [2] with a focus on arm6 with cont-pte only. This series implements the optimization for all architectures, independent of such PTE bits, teaches MMU gather/TLB code to be fully aware of such large-folio-pages batches as well, and amkes use of our new rmap batching function when removing the rmap. To achieve that, we have to enlighten MMU gather / page freeing code (i.e., everything that consumes encoded_page) to process unmapping of consecutive pages that all belong to the same large folio. I'm being very careful to not degrade order-0 performance, and it looks like I managed to achieve that. While this series should -- similar to [1] -- be beneficial for adding cont-pte support on arm64[2], it's one of the requirements for maintaining a total mapcount[3] for large folios with minimal added overhead and further changes[4] that build up on top of the total mapcount. Independent of all that, this series results in a speedup during munmap() and similar unmapping (process teardown, MADV_DONTNEED on larger ranges) with PTE-mapped THP, which is the default with THPs that are smaller than a PMD (for example, 16KiB to 1024KiB mTHPs for anonymous memory[5]). On an Intel Xeon Silver 4210R CPU, munmap'ing a 1GiB VMA backed by PTE-mapped folios of the same size (stddev < 1%) results in the following runtimes for munmap() in seconds (shorter is better): Folio Size | mm-unstable | New | Change --------------------------------------------- 4KiB | 0.058110 | 0.057715 | - 1% 16KiB | 0.044198 | 0.035469 | -20% 32KiB | 0.034216 | 0.023522 | -31% 64KiB | 0.029207 | 0.018434 | -37% 128KiB | 0.026579 | 0.014026 | -47% 256KiB | 0.025130 | 0.011756 | -53% 512KiB | 0.024292 | 0.010703 | -56% 1024KiB | 0.023812 | 0.010294 | -57% 2048KiB | 0.023785 | 0.009910 | -58% [1] https://lkml.kernel.org/r/20240129124649.189745-1-david@redhat.com [2] https://lkml.kernel.org/r/20231218105100.172635-1-ryan.roberts@arm.com [3] https://lkml.kernel.org/r/20230809083256.699513-1-david@redhat.com [4] https://lkml.kernel.org/r/20231124132626.235350-1-david@redhat.com [5] https://lkml.kernel.org/r/20231207161211.2374093-1-ryan.roberts@arm.com This patch (of 10): Let's prepare for further changes by factoring out processing of present PTEs. Link: https://lkml.kernel.org/r/20240214204435.167852-1-david@redhat.com Link: https://lkml.kernel.org/r/20240214204435.167852-2-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Reviewed-by: Ryan Roberts <ryan.roberts@arm.com> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: linuxppc-dev@lists.ozlabs.org Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Hocko <mhocko@suse.com> Cc: "Naveen N. Rao" <naveen.n.rao@linux.ibm.com> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Will Deacon <will@kernel.org> Cc: Yin Fengwei <fengwei.yin@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-22mm/memory: ignore writable bit in folio_pte_batch()David Hildenbrand1-6/+24
... and conditionally return to the caller if any PTE except the first one is writable. fork() has to make sure to properly write-protect in case any PTE is writable. Other users (e.g., page unmaping) are expected to not care. Link: https://lkml.kernel.org/r/20240129124649.189745-16-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Reviewed-by: Ryan Roberts <ryan.roberts@arm.com> Cc: Albert Ou <aou@eecs.berkeley.edu> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Alexandre Ghiti <alexghiti@rivosinc.com> Cc: Aneesh Kumar K.V <aneesh.kumar@kernel.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: David S. Miller <davem@davemloft.net> Cc: Dinh Nguyen <dinguyen@kernel.org> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Naveen N. Rao <naveen.n.rao@linux.ibm.com> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Russell King (Oracle) <linux@armlinux.org.uk> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Will Deacon <will@kernel.org> Cc: Mike Rapoport (IBM) <rppt@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-22mm/memory: ignore dirty/accessed/soft-dirty bits in folio_pte_batch()David Hildenbrand1-5/+31
Let's always ignore the accessed/young bit: we'll always mark the PTE as old in our child process during fork, and upcoming users will similarly not care. Ignore the dirty bit only if we don't want to duplicate the dirty bit into the child process during fork. Maybe, we could just set all PTEs in the child dirty if any PTE is dirty. For now, let's keep the behavior unchanged, this can be optimized later if required. Ignore the soft-dirty bit only if the bit doesn't have any meaning in the src vma, and similarly won't have any in the copied dst vma. For now, we won't bother with the uffd-wp bit. Link: https://lkml.kernel.org/r/20240129124649.189745-15-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Reviewed-by: Ryan Roberts <ryan.roberts@arm.com> Cc: Albert Ou <aou@eecs.berkeley.edu> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Alexandre Ghiti <alexghiti@rivosinc.com> Cc: Aneesh Kumar K.V <aneesh.kumar@kernel.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: David S. Miller <davem@davemloft.net> Cc: Dinh Nguyen <dinguyen@kernel.org> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Naveen N. Rao <naveen.n.rao@linux.ibm.com> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Russell King (Oracle) <linux@armlinux.org.uk> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Will Deacon <will@kernel.org> Cc: Mike Rapoport (IBM) <rppt@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-22mm/memory: optimize fork() with PTE-mapped THPDavid Hildenbrand1-19/+93
Let's implement PTE batching when consecutive (present) PTEs map consecutive pages of the same large folio, and all other PTE bits besides the PFNs are equal. We will optimize folio_pte_batch() separately, to ignore selected PTE bits. This patch is based on work by Ryan Roberts. Use __always_inline for __copy_present_ptes() and keep the handling for single PTEs completely separate from the multi-PTE case: we really want the compiler to optimize for the single-PTE case with small folios, to not degrade performance. Note that PTE batching will never exceed a single page table and will always stay within VMA boundaries. Further, processing PTE-mapped THP that maybe pinned and have PageAnonExclusive set on at least one subpage should work as expected, but there is room for improvement: We will repeatedly (1) detect a PTE batch (2) detect that we have to copy a page (3) fall back and allocate a single page to copy a single page. For now we won't care as pinned pages are a corner case, and we should rather look into maintaining only a single PageAnonExclusive bit for large folios. Link: https://lkml.kernel.org/r/20240129124649.189745-14-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Reviewed-by: Ryan Roberts <ryan.roberts@arm.com> Reviewed-by: Mike Rapoport (IBM) <rppt@kernel.org> Cc: Albert Ou <aou@eecs.berkeley.edu> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Alexandre Ghiti <alexghiti@rivosinc.com> Cc: Aneesh Kumar K.V <aneesh.kumar@kernel.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: David S. Miller <davem@davemloft.net> Cc: Dinh Nguyen <dinguyen@kernel.org> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Naveen N. Rao <naveen.n.rao@linux.ibm.com> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Russell King (Oracle) <linux@armlinux.org.uk> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-22mm/memory: pass PTE to copy_present_pte()David Hildenbrand1-4/+5
We already read it, let's just forward it. This patch is based on work by Ryan Roberts. [david@redhat.com: fix the hmm "exclusive_cow" selftest] Link: https://lkml.kernel.org/r/13f296b8-e882-47fd-b939-c2141dc28717@redhat.com Link: https://lkml.kernel.org/r/20240129124649.189745-13-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Reviewed-by: Ryan Roberts <ryan.roberts@arm.com> Reviewed-by: Mike Rapoport (IBM) <rppt@kernel.org> Cc: Albert Ou <aou@eecs.berkeley.edu> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Alexandre Ghiti <alexghiti@rivosinc.com> Cc: Aneesh Kumar K.V <aneesh.kumar@kernel.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: David S. Miller <davem@davemloft.net> Cc: Dinh Nguyen <dinguyen@kernel.org> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Naveen N. Rao <naveen.n.rao@linux.ibm.com> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Russell King (Oracle) <linux@armlinux.org.uk> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-22mm/memory: factor out copying the actual PTE in copy_present_pte()David Hildenbrand1-30/+33
Let's prepare for further changes. Link: https://lkml.kernel.org/r/20240129124649.189745-12-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Reviewed-by: Ryan Roberts <ryan.roberts@arm.com> Reviewed-by: Mike Rapoport (IBM) <rppt@kernel.org> Cc: Albert Ou <aou@eecs.berkeley.edu> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Alexandre Ghiti <alexghiti@rivosinc.com> Cc: Aneesh Kumar K.V <aneesh.kumar@kernel.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: David S. Miller <davem@davemloft.net> Cc: Dinh Nguyen <dinguyen@kernel.org> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Naveen N. Rao <naveen.n.rao@linux.ibm.com> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Russell King (Oracle) <linux@armlinux.org.uk> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-22mm: memory: move mem_cgroup_charge() into alloc_anon_folio()Kefeng Wang1-8/+8
The GFP flags from vma_thp_gfp_mask() according to user configuration only used for large folio allocation but not for memory cgroup charge, and GFP_KERNEL is used for both order-0 and large order folio when memory cgroup charge at present. However, mem_cgroup_charge() uses the GFP flags in a fairly sophisticated way. In addition to checking gfpflags_allow_blocking(), it pays attention to __GFP_NORETRY and __GFP_RETRY_MAYFAIL to ensure that processes within this memcg do not exceed their quotas. So we'd better to move mem_cgroup_charge() into alloc_anon_folio(), 1) it will make us to allocate as much as possible large order folio, because we could try the next order if mem_cgroup_charge() fails, although the memcg's memory usage is close to its limits. 2) using same GFP flags for allocation and charge is to be consistent with PMD THP firstly, in addition, according to GFP flag returned from vma_thp_gfp_mask(), GFP_TRANSHUGE_LIGHT could make us skip direct reclaim, _GFP_NORETRY will make us skip mem_cgroup_oom() and won't trigger memory cgroup oom from large order(order <= COSTLY_ORDER) folio charging. Link: https://lkml.kernel.org/r/20240122011612.501029-1-wangkefeng.wang@huawei.com Link: https://lkml.kernel.org/r/20240117103954.2756050-1-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Reviewed-by: Ryan Roberts <ryan.roberts@arm.com> Cc: David Hildenbrand <david@redhat.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Shakeel Butt <shakeelb@google.com> Cc: Muchun Song <songmuchun@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-21mm: convert mm_counter_file() to take a folioKefeng Wang1-5/+5
Now all callers of mm_counter_file() have a folio, convert mm_counter_file() to take a folio. Saves a call to compound_head() hidden inside PageSwapBacked(). Link: https://lkml.kernel.org/r/20240111152429.3374566-11-willy@infradead.org Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: David Hildenbrand <david@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-21mm: convert mm_counter() to take a folioKefeng Wang1-5/+5
Now all callers of mm_counter() have a folio, convert mm_counter() to take a folio. Saves a call to compound_head() hidden inside PageAnon(). Link: https://lkml.kernel.org/r/20240111152429.3374566-10-willy@infradead.org Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: David Hildenbrand <david@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-21mm: convert to should_zap_page() to should_zap_folio()Kefeng Wang1-14/+17
Make should_zap_page() take a folio and rename it to should_zap_folio() as preparation for converting mm counter functions to take a folio. Saves a call to compound_head() hidden inside PageAnon(). [wangkefeng.wang@huawei.com: fix used-uninitialized warning] Link: https://lkml.kernel.org/r/962a7993-fce9-4de8-85cd-25e290f25736@huawei.com Link: https://lkml.kernel.org/r/20240111152429.3374566-9-willy@infradead.org Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: David Hildenbrand <david@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-21mm: use pfn_swap_entry_folio() in copy_nonpresent_pte()Kefeng Wang1-2/+2
Call pfn_swap_entry_folio() as preparation for converting mm counter functions to take a folio. Link: https://lkml.kernel.org/r/20240111152429.3374566-8-willy@infradead.org Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: David Hildenbrand <david@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-21mm: memory: use nth_page() in clear/copy_subpage()Kefeng Wang1-4/+5
The clear and copy of huge gigantic page has converted to use nth_page() to handle the possible discontinuous struct page(SPARSEMEM without VMEMMAP), but not change for the non-gigantic part, fix it too. Link: https://lkml.kernel.org/r/20231229082207.60235-1-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Zi Yan <ziy@nvidia.com> Cc: David Hildenbrand <david@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-20mm/swap: fix race when skipping swapcacheKairui Song1-0/+20
When skipping swapcache for SWP_SYNCHRONOUS_IO, if two or more threads swapin the same entry at the same time, they get different pages (A, B). Before one thread (T0) finishes the swapin and installs page (A) to the PTE, another thread (T1) could finish swapin of page (B), swap_free the entry, then swap out the possibly modified page reusing the same entry. It breaks the pte_same check in (T0) because PTE value is unchanged, causing ABA problem. Thread (T0) will install a stalled page (A) into the PTE and cause data corruption. One possible callstack is like this: CPU0 CPU1 ---- ---- do_swap_page() do_swap_page() with same entry <direct swapin path> <direct swapin path> <alloc page A> <alloc page B> swap_read_folio() <- read to page A swap_read_folio() <- read to page B <slow on later locks or interrupt> <finished swapin first> ... set_pte_at() swap_free() <- entry is free <write to page B, now page A stalled> <swap out page B to same swap entry> pte_same() <- Check pass, PTE seems unchanged, but page A is stalled! swap_free() <- page B content lost! set_pte_at() <- staled page A installed! And besides, for ZRAM, swap_free() allows the swap device to discard the entry content, so even if page (B) is not modified, if swap_read_folio() on CPU0 happens later than swap_free() on CPU1, it may also cause data loss. To fix this, reuse swapcache_prepare which will pin the swap entry using the cache flag, and allow only one thread to swap it in, also prevent any parallel code from putting the entry in the cache. Release the pin after PT unlocked. Racers just loop and wait since it's a rare and very short event. A schedule_timeout_uninterruptible(1) call is added to avoid repeated page faults wasting too much CPU, causing livelock or adding too much noise to perf statistics. A similar livelock issue was described in commit 029c4628b2eb ("mm: swap: get rid of livelock in swapin readahead") Reproducer: This race issue can be triggered easily using a well constructed reproducer and patched brd (with a delay in read path) [1]: With latest 6.8 mainline, race caused data loss can be observed easily: $ gcc -g -lpthread test-thread-swap-race.c && ./a.out Polulating 32MB of memory region... Keep swapping out... Starting round 0... Spawning 65536 workers... 32746 workers spawned, wait for done... Round 0: Error on 0x5aa00, expected 32746, got 32743, 3 data loss! Round 0: Error on 0x395200, expected 32746, got 32743, 3 data loss! Round 0: Error on 0x3fd000, expected 32746, got 32737, 9 data loss! Round 0 Failed, 15 data loss! This reproducer spawns multiple threads sharing the same memory region using a small swap device. Every two threads updates mapped pages one by one in opposite direction trying to create a race, with one dedicated thread keep swapping out the data out using madvise. The reproducer created a reproduce rate of about once every 5 minutes, so the race should be totally possible in production. After this patch, I ran the reproducer for over a few hundred rounds and no data loss observed. Performance overhead is minimal, microbenchmark swapin 10G from 32G zram: Before: 10934698 us After: 11157121 us Cached: 13155355 us (Dropping SWP_SYNCHRONOUS_IO flag) [kasong@tencent.com: v4] Link: https://lkml.kernel.org/r/20240219082040.7495-1-ryncsn@gmail.com Link: https://lkml.kernel.org/r/20240206182559.32264-1-ryncsn@gmail.com Fixes: 0bcac06f27d7 ("mm, swap: skip swapcache for swapin of synchronous device") Reported-by: "Huang, Ying" <ying.huang@intel.com> Closes: https://lore.kernel.org/lkml/87bk92gqpx.fsf_-_@yhuang6-desk2.ccr.corp.intel.com/ Link: https://github.com/ryncsn/emm-test-project/tree/master/swap-stress-race [1] Signed-off-by: Kairui Song <kasong@tencent.com> Reviewed-by: "Huang, Ying" <ying.huang@intel.com> Acked-by: Yu Zhao <yuzhao@google.com> Acked-by: David Hildenbrand <david@redhat.com> Acked-by: Chris Li <chrisl@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Yosry Ahmed <yosryahmed@google.com> Cc: Yu Zhao <yuzhao@google.com> Cc: Barry Song <21cnbao@gmail.com> Cc: SeongJae Park <sj@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-12mm/memory: Use exception ip to search exception tablesJiaxun Yang1-2/+2
On architectures with delay slot, instruction_pointer() may differ from where exception was triggered. Use exception_ip we just introduced to search exception tables to get rid of the problem. Fixes: 4bce37a68ff8 ("mips/mm: Convert to using lock_mm_and_find_vma()") Reported-by: Xi Ruoyao <xry111@xry111.site> Link: https://lore.kernel.org/r/75e9fd7b08562ad9b456a5bdaacb7cc220311cc9.camel@xry111.site/ Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Jiaxun Yang <jiaxun.yang@flygoat.com> Signed-off-by: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
2024-01-25mm/memory: fix folio_set_dirty() vs. folio_mark_dirty() in zap_pte_range()David Hildenbrand1-1/+1
The correct folio replacement for "set_page_dirty()" is "folio_mark_dirty()", not "folio_set_dirty()". Using the latter won't properly inform the FS using the dirty_folio() callback. This has been found by code inspection, but likely this can result in some real trouble when zapping dirty PTEs that point at clean pagecache folios. Yuezhang Mo said: "Without this fix, testing the latest exfat with xfstests, test cases generic/029 and generic/030 will fail." Link: https://lkml.kernel.org/r/20240122171751.272074-1-david@redhat.com Fixes: c46265030b0f ("mm/memory: page_remove_rmap() -> folio_remove_rmap_pte()") Signed-off-by: David Hildenbrand <david@redhat.com> Reported-by: Ryan Roberts <ryan.roberts@arm.com> Closes: https://lkml.kernel.org/r/2445cedb-61fb-422c-8bfb-caf0a2beed62@arm.com Reviewed-by: Ryan Roberts <ryan.roberts@arm.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Yuezhang Mo <Yuezhang.Mo@sony.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-01-09Merge tag 'mm-stable-2024-01-08-15-31' of ↵Linus Torvalds1-94/+198
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: "Many singleton patches against the MM code. The patch series which are included in this merge do the following: - Peng Zhang has done some mapletree maintainance work in the series 'maple_tree: add mt_free_one() and mt_attr() helpers' 'Some cleanups of maple tree' - In the series 'mm: use memmap_on_memory semantics for dax/kmem' Vishal Verma has altered the interworking between memory-hotplug and dax/kmem so that newly added 'device memory' can more easily have its memmap placed within that newly added memory. - Matthew Wilcox continues folio-related work (including a few fixes) in the patch series 'Add folio_zero_tail() and folio_fill_tail()' 'Make folio_start_writeback return void' 'Fix fault handler's handling of poisoned tail pages' 'Convert aops->error_remove_page to ->error_remove_folio' 'Finish two folio conversions' 'More swap folio conversions' - Kefeng Wang has also contributed folio-related work in the series 'mm: cleanup and use more folio in page fault' - Jim Cromie has improved the kmemleak reporting output in the series 'tweak kmemleak report format'. - In the series 'stackdepot: allow evicting stack traces' Andrey Konovalov to permits clients (in this case KASAN) to cause eviction of no longer needed stack traces. - Charan Teja Kalla has fixed some accounting issues in the page allocator's atomic reserve calculations in the series 'mm: page_alloc: fixes for high atomic reserve caluculations'. - Dmitry Rokosov has added to the samples/ dorectory some sample code for a userspace memcg event listener application. See the series 'samples: introduce cgroup events listeners'. - Some mapletree maintanance work from Liam Howlett in the series 'maple_tree: iterator state changes'. - Nhat Pham has improved zswap's approach to writeback in the series 'workload-specific and memory pressure-driven zswap writeback'. - DAMON/DAMOS feature and maintenance work from SeongJae Park in the series 'mm/damon: let users feed and tame/auto-tune DAMOS' 'selftests/damon: add Python-written DAMON functionality tests' 'mm/damon: misc updates for 6.8' - Yosry Ahmed has improved memcg's stats flushing in the series 'mm: memcg: subtree stats flushing and thresholds'. - In the series 'Multi-size THP for anonymous memory' Ryan Roberts has added a runtime opt-in feature to transparent hugepages which improves performance by allocating larger chunks of memory during anonymous page faults. - Matthew Wilcox has also contributed some cleanup and maintenance work against eh buffer_head code int he series 'More buffer_head cleanups'. - Suren Baghdasaryan has done work on Andrea Arcangeli's series 'userfaultfd move option'. UFFDIO_MOVE permits userspace heap compaction algorithms to move userspace's pages around rather than UFFDIO_COPY'a alloc/copy/free. - Stefan Roesch has developed a 'KSM Advisor', in the series 'mm/ksm: Add ksm advisor'. This is a governor which tunes KSM's scanning aggressiveness in response to userspace's current needs. - Chengming Zhou has optimized zswap's temporary working memory use in the series 'mm/zswap: dstmem reuse optimizations and cleanups'. - Matthew Wilcox has performed some maintenance work on the writeback code, both code and within filesystems. The series is 'Clean up the writeback paths'. - Andrey Konovalov has optimized KASAN's handling of alloc and free stack traces for secondary-level allocators, in the series 'kasan: save mempool stack traces'. - Andrey also performed some KASAN maintenance work in the series 'kasan: assorted clean-ups'. - David Hildenbrand has gone to town on the rmap code. Cleanups, more pte batching, folio conversions and more. See the series 'mm/rmap: interface overhaul'. - Kinsey Ho has contributed some maintenance work on the MGLRU code in the series 'mm/mglru: Kconfig cleanup'. - Matthew Wilcox has contributed lruvec page accounting code cleanups in the series 'Remove some lruvec page accounting functions'" * tag 'mm-stable-2024-01-08-15-31' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (361 commits) mm, treewide: rename MAX_ORDER to MAX_PAGE_ORDER mm, treewide: introduce NR_PAGE_ORDERS selftests/mm: add separate UFFDIO_MOVE test for PMD splitting selftests/mm: skip test if application doesn't has root privileges selftests/mm: conform test to TAP format output selftests: mm: hugepage-mmap: conform to TAP format output selftests/mm: gup_test: conform test to TAP format output mm/selftests: hugepage-mremap: conform test to TAP format output mm/vmstat: move pgdemote_* out of CONFIG_NUMA_BALANCING mm: zsmalloc: return -ENOSPC rather than -EINVAL in zs_malloc while size is too large mm/memcontrol: remove __mod_lruvec_page_state() mm/khugepaged: use a folio more in collapse_file() slub: use a folio in __kmalloc_large_node slub: use folio APIs in free_large_kmalloc() slub: use alloc_pages_node() in alloc_slab_page() mm: remove inc/dec lruvec page state functions mm: ratelimit stat flush from workingset shrinker kasan: stop leaking stack trace handles mm/mglru: remove CONFIG_TRANSPARENT_HUGEPAGE mm/mglru: add dummy pmd_dirty() ...
2023-12-29mm: remove unnecessary ia64 code and commentKefeng Wang1-3/+1
IA64 has gone with commit cf8e8658100d ("arch: Remove Itanium (IA-64) architecture"), remove unnecessary ia64 special mm code and comment too. Link: https://lkml.kernel.org/r/20231222070203.2966980-1-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Reviewed-by: Mike Rapoport (IBM) <rppt@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-12-29mm/memory: page_try_dup_anon_rmap() -> folio_try_dup_anon_rmap_pte()David Hildenbrand1-3/+5
Let's convert copy_nonpresent_pte(). While at it, perform some more folio conversion. Link: https://lkml.kernel.org/r/20231220224504.646757-37-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Peter Xu <peterx@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Yin Fengwei <fengwei.yin@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-12-29mm/rmap: convert page_dup_file_rmap() to folio_dup_file_rmap_[pte|ptes|pmd]()David Hildenbrand1-1/+1
Let's convert page_dup_file_rmap() like the other rmap functions. As there is only a single caller, convert that single caller right away and remove page_dup_file_rmap(). Add folio_dup_file_rmap_ptes() right away, we want to perform rmap baching during fork() soon. Link: https://lkml.kernel.org/r/20231220224504.646757-34-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Peter Xu <peterx@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Yin Fengwei <fengwei.yin@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-12-29mm/memory: page_remove_rmap() -> folio_remove_rmap_pte()David Hildenbrand1-10/+13
Let's convert zap_pte_range() and closely-related tlb_flush_rmap_batch(). While at it, perform some more folio conversion in zap_pte_range(). Link: https://lkml.kernel.org/r/20231220224504.646757-29-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Peter Xu <peterx@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Yin Fengwei <fengwei.yin@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-12-29mm/memory: page_add_anon_rmap() -> folio_add_anon_rmap_pte()David Hildenbrand1-4/+7
Let's convert restore_exclusive_pte() and do_swap_page(). While at it, perform some folio conversion in restore_exclusive_pte(). Link: https://lkml.kernel.org/r/20231220224504.646757-21-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Peter Xu <peterx@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Yin Fengwei <fengwei.yin@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-12-29mm/memory: page_add_file_rmap() -> folio_add_file_rmap_[pte|pmd]()David Hildenbrand1-6/+8
Let's convert insert_page_into_pte_locked() and do_set_pmd(). While at it, perform some folio conversion. Link: https://lkml.kernel.org/r/20231220224504.646757-9-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Reviewed-by: Yin Fengwei <fengwei.yin@intel.com> Reviewed-by: Ryan Roberts <ryan.roberts@arm.com> Cc: Hugh Dickins <hughd@google.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Peter Xu <peterx@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-12-29mm/rmap: convert folio_add_file_rmap_range() into ↵David Hildenbrand1-1/+1
folio_add_file_rmap_[pte|ptes|pmd]() Let's get rid of the compound parameter and instead define explicitly which mappings we're adding. That is more future proof, easier to read and harder to mess up. Use an enum to express the granularity internally. Make the compiler always special-case on the granularity by using __always_inline. Replace the "compound" check by a switch-case that will be removed by the compiler completely. Add plenty of sanity checks with CONFIG_DEBUG_VM. Replace the folio_test_pmd_mappable() check by a config check in the caller and sanity checks. Convert the single user of folio_add_file_rmap_range(). While at it, consistently use "int" instead of "unisgned int" in rmap code when dealing with mapcounts and the number of pages. This function design can later easily be extended to PUDs and to batch PMDs. Note that for now we don't support anything bigger than PMD-sized folios (as we cleanly separated hugetlb handling). Sanity checks will catch if that ever changes. Next up is removing page_remove_rmap() along with its "compound" parameter and smilarly converting all other rmap functions. Link: https://lkml.kernel.org/r/20231220224504.646757-8-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Reviewed-by: Yin Fengwei <fengwei.yin@intel.com> Reviewed-by: Ryan Roberts <ryan.roberts@arm.com> Cc: Hugh Dickins <hughd@google.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Peter Xu <peterx@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-12-29mm: convert swap_readpage() to swap_read_folio()Matthew Wilcox (Oracle)1-2/+2
All callers have a folio, so pass it in, saving two calls to compound_head(). Link: https://lkml.kernel.org/r/20231213215842.671461-11-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-12-29mm: remove some calls to page_add_new_anon_rmap()Matthew Wilcox (Oracle)1-1/+1
We already have the folio in these functions, we just need to use it. folio_add_new_anon_rmap() didn't exist at the time they were converted to folios. Link: https://lkml.kernel.org/r/20231211162214.2146080-5-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: David Hildenbrand <david@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-12-29mm: convert ksm_might_need_to_copy() to work on foliosMatthew Wilcox (Oracle)1-4/+7
Patch series "Finish two folio conversions". Most callers of page_add_new_anon_rmap() and lru_cache_add_inactive_or_unevictable() have been converted to their folio equivalents, but there are still a few stragglers. There's a bit of preparatory work in ksm and unuse_pte(), but after that it's pretty mechanical. This patch (of 9): Accept a folio as an argument and return a folio result. Removes a call to compound_head() in do_swap_page(), and prevents folio & page from getting out of sync in unuse_pte(). Reviewed-by: David Hildenbrand <david@redhat.com> [willy@infradead.org: fix smatch warning] Link: https://lkml.kernel.org/r/ZXnPtblC6A1IkyAB@casper.infradead.org [david@redhat.com: only adjust the page if the folio changed] Link: https://lkml.kernel.org/r/6a8f2110-fa91-4c10-9eae-88315309a6e3@redhat.com Link: https://lkml.kernel.org/r/20231211162214.2146080-1-willy@infradead.org Link: https://lkml.kernel.org/r/20231211162214.2146080-2-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: David Hildenbrand <david@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-12-29mm: fix unmap_mapping_range high bits shift bugJiajun Xie1-2/+2
The bug happens when highest bit of holebegin is 1, suppose holebegin is 0x8000000111111000, after shift, hba would be 0xfff8000000111111, then vma_interval_tree_foreach would look it up fail or leads to the wrong result. error call seq e.g.: - mmap(..., offset=0x8000000111111000) |- syscall(mmap, ... unsigned long, off): |- ksys_mmap_pgoff( ... , off >> PAGE_SHIFT); here pgoff is correctly shifted to 0x8000000111111, but pass 0x8000000111111000 as holebegin to unmap would then cause terrible result, as shown below: - unmap_mapping_range(..., loff_t const holebegin) |- pgoff_t hba = holebegin >> PAGE_SHIFT; /* hba = 0xfff8000000111111 unexpectedly */ The issue happens in Heterogeneous computing, where the device(e.g. gpu) and host share the same virtual address space. A simple workflow pattern which hit the issue is: /* host */ 1. userspace first mmap a file backed VA range with specified offset. e.g. (offset=0x800..., mmap return: va_a) 2. write some data to the corresponding sys page e.g. (va_a = 0xAABB) /* device */ 3. gpu workload touches VA, triggers gpu fault and notify the host. /* host */ 4. reviced gpu fault notification, then it will: 4.1 unmap host pages and also takes care of cpu tlb (use unmap_mapping_range with offset=0x800...) 4.2 migrate sys page to device 4.3 setup device page table and resolve device fault. /* device */ 5. gpu workload continued, it accessed va_a and got 0xAABB. 6. gpu workload continued, it wrote 0xBBCC to va_a. /* host */ 7. userspace access va_a, as expected, it will: 7.1 trigger cpu vm fault. 7.2 driver handling fault to migrate gpu local page to host. 8. userspace then could correctly get 0xBBCC from va_a 9. done But in step 4.1, if we hit the bug this patch mentioned, then userspace would never trigger cpu fault, and still get the old value: 0xAABB. Making holebegin unsigned first fixes the bug. Link: https://lkml.kernel.org/r/20231220052839.26970-1-jiajun.xie.sh@gmail.com Signed-off-by: Jiajun Xie <jiajun.xie.sh@gmail.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-12-20mm/memory: replace kmap() with kmap_local_page()Fabio M. De Francesco1-3/+2
kmap() has been deprecated in favor of kmap_local_page(). Therefore, replace kmap() with kmap_local_page() in mm/memory.c. There are two main problems with kmap(): (1) It comes with an overhead as the mapping space is restricted and protected by a global lock for synchronization and (2) it also requires global TLB invalidation when the kmap's pool wraps and it might block when the mapping space is fully utilized until a slot becomes available. With kmap_local_page() the mappings are per thread, CPU local, can take page-faults, and can be called from any context (including interrupts). It is faster than kmap() in kernels with HIGHMEM enabled. The tasks can be preempted and, when they are scheduled to run again, the kernel virtual addresses are restored and still valid. Obviously, thread locality implies that the kernel virtual addresses returned by kmap_local_page() are only valid in the context of the callers (i.e., they cannot be handed to other threads). The use of kmap_local_page() in mm/memory.c does not break the above-mentioned assumption, so it is allowed and preferred. Link: https://lkml.kernel.org/r/20231215084417.2002370-1-fabio.maria.de.francesco@linux.intel.com Link: https://lkml.kernel.org/r/20231214081039.1919328-1-fabio.maria.de.francesco@linux.intel.com Signed-off-by: Fabio M. De Francesco <fabio.maria.de.francesco@linux.intel.com> Reviewed-by: Ira Weiny <ira.weiny@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-12-20mm: thp: support allocation of anonymous multi-size THPRyan Roberts1-9/+100
Introduce the logic to allow THP to be configured (through the new sysfs interface we just added) to allocate large folios to back anonymous memory, which are larger than the base page size but smaller than PMD-size. We call this new THP extension "multi-size THP" (mTHP). mTHP continues to be PTE-mapped, but in many cases can still provide similar benefits to traditional PMD-sized THP: Page faults are significantly reduced (by a factor of e.g. 4, 8, 16, etc. depending on the configured order), but latency spikes are much less prominent because the size of each page isn't as huge as the PMD-sized variant and there is less memory to clear in each page fault. The number of per-page operations (e.g. ref counting, rmap management, lru list management) are also significantly reduced since those ops now become per-folio. Some architectures also employ TLB compression mechanisms to squeeze more entries in when a set of PTEs are virtually and physically contiguous and approporiately aligned. In this case, TLB misses will occur less often. The new behaviour is disabled by default, but can be enabled at runtime by writing to /sys/kernel/mm/transparent_hugepage/hugepage-XXkb/enabled (see documentation in previous commit). The long term aim is to change the default to include suitable lower orders, but there are some risks around internal fragmentation that need to be better understood first. [ryan.roberts@arm.com: resolve some multi-size THP review nits] Link: https://lkml.kernel.org/r/20231214160251.3574571-1-ryan.roberts@arm.com Link: https://lkml.kernel.org/r/20231207161211.2374093-5-ryan.roberts@arm.com Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> Tested-by: Kefeng Wang <wangkefeng.wang@huawei.com> Tested-by: John Hubbard <jhubbard@nvidia.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Barry Song <v-songbaohua@oppo.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: David Rientjes <rientjes@google.com> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Itaru Kitayama <itaru.kitayama@gmail.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Luis Chamberlain <mcgrof@kernel.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Yang Shi <shy828301@gmail.com> Cc: Yin Fengwei <fengwei.yin@intel.com> Cc: Yu Zhao <yuzhao@google.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-12-20mm: thp: introduce multi-size THP sysfs interfaceRyan Roberts1-3/+3
In preparation for adding support for anonymous multi-size THP, introduce new sysfs structure that will be used to control the new behaviours. A new directory is added under transparent_hugepage for each supported THP size, and contains an `enabled` file, which can be set to "inherit" (to inherit the global setting), "always", "madvise" or "never". For now, the kernel still only supports PMD-sized anonymous THP, so only 1 directory is populated. The first half of the change converts transhuge_vma_suitable() and hugepage_vma_check() so that they take a bitfield of orders for which the user wants to determine support, and the functions filter out all the orders that can't be supported, given the current sysfs configuration and the VMA dimensions. The resulting functions are renamed to thp_vma_suitable_orders() and thp_vma_allowable_orders() respectively. Convenience functions that take a single, unencoded order and return a boolean are also defined as thp_vma_suitable_order() and thp_vma_allowable_order(). The second half of the change implements the new sysfs interface. It has been done so that each supported THP size has a `struct thpsize`, which describes the relevant metadata and is itself a kobject. This is pretty minimal for now, but should make it easy to add new per-thpsize files to the interface if needed in future (e.g. per-size defrag). Rather than keep the `enabled` state directly in the struct thpsize, I've elected to directly encode it into huge_anon_orders_[always|madvise|inherit] bitfields since this reduces the amount of work required in thp_vma_allowable_orders() which is called for every page fault. See Documentation/admin-guide/mm/transhuge.rst, as modified by this commit, for details of how the new sysfs interface works. [ryan.roberts@arm.com: fix build warning when CONFIG_SYSFS is disabled] Link: https://lkml.kernel.org/r/20231211125320.3997543-1-ryan.roberts@arm.com Link: https://lkml.kernel.org/r/20231207161211.2374093-4-ryan.roberts@arm.com Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> Reviewed-by: Barry Song <v-songbaohua@oppo.com> Tested-by: Kefeng Wang <wangkefeng.wang@huawei.com> Tested-by: John Hubbard <jhubbard@nvidia.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: David Rientjes <rientjes@google.com> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Itaru Kitayama <itaru.kitayama@gmail.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Luis Chamberlain <mcgrof@kernel.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Yang Shi <shy828301@gmail.com> Cc: Yin Fengwei <fengwei.yin@intel.com> Cc: Yu Zhao <yuzhao@google.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-12-12mm: memory: use folio_prealloc() in wp_page_copy()Kefeng Wang1-15/+7
Use folio_prealloc() helper to simplify code a bit. Link: https://lkml.kernel.org/r/20231118023232.1409103-6-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: David Hildenbrand <david@redhat.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com> Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-12-12mm: memory: use a folio in do_cow_fault()Kefeng Wang1-10/+6
Use folio_prealloc() helper and convert to use a folio in do_cow_fault(), which save five compound_head() calls. Link: https://lkml.kernel.org/r/20231118023232.1409103-5-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Reviewed-by: Vishal Moola (Oracle) <vishal.moola@gmail.com> Cc: David Hildenbrand <david@redhat.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-12-12mm: memory: rename page_copy_prealloc() to folio_prealloc()Kefeng Wang1-4/+9
Let's rename page_copy_prealloc() to folio_prealloc(), which could be reused in more functons, as it maybe zero the new page, pass a new need_zero to it, and call the vma_alloc_zeroed_movable_folio() if need_zero is true. Link: https://lkml.kernel.org/r/20231118023232.1409103-4-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Reviewed-by: Sidhartha Kumar <sidhartha.kumar@oracle.com> Reviewed-by: Vishal Moola (Oracle) <vishal.moola@gmail.com> Cc: David Hildenbrand <david@redhat.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-12-12mm: memory: use a folio in validate_page_before_insert()Kefeng Wang1-2/+5
Use a folio in validate_page_before_insert() to save two compound_head() calls. Link: https://lkml.kernel.org/r/20231118023232.1409103-3-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Reviewed-by: Sidhartha Kumar <sidhartha.kumar@oracle.com> Cc: David Hildenbrand <david@redhat.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-12-10mm/memory: use kmap_local_page() in __wp_page_copy_user()Fabio M. De Francesco1-2/+4
kmap_atomic() has been deprecated in favor of kmap_local_{folio,page}. Therefore, replace kmap_atomic() with kmap_local_page in __wp_page_copy_user(). kmap_atomic() disables preemption in !PREEMPT_RT kernels and unconditionally disables also page-faults. My limited knowledge of the implementation of __wp_page_copy_user() makes me think that the latter side effect is still needed here, but kmap_local_page() is implemented not to disable page-faults. So, in addition to the conversion to local mapping, add explicit pagefault_disable() / pagefault_enable() between mapping and un-mapping. Link: https://lkml.kernel.org/r/20231120142418.6977-1-fmdefrancesco@gmail.com Signed-off-by: Fabio M. De Francesco <fabio.maria.de.francesco@linux.intel.com> Cc: Ira Weiny <ira.weiny@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-12-10mm: convert __do_fault() to use a folioMatthew Wilcox (Oracle)1-10/+10
Convert vmf->page to a folio as soon as we're going to use it. This fixes a bug if the fault handler returns a tail page with hardware poison; tail pages have an invalid page->index, so we would fail to unmap the page from the page tables. We actually have to unmap the entire folio (or mapping_evict_folio() will fail), so use unmap_mapping_folio() instead. This also saves various calls to compound_head() hidden in lock_page(), put_page(), etc. Link: https://lkml.kernel.org/r/20231108182809.602073-3-willy@infradead.org Fixes: 793917d997df ("mm/readahead: Add large folio readahead") Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Naoya Horiguchi <naoya.horiguchi@nec.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-12-10fork: use __mt_dup() to duplicate maple tree in dup_mmap()Peng Zhang1-1/+6
In dup_mmap(), using __mt_dup() to duplicate the old maple tree and then directly replacing the entries of VMAs in the new maple tree can result in better performance. __mt_dup() uses DFS pre-order to duplicate the maple tree, so it is efficient. The average time complexity of __mt_dup() is O(n), where n is the number of VMAs. The proof of the time complexity is provided in the commit log that introduces __mt_dup(). After duplicating the maple tree, each element is traversed and replaced (ignoring the cases of deletion, which are rare). Since it is only a replacement operation for each element, this process is also O(n). Analyzing the exact time complexity of the previous algorithm is challenging because each insertion can involve appending to a node, pushing data to adjacent nodes, or even splitting nodes. The frequency of each action is difficult to calculate. The worst-case scenario for a single insertion is when the tree undergoes splitting at every level. If we consider each insertion as the worst-case scenario, we can determine that the upper bound of the time complexity is O(n*log(n)), although this is a loose upper bound. However, based on the test data, it appears that the actual time complexity is likely to be O(n). As the entire maple tree is duplicated using __mt_dup(), if dup_mmap() fails, there will be a portion of VMAs that have not been duplicated in the maple tree. To handle this, we mark the failure point with XA_ZERO_ENTRY. In exit_mmap(), if this marker is encountered, stop releasing VMAs that have not been duplicated after this point. There is a "spawn" in byte-unixbench[1], which can be used to test the performance of fork(). I modified it slightly to make it work with different number of VMAs. Below are the test results. The first row shows the number of VMAs. The second and third rows show the number of fork() calls per ten seconds, corresponding to next-20231006 and the this patchset, respectively. The test results were obtained with CPU binding to avoid scheduler load balancing that could cause unstable results. There are still some fluctuations in the test results, but at least they are better than the original performance. 21 121 221 421 821 1621 3221 6421 12821 25621 51221 112100 76261 54227 34035 20195 11112 6017 3161 1606 802 393 114558 83067 65008 45824 28751 16072 8922 4747 2436 1233 599 2.19% 8.92% 19.88% 34.64% 42.37% 44.64% 48.28% 50.17% 51.68% 53.74% 52.42% [1] https://github.com/kdlucas/byte-unixbench/tree/master Link: https://lkml.kernel.org/r/20231027033845.90608-11-zhangpeng.00@bytedance.com Signed-off-by: Peng Zhang <zhangpeng.00@bytedance.com> Suggested-by: Liam R. Howlett <Liam.Howlett@oracle.com> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Christian Brauner <brauner@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Mateusz Guzik <mjguzik@gmail.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michael S. Tsirkin <mst@redhat.com> Cc: Mike Christie <michael.christie@oracle.com> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-12-06mm/memory.c:zap_pte_range() print bad swap entryAndrew Morton1-0/+1
We have a report of this WARN() triggering. Let's print the offending swp_entry_t to help diagnosis. Link: https://lkml.kernel.org/r/000000000000b0e576060a30ee3b@google.com Cc: Muhammad Usama Anjum <usama.anjum@collabora.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-10-25mm: use folio_xchg_last_cpupid() in wp_page_reuse()Kefeng Wang1-9/+10
Convert to use folio_xchg_last_cpupid() in wp_page_reuse(), and remove page variable. Since now only normal and PMD-mapped page is handled by numa balancing, it's enough to only update the entire folio's last cpupid. Link: https://lkml.kernel.org/r/20231018140806.2783514-19-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: David Hildenbrand <david@redhat.com> Cc: Huang Ying <ying.huang@intel.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-10-25mm: convert wp_page_reuse() and finish_mkwrite_fault() to take a folioKefeng Wang1-9/+10
Saves one compound_head() call, also in preparation for page_cpupid_xchg_last() conversion. Link: https://lkml.kernel.org/r/20231018140806.2783514-18-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: David Hildenbrand <david@redhat.com> Cc: Huang Ying <ying.huang@intel.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-10-25mm: make finish_mkwrite_fault() staticKefeng Wang1-1/+1
Make finish_mkwrite_fault static since it is not used outside of memory.c. Link: https://lkml.kernel.org/r/20231018140806.2783514-17-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: David Hildenbrand <david@redhat.com> Cc: Huang Ying <ying.huang@intel.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-10-25mm: memory: use folio_last_cpupid() in do_numa_page()Kefeng Wang1-1/+1
Convert to use folio_last_cpupid() in do_numa_page(). Link: https://lkml.kernel.org/r/20231018140806.2783514-4-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: David Hildenbrand <david@redhat.com> Cc: Huang Ying <ying.huang@intel.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-10-18mm/gup: adapt get_user_page_vma_remote() to never return NULLLorenzo Stoakes1-2/+1
get_user_pages_remote() will never return 0 except in the case of FOLL_NOWAIT being specified, which we explicitly disallow. This simplifies error handling for the caller and avoids the awkwardness of dealing with both errors and failing to pin. Failing to pin here is an error. Link: https://lkml.kernel.org/r/00319ce292d27b3aae76a0eb220ce3f528187508.1696288092.git.lstoakes@gmail.com Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com> Suggested-by: Arnd Bergmann <arnd@arndb.de> Reviewed-by: Arnd Bergmann <arnd@arndb.de> Acked-by: Catalin Marinas <catalin.marinas@arm.com> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Ian Rogers <irogers@google.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Richard Cochran <richardcochran@gmail.com> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-10-18mm: make __access_remote_vm() staticLorenzo Stoakes1-2/+2
Patch series "various improvements to the GUP interface", v2. A series of fixes to simplify and improve the GUP interface with an eye to providing groundwork to future improvements:- * __access_remote_vm() and access_remote_vm() are functionally identical, so make the former static such that in future we can potentially change the external-facing implementation details of this function. * Extend is_valid_gup_args() to cover the missing FOLL_TOUCH case, and simplify things by defining INTERNAL_GUP_FLAGS to check against. * Adjust __get_user_pages_locked() to explicitly treat a failure to pin any pages as an error in all circumstances other than FOLL_NOWAIT being specified, bringing it in line with the nommu implementation of this function. * (With many thanks to Arnd who suggested this in the first instance) Update get_user_page_vma_remote() to explicitly only return a page or an error, simplifying the interface and avoiding the questionable IS_ERR_OR_NULL() pattern. This patch (of 4): access_remote_vm() passes through parameters to __access_remote_vm() directly, so remove the __access_remote_vm() function from mm.h and use access_remote_vm() in the one caller that needs it (ptrace_access_vm()). This allows future adjustments to the GUP-internal __access_remote_vm() function while keeping the access_remote_vm() function stable. Link: https://lkml.kernel.org/r/cover.1696288092.git.lstoakes@gmail.com Link: https://lkml.kernel.org/r/f7877c5039ce1c202a514a8aeeefc5cdd5e32d19.1696288092.git.lstoakes@gmail.com Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com> Reviewed-by: Arnd Bergmann <arnd@arndb.de> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Ian Rogers <irogers@google.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Richard Cochran <richardcochran@gmail.com> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-10-18memory: move exclusivity detection in do_wp_page() into ↵David Hildenbrand1-43/+45
wp_can_reuse_anon_folio() Let's clean up do_wp_page() a bit, removing two labels and making it a easier to read. wp_can_reuse_anon_folio() now only operates on the whole folio. Move the SetPageAnonExclusive() out into do_wp_page(). No need to do this under page lock -- the page table lock is sufficient. Link: https://lkml.kernel.org/r/20231002142949.235104-4-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Matthew Wilcox <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-10-18mm/rmap: convert page_move_anon_rmap() to folio_move_anon_rmap()David Hildenbrand1-1/+1
Let's convert it to consume a folio. [akpm@linux-foundation.org: fix kerneldoc] Link: https://lkml.kernel.org/r/20231002142949.235104-3-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Reviewed-by: Suren Baghdasaryan <surenb@google.com> Reviewed-by: Vishal Moola (Oracle) <vishal.moola@gmail.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Matthew Wilcox <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-10-18mm/rmap: move SetPageAnonExclusive() out of page_move_anon_rmap()David Hildenbrand1-0/+1
Patch series "mm/rmap: convert page_move_anon_rmap() to folio_move_anon_rmap()". Convert page_move_anon_rmap() to folio_move_anon_rmap(), letting the callers handle PageAnonExclusive. I'm including cleanup patch #3 because it fits into the picture and can be done cleaner by the conversion. This patch (of 3): Let's move it into the caller: there is a difference between whether an anon folio can only be mapped by one process (e.g., into one VMA), and whether it is truly exclusive (e.g., no references -- including GUP -- from other processes). Further, for large folios the page might not actually be pointing at the head page of the folio, so it better be handled in the caller. This is a preparation for converting page_move_anon_rmap() to consume a folio. Link: https://lkml.kernel.org/r/20231002142949.235104-1-david@redhat.com Link: https://lkml.kernel.org/r/20231002142949.235104-2-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Reviewed-by: Suren Baghdasaryan <surenb@google.com> Reviewed-by: Vishal Moola (Oracle) <vishal.moola@gmail.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Matthew Wilcox <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-10-18mm: handle write faults to RO pages under the VMA lockMatthew Wilcox (Oracle)1-7/+6
I think this is a pretty rare occurrence, but for consistency handle faults with the VMA lock held the same way that we handle other faults with the VMA lock held. Link: https://lkml.kernel.org/r/20231006195318.4087158-7-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-10-18mm: handle read faults under the VMA lockMatthew Wilcox (Oracle)1-4/+3
Most file-backed faults are already handled through ->map_pages(), but if we need to do I/O we'll come this way. Since filemap_fault() is now safe to be called under the VMA lock, we can handle these faults under the VMA lock now. Link: https://lkml.kernel.org/r/20231006195318.4087158-6-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-10-18mm: handle COW faults under the VMA lockMatthew Wilcox (Oracle)1-7/+5
If the page is not currently present in the page tables, we need to call the page fault handler to find out which page we're supposed to COW, so we need to both check that there is already an anon_vma and that the fault handler doesn't need the mmap_lock. Link: https://lkml.kernel.org/r/20231006195318.4087158-5-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-10-18mm: handle shared faults under the VMA lockMatthew Wilcox (Oracle)1-4/+18
There are many implementations of ->fault and some of them depend on mmap_lock being held. All vm_ops that implement ->map_pages() end up calling filemap_fault(), which I have audited to be sure it does not rely on mmap_lock. So (for now) key off ->map_pages existing as a flag to indicate that it's safe to call ->fault while only holding the vma lock. Link: https://lkml.kernel.org/r/20231006195318.4087158-4-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-10-18mm: call wp_page_copy() under the VMA lockMatthew Wilcox (Oracle)1-13/+26
It is usually safe to call wp_page_copy() under the VMA lock. The only unsafe situation is when no anon_vma has been allocated for this VMA, and we have to look at adjacent VMAs to determine if their anon_vma can be shared. Since this happens only for the first COW of a page in this VMA, the majority of calls to wp_page_copy() do not need to fall back to the mmap_sem. Add vmf_anon_prepare() as an alternative to anon_vma_prepare() which will return RETRY if we currently hold the VMA lock and need to allocate an anon_vma. This lets us drop the check in do_wp_page(). Link: https://lkml.kernel.org/r/20231006195318.4087158-3-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-10-18userfaultfd: UFFD_FEATURE_WP_ASYNCPeter Xu1-3/+25
Patch series "Implement IOCTL to get and optionally clear info about PTEs", v33. *Motivation* The real motivation for adding PAGEMAP_SCAN IOCTL is to emulate Windows GetWriteWatch() and ResetWriteWatch() syscalls [1]. The GetWriteWatch() retrieves the addresses of the pages that are written to in a region of virtual memory. This syscall is used in Windows applications and games etc. This syscall is being emulated in pretty slow manner in userspace. Our purpose is to enhance the kernel such that we translate it efficiently in a better way. Currently some out of tree hack patches are being used to efficiently emulate it in some kernels. We intend to replace those with these patches. So the whole gaming on Linux can effectively get benefit from this. It means there would be tons of users of this code. CRIU use case [2] was mentioned by Andrei and Danylo: > Use cases for migrating sparse VMAs are binaries sanitized with ASAN, > MSAN or TSAN [3]. All of these sanitizers produce sparse mappings of > shadow memory [4]. Being able to migrate such binaries allows to highly > reduce the amount of work needed to identify and fix post-migration > crashes, which happen constantly. Andrei defines the following uses of this code: * it is more granular and allows us to track changed pages more effectively. The current interface can clear dirty bits for the entire process only. In addition, reading info about pages is a separate operation. It means we must freeze the process to read information about all its pages, reset dirty bits, only then we can start dumping pages. The information about pages becomes more and more outdated, while we are processing pages. The new interface solves both these downsides. First, it allows us to read pte bits and clear the soft-dirty bit atomically. It means that CRIU will not need to freeze processes to pre-dump their memory. Second, it clears soft-dirty bits for a specified region of memory. It means CRIU will have actual info about pages to the moment of dumping them. * The new interface has to be much faster because basic page filtering is happening in the kernel. With the old interface, we have to read pagemap for each page. *Implementation Evolution (Short Summary)* From the definition of GetWriteWatch(), we feel like kernel's soft-dirty feature can be used under the hood with some additions like: * reset soft-dirty flag for only a specific region of memory instead of clearing the flag for the entire process * get and clear soft-dirty flag for a specific region atomically So we decided to use ioctl on pagemap file to read or/and reset soft-dirty flag. But using soft-dirty flag, sometimes we get extra pages which weren't even written. They had become soft-dirty because of VMA merging and VM_SOFTDIRTY flag. This breaks the definition of GetWriteWatch(). We were able to by-pass this short coming by ignoring VM_SOFTDIRTY until David reported that mprotect etc messes up the soft-dirty flag while ignoring VM_SOFTDIRTY [5]. This wasn't happening until [6] got introduced. We discussed if we can revert these patches. But we could not reach to any conclusion. So at this point, I made couple of tries to solve this whole VM_SOFTDIRTY issue by correcting the soft-dirty implementation: * [7] Correct the bug fixed wrongly back in 2014. It had potential to cause regression. We left it behind. * [8] Keep a list of soft-dirty part of a VMA across splits and merges. I got the reply don't increase the size of the VMA by 8 bytes. At this point, we left soft-dirty considering it is too much delicate and userfaultfd [9] seemed like the only way forward. From there onward, we have been basing soft-dirty emulation on userfaultfd wp feature where kernel resolves the faults itself when WP_ASYNC feature is used. It was straight forward to add WP_ASYNC feature in userfautlfd. Now we get only those pages dirty or written-to which are really written in reality. (PS There is another WP_UNPOPULATED userfautfd feature is required which is needed to avoid pre-faulting memory before write-protecting [9].) All the different masks were added on the request of CRIU devs to create interface more generic and better. [1] https://learn.microsoft.com/en-us/windows/win32/api/memoryapi/nf-memoryapi-getwritewatch [2] https://lore.kernel.org/all/20221014134802.1361436-1-mdanylo@google.com [3] https://github.com/google/sanitizers [4] https://github.com/google/sanitizers/wiki/AddressSanitizerAlgorithm#64-bit [5] https://lore.kernel.org/all/bfcae708-db21-04b4-0bbe-712badd03071@redhat.com [6] https://lore.kernel.org/all/20220725142048.30450-1-peterx@redhat.com/ [7] https://lore.kernel.org/all/20221122115007.2787017-1-usama.anjum@collabora.com [8] https://lore.kernel.org/all/20221220162606.1595355-1-usama.anjum@collabora.com [9] https://lore.kernel.org/all/20230306213925.617814-1-peterx@redhat.com [10] https://lore.kernel.org/all/20230125144529.1630917-1-mdanylo@google.com This patch (of 6): Add a new userfaultfd-wp feature UFFD_FEATURE_WP_ASYNC, that allows userfaultfd wr-protect faults to be resolved by the kernel directly. It can be used like a high accuracy version of soft-dirty, without vma modifications during tracking, and also with ranged support by default rather than for a whole mm when reset the protections due to existence of ioctl(UFFDIO_WRITEPROTECT). Several goals of such a dirty tracking interface: 1. All types of memory should be supported and tracable. This is nature for soft-dirty but should mention when the context is userfaultfd, because it used to only support anon/shmem/hugetlb. The problem is for a dirty tracking purpose these three types may not be enough, and it's legal to track anything e.g. any page cache writes from mmap. 2. Protections can be applied to partial of a memory range, without vma split/merge fuss. The hope is that the tracking itself should not affect any vma layout change. It also helps when reset happens because the reset will not need mmap write lock which can block the tracee. 3. Accuracy needs to be maintained. This means we need pte markers to work on any type of VMA. One could question that, the whole concept of async dirty tracking is not really close to fundamentally what userfaultfd used to be: it's not "a fault to be serviced by userspace" anymore. However, using userfaultfd-wp here as a framework is convenient for us in at least: 1. VM_UFFD_WP vma flag, which has a very good name to suite something like this, so we don't need VM_YET_ANOTHER_SOFT_DIRTY. Just use a new feature bit to identify from a sync version of uffd-wp registration. 2. PTE markers logic can be leveraged across the whole kernel to maintain the uffd-wp bit as long as an arch supports, this also applies to this case where uffd-wp bit will be a hint to dirty information and it will not go lost easily (e.g. when some page cache ptes got zapped). 3. Reuse ioctl(UFFDIO_WRITEPROTECT) interface for either starting or resetting a range of memory, while there's no counterpart in the old soft-dirty world, hence if this is wanted in a new design we'll need a new interface otherwise. We can somehow understand that commonality because uffd-wp was fundamentally a similar idea of write-protecting pages just like soft-dirty. This implementation allows WP_ASYNC to imply WP_UNPOPULATED, because so far WP_ASYNC seems to not usable if without WP_UNPOPULATE. This also gives us chance to modify impl of WP_ASYNC just in case it could be not depending on WP_UNPOPULATED anymore in the future kernels. It's also fine to imply that because both features will rely on PTE_MARKER_UFFD_WP config option, so they'll show up together (or both missing) in an UFFDIO_API probe. vma_can_userfault() now allows any VMA if the userfaultfd registration is only about async uffd-wp. So we can track dirty for all kinds of memory including generic file systems (like XFS, EXT4 or BTRFS). One trick worth mention in do_wp_page() is that we need to manually update vmf->orig_pte here because it can be used later with a pte_same() check - this path always has FAULT_FLAG_ORIG_PTE_VALID set in the flags. The major defect of this approach of dirty tracking is we need to populate the pgtables when tracking starts. Soft-dirty doesn't do it like that. It's unwanted in the case where the range of memory to track is huge and unpopulated (e.g., tracking updates on a 10G file with mmap() on top, without having any page cache installed yet). One way to improve this is to allow pte markers exist for larger than PTE level for PMD+. That will not change the interface if to implemented, so we can leave that for later. Link: https://lkml.kernel.org/r/20230821141518.870589-1-usama.anjum@collabora.com Link: https://lkml.kernel.org/r/20230821141518.870589-2-usama.anjum@collabora.com Signed-off-by: Peter Xu <peterx@redhat.com> Co-developed-by: Muhammad Usama Anjum <usama.anjum@collabora.com> Signed-off-by: Muhammad Usama Anjum <usama.anjum@collabora.com> Cc: Alex Sierra <alex.sierra@amd.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Andrei Vagin <avagin@gmail.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Christian Brauner <brauner@kernel.org> Cc: Cyrill Gorcunov <gorcunov@gmail.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Gustavo A. R. Silva <gustavoars@kernel.org> Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michal Miroslaw <emmir@google.com> Cc: Mike Rapoport (IBM) <rppt@kernel.org> Cc: Nadav Amit <namit@vmware.com> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: Paul Gofman <pgofman@codeweavers.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Yang Shi <shy828301@gmail.com> Cc: Yun Zhou <yun.zhou@windriver.com> Cc: Michał Mirosław <mirq-linux@rere.qmqm.pl> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-10-18Merge mm-hotfixes-stable into mm-stable to pick up depended-upon changes.Andrew Morton1-5/+8
2023-10-18hugetlbfs: close race between MADV_DONTNEED and page faultRik van Riel1-5/+8
Malloc libraries, like jemalloc and tcalloc, take decisions on when to call madvise independently from the code in the main application. This sometimes results in the application page faulting on an address, right after the malloc library has shot down the backing memory with MADV_DONTNEED. Usually this is harmless, because we always have some 4kB pages sitting around to satisfy a page fault. However, with hugetlbfs systems often allocate only the exact number of huge pages that the application wants. Due to TLB batching, hugetlbfs MADV_DONTNEED will free pages outside of any lock taken on the page fault path, which can open up the following race condition: CPU 1 CPU 2 MADV_DONTNEED unmap page shoot down TLB entry page fault fail to allocate a huge page killed with SIGBUS free page Fix that race by pulling the locking from __unmap_hugepage_final_range into helper functions called from zap_page_range_single. This ensures page faults stay locked out of the MADV_DONTNEED VMA until the huge pages have actually been freed. Link: https://lkml.kernel.org/r/20231006040020.3677377-4-riel@surriel.com Fixes: 04ada095dcfc ("hugetlb: don't delete vma_lock in hugetlb MADV_DONTNEED processing") Signed-off-by: Rik van Riel <riel@surriel.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-10-16mm: mempolicy: make mpol_misplaced() to take a folioKefeng Wang1-1/+1
In preparation for large folio numa balancing, make mpol_misplaced() to take a folio, no functional change intended. Link: https://lkml.kernel.org/r/20230921074417.24004-6-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: David Hildenbrand <david@redhat.com> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-10-16mm: memory: make numa_migrate_prep() to take a folioKefeng Wang1-5/+4
In preparation for large folio numa balancing, make numa_migrate_prep() to take a folio, no functional change intended. Link: https://lkml.kernel.org/r/20230921074417.24004-5-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: David Hildenbrand <david@redhat.com> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-10-16mm: memory: use a folio in do_numa_page()Kefeng Wang1-17/+17
Numa balancing only try to migrate non-compound page in do_numa_page(), use a folio in it to save several compound_head calls, note we use folio_estimated_sharers(), it is enough to check the folio sharers since only normal page is handled, if large folio numa balancing is supported, a precise folio sharers check would be used, no functional change intended. Link: https://lkml.kernel.org/r/20230921074417.24004-4-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: David Hildenbrand <david@redhat.com> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-10-16mm: memory: add vm_normal_folio_pmd()Kefeng Wang1-0/+10
Patch series "mm: convert numa balancing functions to use a folio", v2. do_numa_pages() only handles non-compound pages, and only PMD-mapped THPs are handled in do_huge_pmd_numa_page(). But a large, PTE-mapped folio will be supported so let's convert more numa balancing functions to use/take a folio in preparation for that, no functional change intended for now. This patch (of 6): The new vm_normal_folio_pmd() wrapper is similar to vm_normal_folio(), which allow them to completely replace the struct page variables with struct folio variables. Link: https://lkml.kernel.org/r/20230921074417.24004-1-wangkefeng.wang@huawei.com Link: https://lkml.kernel.org/r/20230921074417.24004-2-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: David Hildenbrand <david@redhat.com> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-10-04mm: migrate: convert migrate_misplaced_page() to migrate_misplaced_folio()Kefeng Wang1-1/+1
At present, numa balance only support base page and PMD-mapped THP, but we will expand to support to migrate large folio/pte-mapped THP in the future, it is better to make migrate_misplaced_page() to take a folio instead of a page, and rename it to migrate_misplaced_folio(), it is a preparation, also this remove several compound_head() calls. Link: https://lkml.kernel.org/r/20230913095131.2426871-5-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Reviewed-by: Zi Yan <ziy@nvidia.com> Cc: David Hildenbrand <david@redhat.com> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-10-04mm: remove remnants of SPLIT_RSS_COUNTINGMateusz Guzik1-2/+0
The feature got retired in f1a7941243c1 ("mm: convert mm's rss stats into percpu_counter"), but the patch failed to fully clean it up. Link: https://lkml.kernel.org/r/20230823170556.2281747-1-mjguzik@gmail.com Signed-off-by: Mateusz Guzik <mjguzik@gmail.com> Acked-by: Shakeel Butt <shakeelb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-31Merge tag 'x86_shstk_for_6.6-rc1' of ↵Linus Torvalds1-2/+3
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 shadow stack support from Dave Hansen: "This is the long awaited x86 shadow stack support, part of Intel's Control-flow Enforcement Technology (CET). CET consists of two related security features: shadow stacks and indirect branch tracking. This series implements just the shadow stack part of this feature, and just for userspace. The main use case for shadow stack is providing protection against return oriented programming attacks. It works by maintaining a secondary (shadow) stack using a special memory type that has protections against modification. When executing a CALL instruction, the processor pushes the return address to both the normal stack and to the special permission shadow stack. Upon RET, the processor pops the shadow stack copy and compares it to the normal stack copy. For more information, refer to the links below for the earlier versions of this patch set" Link: https://lore.kernel.org/lkml/20220130211838.8382-1-rick.p.edgecombe@intel.com/ Link: https://lore.kernel.org/lkml/20230613001108.3040476-1-rick.p.edgecombe@intel.com/ * tag 'x86_shstk_for_6.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (47 commits) x86/shstk: Change order of __user in type x86/ibt: Convert IBT selftest to asm x86/shstk: Don't retry vm_munmap() on -EINTR x86/kbuild: Fix Documentation/ reference x86/shstk: Move arch detail comment out of core mm x86/shstk: Add ARCH_SHSTK_STATUS x86/shstk: Add ARCH_SHSTK_UNLOCK x86: Add PTRACE interface for shadow stack selftests/x86: Add shadow stack test x86/cpufeatures: Enable CET CR4 bit for shadow stack x86/shstk: Wire in shadow stack interface x86: Expose thread features in /proc/$PID/status x86/shstk: Support WRSS for userspace x86/shstk: Introduce map_shadow_stack syscall x86/shstk: Check that signal frame is shadow stack mem x86/shstk: Check that SSP is aligned on sigreturn x86/shstk: Handle signals for shadow stack x86/shstk: Introduce routines modifying shstk x86/shstk: Handle thread shadow stack x86/shstk: Add user-mode shadow stack support ...
2023-08-29Merge tag 'mm-stable-2023-08-28-18-26' of ↵Linus Torvalds1-151/+190
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: - Some swap cleanups from Ma Wupeng ("fix WARN_ON in add_to_avail_list") - Peter Xu has a series (mm/gup: Unify hugetlb, speed up thp") which reduces the special-case code for handling hugetlb pages in GUP. It also speeds up GUP handling of transparent hugepages. - Peng Zhang provides some maple tree speedups ("Optimize the fast path of mas_store()"). - Sergey Senozhatsky has improved te performance of zsmalloc during compaction (zsmalloc: small compaction improvements"). - Domenico Cerasuolo has developed additional selftest code for zswap ("selftests: cgroup: add zswap test program"). - xu xin has doe some work on KSM's handling of zero pages. These changes are mainly to enable the user to better understand the effectiveness of KSM's treatment of zero pages ("ksm: support tracking KSM-placed zero-pages"). - Jeff Xu has fixes the behaviour of memfd's MEMFD_NOEXEC_SCOPE_NOEXEC_ENFORCED sysctl ("mm/memfd: fix sysctl MEMFD_NOEXEC_SCOPE_NOEXEC_ENFORCED"). - David Howells has fixed an fscache optimization ("mm, netfs, fscache: Stop read optimisation when folio removed from pagecache"). - Axel Rasmussen has given userfaultfd the ability to simulate memory poisoning ("add UFFDIO_POISON to simulate memory poisoning with UFFD"). - Miaohe Lin has contributed some routine maintenance work on the memory-failure code ("mm: memory-failure: remove unneeded PageHuge() check"). - Peng Zhang has contributed some maintenance work on the maple tree code ("Improve the validation for maple tree and some cleanup"). - Hugh Dickins has optimized the collapsing of shmem or file pages into THPs ("mm: free retracted page table by RCU"). - Jiaqi Yan has a patch series which permits us to use the healthy subpages within a hardware poisoned huge page for general purposes ("Improve hugetlbfs read on HWPOISON hugepages"). - Kemeng Shi has done some maintenance work on the pagetable-check code ("Remove unused parameters in page_table_check"). - More folioification work from Matthew Wilcox ("More filesystem folio conversions for 6.6"), ("Followup folio conversions for zswap"). And from ZhangPeng ("Convert several functions in page_io.c to use a folio"). - page_ext cleanups from Kemeng Shi ("minor cleanups for page_ext"). - Baoquan He has converted some architectures to use the GENERIC_IOREMAP ioremap()/iounmap() code ("mm: ioremap: Convert architectures to take GENERIC_IOREMAP way"). - Anshuman Khandual has optimized arm64 tlb shootdown ("arm64: support batched/deferred tlb shootdown during page reclamation/migration"). - Better maple tree lockdep checking from Liam Howlett ("More strict maple tree lockdep"). Liam also developed some efficiency improvements ("Reduce preallocations for maple tree"). - Cleanup and optimization to the secondary IOMMU TLB invalidation, from Alistair Popple ("Invalidate secondary IOMMU TLB on permission upgrade"). - Ryan Roberts fixes some arm64 MM selftest issues ("selftests/mm fixes for arm64"). - Kemeng Shi provides some maintenance work on the compaction code ("Two minor cleanups for compaction"). - Some reduction in mmap_lock pressure from Matthew Wilcox ("Handle most file-backed faults under the VMA lock"). - Aneesh Kumar contributes code to use the vmemmap optimization for DAX on ppc64, under some circumstances ("Add support for DAX vmemmap optimization for ppc64"). - page-ext cleanups from Kemeng Shi ("add page_ext_data to get client data in page_ext"), ("minor cleanups to page_ext header"). - Some zswap cleanups from Johannes Weiner ("mm: zswap: three cleanups"). - kmsan cleanups from ZhangPeng ("minor cleanups for kmsan"). - VMA handling cleanups from Kefeng Wang ("mm: convert to vma_is_initial_heap/stack()"). - DAMON feature work from SeongJae Park ("mm/damon/sysfs-schemes: implement DAMOS tried total bytes file"), ("Extend DAMOS filters for address ranges and DAMON monitoring targets"). - Compaction work from Kemeng Shi ("Fixes and cleanups to compaction"). - Liam Howlett has improved the maple tree node replacement code ("maple_tree: Change replacement strategy"). - ZhangPeng has a general code cleanup - use the K() macro more widely ("cleanup with helper macro K()"). - Aneesh Kumar brings memmap-on-memory to ppc64 ("Add support for memmap on memory feature on ppc64"). - pagealloc cleanups from Kemeng Shi ("Two minor cleanups for pcp list in page_alloc"), ("Two minor cleanups for get pageblock migratetype"). - Vishal Moola introduces a memory descriptor for page table tracking, "struct ptdesc" ("Split ptdesc from struct page"). - memfd selftest maintenance work from Aleksa Sarai ("memfd: cleanups for vm.memfd_noexec"). - MM include file rationalization from Hugh Dickins ("arch: include asm/cacheflush.h in asm/hugetlb.h"). - THP debug output fixes from Hugh Dickins ("mm,thp: fix sloppy text output"). - kmemleak improvements from Xiaolei Wang ("mm/kmemleak: use object_cache instead of kmemleak_initialized"). - More folio-related cleanups from Matthew Wilcox ("Remove _folio_dtor and _folio_order"). - A VMA locking scalability improvement from Suren Baghdasaryan ("Per-VMA lock support for swap and userfaults"). - pagetable handling cleanups from Matthew Wilcox ("New page table range API"). - A batch of swap/thp cleanups from David Hildenbrand ("mm/swap: stop using page->private on tail pages for THP_SWAP + cleanups"). - Cleanups and speedups to the hugetlb fault handling from Matthew Wilcox ("Change calling convention for ->huge_fault"). - Matthew Wilcox has also done some maintenance work on the MM subsystem documentation ("Improve mm documentation"). * tag 'mm-stable-2023-08-28-18-26' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (489 commits) maple_tree: shrink struct maple_tree maple_tree: clean up mas_wr_append() secretmem: convert page_is_secretmem() to folio_is_secretmem() nios2: fix flush_dcache_page() for usage from irq context hugetlb: add documentation for vma_kernel_pagesize() mm: add orphaned kernel-doc to the rst files. mm: fix clean_record_shared_mapping_range kernel-doc mm: fix get_mctgt_type() kernel-doc mm: fix kernel-doc warning from tlb_flush_rmaps() mm: remove enum page_entry_size mm: allow ->huge_fault() to be called without the mmap_lock held mm: move PMD_ORDER to pgtable.h mm: remove checks for pte_index memcg: remove duplication detection for mem_cgroup_uncharge_swap mm/huge_memory: work on folio->swap instead of page->private when splitting folio mm/swap: inline folio_set_swap_entry() and folio_swap_entry() mm/swap: use dedicated entry for swap in folio mm/swap: stop using page->private on tail pages for THP_SWAP selftests/mm: fix WARNING comparing pointer to 0 selftests: cgroup: fix test_kmem_memcg_deletion kernel mem check ...
2023-08-24mm: remove enum page_entry_sizeMatthew Wilcox (Oracle)1-4/+4
Remove the unnecessary encoding of page order into an enum and pass the page order directly. That lets us get rid of pe_order(). The switch constructs have to be changed to if/else constructs to prevent GCC from warning on builds with 3-level page tables where PMD_ORDER and PUD_ORDER have the same value. If you are looking at this commit because your driver stopped compiling, look at the previous commit as well and audit your driver to be sure it doesn't depend on mmap_lock being held in its ->huge_fault method. [willy@infradead.org: use "order %u" to match the (non dev_t) style] Link: https://lkml.kernel.org/r/ZOUYekbtTv+n8hYf@casper.infradead.org Link: https://lkml.kernel.org/r/20230818202335.2739663-4-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-24mm: allow ->huge_fault() to be called without the mmap_lock heldMatthew Wilcox (Oracle)1-20/+2
Remove the checks for the VMA lock being held, allowing the page fault path to call into the filesystem instead of retrying with the mmap_lock held. This will improve scalability for DAX page faults. Also update the documentation to match (and fix some other changes that have happened recently). Link: https://lkml.kernel.org/r/20230818202335.2739663-3-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-24mm: remove checks for pte_indexMatthew Wilcox (Oracle)1-16/+1
Since pte_index is always defined, we don't need to check whether it's defined or not. Delete the slow version that doesn't depend on it and remove the #define since nobody needs to test for it. Link: https://lkml.kernel.org/r/20230819031837.3160096-1-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Mike Rapoport (IBM) <rppt@kernel.org> Cc: Christian Dietrich <stettberger@dokucode.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-24mm/swap: inline folio_set_swap_entry() and folio_swap_entry()David Hildenbrand1-1/+1
Let's simply work on the folio directly and remove the helpers. Link: https://lkml.kernel.org/r/20230821160849.531668-4-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Suggested-by: Matthew Wilcox <willy@infradead.org> Reviewed-by: Chris Li <chrisl@kernel.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Dan Streetman <ddstreet@ieee.org> Cc: Hugh Dickins <hughd@google.com> Cc: Peter Xu <peterx@redhat.com> Cc: Seth Jennings <sjenning@redhat.com> Cc: Vitaly Wool <vitaly.wool@konsulko.com> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-24mm/swap: stop using page->private on tail pages for THP_SWAPDavid Hildenbrand1-1/+1
Patch series "mm/swap: stop using page->private on tail pages for THP_SWAP + cleanups". This series stops using page->private on tail pages for THP_SWAP, replaces folio->private by folio->swap for swapcache folios, and starts using "new_folio" for tail pages that we are splitting to remove the usage of page->private for swapcache handling completely. This patch (of 4): Let's stop using page->private on tail pages, making it possible to just unconditionally reuse that field in the tail pages of large folios. The remaining usage of the private field for THP_SWAP is in the THP splitting code (mm/huge_memory.c), that we'll handle separately later. Update the THP_SWAP documentation and sanity checks in mm_types.h and __split_huge_page_tail(). [david@redhat.com: stop using page->private on tail pages for THP_SWAP] Link: https://lkml.kernel.org/r/6f0a82a3-6948-20d9-580b-be1dbf415701@redhat.com Link: https://lkml.kernel.org/r/20230821160849.531668-1-david@redhat.com Link: https://lkml.kernel.org/r/20230821160849.531668-2-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Acked-by: Catalin Marinas <catalin.marinas@arm.com> [arm64] Reviewed-by: Yosry Ahmed <yosryahmed@google.com> Cc: Dan Streetman <ddstreet@ieee.org> Cc: Hugh Dickins <hughd@google.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Peter Xu <peterx@redhat.com> Cc: Seth Jennings <sjenning@redhat.com> Cc: Vitaly Wool <vitaly.wool@konsulko.com> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-24mm: call update_mmu_cache_range() in more page fault handling pathsMatthew Wilcox (Oracle)1-7/+8
Pass the vm_fault to the architecture to help it make smarter decisions about which PTEs to insert into the TLB. Link: https://lkml.kernel.org/r/20230802151406.3735276-39-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-24mm: convert do_set_pte() to set_pte_range()Yin Fengwei1-13/+24
set_pte_range() allows to setup page table entries for a specific range. It takes advantage of batched rmap update for large folio. It now takes care of calling update_mmu_cache_range(). Link: https://lkml.kernel.org/r/20230802151406.3735276-37-willy@infradead.org Signed-off-by: Yin Fengwei <fengwei.yin@intel.com> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-24mm: use flush_icache_pages() in do_set_pmd()Matthew Wilcox (Oracle)1-3/+1
Push the iteration over each page down to the architectures (many can flush the entire THP without iteration). Link: https://lkml.kernel.org/r/20230802151406.3735276-34-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-24mm: handle userfaults under VMA lockSuren Baghdasaryan1-8/+1
Enable handle_userfault to operate under VMA lock by releasing VMA lock instead of mmap_lock and retrying. Note that FAULT_FLAG_RETRY_NOWAIT should never be used when handling faults under per-VMA lock protection because that would break the assumption that lock is dropped on retry. [surenb@google.com: fix a lockdep issue in vma_assert_write_locked] Link: https://lkml.kernel.org/r/20230712195652.969194-1-surenb@google.com Link: https://lkml.kernel.org/r/20230630211957.1341547-7-surenb@google.com Signed-off-by: Suren Baghdasaryan <surenb@google.com> Acked-by: Peter Xu <peterx@redhat.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Christian Brauner <brauner@kernel.org> Cc: Christoph Hellwig <hch@lst.de> Cc: David Hildenbrand <david@redhat.com> Cc: David Howells <dhowells@redhat.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Hillf Danton <hdanton@sina.com> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jan Kara <jack@suse.cz> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Josef Bacik <josef@toxicpanda.com> Cc: Laurent Dufour <ldufour@linux.ibm.com> Cc: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Lorenzo Stoakes <lstoakes@gmail.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Michel Lespinasse <michel@lespinasse.org> Cc: Minchan Kim <minchan@google.com> Cc: Pavel Tatashin <pasha.tatashin@soleen.com> Cc: Punit Agrawal <punit.agrawal@bytedance.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Yu Zhao <yuzhao@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-24mm: handle swap page faults under per-VMA lockSuren Baghdasaryan1-6/+10
When page fault is handled under per-VMA lock protection, all swap page faults are retried with mmap_lock because folio_lock_or_retry has to drop and reacquire mmap_lock if folio could not be immediately locked. Follow the same pattern as mmap_lock to drop per-VMA lock when waiting for folio and retrying once folio is available. With this obstacle removed, enable do_swap_page to operate under per-VMA lock protection. Drivers implementing ops->migrate_to_ram might still rely on mmap_lock, therefore we have to fall back to mmap_lock in that particular case. Note that the only time do_swap_page calls synchronous swap_readpage is when SWP_SYNCHRONOUS_IO is set, which is only set for QUEUE_FLAG_SYNCHRONOUS devices: brd, zram and nvdimms (both btt and pmem). Therefore we don't sleep in this path, and there's no need to drop the mmap or per-VMA lock. Link: https://lkml.kernel.org/r/20230630211957.1341547-6-surenb@google.com Signed-off-by: Suren Baghdasaryan <surenb@google.com> Tested-by: Alistair Popple <apopple@nvidia.com> Reviewed-by: Alistair Popple <apopple@nvidia.com> Acked-by: Peter Xu <peterx@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Christian Brauner <brauner@kernel.org> Cc: Christoph Hellwig <hch@lst.de> Cc: David Hildenbrand <david@redhat.com> Cc: David Howells <dhowells@redhat.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Hillf Danton <hdanton@sina.com> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jan Kara <jack@suse.cz> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Josef Bacik <josef@toxicpanda.com> Cc: Laurent Dufour <ldufour@linux.ibm.com> Cc: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Lorenzo Stoakes <lstoakes@gmail.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Michel Lespinasse <michel@lespinasse.org> Cc: Minchan Kim <minchan@google.com> Cc: Pavel Tatashin <pasha.tatashin@soleen.com> Cc: Punit Agrawal <punit.agrawal@bytedance.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Yu Zhao <yuzhao@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-24mm: change folio_lock_or_retry to use vm_fault directlySuren Baghdasaryan1-8/+6
Change folio_lock_or_retry to accept vm_fault struct and return the vm_fault_t directly. Link: https://lkml.kernel.org/r/20230630211957.1341547-5-surenb@google.com Signed-off-by: Suren Baghdasaryan <surenb@google.com> Suggested-by: Matthew Wilcox <willy@infradead.org> Acked-by: Peter Xu <peterx@redhat.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Christian Brauner <brauner@kernel.org> Cc: Christoph Hellwig <hch@lst.de> Cc: David Hildenbrand <david@redhat.com> Cc: David Howells <dhowells@redhat.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Hillf Danton <hdanton@sina.com> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jan Kara <jack@suse.cz> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Josef Bacik <josef@toxicpanda.com> Cc: Laurent Dufour <ldufour@linux.ibm.com> Cc: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Lorenzo Stoakes <lstoakes@gmail.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Michel Lespinasse <michel@lespinasse.org> Cc: Minchan Kim <minchan@google.com> Cc: Pavel Tatashin <pasha.tatashin@soleen.com> Cc: Punit Agrawal <punit.agrawal@bytedance.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Yu Zhao <yuzhao@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-24mm: drop per-VMA lock when returning VM_FAULT_RETRY or VM_FAULT_COMPLETEDSuren Baghdasaryan1-0/+12
handle_mm_fault returning VM_FAULT_RETRY or VM_FAULT_COMPLETED means mmap_lock has been released. However with per-VMA locks behavior is different and the caller should still release it. To make the rules consistent for the caller, drop the per-VMA lock when returning VM_FAULT_RETRY or VM_FAULT_COMPLETED. Currently the only path returning VM_FAULT_RETRY under per-VMA locks is do_swap_page and no path returns VM_FAULT_COMPLETED for now. [willy@infradead.org: fix riscv] Link: https://lkml.kernel.org/r/CAJuCfpE6GWEx1rPBmNpUfoD5o-gNFz9-UFywzCE2PbEGBiVz7g@mail.gmail.com Link: https://lkml.kernel.org/r/20230630211957.1341547-4-surenb@google.com Signed-off-by: Suren Baghdasaryan <surenb@google.com> Acked-by: Peter Xu <peterx@redhat.com> Tested-by: Conor Dooley <conor.dooley@microchip.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Christian Brauner <brauner@kernel.org> Cc: Christoph Hellwig <hch@lst.de> Cc: David Hildenbrand <david@redhat.com> Cc: David Howells <dhowells@redhat.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Hillf Danton <hdanton@sina.com> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jan Kara <jack@suse.cz> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Josef Bacik <josef@toxicpanda.com> Cc: Laurent Dufour <ldufour@linux.ibm.com> Cc: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Lorenzo Stoakes <lstoakes@gmail.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Michel Lespinasse <michel@lespinasse.org> Cc: Minchan Kim <minchan@google.com> Cc: Pavel Tatashin <pasha.tatashin@soleen.com> Cc: Punit Agrawal <punit.agrawal@bytedance.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Yu Zhao <yuzhao@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-24mm/memory.c: fix mismergeMatthew Wilcox1-1/+1
Fix a build issue. Link: https://lkml.kernel.org/r/ZNerqcNS4EBJA/2v@casper.infradead.org Fixes: 4aaa60dad4d1 ("mm: allow per-VMA locks on file-backed VMAs") Signed-off-by: Matthew Wilcox <willy@infradead.org> Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202308121909.XNYBtqNI-lkp@intel.com/ Cc: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-21mm: convert ptlock_free() to use ptdescsVishal Moola (Oracle)1-2/+2
This removes some direct accesses to struct page, working towards splitting out struct ptdesc from struct page. Link: https://lkml.kernel.org/r/20230807230513.102486-11-vishal.moola@gmail.com Signed-off-by: Vishal Moola (Oracle) <vishal.moola@gmail.com> Acked-by: Mike Rapoport (IBM) <rppt@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Claudio Imbrenda <imbrenda@linux.ibm.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Dinh Nguyen <dinguyen@kernel.org> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Geert Uytterhoeven <geert+renesas@glider.be> Cc: Guo Ren <guoren@kernel.org> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de> Cc: Jonas Bonn <jonas@southpole.se> Cc: Matthew Wilcox <willy@infradead.org> Cc: Palmer Dabbelt <palmer@rivosinc.com> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Richard Weinberger <richard@nod.at> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-21mm: convert ptlock_alloc() to use ptdescsVishal Moola (Oracle)1-2/+2
This removes some direct accesses to struct page, working towards splitting out struct ptdesc from struct page. Link: https://lkml.kernel.org/r/20230807230513.102486-6-vishal.moola@gmail.com Signed-off-by: Vishal Moola (Oracle) <vishal.moola@gmail.com> Acked-by: Mike Rapoport (IBM) <rppt@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Claudio Imbrenda <imbrenda@linux.ibm.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Dinh Nguyen <dinguyen@kernel.org> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Geert Uytterhoeven <geert+renesas@glider.be> Cc: Guo Ren <guoren@kernel.org> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de> Cc: Jonas Bonn <jonas@southpole.se> Cc: Matthew Wilcox <willy@infradead.org> Cc: Palmer Dabbelt <palmer@rivosinc.com> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Richard Weinberger <richard@nod.at> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-21mm: replace mmap with vma write lock assertions when operating on a vmaSuren Baghdasaryan1-1/+1
Vma write lock assertion always includes mmap write lock assertion and additional vma lock checks when per-VMA locks are enabled. Replace weaker mmap_assert_write_locked() assertions with stronger vma_assert_write_locked() ones when we are operating on a vma which is expected to be locked. Link: https://lkml.kernel.org/r/20230804152724.3090321-4-surenb@google.com Suggested-by: Jann Horn <jannh@google.com> Signed-off-by: Suren Baghdasaryan <surenb@google.com> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Linus Torvalds <torvalds@linuxfoundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-21mm/memory.c: fix some kernel-doc commentsYang Li1-1/+2
Add description of @mas and @tree_end, remove @mt in unmap_vmas(). to silence the warnings: mm/memory.c:1837: warning: Function parameter or member 'mas' not described in 'unmap_vmas' mm/memory.c:1837: warning: Function parameter or member 'tree_end' not described in 'unmap_vmas' mm/memory.c:1837: warning: Excess function parameter 'mt' description in 'unmap_vmas' Link: https://lkml.kernel.org/r/20230727015558.69554-1-yang.lee@linux.alibaba.com Signed-off-by: Yang Li <yang.lee@linux.alibaba.com> Reported-by: Abaci Robot <abaci@linux.alibaba.com> Closes: https://bugzilla.openanolis.cn/show_bug.cgi?id=5996 Cc: Liam Howlett <liam.howlett@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-21mm: avoid 'might_sleep()' in get_mmap_lock_carefully()Linus Torvalds1-4/+1
This might_sleep() goes back a long time: it was originally introduced way back when by commit 010060741ad3 ("x86: add might_sleep() to do_page_fault()"), and made it into the generic VM code when the x86 fault path got re-organized and generalized in commit c2508ec5a58d ("mm: introduce new 'lock_mm_and_find_vma()' page fault helper"). However, it turns out that the placement of that might_sleep() has always been rather questionable simply because it's not only a debug statement to warn about sleeping in contexts that shouldn't sleep (which was the original reason for adding it), but it also implies a voluntary scheduling point. That, in turn, is less than desirable for two reasons: (a) it ends up being done after we successfully got the mmap_lock, so just as we got the lock we will now eagerly schedule away and increase lock contention and (b) this is all very possibly part of the "oops, things went horribly wrong" path and we just haven't figured that out yet After all, the whole _reason_ for having that get_mmap_lock_carefully() rather than just doing the obvious mmap_read_lock() is because this code wants to deal somewhat gracefully with potential kernel wild pointer bugs. So then a voluntary scheduling point here is simply not a good idea. We could certainly turn the 'might_sleep()' into a '__might_sleep()' and make it be just the debug check that it was originally intended to be. But even that seems questionable in the wild kernel pointer case - which again is part of the whole point of this code. The problem wouldn't be about the _sleeping_ part of the page fault, but about a bad kernel access. The fact that that bad kernel access might happen in a section that you shouldn't sleep in is secondary. So it really ends up being the case that this is simply entirely the wrong place to do this debug check and related scheduling point at all. So let's just remove the check entirely. It's been around for over a decade, it has served its purpose. The re-schedule will happen at return to user space anyway for the normal case, and the warning - if we even need it - might be better off done as a special case for "page fault from kernel mode" once we've dealt with any potential kernel oopses where the oops is the relevant thing, not some artificial "scheduling while atomic" test. Reported-by: Mateusz Guzik <mjguzik@gmail.com> Link: https://lore.kernel.org/lkml/20230820104303.2083444-1-mjguzik@gmail.com/ Cc: Matthew Wilcox <willy@infradead.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2023-08-18mm: handle faults that merely update the accessed bit under the VMA lockMatthew Wilcox (Oracle)1-6/+17
Move FAULT_FLAG_VMA_LOCK check out of handle_pte_fault(). This should have a significant performance improvement for mmaped files. Write faults (on read-only shared pages) still take the mmap lock as we do not want to audit all the implementations of ->pfn_mkwrite() and ->page_mkwrite(). However write-faults on private mappings are handled under the VMA lock. [willy@infradead.org: address "suspicious RCU usage" warning] Link: https://lkml.kernel.org/r/ZMK7jwpI4uD6tKrF@casper.infradead.org Link: https://lkml.kernel.org/r/20230724185410.1124082-11-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Arjun Roy <arjunroy@google.com> Cc: Eric Dumazet <edumazet@google.com> Cc: Punit Agrawal <punit.agrawal@bytedance.com> Cc: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-18mm: handle swap and NUMA PTE faults under the VMA lockMatthew Wilcox (Oracle)1-6/+6
Move the FAULT_FLAG_VMA_LOCK check down in handle_pte_fault(). This is probably not a huge win in its own right, but is a nicely separable bit from the next patch. Link: https://lkml.kernel.org/r/20230724185410.1124082-10-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Arjun Roy <arjunroy@google.com> Cc: Eric Dumazet <edumazet@google.com> Cc: Punit Agrawal <punit.agrawal@bytedance.com> Cc: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-18mm: run the fault-around code under the VMA lockMatthew Wilcox (Oracle)1-5/+5
The map_pages fs method should be safe to run under the VMA lock instead of the mmap lock. This should have a measurable reduction in contention on the mmap lock. Link: https://lkml.kernel.org/r/20230724185410.1124082-9-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Suren Baghdasaryan <surenb@google.com> Cc: Arjun Roy <arjunroy@google.com> Cc: Eric Dumazet <edumazet@google.com> Cc: Punit Agrawal <punit.agrawal@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-18mm: move FAULT_FLAG_VMA_LOCK check down from do_fault()Matthew Wilcox (Oracle)1-5/+15
Perform the check at the start of do_read_fault(), do_cow_fault() and do_shared_fault() instead. Should be no performance change from the last commit. Link: https://lkml.kernel.org/r/20230724185410.1124082-8-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Suren Baghdasaryan <surenb@google.com> Cc: Arjun Roy <arjunroy@google.com> Cc: Eric Dumazet <edumazet@google.com> Cc: Punit Agrawal <punit.agrawal@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-18mm: move FAULT_FLAG_VMA_LOCK check down in handle_pte_fault()Matthew Wilcox (Oracle)1-5/+11
Call do_pte_missing() under the VMA lock ... then immediately retry in do_fault(). Link: https://lkml.kernel.org/r/20230724185410.1124082-7-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Suren Baghdasaryan <surenb@google.com> Cc: Arjun Roy <arjunroy@google.com> Cc: Eric Dumazet <edumazet@google.com> Cc: Punit Agrawal <punit.agrawal@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-18mm: handle some PMD faults under the VMA lockMatthew Wilcox (Oracle)1-14/+25
Push the VMA_LOCK check down from __handle_mm_fault() to handle_pte_fault(). Once again, we refuse to call ->huge_fault() with the VMA lock held, but we will wait for a PMD migration entry with the VMA lock held, handle NUMA migration and set the accessed bit. We were already doing this for anonymous VMAs, so it should be safe. Link: https://lkml.kernel.org/r/20230724185410.1124082-6-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Arjun Roy <arjunroy@google.com> Cc: Eric Dumazet <edumazet@google.com> Cc: Punit Agrawal <punit.agrawal@bytedance.com> Cc: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-18mm: handle PUD faults under the VMA lockMatthew Wilcox (Oracle)1-13/+24
Postpone checking the VMA_LOCK flag until we've attempted to handle faults on PUDs. There's a mild upside to this patch in that we'll allocate the page tables while under the VMA lock rather than the mmap lock, reducing the hold time on the mmap lock, since the retry will find the page tables already populated. The real purpose here is to make a commit that shows we don't call ->huge_fault under the VMA lock. We do now handle setting the accessed bit on a PUD fault under the VMA lock, but that doesn't seem likely to be a measurable difference. Link: https://lkml.kernel.org/r/20230724185410.1124082-5-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Arjun Roy <arjunroy@google.com> Cc: Eric Dumazet <edumazet@google.com> Cc: Punit Agrawal <punit.agrawal@bytedance.com> Cc: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-18mm: move FAULT_FLAG_VMA_LOCK check from handle_mm_fault()Matthew Wilcox (Oracle)1-9/+9
Handle a little more of the page fault path outside the mmap sem. The hugetlb path doesn't need to check whether the VMA is anonymous; the VM_HUGETLB flag is only set on hugetlbfs VMAs. There should be no performance change from the previous commit; this is simply a step to ease bisection of any problems. Link: https://lkml.kernel.org/r/20230724185410.1124082-4-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Suren Baghdasaryan <surenb@google.com> Cc: Arjun Roy <arjunroy@google.com> Cc: Eric Dumazet <edumazet@google.com> Cc: Punit Agrawal <punit.agrawal@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-18mm: allow per-VMA locks on file-backed VMAsMatthew Wilcox (Oracle)1-5/+5
Remove the TCP layering violation by allowing per-VMA locks on all VMAs. The fault path will immediately fail in handle_mm_fault(). There may be a small performance reduction from this patch as a little unnecessary work will be done on each page fault. See later patches for the improvement. Link: https://lkml.kernel.org/r/20230724185410.1124082-3-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Suren Baghdasaryan <surenb@google.com> Cc: Arjun Roy <arjunroy@google.com> Cc: Eric Dumazet <edumazet@google.com> Cc: Punit Agrawal <punit.agrawal@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-18mm: change do_vmi_align_munmap() tracking of VMAs to removeLiam R. Howlett1-9/+7
The majority of the calls to munmap a vm range is within a single vma. The maple tree is able to store a single entry at 0, with a size of 1 as a pointer and avoid any allocations. Change do_vmi_align_munmap() to store the VMAs being munmap()'ed into a tree indexed by the count. This will leverage the ability to store the first entry without a node allocation. Storing the entries into a tree by the count and not the vma start and end means changing the functions which iterate over the entries. Update unmap_vmas() and free_pgtables() to take a maple state and a tree end address to support this functionality. Passing through the same maple state to unmap_vmas() and free_pgtables() means the state needs to be reset between calls. This happens in the static unmap_region() and exit_mmap(). Link: https://lkml.kernel.org/r/20230724183157.3939892-4-Liam.Howlett@oracle.com Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Peng Zhang <zhangpeng.00@bytedance.com> Cc: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-18mmu_notifiers: don't invalidate secondary TLBs as part of ↵Alistair Popple1-6/+2
mmu_notifier_invalidate_range_end() Secondary TLBs are now invalidated from the architecture specific TLB invalidation functions. Therefore there is no need to explicitly notify or invalidate as part of the range end functions. This means we can remove mmu_notifier_invalidate_range_end_only() and some of the ptep_*_notify() functions. Link: https://lkml.kernel.org/r/90d749d03cbab256ca0edeb5287069599566d783.1690292440.git-series.apopple@nvidia.com Signed-off-by: Alistair Popple <apopple@nvidia.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Cc: Andrew Donnellan <ajd@linux.ibm.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Chaitanya Kumar Borah <chaitanya.kumar.borah@intel.com> Cc: Frederic Barrat <fbarrat@linux.ibm.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Kevin Tian <kevin.tian@intel.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Nicolin Chen <nicolinc@nvidia.com> Cc: Robin Murphy <robin.murphy@arm.com> Cc: Sean Christopherson <seanjc@google.com> Cc: SeongJae Park <sj@kernel.org> Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com> Cc: Will Deacon <will@kernel.org> Cc: Zhi Wang <zhi.wang.linux@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-18mm/memory: pass folio into do_page_mkwrite()Sidhartha Kumar1-4/+3
Saves one implicit call to compound_head(). I'm not sure if I should change the name of the function to do_folio_mkwrite() and update the description comment to reference a folio as the vm_op is still called page_mkwrite. Link: https://lkml.kernel.org/r/20230711053544.156617-1-sidhartha.kumar@oracle.com Signed-off-by: Sidhartha Kumar <sidhartha.kumar@oracle.com> Suggested-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-18mm: make PTE_MARKER_SWAPIN_ERROR more generalAxel Rasmussen1-6/+9
Patch series "add UFFDIO_POISON to simulate memory poisoning with UFFD", v4. This series adds a new userfaultfd feature, UFFDIO_POISON. See commit 4 for a detailed description of the feature. This patch (of 8): Future patches will reuse PTE_MARKER_SWAPIN_ERROR to implement UFFDIO_POISON, so make some various preparations for that: First, rename it to just PTE_MARKER_POISONED. The "SWAPIN" can be confusing since we're going to re-use it for something not really related to swap. This can be particularly confusing for things like hugetlbfs, which doesn't support swap whatsoever. Also rename some various helper functions. Next, fix pte marker copying for hugetlbfs. Previously, it would WARN on seeing a PTE_MARKER_SWAPIN_ERROR, since hugetlbfs doesn't support swap. But, since we're going to re-use it, we want it to go ahead and copy it just like non-hugetlbfs memory does today. Since the code to do this is more complicated now, pull it out into a helper which can be re-used in both places. While we're at it, also make it slightly more explicit in its handling of e.g. uffd wp markers. For non-hugetlbfs page faults, instead of returning VM_FAULT_SIGBUS for an error entry, return VM_FAULT_HWPOISON. For most cases this change doesn't matter, e.g. a userspace program would receive a SIGBUS either way. But for UFFDIO_POISON, this change will let KVM guests get an MCE out of the box, instead of giving a SIGBUS to the hypervisor and requiring it to somehow inject an MCE. Finally, for hugetlbfs faults, handle PTE_MARKER_POISONED, and return VM_FAULT_HWPOISON_LARGE in such cases. Note that this can't happen today because the lack of swap support means we'll never end up with such a PTE anyway, but this behavior will be needed once such entries *can* show up via UFFDIO_POISON. Link: https://lkml.kernel.org/r/20230707215540.2324998-1-axelrasmussen@google.com Link: https://lkml.kernel.org/r/20230707215540.2324998-2-axelrasmussen@google.com Signed-off-by: Axel Rasmussen <axelrasmussen@google.com> Acked-by: Peter Xu <peterx@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Brian Geffon <bgeffon@google.com> Cc: Christian Brauner <brauner@kernel.org> Cc: David Hildenbrand <david@redhat.com> Cc: Gaosheng Cui <cuigaosheng1@huawei.com> Cc: Huang, Ying <ying.huang@intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: James Houghton <jthoughton@google.com> Cc: Jan Alexander Steffens (heftig) <heftig@archlinux.org> Cc: Jiaqi Yan <jiaqiyan@google.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Mike Rapoport (IBM) <rppt@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Nadav Amit <namit@vmware.com> Cc: Naoya Horiguchi <naoya.horiguchi@nec.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Suleiman Souhlal <suleiman@google.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: T.J. Alumbaugh <talumbau@google.com> Cc: Yu Zhao <yuzhao@google.com> Cc: ZhangPeng <zhangpeng362@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-18mm/memory: convert do_read_fault() to use foliosSidhartha Kumar1-2/+4
Saves one implicit call to compound_head(). Link: https://lkml.kernel.org/r/20230706163847.403202-4-sidhartha.kumar@oracle.com Signed-off-by: Sidhartha Kumar <sidhartha.kumar@oracle.com> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: ZhangPeng <zhangpeng362@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-18mm/memory: convert do_shared_fault() to foliosSidhartha Kumar1-4/+7
Saves three implicit calls to compound_head(). Link: https://lkml.kernel.org/r/20230706163847.403202-3-sidhartha.kumar@oracle.com Signed-off-by: Sidhartha Kumar <sidhartha.kumar@oracle.com> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: ZhangPeng <zhangpeng362@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-18mm/memory: convert wp_page_shared() to use foliosSidhartha Kumar1-11/+11
Saves six implicit calls to compound_head(). Link: https://lkml.kernel.org/r/20230706163847.403202-2-sidhartha.kumar@oracle.com Signed-off-by: Sidhartha Kumar <sidhartha.kumar@oracle.com> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: ZhangPeng <zhangpeng362@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-18mm/memory: convert do_page_mkwrite() to use foliosSidhartha Kumar1-5/+5
Saves one implicit call to compound_head(). Link: https://lkml.kernel.org/r/20230706163847.403202-1-sidhartha.kumar@oracle.com Signed-off-by: Sidhartha Kumar <sidhartha.kumar@oracle.com> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: ZhangPeng <zhangpeng362@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-18mm: fix some kernel-doc commentsYang Li1-1/+2
Add description of @mm_wr_locked and @mm. to silence the warnings: mm/memory.c:1716: warning: Function parameter or member 'mm_wr_locked' not described in 'unmap_vmas' mm/memory.c:5110: warning: Function parameter or member 'mm' not described in 'mm_account_fault' Link: https://lkml.kernel.org/r/20230707090034.125511-1-yang.lee@linux.alibaba.com Signed-off-by: Yang Li <yang.lee@linux.alibaba.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-18ksm: add ksm zero pages for each processxu xin1-2/+2
As the number of ksm zero pages is not included in ksm_merging_pages per process when enabling use_zero_pages, it's unclear of how many actual pages are merged by KSM. To let users accurately estimate their memory demands when unsharing KSM zero-pages, it's necessary to show KSM zero- pages per process. In addition, it help users to know the actual KSM profit because KSM-placed zero pages are also benefit from KSM. since unsharing zero pages placed by KSM accurately is achieved, then tracking empty pages merging and unmerging is not a difficult thing any longer. Since we already have /proc/<pid>/ksm_stat, just add the information of 'ksm_zero_pages' in it. Link: https://lkml.kernel.org/r/20230613030938.185993-1-yang.yang29@zte.com.cn Signed-off-by: xu xin <xu.xin16@zte.com.cn> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Xiaokai Ran <ran.xiaokai@zte.com.cn> Reviewed-by: Yang Yang <yang.yang29@zte.com.cn> Cc: Claudio Imbrenda <imbrenda@linux.ibm.com> Cc: Xuexin Jiang <jiang.xuexin@zte.com.cn> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-18ksm: count all zero pages placed by KSMxu xin1-1/+4
As pages_sharing and pages_shared don't include the number of zero pages merged by KSM, we cannot know how many pages are zero pages placed by KSM when enabling use_zero_pages, which leads to KSM not being transparent with all actual merged pages by KSM. In the early days of use_zero_pages, zero-pages was unable to get unshared by the ways like MADV_UNMERGEABLE so it's hard to count how many times one of those zeropages was then unmerged. But now, unsharing KSM-placed zero page accurately has been achieved, so we can easily count both how many times a page full of zeroes was merged with zero-page and how many times one of those pages was then unmerged. and so, it helps to estimate memory demands when each and every shared page could get unshared. So we add ksm_zero_pages under /sys/kernel/mm/ksm/ to show the number of all zero pages placed by KSM. Meanwhile, we update the Documentation. Link: https://lkml.kernel.org/r/20230613030934.185944-1-yang.yang29@zte.com.cn Signed-off-by: xu xin <xu.xin16@zte.com.cn> Acked-by: David Hildenbrand <david@redhat.com> Cc: Claudio Imbrenda <imbrenda@linux.ibm.com> Cc: Xuexin Jiang <jiang.xuexin@zte.com.cn> Reviewed-by: Xiaokai Ran <ran.xiaokai@zte.com.cn> Reviewed-by: Yang Yang <yang.yang29@zte.com.cn> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-18mm: use a folio in fault_dirty_shared_page()ZhangPeng1-8/+8
We can replace four implicit calls to compound_head() with one by using folio. Link: https://lkml.kernel.org/r/20230701032853.258697-2-zhangpeng362@huawei.com Signed-off-by: ZhangPeng <zhangpeng362@huawei.com> Reviewed-by: Sidhartha Kumar <sidhartha.kumar@oracle.com> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Nanyong Sun <sunnanyong@huawei.com> Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-18mm: increase usage of folio_next_index() helperSidhartha Kumar1-1/+1
Simplify code pattern of 'folio->index + folio_nr_pages(folio)' by using the existing helper folio_next_index(). Link: https://lkml.kernel.org/r/20230627174349.491803-1-sidhartha.kumar@oracle.com Signed-off-by: Sidhartha Kumar <sidhartha.kumar@oracle.com> Suggested-by: Christoph Hellwig <hch@infradead.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Cc: Andreas Dilger <adilger.kernel@dilger.ca> Cc: Christoph Hellwig <hch@infradead.org> Cc: Hugh Dickins <hughd@google.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-09mm: Fix access_remote_vm() regression on tagged addressesKirill A. Shutemov1-0/+3
GDB uses /proc/PID/mem to access memory of the target process. GDB doesn't untag addresses manually, but relies on kernel to do the right thing. mem_rw() of procfs uses access_remote_vm() to get data from the target process. It worked fine until recent changes in __access_remote_vm() that now checks if there's VMA at target address using raw address. Untag the address before looking up the VMA. Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reported-by: Christina Schimpe <christina.schimpe@intel.com> Fixes: eee9c708cc89 ("gup: avoid stack expansion warning for known-good case") Cc: stable@vger.kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2023-07-27mm: lock_vma_under_rcu() must check vma->anon_vma under vma lockJann Horn1-12/+16
lock_vma_under_rcu() tries to guarantee that __anon_vma_prepare() can't be called in the VMA-locked page fault path by ensuring that vma->anon_vma is set. However, this check happens before the VMA is locked, which means a concurrent move_vma() can concurrently call unlink_anon_vmas(), which disassociates the VMA's anon_vma. This means we can get UAF in the following scenario: THREAD 1 THREAD 2 ======== ======== <page fault> lock_vma_under_rcu() rcu_read_lock() mas_walk() check vma->anon_vma mremap() syscall move_vma() vma_start_write() unlink_anon_vmas() <syscall end> handle_mm_fault() __handle_mm_fault() handle_pte_fault() do_pte_missing() do_anonymous_page() anon_vma_prepare() __anon_vma_prepare() find_mergeable_anon_vma() mas_walk() [looks up VMA X] munmap() syscall (deletes VMA X) reusable_anon_vma() [called on freed VMA X] This is a security bug if you can hit it, although an attacker would have to win two races at once where the first race window is only a few instructions wide. This patch is based on some previous discussion with Linus Torvalds on the security list. Cc: stable@vger.kernel.org Fixes: 5e31275cc997 ("mm: add per-VMA lock and helper functions to control it") Signed-off-by: Jann Horn <jannh@google.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2023-07-11mm: Warn on shadow stack memory in wrong vmaRick Edgecombe1-0/+1
The x86 Control-flow Enforcement Technology (CET) feature includes a new type of memory called shadow stack. This shadow stack memory has some unusual properties, which requires some core mm changes to function properly. One sharp edge is that PTEs that are both Write=0 and Dirty=1 are treated as shadow by the CPU, but this combination used to be created by the kernel on x86. Previous patches have changed the kernel to now avoid creating these PTEs unless they are for shadow stack memory. In case any missed corners of the kernel are still creating PTEs like this for non-shadow stack memory, and to catch any re-introductions of the logic, warn if any shadow stack PTEs (Write=0, Dirty=1) are found in non-shadow stack VMAs when they are being zapped. This won't catch transient cases but should have decent coverage. In order to check if a PTE is shadow stack in core mm code, add two arch breakouts arch_check_zapped_pte/pmd(). This will allow shadow stack specific code to be kept in arch/x86. Only do the check if shadow stack is supported by the CPU and configured because in rare cases older CPUs may write Dirty=1 to a Write=0 CPU on older CPUs. This check is handled in pte_shstk()/pmd_shstk(). Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Mark Brown <broonie@kernel.org> Acked-by: Mike Rapoport (IBM) <rppt@kernel.org> Tested-by: Pengfei Xu <pengfei.xu@intel.com> Tested-by: John Allen <john.allen@amd.com> Tested-by: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/all/20230613001108.3040476-18-rick.p.edgecombe%40intel.com
2023-07-11mm: Make pte_mkwrite() take a VMARick Edgecombe1-2/+2
The x86 Shadow stack feature includes a new type of memory called shadow stack. This shadow stack memory has some unusual properties, which requires some core mm changes to function properly. One of these unusual properties is that shadow stack memory is writable, but only in limited ways. These limits are applied via a specific PTE bit combination. Nevertheless, the memory is writable, and core mm code will need to apply the writable permissions in the typical paths that call pte_mkwrite(). Future patches will make pte_mkwrite() take a VMA, so that the x86 implementation of it can know whether to create regular writable or shadow stack mappings. But there are a couple of challenges to this. Modifying the signatures of each arch pte_mkwrite() implementation would be error prone because some are generated with macros and would need to be re-implemented. Also, some pte_mkwrite() callers operate on kernel memory without a VMA. So this can be done in a three step process. First pte_mkwrite() can be renamed to pte_mkwrite_novma() in each arch, with a generic pte_mkwrite() added that just calls pte_mkwrite_novma(). Next callers without a VMA can be moved to pte_mkwrite_novma(). And lastly, pte_mkwrite() and all callers can be changed to take/pass a VMA. Previous work pte_mkwrite() renamed pte_mkwrite_novma() and converted callers that don't have a VMA were to use pte_mkwrite_novma(). So now change pte_mkwrite() to take a VMA and change the remaining callers to pass a VMA. Apply the same changes for pmd_mkwrite(). No functional change. Suggested-by: David Hildenbrand <david@redhat.com> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Mike Rapoport (IBM) <rppt@kernel.org> Acked-by: David Hildenbrand <david@redhat.com> Link: https://lore.kernel.org/all/20230613001108.3040476-4-rick.p.edgecombe%40intel.com
2023-07-08mm: call arch_swap_restore() from do_swap_page()Peter Collingbourne1-0/+7
Commit c145e0b47c77 ("mm: streamline COW logic in do_swap_page()") moved the call to swap_free() before the call to set_pte_at(), which meant that the MTE tags could end up being freed before set_pte_at() had a chance to restore them. Fix it by adding a call to the arch_swap_restore() hook before the call to swap_free(). Link: https://lkml.kernel.org/r/20230523004312.1807357-2-pcc@google.com Link: https://linux-review.googlesource.com/id/I6470efa669e8bd2f841049b8c61020c510678965 Fixes: c145e0b47c77 ("mm: streamline COW logic in do_swap_page()") Signed-off-by: Peter Collingbourne <pcc@google.com> Reported-by: Qun-wei Lin <Qun-wei.Lin@mediatek.com> Closes: https://lore.kernel.org/all/5050805753ac469e8d727c797c2218a9d780d434.camel@mediatek.com/ Acked-by: David Hildenbrand <david@redhat.com> Acked-by: "Huang, Ying" <ying.huang@intel.com> Reviewed-by: Steven Price <steven.price@arm.com> Acked-by: Catalin Marinas <catalin.marinas@arm.com> Cc: <stable@vger.kernel.org> [6.1+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-29gup: avoid stack expansion warning for known-good caseLinus Torvalds1-0/+4
In commit a425ac5365f6 ("gup: add warning if some caller would seem to want stack expansion") I added a temporary warning to catch any strange GUP users that would be impacted by the fact that GUP no longer extends the stack. But it turns out that the warning is most easily triggered through __access_remote_vm(), that already knows to expand the stack - it just does it *after* calling GUP. So the warning is easy to trigger by just running gdb (or similar) and accessing things remotely under the stack. This just adds a temporary extra "expand stack early" to avoid the warning for the already converted case - not because the warning is bad, but because getting the warning for this known good case would then hide any subsequent warnings for any actually interesting cases. Let's try to remember to revert this change when we remove the warnings. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2023-06-28Merge branch 'expand-stack'Linus Torvalds1-12/+138
This modifies our user mode stack expansion code to always take the mmap_lock for writing before modifying the VM layout. It's actually something we always technically should have done, but because we didn't strictly need it, we were being lazy ("opportunistic" sounds so much better, doesn't it?) about things, and had this hack in place where we would extend the stack vma in-place without doing the proper locking. And it worked fine. We just needed to change vm_start (or, in the case of grow-up stacks, vm_end) and together with some special ad-hoc locking using the anon_vma lock and the mm->page_table_lock, it all was fairly straightforward. That is, it was all fine until Ruihan Li pointed out that now that the vma layout uses the maple tree code, we *really* don't just change vm_start and vm_end any more, and the locking really is broken. Oops. It's not actually all _that_ horrible to fix this once and for all, and do proper locking, but it's a bit painful. We have basically three different cases of stack expansion, and they all work just a bit differently: - the common and obvious case is the page fault handling. It's actually fairly simple and straightforward, except for the fact that we have something like 24 different versions of it, and you end up in a maze of twisty little passages, all alike. - the simplest case is the execve() code that creates a new stack. There are no real locking concerns because it's all in a private new VM that hasn't been exposed to anybody, but lockdep still can end up unhappy if you get it wrong. - and finally, we have GUP and page pinning, which shouldn't really be expanding the stack in the first place, but in addition to execve() we also use it for ptrace(). And debuggers do want to possibly access memory under the stack pointer and thus need to be able to expand the stack as a special case. None of these cases are exactly complicated, but the page fault case in particular is just repeated slightly differently many many times. And ia64 in particular has a fairly complicated situation where you can have both a regular grow-down stack _and_ a special grow-up stack for the register backing store. So to make this slightly more manageable, the bulk of this series is to first create a helper function for the most common page fault case, and convert all the straightforward architectures to it. Thus the new 'lock_mm_and_find_vma()' helper function, which ends up being used by x86, arm, powerpc, mips, riscv, alpha, arc, csky, hexagon, loongarch, nios2, sh, sparc32, and xtensa. So we not only convert more than half the architectures, we now have more shared code and avoid some of those twisty little passages. And largely due to this common helper function, the full diffstat of this series ends up deleting more lines than it adds. That still leaves eight architectures (ia64, m68k, microblaze, openrisc, parisc, s390, sparc64 and um) that end up doing 'expand_stack()' manually because they are doing something slightly different from the normal pattern. Along with the couple of special cases in execve() and GUP. So there's a couple of patches that first create 'locked' helper versions of the stack expansion functions, so that there's a obvious path forward in the conversion. The execve() case is then actually pretty simple, and is a nice cleanup from our old "grow-up stackls are special, because at execve time even they grow down". The #ifdef CONFIG_STACK_GROWSUP in that code just goes away, because it's just more straightforward to write out the stack expansion there manually, instead od having get_user_pages_remote() do it for us in some situations but not others and have to worry about locking rules for GUP. And the final step is then to just convert the remaining odd cases to a new world order where 'expand_stack()' is called with the mmap_lock held for reading, but where it might drop it and upgrade it to a write, only to return with it held for reading (in the success case) or with it completely dropped (in the failure case). In the process, we remove all the stack expansion from GUP (where dropping the lock wouldn't be ok without special rules anyway), and add it in manually to __access_remote_vm() for ptrace(). Thanks to Adrian Glaubitz and Frank Scheiner who tested the ia64 cases. Everything else here felt pretty straightforward, but the ia64 rules for stack expansion are really quite odd and very different from everything else. Also thanks to Vegard Nossum who caught me getting one of those odd conditions entirely the wrong way around. Anyway, I think I want to actually move all the stack expansion code to a whole new file of its own, rather than have it split up between mm/mmap.c and mm/memory.c, but since this will have to be backported to the initial maple tree vma introduction anyway, I tried to keep the patches _fairly_ minimal. Also, while I don't think it's valid to expand the stack from GUP, the final patch in here is a "warn if some crazy GUP user wants to try to expand the stack" patch. That one will be reverted before the final release, but it's left to catch any odd cases during the merge window and release candidates. Reported-by: Ruihan Li <lrh2000@pku.edu.cn> * branch 'expand-stack': gup: add warning if some caller would seem to want stack expansion mm: always expand the stack with the mmap write lock held execve: expand new process stack manually ahead of time mm: make find_extend_vma() fail if write lock not held powerpc/mm: convert coprocessor fault to lock_mm_and_find_vma() mm/fault: convert remaining simple cases to lock_mm_and_find_vma() arm/mm: Convert to using lock_mm_and_find_vma() riscv/mm: Convert to using lock_mm_and_find_vma() mips/mm: Convert to using lock_mm_and_find_vma() powerpc/mm: Convert to using lock_mm_and_find_vma() arm64/mm: Convert to using lock_mm_and_find_vma() mm: make the page fault mmap locking killable mm: introduce new 'lock_mm_and_find_vma()' page fault helper
2023-06-28Merge tag 'net-next-6.5' of ↵Linus Torvalds1-3/+4
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next Pull networking changes from Jakub Kicinski: "WiFi 7 and sendpage changes are the biggest pieces of work for this release. The latter will definitely require fixes but I think that we got it to a reasonable point. Core: - Rework the sendpage & splice implementations Instead of feeding data into sockets page by page extend sendmsg handlers to support taking a reference on the data, controlled by a new flag called MSG_SPLICE_PAGES Rework the handling of unexpected-end-of-file to invoke an additional callback instead of trying to predict what the right combination of MORE/NOTLAST flags is Remove the MSG_SENDPAGE_NOTLAST flag completely - Implement SCM_PIDFD, a new type of CMSG type analogous to SCM_CREDENTIALS, but it contains pidfd instead of plain pid - Enable socket busy polling with CONFIG_RT - Improve reliability and efficiency of reporting for ref_tracker - Auto-generate a user space C library for various Netlink families Protocols: - Allow TCP to shrink the advertised window when necessary, prevent sk_rcvbuf auto-tuning from growing the window all the way up to tcp_rmem[2] - Use per-VMA locking for "page-flipping" TCP receive zerocopy - Prepare TCP for device-to-device data transfers, by making sure that payloads are always attached to skbs as page frags - Make the backoff time for the first N TCP SYN retransmissions linear. Exponential backoff is unnecessarily conservative - Create a new MPTCP getsockopt to retrieve all info (MPTCP_FULL_INFO) - Avoid waking up applications using TLS sockets until we have a full record - Allow using kernel memory for protocol ioctl callbacks, paving the way to issuing ioctls over io_uring - Add nolocalbypass option to VxLAN, forcing packets to be fully encapsulated even if they are destined for a local IP address - Make TCPv4 use consistent hash in TIME_WAIT and SYN_RECV. Ensure in-kernel ECMP implementation (e.g. Open vSwitch) select the same link for all packets. Support L4 symmetric hashing in Open vSwitch - PPPoE: make number of hash bits configurable - Allow DNS to be overwritten by DHCPACK in the in-kernel DHCP client (ipconfig) - Add layer 2 miss indication and filtering, allowing higher layers (e.g. ACL filters) to make forwarding decisions based on whether packet matched forwarding state in lower devices (bridge) - Support matching on Connectivity Fault Management (CFM) packets - Hide the "link becomes ready" IPv6 messages by demoting their printk level to debug - HSR: don't enable promiscuous mode if device offloads the proto - Support active scanning in IEEE 802.15.4 - Continue work on Multi-Link Operation for WiFi 7 BPF: - Add precision propagation for subprogs and callbacks. This allows maintaining verification efficiency when subprograms are used, or in fact passing the verifier at all for complex programs, especially those using open-coded iterators - Improve BPF's {g,s}setsockopt() length handling. Previously BPF assumed the length is always equal to the amount of written data. But some protos allow passing a NULL buffer to discover what the output buffer *should* be, without writing anything - Accept dynptr memory as memory arguments passed to helpers - Add routing table ID to bpf_fib_lookup BPF helper - Support O_PATH FDs in BPF_OBJ_PIN and BPF_OBJ_GET commands - Drop bpf_capable() check in BPF_MAP_FREEZE command (used to mark maps as read-only) - Show target_{obj,btf}_id in tracing link fdinfo - Addition of several new kfuncs (most of the names are self-explanatory): - Add a set of new dynptr kfuncs: bpf_dynptr_adjust(), bpf_dynptr_is_null(), bpf_dynptr_is_rdonly(), bpf_dynptr_size() and bpf_dynptr_clone(). - bpf_task_under_cgroup() - bpf_sock_destroy() - force closing sockets - bpf_cpumask_first_and(), rework bpf_cpumask_any*() kfuncs Netfilter: - Relax set/map validation checks in nf_tables. Allow checking presence of an entry in a map without using the value - Increase ip_vs_conn_tab_bits range for 64BIT builds - Allow updating size of a set - Improve NAT tuple selection when connection is closing Driver API: - Integrate netdev with LED subsystem, to allow configuring HW "offloaded" blinking of LEDs based on link state and activity (i.e. packets coming in and out) - Support configuring rate selection pins of SFP modules - Factor Clause 73 auto-negotiation code out of the drivers, provide common helper routines - Add more fool-proof helpers for managing lifetime of MDIO devices associated with the PCS layer - Allow drivers to report advanced statistics related to Time Aware scheduler offload (taprio) - Allow opting out of VF statistics in link dump, to allow more VFs to fit into the message - Split devlink instance and devlink port operations New hardware / drivers: - Ethernet: - Synopsys EMAC4 IP support (stmmac) - Marvell 88E6361 8 port (5x1GE + 3x2.5GE) switches - Marvell 88E6250 7 port switches - Microchip LAN8650/1 Rev.B0 PHYs - MediaTek MT7981/MT7988 built-in 1GE PHY driver - WiFi: - Realtek RTL8192FU, 2.4 GHz, b/g/n mode, 2T2R, 300 Mbps - Realtek RTL8723DS (SDIO variant) - Realtek RTL8851BE - CAN: - Fintek F81604 Drivers: - Ethernet NICs: - Intel (100G, ice): - support dynamic interrupt allocation - use meta data match instead of VF MAC addr on slow-path - nVidia/Mellanox: - extend link aggregation to handle 4, rather than just 2 ports - spawn sub-functions without any features by default - OcteonTX2: - support HTB (Tx scheduling/QoS) offload - make RSS hash generation configurable - support selecting Rx queue using TC filters - Wangxun (ngbe/txgbe): - add basic Tx/Rx packet offloads - add phylink support (SFP/PCS control) - Freescale/NXP (enetc): - report TAPRIO packet statistics - Solarflare/AMD: - support matching on IP ToS and UDP source port of outer header - VxLAN and GENEVE tunnel encapsulation over IPv4 or IPv6 - add devlink dev info support for EF10 - Virtual NICs: - Microsoft vNIC: - size the Rx indirection table based on requested configuration - support VLAN tagging - Amazon vNIC: - try to reuse Rx buffers if not fully consumed, useful for ARM servers running with 16kB pages - Google vNIC: - support TCP segmentation of >64kB frames - Ethernet embedded switches: - Marvell (mv88e6xxx): - enable USXGMII (88E6191X) - Microchip: - lan966x: add support for Egress Stage 0 ACL engine - lan966x: support mapping packet priority to internal switch priority (based on PCP or DSCP) - Ethernet PHYs: - Broadcom PHYs: - support for Wake-on-LAN for BCM54210E/B50212E - report LPI counter - Microsemi PHYs: support RGMII delay configuration (VSC85xx) - Micrel PHYs: receive timestamp in the frame (LAN8841) - Realtek PHYs: support optional external PHY clock - Altera TSE PCS: merge the driver into Lynx PCS which it is a variant of - CAN: Kvaser PCIEcan: - support packet timestamping - WiFi: - Intel (iwlwifi): - major update for new firmware and Multi-Link Operation (MLO) - configuration rework to drop test devices and split the different families - support for segmented PNVM images and power tables - new vendor entries for PPAG (platform antenna gain) feature - Qualcomm 802.11ax (ath11k): - Multiple Basic Service Set Identifier (MBSSID) and Enhanced MBSSID Advertisement (EMA) support in AP mode - support factory test mode - RealTek (rtw89): - add RSSI based antenna diversity - support U-NII-4 channels on 5 GHz band - RealTek (rtl8xxxu): - AP mode support for 8188f - support USB RX aggregation for the newer chips" * tag 'net-next-6.5' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1602 commits) net: scm: introduce and use scm_recv_unix helper af_unix: Skip SCM_PIDFD if scm->pid is NULL. net: lan743x: Simplify comparison netlink: Add __sock_i_ino() for __netlink_diag_dump(). net: dsa: avoid suspicious RCU usage for synced VLAN-aware MAC addresses Revert "af_unix: Call scm_recv() only after scm_set_cred()." phylink: ReST-ify the phylink_pcs_neg_mode() kdoc libceph: Partially revert changes to support MSG_SPLICE_PAGES net: phy: mscc: fix packet loss due to RGMII delays net: mana: use vmalloc_array and vcalloc net: enetc: use vmalloc_array and vcalloc ionic: use vmalloc_array and vcalloc pds_core: use vmalloc_array and vcalloc gve: use vmalloc_array and vcalloc octeon_ep: use vmalloc_array and vcalloc net: usb: qmi_wwan: add u-blox 0x1312 composition perf trace: fix MSG_SPLICE_PAGES build error ipvlan: Fix return value of ipvlan_queue_xmit() netfilter: nf_tables: fix underflow in chain reference counter netfilter: nf_tables: unbind non-anonymous set if rule construction fails ...
2023-06-28mm: fix __access_remote_vm() GUP failure caseLinus Torvalds1-0/+1
Commit ca5e863233e8 ("mm/gup: remove vmas parameter from get_user_pages_remote()") removed the vma argument from GUP handling, and instead added a helper function (get_user_page_vma_remote()) that looks it up separately using 'vma_lookup()'. And then converted existing users that needed a vma to use the helper instead. However, the helper function intentionally acts exactly like the old get_user_pages_remote() did, and only fills in 'vma' on successful page lookup. Fine so far. However, __access_remote_vm() wants the vma even for the unsuccessful case, and used to do a vma = vma_lookup(mm, addr); explicitly to look it up when the get_user_page() failed. However, that conversion commit incorrectly removed that vma lookup, thinking that get_user_page_vma_remote() would have done it. Not so. So add the vma_lookup() back in. Fixes: ca5e863233e8 ("mm/gup: remove vmas parameter from get_user_pages_remote()") Cc: Lorenzo Stoakes <lstoakes@gmail.com> Cc: David Hildenbrand <david@redhat.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2023-06-27mm: always expand the stack with the mmap write lock heldLinus Torvalds1-1/+9
This finishes the job of always holding the mmap write lock when extending the user stack vma, and removes the 'write_locked' argument from the vm helper functions again. For some cases, we just avoid expanding the stack at all: drivers and page pinning really shouldn't be extending any stacks. Let's see if any strange users really wanted that. It's worth noting that architectures that weren't converted to the new lock_mm_and_find_vma() helper function are left using the legacy "expand_stack()" function, but it has been changed to drop the mmap_lock and take it for writing while expanding the vma. This makes it fairly straightforward to convert the remaining architectures. As a result of dropping and re-taking the lock, the calling conventions for this function have also changed, since the old vma may no longer be valid. So it will now return the new vma if successful, and NULL - and the lock dropped - if the area could not be extended. Tested-by: Vegard Nossum <vegard.nossum@oracle.com> Tested-by: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de> # ia64 Tested-by: Frank Scheiner <frank.scheiner@web.de> # ia64 Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2023-06-24mm: make find_extend_vma() fail if write lock not heldLiam R. Howlett1-1/+1
Make calls to extend_vma() and find_extend_vma() fail if the write lock is required. To avoid making this a flag-day event, this still allows the old read-locking case for the trivial situations, and passes in a flag to say "is it write-locked". That way write-lockers can say "yes, I'm being careful", and legacy users will continue to work in all the common cases until they have been fully converted to the new world order. Co-Developed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2023-06-24mm: make the page fault mmap locking killableLinus Torvalds1-4/+2
This is done as a separate patch from introducing the new lock_mm_and_find_vma() helper, because while it's an obvious change, it's not what x86 used to do in this area. We already abort the page fault on fatal signals anyway, so why should we wait for the mmap lock only to then abort later? With the new helper function that returns without the lock held on failure anyway, this is particularly easy and straightforward. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2023-06-24mm: introduce new 'lock_mm_and_find_vma()' page fault helperLinus Torvalds1-0/+121
.. and make x86 use it. This basically extracts the existing x86 "find and expand faulting vma" code, but extends it to also take the mmap lock for writing in case we actually do need to expand the vma. We've historically short-circuited that case, and have some rather ugly special logic to serialize the stack segment expansion (since we only hold the mmap lock for reading) that doesn't match the normal VM locking. That slight violation of locking worked well, right up until it didn't: the maple tree code really does want proper locking even for simple extension of an existing vma. So extract the code for "look up the vma of the fault" from x86, fix it up to do the necessary write locking, and make it available as a helper function for other architectures that can use the common helper. Note: I say "common helper", but it really only handles the normal stack-grows-down case. Which is all architectures except for PA-RISC and IA64. So some rare architectures can't use the helper, but if they care they'll just need to open-code this logic. It's also worth pointing out that this code really would like to have an optimistic "mmap_upgrade_trylock()" to make it quicker to go from a read-lock (for the common case) to taking the write lock (for having to extend the vma) in the normal single-threaded situation where there is no other locking activity. But that _is_ all the very uncommon special case, so while it would be nice to have such an operation, it probably doesn't matter in reality. I did put in the skeleton code for such a possible future expansion, even if it only acts as pseudo-documentation for what we're doing. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2023-06-23mm: remove references to pagevecMatthew Wilcox (Oracle)1-3/+3
Most of these should just refer to the LRU cache rather than the data structure used to implement the LRU cache. Link: https://lkml.kernel.org/r/20230621164557.3510324-13-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-19mm: ptep_get() conversionRyan Roberts1-46/+54
Convert all instances of direct pte_t* dereferencing to instead use ptep_get() helper. This means that by default, the accesses change from a C dereference to a READ_ONCE(). This is technically the correct thing to do since where pgtables are modified by HW (for access/dirty) they are volatile and therefore we should always ensure READ_ONCE() semantics. But more importantly, by always using the helper, it can be overridden by the architecture to fully encapsulate the contents of the pte. Arch code is deliberately not converted, as the arch code knows best. It is intended that arch code (arm64) will override the default with its own implementation that can (e.g.) hide certain bits from the core code, or determine young/dirty status by mixing in state from another source. Conversion was done using Coccinelle: ---- // $ make coccicheck \ // COCCI=ptepget.cocci \ // SPFLAGS="--include-headers" \ // MODE=patch virtual patch @ depends on patch @ pte_t *v; @@ - *v + ptep_get(v) ---- Then reviewed and hand-edited to avoid multiple unnecessary calls to ptep_get(), instead opting to store the result of a single call in a variable, where it is correct to do so. This aims to negate any cost of READ_ONCE() and will benefit arch-overrides that may be more complex. Included is a fix for an issue in an earlier version of this patch that was pointed out by kernel test robot. The issue arose because config MMU=n elides definition of the ptep helper functions, including ptep_get(). HUGETLB_PAGE=n configs still define a simple huge_ptep_clear_flush() for linking purposes, which dereferences the ptep. So when both configs are disabled, this caused a build error because ptep_get() is not defined. Fix by continuing to do a direct dereference when MMU=n. This is safe because for this config the arch code cannot be trying to virtualize the ptes because none of the ptep helpers are defined. Link: https://lkml.kernel.org/r/20230612151545.3317766-4-ryan.roberts@arm.com Reported-by: kernel test robot <lkp@intel.com> Link: https://lore.kernel.org/oe-kbuild-all/202305120142.yXsNEo6H-lkp@intel.com/ Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Alexander Potapenko <glider@google.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Alex Williamson <alex.williamson@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Andrey Konovalov <andreyknvl@gmail.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Christian Brauner <brauner@kernel.org> Cc: Christoph Hellwig <hch@infradead.org> Cc: Daniel Vetter <daniel@ffwll.ch> Cc: Dave Airlie <airlied@gmail.com> Cc: Dimitri Sivanich <dimitri.sivanich@hpe.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Ian Rogers <irogers@google.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Jérôme Glisse <jglisse@redhat.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Lorenzo Stoakes <lstoakes@gmail.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Mike Rapoport (IBM) <rppt@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Naoya Horiguchi <naoya.horiguchi@nec.com> Cc: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com> Cc: Pavel Tatashin <pasha.tatashin@soleen.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: SeongJae Park <sj@kernel.org> Cc: Shakeel Butt <shakeelb@google.com> Cc: Uladzislau Rezki (Sony) <urezki@gmail.com> Cc: Vincenzo Frascino <vincenzo.frascino@arm.com> Cc: Yu Zhao <yuzhao@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-19perf/core: allow pte_offset_map() to failHugh Dickins1-2/+4
In rare transient cases, not yet made possible, pte_offset_map() and pte_offet_map_lock() may not find a page table: handle appropriately. [hughd@google.com: __wp_page_copy_user(): don't call update_mmu_tlb() with NULL] Link: https://lkml.kernel.org/r/1a4db221-7872-3594-57ce-42369945ec8d@google.com Link: https://lkml.kernel.org/r/a194441b-63f3-adb6-5964-7ca3171ae7c2@google.com Signed-off-by: Hugh Dickins <hughd@google.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Christoph Hellwig <hch@infradead.org> Cc: David Hildenbrand <david@redhat.com> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Ira Weiny <ira.weiny@intel.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Lorenzo Stoakes <lstoakes@gmail.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Mike Rapoport (IBM) <rppt@kernel.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Naoya Horiguchi <naoya.horiguchi@nec.com> Cc: Pavel Tatashin <pasha.tatashin@soleen.com> Cc: Peter Xu <peterx@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Ralph Campbell <rcampbell@nvidia.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: SeongJae Park <sj@kernel.org> Cc: Song Liu <song@kernel.org> Cc: Steven Price <steven.price@arm.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com> Cc: Will Deacon <will@kernel.org> Cc: Yang Shi <shy828301@gmail.com> Cc: Yu Zhao <yuzhao@google.com> Cc: Zack Rusin <zackr@vmware.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-19mm/memory: handle_pte_fault() use pte_offset_map_nolock()Hugh Dickins1-25/+13
handle_pte_fault() use pte_offset_map_nolock() to get the vmf.ptl which corresponds to vmf.pte, instead of pte_lockptr() being used later, when there's a chance that the pmd entry might have changed, perhaps to none, or to a huge pmd, with no split ptlock in its struct page. Remove its pmd_devmap_trans_unstable() call: pte_offset_map_nolock() will handle that case by failing. Update the "morph" comment above, looking forward to when shmem or file collapse to THP may not take mmap_lock for write (or not at all). do_numa_page() use the vmf->ptl from handle_pte_fault() at first, but refresh it when refreshing vmf->pte. do_swap_page()'s pte_unmap_same() (the thing that takes ptl to verify a two-part PAE orig_pte) use the vmf->ptl from handle_pte_fault() too; but do_swap_page() is also used by anon THP's __collapse_huge_page_swapin(), so adjust that to set vmf->ptl by pte_offset_map_nolock(). Link: https://lkml.kernel.org/r/c1107654-3929-60ac-223e-6877cbb86065@google.com Signed-off-by: Hugh Dickins <hughd@google.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Christoph Hellwig <hch@infradead.org> Cc: David Hildenbrand <david@redhat.com> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Ira Weiny <ira.weiny@intel.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Lorenzo Stoakes <lstoakes@gmail.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Mike Rapoport (IBM) <rppt@kernel.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Naoya Horiguchi <naoya.horiguchi@nec.com> Cc: Pavel Tatashin <pasha.tatashin@soleen.com> Cc: Peter Xu <peterx@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Ralph Campbell <rcampbell@nvidia.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: SeongJae Park <sj@kernel.org> Cc: Song Liu <song@kernel.org> Cc: Steven Price <steven.price@arm.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com> Cc: Will Deacon <will@kernel.org> Cc: Yang Shi <shy828301@gmail.com> Cc: Yu Zhao <yuzhao@google.com> Cc: Zack Rusin <zackr@vmware.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-19mm/memory: allow pte_offset_map[_lock]() to failHugh Dickins1-86/+81
copy_pte_range(): use pte_offset_map_nolock(), and allow for it to fail; but with a comment on some further assumptions that are being made there. zap_pte_range() and zap_pmd_range(): adjust their interaction so that a pte_offset_map_lock() failure in zap_pte_range() leads to a retry in zap_pmd_range(); remove call to pmd_none_or_trans_huge_or_clear_bad(). Allow pte_offset_map_lock() to fail in many functions. Update comment on calling pte_alloc() in do_anonymous_page(). Remove redundant calls to pmd_trans_unstable(), pmd_devmap_trans_unstable(), pmd_none() and pmd_bad(); but leave pmd_none_or_clear_bad() calls in free_pmd_range() and copy_pmd_range(), those do simplify the next level down. Link: https://lkml.kernel.org/r/bb548d50-e99a-f29e-eab1-a43bef2a1287@google.com Signed-off-by: Hugh Dickins <hughd@google.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Christoph Hellwig <hch@infradead.org> Cc: David Hildenbrand <david@redhat.com> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Ira Weiny <ira.weiny@intel.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Lorenzo Stoakes <lstoakes@gmail.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Mike Rapoport (IBM) <rppt@kernel.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Naoya Horiguchi <naoya.horiguchi@nec.com> Cc: Pavel Tatashin <pasha.tatashin@soleen.com> Cc: Peter Xu <peterx@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Ralph Campbell <rcampbell@nvidia.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: SeongJae Park <sj@kernel.org> Cc: Song Liu <song@kernel.org> Cc: Steven Price <steven.price@arm.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com> Cc: Will Deacon <will@kernel.org> Cc: Yang Shi <shy828301@gmail.com> Cc: Yu Zhao <yuzhao@google.com> Cc: Zack Rusin <zackr@vmware.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-19mm: use pmdp_get_lockless() without surplus barrier()Hugh Dickins1-12/+2
Patch series "mm: allow pte_offset_map[_lock]() to fail", v2. What is it all about? Some mmap_lock avoidance i.e. latency reduction. Initially just for the case of collapsing shmem or file pages to THPs; but likely to be relied upon later in other contexts e.g. freeing of empty page tables (but that's not work I'm doing). mmap_write_lock avoidance when collapsing to anon THPs? Perhaps, but again that's not work I've done: a quick attempt was not as easy as the shmem/file case. I would much prefer not to have to make these small but wide-ranging changes for such a niche case; but failed to find another way, and have heard that shmem MADV_COLLAPSE's usefulness is being limited by that mmap_write_lock it currently requires. These changes (though of course not these exact patches) have been in Google's data centre kernel for three years now: we do rely upon them. What is this preparatory series about? The current mmap locking will not be enough to guard against that tricky transition between pmd entry pointing to page table, and empty pmd entry, and pmd entry pointing to huge page: pte_offset_map() will have to validate the pmd entry for itself, returning NULL if no page table is there. What to do about that varies: sometimes nearby error handling indicates just to skip it; but in many cases an ACTION_AGAIN or "goto again" is appropriate (and if that risks an infinite loop, then there must have been an oops, or pfn 0 mistaken for page table, before). Given the likely extension to freeing empty page tables, I have not limited this set of changes to a THP config; and it has been easier, and sets a better example, if each site is given appropriate handling: even where deeper study might prove that failure could only happen if the pmd table were corrupted. Several of the patches are, or include, cleanup on the way; and by the end, pmd_trans_unstable() and suchlike are deleted: pte_offset_map() and pte_offset_map_lock() then handle those original races and more. Most uses of pte_lockptr() are deprecated, with pte_offset_map_nolock() taking its place. This patch (of 32): Use pmdp_get_lockless() in preference to READ_ONCE(*pmdp), to get a more reliable result with PAE (or READ_ONCE as before without PAE); and remove the unnecessary extra barrier()s which got left behind in its callers. HOWEVER: Note the small print in linux/pgtable.h, where it was designed specifically for fast GUP, and depends on interrupts being disabled for its full guarantee: most callers which have been added (here and before) do NOT have interrupts disabled, so there is still some need for caution. Link: https://lkml.kernel.org/r/f35279a9-9ac0-de22-d245-591afbfb4dc@google.com Signed-off-by: Hugh Dickins <hughd@google.com> Acked-by: Yu Zhao <yuzhao@google.com> Acked-by: Peter Xu <peterx@redhat.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Christoph Hellwig <hch@infradead.org> Cc: David Hildenbrand <david@redhat.com> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Ira Weiny <ira.weiny@intel.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Lorenzo Stoakes <lstoakes@gmail.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Mike Rapoport (IBM) <rppt@kernel.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Naoya Horiguchi <naoya.horiguchi@nec.com> Cc: Pavel Tatashin <pasha.tatashin@soleen.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Ralph Campbell <rcampbell@nvidia.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: SeongJae Park <sj@kernel.org> Cc: Song Liu <song@kernel.org> Cc: Steven Price <steven.price@arm.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com> Cc: Will Deacon <will@kernel.org> Cc: Yang Shi <shy828301@gmail.com> Cc: Zack Rusin <zackr@vmware.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-18tcp: Use per-vma locking for receive zerocopyArjun Roy1-3/+4
Per-VMA locking allows us to lock a struct vm_area_struct without taking the process-wide mmap lock in read mode. Consider a process workload where the mmap lock is taken constantly in write mode. In this scenario, all zerocopy receives are periodically blocked during that period of time - though in principle, the memory ranges being used by TCP are not touched by the operations that need the mmap write lock. This results in performance degradation. Now consider another workload where the mmap lock is never taken in write mode, but there are many TCP connections using receive zerocopy that are concurrently receiving. These connections all take the mmap lock in read mode, but this does induce a lot of contention and atomic ops for this process-wide lock. This results in additional CPU overhead caused by contending on the cache line for this lock. However, with per-vma locking, both of these problems can be avoided. As a test, I ran an RPC-style request/response workload with 4KB payloads and receive zerocopy enabled, with 100 simultaneous TCP connections. I measured perf cycles within the find_tcp_vma/mmap_read_lock/mmap_read_unlock codepath, with and without per-vma locking enabled. When using process-wide mmap semaphore read locking, about 1% of measured perf cycles were within this path. With per-VMA locking, this value dropped to about 0.45%. Signed-off-by: Arjun Roy <arjunroy@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-06-09mm: fix failure to unmap pte on highmem systemsRyan Roberts1-4/+2
The loser of a race to service a pte for a device private entry in the swap path previously unlocked the ptl, but failed to unmap the pte. This only affects highmem systems since unmapping a pte is a noop on non-highmem systems. Link: https://lkml.kernel.org/r/20230602092949.545577-5-ryan.roberts@arm.com Fixes: 16ce101db85d ("mm/memory.c: fix race when faulting a device private page") Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> Reviewed-by: Zi Yan <ziy@nvidia.com> Reviewed-by: Mike Rapoport (IBM) <rppt@kernel.org> Cc: Christoph Hellwig <hch@lst.de> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Lorenzo Stoakes <lstoakes@gmail.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: SeongJae Park <sj@kernel.org> Cc: Uladzislau Rezki (Sony) <urezki@gmail.com> Cc: Yu Zhao <yuzhao@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-09mm/gup: remove vmas parameter from get_user_pages_remote()Lorenzo Stoakes1-10/+10
The only instances of get_user_pages_remote() invocations which used the vmas parameter were for a single page which can instead simply look up the VMA directly. In particular:- - __update_ref_ctr() looked up the VMA but did nothing with it so we simply remove it. - __access_remote_vm() was already using vma_lookup() when the original lookup failed so by doing the lookup directly this also de-duplicates the code. We are able to perform these VMA operations as we already hold the mmap_lock in order to be able to call get_user_pages_remote(). As part of this work we add get_user_page_vma_remote() which abstracts the VMA lookup, error handling and decrementing the page reference count should the VMA lookup fail. This forms part of a broader set of patches intended to eliminate the vmas parameter altogether. [akpm@linux-foundation.org: avoid passing NULL to PTR_ERR] Link: https://lkml.kernel.org/r/d20128c849ecdbf4dd01cc828fcec32127ed939a.1684350871.git.lstoakes@gmail.com Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com> Reviewed-by: Catalin Marinas <catalin.marinas@arm.com> (for arm64) Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Janosch Frank <frankja@linux.ibm.com> (for s390) Reviewed-by: Christoph Hellwig <hch@lst.de> Cc: Christian König <christian.koenig@amd.com> Cc: Dennis Dalessandro <dennis.dalessandro@cornelisnetworks.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Jarkko Sakkinen <jarkko@kernel.org> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Sakari Ailus <sakari.ailus@linux.intel.com> Cc: Sean Christopherson <seanjc@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-04-21mm: do not increment pgfault stats when page fault handler retriesSuren Baghdasaryan1-19/+27
If the page fault handler requests a retry, we will count the fault multiple times. This is a relatively harmless problem as the retry paths are not often requested, and the only user-visible problem is that the fault counter will be slightly higher than it should be. Nevertheless, userspace only took one fault, and should not see the fact that the kernel had to retry the fault multiple times. Move page fault accounting into mm_account_fault() and skip incomplete faults which will be accounted upon completion. Link: https://lkml.kernel.org/r/20230419175836.3857458-1-surenb@google.com Fixes: d065bd810b6d ("mm: retry page fault when blocking on disk transfer") Signed-off-by: Suren Baghdasaryan <surenb@google.com> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Acked-by: Peter Xu <peterx@redhat.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Jan Kara <jack@suse.cz> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Josef Bacik <josef@toxicpanda.com> Cc: Laurent Dufour <ldufour@linux.ibm.com> Cc: Liam R. Howlett <Liam.Howlett@Oracle.com> Cc: Lorenzo Stoakes <lstoakes@gmail.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Michel Lespinasse <michel@lespinasse.org> Cc: Minchan Kim <minchan@google.com> Cc: Punit Agrawal <punit.agrawal@bytedance.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-04-18mm: hwpoison: support recovery from HugePage copy-on-write faultsLiu Shixin1-21/+38
copy-on-write of hugetlb user pages with uncorrectable errors will result in a kernel crash. This is because the copy is performed in kernel mode and in general we can not handle accessing memory with such errors while in kernel mode. Commit a873dfe1032a ("mm, hwpoison: try to recover from copy-on write faults") introduced the routine copy_user_highpage_mc() to gracefully handle copying of user pages with uncorrectable errors. However, the separate hugetlb copy-on-write code paths were not modified as part of commit a873dfe1032a. Modify hugetlb copy-on-write code paths to use copy_mc_user_highpage() so that they can also gracefully handle uncorrectable errors in user pages. This involves changing the hugetlb specific routine copy_user_large_folio() from type void to int so that it can return an error. Modify the hugetlb userfaultfd code in the same way so that it can return -EHWPOISON if it encounters an uncorrectable error. Link: https://lkml.kernel.org/r/20230413131349.2524210-1-liushixin2@huawei.com Signed-off-by: Liu Shixin <liushixin2@huawei.com> Acked-by: Mike Kravetz <mike.kravetz@oracle.com> Reviewed-by: Naoya Horiguchi <naoya.horiguchi@nec.com> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Tony Luck <tony.luck@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-04-18mm: convert copy_user_huge_page() to copy_user_large_folio()ZhangPeng1-14/+14
Replace copy_user_huge_page() with copy_user_large_folio(). copy_user_large_folio() does the same as copy_user_huge_page(), but takes in folios instead of pages. Remove pages_per_huge_page from copy_user_large_folio(), because we can get that from folio_nr_pages(dst). Convert copy_user_gigantic_page() to take in folios. Link: https://lkml.kernel.org/r/20230410133932.32288-6-zhangpeng362@huawei.com Signed-off-by: ZhangPeng <zhangpeng362@huawei.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Nanyong Sun <sunnanyong@huawei.com> Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com> Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-04-18userfaultfd: convert copy_huge_page_from_user() to copy_folio_from_user()ZhangPeng1-12/+11
Replace copy_huge_page_from_user() with copy_folio_from_user(). copy_folio_from_user() does the same as copy_huge_page_from_user(), but takes in a folio instead of a page. Convert page_kaddr to kaddr in copy_folio_from_user() to do indenting cleanup. Link: https://lkml.kernel.org/r/20230410133932.32288-4-zhangpeng362@huawei.com Signed-off-by: ZhangPeng <zhangpeng362@huawei.com> Reviewed-by: Sidhartha Kumar <sidhartha.kumar@oracle.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Nanyong Sun <sunnanyong@huawei.com> Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-04-18userfaultfd: use kmap_local_page() in copy_huge_page_from_user()ZhangPeng1-8/+6
kmap() and kmap_atomic() are being deprecated in favor of kmap_local_page() which is appropriate for any thread local context.[1] Let's replace the kmap() and kmap_atomic() with kmap_local_page() in copy_huge_page_from_user(). When allow_pagefault is false, disable page faults to prevent potential deadlock.[2] [1] https://lore.kernel.org/all/20220813220034.806698-1-ira.weiny@intel.com/ [2] https://lkml.kernel.org/r/20221025220136.2366143-1-ira.weiny@intel.com Link: https://lkml.kernel.org/r/20230410133932.32288-3-zhangpeng362@huawei.com Signed-off-by: ZhangPeng <zhangpeng362@huawei.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Nanyong Sun <sunnanyong@huawei.com> Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com> Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-04-16sync mm-stable with mm-hotfixes-stable to pick up depended-upon upstream changesAndrew Morton1-1/+15
2023-04-05sched/numa: enhance vma scanning logicRaghavendra K T1-0/+3
During Numa scanning make sure only relevant vmas of the tasks are scanned. Before: All the tasks of a process participate in scanning the vma even if they do not access vma in it's lifespan. Now: Except cases of first few unconditional scans, if a process do not touch vma (exluding false positive cases of PID collisions) tasks no longer scan all vma Logic used: 1) 6 bits of PID used to mark active bit in vma numab status during fault to remember PIDs accessing vma. (Thanks Mel) 2) Subsequently in scan path, vma scanning is skipped if current PID had not accessed vma. 3) First two times we do allow unconditional scan to preserve earlier behaviour of scanning. Acknowledgement to Bharata B Rao <bharata@amd.com> for initial patch to store pid information and Peter Zijlstra <peterz@infradead.org> (Usage of test and set bit) Link: https://lkml.kernel.org/r/092f03105c7c1d3450f4636b1ea350407f07640e.1677672277.git.raghavendra.kt@amd.com Signed-off-by: Raghavendra K T <raghavendra.kt@amd.com> Suggested-by: Mel Gorman <mgorman@techsingularity.net> Cc: David Hildenbrand <david@redhat.com> Cc: Disha Talreja <dishaa.talreja@amd.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Mike Rapoport <rppt@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-04-05mm: introduce per-VMA lock statisticsSuren Baghdasaryan1-0/+2
Add a new CONFIG_PER_VMA_LOCK_STATS config option to dump extra statistics about handling page fault under VMA lock. Link: https://lkml.kernel.org/r/20230227173632.3292573-29-surenb@google.com Signed-off-by: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-04-05mm: prevent userfaults to be handled under per-vma lockSuren Baghdasaryan1-0/+9
Due to the possibility of handle_userfault dropping mmap_lock, avoid fault handling under VMA lock and retry holding mmap_lock. This can be handled more gracefully in the future. Link: https://lkml.kernel.org/r/20230227173632.3292573-28-surenb@google.com Signed-off-by: Suren Baghdasaryan <surenb@google.com> Suggested-by: Peter Xu <peterx@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-04-05mm: prevent do_swap_page from handling page faults under VMA lockSuren Baghdasaryan1-0/+5
Due to the possibility of do_swap_page dropping mmap_lock, abort fault handling under VMA lock and retry holding mmap_lock. This can be handled more gracefully in the future. Link: https://lkml.kernel.org/r/20230227173632.3292573-27-surenb@google.com Signed-off-by: Suren Baghdasaryan <surenb@google.com> Reviewed-by: Laurent Dufour <laurent.dufour@fr.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-04-05mm: fall back to mmap_lock if vma->anon_vma is not yet setSuren Baghdasaryan1-0/+4
When vma->anon_vma is not set, page fault handler will set it by either reusing anon_vma of an adjacent VMA if VMAs are compatible or by allocating a new one. find_mergeable_anon_vma() walks VMA tree to find a compatible adjacent VMA and that requires not only the faulting VMA to be stable but also the tree structure and other VMAs inside that tree. Therefore locking just the faulting VMA is not enough for this search. Fall back to taking mmap_lock when vma->anon_vma is not set. This situation happens only on the first page fault and should not affect overall performance. Link: https://lkml.kernel.org/r/20230227173632.3292573-25-surenb@google.com Signed-off-by: Suren Baghdasaryan <surenb@google.com> Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-04-05mm: introduce lock_vma_under_rcu to be used from arch-specific codeSuren Baghdasaryan1-0/+46
Introduce lock_vma_under_rcu function to lookup and lock a VMA during page fault handling. When VMA is not found, can't be locked or changes after being locked, the function returns NULL. The lookup is performed under RCU protection to prevent the found VMA from being destroyed before the VMA lock is acquired. VMA lock statistics are updated according to the results. For now only anonymous VMAs can be searched this way. In other cases the function returns NULL. Link: https://lkml.kernel.org/r/20230227173632.3292573-24-surenb@google.com Signed-off-by: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-04-05mm: conditionally write-lock VMA in free_pgtablesSuren Baghdasaryan1-1/+5
Normally free_pgtables needs to lock affected VMAs except for the case when VMAs were isolated under VMA write-lock. munmap() does just that, isolating while holding appropriate locks and then downgrading mmap_lock and dropping per-VMA locks before freeing page tables. Add a parameter to free_pgtables for such scenario. Link: https://lkml.kernel.org/r/20230227173632.3292573-20-surenb@google.com Signed-off-by: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-04-05mm: hold the RCU read lock over calls to ->map_pagesMatthew Wilcox (Oracle)1-3/+8
Prevent filesystems from doing things which sleep in their map_pages method. This is in preparation for a pagefault path protected only by RCU. Link: https://lkml.kernel.org/r/20230327174515.1811532-4-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Darrick J. Wong <djwong@kernel.org> Cc: Dave Chinner <david@fromorbit.com> Cc: David Howells <dhowells@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-04-05mm: remove unused vmf_insert_mixed_prot()Lorenzo Stoakes1-39/+18
Patch series "Remove drm/ttm-specific mm changes". Functionality was added specifically for the DRM TTM driver to support mapping memory for VM_MIXEDMAP VMAs with customised protection flags, however this has now been rolled back as issues were found with this approach. This series removes the mm changes too, retaining some of the useful comments. This patch (of 3): The sole user of vmf_insert_mixed_prot(), the drm ttm module, stopped using this in commit f91142c62161 ("drm/ttm: nuke VM_MIXEDMAP on BO mappings v3") citing use of VM_MIXEDMAP in this case being terribly broken. Remove this now-dead code and references to it, but retain the useful description of the prot != vma->vm_page_prot case, moving it to vmf_insert_pfn_prot() instead. Link: https://lkml.kernel.org/r/cover.1678661628.git.lstoakes@gmail.com Link: https://lkml.kernel.org/r/a069644388e6f1593a7020d15840e6fc9f39bcaf.1678661628.git.lstoakes@gmail.com Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com> Cc: Christian König <christian.koenig@amd.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com> Cc: Aaron Tomlin <atomlin@atomlin.com> Cc: Christoph Lameter <cl@linux.com> Cc: Frederic Weisbecker <frederic@kernel.org> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: Peter Xu <peterx@redhat.com> Cc: "Russell King (Oracle)" <linux@armlinux.org.uk> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-04-05mm: prefer fault_around_pages to fault_around_bytesLorenzo Stoakes1-11/+15
All use of this value is now at page granularity, so specify the variable as such too. This simplifies the logic. We maintain the debugfs entry to ensure that there are no user-visible changes. Link: https://lkml.kernel.org/r/4995bad07fe9baa51c786fa0d81819dddfb57654.1679089214.git.lstoakes@gmail.com Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com> Cc: David Hildenbrand <david@redhat.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-04-05mm: refactor do_fault_around()Lorenzo Stoakes1-22/+16
Patch series "Refactor do_fault_around()" Refactor do_fault_around() to avoid bitwise tricks and rather difficult to follow logic. Additionally, prefer fault_around_pages to fault_around_bytes as the operations are performed at a base page granularity. This patch (of 2): The existing logic is confusing and fails to abstract a number of bitwise tricks. Use ALIGN_DOWN() to perform alignment, pte_index() to obtain a PTE index and represent the address range using PTE offsets, which naturally make it clear that the operation is intended to occur within only a single PTE and prevent spanning of more than one page table. We rely on the fact that fault_around_bytes will always be page-aligned, at least one page in size, a power of two and that it will not exceed PAGE_SIZE * PTRS_PER_PTE in size (i.e. the address space mapped by a PTE). These are all guaranteed by fault_around_bytes_set(). Link: https://lkml.kernel.org/r/cover.1679089214.git.lstoakes@gmail.com Link: https://lkml.kernel.org/r/d125db1c3665a63b80cea29d56407825482e2262.1679089214.git.lstoakes@gmail.com Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com> Cc: David Hildenbrand <david@redhat.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-04-05mm/uffd: UFFD_FEATURE_WP_UNPOPULATEDPeter Xu1-14/+42
Patch series "mm/uffd: Add feature bit UFFD_FEATURE_WP_UNPOPULATED", v4. The new feature bit makes anonymous memory acts the same as file memory on userfaultfd-wp in that it'll also wr-protect none ptes. It can be useful in two cases: (1) Uffd-wp app that needs to wr-protect none ptes like QEMU snapshot, so pre-fault can be replaced by enabling this flag and speed up protections (2) It helps to implement async uffd-wp mode that Muhammad is working on [1] It's debatable whether this is the most ideal solution because with the new feature bit set, wr-protect none pte needs to pre-populate the pgtables to the last level (PAGE_SIZE). But it seems fine so far to service either purpose above, so we can leave optimizations for later. The series brings pte markers to anonymous memory too. There's some change in the common mm code path in the 1st patch, great to have some eye looking at it, but hopefully they're still relatively straightforward. This patch (of 2): This is a new feature that controls how uffd-wp handles none ptes. When it's set, the kernel will handle anonymous memory the same way as file memory, by allowing the user to wr-protect unpopulated ptes. File memories handles none ptes consistently by allowing wr-protecting of none ptes because of the unawareness of page cache being exist or not. For anonymous it was not as persistent because we used to assume that we don't need protections on none ptes or known zero pages. One use case of such a feature bit was VM live snapshot, where if without wr-protecting empty ptes the snapshot can contain random rubbish in the holes of the anonymous memory, which can cause misbehave of the guest when the guest OS assumes the pages should be all zeros. QEMU worked it around by pre-populate the section with reads to fill in zero page entries before starting the whole snapshot process [1]. Recently there's another need raised on using userfaultfd wr-protect for detecting dirty pages (to replace soft-dirty in some cases) [2]. In that case if without being able to wr-protect none ptes by default, the dirty info can get lost, since we cannot treat every none pte to be dirty (the current design is identify a page dirty based on uffd-wp bit being cleared). In general, we want to be able to wr-protect empty ptes too even for anonymous. This patch implements UFFD_FEATURE_WP_UNPOPULATED so that it'll make uffd-wp handling on none ptes being consistent no matter what the memory type is underneath. It doesn't have any impact on file memories so far because we already have pte markers taking care of that. So it only affects anonymous. The feature bit is by default off, so the old behavior will be maintained. Sometimes it may be wanted because the wr-protect of none ptes will contain overheads not only during UFFDIO_WRITEPROTECT (by applying pte markers to anonymous), but also on creating the pgtables to store the pte markers. So there's potentially less chance of using thp on the first fault for a none pmd or larger than a pmd. The major implementation part is teaching the whole kernel to understand pte markers even for anonymously mapped ranges, meanwhile allowing the UFFDIO_WRITEPROTECT ioctl to apply pte markers for anonymous too when the new feature bit is set. Note that even if the patch subject starts with mm/uffd, there're a few small refactors to major mm path of handling anonymous page faults. But they should be straightforward. With WP_UNPOPUATED, application like QEMU can avoid pre-read faults all the memory before wr-protect during taking a live snapshot. Quotting from Muhammad's test result here [3] based on a simple program [4]: (1) With huge page disabled echo madvise > /sys/kernel/mm/transparent_hugepage/enabled ./uffd_wp_perf Test DEFAULT: 4 Test PRE-READ: 1111453 (pre-fault 1101011) Test MADVISE: 278276 (pre-fault 266378) Test WP-UNPOPULATE: 11712 (2) With Huge page enabled echo always > /sys/kernel/mm/transparent_hugepage/enabled ./uffd_wp_perf Test DEFAULT: 4 Test PRE-READ: 22521 (pre-fault 22348) Test MADVISE: 4909 (pre-fault 4743) Test WP-UNPOPULATE: 14448 There'll be a great perf boost for no-thp case, while for thp enabled with extreme case of all-thp-zero WP_UNPOPULATED can be slower than MADVISE, but that's low possibility in reality, also the overhead was not reduced but postponed until a follow up write on any huge zero thp, so potentially it is faster by making the follow up writes slower. [1] https://lore.kernel.org/all/20210401092226.102804-4-andrey.gruzdev@virtuozzo.com/ [2] https://lore.kernel.org/all/Y+v2HJ8+3i%2FKzDBu@x1n/ [3] https://lore.kernel.org/all/d0eb0a13-16dc-1ac1-653a-78b7273781e3@collabora.com/ [4] https://github.com/xzpeter/clibs/blob/master/uffd-test/uffd-wp-perf.c [peterx@redhat.com: comment changes, oneliner fix to khugepaged] Link: https://lkml.kernel.org/r/ZB2/8jPhD3fpx5U8@x1n Link: https://lkml.kernel.org/r/20230309223711.823547-1-peterx@redhat.com Link: https://lkml.kernel.org/r/20230309223711.823547-2-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Muhammad Usama Anjum <usama.anjum@collabora.com> Cc: Nadav Amit <nadav.amit@gmail.com> Cc: Paul Gofman <pgofman@codeweavers.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-04-05mm: take a page reference when removing device exclusive entriesAlistair Popple1-1/+15
Device exclusive page table entries are used to prevent CPU access to a page whilst it is being accessed from a device. Typically this is used to implement atomic operations when the underlying bus does not support atomic access. When a CPU thread encounters a device exclusive entry it locks the page and restores the original entry after calling mmu notifiers to signal drivers that exclusive access is no longer available. The device exclusive entry holds a reference to the page making it safe to access the struct page whilst the entry is present. However the fault handling code does not hold the PTL when taking the page lock. This means if there are multiple threads faulting concurrently on the device exclusive entry one will remove the entry whilst others will wait on the page lock without holding a reference. This can lead to threads locking or waiting on a folio with a zero refcount. Whilst mmap_lock prevents the pages getting freed via munmap() they may still be freed by a migration. This leads to warnings such as PAGE_FLAGS_CHECK_AT_FREE due to the page being locked when the refcount drops to zero. Fix this by trying to take a reference on the folio before locking it. The code already checks the PTE under the PTL and aborts if the entry is no longer there. It is also possible the folio has been unmapped, freed and re-allocated allowing a reference to be taken on an unrelated folio. This case is also detected by the PTE check and the folio is unlocked without further changes. Link: https://lkml.kernel.org/r/20230330012519.804116-1-apopple@nvidia.com Fixes: b756a3b5e7ea ("mm: device exclusive memory access") Signed-off-by: Alistair Popple <apopple@nvidia.com> Reviewed-by: Ralph Campbell <rcampbell@nvidia.com> Reviewed-by: John Hubbard <jhubbard@nvidia.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Christoph Hellwig <hch@infradead.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-03-28mm: add PTE pointer parameter to flush_tlb_fix_spurious_fault()Gerald Schaefer1-1/+2
s390 can do more fine-grained handling of spurious TLB protection faults, when there also is the PTE pointer available. Therefore, pass on the PTE pointer to flush_tlb_fix_spurious_fault() as an additional parameter. This will add no functional change to other architectures, but those with private flush_tlb_fix_spurious_fault() implementations need to be made aware of the new parameter. Link: https://lkml.kernel.org/r/20230306161548.661740-1-gerald.schaefer@linux.ibm.com Signed-off-by: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com> Acked-by: Catalin Marinas <catalin.marinas@arm.com> [arm64] Acked-by: Michael Ellerman <mpe@ellerman.id.au> [powerpc] Acked-by: David Hildenbrand <david@redhat.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Borislav Petkov (AMD) <bp@alien8.de> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-03-28mm: memory: use folio_throttle_swaprate() in do_cow_fault()Kefeng Wang1-1/+1
Directly use folio_throttle_swaprate() instead of cgroup_throttle_swaprate(). Link: https://lkml.kernel.org/r/20230302115835.105364-7-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-03-28mm: memory: use folio_throttle_swaprate() in do_anonymous_page()Kefeng Wang1-1/+1
Directly use folio_throttle_swaprate() instead of cgroup_throttle_swaprate(). Link: https://lkml.kernel.org/r/20230302115835.105364-6-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-03-28mm: memory: use folio_throttle_swaprate() in wp_page_copy()Kefeng Wang1-1/+1
Directly use folio_throttle_swaprate() instead of cgroup_throttle_swaprate(). Link: https://lkml.kernel.org/r/20230302115835.105364-5-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-03-28mm: memory: use folio_throttle_swaprate() in page_copy_prealloc()Kefeng Wang1-1/+1
Directly use folio_throttle_swaprate() instead of cgroup_throttle_swaprate(). Link: https://lkml.kernel.org/r/20230302115835.105364-4-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-03-28mm: memory: use folio_throttle_swaprate() in do_swap_page()Kefeng Wang1-1/+1
Directly use folio_throttle_swaprate() instead of cgroup_throttle_swaprate(). Link: https://lkml.kernel.org/r/20230302115835.105364-3-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-03-28x86/mm/pat: clear VM_PAT if copy_p4d_range failedMa Wupeng1-0/+1
Syzbot reports a warning in untrack_pfn(). Digging into the root we found that this is due to memory allocation failure in pmd_alloc_one. And this failure is produced due to failslab. In copy_page_range(), memory alloaction for pmd failed. During the error handling process in copy_page_range(), mmput() is called to remove all vmas. While untrack_pfn this empty pfn, warning happens. Here's a simplified flow: dup_mm dup_mmap copy_page_range copy_p4d_range copy_pud_range copy_pmd_range pmd_alloc __pmd_alloc pmd_alloc_one page = alloc_pages(gfp, 0); if (!page) return NULL; mmput exit_mmap unmap_vmas unmap_single_vma untrack_pfn follow_phys WARN_ON_ONCE(1); Since this vma is not generate successfully, we can clear flag VM_PAT. In this case, untrack_pfn() will not be called while cleaning this vma. Function untrack_pfn_moved() has also been renamed to fit the new logic. Link: https://lkml.kernel.org/r/20230217025615.1595558-1-mawupeng1@huawei.com Signed-off-by: Ma Wupeng <mawupeng1@huawei.com> Reported-by: <syzbot+5f488e922d047d8f00cc@syzkaller.appspotmail.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Borislav Petkov <bp@suse.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Suresh Siddha <suresh.b.siddha@intel.com> Cc: Toshi Kani <toshi.kani@hp.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-02-20mm/uffd: fix comment in handling pte markersPeter Xu1-3/+1
The comment is obsolete after f369b07c8614 ("mm/uffd: reset write protection when unregister with wp-mode", 2022-08-20). Remove it. Link: https://lkml.kernel.org/r/20230215205800.223549-1-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-02-10Merge branch 'mm-hotfixes-stable' into mm-stableAndrew Morton1-0/+3
To pick up depended-upon changes
2023-02-09mm: introduce __vm_flags_mod and use it in untrack_pfnSuren Baghdasaryan1-6/+7
There are scenarios when vm_flags can be modified without exclusive mmap_lock, such as: - after VMA was isolated and mmap_lock was downgraded or dropped - in exit_mmap when there are no other mm users and locking is unnecessary Introduce __vm_flags_mod to avoid assertions when the caller takes responsibility for the required locking. Pass a hint to untrack_pfn to conditionally use __vm_flags_mod for flags modification to avoid assertion. Link: https://lkml.kernel.org/r/20230126193752.297968-7-surenb@google.com Signed-off-by: Suren Baghdasaryan <surenb@google.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Mike Rapoport (IBM) <rppt@kernel.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Arjun Roy <arjunroy@google.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Howells <dhowells@redhat.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: David Rientjes <rientjes@google.com> Cc: Eric Dumazet <edumazet@google.com> Cc: Greg Thelen <gthelen@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jann Horn <jannh@google.com> Cc: Joel Fernandes <joelaf@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kent Overstreet <kent.overstreet@linux.dev> Cc: Laurent Dufour <ldufour@linux.ibm.com> Cc: Liam R. Howlett <Liam.Howlett@Oracle.com> Cc: Lorenzo Stoakes <lstoakes@gmail.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Minchan Kim <minchan@google.com> Cc: Paul E. McKenney <paulmck@kernel.org> Cc: Peter Oskolkov <posk@google.com> Cc: Peter Xu <peterx@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Punit Agrawal <punit.agrawal@bytedance.com> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Sebastian Reichel <sebastian.reichel@collabora.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Soheil Hassas Yeganeh <soheil@google.com> Cc: Song Liu <songliubraving@fb.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-02-09mm: replace vma->vm_flags direct modifications with modifier callsSuren Baghdasaryan1-3/+3
Replace direct modifications to vma->vm_flags with calls to modifier functions to be able to track flag changes and to keep vma locking correctness. [akpm@linux-foundation.org: fix drivers/misc/open-dice.c, per Hyeonggon Yoo] Link: https://lkml.kernel.org/r/20230126193752.297968-5-surenb@google.com Signed-off-by: Suren Baghdasaryan <surenb@google.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Mike Rapoport (IBM) <rppt@kernel.org> Acked-by: Sebastian Reichel <sebastian.reichel@collabora.com> Reviewed-by: Liam R. Howlett <Liam.Howlett@Oracle.com> Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Arjun Roy <arjunroy@google.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Howells <dhowells@redhat.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: David Rientjes <rientjes@google.com> Cc: Eric Dumazet <edumazet@google.com> Cc: Greg Thelen <gthelen@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jann Horn <jannh@google.com> Cc: Joel Fernandes <joelaf@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kent Overstreet <kent.overstreet@linux.dev> Cc: Laurent Dufour <ldufour@linux.ibm.com> Cc: Lorenzo Stoakes <lstoakes@gmail.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Minchan Kim <minchan@google.com> Cc: Paul E. McKenney <paulmck@kernel.org> Cc: Peter Oskolkov <posk@google.com> Cc: Peter Xu <peterx@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Punit Agrawal <punit.agrawal@bytedance.com> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Shakeel Butt <shakeelb@google.com> Cc: Soheil Hassas Yeganeh <soheil@google.com> Cc: Song Liu <songliubraving@fb.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-02-09mm: hwpoison: support recovery from ksm_might_need_to_copy()Kefeng Wang1-0/+3
When the kernel copies a page from ksm_might_need_to_copy(), but runs into an uncorrectable error, it will crash since poisoned page is consumed by kernel, this is similar to the issue recently fixed by Copy-on-write poison recovery. When an error is detected during the page copy, return VM_FAULT_HWPOISON in do_swap_page(), and install a hwpoison entry in unuse_pte() when swapoff, which help us to avoid system crash. Note, memory failure on a KSM page will be skipped, but still call memory_failure_queue() to be consistent with general memory failure process, and we could support KSM page recovery in the feature. [wangkefeng.wang@huawei.com: enhance unuse_pte(), fix issue found by lkp] Link: https://lkml.kernel.org/r/20221213120523.141588-1-wangkefeng.wang@huawei.com [wangkefeng.wang@huawei.com: update changelog, alter ksm_might_need_to_copy(), restore unlikely() in unuse_pte()] Link: https://lkml.kernel.org/r/20230201074433.96641-1-wangkefeng.wang@huawei.com Link: https://lkml.kernel.org/r/20221209072801.193221-1-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Reviewed-by: Naoya Horiguchi <naoya.horiguchi@nec.com> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Tony Luck <tony.luck@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-02-02mm: use a folio in copy_present_pte()Matthew Wilcox (Oracle)1-6/+9
We still have to keep the page around because we need to know which page in the folio we're copying, but we can replace five implict calls to compound_head() with one. Link: https://lkml.kernel.org/r/20230116191813.2145215-6-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-02-02mm: use a folio in copy_pte_range()Matthew Wilcox (Oracle)1-26/+25
Allocate an order-0 folio instead of a page and pass it all the way down the call chain. Removes dozens of calls to compound_head(). Link: https://lkml.kernel.org/r/20230116191813.2145215-5-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-02-02mm: convert wp_page_copy() to use foliosMatthew Wilcox (Oracle)1-33/+32
Use new_folio instead of new_page throughout, because we allocated it and know it's an order-0 folio. Most old_page uses become old_folio, but use vmf->page where we need the precise page. Link: https://lkml.kernel.org/r/20230116191813.2145215-4-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-02-02mm: convert do_anonymous_page() to use a folioMatthew Wilcox (Oracle)1-11/+9
Removes six calls to compound_head(); some inline and some external. Link: https://lkml.kernel.org/r/20230116191813.2145215-3-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>