aboutsummaryrefslogtreecommitdiffstats
AgeCommit message (Collapse)AuthorFilesLines
2020-10-22Merge tag 'ext4_for_linus' of ↵HEADmasterLinus Torvalds35-396/+4725
git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 Pull ext4 updates from Ted Ts'o: "The siginificant new ext4 feature this time around is Harshad's new fast_commit mode. In addition, thanks to Mauricio for fixing a race where mmap'ed pages that are being changed in parallel with a data=journal transaction commit could result in bad checksums in the failure that could cause journal replays to fail. Also notable is Ritesh's buffered write optimization which can result in significant improvements on parallel write workloads. (The kernel test robot reported a 330.6% improvement on fio.write_iops on a 96 core system using DAX) Besides that, we have the usual miscellaneous cleanups and bug fixes" Link: https://lore.kernel.org/r/20200925071217.GO28663@shao2-debian * tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (46 commits) ext4: fix invalid inode checksum ext4: add fast commit stats in procfs ext4: add a mount opt to forcefully turn fast commits on ext4: fast commit recovery path jbd2: fast commit recovery path ext4: main fast-commit commit path jbd2: add fast commit machinery ext4 / jbd2: add fast commit initialization ext4: add fast_commit feature and handling for extended mount options doc: update ext4 and journalling docs to include fast commit feature ext4: Detect already used quota file early jbd2: avoid transaction reuse after reformatting ext4: use the normal helper to get the actual inode ext4: fix bs < ps issue reported with dioread_nolock mount opt ext4: data=journal: write-protect pages on j_submit_inode_data_buffers() ext4: data=journal: fixes for ext4_page_mkwrite() jbd2, ext4, ocfs2: introduce/use journal callbacks j_submit|finish_inode_data_buffers() jbd2: introduce/export functions jbd2_journal_submit|finish_inode_data_buffers() ext4: introduce ext4_sb_bread_unmovable() to replace sb_bread_unmovable() ext4: use ext4_sb_bread() instead of sb_bread() ...
2020-10-22Merge branch 'work.set_fs' of ↵Linus Torvalds52-495/+346
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull initial set_fs() removal from Al Viro: "Christoph's set_fs base series + fixups" * 'work.set_fs' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: fs: Allow a NULL pos pointer to __kernel_read fs: Allow a NULL pos pointer to __kernel_write powerpc: remove address space overrides using set_fs() powerpc: use non-set_fs based maccess routines x86: remove address space overrides using set_fs() x86: make TASK_SIZE_MAX usable from assembly code x86: move PAGE_OFFSET, TASK_SIZE & friends to page_{32,64}_types.h lkdtm: remove set_fs-based tests test_bitmap: remove user bitmap tests uaccess: add infrastructure for kernel builds with set_fs() fs: don't allow splice read/write without explicit ops fs: don't allow kernel reads and writes without iter ops sysctl: Convert to iter interfaces proc: add a read_iter method to proc proc_ops proc: cleanup the compat vs no compat file ops proc: remove a level of indentation in proc_get_inode
2020-10-22Merge tag 'nfsd-5.10' of git://linux-nfs.org/~bfields/linuxLinus Torvalds44-1368/+1490
Pull nfsd updates from Bruce Fields: "The one new feature this time, from Anna Schumaker, is READ_PLUS, which has the same arguments as READ but allows the server to return an array of data and hole extents. Otherwise it's a lot of cleanup and bugfixes" * tag 'nfsd-5.10' of git://linux-nfs.org/~bfields/linux: (43 commits) NFSv4.2: Fix NFS4ERR_STALE error when doing inter server copy SUNRPC: fix copying of multiple pages in gss_read_proxy_verf() sunrpc: raise kernel RPC channel buffer size svcrdma: fix bounce buffers for unaligned offsets and multiple pages nfsd: remove unneeded break net/sunrpc: Fix return value for sysctl sunrpc.transports NFSD: Encode a full READ_PLUS reply NFSD: Return both a hole and a data segment NFSD: Add READ_PLUS hole segment encoding NFSD: Add READ_PLUS data support NFSD: Hoist status code encoding into XDR encoder functions NFSD: Map nfserr_wrongsec outside of nfsd_dispatch NFSD: Remove the RETURN_STATUS() macro NFSD: Call NFSv2 encoders on error returns NFSD: Fix .pc_release method for NFSv2 NFSD: Remove vestigial typedefs NFSD: Refactor nfsd_dispatch() error paths NFSD: Clean up nfsd_dispatch() variables NFSD: Clean up stale comments in nfsd_dispatch() NFSD: Clean up switch statement in nfsd_dispatch() ...
2020-10-22Merge tag 'exfat-for-5.10-rc1' of ↵Linus Torvalds7-127/+71
git://git.kernel.org/pub/scm/linux/kernel/git/linkinjeon/exfat Pull exfat updates from Namjae Jeon: - Replace memcpy with structure assignment - Remove unneeded codes and use helper function i_blocksize() - Fix typos found by codespell * tag 'exfat-for-5.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/linkinjeon/exfat: exfat: remove useless check in exfat_move_file() exfat: remove 'rwoffset' in exfat_inode_info exfat: replace memcpy with structure assignment exfat: remove useless directory scan in exfat_add_entry() exfat: eliminate dead code in exfat_find() exfat: use i_blocksize() to get blocksize exfat: fix misspellings using codespell tool
2020-10-22Merge tag '9p-for-5.10-rc1' of git://github.com/martinetd/linuxLinus Torvalds3-5/+5
Pull 9p updates from Dominique Martinet: "A couple of small fixes (loff_t overflow on 32bit, syzbot uninitialized variable warning) and code cleanup (xen)" * tag '9p-for-5.10-rc1' of git://github.com/martinetd/linux: net: 9p: initialize sun_server.sun_path to have addr's value only when addr is valid 9p/xen: Fix format argument warning 9P: Cast to loff_t before multiplying
2020-10-21ext4: fix invalid inode checksumLuo Meng1-5/+6
During the stability test, there are some errors: ext4_lookup:1590: inode #6967: comm fsstress: iget: checksum invalid. If the inode->i_iblocks too big and doesn't set huge file flag, checksum will not be recalculated when update the inode information to it's buffer. If other inode marks the buffer dirty, then the inconsistent inode will be flushed to disk. Fix this problem by checking i_blocks in advance. Cc: stable@kernel.org Signed-off-by: Luo Meng <luomeng12@huawei.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Link: https://lore.kernel.org/r/20201020013631.3796673-1-luomeng12@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-10-21ext4: add fast commit stats in procfsHarshad Shirwadkar3-1/+37
This commit adds a file in procfs that tracks fast commit related statistics. root@kvm-xfstests:/mnt# cat /proc/fs/ext4/vdc/fc_info fc stats: 7772 commits 15 ineligible 4083 numblks 2242us avg_commit_time Ineligible reasons: "Extended attributes changed": 0 "Cross rename": 0 "Journal flag changed": 0 "Insufficient memory": 0 "Swap boot": 0 "Resize": 0 "Dir renamed": 0 "Falloc range op": 0 "FC Commit Failed": 15 Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com> Link: https://lore.kernel.org/r/20201015203802.3597742-10-harshadshirwadkar@gmail.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-10-21ext4: add a mount opt to forcefully turn fast commits onHarshad Shirwadkar1-1/+5
This is a debug only mount option that forcefully turns fast commits on at mount time. Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com> Link: https://lore.kernel.org/r/20201015203802.3597742-9-harshadshirwadkar@gmail.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-10-21ext4: fast commit recovery pathHarshad Shirwadkar14-131/+1821
This patch adds fast commit recovery path support for Ext4 file system. We add several helper functions that are similar in spirit to e2fsprogs journal recovery path handlers. Example of such functions include - a simple block allocator, idempotent block bitmap update function etc. Using these routines and the fast commit log in the fast commit area, the recovery path (ext4_fc_replay()) performs fast commit log recovery. Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com> Link: https://lore.kernel.org/r/20201015203802.3597742-8-harshadshirwadkar@gmail.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-10-21jbd2: fast commit recovery pathHarshad Shirwadkar3-4/+88
This patch adds fast commit recovery support in JBD2. Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com> Link: https://lore.kernel.org/r/20201015203802.3597742-7-harshadshirwadkar@gmail.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-10-21ext4: main fast-commit commit pathHarshad Shirwadkar13-29/+1707
This patch adds main fast commit commit path handlers. The overall patch can be divided into two inter-related parts: (A) Metadata updates tracking This part consists of helper functions to track changes that need to be committed during a commit operation. These updates are maintained by Ext4 in different in-memory queues. Following are the APIs and their short description that are implemented in this patch: - ext4_fc_track_link/unlink/creat() - Track unlink. link and creat operations - ext4_fc_track_range() - Track changed logical block offsets inodes - ext4_fc_track_inode() - Track inodes - ext4_fc_mark_ineligible() - Mark file system fast commit ineligible() - ext4_fc_start_update() / ext4_fc_stop_update() / ext4_fc_start_ineligible() / ext4_fc_stop_ineligible() These functions are useful for co-ordinating inode updates with commits. (B) Main commit Path This part consists of functions to convert updates tracked in in-memory data structures into on-disk commits. Function ext4_fc_commit() is the main entry point to commit path. Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com> Link: https://lore.kernel.org/r/20201015203802.3597742-6-harshadshirwadkar@gmail.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-10-21jbd2: add fast commit machineryHarshad Shirwadkar4-1/+268
This functions adds necessary APIs needed in JBD2 layer for fast commits. Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com> Link: https://lore.kernel.org/r/20201015203802.3597742-5-harshadshirwadkar@gmail.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-10-21ext4 / jbd2: add fast commit initializationHarshad Shirwadkar7-6/+122
This patch adds fast commit area trackers in the journal_t structure. These are initialized via the jbd2_fc_init() routine that this patch adds. This patch also adds ext4/fast_commit.c and ext4/fast_commit.h files for fast commit code that will be added in subsequent patches in this series. Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com> Link: https://lore.kernel.org/r/20201015203802.3597742-4-harshadshirwadkar@gmail.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-10-21ext4: add fast_commit feature and handling for extended mount optionsHarshad Shirwadkar3-6/+30
We are running out of mount option bits. Add handling for using s_mount_opt2. Add ext4 and jbd2 fast commit feature flag and also add ability to turn off the fast commit feature in Ext4. Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com> Link: https://lore.kernel.org/r/20201015203802.3597742-3-harshadshirwadkar@gmail.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-10-21doc: update ext4 and journalling docs to include fast commit featureHarshad Shirwadkar2-0/+99
This patch adds necessary documentation for fast commits. Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com> Link: https://lore.kernel.org/r/20201015203802.3597742-2-harshadshirwadkar@gmail.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-10-22exfat: remove useless check in exfat_move_file()Tetsuhiro Kohada1-5/+0
In exfat_move_file(), the identity of source and target directory has been checked by the caller. Also, it gets stream.start_clu from file dir-entry, which is an invalid determination. Signed-off-by: Tetsuhiro Kohada <kohada.t2@gmail.com> Acked-by: Sungjong Seo <sj1557.seo@samsung.com> Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
2020-10-22exfat: remove 'rwoffset' in exfat_inode_infoTetsuhiro Kohada5-20/+9
Remove 'rwoffset' in exfat_inode_info and replace it with the parameter of exfat_readdir(). Since rwoffset is referenced only by exfat_readdir(), it is not necessary a exfat_inode_info's member. Also, change cpos to point to the next of entry-set, and return the index of dir-entry via dir_entry->entry. Signed-off-by: Tetsuhiro Kohada <kohada.t2@gmail.com> Acked-by: Sungjong Seo <sj1557.seo@samsung.com> Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
2020-10-22exfat: replace memcpy with structure assignmentTetsuhiro Kohada3-14/+10
Use structure assignment instead of memcpy. Signed-off-by: Tetsuhiro Kohada <kohada.t2@gmail.com> Acked-by: Sungjong Seo <sj1557.seo@samsung.com> Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
2020-10-22exfat: remove useless directory scan in exfat_add_entry()Tetsuhiro Kohada1-10/+1
There is nothing in directory just created, so there is no need to scan. Signed-off-by: Tetsuhiro Kohada <kohada.t2@gmail.com> Acked-by: Sungjong Seo <sj1557.seo@samsung.com> Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
2020-10-22exfat: eliminate dead code in exfat_find()Tetsuhiro Kohada2-74/+47
The exfat_find_dir_entry() called by exfat_find() doesn't return -EEXIST. Therefore, the root-dir information setting is never executed. Signed-off-by: Tetsuhiro Kohada <kohada.t2@gmail.com> Acked-by: Sungjong Seo <sj1557.seo@samsung.com> Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
2020-10-22exfat: use i_blocksize() to get blocksizeXianting Tian1-1/+1
We alreday has the interface i_blocksize() to get blocksize, so use it. Signed-off-by: Xianting Tian <tian.xianting@h3c.com> Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
2020-10-22exfat: fix misspellings using codespell toolNamjae Jeon3-3/+3
Sedat reported typos using codespell tool. $ codespell fs/exfat/*.c | grep -v iput fs/exfat/namei.c:293: upto ==> up to fs/exfat/nls.c:14: tabel ==> table $ codespell fs/exfat/*.h fs/exfat/exfat_fs.h:133: usally ==> usually Fix typos found by codespell. Reported-by: Sedat Dilek <sedat.dilek@gmail.com> Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
2020-10-21Merge tag 'linux-watchdog-5.10-rc1' of ↵Linus Torvalds15-48/+374
git://www.linux-watchdog.org/linux-watchdog Pull watchdog updates from Wim Van Sebroeck: - Add Toshiba Visconti watchdog driver - it87_wdt: add IT8772 + IT8784 - several fixes and improvements * tag 'linux-watchdog-5.10-rc1' of git://www.linux-watchdog.org/linux-watchdog: watchdog: Add Toshiba Visconti watchdog driver watchdog: bindings: Add binding documentation for Toshiba Visconti watchdog device watchdog: it87_wdt: add IT8784 ID watchdog: sp5100_tco: Enable watchdog on Family 17h devices if disabled watchdog: sp5100: Fix definition of EFCH_PM_DECODEEN3 watchdog: renesas_wdt: support handover from bootloader watchdog: imx7ulp: Watchdog should continue running for wait/stop mode watchdog: rti: Simplify with dev_err_probe() watchdog: davinci: Simplify with dev_err_probe() watchdog: cadence: Simplify with dev_err_probe() watchdog: remove unneeded inclusion of <uapi/linux/sched/types.h> watchdog: Use put_device on error watchdog: Fix memleak in watchdog_cdev_register watchdog: imx7ulp: Strictly follow the sequence for wdog operations watchdog: it87_wdt: add IT8772 ID watchdog: pcwd_usb: Avoid GFP_ATOMIC where it is not needed drivers: watchdog: rdc321x_wdt: Fix race condition bugs
2020-10-21Merge tag 'rtc-5.10' of ↵Linus Torvalds21-368/+1403
git://git.kernel.org/pub/scm/linux/kernel/git/abelloni/linux Pull RTC updates from Alexandre Belloni: "A new driver this cycle is making the bulk of the changes and the rx8010 driver has been rework to use the modern APIs. Summary: Subsystem: - new generic DT properties: aux-voltage-chargeable, trickle-voltage-millivolt New driver: - Microcrystal RV-3032 Drivers: - ds1307: use aux-voltage-chargeable - r9701, rx8010: modernization of the driver - rv3028: fix clock output, trickle resistor values, RAM configuration registers" * tag 'rtc-5.10' of git://git.kernel.org/pub/scm/linux/kernel/git/abelloni/linux: (50 commits) rtc: r9701: set range rtc: r9701: convert to devm_rtc_allocate_device rtc: r9701: stop setting RWKCNT rtc: r9701: remove useless memset rtc: r9701: stop setting a default time rtc: r9701: remove leftover comment rtc: rv3032: Add a driver for Microcrystal RV-3032 dt-bindings: rtc: rv3032: add RV-3032 bindings dt-bindings: rtc: add trickle-voltage-millivolt rtc: rv3028: ensure ram configuration registers are saved rtc: rv3028: factorize EERD bit handling rtc: rv3028: fix trickle resistor values rtc: rv3028: fix clock output support rtc: mt6397: Remove unused member dev rtc: rv8803: simplify the return expression of rv8803_nvram_write rtc: meson: simplify the return expression of meson_vrtc_probe rtc: rx8010: rename rx8010_init_client() to rx8010_init() rtc: ds1307: enable rx8130's backup battery, make it chargeable optionally rtc: ds1307: consider aux-voltage-chargeable rtc: ds1307: store previous charge default per chip ...
2020-10-21Merge branch 'dmi-for-linus' of ↵Linus Torvalds1-1/+1
git://git.kernel.org/pub/scm/linux/kernel/git/jdelvare/staging Pull dmi update from Jean Delvare. * 'dmi-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jdelvare/staging: Replace HTTP links with HTTPS ones: DMI/SMBIOS SUPPORT
2020-10-21Merge branch 'i2c/for-5.10' of ↵Linus Torvalds41-920/+4009
git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux Pull i2c updates from Wolfram Sang: - if a host can be a client, too, the I2C core can now use it to emulate SMBus HostNotify support (STM32 and R-Car added this so far) - also for client mode, a testunit has been added. It can create rare situations on the bus, so host controllers can be tested - a binding has been added to mark the bus as "single-master". This allows for better timeout detections - new driver for Mellanox Bluefield - massive refactoring of the Tegra driver - EEPROMs recognized by the at24 driver can now have custom names - rest is driver updates * 'i2c/for-5.10' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux: (80 commits) Documentation: i2c: add testunit docs to index i2c: tegra: Improve driver module description i2c: tegra: Clean up whitespaces, newlines and indentation i2c: tegra: Clean up and improve comments i2c: tegra: Clean up printk messages i2c: tegra: Clean up variable names i2c: tegra: Improve formatting of variables i2c: tegra: Check errors for both positive and negative values i2c: tegra: Factor out hardware initialization into separate function i2c: tegra: Factor out register polling into separate function i2c: tegra: Factor out packet header setup from tegra_i2c_xfer_msg() i2c: tegra: Factor out error recovery from tegra_i2c_xfer_msg() i2c: tegra: Rename wait/poll functions i2c: tegra: Remove "dma" variable from tegra_i2c_xfer_msg() i2c: tegra: Remove redundant check in tegra_i2c_issue_bus_clear() i2c: tegra: Remove likely/unlikely from the code i2c: tegra: Remove outdated barrier() i2c: tegra: Clean up variable types i2c: tegra: Reorder location of functions in the code i2c: tegra: Clean up probe function ...
2020-10-21Merge tag 'ceph-for-5.10-rc1' of git://github.com/ceph/ceph-clientLinus Torvalds26-443/+689
Pull ceph updates from Ilya Dryomov: - a patch that removes crush_workspace_mutex (myself). CRUSH computations are no longer serialized and can run in parallel. - a couple new filesystem client metrics for "ceph fs top" command (Xiubo Li) - a fix for a very old messenger bug that affected the filesystem, marked for stable (myself) - assorted fixups and cleanups throughout the codebase from Jeff and others. * tag 'ceph-for-5.10-rc1' of git://github.com/ceph/ceph-client: (27 commits) libceph: clear con->out_msg on Policy::stateful_server faults libceph: format ceph_entity_addr nonces as unsigned libceph: fix ENTITY_NAME format suggestion libceph: move a dout in queue_con_delay() ceph: comment cleanups and clarifications ceph: break up send_cap_msg ceph: drop separate mdsc argument from __send_cap ceph: promote to unsigned long long before shifting ceph: don't SetPageError on readpage errors ceph: mark ceph_fmt_xattr() as printf-like for better type checking ceph: fold ceph_update_writeable_page into ceph_write_begin ceph: fold ceph_sync_writepages into writepage_nounlock ceph: fold ceph_sync_readpages into ceph_readpage ceph: don't call ceph_update_writeable_page from page_mkwrite ceph: break out writeback of incompatible snap context to separate function ceph: add a note explaining session reject error string libceph: switch to the new "osd blocklist add" command libceph, rbd, ceph: "blacklist" -> "blocklist" ceph: have ceph_writepages_start call pagevec_lookup_range_tag ceph: use kill_anon_super helper ...
2020-10-21NFSv4.2: Fix NFS4ERR_STALE error when doing inter server copyDai Ngo8-8/+219
NFS_FS=y as dependency of CONFIG_NFSD_V4_2_INTER_SSC still have build errors and some configs with NFSD=m to get NFS4ERR_STALE error when doing inter server copy. Added ops table in nfs_common for knfsd to access NFS client modules. Fixes: 3ac3711adb88 ("NFSD: Fix NFS server build errors") Signed-off-by: Dai Ngo <dai.ngo@oracle.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2020-10-20Merge tag 'xarray-5.9' of git://git.infradead.org/users/willy/xarrayLinus Torvalds11-35/+116
Pull XArray updates from Matthew Wilcox: - Fix the test suite after introduction of the local_lock - Fix a bug in the IDA spotted by Coverity - Change the API that allows the workingset code to delete a node - Fix xas_reload() when dealing with entries that occupy multiple indices - Add a few more tests to the test suite - Fix an unsigned int being shifted into an unsigned long * tag 'xarray-5.9' of git://git.infradead.org/users/willy/xarray: XArray: Fix xas_create_range for ranges above 4 billion radix-tree: fix the comment of radix_tree_next_slot() XArray: Fix xas_reload for multi-index entries XArray: Add private interface for workingset node deletion XArray: Fix xas_for_each_conflict documentation XArray: Test marked multiorder iterations XArray: Test two more things about xa_cmpxchg ida: Free allocated bitmap in error path radix tree test suite: Fix compilation
2020-10-20Merge tag 'nfs-for-5.10-1' of git://git.linux-nfs.org/projects/anna/linux-nfsLinus Torvalds36-502/+989
Pull NFS client updates from Anna Schumaker: "Stable Fixes: - Wait for stateid updates after CLOSE/OPEN_DOWNGRADE # v5.4+ - Fix nfs_path in case of a rename retry - Support EXCHID4_FLAG_SUPP_FENCE_OPS v4.2 EXCHANGE_ID flag New features and improvements: - Replace dprintk() calls with tracepoints - Make cache consistency bitmap dynamic - Added support for the NFS v4.2 READ_PLUS operation - Improvements to net namespace uniquifier Other bugfixes and cleanups: - Remove redundant clnt pointer - Don't update timeout values on connection resets - Remove redundant tracepoints - Various cleanups to comments - Fix oops when trying to use copy_file_range with v4.0 source server - Improvements to flexfiles mirrors - Add missing 'local_lock=posix' mount option" * tag 'nfs-for-5.10-1' of git://git.linux-nfs.org/projects/anna/linux-nfs: (55 commits) NFSv4.2: support EXCHGID4_FLAG_SUPP_FENCE_OPS 4.2 EXCHANGE_ID flag NFSv4: Fix up RCU annotations for struct nfs_netns_client NFS: Only reference user namespace from nfs4idmap struct instead of cred nfs: add missing "posix" local_lock constant table definition NFSv4: Use the net namespace uniquifier if it is set NFSv4: Clean up initialisation of uniquified client id strings NFS: Decode a full READ_PLUS reply SUNRPC: Add an xdr_align_data() function NFS: Add READ_PLUS hole segment decoding SUNRPC: Add the ability to expand holes in data pages SUNRPC: Split out _shift_data_right_tail() SUNRPC: Split out xdr_realign_pages() from xdr_align_pages() NFS: Add READ_PLUS data segment support NFS: Use xdr_page_pos() in NFSv4 decode_getacl() SUNRPC: Implement a xdr_page_pos() function SUNRPC: Split out a function for setting current page NFS: fix nfs_path in case of a rename retry fs: nfs: return per memcg count for xattr shrinkers NFSv4: Wait for stateid updates after CLOSE/OPEN_DOWNGRADE nfs: remove incorrect fallthrough label ...
2020-10-20Merge tag 'io_uring-5.10-2020-10-20' of git://git.kernel.dk/linux-blockLinus Torvalds7-301/+475
Pull io_uring updates from Jens Axboe: "A mix of fixes and a few stragglers. In detail: - Revert the bogus __read_mostly that we discussed for the initial pull request. - Fix a merge window regression with fixed file registration error path handling. - Fix io-wq numa node affinities. - Series abstracting out an io_identity struct, making it both easier to see what the personality items are, and also easier to to adopt more. Use this to cover audit logging. - Fix for read-ahead disabled block condition in async buffered reads, and using single page read-ahead to unify what generic_file_buffer_read() path is used. - Series for REQ_F_COMP_LOCKED fix and removal of it (Pavel) - Poll fix (Pavel)" * tag 'io_uring-5.10-2020-10-20' of git://git.kernel.dk/linux-block: (21 commits) io_uring: use blk_queue_nowait() to check if NOWAIT supported mm: use limited read-ahead to satisfy read mm: mark async iocb read as NOWAIT once some data has been copied io_uring: fix double poll mask init io-wq: inherit audit loginuid and sessionid io_uring: use percpu counters to track inflight requests io_uring: assign new io_identity for task if members have changed io_uring: store io_identity in io_uring_task io_uring: COW io_identity on mismatch io_uring: move io identity items into separate struct io_uring: rely solely on work flags to determine personality. io_uring: pass required context in as flags io-wq: assign NUMA node locality if appropriate io_uring: fix error path cleanup in io_sqe_files_register() Revert "io_uring: mark io_uring_fops/io_op_defs as __read_mostly" io_uring: fix REQ_F_COMP_LOCKED by killing it io_uring: dig out COMP_LOCK from deep call chain io_uring: don't put a poll req under spinlock io_uring: don't unnecessarily clear F_LINK_TIMEOUT io_uring: don't set COMP_LOCKED if won't put ...
2020-10-20Merge tag 'for-v5.10' of ↵Linus Torvalds55-1383/+3938
git://git.kernel.org/pub/scm/linux/kernel/git/sre/linux-power-supply Pull power supply and reset updates from Sebastian Reichel: "Power-supply core: - add wireless type - properly document current direction Battery/charger driver changes: - new fuel-gauge/charger driver for RN5T618/RN5T619 - new charger driver for BQ25980, BQ25975 and BQ25960 - bq27xxx-battery: add support for TI bq34z100 - gpio-charger: convert to GPIO descriptors - gpio-charger: add optional support for charge current limiting - max17040: add support for max17041, max17043, max17044 - max17040: add support for max17048, max17049, max17058, max17059 - smb347-charger: add DT support - smb247-charger: add SMB345 and SMB358 support - simple-battery: add temperature properties - lots of minor fixes, cleanups and DT binding YAML conversions Reset drivers: - ocelot: Add support for Sparx5" * tag 'for-v5.10' of git://git.kernel.org/pub/scm/linux/kernel/git/sre/linux-power-supply: (81 commits) power: reset: POWER_RESET_OCELOT_RESET should depend on Ocelot or Sparx5 power: supply: bq25980: Fix uninitialized wd_reg_val and overrun power: supply: ltc2941: Fix ptr to enum cast power: supply: test-power: revise parameter printing to use sprintf power: supply: charger-manager: fix incorrect check on charging_duration_ms power: supply: max17040: Fix ptr to enum cast power: supply: bq25980: Fix uninitialized wd_reg_val power: supply: bq25980: remove redundant zero check on ret power: reset: ocelot: Add support for Sparx5 dt-bindings: reset: ocelot: Add Sparx5 support power: supply: sbs-battery: keep error code when get_property() fails power: supply: bq25980: Add support for the BQ259xx family dt-binding: bq25980: Add the bq25980 flash charger power: supply: fix spelling mistake "unprecise" -> "imprecise" power: supply: test_power: add missing newlines when printing parameters by sysfs power: supply: pm2301: drop duplicated i2c_device_id power: supply: charger-manager: drop unused charger assignment power: supply: rt9455: skip 'struct acpi_device_id' when !CONFIG_ACPI power: supply: goldfish: skip 'struct acpi_device_id' when !CONFIG_ACPI power: supply: bq25890: skip 'struct acpi_device_id' when !CONFIG_ACPI ...
2020-10-20SUNRPC: fix copying of multiple pages in gss_read_proxy_verf()Martijn de Gouw1-10/+17
When the passed token is longer than 4032 bytes, the remaining part of the token must be copied from the rqstp->rq_arg.pages. But the copy must make sure it happens in a consecutive way. With the existing code, the first memcpy copies 'length' bytes from argv->iobase, but since the header is in front, this never fills the whole first page of in_token->pages. The mecpy in the loop copies the following bytes, but starts writing at the next page of in_token->pages. This leaves the last bytes of page 0 unwritten. Symptoms were that users with many groups were not able to access NFS exports, when using Active Directory as the KDC. Signed-off-by: Martijn de Gouw <martijn.de.gouw@prodrive-technologies.com> Fixes: 5866efa8cbfb "SUNRPC: Fix svcauth_gss_proxy_init()" Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2020-10-20sunrpc: raise kernel RPC channel buffer sizeRoberto Bergantinos Corpas1-1/+1
Its possible that using AUTH_SYS and mountd manage-gids option a user may hit the 8k RPC channel buffer limit. This have been observed on field, causing unanswered RPCs on clients after mountd fails to write on channel : rpc.mountd[11231]: auth_unix_gid: error writing reply Userland nfs-utils uses a buffer size of 32k (RPC_CHAN_BUF_SIZE), so lets match those two. Signed-off-by: Roberto Bergantinos Corpas <rbergant@redhat.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2020-10-20Merge tag 'drm-next-2020-10-19' of git://anongit.freedesktop.org/drm/drmLinus Torvalds14-34/+72
Pull drm fixes from Dave Airlie: "Some fixes queued up already for i915 and amdgpu, I've also included the fix for the clang warning you've seen. i915: - set all unused color plane offsets to ~0xfff again (Ville) - fix TGL DKL PHY DP vswing handling (Ville) amdgpu: - DCN clang warning fix - eDP fix - BACO fix - kernel documentation fixes - SMU7 mclk fix - VCN1 hw bug workaround amdkfd: - kvfree vs kfree fix" * tag 'drm-next-2020-10-19' of git://anongit.freedesktop.org/drm/drm: drm/amd/display: Fix incorrect dsc force enable logic drm/amdkfd: Use kvfree in destroy_crat_image drm/amdgpu: vcn and jpeg ring synchronization drm/amd/pm: increase mclk switch threshold to 200 us docs: amdgpu: fix a warning when building the documentation drm/amd/display: kernel-doc: document force_timing_sync drm/amdgpu/swsmu: init the baco mutex in early_init drm/amd/display: Fix module load hangs when connected to an eDP drm/i915: Set all unused color plane offsets to ~0xfff again drm/i915: Fix TGL DKL PHY DP vswing handling
2020-10-20Merge tag 'iommu-fix-v5.10' of ↵Linus Torvalds1-1/+1
git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu Pull iommu fix from Joerg Roedel: "Fix a build regression with !CONFIG_IOMMU_API" * tag 'iommu-fix-v5.10' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu: iommu/vt-d: Don't dereference iommu_device if IOMMU_API is not built
2020-10-20Merge tag 'for-linus-5.10b-rc1b-tag' of ↵Linus Torvalds19-165/+707
git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip Pull more xen updates from Juergen Gross: - A single patch to fix the Xen security issue XSA-331 (malicious guests can DoS dom0 by triggering NULL-pointer dereferences or access to stale data). - A larger series to fix the Xen security issue XSA-332 (malicious guests can DoS dom0 by sending events at high frequency leading to dom0's vcpus being busy in IRQ handling for elongated times). * tag 'for-linus-5.10b-rc1b-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip: xen/events: block rogue events for some time xen/events: defer eoi in case of excessive number of events xen/events: use a common cpu hotplug hook for event channels xen/events: switch user event channels to lateeoi model xen/pciback: use lateeoi irq binding xen/pvcallsback: use lateeoi irq binding xen/scsiback: use lateeoi irq binding xen/netback: use lateeoi irq binding xen/blkback: use lateeoi irq binding xen/events: add a new "late EOI" evtchn framework xen/events: fix race in evtchn_fifo_unmask() xen/events: add a proper barrier to 2-level uevent unmasking xen/events: avoid removing an event channel while handling it
2020-10-20Merge tag 'for-linus' of git://git.armlinux.org.uk/~rmk/linux-armLinus Torvalds34-112/+263
Pull ARM updates from Russell King: - handle inexact watchpoint addresses (Douglas Anderson) - decompressor serial debug cleanups (Linus Walleij) - update L2 cache prefetch bits (Guillaume Tucker) - add text offset and malloc size to the decompressor kexec data * tag 'for-linus' of git://git.armlinux.org.uk/~rmk/linux-arm: ARM: add malloc size to decompressor kexec size structure ARM: add TEXT_OFFSET to decompressor kexec image structure ARM: 9007/1: l2c: fix prefetch bits init in L2X0_AUX_CTRL using DT values ARM: 9010/1: uncompress: Print the location of appended DTB ARM: 9009/1: uncompress: Enable debug in head.S ARM: 9008/1: uncompress: Drop excess whitespace print ARM: 9006/1: uncompress: Wait for ready and busy in debug prints ARM: 9005/1: debug: Select flow control for all debug UARTs ARM: 9004/1: debug: Split waituart to CTS and TXRDY ARM: 9003/1: uncompress: Delete unused debug macros ARM: 8997/2: hw_breakpoint: Handle inexact watchpoint addresses
2020-10-20Merge tag 'arc-5.10-rc1' of ↵Linus Torvalds36-1362/+15
git://git.kernel.org/pub/scm/linux/kernel/git/vgupta/arc Pull ARC updates from Vineet Gupta: "The bulk of ARC pull request is removal of EZChip NPS platform which was suffering from constant bitrot. In recent years EZChip has gone though multiple successive acquisitions and I guess things and people move on. I would like to take this opportunity to recognize and thank all those good folks (Gilad, Noam, Ofer...) for contributing major bits to ARC port (SMP, Big Endian). Summary: - drop support for EZChip NPS platform - misc other fixes" * tag 'arc-5.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/vgupta/arc: arc: include/asm: fix typos of "themselves" ARC: SMP: fix typo and use "come up" instead of "comeup" ARC: [dts] fix the errors detected by dtbs_check arc: plat-hsdk: fix kconfig dependency warning when !RESET_CONTROLLER ARC: [plat-eznps]: Drop support for EZChip NPS platform
2020-10-20xen/events: block rogue events for some timeJuergen Gross2-6/+24
In order to avoid high dom0 load due to rogue guests sending events at high frequency, block those events in case there was no action needed in dom0 to handle the events. This is done by adding a per-event counter, which set to zero in case an EOI without the XEN_EOI_FLAG_SPURIOUS is received from a backend driver, and incremented when this flag has been set. In case the counter is 2 or higher delay the EOI by 1 << (cnt - 2) jiffies, but not more than 1 second. In order not to waste memory shorten the per-event refcnt to two bytes (it should normally never exceed a value of 2). Add an overflow check to evtchn_get() to make sure the 2 bytes really won't overflow. This is part of XSA-332. Cc: stable@vger.kernel.org Signed-off-by: Juergen Gross <jgross@suse.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Stefano Stabellini <sstabellini@kernel.org> Reviewed-by: Wei Liu <wl@xen.org>
2020-10-20xen/events: defer eoi in case of excessive number of eventsJuergen Gross5-32/+216
In case rogue guests are sending events at high frequency it might happen that xen_evtchn_do_upcall() won't stop processing events in dom0. As this is done in irq handling a crash might be the result. In order to avoid that, delay further inter-domain events after some time in xen_evtchn_do_upcall() by forcing eoi processing into a worker on the same cpu, thus inhibiting new events coming in. The time after which eoi processing is to be delayed is configurable via a new module parameter "event_loop_timeout" which specifies the maximum event loop time in jiffies (default: 2, the value was chosen after some tests showing that a value of 2 was the lowest with an only slight drop of dom0 network throughput while multiple guests performed an event storm). How long eoi processing will be delayed can be specified via another parameter "event_eoi_delay" (again in jiffies, default 10, again the value was chosen after testing with different delay values). This is part of XSA-332. Cc: stable@vger.kernel.org Reported-by: Julien Grall <julien@xen.org> Signed-off-by: Juergen Gross <jgross@suse.com> Reviewed-by: Stefano Stabellini <sstabellini@kernel.org> Reviewed-by: Wei Liu <wl@xen.org>
2020-10-20xen/events: use a common cpu hotplug hook for event channelsJuergen Gross3-21/+47
Today only fifo event channels have a cpu hotplug callback. In order to prepare for more percpu (de)init work move that callback into events_base.c and add percpu_init() and percpu_deinit() hooks to struct evtchn_ops. This is part of XSA-332. Cc: stable@vger.kernel.org Signed-off-by: Juergen Gross <jgross@suse.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Wei Liu <wl@xen.org>
2020-10-20xen/events: switch user event channels to lateeoi modelJuergen Gross1-4/+3
Instead of disabling the irq when an event is received and enabling it again when handled by the user process use the lateeoi model. This is part of XSA-332. Cc: stable@vger.kernel.org Reported-by: Julien Grall <julien@xen.org> Signed-off-by: Juergen Gross <jgross@suse.com> Tested-by: Stefano Stabellini <sstabellini@kernel.org> Reviewed-by: Stefano Stabellini <sstabellini@kernel.org> Reviewed-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Wei Liu <wl@xen.org>
2020-10-20xen/pciback: use lateeoi irq bindingJuergen Gross4-19/+56
In order to reduce the chance for the system becoming unresponsive due to event storms triggered by a misbehaving pcifront use the lateeoi irq binding for pciback and unmask the event channel only just before leaving the event handling function. Restructure the handling to support that scheme. Basically an event can come in for two reasons: either a normal request for a pciback action, which is handled in a worker, or in case the guest has finished an AER request which was requested by pciback. When an AER request is issued to the guest and a normal pciback action is currently active issue an EOI early in order to be able to receive another event when the AER request has been finished by the guest. Let the worker processing the normal requests run until no further request is pending, instead of starting a new worker ion that case. Issue the EOI only just before leaving the worker. This scheme allows to drop calling the generic function xen_pcibk_test_and_schedule_op() after processing of any request as the handling of both request types is now separated more cleanly. This is part of XSA-332. Cc: stable@vger.kernel.org Reported-by: Julien Grall <julien@xen.org> Signed-off-by: Juergen Gross <jgross@suse.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Wei Liu <wl@xen.org>
2020-10-20xen/pvcallsback: use lateeoi irq bindingJuergen Gross1-30/+46
In order to reduce the chance for the system becoming unresponsive due to event storms triggered by a misbehaving pvcallsfront use the lateeoi irq binding for pvcallsback and unmask the event channel only after handling all write requests, which are the ones coming in via an irq. This requires modifying the logic a little bit to not require an event for each write request, but to keep the ioworker running until no further data is found on the ring page to be processed. This is part of XSA-332. Cc: stable@vger.kernel.org Reported-by: Julien Grall <julien@xen.org> Signed-off-by: Juergen Gross <jgross@suse.com> Reviewed-by: Stefano Stabellini <sstabellini@kernel.org> Reviewed-by: Wei Liu <wl@xen.org>
2020-10-20xen/scsiback: use lateeoi irq bindingJuergen Gross1-10/+13
In order to reduce the chance for the system becoming unresponsive due to event storms triggered by a misbehaving scsifront use the lateeoi irq binding for scsiback and unmask the event channel only just before leaving the event handling function. In case of a ring protocol error don't issue an EOI in order to avoid the possibility to use that for producing an event storm. This at once will result in no further call of scsiback_irq_fn(), so the ring_error struct member can be dropped and scsiback_do_cmd_fn() can signal the protocol error via a negative return value. This is part of XSA-332. Cc: stable@vger.kernel.org Reported-by: Julien Grall <julien@xen.org> Signed-off-by: Juergen Gross <jgross@suse.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Wei Liu <wl@xen.org>
2020-10-20xen/netback: use lateeoi irq bindingJuergen Gross4-14/+86
In order to reduce the chance for the system becoming unresponsive due to event storms triggered by a misbehaving netfront use the lateeoi irq binding for netback and unmask the event channel only just before going to sleep waiting for new events. Make sure not to issue an EOI when none is pending by introducing an eoi_pending element to struct xenvif_queue. When no request has been consumed set the spurious flag when sending the EOI for an interrupt. This is part of XSA-332. Cc: stable@vger.kernel.org Reported-by: Julien Grall <julien@xen.org> Signed-off-by: Juergen Gross <jgross@suse.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Wei Liu <wl@xen.org>
2020-10-20xen/blkback: use lateeoi irq bindingJuergen Gross2-8/+19
In order to reduce the chance for the system becoming unresponsive due to event storms triggered by a misbehaving blkfront use the lateeoi irq binding for blkback and unmask the event channel only after processing all pending requests. As the thread processing requests is used to do purging work in regular intervals an EOI may be sent only after having received an event. If there was no pending I/O request flag the EOI as spurious. This is part of XSA-332. Cc: stable@vger.kernel.org Reported-by: Julien Grall <julien@xen.org> Signed-off-by: Juergen Gross <jgross@suse.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Wei Liu <wl@xen.org>
2020-10-20xen/events: add a new "late EOI" evtchn frameworkJuergen Gross2-17/+155
In order to avoid tight event channel related IRQ loops add a new framework of "late EOI" handling: the IRQ the event channel is bound to will be masked until the event has been handled and the related driver is capable to handle another event. The driver is responsible for unmasking the event channel via the new function xen_irq_lateeoi(). This is similar to binding an event channel to a threaded IRQ, but without having to structure the driver accordingly. In order to support a future special handling in case a rogue guest is sending lots of unsolicited events, add a flag to xen_irq_lateeoi() which can be set by the caller to indicate the event was a spurious one. This is part of XSA-332. Cc: stable@vger.kernel.org Reported-by: Julien Grall <julien@xen.org> Signed-off-by: Juergen Gross <jgross@suse.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Stefano Stabellini <sstabellini@kernel.org> Reviewed-by: Wei Liu <wl@xen.org>
2020-10-20xen/events: fix race in evtchn_fifo_unmask()Juergen Gross1-4/+9
Unmasking a fifo event channel can result in unmasking it twice, once directly in the kernel and once via a hypercall in case the event was pending. Fix that by doing the local unmask only if the event is not pending. This is part of XSA-332. Cc: stable@vger.kernel.org Signed-off-by: Juergen Gross <jgross@suse.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
2020-10-20xen/events: add a proper barrier to 2-level uevent unmaskingJuergen Gross1-0/+2
A follow-up patch will require certain write to happen before an event channel is unmasked. While the memory barrier is not strictly necessary for all the callers, the main one will need it. In order to avoid an extra memory barrier when using fifo event channels, mandate evtchn_unmask() to provide write ordering. The 2-level event handling unmask operation is missing an appropriate barrier, so add it. Fifo event channels are fine in this regard due to using sync_cmpxchg(). This is part of XSA-332. Cc: stable@vger.kernel.org Suggested-by: Julien Grall <julien@xen.org> Signed-off-by: Juergen Gross <jgross@suse.com> Reviewed-by: Julien Grall <jgrall@amazon.com> Reviewed-by: Wei Liu <wl@xen.org>
2020-10-20xen/events: avoid removing an event channel while handling itJuergen Gross1-5/+36
Today it can happen that an event channel is being removed from the system while the event handling loop is active. This can lead to a race resulting in crashes or WARN() splats when trying to access the irq_info structure related to the event channel. Fix this problem by using a rwlock taken as reader in the event handling loop and as writer when deallocating the irq_info structure. As the observed problem was a NULL dereference in evtchn_from_irq() make this function more robust against races by testing the irq_info pointer to be not NULL before dereferencing it. And finally make all accesses to evtchn_to_irq[row][col] atomic ones in order to avoid seeing partial updates of an array element in irq handling. Note that irq handling can be entered only for event channels which have been valid before, so any not populated row isn't a problem in this regard, as rows are only ever added and never removed. This is XSA-331. Cc: stable@vger.kernel.org Reported-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com> Reported-by: Jinoh Kang <luke1337@theori.io> Signed-off-by: Juergen Gross <jgross@suse.com> Reviewed-by: Stefano Stabellini <sstabellini@kernel.org> Reviewed-by: Wei Liu <wl@xen.org>
2020-10-19Merge tag 'riscv-for-linus-5.10-mw0' of ↵Linus Torvalds31-241/+1212
git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux Pull RISC-V updates from Palmer Dabbelt: "A handful of cleanups and new features: - A handful of cleanups for our page fault handling - Improvements to how we fill out cacheinfo - Support for EFI-based systems" * tag 'riscv-for-linus-5.10-mw0' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux: (22 commits) RISC-V: Add page table dump support for uefi RISC-V: Add EFI runtime services RISC-V: Add EFI stub support. RISC-V: Add PE/COFF header for EFI stub RISC-V: Implement late mapping page table allocation functions RISC-V: Add early ioremap support RISC-V: Move DT mapping outof fixmap RISC-V: Fix duplicate included thread_info.h riscv/mm/fault: Set FAULT_FLAG_INSTRUCTION flag in do_page_fault() riscv/mm/fault: Fix inline placement in vmalloc_fault() declaration riscv: Add cache information in AUX vector riscv: Define AT_VECTOR_SIZE_ARCH for ARCH_DLINFO riscv: Set more data to cacheinfo riscv/mm/fault: Move access error check to function riscv/mm/fault: Move FAULT_FLAG_WRITE handling in do_page_fault() riscv/mm/fault: Simplify mm_fault_error() riscv/mm/fault: Move fault error handling to mm_fault_error() riscv/mm/fault: Simplify fault error handling riscv/mm/fault: Move vmalloc fault handling to vmalloc_fault() riscv/mm/fault: Move bad area handling to bad_area() ...
2020-10-19Merge tag 'm68knommu-for-v5.10' of ↵Linus Torvalds6-559/+402
git://git.kernel.org/pub/scm/linux/kernel/git/gerg/m68knommu Pull m68knommu updates from Greg Ungerer: "A collection of fixes for 5.10: - switch to using asm-generic uaccess code - fix sparse warnings in signal code - fix compilation of ColdFire MMC support - support sysrq in ColdFire serial driver" * tag 'm68knommu-for-v5.10' of git://git.kernel.org/pub/scm/linux/kernel/git/gerg/m68knommu: serial: mcf: add sysrq capability m68knommu: include SDHC support only when hardware has it m68knommu: fix sparse warnings in signal code m68knommu: switch to using asm-generic/uaccess.h
2020-10-19Merge tag 'xfs-5.10-merge-5' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linuxLinus Torvalds37-509/+1023
Pull more xfs updates from Darrick Wong: "The second large pile of new stuff for 5.10, with changes even more monumental than last week! We are formally announcing the deprecation of the V4 filesystem format in 2030. All users must upgrade to the V5 format, which contains design improvements that greatly strengthen metadata validation, supports reflink and online fsck, and is the intended vehicle for handling timestamps past 2038. We're also deprecating the old Irix behavioral tweaks in September 2025. Coming along for the ride are two design changes to the deferred metadata ops subsystem. One of the improvements is to retain correct logical ordering of tasks and subtasks, which is a more logical design for upper layers of XFS and will become necessary when we add atomic file range swaps and commits. The second improvement to deferred ops improves the scalability of the log by helping the log tail to move forward during long-running operations. This reduces log contention when there are a large number of threads trying to run transactions. In addition to that, this fixes numerous small bugs in log recovery; refactors logical intent log item recovery to remove the last remaining place in XFS where we could have nested transactions; fixes a couple of ways that intent log item recovery could fail in ways that wouldn't have happened in the regular commit paths; fixes a deadlock vector in the GETFSMAP implementation (which improves its performance by 20%); and fixes serious bugs in the realtime growfs, fallocate, and bitmap handling code. Summary: - Deprecate the V4 filesystem format, some disused mount options, and some legacy sysctl knobs now that we can support dates into the 25th century. Note that removal of V4 support will not happen until the early 2030s. - Fix some probles with inode realtime flag propagation. - Fix some buffer handling issues when growing a rt filesystem. - Fix a problem where a BMAP_REMAP unmap call would free rt extents even though the purpose of BMAP_REMAP is to avoid freeing the blocks. - Strengthen the dabtree online scrubber to check hash values on child dabtree blocks. - Actually log new intent items created as part of recovering log intent items. - Fix a bug where quotas weren't attached to an inode undergoing bmap intent item recovery. - Fix a buffer overrun problem with specially crafted log buffer headers. - Various cleanups to type usage and slightly inaccurate comments. - More cleanups to the xattr, log, and quota code. - Don't run the (slower) shared-rmap operations on attr fork mappings. - Fix a bug where we failed to check the LSN of finobt blocks during replay and could therefore overwrite newer data with older data. - Clean up the ugly nested transaction mess that log recovery uses to stage intent item recovery in the correct order by creating a proper data structure to capture recovered chains. - Use the capture structure to resume intent item chains with the same log space and block reservations as when they were captured. - Fix a UAF bug in bmap intent item recovery where we failed to maintain our reference to the incore inode if the bmap operation needed to relog itself to continue. - Rearrange the defer ops mechanism to finish newly created subtasks of a parent task before moving on to the next parent task. - Automatically relog intent items in deferred ops chains if doing so would help us avoid pinning the log tail. This will help fix some log scaling problems now and will facilitate atomic file updates later. - Fix a deadlock in the GETFSMAP implementation by using an internal memory buffer to reduce indirect calls and copies to userspace, thereby improving its performance by ~20%. - Fix various problems when calling growfs on a realtime volume would not fully update the filesystem metadata. - Fix broken Kconfig asking about deprecated XFS when XFS is disabled" * tag 'xfs-5.10-merge-5' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: (48 commits) xfs: fix Kconfig asking about XFS_SUPPORT_V4 when XFS_FS=n xfs: fix high key handling in the rt allocator's query_range function xfs: annotate grabbing the realtime bitmap/summary locks in growfs xfs: make xfs_growfs_rt update secondary superblocks xfs: fix realtime bitmap/summary file truncation when growing rt volume xfs: fix the indent in xfs_trans_mod_dquot xfs: do the ASSERT for the arguments O_{u,g,p}dqpp xfs: fix deadlock and streamline xfs_getfsmap performance xfs: limit entries returned when counting fsmap records xfs: only relog deferred intent items if free space in the log gets low xfs: expose the log push threshold xfs: periodically relog deferred intent items xfs: change the order in which child and parent defer ops are finished xfs: fix an incore inode UAF in xfs_bui_recover xfs: clean up xfs_bui_item_recover iget/trans_alloc/ilock ordering xfs: clean up bmap intent item recovery checking xfs: xfs_defer_capture should absorb remaining transaction reservation xfs: xfs_defer_capture should absorb remaining block reservations xfs: proper replay of deferred ops queued during log recovery xfs: remove XFS_LI_RECOVERED ...
2020-10-19Merge tag 'fuse-update-5.10' of ↵Linus Torvalds20-496/+2689
git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse Pull fuse updates from Miklos Szeredi: - Support directly accessing host page cache from virtiofs. This can improve I/O performance for various workloads, as well as reducing the memory requirement by eliminating double caching. Thanks to Vivek Goyal for doing most of the work on this. - Allow automatic submounting inside virtiofs. This allows unique st_dev/ st_ino values to be assigned inside the guest to files residing on different filesystems on the host. Thanks to Max Reitz for the patches. - Fix an old use after free bug found by Pradeep P V K. * tag 'fuse-update-5.10' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse: (25 commits) virtiofs: calculate number of scatter-gather elements accurately fuse: connection remove fix fuse: implement crossmounts fuse: Allow fuse_fill_super_common() for submounts fuse: split fuse_mount off of fuse_conn fuse: drop fuse_conn parameter where possible fuse: store fuse_conn in fuse_req fuse: add submount support to <uapi/linux/fuse.h> fuse: fix page dereference after free virtiofs: add logic to free up a memory range virtiofs: maintain a list of busy elements virtiofs: serialize truncate/punch_hole and dax fault path virtiofs: define dax address space operations virtiofs: add DAX mmap support virtiofs: implement dax read/write operations virtiofs: introduce setupmapping/removemapping commands virtiofs: implement FUSE_INIT map_alignment field virtiofs: keep a list of free dax memory ranges virtiofs: add a mount option to enable dax virtiofs: set up virtio_fs dax_device ...
2020-10-19Merge tag 'zonefs-5.10-rc1' of ↵Linus Torvalds3-13/+233
git://git.kernel.org/pub/scm/linux/kernel/git/dlemoal/zonefs Pull zonefs updates from Damien Le Moal: "Add an 'explicit-open' mount option to automatically issue a REQ_OP_ZONE_OPEN command to the device whenever a sequential zone file is open for writing for the first time. This avoids 'insufficient zone resources' errors for write operations on some drives with limited zone resources or on ZNS drives with a limited number of active zones. From Johannes" * tag 'zonefs-5.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/dlemoal/zonefs: zonefs: document the explicit-open mount option zonefs: open/close zone on file open/close zonefs: provide no-lock zonefs_io_error variant zonefs: introduce helper for zone management
2020-10-19rtc: r9701: set rangeAlexandre Belloni1-5/+3
Set range and remove the set_time check. This is a classic BCD RTC. Signed-off-by: Alexandre Belloni <alexandre.belloni@bootlin.com> Link: https://lore.kernel.org/r/20201015191135.471249-6-alexandre.belloni@bootlin.com
2020-10-19rtc: r9701: convert to devm_rtc_allocate_deviceAlexandre Belloni1-3/+3
This allows further improvement of the driver. Signed-off-by: Alexandre Belloni <alexandre.belloni@bootlin.com> Link: https://lore.kernel.org/r/20201015191135.471249-5-alexandre.belloni@bootlin.com
2020-10-19rtc: r9701: stop setting RWKCNTAlexandre Belloni1-1/+0
tm_wday is never checked for validity and it is not read back in r9701_get_datetime. Avoid setting it to stop tripping static checkers: drivers/rtc/rtc-r9701.c:109 r9701_set_datetime() error: undefined (user controlled) shift '1 << dt->tm_wday' Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Alexandre Belloni <alexandre.belloni@bootlin.com> Link: https://lore.kernel.org/r/20201015191135.471249-4-alexandre.belloni@bootlin.com
2020-10-19rtc: r9701: remove useless memsetAlexandre Belloni1-2/+0
The RTC core already sets to zero the struct rtc_tie it passes to the driver, avoid doing it a second time. Signed-off-by: Alexandre Belloni <alexandre.belloni@bootlin.com> Link: https://lore.kernel.org/r/20201015191135.471249-3-alexandre.belloni@bootlin.com
2020-10-19rtc: r9701: stop setting a default timeAlexandre Belloni1-22/+0
It doesn't make sense to set the RTC to a default value at probe time. Let the core handle invalid date and time. Signed-off-by: Alexandre Belloni <alexandre.belloni@bootlin.com> Link: https://lore.kernel.org/r/20201015191135.471249-2-alexandre.belloni@bootlin.com
2020-10-19rtc: r9701: remove leftover commentAlexandre Belloni1-4/+0
Commit 22652ba72453 ("rtc: stop validating rtc_time in .read_time") removed the code but not the associated comment. Signed-off-by: Alexandre Belloni <alexandre.belloni@bootlin.com> Link: https://lore.kernel.org/r/20201015191135.471249-1-alexandre.belloni@bootlin.com
2020-10-19rtc: rv3032: Add a driver for Microcrystal RV-3032Alexandre Belloni3-0/+936
New driver for the Microcrystal RV-3032, including support for: - Date/time - Alarms - Low voltage detection - Trickle charge - Trimming - Clkout - RAM - EEPROM - Temperature sensor Signed-off-by: Alexandre Belloni <alexandre.belloni@bootlin.com> Link: https://lore.kernel.org/r/20201013144110.1942218-3-alexandre.belloni@bootlin.com
2020-10-19dt-bindings: rtc: rv3032: add RV-3032 bindingsAlexandre Belloni1-0/+64
Document the Microcrystal RV-3032 device tree bindings Signed-off-by: Alexandre Belloni <alexandre.belloni@bootlin.com> Reviewed-by: Rob Herring <robh@kernel.org> Link: https://lore.kernel.org/r/20201013144110.1942218-2-alexandre.belloni@bootlin.com
2020-10-19dt-bindings: rtc: add trickle-voltage-millivoltAlexandre Belloni1-0/+6
Some RTCs have a trickle charge that is able to output different voltages depending on the type of the connected auxiliary power (battery, supercap, ...). Add a property allowing to specify the necessary voltage. Signed-off-by: Alexandre Belloni <alexandre.belloni@bootlin.com> Reviewed-by: Rob Herring <robh@kernel.org> Link: https://lore.kernel.org/r/20201013144110.1942218-1-alexandre.belloni@bootlin.com
2020-10-19io_uring: use blk_queue_nowait() to check if NOWAIT supportedJeffle Xu1-1/+1
commit 021a24460dc2 ("block: add QUEUE_FLAG_NOWAIT") adds a new helper function blk_queue_nowait() to check if the bdev supports handling of REQ_NOWAIT or not. Since then bio-based dm device can also support REQ_NOWAIT, and currently only dm-linear supports that since commit 6abc49468eea ("dm: add support for REQ_NOWAIT and enable it for linear target"). Signed-off-by: Jeffle Xu <jefflexu@linux.alibaba.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-10-19iommu/vt-d: Don't dereference iommu_device if IOMMU_API is not builtBartosz Golaszewski1-1/+1
Since commit c40aaaac1018 ("iommu/vt-d: Gracefully handle DMAR units with no supported address widths") dmar.c needs struct iommu_device to be selected. We can drop this dependency by not dereferencing struct iommu_device if IOMMU_API is not selected and by reusing the information stored in iommu->drhd->ignored instead. This fixes the following build error when IOMMU_API is not selected: drivers/iommu/intel/dmar.c: In function ‘free_iommu’: drivers/iommu/intel/dmar.c:1139:41: error: ‘struct iommu_device’ has no member named ‘ops’ 1139 | if (intel_iommu_enabled && iommu->iommu.ops) { ^ Fixes: c40aaaac1018 ("iommu/vt-d: Gracefully handle DMAR units with no supported address widths") Signed-off-by: Bartosz Golaszewski <bgolaszewski@baylibre.com> Acked-by: Lu Baolu <baolu.lu@linux.intel.com> Acked-by: David Woodhouse <dwmw@amazon.co.uk> Link: https://lore.kernel.org/r/20201013073055.11262-1-brgl@bgdev.pl Signed-off-by: Joerg Roedel <jroedel@suse.de>
2020-10-19Merge tag 'drm-intel-next-fixes-2020-10-15' of ↵Dave Airlie2-13/+6
git://anongit.freedesktop.org/drm/drm-intel into drm-next - Set all unused color plane offsets to ~0xfff again (Ville) - Fix TGL DKL PHY DP vswing handling (Ville) Signed-off-by: Dave Airlie <airlied@redhat.com> From: Rodrigo Vivi <rodrigo.vivi@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20201015181453.GA2905280@intel.com
2020-10-19drm/amd/display: Fix incorrect dsc force enable logicEryk Brol1-1/+1
[Why] Missed removing a '!' which results in incorrect behavior [How] Remove the offending '!' Signed-off-by: Eryk Brol <eryk.brol@amd.com> Reviewed-by: Harry Wentland <harry.wentland@amd.com> Signed-off-by: Dave Airlie <airlied@redhat.com> Link: https://patchwork.freedesktop.org/patch/msgid/20201015194053.355335-1-eryk.brol@amd.com
2020-10-19Merge tag 'amd-drm-fixes-5.10-2020-10-14' of ↵Dave Airlie11-20/+65
git://people.freedesktop.org/~agd5f/linux into drm-next amd-drm-fixes-5.10-2020-10-14: amdgpu: - eDP fix - BACO fix - Kernel documentation fixes - SMU7 mclk fix - VCN1 hw bug workaround amdkfd: - kvfree vs kfree fix Signed-off-by: Dave Airlie <airlied@redhat.com> From: Alex Deucher <alexdeucher@gmail.com> Link: https://patchwork.freedesktop.org/patch/msgid/20201014195403.4558-1-alexander.deucher@amd.com
2020-10-18Merge tag 'linux-kselftest-kunit-5.10-rc1' of ↵Linus Torvalds16-110/+444
git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest Pull more Kunit updates from Shuah Khan: - add Kunit to kernel_init() and remove KUnit from init calls entirely. This addresses the concern that Kunit would not work correctly during late init phase. - add a linker section where KUnit can put references to its test suites. This is the first step in transitioning to dispatching all KUnit tests from a centralized executor rather than having each as its own separate late_initcall. - add a centralized executor to dispatch tests rather than relying on late_initcall to schedule each test suite separately. Centralized execution is for built-in tests only; modules will execute tests when loaded. - convert bitfield test to use KUnit framework - Documentation updates for naming guidelines and how kunit_test_suite() works. - add test plan to KUnit TAP format * tag 'linux-kselftest-kunit-5.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest: lib: kunit: Fix compilation test when using TEST_BIT_FIELD_COMPILE lib: kunit: add bitfield test conversion to KUnit Documentation: kunit: add a brief blurb about kunit_test_suite kunit: test: add test plan to KUnit TAP format init: main: add KUnit to kernel init kunit: test: create a single centralized executor for all tests vmlinux.lds.h: add linker section for KUnit test suites Documentation: kunit: Add naming guidelines
2020-10-18Merge tag 'core-rcu-2020-10-12' of ↵Linus Torvalds57-421/+1582
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull RCU changes from Ingo Molnar: - Debugging for smp_call_function() - RT raw/non-raw lock ordering fixes - Strict grace periods for KASAN - New smp_call_function() torture test - Torture-test updates - Documentation updates - Miscellaneous fixes [ This doesn't actually pull the tag - I've dropped the last merge from the RCU branch due to questions about the series. - Linus ] * tag 'core-rcu-2020-10-12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (77 commits) smp: Make symbol 'csd_bug_count' static kernel/smp: Provide CSD lock timeout diagnostics smp: Add source and destination CPUs to __call_single_data rcu: Shrink each possible cpu krcp rcu/segcblist: Prevent useless GP start if no CBs to accelerate torture: Add gdb support rcutorture: Allow pointer leaks to test diagnostic code rcutorture: Hoist OOM registry up one level refperf: Avoid null pointer dereference when buf fails to allocate rcutorture: Properly synchronize with OOM notifier rcutorture: Properly set rcu_fwds for OOM handling torture: Add kvm.sh --help and update help message rcutorture: Add CONFIG_PROVE_RCU_LIST to TREE05 torture: Update initrd documentation rcutorture: Replace HTTP links with HTTPS ones locktorture: Make function torture_percpu_rwsem_init() static torture: document --allcpus argument added to the kvm.sh script rcutorture: Output number of elapsed grace periods rcutorture: Remove KCSAN stubs rcu: Remove unused "cpu" parameter from rcu_report_qs_rdp() ...
2020-10-18Merge tag 'mailbox-v5.10' of ↵Linus Torvalds8-57/+506
git://git.linaro.org/landing-teams/working/fujitsu/integration Pull mailbox updates from Jassi Brar: - arm: implementation of mhu as a doorbell driver and conversion of dt-bindings to json-schema - mediatek: fix platform_get_irq error handling - bcm: convert tasklets to use new tasklet_setup api - core: fix race cause by hrtimer starting inappropriately * tag 'mailbox-v5.10' of git://git.linaro.org/landing-teams/working/fujitsu/integration: mailbox: avoid timer start from callback maiblox: mediatek: Fix handling of platform_get_irq() error mailbox: arm_mhu: Add ARM MHU doorbell driver mailbox: arm_mhu: Match only if compatible is "arm,mhu" dt-bindings: mailbox: add doorbell support to ARM MHU dt-bindings: mailbox : arm,mhu: Convert to Json-schema mailbox: bcm: convert tasklets to use new tasklet_setup() API
2020-10-18Merge branch 'for-5.10' of ↵Linus Torvalds11-26/+1118
git://git.kernel.org/pub/scm/linux/kernel/git/jlawall/linux Pull coccinelle updates from Julia Lawall. * 'for-5.10' of git://git.kernel.org/pub/scm/linux/kernel/git/jlawall/linux: coccinelle: api: add kfree_mismatch script coccinelle: iterators: Add for_each_child.cocci script scripts: coccicheck: Change default condition for parallelism scripts: coccicheck: Add quotes to improve portability coccinelle: api: kfree_sensitive: print memset position coccinelle: misc: add flexible_array.cocci script coccinelle: api: add kvmalloc script scripts: coccicheck: Change default value for parallelism coccinelle: misc: add excluded_middle.cocci script scripts: coccicheck: Improve error feedback when coccicheck fails coccinelle: api: update kzfree script to kfree_sensitive coccinelle: misc: add uninitialized_var.cocci script coccinelle: ifnullfree: add vfree(), kvfree*() functions coccinelle: api: add kobj_to_dev.cocci script coccinelle: add patch rule for dma_alloc_coherent scripts: coccicheck: Add chain mode to list of modes
2020-10-18Merge branch 'akpm' (patches from Andrew)Linus Torvalds55-439/+592
Merge yet more updates from Andrew Morton: "Subsystems affected by this patch series: mm (memcg, migration, pagemap, gup, madvise, vmalloc), ia64, and misc" * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (31 commits) mm: remove duplicate include statement in mmu.c mm: remove the filename in the top of file comment in vmalloc.c mm: cleanup the gfp_mask handling in __vmalloc_area_node mm: remove alloc_vm_area x86/xen: open code alloc_vm_area in arch_gnttab_valloc xen/xenbus: use apply_to_page_range directly in xenbus_map_ring_pv drm/i915: use vmap in i915_gem_object_map drm/i915: stop using kmap in i915_gem_object_map drm/i915: use vmap in shmem_pin_map zsmalloc: switch from alloc_vm_area to get_vm_area mm: allow a NULL fn callback in apply_to_page_range mm: add a vmap_pfn function mm: add a VM_MAP_PUT_PAGES flag for vmap mm: update the documentation for vfree mm/madvise: introduce process_madvise() syscall: an external memory hinting API pid: move pidfd_get_pid() to pid.c mm/madvise: pass mm to do_madvise selftests/vm: 10x speedup for hmm-tests binfmt_elf: take the mmap lock around find_extend_vma() mm/gup_benchmark: take the mmap lock around GUP ...
2020-10-18Merge tag 'for-linus-5.10-rc1' of ↵Linus Torvalds14-61/+91
git://git.kernel.org/pub/scm/linux/kernel/git/rw/uml Pull UML updates from Richard Weinberger: - Improve support for non-glibc systems - Vector: Add support for scripting and dynamic tap devices - Various fixes for the vector networking driver - Various fixes for time travel mode * tag 'for-linus-5.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rw/uml: um: vector: Add dynamic tap interfaces and scripting um: Clean up stacktrace dump um: Fix incorrect assumptions about max pid length um: Remove dead usage of TIF_IA32 um: Remove redundant NULL check um: change sigio_spinlock to a mutex um: time-travel: Return the sequence number in ACK messages um: time-travel: Fix IRQ handling in time_travel_handle_message() um: Allow static linking for non-glibc implementations um: Some fixes to build UML with musl um: vector: Use GFP_ATOMIC under spin lock um: Fix null pointer dereference in vector_user_bpf
2020-10-18Merge tag 'for-linus-5.10-rc1-part2' of ↵Linus Torvalds7-4/+25
git://git.kernel.org/pub/scm/linux/kernel/git/rw/ubifs Pull more ubi and ubifs updates from Richard Weinberger: "UBI: - Correctly use kthread_should_stop in ubi worker UBIFS: - Fixes for memory leaks while iterating directory entries - Fix for a user triggerable error message - Fix for a space accounting bug in authenticated mode" * tag 'for-linus-5.10-rc1-part2' of git://git.kernel.org/pub/scm/linux/kernel/git/rw/ubifs: ubifs: journal: Make sure to not dirty twice for auth nodes ubifs: setflags: Don't show error message when vfs_ioc_setflags_prepare() fails ubifs: ubifs_jnl_change_xattr: Remove assertion 'nlink > 0' for host inode ubi: check kthread_should_stop() after the setting of task state ubifs: dent: Fix some potential memory leaks while iterating entries ubifs: xattr: Fix some potential memory leaks while iterating entries
2020-10-18Merge tag 'for-linus-5.10-rc1' of ↵Linus Torvalds5-21/+34
git://git.kernel.org/pub/scm/linux/kernel/git/rw/ubifs Pull ubifs updates from Richard Weinberger: - Kernel-doc fixes - Fixes for memory leaks in authentication option parsing * tag 'for-linus-5.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rw/ubifs: ubifs: mount_ubifs: Release authentication resource in error handling path ubifs: Don't parse authentication mount options in remount process ubifs: Fix a memleak after dumping authentication mount options ubifs: Fix some kernel-doc warnings in tnc.c ubifs: Fix some kernel-doc warnings in replay.c ubifs: Fix some kernel-doc warnings in gc.c ubifs: Fix 'hash' kernel-doc warning in auth.c
2020-10-18mm: remove duplicate include statement in mmu.cTian Tao1-1/+0
asm/sections.h is included more than once, Remove the one that isn't necessary. Signed-off-by: Tian Tao <tiantao6@hisilicon.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Mike Rapoport <rppt@linux.ibm.com> Link: https://lkml.kernel.org/r/1600088607-17327-1-git-send-email-tiantao6@hisilicon.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-10-18mm: remove the filename in the top of file comment in vmalloc.cChristoph Hellwig1-2/+0
No point in having the filename inside the file. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Cc: Uladzislau Rezki (Sony) <urezki@gmail.com> Link: https://lkml.kernel.org/r/20201002124035.1539300-3-hch@lst.de Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-10-18mm: cleanup the gfp_mask handling in __vmalloc_area_nodeChristoph Hellwig1-12/+10
Patch series "two small vmalloc cleanups". This patch (of 2): __vmalloc_area_node currently has four different gfp_t variables to just express this simple logic: - use the passed in mask, plus __GFP_NOWARN and __GFP_HIGHMEM (if suitable) for the underlying page allocation - use just the reclaim flags from the passed in mask plus __GFP_ZERO for allocating the page array Simplify this down to just use the pre-existing nested_gfp as-is for the page array allocation, and just the passed in gfp_mask for the page allocation, after conditionally ORing __GFP_HIGHMEM into it. This also makes the allocation warning a little more correct. Also initialize two variables at the time of declaration while touching this area. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Cc: Uladzislau Rezki (Sony) <urezki@gmail.com> Link: https://lkml.kernel.org/r/20201002124035.1539300-1-hch@lst.de Link: https://lkml.kernel.org/r/20201002124035.1539300-2-hch@lst.de Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-10-18mm: remove alloc_vm_areaChristoph Hellwig3-59/+1
All users are gone now. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: Chris Wilson <chris@chris-wilson.co.uk> Cc: Jani Nikula <jani.nikula@linux.intel.com> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com> Cc: Juergen Gross <jgross@suse.com> Cc: Matthew Auld <matthew.auld@intel.com> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Nitin Gupta <ngupta@vflare.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rodrigo Vivi <rodrigo.vivi@intel.com> Cc: Stefano Stabellini <sstabellini@kernel.org> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com> Cc: Uladzislau Rezki (Sony) <urezki@gmail.com> Link: https://lkml.kernel.org/r/20201002122204.1534411-12-hch@lst.de Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-10-18x86/xen: open code alloc_vm_area in arch_gnttab_vallocChristoph Hellwig1-7/+20
Replace the last call to alloc_vm_area with an open coded version using an iterator in struct gnttab_vm_area instead of the triple indirection magic in alloc_vm_area. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: Chris Wilson <chris@chris-wilson.co.uk> Cc: Jani Nikula <jani.nikula@linux.intel.com> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com> Cc: Juergen Gross <jgross@suse.com> Cc: Matthew Auld <matthew.auld@intel.com> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Nitin Gupta <ngupta@vflare.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rodrigo Vivi <rodrigo.vivi@intel.com> Cc: Stefano Stabellini <sstabellini@kernel.org> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com> Cc: Uladzislau Rezki (Sony) <urezki@gmail.com> Link: https://lkml.kernel.org/r/20201002122204.1534411-11-hch@lst.de Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-10-18xen/xenbus: use apply_to_page_range directly in xenbus_map_ring_pvChristoph Hellwig1-14/+16
Replacing alloc_vm_area with get_vm_area_caller + apply_page_range allows to fill put the phys_addr values directly instead of doing another loop over all addresses. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: Chris Wilson <chris@chris-wilson.co.uk> Cc: Jani Nikula <jani.nikula@linux.intel.com> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com> Cc: Juergen Gross <jgross@suse.com> Cc: Matthew Auld <matthew.auld@intel.com> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Nitin Gupta <ngupta@vflare.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rodrigo Vivi <rodrigo.vivi@intel.com> Cc: Stefano Stabellini <sstabellini@kernel.org> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com> Cc: Uladzislau Rezki (Sony) <urezki@gmail.com> Link: https://lkml.kernel.org/r/20201002122204.1534411-10-hch@lst.de Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-10-18drm/i915: use vmap in i915_gem_object_mapChristoph Hellwig2-68/+60
i915_gem_object_map implements fairly low-level vmap functionality in a driver. Split it into two helpers, one for remapping kernel memory which can use vmap, and one for I/O memory that uses vmap_pfn. The only practical difference is that alloc_vm_area prefeaults the vmalloc area PTEs, which doesn't seem to be required here for the kernel memory case (and could be added to vmap using a flag if actually required). Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: Chris Wilson <chris@chris-wilson.co.uk> Cc: Jani Nikula <jani.nikula@linux.intel.com> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com> Cc: Juergen Gross <jgross@suse.com> Cc: Matthew Auld <matthew.auld@intel.com> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Nitin Gupta <ngupta@vflare.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rodrigo Vivi <rodrigo.vivi@intel.com> Cc: Stefano Stabellini <sstabellini@kernel.org> Cc: Uladzislau Rezki (Sony) <urezki@gmail.com> Link: https://lkml.kernel.org/r/20201002122204.1534411-9-hch@lst.de Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-10-18drm/i915: stop using kmap in i915_gem_object_mapChristoph Hellwig1-5/+2
kmap for !PageHighmem is just a convoluted way to say page_address, and kunmap is a no-op in that case. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: Chris Wilson <chris@chris-wilson.co.uk> Cc: Jani Nikula <jani.nikula@linux.intel.com> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com> Cc: Juergen Gross <jgross@suse.com> Cc: Matthew Auld <matthew.auld@intel.com> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Nitin Gupta <ngupta@vflare.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rodrigo Vivi <rodrigo.vivi@intel.com> Cc: Stefano Stabellini <sstabellini@kernel.org> Cc: Uladzislau Rezki (Sony) <urezki@gmail.com> Link: https://lkml.kernel.org/r/20201002122204.1534411-8-hch@lst.de Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-10-18drm/i915: use vmap in shmem_pin_mapChristoph Hellwig1-58/+18
shmem_pin_map somewhat awkwardly reimplements vmap using alloc_vm_area and manual pte setup. The only practical difference is that alloc_vm_area prefeaults the vmalloc area PTEs, which doesn't seem to be required here (and could be added to vmap using a flag if actually required). Switch to use vmap, and use vfree to free both the vmalloc mapping and the page array, as well as dropping the references to each page. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: Chris Wilson <chris@chris-wilson.co.uk> Cc: Jani Nikula <jani.nikula@linux.intel.com> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com> Cc: Juergen Gross <jgross@suse.com> Cc: Matthew Auld <matthew.auld@intel.com> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Nitin Gupta <ngupta@vflare.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rodrigo Vivi <rodrigo.vivi@intel.com> Cc: Stefano Stabellini <sstabellini@kernel.org> Cc: Uladzislau Rezki (Sony) <urezki@gmail.com> Link: https://lkml.kernel.org/r/20201002122204.1534411-7-hch@lst.de Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-10-18zsmalloc: switch from alloc_vm_area to get_vm_areaChristoph Hellwig1-2/+8
Just manually pre-fault the PTEs using apply_to_page_range. Co-developed-by: Minchan Kim <minchan@kernel.org> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: Chris Wilson <chris@chris-wilson.co.uk> Cc: Jani Nikula <jani.nikula@linux.intel.com> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com> Cc: Juergen Gross <jgross@suse.com> Cc: Matthew Auld <matthew.auld@intel.com> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> Cc: Nitin Gupta <ngupta@vflare.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rodrigo Vivi <rodrigo.vivi@intel.com> Cc: Stefano Stabellini <sstabellini@kernel.org> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com> Cc: Uladzislau Rezki (Sony) <urezki@gmail.com> Link: https://lkml.kernel.org/r/20201002122204.1534411-6-hch@lst.de Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-10-18mm: allow a NULL fn callback in apply_to_page_rangeChristoph Hellwig1-7/+9
Besides calling the callback on each page, apply_to_page_range also has the effect of pre-faulting all PTEs for the range. To support callers that only need the pre-faulting, make the callback optional. Based on a patch from Minchan Kim <minchan@kernel.org>. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: Chris Wilson <chris@chris-wilson.co.uk> Cc: Jani Nikula <jani.nikula@linux.intel.com> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com> Cc: Juergen Gross <jgross@suse.com> Cc: Matthew Auld <matthew.auld@intel.com> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> Cc: Nitin Gupta <ngupta@vflare.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rodrigo Vivi <rodrigo.vivi@intel.com> Cc: Stefano Stabellini <sstabellini@kernel.org> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com> Cc: Uladzislau Rezki (Sony) <urezki@gmail.com> Link: https://lkml.kernel.org/r/20201002122204.1534411-5-hch@lst.de Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-10-18mm: add a vmap_pfn functionChristoph Hellwig3-0/+49
Add a proper helper to remap PFNs into kernel virtual space so that drivers don't have to abuse alloc_vm_area and open coded PTE manipulation for it. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: Chris Wilson <chris@chris-wilson.co.uk> Cc: Jani Nikula <jani.nikula@linux.intel.com> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com> Cc: Juergen Gross <jgross@suse.com> Cc: Matthew Auld <matthew.auld@intel.com> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Nitin Gupta <ngupta@vflare.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rodrigo Vivi <rodrigo.vivi@intel.com> Cc: Stefano Stabellini <sstabellini@kernel.org> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com> Cc: Uladzislau Rezki (Sony) <urezki@gmail.com> Link: https://lkml.kernel.org/r/20201002122204.1534411-4-hch@lst.de Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-10-18mm: add a VM_MAP_PUT_PAGES flag for vmapChristoph Hellwig2-2/+8
Add a flag so that vmap takes ownership of the passed in page array. When vfree is called on such an allocation it will put one reference on each page, and free the page array itself. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: Chris Wilson <chris@chris-wilson.co.uk> Cc: Jani Nikula <jani.nikula@linux.intel.com> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com> Cc: Juergen Gross <jgross@suse.com> Cc: Matthew Auld <matthew.auld@intel.com> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Nitin Gupta <ngupta@vflare.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rodrigo Vivi <rodrigo.vivi@intel.com> Cc: Stefano Stabellini <sstabellini@kernel.org> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com> Cc: Uladzislau Rezki (Sony) <urezki@gmail.com> Link: https://lkml.kernel.org/r/20201002122204.1534411-3-hch@lst.de Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-10-18mm: update the documentation for vfreeMatthew Wilcox (Oracle)1-10/+11
Patch series "remove alloc_vm_area", v4. This series removes alloc_vm_area, which was left over from the big vmalloc interface rework. It is a rather arkane interface, basicaly the equivalent of get_vm_area + actually faulting in all PTEs in the allocated area. It was originally addeds for Xen (which isn't modular to start with), and then grew users in zsmalloc and i915 which seems to mostly qualify as abuses of the interface, especially for i915 as a random driver should not set up PTE bits directly. This patch (of 11): * Document that you can call vfree() on an address returned from vmap() * Remove the note about the minimum size -- the minimum size of a vmalloc allocation is one page * Add a Context: section * Fix capitalisation * Reword the prohibition on calling from NMI context to avoid a double negative Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: Juergen Gross <jgross@suse.com> Cc: Stefano Stabellini <sstabellini@kernel.org> Cc: Jani Nikula <jani.nikula@linux.intel.com> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com> Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com> Cc: Chris Wilson <chris@chris-wilson.co.uk> Cc: Matthew Auld <matthew.auld@intel.com> Cc: Rodrigo Vivi <rodrigo.vivi@intel.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Matthew Wilcox <willy@infradead.org> Cc: Nitin Gupta <ngupta@vflare.org> Cc: Uladzislau Rezki (Sony) <urezki@gmail.com> Link: https://lkml.kernel.org/r/20201002122204.1534411-1-hch@lst.de Link: https://lkml.kernel.org/r/20201002122204.1534411-2-hch@lst.de Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-10-18mm/madvise: introduce process_madvise() syscall: an external memory hinting APIMinchan Kim22-3/+117
There is usecase that System Management Software(SMS) want to give a memory hint like MADV_[COLD|PAGEEOUT] to other processes and in the case of Android, it is the ActivityManagerService. The information required to make the reclaim decision is not known to the app. Instead, it is known to the centralized userspace daemon(ActivityManagerService), and that daemon must be able to initiate reclaim on its own without any app involvement. To solve the issue, this patch introduces a new syscall process_madvise(2). It uses pidfd of an external process to give the hint. It also supports vector address range because Android app has thousands of vmas due to zygote so it's totally waste of CPU and power if we should call the syscall one by one for each vma.(With testing 2000-vma syscall vs 1-vector syscall, it showed 15% performance improvement. I think it would be bigger in real practice because the testing ran very cache friendly environment). Another potential use case for the vector range is to amortize the cost ofTLB shootdowns for multiple ranges when using MADV_DONTNEED; this could benefit users like TCP receive zerocopy and malloc implementations. In future, we could find more usecases for other advises so let's make it happens as API since we introduce a new syscall at this moment. With that, existing madvise(2) user could replace it with process_madvise(2) with their own pid if they want to have batch address ranges support feature. ince it could affect other process's address range, only privileged process(PTRACE_MODE_ATTACH_FSCREDS) or something else(e.g., being the same UID) gives it the right to ptrace the process could use it successfully. The flag argument is reserved for future use if we need to extend the API. I think supporting all hints madvise has/will supported/support to process_madvise is rather risky. Because we are not sure all hints make sense from external process and implementation for the hint may rely on the caller being in the current context so it could be error-prone. Thus, I just limited hints as MADV_[COLD|PAGEOUT] in this patch. If someone want to add other hints, we could hear the usecase and review it for each hint. It's safer for maintenance rather than introducing a buggy syscall but hard to fix it later. So finally, the API is as follows, ssize_t process_madvise(int pidfd, const struct iovec *iovec, unsigned long vlen, int advice, unsigned int flags); DESCRIPTION The process_madvise() system call is used to give advice or directions to the kernel about the address ranges from external process as well as local process. It provides the advice to address ranges of process described by iovec and vlen. The goal of such advice is to improve system or application performance. The pidfd selects the process referred to by the PID file descriptor specified in pidfd. (See pidofd_open(2) for further information) The pointer iovec points to an array of iovec structures, defined in <sys/uio.h> as: struct iovec { void *iov_base; /* starting address */ size_t iov_len; /* number of bytes to be advised */ }; The iovec describes address ranges beginning at address(iov_base) and with size length of bytes(iov_len). The vlen represents the number of elements in iovec. The advice is indicated in the advice argument, which is one of the following at this moment if the target process specified by pidfd is external. MADV_COLD MADV_PAGEOUT Permission to provide a hint to external process is governed by a ptrace access mode PTRACE_MODE_ATTACH_FSCREDS check; see ptrace(2). The process_madvise supports every advice madvise(2) has if target process is in same thread group with calling process so user could use process_madvise(2) to extend existing madvise(2) to support vector address ranges. RETURN VALUE On success, process_madvise() returns the number of bytes advised. This return value may be less than the total number of requested bytes, if an error occurred. The caller should check return value to determine whether a partial advice occurred. FAQ: Q.1 - Why does any external entity have better knowledge? Quote from Sandeep "For Android, every application (including the special SystemServer) are forked from Zygote. The reason of course is to share as many libraries and classes between the two as possible to benefit from the preloading during boot. After applications start, (almost) all of the APIs end up calling into this SystemServer process over IPC (binder) and back to the application. In a fully running system, the SystemServer monitors every single process periodically to calculate their PSS / RSS and also decides which process is "important" to the user for interactivity. So, because of how these processes start _and_ the fact that the SystemServer is looping to monitor each process, it does tend to *know* which address range of the application is not used / useful. Besides, we can never rely on applications to clean things up themselves. We've had the "hey app1, the system is low on memory, please trim your memory usage down" notifications for a long time[1]. They rely on applications honoring the broadcasts and very few do. So, if we want to avoid the inevitable killing of the application and restarting it, some way to be able to tell the OS about unimportant memory in these applications will be useful. - ssp Q.2 - How to guarantee the race(i.e., object validation) between when giving a hint from an external process and get the hint from the target process? process_madvise operates on the target process's address space as it exists at the instant that process_madvise is called. If the space target process can run between the time the process_madvise process inspects the target process address space and the time that process_madvise is actually called, process_madvise may operate on memory regions that the calling process does not expect. It's the responsibility of the process calling process_madvise to close this race condition. For example, the calling process can suspend the target process with ptrace, SIGSTOP, or the freezer cgroup so that it doesn't have an opportunity to change its own address space before process_madvise is called. Another option is to operate on memory regions that the caller knows a priori will be unchanged in the target process. Yet another option is to accept the race for certain process_madvise calls after reasoning that mistargeting will do no harm. The suggested API itself does not provide synchronization. It also apply other APIs like move_pages, process_vm_write. The race isn't really a problem though. Why is it so wrong to require that callers do their own synchronization in some manner? Nobody objects to write(2) merely because it's possible for two processes to open the same file and clobber each other's writes --- instead, we tell people to use flock or something. Think about mmap. It never guarantees newly allocated address space is still valid when the user tries to access it because other threads could unmap the memory right before. That's where we need synchronization by using other API or design from userside. It shouldn't be part of API itself. If someone needs more fine-grained synchronization rather than process level, there were two ideas suggested - cookie[2] and anon-fd[3]. Both are applicable via using last reserved argument of the API but I don't think it's necessary right now since we have already ways to prevent the race so don't want to add additional complexity with more fine-grained optimization model. To make the API extend, it reserved an unsigned long as last argument so we could support it in future if someone really needs it. Q.3 - Why doesn't ptrace work? Injecting an madvise in the target process using ptrace would not work for us because such injected madvise would have to be executed by the target process, which means that process would have to be runnable and that creates the risk of the abovementioned race and hinting a wrong VMA. Furthermore, we want to act the hint in caller's context, not the callee's, because the callee is usually limited in cpuset/cgroups or even freezed state so they can't act by themselves quick enough, which causes more thrashing/kill. It doesn't work if the target process are ptraced(e.g., strace, debugger, minidump) because a process can have at most one ptracer. [1] https://developer.android.com/topic/performance/memory" [2] process_getinfo for getting the cookie which is updated whenever vma of process address layout are changed - Daniel Colascione - https://lore.kernel.org/lkml/20190520035254.57579-1-minchan@kernel.org/T/#m7694416fd179b2066a2c62b5b139b14e3894e224 [3] anonymous fd which is used for the object(i.e., address range) validation - Michal Hocko - https://lore.kernel.org/lkml/20200120112722.GY18451@dhcp22.suse.cz/ [minchan@kernel.org: fix process_madvise build break for arm64] Link: http://lkml.kernel.org/r/20200303145756.GA219683@google.com [minchan@kernel.org: fix build error for mips of process_madvise] Link: http://lkml.kernel.org/r/20200508052517.GA197378@google.com [akpm@linux-foundation.org: fix patch ordering issue] [akpm@linux-foundation.org: fix arm64 whoops] [minchan@kernel.org: make process_madvise() vlen arg have type size_t, per Florian] [akpm@linux-foundation.org: fix i386 build] [sfr@canb.auug.org.au: fix syscall numbering] Link: https://lkml.kernel.org/r/20200905142639.49fc3f1a@canb.auug.org.au [sfr@canb.auug.org.au: madvise.c needs compat.h] Link: https://lkml.kernel.org/r/20200908204547.285646b4@canb.auug.org.au [minchan@kernel.org: fix mips build] Link: https://lkml.kernel.org/r/20200909173655.GC2435453@google.com [yuehaibing@huawei.com: remove duplicate header which is included twice] Link: https://lkml.kernel.org/r/20200915121550.30584-1-yuehaibing@huawei.com [minchan@kernel.org: do not use helper functions for process_madvise] Link: https://lkml.kernel.org/r/20200921175539.GB387368@google.com [akpm@linux-foundation.org: pidfd_get_pid() gained an argument] [sfr@canb.auug.org.au: fix up for "iov_iter: transparently handle compat iovecs in import_iovec"] Link: https://lkml.kernel.org/r/20200928212542.468e1fef@canb.auug.org.au Signed-off-by: Minchan Kim <minchan@kernel.org> Signed-off-by: YueHaibing <yuehaibing@huawei.com> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Suren Baghdasaryan <surenb@google.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: David Rientjes <rientjes@google.com> Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com> Cc: Brian Geffon <bgeffon@google.com> Cc: Christian Brauner <christian@brauner.io> Cc: Daniel Colascione <dancol@google.com> Cc: Jann Horn <jannh@google.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Joel Fernandes <joel@joelfernandes.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: John Dias <joaodias@google.com> Cc: Kirill Tkhai <ktkhai@virtuozzo.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Oleksandr Natalenko <oleksandr@redhat.com> Cc: Sandeep Patil <sspatil@google.com> Cc: SeongJae Park <sj38.park@gmail.com> Cc: SeongJae Park <sjpark@amazon.de> Cc: Shakeel Butt <shakeelb@google.com> Cc: Sonny Rao <sonnyrao@google.com> Cc: Tim Murray <timmurray@google.com> Cc: Christian Brauner <christian.brauner@ubuntu.com> Cc: Florian Weimer <fw@deneb.enyo.de> Cc: <linux-man@vger.kernel.org> Link: http://lkml.kernel.org/r/20200302193630.68771-3-minchan@kernel.org Link: http://lkml.kernel.org/r/20200508183320.GA125527@google.com Link: http://lkml.kernel.org/r/20200622192900.22757-4-minchan@kernel.org Link: https://lkml.kernel.org/r/20200901000633.1920247-4-minchan@kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-10-18pid: move pidfd_get_pid() to pid.cMinchan Kim3-19/+20
process_madvise syscall needs pidfd_get_pid function to translate pidfd to pid so this patch move the function to kernel/pid.c. Suggested-by: Alexander Duyck <alexander.h.duyck@linux.intel.com> Signed-off-by: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Suren Baghdasaryan <surenb@google.com> Reviewed-by: Alexander Duyck <alexander.h.duyck@linux.intel.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Christian Brauner <christian.brauner@ubuntu.com> Acked-by: David Rientjes <rientjes@google.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Jann Horn <jannh@google.com> Cc: Brian Geffon <bgeffon@google.com> Cc: Daniel Colascione <dancol@google.com> Cc: Joel Fernandes <joel@joelfernandes.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: John Dias <joaodias@google.com> Cc: Kirill Tkhai <ktkhai@virtuozzo.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Oleksandr Natalenko <oleksandr@redhat.com> Cc: Sandeep Patil <sspatil@google.com> Cc: SeongJae Park <sj38.park@gmail.com> Cc: SeongJae Park <sjpark@amazon.de> Cc: Shakeel Butt <shakeelb@google.com> Cc: Sonny Rao <sonnyrao@google.com> Cc: Tim Murray <timmurray@google.com> Cc: Christian Brauner <christian.brauner@ubuntu.com> Cc: Florian Weimer <fw@deneb.enyo.de> Cc: <linux-man@vger.kernel.org> Link: http://lkml.kernel.org/r/20200302193630.68771-5-minchan@kernel.org Link: http://lkml.kernel.org/r/20200622192900.22757-3-minchan@kernel.org Link: https://lkml.kernel.org/r/20200901000633.1920247-3-minchan@kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-10-18mm/madvise: pass mm to do_madviseMinchan Kim3-16/+20
Patch series "introduce memory hinting API for external process", v9. Now, we have MADV_PAGEOUT and MADV_COLD as madvise hinting API. With that, application could give hints to kernel what memory range are preferred to be reclaimed. However, in some platform(e.g., Android), the information required to make the hinting decision is not known to the app. Instead, it is known to a centralized userspace daemon(e.g., ActivityManagerService), and that daemon must be able to initiate reclaim on its own without any app involvement. To solve the concern, this patch introduces new syscall - process_madvise(2). Bascially, it's same with madvise(2) syscall but it has some differences. 1. It needs pidfd of target process to provide the hint 2. It supports only MADV_{COLD|PAGEOUT|MERGEABLE|UNMEREABLE} at this moment. Other hints in madvise will be opened when there are explicit requests from community to prevent unexpected bugs we couldn't support. 3. Only privileged processes can do something for other process's address space. For more detail of the new API, please see "mm: introduce external memory hinting API" description in this patchset. This patch (of 3): In upcoming patches, do_madvise will be called from external process context so we shouldn't asssume "current" is always hinted process's task_struct. Furthermore, we must not access mm_struct via task->mm, but obtain it via access_mm() once (in the following patch) and only use that pointer [1], so pass it to do_madvise() as well. Note the vma->vm_mm pointers are safe, so we can use them further down the call stack. And let's pass current->mm as arguments of do_madvise so it shouldn't change existing behavior but prepare next patch to make review easy. [vbabka@suse.cz: changelog tweak] [minchan@kernel.org: use current->mm for io_uring] Link: http://lkml.kernel.org/r/20200423145215.72666-1-minchan@kernel.org [akpm@linux-foundation.org: fix it for upstream changes] [akpm@linux-foundation.org: whoops] [rdunlap@infradead.org: add missing includes] Signed-off-by: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Suren Baghdasaryan <surenb@google.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: David Rientjes <rientjes@google.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Jann Horn <jannh@google.com> Cc: Tim Murray <timmurray@google.com> Cc: Daniel Colascione <dancol@google.com> Cc: Sandeep Patil <sspatil@google.com> Cc: Sonny Rao <sonnyrao@google.com> Cc: Brian Geffon <bgeffon@google.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Shakeel Butt <shakeelb@google.com> Cc: John Dias <joaodias@google.com> Cc: Joel Fernandes <joel@joelfernandes.org> Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com> Cc: SeongJae Park <sj38.park@gmail.com> Cc: Christian Brauner <christian@brauner.io> Cc: Kirill Tkhai <ktkhai@virtuozzo.com> Cc: Oleksandr Natalenko <oleksandr@redhat.com> Cc: SeongJae Park <sjpark@amazon.de> Cc: Christian Brauner <christian.brauner@ubuntu.com> Cc: Florian Weimer <fw@deneb.enyo.de> Cc: <linux-man@vger.kernel.org> Link: https://lkml.kernel.org/r/20200901000633.1920247-1-minchan@kernel.org Link: http://lkml.kernel.org/r/20200622192900.22757-1-minchan@kernel.org Link: http://lkml.kernel.org/r/20200302193630.68771-2-minchan@kernel.org Link: http://lkml.kernel.org/r/20200622192900.22757-2-minchan@kernel.org Link: https://lkml.kernel.org/r/20200901000633.1920247-2-minchan@kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-10-18selftests/vm: 10x speedup for hmm-testsJohn Hubbard1-1/+1
This patch reduces the running time for hmm-tests from about 10+ seconds, to just under 1.0 second, for an approximately 10x speedup. That brings it in line with most of the other tests in selftests/vm, which mostly run in < 1 sec. This is done with a one-line change that simply reduces the number of iterations of several tests, from 256, to 10. Thanks to Ralph Campbell for suggesting changing NTIMES as a way to get the speedup. Suggested-by: Ralph Campbell <rcampbell@nvidia.com> Signed-off-by: John Hubbard <jhubbard@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Cc: SeongJae Park <sj38.park@gmail.com> Cc: Shuah Khan <shuah@kernel.org> Link: https://lkml.kernel.org/r/20201003011721.44238-1-jhubbard@nvidia.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-10-18binfmt_elf: take the mmap lock around find_extend_vma()Jann Horn1-0/+3
create_elf_tables() runs after setup_new_exec(), so other tasks can already access our new mm and do things like process_madvise() on it. (At the time I'm writing this commit, process_madvise() is not in mainline yet, but has been in akpm's tree for some time.) While I believe that there are currently no APIs that would actually allow another process to mess up our VMA tree (process_madvise() is limited to MADV_COLD and MADV_PAGEOUT, and uring and userfaultfd cannot reach an mm under which no syscalls have been executed yet), this seems like an accident waiting to happen. Let's make sure that we always take the mmap lock around GUP paths as long as another process might be able to see the mm. (Yes, this diff looks suspicious because we drop the lock before doing anything with `vma`, but that's because we actually don't do anything with it apart from the NULL check.) Signed-off-by: Jann Horn <jannh@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Michel Lespinasse <walken@google.com> Cc: "Eric W . Biederman" <ebiederm@xmission.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Mauro Carvalho Chehab <mchehab@kernel.org> Cc: Sakari Ailus <sakari.ailus@linux.intel.com> Link: https://lkml.kernel.org/r/CAG48ez1-PBCdv3y8pn-Ty-b+FmBSLwDuVKFSt8h7wARLy0dF-Q@mail.gmail.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-10-18mm/gup_benchmark: take the mmap lock around GUPJann Horn1-3/+12
To be safe against concurrent changes to the VMA tree, we must take the mmap lock around GUP operations (excluding the GUP-fast family of operations, which will take the mmap lock by themselves if necessary). This code is only for testing, and it's only reachable by root through debugfs, so this doesn't really have any impact; however, if we want to add lockdep asserts into the GUP path, we need to have clean locking here. Signed-off-by: Jann Horn <jannh@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: John Hubbard <jhubbard@nvidia.com> Acked-by: Michel Lespinasse <walken@google.com> Cc: "Eric W . Biederman" <ebiederm@xmission.com> Cc: Mauro Carvalho Chehab <mchehab@kernel.org> Cc: Sakari Ailus <sakari.ailus@linux.intel.com> Link: https://lkml.kernel.org/r/CAG48ez3SG6ngZLtasxJ6LABpOnqCz5-QHqb0B4k44TQ8F9n6+w@mail.gmail.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-10-18mm/mmap: add inline munmap_vma_range() for code readabilityLiam R. Howlett1-15/+33
There are two locations that have a block of code for munmapping a vma range. Change those two locations to use a function and add meaningful comments about what happens to the arguments, which was unclear in the previous code. Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Link: http://lkml.kernel.org/r/20200818154707.2515169-2-Liam.Howlett@Oracle.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-10-18mm/mmap: add inline vma_next() for readability of mmap codeLiam R. Howlett1-6/+20
There are three places that the next vma is required which uses the same block of code. Replace the block with a function and add comments on what happens in the case where NULL is encountered. Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Link: http://lkml.kernel.org/r/20200818154707.2515169-1-Liam.Howlett@Oracle.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-10-18mm/migrate: avoid possible unnecessary process right check in ↵Miaohe Lin1-28/+43
kernel_move_pages() There is no need to check if this process has the right to modify the specified process when they are same. And we could also skip the security hook call if a process is modifying its own pages. Add helper function to handle these. Suggested-by: Matthew Wilcox <willy@infradead.org> Signed-off-by: Hongxiang Lou <louhongxiang@huawei.com> Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Cc: Christopher Lameter <cl@linux.com> Link: https://lkml.kernel.org/r/20200819083331.19012-1-linmiaohe@huawei.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-10-18mm/memory_hotplug: remove a wrapper for alloc_migration_target()Joonsoo Kim1-24/+22
To calculate the correct node to migrate the page for hotplug, we need to check node id of the page. Wrapper for alloc_migration_target() exists for this purpose. However, Vlastimil informs that all migration source pages come from a single node. In this case, we don't need to check the node id for each page and we don't need to re-set the target nodemask for each page by using the wrapper. Set up the migration_target_control once and use it for all pages. Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Roman Gushchin <guro@fb.com> Link: http://lkml.kernel.org/r/1594622517-20681-10-git-send-email-iamjoonsoo.kim@lge.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-10-18mm/memory-failure: remove a wrapper for alloc_migration_target()Joonsoo Kim1-12/+6
There is a well-defined standard migration target callback. Use it directly. Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Christoph Hellwig <hch@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Roman Gushchin <guro@fb.com> Link: http://lkml.kernel.org/r/1594622517-20681-9-git-send-email-iamjoonsoo.kim@lge.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-10-18mm: kmem: enable kernel memcg accounting from interrupt contextsRoman Gushchin2-12/+13
If a memcg to charge can be determined (using remote charging API), there are no reasons to exclude allocations made from an interrupt context from the accounting. Such allocations will pass even if the resulting memcg size will exceed the hard limit, but it will affect the application of the memory pressure and an inability to put the workload under the limit will eventually trigger the OOM. To use active_memcg() helper, memcg_kmem_bypass() is moved back to memcontrol.c. Signed-off-by: Roman Gushchin <guro@fb.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Shakeel Butt <shakeelb@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Link: http://lkml.kernel.org/r/20200827225843.1270629-5-guro@fb.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-10-18mm: kmem: prepare remote memcg charging infra for interrupt contextsRoman Gushchin2-16/+45
Remote memcg charging API uses current->active_memcg to store the currently active memory cgroup, which overwrites the memory cgroup of the current process. It works well for normal contexts, but doesn't work for interrupt contexts: indeed, if an interrupt occurs during the execution of a section with an active memcg set, all allocations inside the interrupt will be charged to the active memcg set (given that we'll enable accounting for allocations from an interrupt context). But because the interrupt might have no relation to the active memcg set outside, it's obviously wrong from the accounting prospective. To resolve this problem, let's add a global percpu int_active_memcg variable, which will be used to store an active memory cgroup which will be used from interrupt contexts. set_active_memcg() will transparently use current->active_memcg or int_active_memcg depending on the context. To make the read part simple and transparent for the caller, let's introduce two new functions: - struct mem_cgroup *active_memcg(void), - struct mem_cgroup *get_active_memcg(void). They are returning the active memcg if it's set, hiding all implementation details: where to get it depending on the current context. Signed-off-by: Roman Gushchin <guro@fb.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Shakeel Butt <shakeelb@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Link: http://lkml.kernel.org/r/20200827225843.1270629-4-guro@fb.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-10-18mm: kmem: remove redundant checks from get_obj_cgroup_from_current()Roman Gushchin1-3/+0
There are checks for current->mm and current->active_memcg in get_obj_cgroup_from_current(), but these checks are redundant: memcg_kmem_bypass() called just above performs same checks. Signed-off-by: Roman Gushchin <guro@fb.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Shakeel Butt <shakeelb@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Link: http://lkml.kernel.org/r/20200827225843.1270629-3-guro@fb.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-10-18mm: kmem: move memcg_kmem_bypass() calls to get_mem/obj_cgroup_from_current()Roman Gushchin3-10/+9
Patch series "mm: kmem: kernel memory accounting in an interrupt context". This patchset implements memcg-based memory accounting of allocations made from an interrupt context. Historically, such allocations were passed unaccounted mostly because charging the memory cgroup of the current process wasn't an option. Also performance reasons were likely a reason too. The remote charging API allows to temporarily overwrite the currently active memory cgroup, so that all memory allocations are accounted towards some specified memory cgroup instead of the memory cgroup of the current process. This patchset extends the remote charging API so that it can be used from an interrupt context. Then it removes the fence that prevented the accounting of allocations made from an interrupt context. It also contains a couple of optimizations/code refactorings. This patchset doesn't directly enable accounting for any specific allocations, but prepares the code base for it. The bpf memory accounting will likely be the first user of it: a typical example is a bpf program parsing an incoming network packet, which allocates an entry in hashmap map to store some information. This patch (of 4): Currently memcg_kmem_bypass() is called before obtaining the current memory/obj cgroup using get_mem/obj_cgroup_from_current(). Moving memcg_kmem_bypass() into get_mem/obj_cgroup_from_current() reduces the number of call sites and allows further code simplifications. Signed-off-by: Roman Gushchin <guro@fb.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Shakeel Butt <shakeelb@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Link: http://lkml.kernel.org/r/20200827225843.1270629-1-guro@fb.com Link: http://lkml.kernel.org/r/20200827225843.1270629-2-guro@fb.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-10-18mm, memcg: rework remote charging API to support nestingRoman Gushchin5-30/+22
Currently the remote memcg charging API consists of two functions: memalloc_use_memcg() and memalloc_unuse_memcg(), which set and clear the memcg value, which overwrites the memcg of the current task. memalloc_use_memcg(target_memcg); <...> memalloc_unuse_memcg(); It works perfectly for allocations performed from a normal context, however an attempt to call it from an interrupt context or just nest two remote charging blocks will lead to an incorrect accounting. On exit from the inner block the active memcg will be cleared instead of being restored. memalloc_use_memcg(target_memcg); memalloc_use_memcg(target_memcg_2); <...> memalloc_unuse_memcg(); Error: allocation here are charged to the memcg of the current process instead of target_memcg. memalloc_unuse_memcg(); This patch extends the remote charging API by switching to a single function: struct mem_cgroup *set_active_memcg(struct mem_cgroup *memcg), which sets the new value and returns the old one. So a remote charging block will look like: old_memcg = set_active_memcg(target_memcg); <...> set_active_memcg(old_memcg); This patch is heavily based on the patch by Johannes Weiner, which can be found here: https://lkml.org/lkml/2020/5/28/806 . Signed-off-by: Roman Gushchin <guro@fb.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Shakeel Butt <shakeelb@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Dan Schatzberg <dschatzberg@fb.com> Link: https://lkml.kernel.org/r/20200821212056.3769116-1-guro@fb.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-10-18ia64: fix build error with !COREDUMPKrzysztof Kozlowski1-1/+1
Fix linkage error when CONFIG_BINFMT_ELF is selected but CONFIG_COREDUMP is not: ia64-linux-ld: arch/ia64/kernel/elfcore.o: in function `elf_core_write_extra_phdrs': elfcore.c:(.text+0x172): undefined reference to `dump_emit' ia64-linux-ld: arch/ia64/kernel/elfcore.o: in function `elf_core_write_extra_data': elfcore.c:(.text+0x2b2): undefined reference to `dump_emit' Fixes: 1fcccbac89f5 ("elf coredump: replace ELF_CORE_EXTRA_* macros by functions") Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Krzysztof Kozlowski <krzk@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Cc: Tony Luck <tony.luck@intel.com> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: <stable@vger.kernel.org> Link: https://lkml.kernel.org/r/20200819064146.12529-1-krzk@kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-10-18ext4: Detect already used quota file earlyJan Kara1-0/+5
When we try to use file already used as a quota file again (for the same or different quota type), strange things can happen. At the very least lockdep annotations may be wrong but also inode flags may be wrongly set / reset. When the file is used for two quota types at once we can even corrupt the file and likely crash the kernel. Catch all these cases by checking whether passed file is already used as quota file and bail early in that case. This fixes occasional generic/219 failure due to lockdep complaint. Reviewed-by: Andreas Dilger <adilger@dilger.ca> Reported-by: Ritesh Harjani <riteshh@linux.ibm.com> Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20201015110330.28716-1-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-10-18jbd2: avoid transaction reuse after reformattingchangfengnan1-12/+66
When ext4 is formatted with lazy_journal_init=1 and transactions from the previous filesystem are still on disk, it is possible that they are considered during a recovery after a crash. Because the checksum seed has changed, the CRC check will fail, and the journal recovery fails with checksum error although the journal is otherwise perfectly valid. Fix the problem by checking commit block time stamps to determine whether the data in the journal block is just stale or whether it is indeed corrupt. Reported-by: kernel test robot <lkp@intel.com> Reviewed-by: Andreas Dilger <adilger@dilger.ca> Signed-off-by: Fengnan Chang <changfengnan@hikvision.com> Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20201012164900.20197-1-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-10-18ext4: use the normal helper to get the actual inodeKaixu Xia1-2/+2
Here we use the READ_ONCE to fix race conditions in ->d_compare() and ->d_hash() when they are called in RCU-walk mode, seems we can use the normal helper d_inode_rcu() to get the actual inode. Signed-off-by: Kaixu Xia <kaixuxia@tencent.com> Reviewed-by: Lukas Czerner <lczerner@redhat.com> Link: https://lore.kernel.org/r/1602317416-1260-1-git-send-email-kaixuxia@tencent.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-10-18ext4: fix bs < ps issue reported with dioread_nolock mount optRitesh Harjani2-2/+2
left shifting m_lblk by blkbits was causing value overflow and hence it was not able to convert unwritten to written extent. So, make sure we typecast it to loff_t before do left shift operation. Also in func ext4_convert_unwritten_io_end_vec(), make sure to initialize ret variable to avoid accidentally returning an uninitialized ret. This patch fixes the issue reported in ext4 for bs < ps with dioread_nolock mount option. Fixes: c8cc88163f40df39e50c ("ext4: Add support for blocksize < pagesize in dioread_nolock") Cc: stable@kernel.org Reported-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/af902b5db99e8b73980c795d84ad7bb417487e76.1602168865.git.riteshh@linux.ibm.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-10-18ext4: data=journal: write-protect pages on j_submit_inode_data_buffers()Mauricio Faria de Oliveira2-11/+101
This implements journal callbacks j_submit|finish_inode_data_buffers() with different behavior for data=journal: to write-protect pages under commit, preventing changes to buffers writeably mapped to userspace. If a buffer's content changes between commit's checksum calculation and write-out to disk, it can cause journal recovery/mount failures upon a kernel crash or power loss. [ 27.334874] EXT4-fs: Warning: mounting with data=journal disables delayed allocation, dioread_nolock, and O_DIRECT support! [ 27.339492] JBD2: Invalid checksum recovering data block 8705 in log [ 27.342716] JBD2: recovery failed [ 27.343316] EXT4-fs (loop0): error loading journal mount: /ext4: can't read superblock on /dev/loop0. In j_submit_inode_data_buffers() we write-protect the inode's pages with write_cache_pages() and redirty w/ writepage callback if needed. In j_finish_inode_data_buffers() there is nothing do to. And in order to use the callbacks, inodes are added to the inode list in transaction in __ext4_journalled_writepage() and ext4_page_mkwrite(). In ext4_page_mkwrite() we must make sure that the buffers are attached to the transaction as jbddirty with write_end_fn(), as already done in __ext4_journalled_writepage(). Signed-off-by: Mauricio Faria de Oliveira <mfo@canonical.com> Reported-by: Dann Frazier <dann.frazier@canonical.com> Reported-by: kernel test robot <lkp@intel.com> # wbc.nr_to_write Suggested-by: Jan Kara <jack@suse.cz> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20201006004841.600488-5-mfo@canonical.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-10-18ext4: data=journal: fixes for ext4_page_mkwrite()Mauricio Faria de Oliveira1-7/+44
These are two fixes for data journalling required by the next patch, discovered while testing it. First, the optimization to return early if all buffers are mapped is not appropriate for the next patch: The inode _must_ be added to the transaction's list in data=journal mode (so to write-protect pages on commit) thus we cannot return early there. Second, once that optimization to reduce transactions was disabled for data=journal mode, more transactions happened, and occasionally hit this warning message: 'JBD2: Spotted dirty metadata buffer'. Reason is, block_page_mkwrite() will set_buffer_dirty() before do_journal_get_write_access() that is there to prevent it. This issue was masked by the optimization. So, on data=journal use __block_write_begin() instead. This also requires page locking and len recalculation. (see block_page_mkwrite() for implementation details.) Finally, as Jan noted there is little sharing between data=journal and other modes in ext4_page_mkwrite(). However, a prototype of ext4_journalled_page_mkwrite() showed there still would be lots of duplicated lines (tens of) that didn't seem worth it. Thus this patch ends up with an ugly goto to skip all non-data journalling code (to avoid long indentations, but that can be changed..) in the beginning, and just a conditional in the transaction section. Well, we skip a common part to data journalling which is the page truncated check, but we do it again after ext4_journal_start() when we re-acquire the page lock (so not to acquire the page lock twice needlessly for data journalling.) Signed-off-by: Mauricio Faria de Oliveira <mfo@canonical.com> Suggested-by: Jan Kara <jack@suse.cz> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Andreas Dilger <adilger@dilger.ca> Link: https://lore.kernel.org/r/20201006004841.600488-4-mfo@canonical.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-10-18jbd2, ext4, ocfs2: introduce/use journal callbacks ↵Mauricio Faria de Oliveira4-13/+50
j_submit|finish_inode_data_buffers() Introduce journal callbacks to allow different behaviors for an inode in journal_submit|finish_inode_data_buffers(). The existing users of the current behavior (ext4, ocfs2) are adapted to use the previously exported functions that implement the current behavior. Users are callers of jbd2_journal_inode_ranged_write|wait(), which adds the inode to the transaction's inode list with the JI_WRITE|WAIT_DATA flags. Only ext4 and ocfs2 in-tree. Both CONFIG_EXT4_FS and CONFIG_OCSFS2_FS select CONFIG_JBD2, which builds fs/jbd2/commit.c and journal.c that define and export the functions, so we can call directly in ext4/ocfs2. Signed-off-by: Mauricio Faria de Oliveira <mfo@canonical.com> Suggested-by: Jan Kara <jack@suse.cz> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Andreas Dilger <adilger@dilger.ca> Link: https://lore.kernel.org/r/20201006004841.600488-3-mfo@canonical.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-10-18jbd2: introduce/export functions jbd2_journal_submit|finish_inode_data_buffers()Mauricio Faria de Oliveira3-20/+22
Export functions that implement the current behavior done for an inode in journal_submit|finish_inode_data_buffers(). No functional change. Signed-off-by: Mauricio Faria de Oliveira <mfo@canonical.com> Suggested-by: Jan Kara <jack@suse.cz> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Andreas Dilger <adilger@dilger.ca> Link: https://lore.kernel.org/r/20201006004841.600488-2-mfo@canonical.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-10-18ext4: introduce ext4_sb_bread_unmovable() to replace sb_bread_unmovable()zhangyi (F)2-9/+31
Now we only use sb_bread_unmovable() to read superblock and descriptor block at mount time, so there is no opportunity that we need to clear buffer verified bit and also handle buffer write_io error bit. But for the sake of unification, let's introduce ext4_sb_bread_unmovable() to replace all sb_bread_unmovable(). After this patch, we stop using read helpers in fs/buffer.c. Signed-off-by: zhangyi (F) <yi.zhang@huawei.com> Link: https://lore.kernel.org/r/20200924073337.861472-8-yi.zhang@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-10-18ext4: use ext4_sb_bread() instead of sb_bread()zhangyi (F)2-7/+7
We have already remove open codes that invoke helpers provide by fs/buffer.c in all places reading metadata buffers. This patch switch to use ext4_sb_bread() to replace all sb_bread() helpers, which is ext4_read_bh() helper back end. Signed-off-by: zhangyi (F) <yi.zhang@huawei.com> Link: https://lore.kernel.org/r/20200924073337.861472-7-yi.zhang@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-10-18ext4: introduce ext4_sb_breadahead_unmovable() to replace ↵zhangyi (F)3-2/+13
sb_breadahead_unmovable() If we readahead inode tables in __ext4_get_inode_loc(), it may bypass buffer_write_io_error() check, so introduce ext4_sb_breadahead_unmovable() to handle this special case. This patch also replace sb_breadahead_unmovable() in ext4_fill_super() for the sake of unification. Signed-off-by: zhangyi (F) <yi.zhang@huawei.com> Link: https://lore.kernel.org/r/20200924073337.861472-6-yi.zhang@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-10-18ext4: use ext4_buffer_uptodate() in __ext4_get_inode_loc()zhangyi (F)1-10/+1
We have already introduced ext4_buffer_uptodate() to re-set the uptodate bit on buffer which has been failed to write out to disk. Just remove the redundant codes and switch to use ext4_buffer_uptodate() in __ext4_get_inode_loc(). Signed-off-by: zhangyi (F) <yi.zhang@huawei.com> Link: https://lore.kernel.org/r/20200924073337.861472-5-yi.zhang@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-10-18ext4: use common helpers in all places reading metadata bufferszhangyi (F)9-54/+36
Revome all open codes that read metadata buffers, switch to use ext4_read_bh_*() common helpers. Signed-off-by: zhangyi (F) <yi.zhang@huawei.com> Suggested-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20200924073337.861472-4-yi.zhang@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-10-18ext4: introduce new metadata buffer read helperszhangyi (F)2-0/+67
The previous patch add clear_buffer_verified() before we read metadata block from disk again, but it's rather easy to miss clearing of this bit because currently we read metadata buffer through different open codes (e.g. ll_rw_block(), bh_submit_read() and invoke submit_bh() directly). So, it's time to add common helpers to unify in all the places reading metadata buffers instead. This patch add 3 helpers: - ext4_read_bh_nowait(): async read metadata buffer if it's actually not uptodate, clear buffer_verified bit before read from disk. - ext4_read_bh(): sync version of read metadata buffer, it will wait until the read operation return and check the return status. - ext4_read_bh_lock(): try to lock the buffer before read buffer, it will skip reading if the buffer is already locked. After this patch, we need to use these helpers in all the places reading metadata buffer instead of different open codes. Signed-off-by: zhangyi (F) <yi.zhang@huawei.com> Suggested-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20200924073337.861472-3-yi.zhang@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-10-18ext4: clear buffer verified flag if read meta block from diskzhangyi (F)5-1/+8
The metadata buffer is no longer trusted after we read it from disk again because it is not uptodate for some reasons (e.g. failed to write back). Otherwise we may get below memory corruption problem in ext4_ext_split()->memset() if we read stale data from the newly allocated extent block on disk which has been failed to async write out but miss verify again since the verified bit has already been set on the buffer. [ 29.774674] BUG: unable to handle kernel paging request at ffff88841949d000 ... [ 29.783317] Oops: 0002 [#2] SMP [ 29.784219] R10: 00000000000f4240 R11: 0000000000002e28 R12: ffff88842fa1c800 [ 29.784627] CPU: 1 PID: 126 Comm: kworker/u4:3 Tainted: G D W [ 29.785546] R13: ffffffff9cddcc20 R14: ffffffff9cddd420 R15: ffff88842fa1c2f8 [ 29.786679] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),BIOS ?-20190727_0738364 [ 29.787588] FS: 0000000000000000(0000) GS:ffff88842fa00000(0000) knlGS:0000000000000000 [ 29.789288] Workqueue: writeback wb_workfn [ 29.790319] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 29.790321] (flush-8:0) [ 29.790844] CR2: 0000000000000008 CR3: 00000004234f2000 CR4: 00000000000006f0 [ 29.791924] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 29.792839] RIP: 0010:__memset+0x24/0x30 [ 29.793739] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 29.794256] Code: 90 90 90 90 90 90 0f 1f 44 00 00 49 89 f9 48 89 d1 83 e2 07 48 c1 e9 033 [ 29.795161] Kernel panic - not syncing: Fatal exception in interrupt ... [ 29.808149] Call Trace: [ 29.808475] ext4_ext_insert_extent+0x102e/0x1be0 [ 29.809085] ext4_ext_map_blocks+0xa89/0x1bb0 [ 29.809652] ext4_map_blocks+0x290/0x8a0 [ 29.809085] ext4_ext_map_blocks+0xa89/0x1bb0 [ 29.809652] ext4_map_blocks+0x290/0x8a0 [ 29.810161] ext4_writepages+0xc85/0x17c0 ... Fix this by clearing buffer's verified bit if we read meta block from disk again. Signed-off-by: zhangyi (F) <yi.zhang@huawei.com> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20200924073337.861472-2-yi.zhang@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-10-18ext4: limit entries returned when counting fsmap recordsDarrick J. Wong1-0/+3
If userspace asked fsmap to try to count the number of entries, we cannot return more than UINT_MAX entries because fmh_entries is u32. Therefore, stop counting if we hit this limit or else we will waste time to return truncated results. Fixes: 0c9ec4beecac ("ext4: support GETFSMAP ioctls") Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Link: https://lore.kernel.org/r/20201001222148.GA49520@magnolia Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-10-18ext4: make mb_check_counter per groupChunguang Xu2-5/+5
Make bb_check_counter per group, so each group has the same chance to be checked, which can expose errors more easily. Signed-off-by: Chunguang Xu <brookxu@tencent.com> Link: https://lore.kernel.org/r/1601292995-32205-2-git-send-email-brookxu@tencent.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-10-18ext4: delete invalid comments near mb_buddy_adjust_borderChunguang Xu1-3/+0
The comment near mb_buddy_adjust_border seems meaningless, just clear it. Signed-off-by: Chunguang Xu <brookxu@tencent.com> Link: https://lore.kernel.org/r/1601292995-32205-1-git-send-email-brookxu@tencent.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-10-18ext4: fix bdev write error check failed when mount fs with roZhang Xiaoxu1-11/+2
Consider a situation when a filesystem was uncleanly shutdown and the orphan list is not empty and a read-only mount is attempted. The orphan list cleanup during mount will fail with: ext4_check_bdev_write_error:193: comm mount: Error while async write back metadata This happens because sbi->s_bdev_wb_err is not initialized when mounting the filesystem in read only mode and so ext4_check_bdev_write_error() falsely triggers. Initialize sbi->s_bdev_wb_err unconditionally to avoid this problem. Fixes: bc71726c7257 ("ext4: abort the filesystem if failed to async write metadata buffer") Cc: stable@kernel.org Signed-off-by: Zhang Xiaoxu <zhangxiaoxu5@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20200928020556.710971-1-zhangxiaoxu5@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-10-18ext4: rename system_blks to s_system_blks inside ext4_sb_infoChunguang Xu3-9/+9
Rename system_blks to s_system_blks inside ext4_sb_info, keep the naming rules consistent with other variables, which is convenient for code reading and writing. Signed-off-by: Chunguang Xu <brookxu@tencent.com> Reviewed-by: Andreas Dilger <adilger@dilger.ca> Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com> Link: https://lore.kernel.org/r/1600916623-544-2-git-send-email-brookxu@tencent.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-10-18ext4: rename journal_dev to s_journal_dev inside ext4_sb_infoChunguang Xu3-12/+12
Rename journal_dev to s_journal_dev inside ext4_sb_info, keep the naming rules consistent with other variables, which is convenient for code reading and writing. Signed-off-by: Chunguang Xu <brookxu@tencent.com> Reviewed-by: Andreas Dilger <adilger@dilger.ca> Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com> Link: https://lore.kernel.org/r/1600916623-544-1-git-send-email-brookxu@tencent.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-10-18jbd2: fix the comment of struct jbd2_journal_handleHui Su1-2/+2
the struct name was modified long ago, but the comment still use struct handle_s. Signed-off-by: Hui Su <sh_def@163.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20200922171231.GA53120@rlk Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-10-18ext4: add trace exit in exception path.Zhang Qilong2-2/+3
Missing trace exit in exception path of ext4_sync_file and ext4_ind_map_blocks. Signed-off-by: Zhang Qilong <zhangqilong3@huawei.com> Link: https://lore.kernel.org/r/20200921124738.23352-1-zhangqilong3@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-10-18ext4: optimize file overwritesRitesh Harjani1-3/+15
In case if the file already has underlying blocks/extents allocated then we don't need to start a journal txn and can directly return the underlying mapping. Currently ext4_iomap_begin() is used by both DAX & DIO path. We can check if the write request is an overwrite & then directly return the mapping information. This could give a significant perf boost for multi-threaded writes specially random overwrites. On PPC64 VM with simulated pmem(DAX) device, ~10x perf improvement could be seen in random writes (overwrite). Also bcoz this optimizes away the spinlock contention during jbd2 slab cache allocation (jbd2_journal_handle). On x86 VM, ~2x perf improvement was observed. Reported-by: Dan Williams <dan.j.williams@intel.com> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com> Link: https://lore.kernel.org/r/88e795d8a4d5cd22165c7ebe857ba91d68d8813e.1600401668.git.riteshh@linux.ibm.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-10-18ext4: remove unused including <linux/version.h>Tian Tao1-1/+0
Remove including <linux/version.h> that don't need it. Signed-off-by: Tian Tao <tiantao6@hisilicon.com> Link: https://lore.kernel.org/r/1600397165-42873-1-git-send-email-tiantao6@hisilicon.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-10-18ext4: fix superblock checksum calculation raceConstantine Sapuntzakis1-0/+11
The race condition could cause the persisted superblock checksum to not match the contents of the superblock, causing the superblock to be considered corrupt. An example of the race follows. A first thread is interrupted in the middle of a checksum calculation. Then, another thread changes the superblock, calculates a new checksum, and sets it. Then, the first thread resumes and sets the checksum based on the older superblock. To fix, serialize the superblock checksum calculation using the buffer header lock. While a spinlock is sufficient, the buffer header is already there and there is precedent for locking it (e.g. in ext4_commit_super). Tested the patch by booting up a kernel with the patch, creating a filesystem and some files (including some orphans), and then unmounting and remounting the file system. Cc: stable@kernel.org Signed-off-by: Constantine Sapuntzakis <costa@purestorage.com> Reviewed-by: Jan Kara <jack@suse.cz> Suggested-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20200914161014.22275-1-costa@purestorage.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-10-18ext4: fix error handling code in add_new_gdbDinghao Liu1-1/+3
When ext4_journal_get_write_access() fails, we should terminate the execution flow and release n_group_desc, iloc.bh, dind and gdb_bh. Cc: stable@kernel.org Signed-off-by: Dinghao Liu <dinghao.liu@zju.edu.cn> Reviewed-by: Andreas Dilger <adilger@dilger.ca> Link: https://lore.kernel.org/r/20200829025403.3139-1-dinghao.liu@zju.edu.cn Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-10-18ext4: disallow modifying DAX inode flag if inline_data has been setXiao Yang1-1/+1
inline_data is mutually exclusive to DAX so enabling both of them triggers the following issue: ------------------------------------------ # mkfs.ext4 -F -O inline_data /dev/pmem1 ... # mount /dev/pmem1 /mnt # echo 'test' >/mnt/file # lsattr -l /mnt/file /mnt/file Inline_Data # xfs_io -c "chattr +x" /mnt/file # xfs_io -c "lsattr -v" /mnt/file [dax] /mnt/file # umount /mnt # mount /dev/pmem1 /mnt # cat /mnt/file cat: /mnt/file: Numerical result out of range ------------------------------------------ Fixes: b383a73f2b83 ("fs/ext4: Introduce DAX inode flag") Signed-off-by: Xiao Yang <yangx.jy@cn.fujitsu.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Ira Weiny <ira.weiny@intel.com> Reviewed-by: Andreas Dilger <adilger@dilger.ca> Link: https://lore.kernel.org/r/20200828084330.15776-1-yangx.jy@cn.fujitsu.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-10-18ext4: remove unused argument from ext4_(inc|dec)_countNikolay Borisov1-10/+10
The 'handle' argument is not used for anything so simply remove it. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com> Link: https://lore.kernel.org/r/20200826133116.11592-1-nborisov@suse.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-10-18ext4: do not interpret high bytes if 64bit feature is disabledPetr Malat1-6/+8
Fields s_free_blocks_count_hi, s_r_blocks_count_hi and s_blocks_count_hi are not valid if EXT4_FEATURE_INCOMPAT_64BIT is not enabled and should be treated as zeroes. Signed-off-by: Petr Malat <oss@malat.biz> Link: https://lore.kernel.org/r/20200825150016.3363-1-oss@malat.biz Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-10-18ext4: delete duplicated words + other fixesRandy Dunlap5-6/+6
Delete repeated words in fs/ext4/. {the, this, of, we, after} Also change spelling of "xttr" in inline.c to "xattr" in 2 places. Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20200805024850.12129-1-rdunlap@infradead.org Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-10-18ext4: flag as supporting buffered async readsJens Axboe1-1/+1
ext4 uses generic_file_read_iter(), which already supports this. Cc: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Jens Axboe <axboe@kernel.dk> Link: https://lore.kernel.org/r/fb90cc2d-b12c-738f-21a4-dd7a8ae0556a@kernel.dk Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-10-18ext4: fix leaking sysfs kobject after failed mountEric Biggers1-0/+1
ext4_unregister_sysfs() only deletes the kobject. The reference to it needs to be put separately, like ext4_put_super() does. This addresses the syzbot report "memory leak in kobject_set_name_vargs (3)" (https://syzkaller.appspot.com/bug?extid=9f864abad79fae7c17e1). Reported-by: syzbot+9f864abad79fae7c17e1@syzkaller.appspotmail.com Fixes: 72ba74508b28 ("ext4: release sysfs kobject when failing to enable quotas on mount") Cc: stable@vger.kernel.org Signed-off-by: Eric Biggers <ebiggers@google.com> Link: https://lore.kernel.org/r/20200922162456.93657-1-ebiggers@kernel.org Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-10-18ext4: discard preallocations before releasing group lockJan Kara1-20/+13
ext4_mb_discard_group_preallocations() can be releasing group lock with preallocations accumulated on its local list. Thus although discard_pa_seq was incremented and concurrent allocating processes will be retrying allocations, it can happen that premature ENOSPC error is returned because blocks used for preallocations are not available for reuse yet. Make sure we always free locally accumulated preallocations before releasing group lock. Fixes: 07b5b8e1ac40 ("ext4: mballoc: introduce pcpu seqcnt for freeing PA to improve ENOSPC handling") Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20200924150959.4335-1-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-10-18ext4: fix dead loop in ext4_mb_new_blocksYe Bin1-1/+3
As we test disk offline/online with running fsstress, we find fsstress process is keeping running state. kworker/u32:3-262 [004] ...1 140.787471: ext4_mb_discard_preallocations: dev 8,32 needed 114 .... kworker/u32:3-262 [004] ...1 140.787471: ext4_mb_discard_preallocations: dev 8,32 needed 114 ext4_mb_new_blocks repeat: ext4_mb_discard_preallocations_should_retry(sb, ac, &seq) freed = ext4_mb_discard_preallocations ext4_mb_discard_group_preallocations this_cpu_inc(discard_pa_seq); ---> freed == 0 seq_retry = ext4_get_discard_pa_seq_sum for_each_possible_cpu(__cpu) __seq += per_cpu(discard_pa_seq, __cpu); if (seq_retry != *seq) { *seq = seq_retry; ret = true; } As we see seq_retry is sum of discard_pa_seq every cpu, if ext4_mb_discard_group_preallocations return zero discard_pa_seq in this cpu maybe increase one, so condition "seq_retry != *seq" have always been met. Ritesh Harjani suggest to in ext4_mb_discard_group_preallocations function we only increase discard_pa_seq when there is some PA to free. Fixes: 07b5b8e1ac40 ("ext4: mballoc: introduce pcpu seqcnt for freeing PA to improve ENOSPC handling") Signed-off-by: Ye Bin <yebin10@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com> Link: https://lore.kernel.org/r/20200916113859.1556397-3-yebin10@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-10-18ext4: implement swap_activate aops using iomapRitesh Harjani1-0/+11
After moving ext4's bmap to iomap interface, swapon functionality on files created using fallocate (which creates unwritten extents) are failing. This is since iomap_bmap interface returns 0 for unwritten extents and thus generic_swapfile_activate considers this as holes and hence bail out with below kernel msg :- [340.915835] swapon: swapfile has holes To fix this we need to implement ->swap_activate aops in ext4 which will use ext4_iomap_report_ops. Since we only need to return the list of extents so ext4_iomap_report_ops should be enough. Cc: stable@kernel.org Reported-by: Yuxuan Shui <yshuiv7@gmail.com> Fixes: ac58e4fb03f ("ext4: move ext4 bmap to use iomap infrastructure") Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com> Link: https://lore.kernel.org/r/20200904091653.1014334-1-riteshh@linux.ibm.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-10-17coccinelle: api: add kfree_mismatch scriptDenis Efremov1-0/+228
Check that alloc and free types of functions match each other. Signed-off-by: Denis Efremov <efremov@linux.com> Signed-off-by: Julia Lawall <Julia.Lawall@lip6.fr>
2020-10-17mm: use limited read-ahead to satisfy readJens Axboe1-6/+14
For the case where read-ahead is disabled on the file, or if the cgroup is congested, ensure that we can at least do 1 page of read-ahead to make progress on the read in an async fashion. This could potentially be larger, but it's not needed in terms of functionality, so let's error on the side of caution as larger counts of pages may run into reclaim issues (particularly if we're congested). This makes sure we're not hitting the potentially sync ->readpage() path for IO that is marked IOCB_WAITQ, which could cause us to block. It also means we'll use the same path for IO, regardless of whether or not read-ahead happens to be disabled on the lower level device. Acked-by: Johannes Weiner <hannes@cmpxchg.org> Reported-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reported-by: Hao_Xu <haoxu@linux.alibaba.com> [axboe: updated for new ractl API] Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-10-17mm: mark async iocb read as NOWAIT once some data has been copiedJens Axboe1-0/+8
Once we've copied some data for an iocb that is marked with IOCB_WAITQ, we should no longer attempt to async lock a new page. Instead make sure we return the copied amount, and let the caller retry, instead of returning -EIOCBQUEUED for a new page. This should only be possible with read-ahead disabled on the below device, and multiple threads racing on the same file. Haven't been able to reproduce on anything else. Cc: stable@vger.kernel.org # v5.9 Fixes: 1a0a7853b901 ("mm: support async buffered reads in generic_file_buffered_read()") Reported-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-10-17Merge tag 'perf-tools-for-v5.10-2020-10-15' of ↵Linus Torvalds164-5711/+10441
git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux Pull perf tools updates from Arnaldo Carvalho de Melo: - cgroup improvements for 'perf stat', allowing for compact specification of events and cgroups in the command line. - Support per thread topdown metrics in 'perf stat'. - Support sample-read topdown metric group in 'perf record' - Show start of latency in addition to its start in 'perf sched latency'. - Add min, max to 'perf script' futex-contention output, in addition to avg. - Allow usage of 'perf_event_attr->exclusive' attribute via the new ':e' event modifier. - Add 'snapshot' command to 'perf record --control', using it with Intel PT. - Support FIFO file names as alternative options to 'perf record --control'. - Introduce branch history "streams", to compare 'perf record' runs with 'perf diff' based on branch records and report hot streams. - Support PE executable symbol tables using libbfd, to profile, for instance, wine binaries. - Add filter support for option 'perf ftrace -F/--funcs'. - Allow configuring the 'disassembler_style' 'perf annotate' knob via 'perf config' - Update CascadelakeX and SkylakeX JSON vendor events files. - Add support for parsing perchip/percore JSON vendor events. - Add power9 hv_24x7 core level metric events. - Add L2 prefetch, ITLB instruction fetch hits JSON events for AMD zen1. - Enable Family 19h users by matching Zen2 AMD vendor events. - Use debuginfod in 'perf probe' when required debug files not found locally. - Display negative tid in non-sample events in 'perf script'. - Make GTK2 support opt-in - Add build test with GTK+ - Add missing -lzstd to the fast path feature detection - Add scripts to auto generate 'mmap', 'mremap' string<->id tables for use in 'perf trace'. - Show python test script in verbose mode. - Fix uncore metric expressions - Msan uninitialized use fixes. - Use condition variables in 'perf bench numa' - Autodetect python3 binary in systems without python2. - Support md5 build ids in addition to sha1. - Add build id 'perf test' regression test. - Fix printable strings in python3 scripts. - Fix off by ones in 'perf trace' in arches using libaudit. - Fix JSON event code for events referencing std arch events. - Introduce 'perf test' shell script for Arm CoreSight testing. - Add rdtsc() for Arm64 for used in the PERF_RECORD_TIME_CONV metadata event and in 'perf test tsc'. - 'perf c2c' improvements: Add "RMT Load Hit" metric, "Total Stores", fixes and documentation update. - Fix usage of reloc_sym in 'perf probe' when using both kallsyms and debuginfo files. - Do not print 'Metric Groups:' unnecessarily in 'perf list' - Refcounting fixes in the event parsing code. - Add expand cgroup event 'perf test' entry. - Fix out of bounds CPU map access when handling armv8_pmu events in 'perf stat'. - Add build-id injection 'perf bench' benchmark. - Enter namespace when reading build-id in 'perf inject'. - Do not load map/dso when injecting build-id speeding up the 'perf inject' process. - Add --buildid-all option to avoid processing all samples, just the mmap metadata events. - Add feature test to check if libbfd has buildid support - Add 'perf test' entry for PE binary format support. - Fix typos in power8 PMU vendor events JSON files. - Hide libtraceevent non API functions. * tag 'perf-tools-for-v5.10-2020-10-15' of git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux: (113 commits) perf c2c: Update documentation for metrics reorganization perf c2c: Add metrics "RMT Load Hit" perf c2c: Correct LLC load hit metrics perf c2c: Change header for LLC local hit perf c2c: Use more explicit headers for HITM perf c2c: Change header from "LLC Load Hitm" to "Load Hitm" perf c2c: Organize metrics based on memory hierarchy perf c2c: Display "Total Stores" as a standalone metrics perf c2c: Display the total numbers continuously perf bench: Use condition variables in numa. perf jevents: Fix event code for events referencing std arch events perf diff: Support hot streams comparison perf streams: Report hot streams perf streams: Calculate the sum of total streams hits perf streams: Link stream pair perf streams: Compare two streams perf streams: Get the evsel_streams by evsel_idx perf streams: Introduce branch history "streams" perf intel-pt: Improve PT documentation slightly perf tools: Add support for exclusive groups/events ...
2020-10-17Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdmaLinus Torvalds224-4703/+5213
Pull rdma updates from Jason Gunthorpe: "A usual cycle for RDMA with a typical mix of driver and core subsystem updates: - Driver minor changes and bug fixes for mlx5, efa, rxe, vmw_pvrdma, hns, usnic, qib, qedr, cxgb4, hns, bnxt_re - Various rtrs fixes and updates - Bug fix for mlx4 CM emulation for virtualization scenarios where MRA wasn't working right - Use tracepoints instead of pr_debug in the CM code - Scrub the locking in ucma and cma to close more syzkaller bugs - Use tasklet_setup in the subsystem - Revert the idea that 'destroy' operations are not allowed to fail at the driver level. This proved unworkable from a HW perspective. - Revise how the umem API works so drivers make fewer mistakes using it - XRC support for qedr - Convert uverbs objects RWQ and MW to new the allocation scheme - Large queue entry sizes for hns - Use hmm_range_fault() for mlx5 On Demand Paging - uverbs APIs to inspect the GID table instead of sysfs - Move some of the RDMA code for building large page SGLs into lib/scatterlist" * tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (191 commits) RDMA/ucma: Fix use after free in destroy id flow RDMA/rxe: Handle skb_clone() failure in rxe_recv.c RDMA/rxe: Move the definitions for rxe_av.network_type to uAPI RDMA: Explicitly pass in the dma_device to ib_register_device lib/scatterlist: Do not limit max_segment to PAGE_ALIGNED values IB/mlx4: Convert rej_tmout radix-tree to XArray RDMA/rxe: Fix bug rejecting all multicast packets RDMA/rxe: Fix skb lifetime in rxe_rcv_mcast_pkt() RDMA/rxe: Remove duplicate entries in struct rxe_mr IB/hfi,rdmavt,qib,opa_vnic: Update MAINTAINERS IB/rdmavt: Fix sizeof mismatch MAINTAINERS: CISCO VIC LOW LATENCY NIC DRIVER RDMA/bnxt_re: Fix sizeof mismatch for allocation of pbl_tbl. RDMA/bnxt_re: Use rdma_umem_for_each_dma_block() RDMA/umem: Move to allocate SG table from pages lib/scatterlist: Add support in dynamic allocation of SG table from pages tools/testing/scatterlist: Show errors in human readable form tools/testing/scatterlist: Rejuvenate bit-rotten test RDMA/ipoib: Set rtnl_link_ops for ipoib interfaces RDMA/uverbs: Expose the new GID query API to user space ...
2020-10-17Merge tag 'i3c/for-5.10' of ↵Linus Torvalds2-54/+94
git://git.kernel.org/pub/scm/linux/kernel/git/i3c/linux Pull i3c updates from Boris Brezillon: - Fix DAA for the pre-reserved address case - Fix an error path in the cadence driver * tag 'i3c/for-5.10' of git://git.kernel.org/pub/scm/linux/kernel/git/i3c/linux: i3c: master: Fix error return in cdns_i3c_master_probe() i3c: master: fix for SETDASA and DAA process i3c: master add i3c_master_attach_boardinfo to preserve boardinfo
2020-10-17Merge tag 'mtd/for-5.10' of ↵Linus Torvalds121-1076/+2316
git://git.kernel.org/pub/scm/linux/kernel/git/mtd/linux Pull MTD updates from Richard Weinberger: "NAND core changes: - Drop useless 'depends on' in Kconfig - Add an extra level in the Kconfig hierarchy - Trivial spellings - Dynamic allocation of the interface configurations - Dropping the default ONFI timing mode - Various cleanup (types, structures, naming, comments) - Hide the chip->data_interface indirection - Add the generic rb-gpios property - Add the ->choose_interface_config() hook - Introduce nand_choose_best_sdr_timings() - Use default values for tPROG_max and tBERS_max - Avoid redefining tR_max and tCCS_min - Add a helper to find the closest ONFI mode - bcm63xx MTD parsers: simplify CFE detection Raw NAND controller drivers changes: - fsl-upm: Deprecation of specific DT properties - fsl_upm: Driver rework and cleanup in favor of ->exec_op() - Ingenic: Cleanup ARRAY_SIZE() vs sizeof() use - brcmnand: ECC error handling on EDU transfers - brcmnand: Don't default to EDU transfers - qcom: Set BAM mode only if not set already - qcom: Avoid write to unavailable register - gpio: Driver rework in favor of ->exec_op() - tango: ->exec_op() conversion - mtk: ->exec_op() conversion Raw NAND chip drivers changes: - toshiba: Implement ->choose_interface_config() for TH58NVG2S3HBAI4 - toshiba: Implement ->choose_interface_config() for TC58NVG0S3E - toshiba: Implement ->choose_interface_config() for TC58TEG5DCLTA00 - hynix: Implement ->choose_interface_config() for H27UCG8T2ATR-BC HyperBus changes: - DMA support for TI's AM654 HyperBus controller driver. - HyperBus frontend driver for Renesas RPC-IF driver. SPI NOR core changes: - Support for Winbond w25q64jwm flash - Enable 4K sector support for mx25l12805d SPI NOR controller drivers changes: - intel-spi Add Alder Lake-S PCI ID MTD Core changes: - mtdoops: Don't run panic write twice - mtdconcat: Correctly handle panic write - Use DEFINE_SHOW_ATTRIBUTE" * tag 'mtd/for-5.10' of git://git.kernel.org/pub/scm/linux/kernel/git/mtd/linux: (76 commits) mtd: hyperbus: Fix build failure when only RPCIF_HYPERBUS is enabled mtd: hyperbus: add Renesas RPC-IF driver Revert "mtd: spi-nor: Prefer asynchronous probe" mtd: parsers: bcm63xx: Do not make it modular mtd: spear_smi: Enable compile testing mtd: maps: vmu-flash: fix typos for struct memcard mtd: physmap: Add Baikal-T1 physically mapped ROM support mtd: maps: vmu-flash: simplify the return expression of probe_maple_vmu mtd: onenand: simplify the return expression of onenand_transfer_auto_oob mtd: rawnand: cadence: remove a redundant dev_err call mtd: rawnand: ams-delta: Fix non-OF build warning mtd: rawnand: Don't overwrite the error code from nand_set_ecc_soft_ops() mtd: rawnand: Introduce nand_set_ecc_on_host_ops() mtd: rawnand: atmel: Check return values for nand_read_data_op mtd: rawnand: vf610: Remove unused function vf610_nfc_transfer_size() mtd: rawnand: qcom: Simplify with dev_err_probe() mtd: rawnand: marvell: Fix and update kerneldoc mtd: rawnand: marvell: Simplify with dev_err_probe() mtd: rawnand: gpmi: Simplify with dev_err_probe() mtd: rawnand: atmel: Simplify with dev_err_probe() ...
2020-10-17Merge tag 'thermal-v5.10-rc1' of ↵Linus Torvalds22-82/+156
git://git.kernel.org/pub/scm/linux/kernel/git/thermal/linux Pull thermal updates from Daniel Lezcano: - Fix Kconfig typo "acces" -> "access" (Colin Ian King) - Use dev_error_probe() to simplify the error handling on imx and imx8 platforms (Anson Huang) - Use dedicated kobj_to_dev() instead of container_of() in the sysfs core code (Tian Tao) - Fix coding style by adding braces to a one line conditional statement on rcar (Geert Uytterhoeven) - Add DT binding documentation for the r8a774e1 platform and update the Kconfig description supporting RZ/G2 SoCs (Lad Prabhakar) - Simplify the return expression of stm_thermal_prepare on the stm32 platform (Qinglang Miao) - Fix the unit in the function documentation for the idle injection cooling device (Zhuguang Qing) - Remove an unecessary mutex_init() in the core code (Qinglang Miao) - Add support for keep alive events in the core code and the specific int340x (Srinivas Pandruvada) - Remove unused thermal zone variable in devfreq and cpufreq cooling devices (Zhuguang Qing) - Add the A100's THS controller support (Yangtao Li) - Add power management on the omap3's bandgap sensor (Adam Ford) - Fix a missing nlmsg_free in the netlink core error path (Jing Xiangfeng) * tag 'thermal-v5.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/thermal/linux: thermal: core: Adding missing nlmsg_free() in thermal_genl_sampling_temp() thermal: ti-soc-thermal: Enable addition power management thermal: sun8i: Add A100's THS controller support thermal: sun8i: add TEMP_CALIB_MASK for calibration data in sun50i_h6_ths_calibrate dt-bindings: thermal: sun8i: Add binding for A100's THS controller thermal: cooling: Remove unused variable *tz thermal: int340x: Add keep alive response method thermal: core: Add new event for sending keep alive notifications thermal: int340x: Provide notification for OEM variable change thermal: core: remove unnecessary mutex_init() thermal/idle_inject: Fix comment of idle_duration_us and name of latency_ns thermal: Kconfig: Update description for RCAR_GEN3_THERMAL config thermal: stm32: simplify the return expression of stm_thermal_prepare() dt-bindings: thermal: rcar-gen3-thermal: Add r8a774e1 support thermal: rcar_thermal: Add missing braces to conditional statement thermal: Use kobj_to_dev() instead of container_of() thermal: imx8mm: Use dev_err_probe() to simplify error handling thermal: imx: Use dev_err_probe() to simplify error handling drivers: thermal: Kconfig: fix spelling mistake "acces" -> "access"
2020-10-17io_uring: fix double poll mask initPavel Begunkov1-1/+3
__io_queue_proc() is used by both, poll reqs and apoll. Don't use req->poll.events to copy poll mask because for apoll it aliases with private data of the request. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-10-17io-wq: inherit audit loginuid and sessionidJens Axboe4-1/+41
Make sure the async io-wq workers inherit the loginuid and sessionid from the original task, and restore them to unset once we're done with the async work item. While at it, disable the ability for kernel threads to write to their own loginuid. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-10-17io_uring: use percpu counters to track inflight requestsJens Axboe2-27/+30
Even though we place the req_issued and req_complete in separate cachelines, there's considerable overhead in doing the atomics particularly on the completion side. Get rid of having the two counters, and just use a percpu_counter for this. That's what it was made for, after all. This considerably reduces the overhead in __io_free_req(). Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-10-17io_uring: assign new io_identity for task if members have changedJens Axboe2-5/+17
This avoids doing a copy for each new async IO, if some parts of the io_identity has changed. We avoid reference counting for the normal fast path of nothing ever changing. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-10-17io_uring: store io_identity in io_uring_taskJens Axboe2-10/+12
This is, by definition, a per-task structure. So store it in the task context, instead of doing carrying it in each io_kiocb. We're being a bit inefficient if members have changed, as that requires an alloc and copy of a new io_identity struct. The next patch will fix that up. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-10-17io_uring: COW io_identity on mismatchJens Axboe2-52/+171
If the io_identity doesn't completely match the task, then create a copy of it and use that. The existing copy remains valid until the last user of it has gone away. This also changes the personality lookup to be indexed by io_identity, instead of creds directly. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-10-17io_uring: move io identity items into separate structJens Axboe4-57/+64
io-wq contains a pointer to the identity, which we just hold in io_kiocb for now. This is in preparation for putting this outside io_kiocb. The only exception is struct files_struct, which we'll need different rules for to avoid a circular dependency. No functional changes in this patch. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-10-17io_uring: rely solely on work flags to determine personality.Jens Axboe2-22/+33
We solely rely on work->work_flags now, so use that for proper checking and clearing/dropping of various identity items. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-10-17io_uring: pass required context in as flagsJens Axboe3-64/+52
We have a number of bits that decide what context to inherit. Set up io-wq flags for these instead. This is in preparation for always having the various members set, but not always needing them for all requests. No intended functional changes in this patch. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-10-17io-wq: assign NUMA node locality if appropriateJens Axboe1-0/+1
There was an assumption that kthread_create_on_node() would properly set NUMA affinities in terms of CPUs allowed, but it doesn't. Make sure we do this when creating an io-wq context on NUMA. Cc: stable@vger.kernel.org Stefan Metzmacher <metze@samba.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-10-17io_uring: fix error path cleanup in io_sqe_files_register()Jens Axboe1-1/+2
syzbot reports the following crash: general protection fault, probably for non-canonical address 0xdffffc0000000000: 0000 [#1] PREEMPT SMP KASAN KASAN: null-ptr-deref in range [0x0000000000000000-0x0000000000000007] CPU: 1 PID: 8927 Comm: syz-executor.3 Not tainted 5.9.0-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 RIP: 0010:io_file_from_index fs/io_uring.c:5963 [inline] RIP: 0010:io_sqe_files_register fs/io_uring.c:7369 [inline] RIP: 0010:__io_uring_register fs/io_uring.c:9463 [inline] RIP: 0010:__do_sys_io_uring_register+0x2fd2/0x3ee0 fs/io_uring.c:9553 Code: ec 03 49 c1 ee 03 49 01 ec 49 01 ee e8 57 61 9c ff 41 80 3c 24 00 0f 85 9b 09 00 00 4d 8b af b8 01 00 00 4c 89 e8 48 c1 e8 03 <80> 3c 28 00 0f 85 76 09 00 00 49 8b 55 00 89 d8 c1 f8 09 48 98 4c RSP: 0018:ffffc90009137d68 EFLAGS: 00010246 RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffffc9000ef2a000 RDX: 0000000000040000 RSI: ffffffff81d81dd9 RDI: 0000000000000005 RBP: dffffc0000000000 R08: 0000000000000001 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000000 R12: ffffed1012882a37 R13: 0000000000000000 R14: ffffed1012882a38 R15: ffff888094415000 FS: 00007f4266f3c700(0000) GS:ffff8880ae500000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 000000000118c000 CR3: 000000008e57d000 CR4: 00000000001506e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46 entry_SYSCALL_64_after_hwframe+0x44/0xa9 RIP: 0033:0x45de59 Code: 0d b4 fb ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 db b3 fb ff c3 66 2e 0f 1f 84 00 00 00 00 RSP: 002b:00007f4266f3bc78 EFLAGS: 00000246 ORIG_RAX: 00000000000001ab RAX: ffffffffffffffda RBX: 00000000000083c0 RCX: 000000000045de59 RDX: 0000000020000280 RSI: 0000000000000002 RDI: 0000000000000005 RBP: 000000000118bf68 R08: 0000000000000000 R09: 0000000000000000 R10: 40000000000000a1 R11: 0000000000000246 R12: 000000000118bf2c R13: 00007fff2fa4f12f R14: 00007f4266f3c9c0 R15: 000000000118bf2c Modules linked in: ---[ end trace 2a40a195e2d5e6e6 ]--- RIP: 0010:io_file_from_index fs/io_uring.c:5963 [inline] RIP: 0010:io_sqe_files_register fs/io_uring.c:7369 [inline] RIP: 0010:__io_uring_register fs/io_uring.c:9463 [inline] RIP: 0010:__do_sys_io_uring_register+0x2fd2/0x3ee0 fs/io_uring.c:9553 Code: ec 03 49 c1 ee 03 49 01 ec 49 01 ee e8 57 61 9c ff 41 80 3c 24 00 0f 85 9b 09 00 00 4d 8b af b8 01 00 00 4c 89 e8 48 c1 e8 03 <80> 3c 28 00 0f 85 76 09 00 00 49 8b 55 00 89 d8 c1 f8 09 48 98 4c RSP: 0018:ffffc90009137d68 EFLAGS: 00010246 RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffffc9000ef2a000 RDX: 0000000000040000 RSI: ffffffff81d81dd9 RDI: 0000000000000005 RBP: dffffc0000000000 R08: 0000000000000001 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000000 R12: ffffed1012882a37 R13: 0000000000000000 R14: ffffed1012882a38 R15: ffff888094415000 FS: 00007f4266f3c700(0000) GS:ffff8880ae400000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 000000000074a918 CR3: 000000008e57d000 CR4: 00000000001506f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 which is a copy of fget failure condition jumping to cleanup, but the cleanup requires ctx->file_data to be assigned. Assign it when setup, and ensure that we clear it again for the error path exit. Fixes: 5398ae698525 ("io_uring: clean file_data access in files_register") Reported-by: syzbot+f4ebcc98223dafd8991e@syzkaller.appspotmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-10-17Revert "io_uring: mark io_uring_fops/io_op_defs as __read_mostly"Jens Axboe1-2/+2
This reverts commit 738277adc81929b3e7c9b63fec6693868cc5f931. This change didn't make a lot of sense, and as Linus reports, it actually fails on clang: /tmp/io_uring-dd40c4.s:26476: Warning: ignoring changed section attributes for .data..read_mostly The arrays are already marked const so, by definition, they are not just read-mostly, they are read-only. Reported-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-10-17io_uring: fix REQ_F_COMP_LOCKED by killing itPavel Begunkov1-80/+69
REQ_F_COMP_LOCKED is used and implemented in a buggy way. The problem is that the flag is set before io_put_req() but not cleared after, and if that wasn't the final reference, the request will be freed with the flag set from some other context, which may not hold a spinlock. That means possible races with removing linked timeouts and unsynchronised completion (e.g. access to CQ). Instead of fixing REQ_F_COMP_LOCKED, kill the flag and use task_work_add() to move such requests to a fresh context to free from it, as was done with __io_free_req_finish(). Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-10-17io_uring: dig out COMP_LOCK from deep call chainPavel Begunkov1-24/+9
io_req_clean_work() checks REQ_F_COMP_LOCK to pass this two layers up. Move the check up into __io_free_req(), so at least it doesn't looks so ugly and would facilitate further changes. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-10-17io_uring: don't put a poll req under spinlockPavel Begunkov1-2/+1
Move io_put_req() in io_poll_task_handler() from under spinlock. This eliminates the need to use REQ_F_COMP_LOCKED, at the expense of potentially having to grab the lock again. That's still a better trade off than relying on the locked flag. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-10-17io_uring: don't unnecessarily clear F_LINK_TIMEOUTPavel Begunkov1-1/+0
If a request had REQ_F_LINK_TIMEOUT it would've been cleared in __io_kill_linked_timeout() by the time of __io_fail_links(), so no need to care about it. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-10-17io_uring: don't set COMP_LOCKED if won't putPavel Begunkov1-1/+1
__io_kill_linked_timeout() sets REQ_F_COMP_LOCKED for a linked timeout even if it can't cancel it, e.g. it's already running. It not only races with io_link_timeout_fn() for ->flags field, but also leaves the flag set and so io_link_timeout_fn() may find it and decide that it holds the lock. Hopefully, the second problem is potential. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-10-17io_uring: Fix sizeof() mismatchColin Ian King1-1/+1
An incorrect sizeof() is being used, sizeof(file_data->table) is not correct, it should be sizeof(*file_data->table). Fixes: 5398ae698525 ("io_uring: clean file_data access in files_register") Signed-off-by: Colin Ian King <colin.king@canonical.com> Addresses-Coverity: ("Sizeof not portable (SIZEOF_MISMATCH)") Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-10-16mailbox: avoid timer start from callbackJassi Brar1-5/+7
If the txdone is done by polling, it is possible for msg_submit() to start the timer while txdone_hrtimer() callback is running. If the timer needs recheduling, it could already be enqueued by the time hrtimer_forward_now() is called, leading hrtimer to loudly complain. WARNING: CPU: 3 PID: 74 at kernel/time/hrtimer.c:932 hrtimer_forward+0xc4/0x110 CPU: 3 PID: 74 Comm: kworker/u8:1 Not tainted 5.9.0-rc2-00236-gd3520067d01c-dirty #5 Hardware name: Libre Computer AML-S805X-AC (DT) Workqueue: events_freezable_power_ thermal_zone_device_check pstate: 20000085 (nzCv daIf -PAN -UAO BTYPE=--) pc : hrtimer_forward+0xc4/0x110 lr : txdone_hrtimer+0xf8/0x118 [...] This can be fixed by not starting the timer from the callback path. Which requires the timer reloading as long as any message is queued on the channel, and not just when current tx is not done yet. Fixes: 0cc67945ea59 ("mailbox: switch to hrtimer for tx_complete polling") Reported-by: Da Xue <da@libre.computer> Reviewed-by: Sudeep Holla <sudeep.holla@arm.com> Tested-by: Sudeep Holla <sudeep.holla@arm.com> Acked-by: Jerome Brunet <jbrunet@baylibre.com> Tested-by: Jerome Brunet <jbrunet@baylibre.com> Signed-off-by: Jassi Brar <jaswinder.singh@linaro.org>
2020-10-16xfs: fix Kconfig asking about XFS_SUPPORT_V4 when XFS_FS=nDarrick J. Wong1-0/+1
Pavel Machek complained that the question about supporting deprecated XFS v4 comes up even when XFS is disabled. This clearly makes no sense, so fix Kconfig. Reported-by: Pavel Machek <pavel@ucw.cz> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Eric Sandeen <sandeen@redhat.com>
2020-10-16xfs: fix high key handling in the rt allocator's query_range functionDarrick J. Wong1-7/+4
Fix some off-by-one errors in xfs_rtalloc_query_range. The highest key in the realtime bitmap is always one less than the number of rt extents, which means that the key clamp at the start of the function is wrong. The 4th argument to xfs_rtfind_forw is the highest rt extent that we want to probe, which means that passing 1 less than the high key is wrong. Finally, drop the rem variable that controls the loop because we can compare the iteration point (rtstart) against the high key directly. The sordid history of this function is that the original commit (fb3c3) incorrectly passed (high_rec->ar_startblock - 1) as the 'limit' parameter to xfs_rtfind_forw. This was wrong because the "high key" is supposed to be the largest key for which the caller wants result rows, not the key for the first row that could possibly be outside the range that the caller wants to see. A subsequent attempt (8ad56) to strengthen the parameter checking added incorrect clamping of the parameters to the number of rt blocks in the system (despite the bitmap functions all taking units of rt extents) to avoid querying ranges past the end of rt bitmap file but failed to fix the incorrect _rtfind_forw parameter. The original _rtfind_forw parameter error then survived the conversion of the startblock and blockcount fields to rt extents (a0e5c), and the most recent off-by-one fix (a3a37) thought it was patching a problem when the end of the rt volume is not in use, but none of these fixes actually solved the original problem that the author was confused about the "limit" argument to xfs_rtfind_forw. Sadly, all four of these patches were written by this author and even his own usage of this function and rt testing were inadequate to get this fixed quickly. Original-problem: fb3c3de2f65c ("xfs: add a couple of queries to iterate free extents in the rtbitmap") Not-fixed-by: 8ad560d2565e ("xfs: strengthen rtalloc query range checks") Not-fixed-by: a0e5c435babd ("xfs: fix xfs_rtalloc_rec units") Fixes: a3a374bf1889 ("xfs: fix off-by-one error in xfs_rtalloc_query_range") Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com>
2020-10-16Merge tag 'ovl-update-5.10' of ↵Linus Torvalds12-200/+446
git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs Pull overlayfs updates from Miklos Szeredi: - Improve performance for certain container setups by introducing a "volatile" mode - ioctl improvements - continue preparation for unprivileged overlay mounts * tag 'ovl-update-5.10' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs: ovl: use generic vfs_ioc_setflags_prepare() helper ovl: support [S|G]ETFLAGS and FS[S|G]ETXATTR ioctls for directories ovl: rearrange ovl_can_list() ovl: enumerate private xattrs ovl: pass ovl_fs down to functions accessing private xattrs ovl: drop flags argument from ovl_do_setxattr() ovl: adhere to the vfs_ vs. ovl_do_ conventions for xattrs ovl: use ovl_do_getxattr() for private xattr ovl: fold ovl_getxattr() into ovl_get_redirect_xattr() ovl: clean up ovl_getxattr() in copy_up.c duplicate ovl_getxattr() ovl: provide a mount option "volatile" ovl: check for incompatible features in work dir
2020-10-16Merge tag 'afs-fixes-20201016' of ↵Linus Torvalds12-172/+378
git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs Pull afs updates from David Howells: "A collection of fixes to fix afs_cell struct refcounting, thereby fixing a slew of related syzbot bugs: - Fix the cell tree in the netns to use an rwsem rather than RCU. There seem to be some problems deriving from the use of RCU and a seqlock to walk the rbtree, but it's not entirely clear what since there are several different failures being seen. Changing things to use an rwsem instead makes it more robust. The extra performance derived from using RCU isn't necessary in this case since the only time we're looking up a cell is during mount or when cells are being manually added. - Fix the refcounting by splitting the usage counter into a memory refcount and an active users counter. The usage counter was doing double duty, keeping track of whether a cell is still in use and keeping track of when it needs to be destroyed - but this makes the clean up tricky. Separating these out simplifies the logic. - Fix purging a cell that has an alias. A cell alias pins the cell it's an alias of, but the alias is always later in the list. Trying to purge in a single pass causes rmmod to hang in such a case. - Fix cell removal. If a cell's manager is requeued whilst it's removing itself, the manager will run again and re-remove itself, causing problems in various places. Follow Hillf Danton's suggestion to insert a more terminal state that causes the manager to do nothing post-removal. In additional to the above, two other changes: - Add a tracepoint for the cell refcount and active users count. This helped with debugging the above and may be useful again in future. - Downgrade an assertion to a print when a still-active server is seen during purging. This was happening as a consequence of incomplete cell removal before the servers were cleaned up" * tag 'afs-fixes-20201016' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs: afs: Don't assert on unpurgeable server records afs: Add tracing for cell refcount and active user count afs: Fix cell removal afs: Fix cell purging with aliases afs: Fix cell refcounting by splitting the usage counter afs: Fix rapid cell addition/removal by not using RCU on cells tree
2020-10-16Merge tag 'f2fs-for-5.10-rc1' of ↵Linus Torvalds28-493/+1795
git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs Pull f2fs updates from Jaegeuk Kim: "In this round, we've added new features such as zone capacity for ZNS and a new GC policy, ATGC, along with in-memory segment management. In addition, we could improve the decompression speed significantly by changing virtual mapping method. Even though we've fixed lots of small bugs in compression support, I feel that it becomes more stable so that I could give it a try in production. Enhancements: - suport zone capacity in NVMe Zoned Namespace devices - introduce in-memory current segment management - add standart casefolding support - support age threshold based garbage collection - improve decompression speed by changing virtual mapping method Bug fixes: - fix condition checks in some ioctl() such as compression, move_range, etc - fix 32/64bits support in data structures - fix memory allocation in zstd decompress - add some boundary checks to avoid kernel panic on corrupted image - fix disallowing compression for non-empty file - fix slab leakage of compressed block writes In addition, it includes code refactoring for better readability and minor bug fixes for compression and zoned device support" * tag 'f2fs-for-5.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs: (51 commits) f2fs: code cleanup by removing unnecessary check f2fs: wait for sysfs kobject removal before freeing f2fs_sb_info f2fs: fix writecount false positive in releasing compress blocks f2fs: introduce check_swap_activate_fast() f2fs: don't issue flush in f2fs_flush_device_cache() for nobarrier case f2fs: handle errors of f2fs_get_meta_page_nofail f2fs: fix to set SBI_NEED_FSCK flag for inconsistent inode f2fs: reject CASEFOLD inode flag without casefold feature f2fs: fix memory alignment to support 32bit f2fs: fix slab leak of rpages pointer f2fs: compress: fix to disallow enabling compress on non-empty file f2fs: compress: introduce cic/dic slab cache f2fs: compress: introduce page array slab cache f2fs: fix to do sanity check on segment/section count f2fs: fix to check segment boundary during SIT page readahead f2fs: fix uninit-value in f2fs_lookup f2fs: remove unneeded parameter in find_in_block() f2fs: fix wrong total_sections check and fsmeta check f2fs: remove duplicated code in sanity_check_area_boundary f2fs: remove unused check on version_bitmap ...
2020-10-16Merge tag 'docs/v5.10-1' of ↵Linus Torvalds309-2275/+1952
git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media Pull documentation updates from Mauro Carvalho Chehab: "A series of patches addressing warnings produced by make htmldocs. This includes: - kernel-doc markup fixes - ReST fixes - Updates at the build system in order to support newer versions of the docs build toolchain (Sphinx) After this series, the number of html build warnings should reduce significantly, and building with Sphinx 3.1 or later should now be supported (although it is still recommended to use Sphinx 2.4.4). As agreed with Jon, I should be sending you a late pull request by the end of the merge window addressing remaining issues with docs build, as there are a number of warning fixes that depends on pull requests that should be happening along the merge window. The end goal is to have a clean htmldocs build on Kernel 5.10. PS. It should be noticed that Sphinx 3.0 is not currently supported, as it lacks support for C domain namespaces. Such feature, needed in order to document uAPI system calls with Sphinx 3.x, was added only on Sphinx 3.1" * tag 'docs/v5.10-1' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media: (75 commits) PM / devfreq: remove a duplicated kernel-doc markup mm/doc: fix a literal block markup workqueue: fix a kernel-doc warning docs: virt: user_mode_linux_howto_v2.rst: fix a literal block markup Input: sparse-keymap: add a description for @sw rcu/tree: docs: document bkvcache new members at struct kfree_rcu_cpu nl80211: docs: add a description for s1g_cap parameter usb: docs: document altmode register/unregister functions kunit: test.h: fix a bad kernel-doc markup drivers: core: fix kernel-doc markup for dev_err_probe() docs: bio: fix a kerneldoc markup kunit: test.h: solve kernel-doc warnings block: bio: fix a warning at the kernel-doc markups docs: powerpc: syscall64-abi.rst: fix a malformed table drivers: net: hamradio: fix document location net: appletalk: Kconfig: Fix docs location dt-bindings: fix references to files converted to yaml memblock: get rid of a :c:type leftover math64.h: kernel-docs: Convert some markups into normal comments media: uAPI: buffer.rst: remove a left-over documentation ...
2020-10-16Merge tag 'trace-v5.10-2' of ↵Linus Torvalds1-1/+1
git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace Pull tracing fix from Steven Rostedt: "Fix mismatch section of adding early trace events. Fixes the issue of a mismatch section that was missed due to gcc inlining the offending function, while clang did not (and reported the issue)" * tag 'trace-v5.10-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: tracing: Remove __init from __trace_early_add_new_event()
2020-10-16Merge tag 'printk-for-5.10-fixup' of ↵Linus Torvalds1-1/+4
git://git.kernel.org/pub/scm/linux/kernel/git/printk/linux Pull printk fix from Petr Mladek: "Prevent overflow in the new lockless ringbuffer" * tag 'printk-for-5.10-fixup' of git://git.kernel.org/pub/scm/linux/kernel/git/printk/linux: printk: ringbuffer: Wrong data pointer when appending small string
2020-10-16Merge tag 'kgdb-5.10-rc1' of ↵Linus Torvalds10-34/+101
git://git.kernel.org/pub/scm/linux/kernel/git/danielt/linux Pull kgdb updates from Daniel Thompson: "A fairly modest set of changes for this cycle. Of particular note are an earlycon fix from Doug Anderson and my own changes to get kgdb/kdb to honour the kprobe blocklist. The later creates a safety rail that strongly encourages developers not to place breakpoints in, for example, arch specific trap handling code. Also included are a couple of small fixes and tweaks: an API update, eliminate a coverity dead code warning, improved handling of search during multi-line printk and a couple of typo corrections" * tag 'kgdb-5.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/danielt/linux: kdb: Fix pager search for multi-line strings kernel: debug: Centralize dbg_[de]activate_sw_breakpoints kgdb: Add NOKPROBE labels on the trap handler functions kgdb: Honour the kprobe blocklist when setting breakpoints kernel/debug: Fix spelling mistake in debug_core.c kdb: Use newer api for tasklist scanning kgdb: Make "kgdbcon" work properly with "kgdb_earlycon" kdb: remove unnecessary null check of dbg_io_ops
2020-10-16Merge tag 'mips_5.10' of ↵Linus Torvalds165-3681/+1730
git://git.kernel.org/pub/scm/linux/kernel/git/mips/linux Pull MIPS updates from Thomas Bogendoerfer: - removed support for PNX833x alias NXT_STB22x - included Ingenic SoC support into generic MIPS kernels - added support for new Ingenic SoCs - converted workaround selection to use Kconfig - replaced old boot mem functions by memblock_* - enabled COP2 usage in kernel for Loongson64 to make use of 16byte load/stores possible - cleanups and fixes * tag 'mips_5.10' of git://git.kernel.org/pub/scm/linux/kernel/git/mips/linux: (92 commits) MIPS: DEC: Restore bootmem reservation for firmware working memory area MIPS: dec: fix section mismatch bcm963xx_tag.h: fix duplicated word mips: ralink: enable zboot support MIPS: ingenic: Remove CPU_SUPPORTS_HUGEPAGES MIPS: cpu-probe: remove MIPS_CPU_BP_GHIST option bit MIPS: cpu-probe: introduce exclusive R3k CPU probe MIPS: cpu-probe: move fpu probing/handling into its own file MIPS: replace add_memory_region with memblock MIPS: Loongson64: Clean up numa.c MIPS: Loongson64: Select SMP in Kconfig to avoid build error mips: octeon: Add Ubiquiti E200 and E220 boards MIPS: SGI-IP28: disable use of ll/sc in kernel MIPS: tx49xx: move tx4939_add_memory_regions into only user MIPS: pgtable: Remove used PAGE_USERIO define MIPS: alchemy: Share prom_init implementation MIPS: alchemy: Fix build breakage, if TOUCHSCREEN_WM97XX is disabled MIPS: process: include exec.h header in process.c MIPS: process: Add prototype for function arch_dup_task_struct MIPS: idle: Add prototype for function check_wait ...
2020-10-16Merge tag 's390-5.10-1' of ↵Linus Torvalds129-2162/+4094
git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux Pull s390 updates from Vasily Gorbik: - Remove address space overrides using set_fs() - Convert to generic vDSO - Convert to generic page table dumper - Add ARCH_HAS_DEBUG_WX support - Add leap seconds handling support - Add NVMe firmware-assisted kernel dump support - Extend NVMe boot support with memory clearing control and addition of kernel parameters - AP bus and zcrypt api code rework. Add adapter configure/deconfigure interface. Extend debug features. Add failure injection support - Add ECC secure private keys support - Add KASan support for running protected virtualization host with 4-level paging - Utilize destroy page ultravisor call to speed up secure guests shutdown - Implement ioremap_wc() and ioremap_prot() with MIO in PCI code - Various checksum improvements - Other small various fixes and improvements all over the code * tag 's390-5.10-1' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux: (85 commits) s390/uaccess: fix indentation s390/uaccess: add default cases for __put_user_fn()/__get_user_fn() s390/zcrypt: fix wrong format specifications s390/kprobes: move insn_page to text segment s390/sie: fix typo in SIGP code description s390/lib: fix kernel doc for memcmp() s390/zcrypt: Introduce Failure Injection feature s390/zcrypt: move ap_msg param one level up the call chain s390/ap/zcrypt: revisit ap and zcrypt error handling s390/ap: Support AP card SCLP config and deconfig operations s390/sclp: Add support for SCLP AP adapter config/deconfig s390/ap: add card/queue deconfig state s390/ap: add error response code field for ap queue devices s390/ap: split ap queue state machine state from device state s390/zcrypt: New config switch CONFIG_ZCRYPT_DEBUG s390/zcrypt: introduce msg tracking in zcrypt functions s390/startup: correct early pgm check info formatting s390: remove orphaned extern variables declarations s390/kasan: make sure int handler always run with DAT on s390/ipl: add support to control memory clearing for nvme re-IPL ...
2020-10-16lib: kunit: Fix compilation test when using TEST_BIT_FIELD_COMPILEVitor Massaru Iha1-4/+2
A build condition was missing around a compilation test, this compilation test comes from the original test_bitfield code. And removed unnecessary code for this test. Fixes: d2585f5164c2 ("lib: kunit: add bitfield test conversion to KUnit") Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Vitor Massaru Iha <vitor@massaru.org> Link: https://lore.kernel.org/linux-next/20201015163056.56fcc835@canb.auug.org.au/ Reviewed-by: Brendan Higgins <brendanhiggins@google.com> Signed-off-by: Shuah Khan <skhan@linuxfoundation.org>
2020-10-16Merge tag 'powerpc-5.10-1' of ↵Linus Torvalds223-2491/+3245
git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux Pull powerpc updates from Michael Ellerman: - A series from Nick adding ARCH_WANT_IRQS_OFF_ACTIVATE_MM & selecting it for powerpc, as well as a related fix for sparc. - Remove support for PowerPC 601. - Some fixes for watchpoints & addition of a new ptrace flag for detecting ISA v3.1 (Power10) watchpoint features. - A fix for kernels using 4K pages and the hash MMU on bare metal Power9 systems with > 16TB of RAM, or RAM on the 2nd node. - A basic idle driver for shallow stop states on Power10. - Tweaks to our sched domains code to better inform the scheduler about the hardware topology on Power9/10, where two SMT4 cores can be presented by firmware as an SMT8 core. - A series doing further reworks & cleanups of our EEH code. - Addition of a filter for RTAS (firmware) calls done via sys_rtas(), to prevent root from overwriting kernel memory. - Other smaller features, fixes & cleanups. Thanks to: Alexey Kardashevskiy, Andrew Donnellan, Aneesh Kumar K.V, Athira Rajeev, Biwen Li, Cameron Berkenpas, Cédric Le Goater, Christophe Leroy, Christoph Hellwig, Colin Ian King, Daniel Axtens, David Dai, Finn Thain, Frederic Barrat, Gautham R. Shenoy, Greg Kurz, Gustavo Romero, Ira Weiny, Jason Yan, Joel Stanley, Jordan Niethe, Kajol Jain, Konrad Rzeszutek Wilk, Laurent Dufour, Leonardo Bras, Liu Shixin, Luca Ceresoli, Madhavan Srinivasan, Mahesh Salgaonkar, Nathan Lynch, Nicholas Mc Guire, Nicholas Piggin, Nick Desaulniers, Oliver O'Halloran, Pedro Miraglia Franco de Carvalho, Pratik Rajesh Sampat, Qian Cai, Qinglang Miao, Ravi Bangoria, Russell Currey, Satheesh Rajendran, Scott Cheloha, Segher Boessenkool, Srikar Dronamraju, Stan Johnson, Stephen Kitt, Stephen Rothwell, Thiago Jung Bauermann, Tyrel Datwyler, Vaibhav Jain, Vaidyanathan Srinivasan, Vasant Hegde, Wang Wensheng, Wolfram Sang, Yang Yingliang, zhengbin. * tag 'powerpc-5.10-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (228 commits) Revert "powerpc/pci: unmap legacy INTx interrupts when a PHB is removed" selftests/powerpc: Fix eeh-basic.sh exit codes cpufreq: powernv: Fix frame-size-overflow in powernv_cpufreq_reboot_notifier powerpc/time: Make get_tb() common to PPC32 and PPC64 powerpc/time: Make get_tbl() common to PPC32 and PPC64 powerpc/time: Remove get_tbu() powerpc/time: Avoid using get_tbl() and get_tbu() internally powerpc/time: Make mftb() common to PPC32 and PPC64 powerpc/time: Rename mftbl() to mftb() powerpc/32s: Remove #ifdef CONFIG_PPC_BOOK3S_32 in head_book3s_32.S powerpc/32s: Rename head_32.S to head_book3s_32.S powerpc/32s: Setup the early hash table at all time. powerpc/time: Remove ifdef in get_dec() and set_dec() powerpc: Remove get_tb_or_rtc() powerpc: Remove __USE_RTC() powerpc: Tidy up a bit after removal of PowerPC 601. powerpc: Remove support for PowerPC 601 powerpc: Remove PowerPC 601 powerpc: Drop SYNC_601() ISYNC_601() and SYNC() powerpc: Remove CONFIG_PPC601_SYNC_FIX ...
2020-10-16svcrdma: fix bounce buffers for unaligned offsets and multiple pagesDan Aloni1-1/+2
This was discovered using O_DIRECT at the client side, with small unaligned file offsets or IOs that span multiple file pages. Fixes: e248aa7be86 ("svcrdma: Remove max_sge check at connect time") Signed-off-by: Dan Aloni <dan@kernelim.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2020-10-16nfsd: remove unneeded breakTom Rix1-1/+0
Because every path through nfs4_find_file()'s switch does an explicit return, the break is not needed. Signed-off-by: Tom Rix <trix@redhat.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2020-10-16net/sunrpc: Fix return value for sysctl sunrpc.transportsArtur Molchanov1-1/+7
Fix returning value for sysctl sunrpc.transports. Return error code from sysctl proc_handler function proc_do_xprt instead of number of the written bytes. Otherwise sysctl returns random garbage for this key. Since v1: - Handle negative returned value from memory_read_from_buffer as an error Signed-off-by: Artur Molchanov <arturmolchanov@gmail.com> Cc: stable@vger.kernel.org Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2020-10-16Merge branch 'akpm' (patches from Andrew)Linus Torvalds161-1724/+2390
Merge more updates from Andrew Morton: "155 patches. Subsystems affected by this patch series: mm (dax, debug, thp, readahead, page-poison, util, memory-hotplug, zram, cleanups), misc, core-kernel, get_maintainer, MAINTAINERS, lib, bitops, checkpatch, binfmt, ramfs, autofs, nilfs, rapidio, panic, relay, kgdb, ubsan, romfs, and fault-injection" * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (155 commits) lib, uaccess: add failure injection to usercopy functions lib, include/linux: add usercopy failure capability ROMFS: support inode blocks calculation ubsan: introduce CONFIG_UBSAN_LOCAL_BOUNDS for Clang sched.h: drop in_ubsan field when UBSAN is in trap mode scripts/gdb/tasks: add headers and improve spacing format scripts/gdb/proc: add struct mount & struct super_block addr in lx-mounts command kernel/relay.c: drop unneeded initialization panic: dump registers on panic_on_warn rapidio: fix the missed put_device() for rio_mport_add_riodev rapidio: fix error handling path nilfs2: fix some kernel-doc warnings for nilfs2 autofs: harden ioctl table ramfs: fix nommu mmap with gaps in the page cache mm: remove the now-unnecessary mmget_still_valid() hack mm/gup: take mmap_lock in get_dump_page() binfmt_elf, binfmt_elf_fdpic: use a VMA list snapshot coredump: rework elf/elf_fdpic vma_dump_size() into common helper coredump: refactor page range dumping into common helper coredump: let dump_emit() bail out on short writes ...
2020-10-16lib, uaccess: add failure injection to usercopy functionsAlbert van der Linde4-2/+22
To test fault-tolerance of user memory access functions, introduce fault injection to usercopy functions. If a failure is expected return either -EFAULT or the total amount of bytes that were not copied. Signed-off-by: Albert van der Linde <alinde@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Akinobu Mita <akinobu.mita@gmail.com> Reviewed-by: Alexander Potapenko <glider@google.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Andrey Konovalov <andreyknvl@google.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Borislav Petkov <bp@alien8.de> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Marco Elver <elver@google.com> Cc: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Christoph Hellwig <hch@lst.de> Link: http://lkml.kernel.org/r/20200831171733.955393-3-alinde@google.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-10-16lib, include/linux: add usercopy failure capabilityAlbert van der Linde6-1/+76
Patch series "add fault injection to user memory access", v3. The goal of this series is to improve testing of fault-tolerance in usages of user memory access functions, by adding support for fault injection. syzkaller/syzbot are using the existing fault injection modes and will use this particular feature also. The first patch adds failure injection capability for usercopy functions. The second changes usercopy functions to use this new failure capability (copy_from_user, ...). The third patch adds get/put/clear_user failures to x86. This patch (of 3): Add a failure injection capability to improve testing of fault-tolerance in usages of user memory access functions. Add CONFIG_FAULT_INJECTION_USERCOPY to enable faults in usercopy functions. The should_fail_usercopy function is to be called by these functions (copy_from_user, get_user, ...) in order to fail or not. Signed-off-by: Albert van der Linde <alinde@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Akinobu Mita <akinobu.mita@gmail.com> Reviewed-by: Alexander Potapenko <glider@google.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Andrey Konovalov <andreyknvl@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Marco Elver <elver@google.com> Cc: Christoph Hellwig <hch@lst.de> Link: http://lkml.kernel.org/r/20200831171733.955393-1-alinde@google.com Link: http://lkml.kernel.org/r/20200831171733.955393-2-alinde@google.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-10-16ROMFS: support inode blocks calculationLibing Zhou1-0/+1
When use 'stat' tool to display file status, the 'Blocks' field always in '0', this is not good for tool 'du'(e.g.: busybox 'du'), it always output '0' size for the files under ROMFS since such tool calculates number of 512B Blocks. This patch calculates approx. number of 512B blocks based on inode size. Signed-off-by: Libing Zhou <libing.zhou@nokia-sbell.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Cc: David Howells <dhowells@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Link: http://lkml.kernel.org/r/20200811052606.4243-1-libing.zhou@nokia-sbell.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-10-16ubsan: introduce CONFIG_UBSAN_LOCAL_BOUNDS for ClangGeorge Popescu2-1/+23
When the kernel is compiled with Clang, -fsanitize=bounds expands to -fsanitize=array-bounds and -fsanitize=local-bounds. Enabling -fsanitize=local-bounds with Clang has the unfortunate side-effect of inserting traps; this goes back to its original intent, which was as a hardening and not a debugging feature [1]. The same feature made its way into -fsanitize=bounds, but the traps remained. For that reason, -fsanitize=bounds was split into 'array-bounds' and 'local-bounds' [2]. Since 'local-bounds' doesn't behave like a normal sanitizer, enable it with Clang only if trapping behaviour was requested by CONFIG_UBSAN_TRAP=y. Add the UBSAN_BOUNDS_LOCAL config to Kconfig.ubsan to enable the 'local-bounds' option by default when UBSAN_TRAP is enabled. [1] http://lists.llvm.org/pipermail/llvm-dev/2012-May/049972.html [2] http://lists.llvm.org/pipermail/cfe-commits/Week-of-Mon-20131021/091536.html Suggested-by: Marco Elver <elver@google.com> Signed-off-by: George Popescu <georgepope@android.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: David Brazdil <dbrazdil@google.com> Reviewed-by: Marco Elver <elver@google.com> Cc: Masahiro Yamada <masahiroy@kernel.org> Cc: Michal Marek <michal.lkml@markovi.net> Cc: Nathan Chancellor <natechancellor@gmail.com> Cc: Nick Desaulniers <ndesaulniers@google.com> Cc: Kees Cook <keescook@chromium.org> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: https://lkml.kernel.org/r/20200922074330.2549523-1-georgepope@google.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-10-16sched.h: drop in_ubsan field when UBSAN is in trap modeElena Petrova1-1/+1
in_ubsan field of task_struct is only used in lib/ubsan.c, which in its turn is used only `ifneq ($(CONFIG_UBSAN_TRAP),y)`. Removing unnecessary field from a task_struct will help preserve the ABI between vanilla and CONFIG_UBSAN_TRAP'ed kernels. In particular, this will help enabling bounds sanitizer transparently for Android's GKI. Signed-off-by: Elena Petrova <lenaptr@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Kees Cook <keescook@chromium.org> Cc: Jann Horn <jannh@google.com> Link: https://lkml.kernel.org/r/20200910134802.3160311-1-lenaptr@google.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-10-16scripts/gdb/tasks: add headers and improve spacing formatRitesh Harjani1-4/+5
With the patch. <e.g. o/p> TASK PID COMM 0xffffffff82c2b8c0 0 swapper/0 0xffff888a0ba20040 1 systemd 0xffff888a0ba24040 2 kthreadd 0xffff888a0ba28040 3 rcu_gp w/o 0xffffffff82c2b8c0 <init_task> 0 swapper/0 0xffff888a0ba20040 1 systemd 0xffff888a0ba24040 2 kthreadd 0xffff888a0ba28040 3 rcu_gp Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Jan Kiszka <jan.kiszka@siemens.com> Cc: Kieran Bingham <kbingham@kernel.org> Link: http://lkml.kernel.org/r/54c868c79b5fc364a8be7799891934a6fe6d1464.1597742951.git.riteshh@linux.ibm.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-10-16scripts/gdb/proc: add struct mount & struct super_block addr in lx-mounts ↵Ritesh Harjani1-8/+7
command This is many times found useful while debugging some FS related issue. <e.g. output> mount super_block devname pathname fstype options 0xffff888a0bfa4b40 0xffff888a0bfc1000 none / rootfs rw 0 0 0xffff888a033f75c0 0xffff8889fcf65000 /dev/root / ext4 rw,relatime 0 0 0xffff8889fc8ce040 0xffff888a0bb51000 devtmpfs /dev devtmpfs rw,relatime 0 0 Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Jan Kiszka <jan.kiszka@siemens.com> Cc: Kieran Bingham <kbingham@kernel.org> Link: http://lkml.kernel.org/r/a3c4177e1597b3e06d66d55e07d72c0c46a03571.1597742951.git.riteshh@linux.ibm.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-10-16kernel/relay.c: drop unneeded initializationSudip Mukherjee1-1/+1
The variable 'consumed' is initialized with the consumed count but immediately after that the consumed count is updated and assigned to 'consumed' again thus overwriting the previous value. So, drop the unneeded initialization. Signed-off-by: Sudip Mukherjee <sudipm.mukherjee@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Link: https://lkml.kernel.org/r/20201005205727.1147-1-sudipm.mukherjee@gmail.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-10-16panic: dump registers on panic_on_warnAlexey Kardashevskiy1-6/+6
Currently we print stack and registers for ordinary warnings but we do not for panic_on_warn which looks as oversight - panic() will reboot the machine but won't print registers. This moves printing of registers and modules earlier. This does not move the stack dumping as panic() dumps it. Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Kees Cook <keescook@chromium.org> Cc: Douglas Anderson <dianders@chromium.org> Cc: Ingo Molnar <mingo@kernel.org> Cc: Kees Cook <keescook@chromium.org> Cc: Rafael Aquini <aquini@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Will Deacon <will@kernel.org> Cc: Nicholas Piggin <npiggin@gmail.com> Link: https://lkml.kernel.org/r/20200804095054.68724-1-aik@ozlabs.ru Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-10-16rapidio: fix the missed put_device() for rio_mport_add_riodevJing Xiangfeng1-1/+4
rio_mport_add_riodev() misses to call put_device() when the device already exists. Add the missed function call to fix it. Fixes: e8de370188d0 ("rapidio: add mport char device driver") Signed-off-by: Jing Xiangfeng <jingxiangfeng@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Dan Carpenter <dan.carpenter@oracle.com> Cc: Matt Porter <mporter@kernel.crashing.org> Cc: Alexandre Bounine <alex.bou9@gmail.com> Cc: Gustavo A. R. Silva <gustavoars@kernel.org> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Kees Cook <keescook@chromium.org> Cc: Madhuparna Bhowmik <madhuparnabhowmik10@gmail.com> Link: https://lkml.kernel.org/r/20200922072525.42330-1-jingxiangfeng@huawei.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>