aboutsummaryrefslogtreecommitdiffstats
path: root/fs/ext4/file.c
AgeCommit message (Collapse)AuthorFilesLines
4 daysMerge tag 'ext4_for_linus-6.10-rc1' of ↵Linus Torvalds1-3/+2
git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 Pull ext4 updates from Ted Ts'o: - more folio conversion patches - add support for FS_IOC_GETFSSYSFSPATH - mballoc cleaups and add more kunit tests - sysfs cleanups and bug fixes - miscellaneous bug fixes and cleanups * tag 'ext4_for_linus-6.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (40 commits) ext4: fix error pointer dereference in ext4_mb_load_buddy_gfp() jbd2: add prefix 'jbd2' for 'shrink_type' jbd2: use shrink_type type instead of bool type for __jbd2_journal_clean_checkpoint_list() ext4: fix uninitialized ratelimit_state->lock access in __ext4_fill_super() ext4: remove calls to to set/clear the folio error flag ext4: propagate errors from ext4_sb_bread() in ext4_xattr_block_cache_find() ext4: fix mb_cache_entry's e_refcnt leak in ext4_xattr_block_cache_find() jbd2: remove redundant assignement to variable err ext4: remove the redundant folio_wait_stable() ext4: fix potential unnitialized variable ext4: convert ac_buddy_page to ac_buddy_folio ext4: convert ac_bitmap_page to ac_bitmap_folio ext4: convert ext4_mb_init_cache() to take a folio ext4: convert bd_buddy_page to bd_buddy_folio ext4: convert bd_bitmap_page to bd_bitmap_folio ext4: open coding repeated check in next_linear_group ext4: use correct criteria name instead stale integer number in comment ext4: call ext4_mb_mark_free_simple to free continuous bits in found chunk ext4: add test_mb_mark_used_cost to estimate cost of mb_mark_used ext4: keep "prefetch_grp" and "nr" consistent ...
2024-05-02ext4: replace deprecated strncpy with alternativesJustin Stitt1-2/+1
strncpy() is deprecated for use on NUL-terminated destination strings [1] and as such we should prefer more robust and less ambiguous string interfaces. in file.c: s_last_mounted is marked as __nonstring meaning it does not need to be NUL-terminated. Let's instead use strtomem_pad() to copy bytes from the string source to the byte array destination -- while also ensuring to pad with zeroes. in ioctl.c: We can drop the memset and size argument in favor of using the new 2-argument version of strscpy_pad() -- which was introduced with Commit e6584c3964f2f ("string: Allow 2-argument strscpy()"). This guarantees NUL-termination and NUL-padding on the destination buffer -- which seems to be a requirement judging from this comment: | static int ext4_ioctl_getlabel(struct ext4_sb_info *sbi, char __user *user_label) | { | char label[EXT4_LABEL_MAX + 1]; | | /* | * EXT4_LABEL_MAX must always be smaller than FSLABEL_MAX because | * FSLABEL_MAX must include terminating null byte, while s_volume_name | * does not have to. | */ in super.c: s_first_error_func is marked as __nonstring meaning we can take the same approach as in file.c; just use strtomem_pad() Link: https://www.kernel.org/doc/html/latest/process/deprecated.html#strncpy-on-nul-terminated-strings [1] Link: https://manpages.debian.org/testing/linux-manual-4.8/strscpy.9.en.html [2] Link: https://github.com/KSPP/linux/issues/90 Cc: linux-hardening@vger.kernel.org Signed-off-by: Justin Stitt <justinstitt@google.com> Reviewed-by: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/r/20240321-strncpy-fs-ext4-file-c-v1-1-36a6a09fef0c@google.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2024-05-02ext4: set FMODE_CAN_ODIRECT instead of a dummy direct_IO methodChristoph Hellwig1-1/+1
Since commit a2ad63daa88b ("VFS: add FMODE_CAN_ODIRECT file flag") file systems can just set the FMODE_CAN_ODIRECT flag at open time instead of wiring up a dummy direct_IO method to indicate support for direct I/O. Signed-off-by: Christoph Hellwig <hch@lst.de> [RH: Rebased to upstream] Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/e5797bb597219a49043e53e4e90aa494b97dc328.1709215665.git.ritesh.list@gmail.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2024-04-07fs: claw back a few FMODE_* bitsChristian Brauner1-3/+3
There's a bunch of flags that are purely based on what the file operations support while also never being conditionally set or unset. IOW, they're not subject to change for individual files. Imho, such flags don't need to live in f_mode they might as well live in the fops structs itself. And the fops struct already has that lonely mmap_supported_flags member. We might as well turn that into a generic fop_flags member and move a few flags from FMODE_* space into FOP_* space. That gets us four FMODE_* bits back and the ability for new static flags that are about file ops to not have to live in FMODE_* space but in their own FOP_* space. It's not the most beautiful thing ever but it gets the job done. Yes, there'll be an additional pointer chase but hopefully that won't matter for these flags. I suspect there's a few more we can move into there and that we can also redirect a bunch of new flag suggestions that follow this pattern into the fop_flags field instead of f_mode. Link: https://lore.kernel.org/r/20240328-gewendet-spargel-aa60a030ef74@brauner Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-01-18ext4: remove unnecessary parameter "needed" in ext4_discard_preallocationsKemeng Shi1-1/+1
The "needed" controls the number of ext4_prealloc_space to discard in ext4_discard_preallocations. Function ext4_discard_preallocations is supposed to discard all non-used preallocated blocks when "needed" is 0 and now ext4_discard_preallocations is always called with "needed" = 0. Remove unnecessary parameter "needed" and remove all non-used preallocated spaces in ext4_discard_preallocations to simplify the code. Note: If count of non-used preallocated spaces could be more than UINT_MAX, there was a memory leak as some non-used preallocated spaces are left ununsed and this commit will fix it. Otherwise, there is no behavior change. Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20240105092102.496631-9-shikemeng@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-11-30ext4: fix warning in ext4_dio_write_end_io()Jan Kara1-6/+8
The syzbot has reported that it can hit the warning in ext4_dio_write_end_io() because i_size < i_disksize. Indeed the reproducer creates a race between DIO IO completion and truncate expanding the file and thus ext4_dio_write_end_io() sees an inconsistent inode state where i_disksize is already updated but i_size is not updated yet. Since we are careful when setting up DIO write and consider it extending (and thus performing the IO synchronously with i_rwsem held exclusively) whenever it goes past either of i_size or i_disksize, we can use the same test during IO completion without risking entering ext4_handle_inode_extension() without i_rwsem held. This way we make it obvious both i_size and i_disksize are large enough when we report DIO completion without relying on unreliable WARN_ON. Reported-by: <syzbot+47479b71cdfc78f56d30@syzkaller.appspotmail.com> Fixes: 91562895f803 ("ext4: properly sync file size update after O_SYNC direct IO") Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Link: https://lore.kernel.org/r/20231130095653.22679-1-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-10-31ext4: properly sync file size update after O_SYNC direct IOJan Kara1-88/+65
Gao Xiang has reported that on ext4 O_SYNC direct IO does not properly sync file size update and thus if we crash at unfortunate moment, the file can have smaller size although O_SYNC IO has reported successful completion. The problem happens because update of on-disk inode size is handled in ext4_dio_write_iter() *after* iomap_dio_rw() (and thus dio_complete() in particular) has returned and generic_file_sync() gets called by dio_complete(). Fix the problem by handling on-disk inode size update directly in our ->end_io completion handler. References: https://lore.kernel.org/all/02d18236-26ef-09b0-90ad-030c4fe3ee20@linux.alibaba.com Reported-by: Gao Xiang <hsiangkao@linux.alibaba.com> CC: stable@vger.kernel.org Fixes: 378f32bab371 ("ext4: introduce direct I/O write using iomap infrastructure") Signed-off-by: Jan Kara <jack@suse.cz> Tested-by: Joseph Qi <joseph.qi@linux.alibaba.com> Reviewed-by: "Ritesh Harjani (IBM)" <ritesh.list@gmail.com> Link: https://lore.kernel.org/r/20231013121350.26872-1-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-10-31ext4: fix racy may inline data check in dio writeBrian Foster1-7/+9
syzbot reports that the following warning from ext4_iomap_begin() triggers as of the commit referenced below: if (WARN_ON_ONCE(ext4_has_inline_data(inode))) return -ERANGE; This occurs during a dio write, which is never expected to encounter an inode with inline data. To enforce this behavior, ext4_dio_write_iter() checks the current inline state of the inode and clears the MAY_INLINE_DATA state flag to either fall back to buffered writes, or enforce that any other writers in progress on the inode are not allowed to create inline data. The problem is that the check for existing inline data and the state flag can span a lock cycle. For example, if the ilock is originally locked shared and subsequently upgraded to exclusive, another writer may have reacquired the lock and created inline data before the dio write task acquires the lock and proceeds. The commit referenced below loosens the lock requirements to allow some forms of unaligned dio writes to occur under shared lock, but AFAICT the inline data check was technically already racy for any dio write that would have involved a lock cycle. Regardless, lift clearing of the state bit to the same lock critical section that checks for preexisting inline data on the inode to close the race. Cc: stable@kernel.org Reported-by: syzbot+307da6ca5cb0d01d581a@syzkaller.appspotmail.com Fixes: 310ee0902b8d ("ext4: allow concurrent unaligned dio overwrites") Signed-off-by: Brian Foster <bfoster@redhat.com> Link: https://lore.kernel.org/r/20231002185020.531537-1-bfoster@redhat.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-08-31Merge tag 'ext4_for_linus-6.6-rc1' of ↵Linus Torvalds1-22/+16
git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 Pull ext4 updates from Ted Ts'o: "Many ext4 and jbd2 cleanups and bug fixes: - Cleanups in the ext4 remount code when going to and from read-only - Cleanups in ext4's multiblock allocator - Cleanups in the jbd2 setup/mounting code paths - Performance improvements when appending to a delayed allocation file - Miscellaneous syzbot and other bug fixes" * tag 'ext4_for_linus-6.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (60 commits) ext4: fix slab-use-after-free in ext4_es_insert_extent() libfs: remove redundant checks of s_encoding ext4: remove redundant checks of s_encoding ext4: reject casefold inode flag without casefold feature ext4: use LIST_HEAD() to initialize the list_head in mballoc.c ext4: do not mark inode dirty every time when appending using delalloc ext4: rename s_error_work to s_sb_upd_work ext4: add periodic superblock update check ext4: drop dio overwrite only flag and associated warning ext4: add correct group descriptors and reserved GDT blocks to system zone ext4: remove unused function declaration ext4: mballoc: avoid garbage value from err ext4: use sbi instead of EXT4_SB(sb) in ext4_mb_new_blocks_simple() ext4: change the type of blocksize in ext4_mb_init_cache() ext4: fix unttached inode after power cut with orphan file feature enabled jbd2: correct the end of the journal recovery scan range ext4: ext4_get_{dev}_journal return proper error value ext4: cleanup ext4_get_dev_journal() and ext4_get_journal() jbd2: jbd2_journal_init_{dev,inode} return proper error return value jbd2: drop useless error tag in jbd2_journal_wipe() ...
2023-08-27ext4: drop dio overwrite only flag and associated warningBrian Foster1-15/+10
The commit referenced below opened up concurrent unaligned dio under shared locking for pure overwrites. In doing so, it enabled use of the IOMAP_DIO_OVERWRITE_ONLY flag and added a warning on unexpected -EAGAIN returns as an extra precaution, since ext4 does not retry writes in such cases. The flag itself is advisory in this case since ext4 checks for unaligned I/Os and uses appropriate locking up front, rather than on a retry in response to -EAGAIN. As it turns out, the warning check is susceptible to false positives because there are scenarios where -EAGAIN can be expected from lower layers without necessarily having IOCB_NOWAIT set on the iocb. For example, one instance of the warning has been seen where io_uring sets IOCB_HIPRI, which in turn results in REQ_POLLED|REQ_NOWAIT on the bio. This results in -EAGAIN if the block layer is unable to allocate a request, etc. [Note that there is an outstanding patch to untangle REQ_POLLED and REQ_NOWAIT such that the latter relies on IOCB_NOWAIT, which would also address this instance of the warning.] Another instance of the warning has been reproduced by syzbot. A dio write is interrupted down in __get_user_pages_locked() waiting on the mm lock and returns -EAGAIN up the stack. If the iomap dio iteration layer has made no progress on the write to this point, -EAGAIN returns up to the filesystem and triggers the warning. This use of the overwrite flag in ext4 is precautionary and half-baked. I.e., ext4 doesn't actually implement overwrite checking in the iomap callbacks when the flag is set, so the only extra verification it provides are i_size checks in the generic iomap dio layer. Combined with the tendency for false positives, the added verification is not worth the extra trouble. Remove the flag, associated warning, and update the comments to document when concurrent unaligned dio writes are allowed and why said flag is not used. Cc: stable@kernel.org Reported-by: syzbot+5050ad0fb47527b1808a@syzkaller.appspotmail.com Reported-by: Pengfei Xu <pengfei.xu@intel.com> Fixes: 310ee0902b8d ("ext4: allow concurrent unaligned dio overwrites") Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20230810165559.946222-1-bfoster@redhat.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-08-24mm: remove enum page_entry_sizeMatthew Wilcox (Oracle)1-6/+5
Remove the unnecessary encoding of page order into an enum and pass the page order directly. That lets us get rid of pe_order(). The switch constructs have to be changed to if/else constructs to prevent GCC from warning on builds with 3-level page tables where PMD_ORDER and PUD_ORDER have the same value. If you are looking at this commit because your driver stopped compiling, look at the previous commit as well and audit your driver to be sure it doesn't depend on mmap_lock being held in its ->huge_fault method. [willy@infradead.org: use "order %u" to match the (non dev_t) style] Link: https://lkml.kernel.org/r/ZOUYekbtTv+n8hYf@casper.infradead.org Link: https://lkml.kernel.org/r/20230818202335.2739663-4-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-07-29ext4: make ext4_forced_shutdown() take struct super_blockJan Kara1-7/+6
Currently ext4_forced_shutdown() takes struct ext4_sb_info but most callers need to get it from struct super_block anyway. So just pass in struct super_block to save all callers from some boilerplate code. No functional changes. Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20230616165109.21695-3-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-06-29Merge tag 'ext4_for_linus' of ↵Linus Torvalds1-40/+46
git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 Pull ext4 updates from Ted Ts'o: "Various cleanups and bug fixes in ext4's extent status tree, journalling, and block allocator subsystems. Also improve performance for parallel DIO overwrites" * tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (55 commits) ext4: avoid updating the superblock on a r/o mount if not needed jbd2: skip reading super block if it has been verified ext4: fix to check return value of freeze_bdev() in ext4_shutdown() ext4: refactoring to use the unified helper ext4_quotas_off() ext4: turn quotas off if mount failed after enabling quotas ext4: update doc about journal superblock description ext4: add journal cycled recording support jbd2: continue to record log between each mount jbd2: remove j_format_version jbd2: factor out journal initialization from journal_get_superblock() jbd2: switch to check format version in superblock directly jbd2: remove unused feature macros ext4: ext4_put_super: Remove redundant checking for 'sbi->s_journal_bdev' ext4: Fix reusing stale buffer heads from last failed mounting ext4: allow concurrent unaligned dio overwrites ext4: clean up mballoc criteria comments ext4: make ext4_zeroout_es() return void ext4: make ext4_es_insert_extent() return void ext4: make ext4_es_insert_delayed_block() return void ext4: make ext4_es_remove_extent() return void ...
2023-06-28Merge tag 'mm-stable-2023-06-24-19-15' of ↵Linus Torvalds1-8/+3
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull mm updates from Andrew Morton: - Yosry Ahmed brought back some cgroup v1 stats in OOM logs - Yosry has also eliminated cgroup's atomic rstat flushing - Nhat Pham adds the new cachestat() syscall. It provides userspace with the ability to query pagecache status - a similar concept to mincore() but more powerful and with improved usability - Mel Gorman provides more optimizations for compaction, reducing the prevalence of page rescanning - Lorenzo Stoakes has done some maintanance work on the get_user_pages() interface - Liam Howlett continues with cleanups and maintenance work to the maple tree code. Peng Zhang also does some work on maple tree - Johannes Weiner has done some cleanup work on the compaction code - David Hildenbrand has contributed additional selftests for get_user_pages() - Thomas Gleixner has contributed some maintenance and optimization work for the vmalloc code - Baolin Wang has provided some compaction cleanups, - SeongJae Park continues maintenance work on the DAMON code - Huang Ying has done some maintenance on the swap code's usage of device refcounting - Christoph Hellwig has some cleanups for the filemap/directio code - Ryan Roberts provides two patch series which yield some rationalization of the kernel's access to pte entries - use the provided APIs rather than open-coding accesses - Lorenzo Stoakes has some fixes to the interaction between pagecache and directio access to file mappings - John Hubbard has a series of fixes to the MM selftesting code - ZhangPeng continues the folio conversion campaign - Hugh Dickins has been working on the pagetable handling code, mainly with a view to reducing the load on the mmap_lock - Catalin Marinas has reduced the arm64 kmalloc() minimum alignment from 128 to 8 - Domenico Cerasuolo has improved the zswap reclaim mechanism by reorganizing the LRU management - Matthew Wilcox provides some fixups to make gfs2 work better with the buffer_head code - Vishal Moola also has done some folio conversion work - Matthew Wilcox has removed the remnants of the pagevec code - their functionality is migrated over to struct folio_batch * tag 'mm-stable-2023-06-24-19-15' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (380 commits) mm/hugetlb: remove hugetlb_set_page_subpool() mm: nommu: correct the range of mmap_sem_read_lock in task_mem() hugetlb: revert use of page_cache_next_miss() Revert "page cache: fix page_cache_next/prev_miss off by one" mm/vmscan: fix root proactive reclaim unthrottling unbalanced node mm: memcg: rename and document global_reclaim() mm: kill [add|del]_page_to_lru_list() mm: compaction: convert to use a folio in isolate_migratepages_block() mm: zswap: fix double invalidate with exclusive loads mm: remove unnecessary pagevec includes mm: remove references to pagevec mm: rename invalidate_mapping_pagevec to mapping_try_invalidate mm: remove struct pagevec net: convert sunrpc from pagevec to folio_batch i915: convert i915_gpu_error to use a folio_batch pagevec: rename fbatch_count() mm: remove check_move_unevictable_pages() drm: convert drm_gem_put_pages() to use a folio_batch i915: convert shmem_sg_free_table() to use a folio_batch scatterlist: add sg_set_folio() ...
2023-06-26ext4: allow concurrent unaligned dio overwritesBrian Foster1-40/+46
We've had reports of significant performance regression of sub-block (unaligned) direct writes due to the added exclusivity restrictions in ext4. The purpose of the exclusivity requirement for unaligned direct writes is to avoid data corruption caused by unserialized partial block zeroing in the iomap dio layer across overlapping writes. XFS has similar requirements for the same underlying reasons, yet doesn't suffer the extreme performance regression that ext4 does. The reason for this is that XFS utilizes IOMAP_DIO_OVERWRITE_ONLY mode, which allows for optimistic submission of concurrent unaligned I/O and kicks back writes that require partial block zeroing such that they can be submitted in a safe, exclusive context. Since ext4 already performs most of these checks pre-submission, it can support something similar without necessarily relying on the iomap flag and associated retry mechanism. Update the dio write submission path to allow concurrent submission of unaligned direct writes that are purely overwrite and so will not require block zeroing. To improve readability of the various related checks, move the unaligned I/O handling down into ext4_dio_write_checks(), where the dio draining and force wait logic can immediately follow the locking requirement checks. Finally, the IOMAP_DIO_OVERWRITE_ONLY flag is set to enable a warning check as a precaution should the ext4 overwrite logic ever become inconsistent with the zeroing expectations of iomap dio. The performance improvement of sub-block direct write I/O is shown in the following fio test on a 64xcpu guest vm: Test: fio --name=test --ioengine=libaio --direct=1 --group_reporting --overwrite=1 --thread --size=10G --filename=/mnt/fio --readwrite=write --ramp_time=10s --runtime=60s --numjobs=8 --blocksize=2k --iodepth=256 --allow_file_create=0 v6.2: write: IOPS=4328, BW=8724KiB/s v6.2 (patched): write: IOPS=801k, BW=1565MiB/s Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20230314130759.642710-1-bfoster@redhat.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-06-09filemap: update ki_pos in generic_perform_writeChristoph Hellwig1-6/+3
All callers of generic_perform_write need to updated ki_pos, move it into common code. Link: https://lkml.kernel.org/r/20230601145904.1385409-4-hch@lst.de Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Xiubo Li <xiubli@redhat.com> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Acked-by: Theodore Ts'o <tytso@mit.edu> Acked-by: Darrick J. Wong <djwong@kernel.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Andreas Gruenbacher <agruenba@redhat.com> Cc: Anna Schumaker <anna@kernel.org> Cc: Chao Yu <chao@kernel.org> Cc: Christian Brauner <brauner@kernel.org> Cc: Ilya Dryomov <idryomov@gmail.com> Cc: Jaegeuk Kim <jaegeuk@kernel.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Miklos Szeredi <miklos@szeredi.hu> Cc: Miklos Szeredi <mszeredi@redhat.com> Cc: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-09backing_dev: remove current->backing_dev_infoChristoph Hellwig1-2/+0
Patch series "cleanup the filemap / direct I/O interaction", v4. This series cleans up some of the generic write helper calling conventions and the page cache writeback / invalidation for direct I/O. This is a spinoff from the no-bufferhead kernel project, for which we'll want to an use iomap based buffered write path in the block layer. This patch (of 12): The last user of current->backing_dev_info disappeared in commit b9b1335e6403 ("remove bdi_congested() and wb_congested() and related functions"). Remove the field and all assignments to it. Link: https://lkml.kernel.org/r/20230601145904.1385409-1-hch@lst.de Link: https://lkml.kernel.org/r/20230601145904.1385409-2-hch@lst.de Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Christian Brauner <brauner@kernel.org> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Acked-by: Theodore Ts'o <tytso@mit.edu> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Andreas Gruenbacher <agruenba@redhat.com> Cc: Anna Schumaker <anna@kernel.org> Cc: Chao Yu <chao@kernel.org> Cc: Ilya Dryomov <idryomov@gmail.com> Cc: Jaegeuk Kim <jaegeuk@kernel.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Matthew Wilcox <willy@infradead.org> Cc: Miklos Szeredi <miklos@szeredi.hu> Cc: Miklos Szeredi <mszeredi@redhat.com> Cc: Trond Myklebust <trond.myklebust@hammerspace.com> Cc: Xiubo Li <xiubli@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-05-24splice: Use filemap_splice_read() instead of generic_file_splice_read()David Howells1-1/+1
Replace pointers to generic_file_splice_read() with calls to filemap_splice_read(). Signed-off-by: David Howells <dhowells@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Christian Brauner <brauner@kernel.org> cc: Jens Axboe <axboe@kernel.dk> cc: Al Viro <viro@zeniv.linux.org.uk> cc: David Hildenbrand <david@redhat.com> cc: John Hubbard <jhubbard@nvidia.com> cc: linux-mm@kvack.org cc: linux-block@vger.kernel.org cc: linux-fsdevel@vger.kernel.org Link: https://lore.kernel.org/r/20230522135018.2742245-29-dhowells@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-05-24ext4: Provide a splice-read wrapperDavid Howells1-1/+12
Provide a splice_read wrapper for Ext4. This does the inode shutdown check before proceeding. Splicing from DAX files and O_DIRECT fds is handled by the caller. Signed-off-by: David Howells <dhowells@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Acked-by: Theodore Ts'o <tytso@mit.edu> cc: Al Viro <viro@zeniv.linux.org.uk> cc: Jens Axboe <axboe@kernel.dk> cc: Andreas Dilger <adilger.kernel@dilger.ca> cc: linux-ext4@vger.kernel.org cc: linux-fsdevel@vger.kernel.org cc: linux-block@vger.kernel.org cc: linux-mm@kvack.org Link: https://lore.kernel.org/r/20230522135018.2742245-19-dhowells@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-04-03fs: add FMODE_DIO_PARALLEL_WRITE flagJens Axboe1-1/+2
Some filesystems support multiple threads writing to the same file with O_DIRECT without requiring exclusive access to it. io_uring can use this hint to avoid serializing dio writes to this inode, instead allowing them to run in parallel. XFS and ext4 both fall into this category, so set the flag for both of them. Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-02-28Merge tag 'ext4_for_linus' of ↵Linus Torvalds1-12/+22
git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 Pull ext4 updates from Ted Ts'o: "Improve performance for ext4 by allowing multiple process to perform direct I/O writes to preallocated blocks by using a shared inode lock instead of taking an exclusive lock. In addition, multiple bug fixes and cleanups" * tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: ext4: fix incorrect options show of original mount_opt and extend mount_opt2 ext4: Fix possible corruption when moving a directory ext4: init error handle resource before init group descriptors ext4: fix task hung in ext4_xattr_delete_inode jbd2: fix data missing when reusing bh which is ready to be checkpointed ext4: update s_journal_inum if it changes after journal replay ext4: fail ext4_iget if special inode unallocated ext4: fix function prototype mismatch for ext4_feat_ktype ext4: remove unnecessary variable initialization ext4: fix inode tree inconsistency caused by ENOMEM ext4: refuse to create ea block when umounted ext4: optimize ea_inode block expansion ext4: remove dead code in updating backup sb ext4: dio take shared inode lock when overwriting preallocated blocks ext4: don't show commit interval if it is zero ext4: use ext4_fc_tl_mem in fast-commit replay path ext4: improve xattr consistency checking and error reporting
2023-02-14ext4: dio take shared inode lock when overwriting preallocated blocksZhang Yi1-12/+22
In the dio write path, we only take shared inode lock for the case of aligned overwriting initialized blocks inside EOF. But for overwriting preallocated blocks, it may only need to split unwritten extents, this procedure has been protected under i_data_sem lock, it's safe to release the exclusive inode lock and take shared inode lock. This could give a significant speed up for multi-threaded writes. Test on Intel Xeon Gold 6140 and nvme SSD with below fio parameters. direct=1 ioengine=libaio iodepth=10 numjobs=10 runtime=60 rw=randwrite size=100G And the test result are: Before: bs=4k IOPS=11.1k, BW=43.2MiB/s bs=16k IOPS=11.1k, BW=173MiB/s bs=64k IOPS=11.2k, BW=697MiB/s After: bs=4k IOPS=41.4k, BW=162MiB/s bs=16k IOPS=41.3k, BW=646MiB/s bs=64k IOPS=13.5k, BW=843MiB/s Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20221226062015.3479416-1-yi.zhang@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-02-09mm: replace vma->vm_flags direct modifications with modifier callsSuren Baghdasaryan1-1/+1
Replace direct modifications to vma->vm_flags with calls to modifier functions to be able to track flag changes and to keep vma locking correctness. [akpm@linux-foundation.org: fix drivers/misc/open-dice.c, per Hyeonggon Yoo] Link: https://lkml.kernel.org/r/20230126193752.297968-5-surenb@google.com Signed-off-by: Suren Baghdasaryan <surenb@google.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Mike Rapoport (IBM) <rppt@kernel.org> Acked-by: Sebastian Reichel <sebastian.reichel@collabora.com> Reviewed-by: Liam R. Howlett <Liam.Howlett@Oracle.com> Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Arjun Roy <arjunroy@google.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Howells <dhowells@redhat.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: David Rientjes <rientjes@google.com> Cc: Eric Dumazet <edumazet@google.com> Cc: Greg Thelen <gthelen@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jann Horn <jannh@google.com> Cc: Joel Fernandes <joelaf@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kent Overstreet <kent.overstreet@linux.dev> Cc: Laurent Dufour <ldufour@linux.ibm.com> Cc: Lorenzo Stoakes <lstoakes@gmail.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Minchan Kim <minchan@google.com> Cc: Paul E. McKenney <paulmck@kernel.org> Cc: Peter Oskolkov <posk@google.com> Cc: Peter Xu <peterx@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Punit Agrawal <punit.agrawal@bytedance.com> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Shakeel Butt <shakeelb@google.com> Cc: Soheil Hassas Yeganeh <soheil@google.com> Cc: Song Liu <songliubraving@fb.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-10-20fs: rename current get acl methodChristian Brauner1-1/+1
The current way of setting and getting posix acls through the generic xattr interface is error prone and type unsafe. The vfs needs to interpret and fixup posix acls before storing or reporting it to userspace. Various hacks exist to make this work. The code is hard to understand and difficult to maintain in it's current form. Instead of making this work by hacking posix acls through xattr handlers we are building a dedicated posix acl api around the get and set inode operations. This removes a lot of hackiness and makes the codepaths easier to maintain. A lot of background can be found in [1]. The current inode operation for getting posix acls takes an inode argument but various filesystems (e.g., 9p, cifs, overlayfs) need access to the dentry. In contrast to the ->set_acl() inode operation we cannot simply extend ->get_acl() to take a dentry argument. The ->get_acl() inode operation is called from: acl_permission_check() -> check_acl() -> get_acl() which is part of generic_permission() which in turn is part of inode_permission(). Both generic_permission() and inode_permission() are called in the ->permission() handler of various filesystems (e.g., overlayfs). So simply passing a dentry argument to ->get_acl() would amount to also having to pass a dentry argument to ->permission(). We should avoid this unnecessary change. So instead of extending the existing inode operation rename it from ->get_acl() to ->get_inode_acl() and add a ->get_acl() method later that passes a dentry argument and which filesystems that need access to the dentry can implement instead of ->get_inode_acl(). Filesystems like cifs which allow setting and getting posix acls but not using them for permission checking during lookup can simply not implement ->get_inode_acl(). This is intended to be a non-functional change. Link: https://lore.kernel.org/all/20220801145520.1532837-1-brauner@kernel.org [1] Suggested-by/Inspired-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
2022-10-06Merge tag 'ext4_for_linus' of ↵Linus Torvalds1-0/+6
git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 Pull ext4 updates from Ted Ts'o: "The first two changes involve files outside of fs/ext4: - submit_bh() can never return an error, so change it to return void, and remove the unused checks from its callers - fix I_DIRTY_TIME handling so it will be set even if the inode already has I_DIRTY_INODE Performance: - Always enable i_version counter (as btrfs and xfs already do). Remove some uneeded i_version bumps to avoid unnecessary nfs cache invalidations - Wake up journal waiters in FIFO order, to avoid some journal users from not getting a journal handle for an unfairly long time - In ext4_write_begin() allocate any necessary buffer heads before starting the journal handle - Don't try to prefetch the block allocation bitmaps for a read-only file system Bug Fixes: - Fix a number of fast commit bugs, including resources leaks and out of bound references in various error handling paths and/or if the fast commit log is corrupted - Avoid stopping the online resize early when expanding a file system which is less than 16TiB to a size greater than 16TiB - Fix apparent metadata corruption caused by a race with a metadata buffer head getting migrated while it was trying to be read - Mark the lazy initialization thread freezable to prevent suspend failures - Other miscellaneous bug fixes Cleanups: - Break up the incredibly long ext4_full_super() function by refactoring to move code into more understandable, smaller functions - Remove the deprecated (and ignored) noacl and nouser_attr mount option - Factor out some common code in fast commit handling - Other miscellaneous cleanups" * tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (53 commits) ext4: fix potential out of bound read in ext4_fc_replay_scan() ext4: factor out ext4_fc_get_tl() ext4: introduce EXT4_FC_TAG_BASE_LEN helper ext4: factor out ext4_free_ext_path() ext4: remove unnecessary drop path references in mext_check_coverage() ext4: update 'state->fc_regions_size' after successful memory allocation ext4: fix potential memory leak in ext4_fc_record_regions() ext4: fix potential memory leak in ext4_fc_record_modified_inode() ext4: remove redundant checking in ext4_ioctl_checkpoint jbd2: add miss release buffer head in fc_do_one_pass() ext4: move DIOREAD_NOLOCK setting to ext4_set_def_opts() ext4: remove useless local variable 'blocksize' ext4: unify the ext4 super block loading operation ext4: factor out ext4_journal_data_mode_check() ext4: factor out ext4_load_and_init_journal() ext4: factor out ext4_group_desc_init() and ext4_group_desc_free() ext4: factor out ext4_geometry_check() ext4: factor out ext4_check_feature_compatibility() ext4: factor out ext4_init_metadata_csum() ext4: factor out ext4_encoding_init() ...
2022-09-29ext4: avoid crash when inline data creation follows DIO writeJan Kara1-0/+6
When inode is created and written to using direct IO, there is nothing to clear the EXT4_STATE_MAY_INLINE_DATA flag. Thus when inode gets truncated later to say 1 byte and written using normal write, we will try to store the data as inline data. This confuses the code later because the inode now has both normal block and inline data allocated and the confusion manifests for example as: kernel BUG at fs/ext4/inode.c:2721! invalid opcode: 0000 [#1] PREEMPT SMP KASAN CPU: 0 PID: 359 Comm: repro Not tainted 5.19.0-rc8-00001-g31ba1e3b8305-dirty #15 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.0-1.fc36 04/01/2014 RIP: 0010:ext4_writepages+0x363d/0x3660 RSP: 0018:ffffc90000ccf260 EFLAGS: 00010293 RAX: ffffffff81e1abcd RBX: 0000008000000000 RCX: ffff88810842a180 RDX: 0000000000000000 RSI: 0000008000000000 RDI: 0000000000000000 RBP: ffffc90000ccf650 R08: ffffffff81e17d58 R09: ffffed10222c680b R10: dfffe910222c680c R11: 1ffff110222c680a R12: ffff888111634128 R13: ffffc90000ccf880 R14: 0000008410000000 R15: 0000000000000001 FS: 00007f72635d2640(0000) GS:ffff88811b000000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000565243379180 CR3: 000000010aa74000 CR4: 0000000000150eb0 Call Trace: <TASK> do_writepages+0x397/0x640 filemap_fdatawrite_wbc+0x151/0x1b0 file_write_and_wait_range+0x1c9/0x2b0 ext4_sync_file+0x19e/0xa00 vfs_fsync_range+0x17b/0x190 ext4_buffered_write_iter+0x488/0x530 ext4_file_write_iter+0x449/0x1b90 vfs_write+0xbcd/0xf40 ksys_write+0x198/0x2c0 __x64_sys_write+0x7b/0x90 do_syscall_64+0x3d/0x90 entry_SYSCALL_64_after_hwframe+0x63/0xcd </TASK> Fix the problem by clearing EXT4_STATE_MAY_INLINE_DATA when we are doing direct IO write to a file. Cc: stable@kernel.org Reported-by: Tadeusz Struk <tadeusz.struk@linaro.org> Reported-by: syzbot+bd13648a53ed6933ca49@syzkaller.appspotmail.com Link: https://syzkaller.appspot.com/bug?id=a1e89d09bbbcbd5c4cb45db230ee28c822953984 Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Lukas Czerner <lczerner@redhat.com> Tested-by: Tadeusz Struk<tadeusz.struk@linaro.org> Link: https://lore.kernel.org/r/20220727155753.13969-1-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-09-11ext4: support STATX_DIOALIGNEric Biggers1-16/+26
Add support for STATX_DIOALIGN to ext4, so that direct I/O alignment restrictions are exposed to userspace in a generic way. Acked-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Eric Biggers <ebiggers@google.com> Link: https://lore.kernel.org/r/20220827065851.135710-5-ebiggers@kernel.org
2022-09-11fscrypt: change fscrypt_dio_supported() to prepare for STATX_DIOALIGNEric Biggers1-2/+7
To prepare for STATX_DIOALIGN support, make two changes to fscrypt_dio_supported(). First, remove the filesystem-block-alignment check and make the filesystems handle it instead. It previously made sense to have it in fs/crypto/; however, to support STATX_DIOALIGN the alignment restriction would have to be returned to filesystems. It ends up being simpler if filesystems handle this part themselves, especially for f2fs which only allows fs-block-aligned DIO in the first place. Second, make fscrypt_dio_supported() work on inodes whose encryption key hasn't been set up yet, by making it set up the key if needed. This is required for statx(), since statx() doesn't require a file descriptor. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Eric Biggers <ebiggers@google.com> Link: https://lore.kernel.org/r/20220827065851.135710-4-ebiggers@kernel.org
2022-05-16iomap: add per-iomap_iter private dataChristoph Hellwig1-2/+2
Allow the file system to keep state for all iterations. For now only wire it up for direct I/O as there is an immediate need for it there. Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-04-01fs: Pass an iocb to generic_perform_write()Matthew Wilcox (Oracle)1-1/+1
We can extract both the file pointer and the pos from the iocb. This simplifies each caller as well as allowing generic_perform_write() to see more of the iocb contents in the future. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Christian Brauner <brauner@kernel.org> Reviewed-by: Al Viro <viro@zeniv.linux.org.uk> Acked-by: Al Viro <viro@zeniv.linux.org.uk>
2022-02-08ext4: support direct I/O with fscrypt using blk-cryptoEric Biggers1-4/+6
Encrypted files traditionally haven't supported DIO, due to the need to encrypt/decrypt the data. However, when the encryption is implemented using inline encryption (blk-crypto) instead of the traditional filesystem-layer encryption, it is straightforward to support DIO. Therefore, make ext4 support DIO on files that are using inline encryption. Since ext4 uses iomap for DIO, and fscrypt support was already added to iomap DIO, this just requires two small changes: - Let DIO proceed when supported, by checking fscrypt_dio_supported() instead of assuming that encrypted files never support DIO. - In ext4_iomap_begin(), use fscrypt_limit_io_blocks() to limit the length of the mapping in the rare case where a DUN discontiguity occurs in the middle of an extent. The iomap DIO implementation requires this, since it assumes that it can submit a bio covering (up to) the whole mapping, without checking fscrypt constraints itself. Co-developed-by: Satya Tangirala <satyat@google.com> Signed-off-by: Satya Tangirala <satyat@google.com> Acked-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Jaegeuk Kim <jaegeuk@kernel.org> Link: https://lore.kernel.org/r/20220128233940.79464-4-ebiggers@kernel.org Signed-off-by: Eric Biggers <ebiggers@google.com>
2021-12-23ext4: use ext4_journal_start/stop for fast commit transactionsHarshad Shirwadkar1-4/+0
This patch drops all calls to ext4_fc_start_update() and ext4_fc_stop_update(). To ensure that there are no ongoing journal updates during fast commit, we also make jbd2_fc_begin_commit() lock journal for updates. This way we don't have to maintain two different transaction start stop APIs for fast commit and full commit. This patch doesn't remove the functions altogether since in future we want to have inode level locking for fast commits. Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com> Link: https://lore.kernel.org/r/20211223202140.2061101-2-harshads@google.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2021-11-02Merge tag 'gfs2-v5.15-rc5-mmap-fault' of ↵Linus Torvalds1-2/+3
git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2 Pull gfs2 mmap + page fault deadlocks fixes from Andreas Gruenbacher: "Functions gfs2_file_read_iter and gfs2_file_write_iter are both accessing the user buffer to write to or read from while holding the inode glock. In the most basic deadlock scenario, that buffer will not be resident and it will be mapped to the same file. Accessing the buffer will trigger a page fault, and gfs2 will deadlock trying to take the same inode glock again while trying to handle that fault. Fix that and similar, more complex scenarios by disabling page faults while accessing user buffers. To make this work, introduce a small amount of new infrastructure and fix some bugs that didn't trigger so far, with page faults enabled" * tag 'gfs2-v5.15-rc5-mmap-fault' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2: gfs2: Fix mmap + page fault deadlocks for direct I/O iov_iter: Introduce nofault flag to disable page faults gup: Introduce FOLL_NOFAULT flag to disable page faults iomap: Add done_before argument to iomap_dio_rw iomap: Support partial direct I/O on user copy failures iomap: Fix iomap_dio_rw return value for user copies gfs2: Fix mmap + page fault deadlocks for buffered I/O gfs2: Eliminate ip->i_gh gfs2: Move the inode glock locking to gfs2_file_buffered_write gfs2: Introduce flag for glock holder auto-demotion gfs2: Clean up function may_grant gfs2: Add wrapper for iomap_file_buffered_write iov_iter: Introduce fault_in_iov_iter_writeable iov_iter: Turn iov_iter_fault_in_readable into fault_in_iov_iter_readable gup: Turn fault_in_pages_{readable,writeable} into fault_in_{readable,writeable} powerpc/kvm: Fix kvm_use_magic_page iov_iter: Fix iov_iter_get_pages{,_alloc} page fault return value
2021-10-24iomap: Add done_before argument to iomap_dio_rwAndreas Gruenbacher1-2/+3
Add a done_before argument to iomap_dio_rw that indicates how much of the request has already been transferred. When the request succeeds, we report that done_before additional bytes were tranferred. This is useful for finishing a request asynchronously when part of the request has already been completed synchronously. We'll use that to allow iomap_dio_rw to be used with page faults disabled: when a page fault occurs while submitting a request, we synchronously complete the part of the request that has already been submitted. The caller can then take care of the page fault and call iomap_dio_rw again for the rest of the request, passing in the number of bytes already tranferred. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2021-10-18block: switch polling to be bio basedChristoph Hellwig1-1/+1
Replace the blk_poll interface that requires the caller to keep a queue and cookie from the submissions with polling based on the bio. Polling for the bio itself leads to a few advantages: - the cookie construction can made entirely private in blk-mq.c - the caller does not need to remember the request_queue and cookie separately and thus sidesteps their lifetime issues - keeping the device and the cookie inside the bio allows to trivially support polling BIOs remapping by stacking drivers - a lot of code to propagate the cookie back up the submission path can be removed entirely. Signed-off-by: Christoph Hellwig <hch@lst.de> Tested-by: Mark Wunderlich <mark.wunderlich@intel.com> Link: https://lore.kernel.org/r/20211012111226.760968-15-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-09-02Merge tag 'ext4_for_linus' of ↵Linus Torvalds1-1/+2
git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 Pull ext4 updates from Ted Ts'o: "In addition to some ext4 bug fixes and cleanups, this cycle we add the orphan_file feature, which eliminates bottlenecks when doing a large number of parallel truncates and file deletions, and move the discard operation out of the jbd2 commit thread when using the discard mount option, to better support devices with slow discard operations" * tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (23 commits) ext4: make the updating inode data procedure atomic ext4: remove an unnecessary if statement in __ext4_get_inode_loc() ext4: move inode eio simulation behind io completeion ext4: Improve scalability of ext4 orphan file handling ext4: Orphan file documentation ext4: Speedup ext4 orphan inode handling ext4: Move orphan inode handling into a separate file ext4: Support for checksumming from journal triggers ext4: fix race writing to an inline_data file while its xattrs are changing jbd2: add sparse annotations for add_transaction_credits() ext4: fix sparse warnings ext4: Make sure quota files are not grabbed accidentally ext4: fix e2fsprogs checksum failure for mounted filesystem ext4: if zeroout fails fall back to splitting the extent node ext4: reduce arguments of ext4_fc_add_dentry_tlv ext4: flush background discard kwork when retry allocation ext4: get discard out of jbd2 commit kthread contex ext4: remove the repeated comment of ext4_trim_all_free ext4: add new helper interface ext4_try_to_trim_range() ext4: remove the 'group' parameter of ext4_trim_extent ...
2021-08-30ext4: Support for checksumming from journal triggersJan Kara1-1/+2
JBD2 layer support triggers which are called when journaling layer moves buffer to a certain state. We can use the frozen trigger, which gets called when buffer data is frozen and about to be written out to the journal, to compute block checksums for some buffer types (similarly as does ocfs2). This avoids unnecessary repeated recomputation of the checksum (at the cost of larger window where memory corruption won't be caught by checksumming) and is even necessary when there are unsynchronized updaters of the checksummed data. So add superblock and journal trigger type arguments to ext4_journal_get_write_access() and ext4_journal_get_create_access() so that frozen triggers can be set accordingly. Also add inode argument to ext4_walk_page_buffers() and all the callbacks used with that function for the same purpose. This patch is mostly only a change of prototype of the above mentioned functions and a few small helpers. Real checksumming will come later. Reviewed-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20210816095713.16537-1-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2021-07-13ext4: Convert to use mapping->invalidate_lockJan Kara1-6/+7
Convert ext4 to use mapping->invalidate_lock instead of its private EXT4_I(inode)->i_mmap_sem. This is mostly search-and-replace. By this conversion we fix a long standing race between hole punching and read(2) / readahead(2) paths that can lead to stale page cache contents. CC: <linux-ext4@vger.kernel.org> CC: Ted Tso <tytso@mit.edu> Acked-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Jan Kara <jack@suse.cz>
2021-04-30Merge tag 'ext4_for_linus' of ↵Linus Torvalds1-4/+21
git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 Pull ext4 updates from Ted Ts'o: "New features for ext4 this cycle include support for encrypted casefold, ensure that deleted file names are cleared in directory blocks by zeroing directory entries when they are unlinked or moved as part of a hash tree node split. We also improve the block allocator's performance on a freshly mounted file system by prefetching block bitmaps. There are also the usual cleanups and bug fixes, including fixing a page cache invalidation race when there is mixed buffered and direct I/O and the block size is less than page size, and allow the dax flag to be set and cleared on inline directories" * tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (32 commits) ext4: wipe ext4_dir_entry2 upon file deletion ext4: Fix occasional generic/418 failure fs: fix reporting supported extra file attributes for statx() ext4: allow the dax flag to be set and cleared on inline directories ext4: fix debug format string warning ext4: fix trailing whitespace ext4: fix various seppling typos ext4: fix error return code in ext4_fc_perform_commit() ext4: annotate data race in jbd2_journal_dirty_metadata() ext4: annotate data race in start_this_handle() ext4: fix ext4_error_err save negative errno into superblock ext4: fix error code in ext4_commit_super ext4: always panic when errors=panic is specified ext4: delete redundant uptodate check for buffer ext4: do not set SB_ACTIVE in ext4_orphan_cleanup() ext4: make prefetch_block_bitmaps default ext4: add proc files to monitor new structures ext4: improve cr 0 / cr 1 group scanning ext4: add MB_NUM_ORDERS macro ext4: add mballoc stats proc file ...
2021-04-22ext4: Fix occasional generic/418 failureJan Kara1-4/+21
Eric has noticed that after pagecache read rework, generic/418 is occasionally failing for ext4 when blocksize < pagesize. In fact, the pagecache rework just made hard to hit race in ext4 more likely. The problem is that since ext4 conversion of direct IO writes to iomap framework (commit 378f32bab371), we update inode size after direct IO write only after invalidating page cache. Thus if buffered read sneaks at unfortunate moment like: CPU1 - write at offset 1k CPU2 - read from offset 0 iomap_dio_rw(..., IOMAP_DIO_FORCE_WAIT); ext4_readpage(); ext4_handle_inode_extension() the read will zero out tail of the page as it still sees smaller inode size and thus page cache becomes inconsistent with on-disk contents with all the consequences. Fix the problem by moving inode size update into end_io handler which gets called before the page cache is invalidated. Reported-and-tested-by: Eric Whitney <enwlinux@gmail.com> Fixes: 378f32bab371 ("ext4: introduce direct I/O write using iomap infrastructure") CC: stable@vger.kernel.org Signed-off-by: Jan Kara <jack@suse.cz> Acked-by: Dave Chinner <dchinner@redhat.com> Link: https://lore.kernel.org/r/20210415155417.4734-1-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2021-04-12ext4: convert to fileattrMiklos Szeredi1-0/+2
Use the fileattr API to let the VFS handle locking, permission checking and conversion. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com> Cc: "Theodore Ts'o" <tytso@mit.edu>
2021-01-23iomap: pass a flags argument to iomap_dio_rwChristoph Hellwig1-3/+2
Pass a set of flags to iomap_dio_rw instead of the boolean wait_for_completion argument. The IOMAP_DIO_FORCE_WAIT flag replaces the wait_for_completion, but only needs to be passed when the iocb isn't synchronous to start with to simplify the callers. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> [djwong: rework xfs_file.c so that we can push iomap changes separately] Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2020-12-22ext4: don't leak old mountpoint samplesTheodore Ts'o1-1/+1
When the first file is opened, ext4 samples the mountpoint of the filesystem in 64 bytes of the super block. It does so using strlcpy(), this means that the remaining bytes in the super block string buffer are untouched. If the mount point before had a longer path than the current one, it can be reconstructed. Consider the case where the fs was mounted to "/media/johnjdeveloper" and later to "/". The super block buffer then contains "/\x00edia/johnjdeveloper". This case was seen in the wild and caused confusion how the name of a developer ands up on the super block of a filesystem used in production... Fix this by using strncpy() instead of strlcpy(). The superblock field is defined to be a fixed-size char array, and it is already marked using __nonstring in fs/ext4/ext4.h. The consumer of the field in e2fsprogs already assumes that in the case of a 64+ byte mount path, that s_last_mounted will not be NUL terminated. Link: https://lore.kernel.org/r/X9ujIOJG/HqMr88R@mit.edu Reported-by: Richard Weinberger <richard@nod.at> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org
2020-12-22ext4: drop ext4_handle_dirty_super()Jan Kara1-1/+1
The wrapper is now useless since it does what ext4_handle_dirty_metadata() does. Just remove it. Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20201216101844.22917-9-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-12-22ext4: protect superblock modifications with a buffer lockJan Kara1-0/+3
Protect all superblock modifications (including checksum computation) with a superblock buffer lock. That way we are sure computed checksum matches current superblock contents (a mismatch could cause checksum failures in nojournal mode or if an unjournalled superblock update races with a journalled one). Also we avoid modifying superblock contents while it is being written out (which can cause DIF/DIX failures if we are running in nojournal mode). Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20201216101844.22917-4-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-11-06ext4: make s_mount_flags modifications atomicHarshad Shirwadkar1-2/+2
Fast commit file system states are recorded in sbi->s_mount_flags. Fast commit expects these bit manipulations to be atomic. This patch adds helpers to make those modifications atomic. Suggested-by: Jan Kara <jack@suse.cz> Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com> Link: https://lore.kernel.org/r/20201106035911.1942128-21-harshadshirwadkar@gmail.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-11-06ext4: remove unnecessary fast commit calls from ext4_file_mmapHarshad Shirwadkar1-2/+0
Remove unnecessary calls to ext4_fc_start_update() and ext4_fc_stop_update() from ext4_file_mmap(). Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com> Link: https://lore.kernel.org/r/20201106035911.1942128-17-harshadshirwadkar@gmail.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-10-21ext4: main fast-commit commit pathHarshad Shirwadkar1-2/+8
This patch adds main fast commit commit path handlers. The overall patch can be divided into two inter-related parts: (A) Metadata updates tracking This part consists of helper functions to track changes that need to be committed during a commit operation. These updates are maintained by Ext4 in different in-memory queues. Following are the APIs and their short description that are implemented in this patch: - ext4_fc_track_link/unlink/creat() - Track unlink. link and creat operations - ext4_fc_track_range() - Track changed logical block offsets inodes - ext4_fc_track_inode() - Track inodes - ext4_fc_mark_ineligible() - Mark file system fast commit ineligible() - ext4_fc_start_update() / ext4_fc_stop_update() / ext4_fc_start_ineligible() / ext4_fc_stop_ineligible() These functions are useful for co-ordinating inode updates with commits. (B) Main commit Path This part consists of functions to convert updates tracked in in-memory data structures into on-disk commits. Function ext4_fc_commit() is the main entry point to commit path. Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com> Link: https://lore.kernel.org/r/20201015203802.3597742-6-harshadshirwadkar@gmail.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-10-18ext4: flag as supporting buffered async readsJens Axboe1-1/+1
ext4 uses generic_file_read_iter(), which already supports this. Cc: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Jens Axboe <axboe@kernel.dk> Link: https://lore.kernel.org/r/fb90cc2d-b12c-738f-21a4-dd7a8ae0556a@kernel.dk Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-08-21Merge tag 'ext4_for_linus' of ↵Linus Torvalds1-4/+7
git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 Pull ext4 updates from Ted Ts'o: "Improvements to ext4's block allocator performance for very large file systems, especially when the file system or files which are highly fragmented. There is a new mount option, prefetch_block_bitmaps which will pull in the block bitmaps and set up the in-memory buddy bitmaps when the file system is initially mounted. Beyond that, a lot of bug fixes and cleanups. In particular, a number of changes to make ext4 more robust in the face of write errors or file system corruptions" * tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (46 commits) ext4: limit the length of per-inode prealloc list ext4: reorganize if statement of ext4_mb_release_context() ext4: add mb_debug logging when there are lost chunks ext4: Fix comment typo "the the". jbd2: clean up checksum verification in do_one_pass() ext4: change to use fallthrough macro ext4: remove unused parameter of ext4_generic_delete_entry function mballoc: replace seq_printf with seq_puts ext4: optimize the implementation of ext4_mb_good_group() ext4: delete invalid comments near ext4_mb_check_limits() ext4: fix typos in ext4_mb_regular_allocator() comment ext4: fix checking of directory entry validity for inline directories fs: prevent BUG_ON in submit_bh_wbc() ext4: correctly restore system zone info when remount fails ext4: handle add_system_zone() failure in ext4_setup_system_zone() ext4: fold ext4_data_block_valid_rcu() into the caller ext4: check journal inode extents more carefully ext4: don't allow overlapping system zones ext4: handle error of ext4_setup_system_zone() on remount ext4: delete the invalid BUGON in ext4_mb_load_buddy_gfp() ...
2020-08-19ext4: limit the length of per-inode prealloc listbrookxu1-1/+1
In the scenario of writing sparse files, the per-inode prealloc list may be very long, resulting in high overhead for ext4_mb_use_preallocated(). To circumvent this problem, we limit the maximum length of per-inode prealloc list to 512 and allow users to modify it. After patching, we observed that the sys ratio of cpu has dropped, and the system throughput has increased significantly. We created a process to write the sparse file, and the running time of the process on the fixed kernel was significantly reduced, as follows: Running time on unfixed kernel: [root@TENCENT64 ~]# time taskset 0x01 ./sparse /data1/sparce.dat real 0m2.051s user 0m0.008s sys 0m2.026s Running time on fixed kernel: [root@TENCENT64 ~]# time taskset 0x01 ./sparse /data1/sparce.dat real 0m0.471s user 0m0.004s sys 0m0.395s Signed-off-by: Chunguang Xu <brookxu@tencent.com> Link: https://lore.kernel.org/r/d7a98178-056b-6db5-6bce-4ead23f4a257@gmail.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-08-06ext4: do not block RWF_NOWAIT dio write on unallocated spaceJan Kara1-0/+4
Since commit 378f32bab371 ("ext4: introduce direct I/O write using iomap infrastructure") we don't properly bail out of RWF_NOWAIT direct IO write if underlying blocks are not allocated. Also ext4_dio_write_checks() does not honor RWF_NOWAIT when re-acquiring i_rwsem. Fix both issues. Fixes: 378f32bab371 ("ext4: introduce direct I/O write using iomap infrastructure") Cc: stable@kernel.org Reported-by: Filipe Manana <fdmanana@gmail.com> Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com> Link: https://lore.kernel.org/r/20200708153516.9507-1-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-08-06ext4: fix coding style in file.cDio Putra1-3/+2
Fixed a few coding style issues in file.c Signed-off-by: Dio Putra <dioput12@gmail.com> Link: https://lore.kernel.org/r/239fcd8f-d33f-8621-9e82-0416dd3f9c94@gmail.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-08-05iomap: fall back to buffered writes for invalidation failuresChristoph Hellwig1-0/+2
Failing to invalid the page cache means data in incoherent, which is a very bad state for the system. Always fall back to buffered I/O through the page cache if we can't invalidate mappings. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Acked-by: Bob Peterson <rpeterso@redhat.com> Acked-by: Damien Le Moal <damien.lemoal@wdc.com> Reviewed-by: Theodore Ts'o <tytso@mit.edu> # for ext4 Reviewed-by: Andreas Gruenbacher <agruenba@redhat.com> # for gfs2 Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com>
2020-06-03ext4: don't block for O_DIRECT if IOCB_NOWAIT is setJens Axboe1-0/+6
Running with some debug patches to detect illegal blocking triggered the extend/unaligned condition in ext4. If ext4 needs to extend the file (and hence go to buffered IO), or if the app is doing unaligned IO, then ext4 asks the iomap code to wait for IO completion. If the caller asked for no-wait semantics by setting IOCB_NOWAIT, then ext4 should return -EAGAIN instead. Signed-off-by: Jens Axboe <axboe@kernel.dk> Link: https://lore.kernel.org/r/76152096-2bbb-7682-8fce-4cb498bcd909@kernel.dk Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-06-03ext4: handle ext4_mark_inode_dirty errorsHarshad Shirwadkar1-2/+9
ext4_mark_inode_dirty() can fail for real reasons. Ignoring its return value may lead ext4 to ignore real failures that would result in corruption / crashes. Harden ext4_mark_inode_dirty error paths to fail as soon as possible and return errors to the caller whenever appropriate. One of the possible scnearios when this bug could affected is that while creating a new inode, its directory entry gets added successfully but while writing the inode itself mark_inode_dirty returns error which is ignored. This would result in inconsistency that the directory entry points to a non-existent inode. Ran gce-xfstests smoke tests and verified that there were no regressions. Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com> Link: https://lore.kernel.org/r/20200427013438.219117-1-harshadshirwadkar@gmail.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-03-05ext4: start to support iopoll methodXiaoguang Wang1-0/+1
Since commit "b1b4705d54ab ext4: introduce direct I/O read using iomap infrastructure", we can easily make ext4 support iopoll method, just use iomap_dio_iopoll(). Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com> Link: https://lore.kernel.org/r/20200207120758.2411-1-xiaoguang.wang@linux.alibaba.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2019-12-26ext4: Optimize ext4 DIO overwritesJan Kara1-1/+4
Currently we start transaction for mapping every extent for writing using direct IO. This is unnecessary when we know we are overwriting already allocated blocks and the overhead of starting a transaction can be significant especially for multithreaded workloads doing small writes. Use iomap operations that avoid starting a transaction for direct IO overwrites. This improves throughput of 4k random writes - fio jobfile: [global] rw=randrw norandommap=1 invalidate=0 bs=4k numjobs=16 time_based=1 ramp_time=30 runtime=120 group_reporting=1 ioengine=psync direct=1 size=16G filename=file1.0.0:file1.0.1:file1.0.2:file1.0.3:file1.0.4:file1.0.5:file1.0.6:file1.0.7:file1.0.8:file1.0.9:file1.0.10:file1.0.11:file1.0.12:file1.0.13:file1.0.14:file1.0.15:file1.0.16:file1.0.17:file1.0.18:file1.0.19:file1.0.20:file1.0.21:file1.0.22:file1.0.23:file1.0.24:file1.0.25:file1.0.26:file1.0.27:file1.0.28:file1.0.29:file1.0.30:file1.0.31 file_service_type=random nrfiles=32 from 3018MB/s to 4059MB/s in my test VM running test against simulated pmem device (note that before iomap conversion, this workload was able to achieve 3708MB/s because old direct IO path avoided transaction start for overwrites as well). For dax, the win is even larger improving throughput from 3042MB/s to 4311MB/s. Reported-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20191218174433.19380-1-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2019-12-22ext4: Move to shared i_rwsem even without dioread_nolock mount optRitesh Harjani1-6/+3
We were using shared locking only in case of dioread_nolock mount option in case of DIO overwrites. This mount condition is not needed anymore with current code, since:- 1. No race between buffered writes & DIO overwrites. Since buffIO writes takes exclusive lock & DIO overwrites will take shared locking. Also DIO path will make sure to flush and wait for any dirty page cache data. 2. No race between buffered reads & DIO overwrites, since there is no block allocation that is possible with DIO overwrites. So no stale data exposure should happen. Same is the case between DIO reads & DIO overwrites. 3. Also other paths like truncate is protected, since we wait there for any DIO in flight to be over. Reviewed-by: Jan Kara <jack@suse.cz> Tested-by: Joseph Qi <joseph.qi@linux.alibaba.com> Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com> Link: https://lore.kernel.org/r/20191212055557.11151-4-riteshh@linux.ibm.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2019-12-22ext4: Start with shared i_rwsem in case of DIO instead of exclusiveRitesh Harjani1-49/+142
Earlier there was no shared lock in DIO read path. But this patch (16c54688592ce: ext4: Allow parallel DIO reads) simplified some of the locking mechanism while still allowing for parallel DIO reads by adding shared lock in inode DIO read path. But this created problem with mixed read/write workload. It is due to the fact that in DIO path, we first start with exclusive lock and only when we determine that it is a ovewrite IO, we downgrade the lock. This causes the problem, since we still have shared locking in DIO reads. So, this patch tries to fix this issue by starting with shared lock and then switching to exclusive lock only when required based on ext4_dio_write_checks(). Other than that, it also simplifies below cases:- 1. Simplified ext4_unaligned_aio API to ext4_unaligned_io. Previous API was abused in the sense that it was not really checking for AIO anywhere also it used to check for extending writes. So this API was renamed and simplified to ext4_unaligned_io() which actully only checks if the IO is really unaligned. Now, in case of unaligned direct IO, iomap_dio_rw needs to do zeroing of partial block and that will require serialization against other direct IOs in the same block. So we take a exclusive inode lock for any unaligned DIO. In case of AIO we also need to wait for any outstanding IOs to complete so that conversion from unwritten to written is completed before anyone try to map the overlapping block. Hence we take exclusive inode lock and also wait for inode_dio_wait() for unaligned DIO case. Please note since we are anyway taking an exclusive lock in unaligned IO, inode_dio_wait() becomes a no-op in case of non-AIO DIO. 2. Added ext4_extending_io(). This checks if the IO is extending the file. 3. Added ext4_dio_write_checks(). In this we start with shared inode lock and only switch to exclusive lock if required. So in most cases with aligned, non-extending, dioread_nolock & overwrites, it tries to write with a shared lock. If not, then we restart the operation in ext4_dio_write_checks(), after acquiring exclusive lock. Reviewed-by: Jan Kara <jack@suse.cz> Tested-by: Joseph Qi <joseph.qi@linux.alibaba.com> Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com> Link: https://lore.kernel.org/r/20191212055557.11151-3-riteshh@linux.ibm.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2019-12-22ext4: fix ext4_dax_read/write inode locking sequence for IOCB_NOWAITRitesh Harjani1-4/+6
Apparently our current rwsem code doesn't like doing the trylock, then lock for real scheme. So change our dax read/write methods to just do the trylock for the RWF_NOWAIT case. This seems to fix AIM7 regression in some scalable filesystems upto ~25% in some cases. Claimed in commit 942491c9e6d6 ("xfs: fix AIM7 regression") Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Matthew Bobrowski <mbobrowski@mbobrowski.org> Tested-by: Joseph Qi <joseph.qi@linux.alibaba.com> Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com> Link: https://lore.kernel.org/r/20191212055557.11151-2-riteshh@linux.ibm.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2019-11-05ext4: introduce direct I/O write using iomap infrastructureMatthew Bobrowski1-72/+174
This patch introduces a new direct I/O write path which makes use of the iomap infrastructure. All direct I/O writes are now passed from the ->write_iter() callback through to the new direct I/O handler ext4_dio_write_iter(). This function is responsible for calling into the iomap infrastructure via iomap_dio_rw(). Code snippets from the existing direct I/O write code within ext4_file_write_iter() such as, checking whether the I/O request is unaligned asynchronous I/O, or whether the write will result in an overwrite have effectively been moved out and into the new direct I/O ->write_iter() handler. The block mapping flags that are eventually passed down to ext4_map_blocks() from the *_get_block_*() suite of routines have been taken out and introduced within ext4_iomap_alloc(). For inode extension cases, ext4_handle_inode_extension() is effectively the function responsible for performing such metadata updates. This is called after iomap_dio_rw() has returned so that we can safely determine whether we need to potentially truncate any allocated blocks that may have been prepared for this direct I/O write. We don't perform the inode extension, or truncate operations from the ->end_io() handler as we don't have the original I/O 'length' available there. The ->end_io() however is responsible fo converting allocated unwritten extents to written extents. In the instance of a short write, we fallback and complete the remainder of the I/O using buffered I/O via ext4_buffered_write_iter(). The existing buffer_head direct I/O implementation has been removed as it's now redundant. [ Fix up ext4_dio_write_iter() per Jan's comments at https://lore.kernel.org/r/20191105135932.GN22379@quack2.suse.cz -- TYT ] Signed-off-by: Matthew Bobrowski <mbobrowski@mbobrowski.org> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com> Link: https://lore.kernel.org/r/e55db6f12ae6ff017f36774135e79f3e7b0333da.1572949325.git.mbobrowski@mbobrowski.org Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2019-11-05ext4: move inode extension check out from ext4_iomap_alloc()Matthew Bobrowski1-1/+23
Lift the inode extension/orphan list handling code out from ext4_iomap_alloc() and apply it within the ext4_dax_write_iter(). Signed-off-by: Matthew Bobrowski <mbobrowski@mbobrowski.org> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com> Link: https://lore.kernel.org/r/fd5c84db25d5d0da87d97ed4c36fd844f57da759.1572949325.git.mbobrowski@mbobrowski.org Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2019-11-05ext4: move inode extension/truncate code out from ->iomap_end() callbackMatthew Bobrowski1-1/+88
In preparation for implementing the iomap direct I/O modifications, the inode extension/truncate code needs to be moved out from the ext4_iomap_end() callback. For direct I/O, if the current code remained, it would behave incorrrectly. Updating the inode size prior to converting unwritten extents would potentially allow a racing direct I/O read to find unwritten extents before being converted correctly. The inode extension/truncate code now resides within a new helper ext4_handle_inode_extension(). This function has been designed so that it can accommodate for both DAX and direct I/O extension/truncate operations. Signed-off-by: Matthew Bobrowski <mbobrowski@mbobrowski.org> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com> Link: https://lore.kernel.org/r/d41ffa26e20b15b12895812c3cad7c91a6a59bc6.1572949325.git.mbobrowski@mbobrowski.org Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2019-11-05ext4: introduce direct I/O read using iomap infrastructureMatthew Bobrowski1-2/+53
This patch introduces a new direct I/O read path which makes use of the iomap infrastructure. The new function ext4_do_read_iter() is responsible for calling into the iomap infrastructure via iomap_dio_rw(). If the read operation performed on the inode is not supported, which is checked via ext4_dio_supported(), then we simply fallback and complete the I/O using buffered I/O. Existing direct I/O read code path has been removed, as it is now redundant. Signed-off-by: Matthew Bobrowski <mbobrowski@mbobrowski.org> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com> Link: https://lore.kernel.org/r/f98a6f73fadddbfbad0fc5ed04f712ca0b799f37.1572949325.git.mbobrowski@mbobrowski.org Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2019-11-05ext4: introduce new callback for IOMAP_REPORTMatthew Bobrowski1-2/+4
As part of the ext4_iomap_begin() cleanups that precede this patch, we also split up the IOMAP_REPORT branch into a completely separate ->iomap_begin() callback named ext4_iomap_begin_report(). Again, the raionale for this change is to reduce the overall clutter within ext4_iomap_begin(). Signed-off-by: Matthew Bobrowski <mbobrowski@mbobrowski.org> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com> Link: https://lore.kernel.org/r/5c97a569e26ddb6696e3d3ac9fbde41317e029a0.1572949325.git.mbobrowski@mbobrowski.org Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2019-09-21Merge tag 'ext4_for_linus' of ↵Linus Torvalds1-2/+0
git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 Pull ext4 updates from Ted Ts'o: "Added new ext4 debugging ioctls to allow userspace to get information about the state of the extent status cache. Dropped workaround for pre-1970 dates which were encoded incorrectly in pre-4.4 kernels. Since both the kernel correctly generates, and e2fsck detects and fixes this issue for the past four years, it'e time to drop the workaround. (Also, it's not like files with dates in the distant past were all that common in the first place.) A lot of miscellaneous bug fixes and cleanups, including some ext4 Documentation fixes. Also included are two minor bug fixes in fs/unicode" * tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (21 commits) unicode: make array 'token' static const, makes object smaller unicode: Move static keyword to the front of declarations ext4: add missing bigalloc documentation. ext4: fix kernel oops caused by spurious casefold flag ext4: fix integer overflow when calculating commit interval ext4: use percpu_counters for extent_status cache hits/misses ext4: fix potential use after free after remounting with noblock_validity jbd2: add missing tracepoint for reserved handle ext4: fix punch hole for inline_data file systems ext4: rework reserved cluster accounting when invalidating pages ext4: documentation fixes ext4: treat buffers with write errors as containing valid data ext4: fix warning inside ext4_convert_unwritten_extents_endio ext4: set error return correctly when ext4_htree_store_dirent fails ext4: drop legacy pre-1970 encoding workaround ext4: add new ioctl EXT4_IOC_GET_ES_CACHE ext4: add a new ioctl EXT4_IOC_GETSTATE ext4: add a new ioctl EXT4_IOC_CLEAR_ES_CACHE jbd2: flush_descriptor(): Do not decrease buffer head's ref count ext4: remove unnecessary error check ...
2019-08-12ext4: add basic fs-verity supportEric Biggers1-0/+4
Add most of fs-verity support to ext4. fs-verity is a filesystem feature that enables transparent integrity protection and authentication of read-only files. It uses a dm-verity like mechanism at the file level: a Merkle tree is used to verify any block in the file in log(filesize) time. It is implemented mainly by helper functions in fs/verity/. See Documentation/filesystems/fsverity.rst for the full documentation. This commit adds all of ext4 fs-verity support except for the actual data verification, including: - Adding a filesystem feature flag and an inode flag for fs-verity. - Implementing the fsverity_operations to support enabling verity on an inode and reading/writing the verity metadata. - Updating ->write_begin(), ->write_end(), and ->writepages() to support writing verity metadata pages. - Calling the fs-verity hooks for ->open(), ->setattr(), and ->ioctl(). ext4 stores the verity metadata (Merkle tree and fsverity_descriptor) past the end of the file, starting at the first 64K boundary beyond i_size. This approach works because (a) verity files are readonly, and (b) pages fully beyond i_size aren't visible to userspace but can be read/written internally by ext4 with only some relatively small changes to ext4. This approach avoids having to depend on the EA_INODE feature and on rearchitecturing ext4's xattr support to support paging multi-gigabyte xattrs into memory, and to support encrypting xattrs. Note that the verity metadata *must* be encrypted when the file is, since it contains hashes of the plaintext data. This patch incorporates work by Theodore Ts'o and Chandan Rajendra. Reviewed-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Eric Biggers <ebiggers@google.com>
2019-08-11ext4: remove unnecessary error checkShi Siyuan1-2/+0
Remove unnecessary error check in ext4_file_write_iter(), because this check will be done in upcoming later function -- ext4_write_checks() -> generic_write_checks() Change-Id: I7b0ab27f693a50765c15b5eaa3f4e7c38f42e01e Signed-off-by: shisiyuan <shisiyuan@xiaomi.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2019-07-18Merge tag 'libnvdimm-for-5.3' of ↵Linus Torvalds1-4/+6
git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm Pull libnvdimm updates from Dan Williams: "Primarily just the virtio_pmem driver: - virtio_pmem The new virtio_pmem facility introduces a paravirtualized persistent memory device that allows a guest VM to use DAX mechanisms to access a host-file with host-page-cache. It arranges for MAP_SYNC to be disabled and instead triggers a host fsync() when a 'write-cache flush' command is sent to the virtual disk device. - Miscellaneous small fixups" * tag 'libnvdimm-for-5.3' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: virtio_pmem: fix sparse warning xfs: disable map_sync for async flush ext4: disable map_sync for async flush dax: check synchronous mapping is supported dm: enable synchronous dax libnvdimm: add dax_dev sync flag virtio-pmem: Add virtio pmem driver libnvdimm: nd_region flush callback support libnvdimm, namespace: Drop uuid_t implementation detail
2019-07-05ext4: disable map_sync for async flushPankaj Gupta1-4/+6
Dont support 'MAP_SYNC' with non-DAX files and DAX files with asynchronous dax_device. Virtio pmem provides asynchronous host page cache flush mechanism. We don't support 'MAP_SYNC' with virtio pmem and ext4. Signed-off-by: Pankaj Gupta <pagupta@redhat.com> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2019-06-09ext4: enforce the immutable flag on open filesTheodore Ts'o1-0/+4
According to the chattr man page, "a file with the 'i' attribute cannot be modified..." Historically, this was only enforced when the file was opened, per the rest of the description, "... and the file can not be opened in write mode". There is general agreement that we should standardize all file systems to prevent modifications even for files that were opened at the time the immutable flag is set. Eventually, a change to enforce this at the VFS layer should be landing in mainline. Until then, enforce this at the ext4 level to prevent xfstests generic/553 from failing. Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: "Darrick J. Wong" <darrick.wong@oracle.com> Cc: stable@kernel.org
2019-05-10ext4: fix data corruption caused by overlapping unaligned and aligned IOLukas Czerner1-0/+7
Unaligned AIO must be serialized because the zeroing of partial blocks of unaligned AIO can result in data corruption in case it's overlapping another in flight IO. Currently we wait for all unwritten extents before we submit unaligned AIO which protects data in case of unaligned AIO is following overlapping IO. However if a unaligned AIO is followed by overlapping aligned AIO we can still end up corrupting data. To fix this, we must make sure that the unaligned AIO is the only IO in flight by waiting for unwritten extents conversion not just before the IO submission, but right after it as well. This problem can be reproduced by xfstest generic/538 Signed-off-by: Lukas Czerner <lczerner@redhat.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org
2019-03-14ext4: fix data corruption caused by unaligned direct AIOLukas Czerner1-1/+1
Ext4 needs to serialize unaligned direct AIO because the zeroing of partial blocks of two competing unaligned AIOs can result in data corruption. However it decides not to serialize if the potentially unaligned aio is past i_size with the rationale that no pending writes are possible past i_size. Unfortunately if the i_size is not block aligned and the second unaligned write lands past i_size, but still into the same block, it has the potential of corrupting the previous unaligned write to the same block. This is (very simplified) reproducer from Frank // 41472 = (10 * 4096) + 512 // 37376 = 41472 - 4096 ftruncate(fd, 41472); io_prep_pwrite(iocbs[0], fd, buf[0], 4096, 37376); io_prep_pwrite(iocbs[1], fd, buf[1], 4096, 41472); io_submit(io_ctx, 1, &iocbs[1]); io_submit(io_ctx, 1, &iocbs[2]); io_getevents(io_ctx, 2, 2, events, NULL); Without this patch the 512B range from 40960 up to the start of the second unaligned write (41472) is going to be zeroed overwriting the data written by the first write. This is a data corruption. 00000000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 * 00009200 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 * 0000a000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 * 0000a200 31 31 31 31 31 31 31 31 31 31 31 31 31 31 31 31 With this patch the data corruption is avoided because we will recognize the unaligned_aio and wait for the unwritten extent conversion. 00000000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 * 00009200 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 * 0000a200 31 31 31 31 31 31 31 31 31 31 31 31 31 31 31 31 * 0000b200 Reported-by: Frank Sorenson <fsorenso@redhat.com> Signed-off-by: Lukas Czerner <lczerner@redhat.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Fixes: e9e3bcecf44c ("ext4: serialize unaligned asynchronous DIO") Cc: stable@vger.kernel.org
2018-08-17dax: remove VM_MIXEDMAP for fsdax and device daxDave Jiang1-1/+1
This patch is reworked from an earlier patch that Dan has posted: https://patchwork.kernel.org/patch/10131727/ VM_MIXEDMAP is used by dax to direct mm paths like vm_normal_page() that the memory page it is dealing with is not typical memory from the linear map. The get_user_pages_fast() path, since it does not resolve the vma, is already using {pte,pmd}_devmap() as a stand-in for VM_MIXEDMAP, so we use that as a VM_MIXEDMAP replacement in some locations. In the cases where there is no pte to consult we fallback to using vma_is_dax() to detect the VM_MIXEDMAP special case. Now that we have explicit driver pfn_t-flag opt-in/opt-out for get_user_pages() support for DAX we can stop setting VM_MIXEDMAP. This also means we no longer need to worry about safely manipulating vm_flags in a future where we support dynamically changing the dax mode of a file. DAX should also now be supported with madvise_behavior(), vma_merge(), and copy_page_range(). This patch has been tested against ndctl unit test. It has also been tested against xfstests commit: 625515d using fake pmem created by memmap and no additional issues have been observed. Link: http://lkml.kernel.org/r/152847720311.55924.16999195879201817653.stgit@djiang5-desk3.ch.intel.com Signed-off-by: Dave Jiang <dave.jiang@intel.com> Acked-by: Dan Williams <dan.j.williams@intel.com> Cc: Jan Kara <jack@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-05-13ext4: do not update s_last_mounted of a frozen fsAmir Goldstein1-5/+9
If fs is frozen after mount and before the first file open, the update of s_last_mounted bypasses freeze protection and prints out a WARNING splat: $ mount /vdf $ fsfreeze -f /vdf $ cat /vdf/foo [ 31.578555] WARNING: CPU: 1 PID: 1415 at fs/ext4/ext4_jbd2.c:53 ext4_journal_check_start+0x48/0x82 [ 31.614016] Call Trace: [ 31.614997] __ext4_journal_start_sb+0xe4/0x1a4 [ 31.616771] ? ext4_file_open+0xb6/0x189 [ 31.618094] ext4_file_open+0xb6/0x189 If fs is frozen, skip s_last_mounted update. [backport hint: to apply to stable tree, need to apply also patches vfs: add the sb_start_intwrite_trylock() helper ext4: factor out helper ext4_sample_last_mounted()] Cc: stable@vger.kernel.org Fixes: bc0b0d6d69ee ("ext4: update the s_last_mounted field in the superblock") Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Jan Kara <jack@suse.cz>
2018-05-13ext4: factor out helper ext4_sample_last_mounted()Amir Goldstein1-36/+46
Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Jan Kara <jack@suse.cz>
2018-05-13fs: ext4: add new return type vm_fault_tSouptick Joarder1-3/+4
Use new return type vm_fault_t for fault handler. For now, this is just documenting that the function returns a VM_FAULT value rather than an errno. Once all instances are converted, vm_fault_t will become a distinct type. commit 1c8f422059ae ("mm: change return type to vm_fault_t") Signed-off-by: Souptick Joarder <jrdr.linux@gmail.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Matthew Wilcox <mawilcox@microsoft.com>
2018-01-07ext4: fix ENOSPC handling in DAX page fault handlerJan Kara1-2/+8
When allocation of underlying block for a page fault fails, we fail the fault with SIGBUS. However we may well hit ENOSPC just due to lots of free blocks being held by the running / committing transaction. So propagate the error from ext4_iomap_begin() and implement do standard allocation retry loop in ext4_dax_huge_fault(). Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2018-01-07dax: pass detailed error code from dax_iomap_fault()Jan Kara1-1/+1
Ext4 needs to pass through error from its iomap handler to the page fault handler so that it can properly detect ENOSPC and force transaction commit and retry the fault (and block allocation). Add argument to dax_iomap_fault() for passing such error. Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2017-11-17Merge tag 'libnvdimm-for-4.15' of ↵Linus Torvalds1-6/+20
git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm Pull libnvdimm and dax updates from Dan Williams: "Save for a few late fixes, all of these commits have shipped in -next releases since before the merge window opened, and 0day has given a build success notification. The ext4 touches came from Jan, and the xfs touches have Darrick's reviewed-by. An xfstest for the MAP_SYNC feature has been through a few round of reviews and is on track to be merged. - Introduce MAP_SYNC and MAP_SHARED_VALIDATE, a mechanism to enable 'userspace flush' of persistent memory updates via filesystem-dax mappings. It arranges for any filesystem metadata updates that may be required to satisfy a write fault to also be flushed ("on disk") before the kernel returns to userspace from the fault handler. Effectively every write-fault that dirties metadata completes an fsync() before returning from the fault handler. The new MAP_SHARED_VALIDATE mapping type guarantees that the MAP_SYNC flag is validated as supported by the filesystem's ->mmap() file operation. - Add support for the standard ACPI 6.2 label access methods that replace the NVDIMM_FAMILY_INTEL (vendor specific) label methods. This enables interoperability with environments that only implement the standardized methods. - Add support for the ACPI 6.2 NVDIMM media error injection methods. - Add support for the NVDIMM_FAMILY_INTEL v1.6 DIMM commands for latch last shutdown status, firmware update, SMART error injection, and SMART alarm threshold control. - Cleanup physical address information disclosures to be root-only. - Fix revalidation of the DIMM "locked label area" status to support dynamic unlock of the label area. - Expand unit test infrastructure to mock the ACPI 6.2 Translate SPA (system-physical-address) command and error injection commands. Acknowledgements that came after the commits were pushed to -next: - 957ac8c421ad ("dax: fix PMD faults on zero-length files"): Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com> - a39e596baa07 ("xfs: support for synchronous DAX faults") and 7b565c9f965b ("xfs: Implement xfs_filemap_pfn_mkwrite() using __xfs_filemap_fault()") Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>" * tag 'libnvdimm-for-4.15' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: (49 commits) acpi, nfit: add 'Enable Latch System Shutdown Status' command support dax: fix general protection fault in dax_alloc_inode dax: fix PMD faults on zero-length files dax: stop requiring a live device for dax_flush() brd: remove dax support dax: quiet bdev_dax_supported() fs, dax: unify IOMAP_F_DIRTY read vs write handling policy in the dax core tools/testing/nvdimm: unit test clear-error commands acpi, nfit: validate commands against the device type tools/testing/nvdimm: stricter bounds checking for error injection commands xfs: support for synchronous DAX faults xfs: Implement xfs_filemap_pfn_mkwrite() using __xfs_filemap_fault() ext4: Support for synchronous DAX faults ext4: Simplify error handling in ext4_dax_huge_fault() dax: Implement dax_finish_sync_fault() dax, iomap: Add support for synchronous faults mm: Define MAP_SYNC and VM_SYNC flags dax: Allow tuning whether dax_insert_mapping_entry() dirties entry dax: Allow dax_iomap_fault() to return pfn dax: Fix comment describing dax_iomap_fault() ...
2017-11-14Merge tag 'ext4_for_linus' of ↵Linus Torvalds1-249/+14
git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 Pull ext4 updates from Ted Ts'o: - Add support for online resizing of file systems with bigalloc - Fix a two data corruption bugs involving DAX, as well as a corruption bug after a crash during a racing fallocate and delayed allocation. - Finally, a number of cleanups and optimizations. * tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: ext4: improve smp scalability for inode generation ext4: add support for online resizing with bigalloc ext4: mention noload when recovering on read-only device Documentation: fix little inconsistencies ext4: convert timers to use timer_setup() jbd2: convert timers to use timer_setup() ext4: remove duplicate extended attributes defs ext4: add ext4_should_use_dax() ext4: add sanity check for encryption + DAX ext4: prevent data corruption with journaling + DAX ext4: prevent data corruption with inline data + DAX ext4: fix interaction between i_size, fallocate, and delalloc after a crash ext4: retry allocations conservatively ext4: Switch to iomap for SEEK_HOLE / SEEK_DATA ext4: Add iomap support for inline data iomap: Add IOMAP_F_DATA_INLINE flag iomap: Switch from blkno to disk offset
2017-11-14Merge tag 'fscrypt-for-linus' of ↵Linus Torvalds1-19/+4
git://git.kernel.org/pub/scm/linux/kernel/git/tytso/fscrypt Pull fscrypt updates from Ted Ts'o: "Lots of cleanups, mostly courtesy by Eric Biggers" * tag 'fscrypt-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/fscrypt: fscrypt: lock mutex before checking for bounce page pool fscrypt: add a documentation file for filesystem-level encryption ext4: switch to fscrypt_prepare_setattr() ext4: switch to fscrypt_prepare_lookup() ext4: switch to fscrypt_prepare_rename() ext4: switch to fscrypt_prepare_link() ext4: switch to fscrypt_file_open() fscrypt: new helper function - fscrypt_prepare_setattr() fscrypt: new helper function - fscrypt_prepare_lookup() fscrypt: new helper function - fscrypt_prepare_rename() fscrypt: new helper function - fscrypt_prepare_link() fscrypt: new helper function - fscrypt_file_open() fscrypt: new helper function - fscrypt_require_key() fscrypt: remove unneeded empty fscrypt_operations structs fscrypt: remove ->is_encrypted() fscrypt: switch from ->is_encrypted() to IS_ENCRYPTED() fs, fscrypt: add an S_ENCRYPTED inode flag fscrypt: clean up include file mess
2017-11-03ext4: Support for synchronous DAX faultsJan Kara1-1/+14
We return IOMAP_F_DIRTY flag from ext4_iomap_begin() when asked to prepare blocks for writing and the inode has some uncommitted metadata changes. In the fault handler ext4_dax_fault() we then detect this case (through VM_FAULT_NEEDDSYNC return value) and call helper dax_finish_sync_fault() to flush metadata changes and insert page table entry. Note that this will also dirty corresponding radix tree entry which is what we want - fsync(2) will still provide data integrity guarantees for applications not using userspace flushing. And applications using userspace flushing can avoid calling fsync(2) and thus avoid the performance overhead. Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2017-11-03ext4: Simplify error handling in ext4_dax_huge_fault()Jan Kara1-6/+7
If transaction starting fails, just bail out of the function immediately instead of checking for that condition throughout the function. Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2017-11-03dax: Allow dax_iomap_fault() to return pfnJan Kara1-1/+1
For synchronous page fault dax_iomap_fault() will need to return PFN which will then need to be inserted into page tables after fsync() completes. Add necessary parameter to dax_iomap_fault(). Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2017-11-02License cleanup: add SPDX GPL-2.0 license identifier to files with no licenseGreg Kroah-Hartman1-0/+1
Many source files in the tree are missing licensing information, which makes it harder for compliance tools to determine the correct license. By default all files without license information are under the default license of the kernel, which is GPL version 2. Update the files which contain no license information with the 'GPL-2.0' SPDX license identifier. The SPDX identifier is a legally binding shorthand, which can be used instead of the full boiler plate text. This patch is based on work done by Thomas Gleixner and Kate Stewart and Philippe Ombredanne. How this work was done: Patches were generated and checked against linux-4.14-rc6 for a subset of the use cases: - file had no licensing information it it. - file was a */uapi/* one with no licensing information in it, - file was a */uapi/* one with existing licensing information, Further patches will be generated in subsequent months to fix up cases where non-standard license headers were used, and references to license had to be inferred by heuristics based on keywords. The analysis to determine which SPDX License Identifier to be applied to a file was done in a spreadsheet of side by side results from of the output of two independent scanners (ScanCode & Windriver) producing SPDX tag:value files created by Philippe Ombredanne. Philippe prepared the base worksheet, and did an initial spot review of a few 1000 files. The 4.13 kernel was the starting point of the analysis with 60,537 files assessed. Kate Stewart did a file by file comparison of the scanner results in the spreadsheet to determine which SPDX license identifier(s) to be applied to the file. She confirmed any determination that was not immediately clear with lawyers working with the Linux Foundation. Criteria used to select files for SPDX license identifier tagging was: - Files considered eligible had to be source code files. - Make and config files were included as candidates if they contained >5 lines of source - File already had some variant of a license header in it (even if <5 lines). All documentation files were explicitly excluded. The following heuristics were used to determine which SPDX license identifiers to apply. - when both scanners couldn't find any license traces, file was considered to have no license information in it, and the top level COPYING file license applied. For non */uapi/* files that summary was: SPDX license identifier # files ---------------------------------------------------|------- GPL-2.0 11139 and resulted in the first patch in this series. If that file was a */uapi/* path one, it was "GPL-2.0 WITH Linux-syscall-note" otherwise it was "GPL-2.0". Results of that was: SPDX license identifier # files ---------------------------------------------------|------- GPL-2.0 WITH Linux-syscall-note 930 and resulted in the second patch in this series. - if a file had some form of licensing information in it, and was one of the */uapi/* ones, it was denoted with the Linux-syscall-note if any GPL family license was found in the file or had no licensing in it (per prior point). Results summary: SPDX license identifier # files ---------------------------------------------------|------ GPL-2.0 WITH Linux-syscall-note 270 GPL-2.0+ WITH Linux-syscall-note 169 ((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause) 21 ((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause) 17 LGPL-2.1+ WITH Linux-syscall-note 15 GPL-1.0+ WITH Linux-syscall-note 14 ((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause) 5 LGPL-2.0+ WITH Linux-syscall-note 4 LGPL-2.1 WITH Linux-syscall-note 3 ((GPL-2.0 WITH Linux-syscall-note) OR MIT) 3 ((GPL-2.0 WITH Linux-syscall-note) AND MIT) 1 and that resulted in the third patch in this series. - when the two scanners agreed on the detected license(s), that became the concluded license(s). - when there was disagreement between the two scanners (one detected a license but the other didn't, or they both detected different licenses) a manual inspection of the file occurred. - In most cases a manual inspection of the information in the file resulted in a clear resolution of the license that should apply (and which scanner probably needed to revisit its heuristics). - When it was not immediately clear, the license identifier was confirmed with lawyers working with the Linux Foundation. - If there was any question as to the appropriate license identifier, the file was flagged for further research and to be revisited later in time. In total, over 70 hours of logged manual review was done on the spreadsheet to determine the SPDX license identifiers to apply to the source files by Kate, Philippe, Thomas and, in some cases, confirmation by lawyers working with the Linux Foundation. Kate also obtained a third independent scan of the 4.13 code base from FOSSology, and compared selected files where the other two scanners disagreed against that SPDX file, to see if there was new insights. The Windriver scanner is based on an older version of FOSSology in part, so they are related. Thomas did random spot checks in about 500 files from the spreadsheets for the uapi headers and agreed with SPDX license identifier in the files he inspected. For the non-uapi files Thomas did random spot checks in about 15000 files. In initial set of patches against 4.14-rc6, 3 files were found to have copy/paste license identifier errors, and have been fixed to reflect the correct identifier. Additionally Philippe spent 10 hours this week doing a detailed manual inspection and review of the 12,461 patched files from the initial patch version early this week with: - a full scancode scan run, collecting the matched texts, detected license ids and scores - reviewing anything where there was a license detected (about 500+ files) to ensure that the applied SPDX license was correct - reviewing anything where there was no detection but the patch license was not GPL-2.0 WITH Linux-syscall-note to ensure that the applied SPDX license was correct This produced a worksheet with 20 files needing minor correction. This worksheet was then exported into 3 different .csv files for the different types of files to be modified. These .csv files were then reviewed by Greg. Thomas wrote a script to parse the csv files and add the proper SPDX tag to the file, in the format that the file expected. This script was further refined by Greg based on the output to detect more types of files automatically and to distinguish between header and source .c files (which need different comment types.) Finally Greg ran the script using the .csv files to generate the patches. Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org> Reviewed-by: Philippe Ombredanne <pombredanne@nexb.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-10-18ext4: switch to fscrypt_file_open()Eric Biggers1-19/+4
Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2017-10-01ext4: Switch to iomap for SEEK_HOLE / SEEK_DATAChristoph Hellwig1-249/+14
Switch to the iomap_seek_hole and iomap_seek_data helpers for implementing lseek SEEK_HOLE / SEEK_DATA, and remove all the code that isn't needed any more. Note that with this patch ext4 will now always depend on the iomap code instead of only when CONFIG_DAX is enabled, and it requires adding a call into the extent status tree for iomap_begin as well to properly deal with delalloc extents. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Jan Kara <jack@suse.cz> [More fixes and cleanups by Andreas]
2017-09-14Merge branch 'work.read_write' of ↵Linus Torvalds1-3/+3
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull nowait read support from Al Viro: "Support IOCB_NOWAIT for buffered reads and block devices" * 'work.read_write' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: block_dev: support RFW_NOWAIT on block device nodes fs: support RWF_NOWAIT for buffered reads fs: support IOCB_NOWAIT in generic_file_buffered_read fs: pass iocb to do_generic_file_read
2017-09-14Merge branch 'work.mount' of ↵Linus Torvalds1-1/+1
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull mount flag updates from Al Viro: "Another chunk of fmount preparations from dhowells; only trivial conflicts for that part. It separates MS_... bits (very grotty mount(2) ABI) from the struct super_block ->s_flags (kernel-internal, only a small subset of MS_... stuff). This does *not* convert the filesystems to new constants; only the infrastructure is done here. The next step in that series is where the conflicts would be; that's the conversion of filesystems. It's purely mechanical and it's better done after the merge, so if you could run something like list=$(for i in MS_RDONLY MS_NOSUID MS_NODEV MS_NOEXEC MS_SYNCHRONOUS MS_MANDLOCK MS_DIRSYNC MS_NOATIME MS_NODIRATIME MS_SILENT MS_POSIXACL MS_KERNMOUNT MS_I_VERSION MS_LAZYTIME; do git grep -l $i fs drivers/staging/lustre drivers/mtd ipc mm include/linux; done|sort|uniq|grep -v '^fs/namespace.c$') sed -i -e 's/\<MS_RDONLY\>/SB_RDONLY/g' \ -e 's/\<MS_NOSUID\>/SB_NOSUID/g' \ -e 's/\<MS_NODEV\>/SB_NODEV/g' \ -e 's/\<MS_NOEXEC\>/SB_NOEXEC/g' \ -e 's/\<MS_SYNCHRONOUS\>/SB_SYNCHRONOUS/g' \ -e 's/\<MS_MANDLOCK\>/SB_MANDLOCK/g' \ -e 's/\<MS_DIRSYNC\>/SB_DIRSYNC/g' \ -e 's/\<MS_NOATIME\>/SB_NOATIME/g' \ -e 's/\<MS_NODIRATIME\>/SB_NODIRATIME/g' \ -e 's/\<MS_SILENT\>/SB_SILENT/g' \ -e 's/\<MS_POSIXACL\>/SB_POSIXACL/g' \ -e 's/\<MS_KERNMOUNT\>/SB_KERNMOUNT/g' \ -e 's/\<MS_I_VERSION\>/SB_I_VERSION/g' \ -e 's/\<MS_LAZYTIME\>/SB_LAZYTIME/g' \ $list and commit it with something along the lines of 'convert filesystems away from use of MS_... constants' as commit message, it would save a quite a bit of headache next cycle" * 'work.mount' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: VFS: Differentiate mount flags (MS_*) from internal superblock flags VFS: Convert sb->s_flags & MS_RDONLY to sb_rdonly(sb) vfs: Add sb_rdonly(sb) to query the MS_RDONLY flag on s_flags
2017-09-06Merge branch 'akpm' (patches from Andrew)Linus Torvalds1-43/+5
Merge updates from Andrew Morton: - various misc bits - DAX updates - OCFS2 - most of MM * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (119 commits) mm,fork: introduce MADV_WIPEONFORK x86,mpx: make mpx depend on x86-64 to free up VMA flag mm: add /proc/pid/smaps_rollup mm: hugetlb: clear target sub-page last when clearing huge page mm: oom: let oom_reap_task and exit_mmap run concurrently swap: choose swap device according to numa node mm: replace TIF_MEMDIE checks by tsk_is_oom_victim mm, oom: do not rely on TIF_MEMDIE for memory reserves access z3fold: use per-cpu unbuddied lists mm, swap: don't use VMA based swap readahead if HDD is used as swap mm, swap: add sysfs interface for VMA based swap readahead mm, swap: VMA based swap readahead mm, swap: fix swap readahead marking mm, swap: add swap readahead hit statistics mm/vmalloc.c: don't reinvent the wheel but use existing llist API mm/vmstat.c: fix wrong comment selftests/memfd: add memfd_create hugetlbfs selftest mm/shmem: add hugetlbfs support to memfd_create() mm, devm_memremap_pages: use multi-order radix for ZONE_DEVICE lookups mm/vmalloc.c: halve the number of comparisons performed in pcpu_get_vm_areas() ...
2017-09-06mm: remove nr_pages argument from pagevec_lookup{,_range}()Jan Kara1-1/+1
All users of pagevec_lookup() and pagevec_lookup_range() now pass PAGEVEC_SIZE as a desired number of pages. Just drop the argument. Link: http://lkml.kernel.org/r/20170726114704.7626-11-jack@suse.cz Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-09-06ext4: use pagevec_lookup_range() in ext4_find_unwritten_pgoff()Jan Kara1-10/+4
Use pagevec_lookup_range() in ext4_find_unwritten_pgoff() since we are interested only in pages in the given range. Simplify the logic as a result of not getting pages out of range and index getting automatically advanced. Link: http://lkml.kernel.org/r/20170726114704.7626-6-jack@suse.cz Signed-off-by: Jan Kara <jack@suse.cz> Cc: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-09-06mm: make pagevec_lookup() update indexJan Kara1-3/+1
Make pagevec_lookup() (and underlying find_get_pages()) update index to the next page where iteration should continue. Most callers want this and also pagevec_lookup_tag() already does this. Link: http://lkml.kernel.org/r/20170726114704.7626-3-jack@suse.cz Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-09-06dax: use common 4k zero page for dax mmap readsRoss Zwisler1-31/+1
When servicing mmap() reads from file holes the current DAX code allocates a page cache page of all zeroes and places the struct page pointer in the mapping->page_tree radix tree. This has three major drawbacks: 1) It consumes memory unnecessarily. For every 4k page that is read via a DAX mmap() over a hole, we allocate a new page cache page. This means that if you read 1GiB worth of pages, you end up using 1GiB of zeroed memory. This is easily visible by looking at the overall memory consumption of the system or by looking at /proc/[pid]/smaps: 7f62e72b3000-7f63272b3000 rw-s 00000000 103:00 12 /root/dax/data Size: 1048576 kB Rss: 1048576 kB Pss: 1048576 kB Shared_Clean: 0 kB Shared_Dirty: 0 kB Private_Clean: 1048576 kB Private_Dirty: 0 kB Referenced: 1048576 kB Anonymous: 0 kB LazyFree: 0 kB AnonHugePages: 0 kB ShmemPmdMapped: 0 kB Shared_Hugetlb: 0 kB Private_Hugetlb: 0 kB Swap: 0 kB SwapPss: 0 kB KernelPageSize: 4 kB MMUPageSize: 4 kB Locked: 0 kB 2) It is slower than using a common zero page because each page fault has more work to do. Instead of just inserting a common zero page we have to allocate a page cache page, zero it, and then insert it. Here are the average latencies of dax_load_hole() as measured by ftrace on a random test box: Old method, using zeroed page cache pages: 3.4 us New method, using the common 4k zero page: 0.8 us This was the average latency over 1 GiB of sequential reads done by this simple fio script: [global] size=1G filename=/root/dax/data fallocate=none [io] rw=read ioengine=mmap 3) The fact that we had to check for both DAX exceptional entries and for page cache pages in the radix tree made the DAX code more complex. Solve these issues by following the lead of the DAX PMD code and using a common 4k zero page instead. As with the PMD code we will now insert a DAX exceptional entry into the radix tree instead of a struct page pointer which allows us to remove all the special casing in the DAX code. Note that we do still pretty aggressively check for regular pages in the DAX radix tree, especially where we take action based on the bits set in the page. If we ever find a regular page in our radix tree now that most likely means that someone besides DAX is inserting pages (which has happened lots of times in the past), and we want to find that out early and fail loudly. This solution also removes the extra memory consumption. Here is that same /proc/[pid]/smaps after 1GiB of reading from a hole with the new code: 7f2054a74000-7f2094a74000 rw-s 00000000 103:00 12 /root/dax/data Size: 1048576 kB Rss: 0 kB Pss: 0 kB Shared_Clean: 0 kB Shared_Dirty: 0 kB Private_Clean: 0 kB Private_Dirty: 0 kB Referenced: 0 kB Anonymous: 0 kB LazyFree: 0 kB AnonHugePages: 0 kB ShmemPmdMapped: 0 kB Shared_Hugetlb: 0 kB Private_Hugetlb: 0 kB Swap: 0 kB SwapPss: 0 kB KernelPageSize: 4 kB MMUPageSize: 4 kB Locked: 0 kB Overall system memory consumption is similarly improved. Another major change is that we remove dax_pfn_mkwrite() from our fault flow, and instead rely on the page fault itself to make the PTE dirty and writeable. The following description from the patch adding the vm_insert_mixed_mkwrite() call explains this a little more: "To be able to use the common 4k zero page in DAX we need to have our PTE fault path look more like our PMD fault path where a PTE entry can be marked as dirty and writeable as it is first inserted rather than waiting for a follow-up dax_pfn_mkwrite() => finish_mkwrite_fault() call. Right now we can rely on having a dax_pfn_mkwrite() call because we can distinguish between these two cases in do_wp_page(): case 1: 4k zero page => writable DAX storage case 2: read-only DAX storage => writeable DAX storage This distinction is made by via vm_normal_page(). vm_normal_page() returns false for the common 4k zero page, though, just as it does for DAX ptes. Instead of special casing the DAX + 4k zero page case we will simplify our DAX PTE page fault sequence so that it matches our DAX PMD sequence, and get rid of the dax_pfn_mkwrite() helper. We will instead use dax_iomap_fault() to handle write-protection faults. This means that insert_pfn() needs to follow the lead of insert_pfn_pmd() and allow us to pass in a 'mkwrite' flag. If 'mkwrite' is set insert_pfn() will do the work that was previously done by wp_page_reuse() as part of the dax_pfn_mkwrite() call path" Link: http://lkml.kernel.org/r/20170724170616.25810-4-ross.zwisler@linux.intel.com Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com> Reviewed-by: Jan Kara <jack@suse.cz> Cc: "Darrick J. Wong" <darrick.wong@oracle.com> Cc: "Theodore Ts'o" <tytso@mit.edu> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Andreas Dilger <adilger.kernel@dilger.ca> Cc: Christoph Hellwig <hch@lst.de> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Matthew Wilcox <mawilcox@microsoft.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-09-04fs: support RWF_NOWAIT for buffered readsChristoph Hellwig1-3/+3
This is based on the old idea and code from Milosz Tanski. With the aio nowait code it becomes mostly trivial now. Buffered writes continue to return -EOPNOTSUPP if RWF_NOWAIT is passed. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-08-24ext4: fix fault handling when mounted with -o dax,roRandy Dodgen1-1/+14
If an ext4 filesystem is mounted with both the DAX and read-only options, executables on that filesystem will fail to start (claiming 'Segmentation fault') due to the fault handler returning VM_FAULT_SIGBUS. This is due to the DAX fault handler (see ext4_dax_huge_fault) attempting to write to the journal when FAULT_FLAG_WRITE is set. This is the wrong behavior for write faults which will lead to a COW page; in particular, this fails for readonly mounts. This change avoids journal writes for faults that are expected to COW. It might be the case that this could be better handled in ext4_iomap_begin / ext4_iomap_end (called via iomap_ops inside dax_iomap_fault). These is some overlap already (e.g. grabbing journal handles). Signed-off-by: Randy Dodgen <dodgen@google.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>
2017-08-24ext4: in ext4_seek_{hole,data}, return -ENXIO for negative offsetsDarrick J. Wong1-2/+2
In the ext4 implementations of SEEK_HOLE and SEEK_DATA, make sure we return -ENXIO for negative offsets instead of banging around inside the extent code and returning -EFSCORRUPTED. Reported-by: Mateusz S <muttdini@gmail.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@vger.kernel.org # 4.6
2017-08-05ext4: fix SEEK_HOLE/SEEK_DATA for blocksize < pagesizeJan Kara1-0/+3
ext4_find_unwritten_pgoff() does not properly handle a situation when starting index is in the middle of a page and blocksize < pagesize. The following command shows the bug on filesystem with 1k blocksize: xfs_io -f -c "falloc 0 4k" \ -c "pwrite 1k 1k" \ -c "pwrite 3k 1k" \ -c "seek -a -r 0" foo In this example, neither lseek(fd, 1024, SEEK_HOLE) nor lseek(fd, 2048, SEEK_DATA) will return the correct result. Fix the problem by neglecting buffers in a page before starting offset. Reported-by: Andreas Gruenbacher <agruenba@redhat.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Jan Kara <jack@suse.cz> CC: stable@vger.kernel.org # 3.8+
2017-07-17VFS: Convert sb->s_flags & MS_RDONLY to sb_rdonly(sb)David Howells1-1/+1
Firstly by applying the following with coccinelle's spatch: @@ expression SB; @@ -SB->s_flags & MS_RDONLY +sb_rdonly(SB) to effect the conversion to sb_rdonly(sb), then by applying: @@ expression A, SB; @@ ( -(!sb_rdonly(SB)) && A +!sb_rdonly(SB) && A | -A != (sb_rdonly(SB)) +A != sb_rdonly(SB) | -A == (sb_rdonly(SB)) +A == sb_rdonly(SB) | -!(sb_rdonly(SB)) +!sb_rdonly(SB) | -A && (sb_rdonly(SB)) +A && sb_rdonly(SB) | -A || (sb_rdonly(SB)) +A || sb_rdonly(SB) | -(sb_rdonly(SB)) != A +sb_rdonly(SB) != A | -(sb_rdonly(SB)) == A +sb_rdonly(SB) == A | -(sb_rdonly(SB)) && A +sb_rdonly(SB) && A | -(sb_rdonly(SB)) || A +sb_rdonly(SB) || A ) @@ expression A, B, SB; @@ ( -(sb_rdonly(SB)) ? 1 : 0 +sb_rdonly(SB) | -(sb_rdonly(SB)) ? A : B +sb_rdonly(SB) ? A : B ) to remove left over excess bracketage and finally by applying: @@ expression A, SB; @@ ( -(A & MS_RDONLY) != sb_rdonly(SB) +(bool)(A & MS_RDONLY) != sb_rdonly(SB) | -(A & MS_RDONLY) == sb_rdonly(SB) +(bool)(A & MS_RDONLY) == sb_rdonly(SB) ) to make comparisons against the result of sb_rdonly() (which is a bool) work correctly. Signed-off-by: David Howells <dhowells@redhat.com>
2017-07-09Merge tag 'ext4_for_linus' of ↵Linus Torvalds1-7/+0
git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 Pull ext4 updates from Ted Ts'o: "The first major feature for ext4 this merge window is the largedir feature, which allows ext4 directories to support over 2 billion directory entries (assuming ~64 byte file names; in practice, users will run into practical performance limits first.) This feature was originally written by the Lustre team, and credit goes to Artem Blagodarenko from Seagate for getting this feature upstream. The second major major feature allows ext4 to support extended attribute values up to 64k. This feature was also originally from Lustre, and has been enhanced by Tahsin Erdogan from Google with a deduplication feature so that if multiple files have the same xattr value (for example, Windows ACL's stored by Samba), only one copy will be stored on disk for encoding and caching efficiency. We also have the usual set of bug fixes, cleanups, and optimizations" * tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (47 commits) ext4: fix spelling mistake: "prellocated" -> "preallocated" ext4: fix __ext4_new_inode() journal credits calculation ext4: skip ext4_init_security() and encryption on ea_inodes fs: generic_block_bmap(): initialize all of the fields in the temp bh ext4: change fast symlink test to not rely on i_blocks ext4: require key for truncate(2) of encrypted file ext4: don't bother checking for encryption key in ->mmap() ext4: check return value of kstrtoull correctly in reserved_clusters_store ext4: fix off-by-one fsmap error on 1k block filesystems ext4: return EFSBADCRC if a bad checksum error is found in ext4_find_entry() ext4: return EIO on read error in ext4_find_entry ext4: forbid encrypting root directory ext4: send parallel discards on commit completions ext4: avoid unnecessary stalls in ext4_evict_inode() ext4: add nombcache mount option ext4: strong binding of xattr inode references ext4: eliminate xattr entry e_hash recalculation for removes ext4: reserve space for xattr entries/names quota: add get_inode_usage callback to transfer multi-inode charges ext4: xattr inode deduplication ...
2017-06-23ext4: don't bother checking for encryption key in ->mmap()Eric Biggers1-7/+0
Since only an open file can be mmap'ed, and we only allow open()ing an encrypted file when its key is available, there is no need to check for the key again before permitting each mmap(). Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2017-06-20ext4: nowait aio supportGoldwyn Rodrigues1-6/+29
Return EAGAIN if any of the following checks fail for direct I/O: + i_rwsem is lockable + Writing beyond end of file (will trigger allocation) + Blocks are not allocated at the write location Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-05-24ext4: fix off-by-one on max nr_pages in ext4_find_unwritten_pgoff()Eryu Guan1-1/+1
ext4_find_unwritten_pgoff() is used to search for offset of hole or data in page range [index, end] (both inclusive), and the max number of pages to search should be at least one, if end == index. Otherwise the only page is missed and no hole or data is found, which is not correct. When block size is smaller than page size, this can be demonstrated by preallocating a file with size smaller than page size and writing data to the last block. E.g. run this xfs_io command on a 1k block size ext4 on x86_64 host. # xfs_io -fc "falloc 0 3k" -c "pwrite 2k 1k" \ -c "seek -d 0" /mnt/ext4/testfile wrote 1024/1024 bytes at offset 2048 1 KiB, 1 ops; 0.0000 sec (42.459 MiB/sec and 43478.2609 ops/sec) Whence Result DATA EOF Data at offset 2k was missed, and lseek(2) returned ENXIO. This is unconvered by generic/285 subtest 07 and 08 on ppc64 host, where pagesize is 64k. Because a recent change to generic/285 reduced the preallocated file size to smaller than 64k. Signed-off-by: Eryu Guan <eguan@redhat.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Jan Kara <jack@suse.cz>
2017-05-21ext4: fix off-by-in in loop termination in ext4_find_unwritten_pgoff()Jan Kara1-1/+1
There is an off-by-one error in loop termination conditions in ext4_find_unwritten_pgoff() since 'end' may index a page beyond end of desired range if 'endoff' is page aligned. It doesn't have any visible effects but still it is good to fix it. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2017-05-21ext4: fix SEEK_HOLEJan Kara1-36/+14
Currently, SEEK_HOLE implementation in ext4 may both return that there's a hole at some offset although that offset already has data and skip some holes during a search for the next hole. The first problem is demostrated by: xfs_io -c "falloc 0 256k" -c "pwrite 0 56k" -c "seek -h 0" file wrote 57344/57344 bytes at offset 0 56 KiB, 14 ops; 0.0000 sec (2.054 GiB/sec and 538461.5385 ops/sec) Whence Result HOLE 0 Where we can see that SEEK_HOLE wrongly returned offset 0 as containing a hole although we have written data there. The second problem can be demonstrated by: xfs_io -c "falloc 0 256k" -c "pwrite 0 56k" -c "pwrite 128k 8k" -c "seek -h 0" file wrote 57344/57344 bytes at offset 0 56 KiB, 14 ops; 0.0000 sec (1.978 GiB/sec and 518518.5185 ops/sec) wrote 8192/8192 bytes at offset 131072 8 KiB, 2 ops; 0.0000 sec (2 GiB/sec and 500000.0000 ops/sec) Whence Result HOLE 139264 Where we can see that hole at offsets 56k..128k has been ignored by the SEEK_HOLE call. The underlying problem is in the ext4_find_unwritten_pgoff() which is just buggy. In some cases it fails to update returned offset when it finds a hole (when no pages are found or when the first found page has higher index than expected), in some cases conditions for detecting hole are just missing (we fail to detect a situation where indices of returned pages are not contiguous). Fix ext4_find_unwritten_pgoff() to properly detect non-contiguous page indices and also handle all cases where we got less pages then expected in one place and handle it properly there. CC: stable@vger.kernel.org Fixes: c8c0df241cc2719b1262e627f999638411934f60 CC: Zheng Liu <wenqing.lz@taobao.com> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2017-05-12ext4: return to starting transaction in ext4_dax_huge_fault()Jan Kara1-4/+17
DAX will return to locking exceptional entry before mapping blocks for a page fault to fix possible races with concurrent writes. To avoid lock inversion between exceptional entry lock and transaction start, start the transaction already in ext4_dax_huge_fault(). Fixes: 9f141d6ef6258a3a37a045842d9ba7e68f368956 Link: http://lkml.kernel.org/r/20170510085419.27601-4-jack@suse.cz Signed-off-by: Jan Kara <jack@suse.cz> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-04-03ext4: Add statx supportDavid Howells1-1/+1
Return enhanced file attributes from the Ext4 filesystem. This includes the following: (1) The inode creation time (i_crtime) as stx_btime, setting STATX_BTIME. (2) Certain FS_xxx_FL flags are mapped to stx_attribute flags. This requires that all ext4 inodes have a getattr call, not just some of them, so to this end, split the ext4_getattr() function and only call part of it where appropriate. Example output: [root@andromeda ~]# touch foo [root@andromeda ~]# chattr +ai foo [root@andromeda ~]# /tmp/test-statx foo statx(foo) = 0 results=fff Size: 0 Blocks: 0 IO Block: 4096 regular file Device: 08:12 Inode: 2101950 Links: 1 Access: (0644/-rw-r--r--) Uid: 0 Gid: 0 Access: 2016-02-11 17:08:29.031795451+0000 Modify: 2016-02-11 17:08:29.031795451+0000 Change: 2016-02-11 17:11:11.987790114+0000 Birth: 2016-02-11 17:08:29.031795451+0000 Attributes: 0000000000000030 (-------- -------- -------- -------- -------- -------- -------- --ai----) Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-02-24mm: replace FAULT_FLAG_SIZE with parameter to huge_faultDave Jiang1-3/+9
Since the introduction of FAULT_FLAG_SIZE to the vm_fault flag, it has been somewhat painful with getting the flags set and removed at the correct locations. More than one kernel oops was introduced due to difficulties of getting the placement correctly. Remove the flag values and introduce an input parameter to huge_fault that indicates the size of the page entry. This makes the code easier to trace and should avoid the issues we see with the fault flags where removal of the flag was necessary in the fallback paths. Link: http://lkml.kernel.org/r/148615748258.43180.1690152053774975329.stgit@djiang5-desk3.ch.intel.com Signed-off-by: Dave Jiang <dave.jiang@intel.com> Tested-by: Dan Williams <dan.j.williams@intel.com> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Matthew Wilcox <mawilcox@microsoft.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Nilesh Choudhury <nilesh.choudhury@oracle.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-24mm,fs,dax: change ->pmd_fault to ->huge_faultDave Jiang1-22/+1
Patch series "1G transparent hugepage support for device dax", v2. The following series implements support for 1G trasparent hugepage on x86 for device dax. The bulk of the code was written by Mathew Wilcox a while back supporting transparent 1G hugepage for fs DAX. I have forward ported the relevant bits to 4.10-rc. The current submission has only the necessary code to support device DAX. Comments from Dan Williams: So the motivation and intended user of this functionality mirrors the motivation and users of 1GB page support in hugetlbfs. Given expected capacities of persistent memory devices an in-memory database may want to reduce tlb pressure beyond what they can already achieve with 2MB mappings of a device-dax file. We have customer feedback to that effect as Willy mentioned in his previous version of these patches [1]. [1]: https://lkml.org/lkml/2016/1/31/52 Comments from Nilesh @ Oracle: There are applications which have a process model; and if you assume 10,000 processes attempting to mmap all the 6TB memory available on a server; we are looking at the following: processes : 10,000 memory : 6TB pte @ 4k page size: 8 bytes / 4K of memory * #processes = 6TB / 4k * 8 * 10000 = 1.5GB * 80000 = 120,000GB pmd @ 2M page size: 120,000 / 512 = ~240GB pud @ 1G page size: 240GB / 512 = ~480MB As you can see with 2M pages, this system will use up an exorbitant amount of DRAM to hold the page tables; but the 1G pages finally brings it down to a reasonable level. Memory sizes will keep increasing; so this number will keep increasing. An argument can be made to convert the applications from process model to thread model, but in the real world that may not be always practical. Hopefully this helps explain the use case where this is valuable. This patch (of 3): In preparation for adding the ability to handle PUD pages, convert vm_operations_struct.pmd_fault to vm_operations_struct.huge_fault. The vm_fault structure is extended to include a union of the different page table pointers that may be needed, and three flag bits are reserved to indicate which type of pointer is in the union. [ross.zwisler@linux.intel.com: remove unused function ext4_dax_huge_fault()] Link: http://lkml.kernel.org/r/1485813172-7284-1-git-send-email-ross.zwisler@linux.intel.com [dave.jiang@intel.com: clear PMD or PUD size flags when in fall through path] Link: http://lkml.kernel.org/r/148589842696.5820.16078080610311444794.stgit@djiang5-desk3.ch.intel.com Link: http://lkml.kernel.org/r/148545058784.17912.6353162518188733642.stgit@djiang5-desk3.ch.intel.com Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com> Signed-off-by: Dave Jiang <dave.jiang@intel.com> Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Jan Kara <jack@suse.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Nilesh Choudhury <nilesh.choudhury@oracle.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Dave Jiang <dave.jiang@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-24mm, fs: reduce fault, page_mkwrite, and pfn_mkwrite to take only vmfDave Jiang1-9/+8
->fault(), ->page_mkwrite(), and ->pfn_mkwrite() calls do not need to take a vma and vmf parameter when the vma already resides in vmf. Remove the vma parameter to simplify things. [arnd@arndb.de: fix ARM build] Link: http://lkml.kernel.org/r/20170125223558.1451224-1-arnd@arndb.de Link: http://lkml.kernel.org/r/148521301778.19116.10840599906674778980.stgit@djiang5-desk3.ch.intel.com Signed-off-by: Dave Jiang <dave.jiang@intel.com> Signed-off-by: Arnd Bergmann <arnd@arndb.de> Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com> Cc: Theodore Ts'o <tytso@mit.edu> Cc: Darrick J. Wong <darrick.wong@oracle.com> Cc: Matthew Wilcox <mawilcox@microsoft.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Jan Kara <jack@suse.com> Cc: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-22mm, dax: change pmd_fault() to take only vmf parameterDave Jiang1-4/+4
pmd_fault() and related functions really only need the vmf parameter since the additional parameters are all included in the vmf struct. Remove the additional parameter and simplify pmd_fault() and friends. Link: http://lkml.kernel.org/r/1484085142-2297-8-git-send-email-ross.zwisler@linux.intel.com Signed-off-by: Dave Jiang <dave.jiang@intel.com> Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Dave Chinner <david@fromorbit.com> Cc: Dave Jiang <dave.jiang@intel.com> Cc: Matthew Wilcox <mawilcox@microsoft.com> Cc: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-22mm, dax: make pmd_fault() and friends be the same as fault()Dave Jiang1-5/+4
Instead of passing in multiple parameters in the pmd_fault() handler, a vmf can be passed in just like a fault() handler. This will simplify code and remove the need for the actual pmd fault handlers to allocate a vmf. Related functions are also modified to do the same. [dave.jiang@intel.com: fix issue with xfs_tests stall when DAX option is off] Link: http://lkml.kernel.org/r/148469861071.195597.3619476895250028518.stgit@djiang5-desk3.ch.intel.com Link: http://lkml.kernel.org/r/1484085142-2297-7-git-send-email-ross.zwisler@linux.intel.com Signed-off-by: Dave Jiang <dave.jiang@intel.com> Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Dave Chinner <david@fromorbit.com> Cc: Matthew Wilcox <mawilcox@microsoft.com> Cc: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-08ext4: fix DAX write lockingChristoph Hellwig1-9/+1
Unlike O_DIRECT DAX is not an optional opt-in feature selected by the application, so we'll have to provide the traditional synchronіzation of overlapping writes as we do for buffered writes. This was broken historically for DAX, but got fixed for ext2 and XFS as part of the iomap conversion. Fix up ext4 as well. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Jan Kara <jack@suse.cz>
2017-02-05ext4: add shutdown bit and check for itTheodore Ts'o1-0/+12
Add a shutdown bit that will cause ext4 processing to fail immediately with EIO. Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-12-26ext4: Simplify DAX fault pathJan Kara1-38/+10
Now that dax_iomap_fault() calls ->iomap_begin() without entry lock, we can use transaction starting in ext4_iomap_begin() and thus simplify ext4_dax_fault(). It also provides us proper retries in case of ENOSPC. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2016-11-20ext4: convert DAX faults to iomap infrastructureJan Kara1-4/+5
Convert DAX faults to use iomap infrastructure. We would not have to start transaction in ext4_dax_fault() anymore since ext4_iomap_begin takes care of that but so far we do that to avoid lock inversion of transaction start with DAX entry lock which gets acquired in dax_iomap_fault() before calling ->iomap_begin handler. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-11-20ext4: DAX iomap write supportJan Kara1-0/+40
Implement DAX writes using the new iomap infrastructure instead of overloading the direct IO path. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-11-20ext4: convert DAX reads to iomap infrastructureJan Kara1-1/+37
Implement basic iomap_begin function that handles reading and use it for DAX reads. Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-11-20ext4: factor out checks from ext4_file_write_iter()Jan Kara1-47/+50
Factor out checks of 'from' and whether we are overwriting out of ext4_file_write_iter() so that the function is easier to follow. Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-10-10Merge branch 'work.xattr' of ↵Linus Torvalds1-3/+0
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull vfs xattr updates from Al Viro: "xattr stuff from Andreas This completes the switch to xattr_handler ->get()/->set() from ->getxattr/->setxattr/->removexattr" * 'work.xattr' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: vfs: Remove {get,set,remove}xattr inode operations xattr: Stop calling {get,set,remove}xattr inode operations vfs: Check for the IOP_XATTR flag in listxattr xattr: Add __vfs_{get,set,remove}xattr helpers libfs: Use IOP_XATTR flag for empty directory handling vfs: Use IOP_XATTR flag for bad-inode handling vfs: Add IOP_XATTR inode operations flag vfs: Move xattr_resolve_name to the front of fs/xattr.c ecryptfs: Switch to generic xattr handlers sockfs: Get rid of getxattr iop sockfs: getxattr: Fail with -EOPNOTSUPP for invalid attribute names kernfs: Switch to generic xattr handlers hfs: Switch to generic xattr handlers jffs2: Remove jffs2_{get,set,remove}xattr macros xattr: Remove unnecessary NULL attribute name check
2016-10-07Merge branch 'akpm' (patches from Andrew)Linus Torvalds1-0/+1
Merge updates from Andrew Morton: - fsnotify updates - ocfs2 updates - all of MM * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (127 commits) console: don't prefer first registered if DT specifies stdout-path cred: simpler, 1D supplementary groups CREDITS: update Pavel's information, add GPG key, remove snail mail address mailmap: add Johan Hovold .gitattributes: set git diff driver for C source code files uprobes: remove function declarations from arch/{mips,s390} spelling.txt: "modeled" is spelt correctly nmi_backtrace: generate one-line reports for idle cpus arch/tile: adopt the new nmi_backtrace framework nmi_backtrace: do a local dump_stack() instead of a self-NMI nmi_backtrace: add more trigger_*_cpu_backtrace() methods min/max: remove sparse warnings when they're nested Documentation/filesystems/proc.txt: add more description for maps/smaps mm, proc: fix region lost in /proc/self/smaps proc: fix timerslack_ns CAP_SYS_NICE check when adjusting self proc: add LSM hook checks to /proc/<tid>/timerslack_ns proc: relax /proc/<tid>/timerslack_ns capability requirements meminfo: break apart a very long seq_printf with #ifdefs seq/proc: modify seq_put_decimal_[u]ll to take a const char *, not char proc: faster /proc/*/status ...
2016-10-07vfs: Remove {get,set,remove}xattr inode operationsAndreas Gruenbacher1-3/+0
These inode operations are no longer used; remove them. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-10-07ext2/4, xfs: call thp_get_unmapped_area() for pmd mappingsToshi Kani1-0/+1
To support DAX pmd mappings with unmodified applications, filesystems need to align an mmap address by the pmd size. Call thp_get_unmapped_area() from f_op->get_unmapped_area. Note, there is no change in behavior for a non-DAX file. Link: http://lkml.kernel.org/r/1472497881-9323-3-git-send-email-toshi.kani@hpe.com Signed-off-by: Toshi Kani <toshi.kani@hpe.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Matthew Wilcox <mawilcox@microsoft.com> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Jan Kara <jack@suse.cz> Cc: Theodore Ts'o <tytso@mit.edu> Cc: Andreas Dilger <adilger.kernel@dilger.ca> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-09-30ext4: remove plugging from ext4_file_write_iter()Jan Kara1-4/+0
do_blockdev_direct_IO() takes care of properly plugging direct IO so there's no need to plug again inside ext4_file_write_iter(). Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-09-30ext4: allow unlocked direct IO when pages are cachedJan Kara1-2/+1
Currently we do not allow unlocked (meaning without inode_lock) direct IO when the file has any pages cached. This check is not needed anymore as we keep inode lock until ext4_direct_IO_write() and thus can happily writeback and evict any pages conflicting with current direct IO write. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-09-15ext4: create EXT4_MAX_BLOCKS() macroFabian Frederick1-2/+1
Create a macro to calculate length + offset -> maximum blocks This adds more readability. Signed-off-by: Fabian Frederick <fabf@skynet.be> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-07-26Merge branch 'akpm' (patches from Andrew)Linus Torvalds1-2/+2
Merge updates from Andrew Morton: - a few misc bits - ocfs2 - most(?) of MM * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (125 commits) thp: fix comments of __pmd_trans_huge_lock() cgroup: remove unnecessary 0 check from css_from_id() cgroup: fix idr leak for the first cgroup root mm: memcontrol: fix documentation for compound parameter mm: memcontrol: remove BUG_ON in uncharge_list mm: fix build warnings in <linux/compaction.h> mm, thp: convert from optimistic swapin collapsing to conservative mm, thp: fix comment inconsistency for swapin readahead functions thp: update Documentation/{vm/transhuge,filesystems/proc}.txt shmem: split huge pages beyond i_size under memory pressure thp: introduce CONFIG_TRANSPARENT_HUGE_PAGECACHE khugepaged: add support of collapse for tmpfs/shmem pages shmem: make shmem_inode_info::lock irq-safe khugepaged: move up_read(mmap_sem) out of khugepaged_alloc_page() thp: extract khugepaged from mm/huge_memory.c shmem, thp: respect MADV_{NO,}HUGEPAGE for file mappings shmem: add huge pages support shmem: get_unmapped_area align huge page shmem: prepare huge= mount option and sysfs knob mm, rmap: account shmem thp pages ...
2016-07-26dax: remote unused fault wrappersRoss Zwisler1-2/+2
Remove the unused wrappers dax_fault() and dax_pmd_fault(). After this removal, rename __dax_fault() and __dax_pmd_fault() to dax_fault() and dax_pmd_fault() respectively, and update all callers. The dax_fault() and dax_pmd_fault() wrappers were initially intended to capture some filesystem independent functionality around page faults (calling sb_start_pagefault() & sb_end_pagefault(), updating file mtime and ctime). However, the following commits: 5726b27b09cc ("ext2: Add locking for DAX faults") ea3d7209ca01 ("ext4: fix races between page faults and hole punching") added locking to the ext2 and ext4 filesystems after these common operations but before __dax_fault() and __dax_pmd_fault() were called. This means that these wrappers are no longer used, and are unlikely to be used in the future. XFS has had locking analogous to what was recently added to ext2 and ext4 since DAX support was initially introduced by: 6b698edeeef0 ("xfs: add DAX file operations support") Link: http://lkml.kernel.org/r/20160714214049.20075-2-ross.zwisler@linux.intel.com Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com> Cc: "Theodore Ts'o" <tytso@mit.edu> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Andreas Dilger <adilger.kernel@dilger.ca> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Chinner <david@fromorbit.com> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Jonathan Corbet <corbet@lwn.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-10ext4 crypto: migrate into vfs's crypto engineJaegeuk Kim1-5/+5
This patch removes the most parts of internal crypto codes. And then, it modifies and adds some ext4-specific crypt codes to use the generic facility. Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-05-26Merge tag 'dax-misc-for-4.7' of ↵Linus Torvalds1-2/+2
git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm Pull misc DAX updates from Vishal Verma: "DAX error handling for 4.7 - Until now, dax has been disabled if media errors were found on any device. This enables the use of DAX in the presence of these errors by making all sector-aligned zeroing go through the driver. - The driver (already) has the ability to clear errors on writes that are sent through the block layer using 'DSMs' defined in ACPI 6.1. Other misc changes: - When mounting DAX filesystems, check to make sure the partition is page aligned. This is a requirement for DAX, and previously, we allowed such unaligned mounts to succeed, but subsequent reads/writes would fail. - Misc/cleanup fixes from Jan that remove unused code from DAX related to zeroing, writeback, and some size checks" * tag 'dax-misc-for-4.7' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: dax: fix a comment in dax_zero_page_range and dax_truncate_page dax: for truncate/hole-punch, do zeroing through the driver if possible dax: export a low-level __dax_zero_page_range helper dax: use sb_issue_zerout instead of calling dax_clear_sectors dax: enable dax in the presence of known media errors (badblocks) dax: fallback from pmd to pte on error block: Update blkdev_dax_capable() for consistency xfs: Add alignment check for DAX mount ext2: Add alignment check for DAX mount ext4: Add alignment check for DAX mount block: Add bdev_dax_supported() for dax mount checks block: Add vfs_msg() interface dax: Remove redundant inode size checks dax: Remove pointless writeback from dax_do_io() dax: Remove zeroing from dax_io() dax: Remove dead zeroing code from fault handlers ext2: Avoid DAX zeroing to corrupt data ext2: Fix block zeroing in ext2_get_blocks() for DAX dax: Remove complete_unwritten argument DAX: move RADIX_DAX_ definitions to dax.c
2016-05-24Merge tag 'ext4_for_linus' of ↵Linus Torvalds1-3/+3
git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 Pull ext4 updates from Ted Ts'o: "Fix a number of bugs, most notably a potential stale data exposure after a crash and a potential BUG_ON crash if a file has the data journalling flag enabled while it has dirty delayed allocation blocks that haven't been written yet. Also fix a potential crash in the new project quota code and a maliciously corrupted file system. In addition, fix some DAX-specific bugs, including when there is a transient ENOSPC situation and races between writes via direct I/O and an mmap'ed segment that could lead to lost I/O. Finally the usual set of miscellaneous cleanups" * tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (23 commits) ext4: pre-zero allocated blocks for DAX IO ext4: refactor direct IO code ext4: fix race in transient ENOSPC detection ext4: handle transient ENOSPC properly for DAX dax: call get_blocks() with create == 1 for write faults to unwritten extents ext4: remove unmeetable inconsisteny check from ext4_find_extent() jbd2: remove excess descriptions for handle_s ext4: remove unnecessary bio get/put ext4: silence UBSAN in ext4_mb_init() ext4: address UBSAN warning in mb_find_order_for_block() ext4: fix oops on corrupted filesystem ext4: fix check of dqget() return value in ext4_ioctl_setproject() ext4: clean up error handling when orphan list is corrupted ext4: fix hang when processing corrupted orphaned inode list ext4: remove trailing \n from ext4_warning/ext4_error calls ext4: fix races between changing inode journal mode and ext4_writepages ext4: handle unwritten or delalloc buffers before enabling data journaling ext4: fix jbd2 handle extension in ext4_ext_truncate_extend_restart() ext4: do not ask jbd2 to write data for delalloc buffers jbd2: add support for avoiding data writes during transaction commits ...
2016-05-16dax: Remove complete_unwritten argumentJan Kara1-2/+2
Fault handlers currently take complete_unwritten argument to convert unwritten extents after PTEs are updated. However no filesystem uses this anymore as the code is racy. Remove the unused argument. Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Vishal Verma <vishal.l.verma@intel.com>
2016-05-13ext4: pre-zero allocated blocks for DAX IOJan Kara1-2/+2
Currently ext4 treats DAX IO the same way as direct IO. I.e., it allocates unwritten extents before IO is done and converts unwritten extents afterwards. However this way DAX IO can race with page fault to the same area: ext4_ext_direct_IO() dax_fault() dax_io() get_block() - allocates unwritten extent copy_from_iter_pmem() get_block() - converts unwritten block to written and zeroes it out ext4_convert_unwritten_extents() So data written with DAX IO gets lost. Similarly dax_new_buf() called from dax_io() can overwrite data that has been already written to the block via mmap. Fix the problem by using pre-zeroed blocks for DAX IO the same way as we use them for DAX mmap. The downside of this solution is that every allocating write writes each block twice (once zeros, once data). Fixing the race with locking is possible as well however we would need to lock-out faults for the whole range written to by DAX IO. And that is not easy to do without locking-out faults for the whole file which seems too aggressive. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-05-01fs: simplify the generic_write_sync prototypeChristoph Hellwig1-7/+2
The kiocb already has the new position, so use that. The only interesting case is AIO, where we currently don't bother updating ki_pos. We're about to free the kiocb after we're done, so we might as well update it to make everyone's life simpler. While we're at it also return the bytes written argument passed in if we were successful so that the boilerplate error switch code in the callers can go away. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-05-01fs: add IOCB_SYNC and IOCB_DSYNCChristoph Hellwig1-1/+1
This will allow us to do per-I/O sync file writes, as required by a lot of fileservers or storage targets. XXX: Will need a few additional audits for O_DSYNC Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-04-27ext4: remove trailing \n from ext4_warning/ext4_error callsJakub Wilk1-1/+1
Messages passed to ext4_warning() or ext4_error() don't need trailing newlines, because these function add the newlines themselves. Signed-off-by: Jakub Wilk <jwilk@jwilk.net>
2016-04-07Merge tag 'ext4_for_linus_stable' of ↵Linus Torvalds1-4/+8
git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 Pull ext4 bugfixes from Ted Ts'o: "These changes contains a fix for overlayfs interacting with some (badly behaved) dentry code in various file systems. These have been reviewed by Al and the respective file system mtinainers and are going through the ext4 tree for convenience. This also has a few ext4 encryption bug fixes that were discovered in Android testing (yes, we will need to get these sync'ed up with the fs/crypto code; I'll take care of that). It also has some bug fixes and a change to ignore the legacy quota options to allow for xfstests regression testing of ext4's internal quota feature and to be more consistent with how xfs handles this case" * tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: ext4: ignore quota mount options if the quota feature is enabled ext4 crypto: fix some error handling ext4: avoid calling dquot_get_next_id() if quota is not enabled ext4: retry block allocation for failed DIO and DAX writes ext4: add lockdep annotations for i_data_sem ext4: allow readdir()'s of large empty directories to be interrupted btrfs: fix crash/invalid memory access on fsync when using overlayfs ext4 crypto: use dget_parent() in ext4_d_revalidate() ext4: use file_dentry() ext4: use dget_parent() in ext4_file_open() nfs: use file_dentry() fs: add file_dentry() ext4 crypto: don't let data integrity writebacks fail with ENOMEM ext4: check if in-inode xattr is corrupted in ext4_expand_extra_isize_ea()
2016-04-04mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macrosKirill A. Shutemov1-2/+2
PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time ago with promise that one day it will be possible to implement page cache with bigger chunks than PAGE_SIZE. This promise never materialized. And unlikely will. We have many places where PAGE_CACHE_SIZE assumed to be equal to PAGE_SIZE. And it's constant source of confusion on whether PAGE_CACHE_* or PAGE_* constant should be used in a particular case, especially on the border between fs and mm. Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much breakage to be doable. Let's stop pretending that pages in page cache are special. They are not. The changes are pretty straight-forward: - <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>; - <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>; - PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN}; - page_cache_get() -> get_page(); - page_cache_release() -> put_page(); This patch contains automated changes generated with coccinelle using script below. For some reason, coccinelle doesn't patch header files. I've called spatch for them manually. The only adjustment after coccinelle is revert of changes to PAGE_CAHCE_ALIGN definition: we are going to drop it later. There are few places in the code where coccinelle didn't reach. I'll fix them manually in a separate patch. Comments and documentation also will be addressed with the separate patch. virtual patch @@ expression E; @@ - E << (PAGE_CACHE_SHIFT - PAGE_SHIFT) + E @@ expression E; @@ - E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) + E @@ @@ - PAGE_CACHE_SHIFT + PAGE_SHIFT @@ @@ - PAGE_CACHE_SIZE + PAGE_SIZE @@ @@ - PAGE_CACHE_MASK + PAGE_MASK @@ expression E; @@ - PAGE_CACHE_ALIGN(E) + PAGE_ALIGN(E) @@ expression E; @@ - page_cache_get(E) + get_page(E) @@ expression E; @@ - page_cache_release(E) + put_page(E) Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-26ext4: use file_dentry()Miklos Szeredi1-1/+1
EXT4 may be used as lower layer of overlayfs and accessing f_path.dentry can lead to a crash. Fix by replacing direct access of file->f_path.dentry with the file_dentry() accessor, which will always return a native object. Reported-by: Daniel Axtens <dja@axtens.net> Fixes: 4bacc9c9234c ("overlayfs: Make f_path always point to the overlay and f_inode to the underlay") Fixes: ff978b09f973 ("ext4 crypto: move context consistency check to ext4_file_open()") Signed-off-by: Miklos Szeredi <mszeredi@redhat.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: David Howells <dhowells@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: <stable@vger.kernel.org> # v4.5
2016-03-26ext4: use dget_parent() in ext4_file_open()Miklos Szeredi1-4/+8
In f_op->open() lock on parent is not held, so there's no guarantee that parent dentry won't go away at any time. Even after this patch there's no guarantee that 'dir' will stay the parent of 'inode', but at least it won't be freed while being used. Fixes: ff978b09f973 ("ext4 crypto: move context consistency check to ext4_file_open()") Signed-off-by: Miklos Szeredi <mszeredi@redhat.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: <stable@vger.kernel.org> # v4.5
2016-03-17Merge tag 'ext4_for_linus' of ↵Linus Torvalds1-80/+49
git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 Pull ext4 updates from Ted Ts'o: "Performance improvements in SEEK_DATA and xattr scalability improvements, plus a lot of clean ups and bug fixes" * tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (38 commits) ext4: clean up error handling in the MMP support jbd2: do not fail journal because of frozen_buffer allocation failure ext4: use __GFP_NOFAIL in ext4_free_blocks() ext4: fix compile error while opening the macro DOUBLE_CHECK ext4: print ext4 mount option data_err=abort correctly ext4: fix NULL pointer dereference in ext4_mark_inode_dirty() ext4: drop unneeded BUFFER_TRACE in ext4_delete_inline_entry() ext4: fix misspellings in comments. jbd2: fix FS corruption possibility in jbd2_journal_destroy() on umount path ext4: more efficient SEEK_DATA implementation ext4: cleanup handling of bh->b_state in DAX mmap ext4: return hole from ext4_map_blocks() ext4: factor out determining of hole size ext4: fix setting of referenced bit in ext4_es_lookup_extent() ext4: remove i_ioend_count ext4: simplify io_end handling for AIO DIO ext4: move trans handling and completion deferal out of _ext4_get_block ext4: rename and split get blocks functions ext4: use i_mutex to serialize unaligned AIO DIO ext4: pack ioend structure better ...
2016-03-09ext4: more efficient SEEK_DATA implementationJan Kara1-61/+36
Using SEEK_DATA in a huge sparse file can easily lead to sotflockups as ext4_seek_data() iterates hole block-by-block. Fix the problem by using returned hole size from ext4_map_blocks() and thus skip the hole in one go. Update also SEEK_HOLE implementation to follow the same pattern as SEEK_DATA to make future maintenance easier. Furthermore we add cond_resched() to both ext4_seek_data() and ext4_seek_hole() to avoid softlockups in case evil user creates huge fragmented file and we have to go through lots of extents. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-03-08ext4: use i_mutex to serialize unaligned AIO DIOJan Kara1-19/+13
Currently we've used hashed aio_mutex to serialize unaligned AIO DIO. However the code cleanups that happened after 2011 when the lock was introduced made aio_mutex acquired at almost the same places where we already have exclusion using i_mutex. So just use i_mutex for the exclusion of unaligned AIO DIO. The change moves waiting for pending unwritten extent conversion under i_mutex. That makes special handling of O_APPEND writes unnecessary and also avoids possible livelocking of unaligned AIO DIO with aligned one (nothing was preventing contiguous stream of aligned AIO DIOs to let unaligned AIO DIO wait forever). Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-02-27ext2, ext4: fix issue with missing journal entry in ext4_dax_mkwrite()Ross Zwisler1-17/+2
As it is currently written ext4_dax_mkwrite() assumes that the call into __dax_mkwrite() will not have to do a block allocation so it doesn't create a journal entry. For a read that creates a zero page to cover a hole followed by a write that actually allocates storage this is incorrect. The ext4_dax_mkwrite() -> __dax_mkwrite() -> __dax_fault() path calls get_blocks() to allocate storage. Fix this by having the ->page_mkwrite fault handler call ext4_dax_fault() as this function already has all the logic needed to allocate a journal entry and call __dax_fault(). Also update the ext2 fault handlers in this same way to remove duplicate code and keep the logic between ext2 and ext4 the same. Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-02-08ext4 crypto: move context consistency check to ext4_file_open()Theodore Ts'o1-0/+9
In the case where the per-file key for the directory is cached, but root does not have access to the key needed to derive the per-file key for the files in the directory, we allow the lookup to succeed, so that lstat(2) and unlink(2) can suceed. However, if a program tries to open the file, it will get an ENOKEY error. Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-01-23Merge branch 'for-linus' of ↵Linus Torvalds1-9/+9
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull final vfs updates from Al Viro: - The ->i_mutex wrappers (with small prereq in lustre) - a fix for too early freeing of symlink bodies on shmem (they need to be RCU-delayed) (-stable fodder) - followup to dedupe stuff merged this cycle * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: vfs: abort dedupe loop if fatal signals are pending make sure that freeing shmem fast symlinks is RCU-delayed wrappers for ->i_mutex access lustre: remove unused declaration
2016-01-22ext4: call dax_pfn_mkwrite() for DAX fsync/msyncRoss Zwisler1-1/+3
To properly support the new DAX fsync/msync infrastructure filesystems need to call dax_pfn_mkwrite() so that DAX can track when user pages are dirtied. Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: "J. Bruce Fields" <bfields@fieldses.org> Cc: "Theodore Ts'o" <tytso@mit.edu> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Andreas Dilger <adilger.kernel@dilger.ca> Cc: Dave Chinner <david@fromorbit.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jan Kara <jack@suse.com> Cc: Jeff Layton <jlayton@poochiereds.net> Cc: Matthew Wilcox <willy@linux.intel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Matthew Wilcox <matthew.r.wilcox@intel.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-22wrappers for ->i_mutex accessAl Viro1-9/+9
parallel to mutex_{lock,unlock,trylock,is_locked,lock_nested}, inode_foo(inode) being mutex_foo(&inode->i_mutex). Please, use those for access to ->i_mutex; over the coming cycle ->i_mutex will become rwsem, with ->lookup() done with it held only shared. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-12-07ext4: use pre-zeroed blocks for DAX page faultsJan Kara1-17/+3
Make DAX fault path use pre-zeroed blocks to avoid races with extent conversion and zeroing when two page faults to the same block happen. Signed-off-by: Jan Kara <jack@suse.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2015-12-07ext4: fix races between page faults and hole punchingJan Kara1-9/+57
Currently, page faults and hole punching are completely unsynchronized. This can result in page fault faulting in a page into a range that we are punching after truncate_pagecache_range() has been called and thus we can end up with a page mapped to disk blocks that will be shortly freed. Filesystem corruption will shortly follow. Note that the same race is avoided for truncate by checking page fault offset against i_size but there isn't similar mechanism available for punching holes. Fix the problem by creating new rw semaphore i_mmap_sem in inode and grab it for writing over truncate, hole punching, and other functions removing blocks from extent tree and for read over page faults. We cannot easily use i_data_sem for this since that ranks below transaction start and we need something ranking above it so that it can be held over the whole truncate / hole punching operation. Also remove various workarounds we had in the code to reduce race window when page fault could have created pages with stale mapping information. Signed-off-by: Jan Kara <jack@suse.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2015-09-08ext4: start transaction before calling into DAXMatthew Wilcox1-3/+52
Jan Kara pointed out that in the case where we are writing to a hole, we can end up with a lock inversion between the page lock and the journal lock. We can avoid this by starting the transaction in ext4 before calling into DAX. The journal lock nests inside the superblock pagefault lock, so we have to duplicate that code from dax_fault, like XFS does. Signed-off-by: Matthew Wilcox <willy@linux.intel.com> Cc: Jan Kara <jack@suse.cz> Cc: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-08ext4: add ext4_get_block_dax()Matthew Wilcox1-3/+3
DAX wants different semantics from any currently-existing ext4 get_block callback. Unlike ext4_get_block_write(), it needs to honour the 'create' flag, and unlike ext4_get_block(), it needs to be able to return unwritten extents. So introduce a new ext4_get_block_dax() which has those semantics. We could also change ext4_get_block_write() to honour the 'create' flag, but that might have consequences on other users that I do not currently understand. Signed-off-by: Matthew Wilcox <willy@linux.intel.com> Cc: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-08ext4: use ext4_get_block_write() for DAXMatthew Wilcox1-4/+4
DAX relies on the get_block function either zeroing newly allocated blocks before they're findable by subsequent calls to get_block, or marking newly allocated blocks as unwritten. ext4_get_block() cannot create unwritten extents, but ext4_get_block_write() can. Signed-off-by: Matthew Wilcox <willy@linux.intel.com> Reported-by: Andy Rudoff <andy.rudoff@intel.com> Cc: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-08ext4: huge page fault supportMatthew Wilcox1-1/+9
Use DAX to provide support for huge pages. Signed-off-by: Matthew Wilcox <willy@linux.intel.com> Cc: Hillf Danton <dhillf@gmail.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Theodore Ts'o <tytso@mit.edu> Cc: Jan Kara <jack@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-08dax: move DAX-related functions to a new headerMatthew Wilcox1-0/+1
In order to handle the !CONFIG_TRANSPARENT_HUGEPAGES case, we need to return VM_FAULT_FALLBACK from the inlined dax_pmd_fault(), which is defined in linux/mm.h. Given that we don't want to include <linux/mm.h> in <linux/fs.h>, the easiest solution is to move the DAX-related functions to a new header, <linux/dax.h>. We could also have moved VM_FAULT_* definitions to a new header, or a different header that isn't quite such a boil-the-ocean header as <linux/mm.h>, but this felt like the best option. Signed-off-by: Matthew Wilcox <willy@linux.intel.com> Cc: Hillf Danton <dhillf@gmail.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Theodore Ts'o <tytso@mit.edu> Cc: Jan Kara <jack@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-06-30Merge tag 'xfs-for-linus-4.2-rc1' of ↵Linus Torvalds1-2/+14
git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs Pul xfs updates from Dave Chinner: "There's a couple of small API changes to the core DAX code which required small changes to the ext2 and ext4 code bases, but otherwise everything is within the XFS codebase. This update contains: - A new sparse on-disk inode record format to allow small extents to be used for inode allocation when free space is fragmented. - DAX support. This includes minor changes to the DAX core code to fix problems with lock ordering and bufferhead mapping abuse. - transaction commit interface cleanup - removal of various unnecessary XFS specific type definitions - cleanup and optimisation of freelist preparation before allocation - various minor cleanups - bug fixes for - transaction reservation leaks - incorrect inode logging in unwritten extent conversion - mmap lock vs freeze ordering - remote symlink mishandling - attribute fork removal issues" * tag 'xfs-for-linus-4.2-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs: (49 commits) xfs: don't truncate attribute extents if no extents exist xfs: clean up XFS_MIN_FREELIST macros xfs: sanitise error handling in xfs_alloc_fix_freelist xfs: factor out free space extent length check xfs: xfs_alloc_fix_freelist() can use incore perag structures xfs: remove xfs_caddr_t xfs: use void pointers in log validation helpers xfs: return a void pointer from xfs_buf_offset xfs: remove inst_t xfs: remove __psint_t and __psunsigned_t xfs: fix remote symlinks on V5/CRC filesystems xfs: fix xfs_log_done interface xfs: saner xfs_trans_commit interface xfs: remove the flags argument to xfs_trans_cancel xfs: pass a boolean flag to xfs_trans_free_items xfs: switch remaining xfs_trans_dup users to xfs_trans_roll xfs: check min blks for random debug mode sparse allocations xfs: fix sparse inodes 32-bit compile failure xfs: add initial DAX support xfs: add DAX IO path support ...
2015-06-04dax: don't abuse get_block mapping for endio callbacksDave Chinner1-2/+14
dax_fault() currently relies on the get_block callback to attach an io completion callback to the mapping buffer head so that it can run unwritten extent conversion after zeroing allocated blocks. Instead of this hack, pass the conversion callback directly into dax_fault() similar to the get_block callback. When the filesystem allocates unwritten extents, it will set the buffer_unwritten() flag, and hence the dax_fault code can call the completion function in the contexts where it is necessary without overloading the mapping buffer head. Note: The changes to ext4 to use this interface are suspect at best. In fact, the way ext4 did this end_io assignment in the first place looks suspect because it only set a completion callback when there wasn't already some other write() call taking place on the same inode. The ext4 end_io code looks rather intricate and fragile with all it's reference counting and passing to different contexts for modification via inode private pointers that aren't protected by locks... Signed-off-by: Dave Chinner <dchinner@redhat.com> Acked-by: Jan Kara <jack@suse.cz> Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-05-31ext4 crypto: handle unexpected lack of encryption keysTheodore Ts'o1-7/+10
Fix up attempts by users to try to write to a file when they don't have access to the encryption key. Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2015-05-18ext4 crypto: reorganize how we store keys in the inodeTheodore Ts'o1-2/+2
This is a pretty massive patch which does a number of different things: 1) The per-inode encryption information is now stored in an allocated data structure, ext4_crypt_info, instead of directly in the node. This reduces the size usage of an in-memory inode when it is not using encryption. 2) We drop the ext4_fname_crypto_ctx entirely, and use the per-inode encryption structure instead. This remove an unnecessary memory allocation and free for the fname_crypto_ctx as well as allowing us to reuse the ctfm in a directory for multiple lookups and file creations. 3) We also cache the inode's policy information in the ext4_crypt_info structure so we don't have to continually read it out of the extended attributes. 4) We now keep the keyring key in the inode's encryption structure instead of releasing it after we are done using it to derive the per-inode key. This allows us to test to see if the key has been revoked; if it has, we prevent the use of the derived key and free it. 5) When an inode is released (or when the derived key is freed), we will use memset_explicit() to zero out the derived key, so it's not left hanging around in memory. This implies that when a user logs out, it is important to first revoke the key, and then unlink it, and then finally, to use "echo 3 > /proc/sys/vm/drop_caches" to release any decrypted pages and dcache entries from the system caches. 6) All this, and we also shrink the number of lines of code by around 100. :-) Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2015-04-19Merge tag 'ext4_for_linus' of ↵Linus Torvalds1-3/+16
git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 Pull ext4 updates from Ted Ts'o: "A few bug fixes and add support for file-system level encryption in ext4" * tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (31 commits) ext4 crypto: enable encryption feature flag ext4 crypto: add symlink encryption ext4 crypto: enable filename encryption ext4 crypto: filename encryption modifications ext4 crypto: partial update to namei.c for fname crypto ext4 crypto: insert encrypted filenames into a leaf directory block ext4 crypto: teach ext4_htree_store_dirent() to store decrypted filenames ext4 crypto: filename encryption facilities ext4 crypto: implement the ext4 decryption read path ext4 crypto: implement the ext4 encryption write path ext4 crypto: inherit encryption policies on inode and directory create ext4 crypto: enforce context consistency ext4 crypto: add encryption key management facilities ext4 crypto: add ext4 encryption facilities ext4 crypto: add encryption policy and password salt support ext4 crypto: add encryption xattr support ext4 crypto: export ext4_empty_dir() ext4 crypto: add ext4 encryption Kconfig ext4 crypto: reserve codepoints used by the ext4 encryption feature ext4 crypto: add ext4_mpage_readpages() ...
2015-04-16Merge branch 'for-linus' of ↵Linus Torvalds1-16/+17
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull third hunk of vfs changes from Al Viro: "This contains the ->direct_IO() changes from Omar + saner generic_write_checks() + dealing with fcntl()/{read,write}() races (mirroring O_APPEND/O_DIRECT into iocb->ki_flags and instead of repeatedly looking at ->f_flags, which can be changed by fcntl(2), check ->ki_flags - which cannot) + infrastructure bits for dhowells' d_inode annotations + Christophs switch of /dev/loop to vfs_iter_write()" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (30 commits) block: loop: switch to VFS ITER_BVEC configfs: Fix inconsistent use of file_inode() vs file->f_path.dentry->d_inode VFS: Make pathwalk use d_is_reg() rather than S_ISREG() VFS: Fix up debugfs to use d_is_dir() in place of S_ISDIR() VFS: Combine inode checks with d_is_negative() and d_is_positive() in pathwalk NFS: Don't use d_inode as a variable name VFS: Impose ordering on accesses of d_inode and d_flags VFS: Add owner-filesystem positive/negative dentry checks nfs: generic_write_checks() shouldn't be done on swapout... ocfs2: use __generic_file_write_iter() mirror O_APPEND and O_DIRECT into iocb->ki_flags switch generic_write_checks() to iocb and iter ocfs2: move generic_write_checks() before the alignment checks ocfs2_file_write_iter: stop messing with ppos udf_file_write_iter: reorder and simplify fuse: ->direct_IO() doesn't need generic_write_checks() ext4_file_write_iter: move generic_write_checks() up xfs_file_aio_write_checks: switch to iocb/iov_iter generic_write_checks(): drop isblk argument blkdev_write_iter: expand generic_file_checks() call in there ...
2015-04-15Merge branch 'akpm' (patches from Andrew)Linus Torvalds1-18/+1
Merge second patchbomb from Andrew Morton: - the rest of MM - various misc bits - add ability to run /sbin/reboot at reboot time - printk/vsprintf changes - fiddle with seq_printf() return value * akpm: (114 commits) parisc: remove use of seq_printf return value lru_cache: remove use of seq_printf return value tracing: remove use of seq_printf return value cgroup: remove use of seq_printf return value proc: remove use of seq_printf return value s390: remove use of seq_printf return value cris fasttimer: remove use of seq_printf return value cris: remove use of seq_printf return value openrisc: remove use of seq_printf return value ARM: plat-pxa: remove use of seq_printf return value nios2: cpuinfo: remove use of seq_printf return value microblaze: mb: remove use of seq_printf return value ipc: remove use of seq_printf return value rtc: remove use of seq_printf return value power: wakeup: remove use of seq_printf return value x86: mtrr: if: remove use of seq_printf return value linux/bitmap.h: improve BITMAP_{LAST,FIRST}_WORD_MASK MAINTAINERS: CREDITS: remove Stefano Brivio from B43 .mailmap: add Ricardo Ribalda CREDITS: add Ricardo Ribalda Delgado ...
2015-04-15dax: unify ext2/4_{dax,}_file_operationsBoaz Harrosh1-20/+0
The original dax patchset split the ext2/4_file_operations because of the two NULL splice_read/splice_write in the dax case. In the vfs if splice_read/splice_write are NULL we then call default_splice_read/write. What we do here is make generic_file_splice_read aware of IS_DAX() so the original ext2/4_file_operations can be used as is. For write it appears that iter_file_splice_write is just fine. It uses the regular f_op->write(file,..) or new_sync_write(file, ...). Signed-off-by: Boaz Harrosh <boaz@plexistor.com> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Dave Chinner <dchinner@redhat.com> Cc: Matthew Wilcox <willy@linux.intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-15dax: use pfn_mkwrite to update c/mtime + freeze protectionBoaz Harrosh1-0/+1
From: Yigal Korman <yigal@plexistor.com> [v1] Without this patch, c/mtime is not updated correctly when mmap'ed page is first read from and then written to. A new xfstest is submitted for testing this (generic/080) [v2] Jan Kara has pointed out that if we add the sb_start/end_pagefault pair in the new pfn_mkwrite we are then fixing another bug where: A user could start writing to the page while filesystem is frozen. Signed-off-by: Yigal Korman <yigal@plexistor.com> Signed-off-by: Boaz Harrosh <boaz@plexistor.com> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Matthew Wilcox <matthew.r.wilcox@intel.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Hugh Dickins <hughd@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-12ext4 crypto: implement the ext4 decryption read pathMichael Halcrow1-2/+16
Signed-off-by: Michael Halcrow <mhalcrow@google.com> Signed-off-by: Ildar Muslukhov <ildarm@google.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2015-04-11mirror O_APPEND and O_DIRECT into iocb->ki_flagsAl Viro1-2/+2
... avoiding write_iter/fcntl races. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-04-11switch generic_write_checks() to iocb and iterAl Viro1-14/+6
... returning -E... upon error and amount of data left in iter after (possible) truncation upon success. Note, that normal case gives a non-zero (positive) return value, so any tests for != 0 _must_ be updated. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Conflicts: fs/ext4/file.c
2015-04-11ext4_file_write_iter: move generic_write_checks() upAl Viro1-19/+20
simpler that way... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-04-11generic_write_checks(): drop isblk argumentAl Viro1-1/+1
all remaining callers are passing 0; some just obscure that fact. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-04-11lift generic_write_checks() into callers of __generic_file_write_iter()Al Viro1-3/+11
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-04-11make new_sync_{read,write}() staticAl Viro1-4/+0
All places outside of core VFS that checked ->read and ->write for being NULL or called the methods directly are gone now, so NULL {read,write} with non-NULL {read,write}_iter will do the right thing in all cases. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-04-02ext4: remove unused header filesSheng Yong1-1/+0
Remove unused header files and header files which are included in ext4.h. Signed-off-by: Sheng Yong <shengyong1@huawei.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2015-03-25fs: move struct kiocb to fs.hChristoph Hellwig1-1/+1
struct kiocb now is a generic I/O container, so move it to fs.h. Also do a #include diet for aio.h while we're at it. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-02-16ext4: add DAX functionalityRoss Zwisler1-2/+47
This is a port of the DAX functionality found in the current version of ext2. [matthew.r.wilcox@intel.com: heavily tweaked] [akpm@linux-foundation.org: remap_pages went away] Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com> Reviewed-by: Andreas Dilger <andreas.dilger@intel.com> Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com> Cc: Boaz Harrosh <boaz@plexistor.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Dave Chinner <david@fromorbit.com> Cc: Jan Kara <jack@suse.cz> Cc: Jens Axboe <axboe@kernel.dk> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-02-10mm: drop vm_ops->remap_pages and generic_file_remap_pages() stubKirill A. Shutemov1-1/+0
Nobody uses it anymore. [akpm@linux-foundation.org: fix filemap_xip.c] Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Wu Fengguang <fengguang.wu@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-01-02Revert "ext4: fix suboptimal seek_{data,hole} extents traversial"Theodore Ts'o1-106/+114
This reverts commit 14516bb7bb6ffbd49f35389f9ece3b2045ba5815. This was causing regression test failures with generic/285 with an ext3 filesystem using CONFIG_EXT4_USE_FOR_EXT23. Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-12-02ext4: fix suboptimal seek_{data,hole} extents traversialDmitry Monakhov1-114/+106
It is ridiculous practice to scan inode block by block, this technique applicable only for old indirect files. This takes significant amount of time for really large files. Let's reuse ext4_fiemap which already traverse inode-tree in most optimal meaner. TESTCASE: ftruncate64(fd, 0); ftruncate64(fd, 1ULL << 40); /* lseek will spin very long time */ lseek64(fd, 0, SEEK_DATA); lseek64(fd, 0, SEEK_HOLE); Original report: https://lkml.org/lkml/2014/10/16/620 Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-10-30ext4: prevent bugon on race between write/fcntlDmitry Monakhov1-1/+1
O_DIRECT flags can be toggeled via fcntl(F_SETFL). But this value checked twice inside ext4_file_write_iter() and __generic_file_write() which result in BUG_ON inside ext4_direct_IO. Let's initialize iocb->private unconditionally. TESTCASE: xfstest:generic/036 https://patchwork.ozlabs.org/patch/402445/ #TYPICAL STACK TRACE: kernel BUG at fs/ext4/inode.c:2960! invalid opcode: 0000 [#1] SMP Modules linked in: brd iTCO_wdt lpc_ich mfd_core igb ptp dm_mirror dm_region_hash dm_log dm_mod CPU: 6 PID: 5505 Comm: aio-dio-fcntl-r Not tainted 3.17.0-rc2-00176-gff5c017 #161 Hardware name: Intel Corporation W2600CR/W2600CR, BIOS SE5C600.86B.99.99.x028.061320111235 06/13/2011 task: ffff88080e95a7c0 ti: ffff88080f908000 task.ti: ffff88080f908000 RIP: 0010:[<ffffffff811fabf2>] [<ffffffff811fabf2>] ext4_direct_IO+0x162/0x3d0 RSP: 0018:ffff88080f90bb58 EFLAGS: 00010246 RAX: 0000000000000400 RBX: ffff88080fdb2a28 RCX: 00000000a802c818 RDX: 0000040000080000 RSI: ffff88080d8aeb80 RDI: 0000000000000001 RBP: ffff88080f90bbc8 R08: 0000000000000000 R09: 0000000000001581 R10: 0000000000000000 R11: 0000000000000000 R12: ffff88080d8aeb80 R13: ffff88080f90bbf8 R14: ffff88080fdb28c8 R15: ffff88080fdb2a28 FS: 00007f23b2055700(0000) GS:ffff880818400000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f23b2045000 CR3: 000000080cedf000 CR4: 00000000000407e0 Stack: ffff88080f90bb98 0000000000000000 7ffffffffffffffe ffff88080fdb2c30 0000000000000200 0000000000000200 0000000000000001 0000000000000200 ffff88080f90bbc8 ffff88080fdb2c30 ffff88080f90be08 0000000000000200 Call Trace: [<ffffffff8112ca9d>] generic_file_direct_write+0xed/0x180 [<ffffffff8112f2b2>] __generic_file_write_iter+0x222/0x370 [<ffffffff811f495b>] ext4_file_write_iter+0x34b/0x400 [<ffffffff811bd709>] ? aio_run_iocb+0x239/0x410 [<ffffffff811bd709>] ? aio_run_iocb+0x239/0x410 [<ffffffff810990e5>] ? local_clock+0x25/0x30 [<ffffffff810abd94>] ? __lock_acquire+0x274/0x700 [<ffffffff811f4610>] ? ext4_unwritten_wait+0xb0/0xb0 [<ffffffff811bd756>] aio_run_iocb+0x286/0x410 [<ffffffff810990e5>] ? local_clock+0x25/0x30 [<ffffffff810ac359>] ? lock_release_holdtime+0x29/0x190 [<ffffffff811bc05b>] ? lookup_ioctx+0x4b/0xf0 [<ffffffff811bde3b>] do_io_submit+0x55b/0x740 [<ffffffff811bdcaa>] ? do_io_submit+0x3ca/0x740 [<ffffffff811be030>] SyS_io_submit+0x10/0x20 [<ffffffff815ce192>] system_call_fastpath+0x16/0x1b Code: 01 48 8b 80 f0 01 00 00 48 8b 18 49 8b 45 10 0f 85 f1 01 00 00 48 03 45 c8 48 3b 43 48 0f 8f e3 01 00 00 49 83 7c 24 18 00 75 04 <0f> 0b eb fe f0 ff 83 ec 01 00 00 49 8b 44 24 18 8b 00 85 c0 89 RIP [<ffffffff811fabf2>] ext4_direct_IO+0x162/0x3d0 RSP <ffff88080f90bb58> Reported-by: Sasha Levin <sasha.levin@oracle.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org> Cc: stable@vger.kernel.org
2014-07-15ext4: remove readpage() check in ext4_mmap_file()Zhang Zhen1-4/+0
There is no kind of file which does not supply a page reading function. Signed-off-by: Zhang Zhen <zhenzhang.zhang@huawei.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-06-12Merge branch 'for-linus' of ↵Linus Torvalds1-21/+14
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull vfs updates from Al Viro: "This the bunch that sat in -next + lock_parent() fix. This is the minimal set; there's more pending stuff. In particular, I really hope to get acct.c fixes merged this cycle - we need that to deal sanely with delayed-mntput stuff. In the next pile, hopefully - that series is fairly short and localized (kernel/acct.c, fs/super.c and fs/namespace.c). In this pile: more iov_iter work. Most of prereqs for ->splice_write with sane locking order are there and Kent's dio rewrite would also fit nicely on top of this pile" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (70 commits) lock_parent: don't step on stale ->d_parent of all-but-freed one kill generic_file_splice_write() ceph: switch to iter_file_splice_write() shmem: switch to iter_file_splice_write() nfs: switch to iter_splice_write_file() fs/splice.c: remove unneeded exports ocfs2: switch to iter_file_splice_write() ->splice_write() via ->write_iter() bio_vec-backed iov_iter optimize copy_page_{to,from}_iter() bury generic_file_aio_{read,write} lustre: get rid of messing with iovecs ceph: switch to ->write_iter() ceph_sync_direct_write: stop poking into iov_iter guts ceph_sync_read: stop poking into iov_iter guts new helper: copy_page_from_iter() fuse: switch to ->write_iter() btrfs: switch to ->write_iter() ocfs2: switch to ->write_iter() xfs: switch to ->write_iter() ...
2014-06-12->splice_write() via ->write_iter()Al Viro1-1/+1
iter_file_splice_write() - a ->splice_write() instance that gathers the pipe buffers, builds a bio_vec-based iov_iter covering those and feeds it to ->write_iter(). A bunch of simple cases coverted to that... [AV: fixed the braino spotted by Cyrill] Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-05-12ext4: add missing BUFFER_TRACE before ext4_journal_get_write_accessliang xie1-0/+1
Make them more consistently Signed-off-by: xieliang <xieliang@xiaomi.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2014-05-12ext4: make local functions staticStephen Hemminger1-1/+1
I have been running make namespacecheck to look for unneeded globals, and found these in ext4. Signed-off-by: Stephen Hemminger <stephen@networkplumber.org> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2014-05-06convert ext4 to ->write_iter()Al Viro1-18/+11
unfortunately, Ted's changes to ext4_file_write() are *still* an incomplete fix - playing with rlimits can let you smuggle an unaligned request past the checks. So there almost certainly will be more merge PITA around that place... [fix from Peter Ujfalusi <peter.ujfalusi@ti.com> folded] Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-05-06Merge ext4 changes in ext4_file_write() into for-nextAl Viro1-75/+75
From ext4.git#dev, needed for switch of ext4 to ->write_iter() ;-/
2014-05-06switch simple generic_file_aio_read() users to ->read_iter()Al Viro1-2/+2
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-04-21ext4: fix locking for O_APPEND writesTheodore Ts'o1-16/+26
Al Viro pointed out that locking for O_APPEND writes was problematic, since the location of the write isn't known until after we take the i_mutex, which impacts the ext4_unaligned_aio() and s_bitmap_maxbytes check. For O_APPEND always assume that the write is unaligned so call ext4_unwritten_wait(). And to solve the second problem, take the i_mutex earlier before we start the s_bitmap_maxbytes check. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2014-04-21ext4: factor out common code in ext4_file_write()Theodore Ts'o1-34/+22
This shouldn't change any logic flow; just delete duplicated code. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Reviewed-by: Jan Kara <jack@suse.cz>
2014-04-21ext4: move ext4_file_dio_write() into ext4_file_write()Theodore Ts'o1-74/+64
This commit doesn't actually change anything; it just moves code around in preparation for some code simplification work. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Reviewed-by: Jan Kara <jack@suse.cz>
2014-04-21ext4: inline generic_file_aio_write() into ext4_file_write()Theodore Ts'o1-4/+16
Copy generic_file_aio_write() into ext4_file_write(). This is part of a patch series which allows us to simplify ext4_file_write() and ext4_file_dio_write(), by calling __generic_file_aio_write() directly. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Reviewed-by: Jan Kara <jack@suse.cz>
2014-04-20ext4: rename uninitialized extents to unwrittenLukas Czerner1-1/+1
Currently in ext4 there is quite a mess when it comes to naming unwritten extents. Sometimes we call it uninitialized and sometimes we refer to it as unwritten. The right name for the extent which has been allocated but does not contain any written data is _unwritten_. Other file systems are using this name consistently, even the buffer head state refers to it as unwritten. We need to fix this confusion in ext4. This commit changes every reference to an uninitialized extent (meaning allocated but unwritten) to unwritten extent. This includes comments, function names and variable names. It even covers abbreviation of the word uninitialized (such as uninit) and some misspellings. This commit does not change any of the code paths at all. This has been confirmed by comparing md5sums of the assembly code of each object file after all the function names were stripped from it. Signed-off-by: Lukas Czerner <lczerner@redhat.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2014-04-20Merge tag 'ext4_for_linus_stable' of ↵Linus Torvalds1-1/+1
git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 Pull ext4 fixes from Ted Ts'o: "These are regression and bug fixes for ext4. We had a number of new features in ext4 during this merge window (ZERO_RANGE and COLLAPSE_RANGE fallocate modes, renameat, etc.) so there were many more regression and bug fixes this time around. It didn't help that xfstests hadn't been fully updated to fully stress test COLLAPSE_RANGE until after -rc1" * tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (31 commits) ext4: disable COLLAPSE_RANGE for bigalloc ext4: fix COLLAPSE_RANGE failure with 1KB block size ext4: use EINVAL if not a regular file in ext4_collapse_range() ext4: enforce we are operating on a regular file in ext4_zero_range() ext4: fix extent merging in ext4_ext_shift_path_extents() ext4: discard preallocations after removing space ext4: no need to truncate pagecache twice in collapse range ext4: fix removing status extents in ext4_collapse_range() ext4: use filemap_write_and_wait_range() correctly in collapse range ext4: use truncate_pagecache() in collapse range ext4: remove temporary shim used to merge COLLAPSE_RANGE and ZERO_RANGE ext4: fix ext4_count_free_clusters() with EXT4FS_DEBUG and bigalloc enabled ext4: always check ext4_ext_find_extent result ext4: fix error handling in ext4_ext_shift_extents ext4: silence sparse check warning for function ext4_trim_extent ext4: COLLAPSE_RANGE only works on extent-based files ext4: fix byte order problems introduced by the COLLAPSE_RANGE patches ext4: use i_size_read in ext4_unaligned_aio() fs: disallow all fallocate operation on active swapfile fs: move falloc collapse range check into the filesystem methods ...
2014-04-12Merge branch 'for-linus' of ↵Linus Torvalds1-1/+1
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull vfs updates from Al Viro: "The first vfs pile, with deep apologies for being very late in this window. Assorted cleanups and fixes, plus a large preparatory part of iov_iter work. There's a lot more of that, but it'll probably go into the next merge window - it *does* shape up nicely, removes a lot of boilerplate, gets rid of locking inconsistencie between aio_write and splice_write and I hope to get Kent's direct-io rewrite merged into the same queue, but some of the stuff after this point is having (mostly trivial) conflicts with the things already merged into mainline and with some I want more testing. This one passes LTP and xfstests without regressions, in addition to usual beating. BTW, readahead02 in ltp syscalls testsuite has started giving failures since "mm/readahead.c: fix readahead failure for memoryless NUMA nodes and limit readahead pages" - might be a false positive, might be a real regression..." * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (63 commits) missing bits of "splice: fix racy pipe->buffers uses" cifs: fix the race in cifs_writev() ceph_sync_{,direct_}write: fix an oops on ceph_osdc_new_request() failure kill generic_file_buffered_write() ocfs2_file_aio_write(): switch to generic_perform_write() ceph_aio_write(): switch to generic_perform_write() xfs_file_buffered_aio_write(): switch to generic_perform_write() export generic_perform_write(), start getting rid of generic_file_buffer_write() generic_file_direct_write(): get rid of ppos argument btrfs_file_aio_write(): get rid of ppos kill the 5th argument of generic_file_buffered_write() kill the 4th argument of __generic_file_aio_write() lustre: don't open-code kernel_recvmsg() ocfs2: don't open-code kernel_recvmsg() drbd: don't open-code kernel_recvmsg() constify blk_rq_map_user_iov() and friends lustre: switch to kernel_sendmsg() ocfs2: don't open-code kernel_sendmsg() take iov_iter stuff to mm/iov_iter.c process_vm_access: tidy up a bit ...
2014-04-12ext4: use i_size_read in ext4_unaligned_aio()Theodore Ts'o1-1/+1
We haven't taken i_mutex yet, so we need to use i_size_read(). Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: stable@vger.kernel.org
2014-04-07mm: implement ->map_pages for page cacheKirill A. Shutemov1-0/+1
filemap_map_pages() is generic implementation of ->map_pages() for filesystems who uses page cache. It should be safe to use filemap_map_pages() for ->map_pages() if filesystem use filemap_fault() for ->fault(). Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Rik van Riel <riel@redhat.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Matthew Wilcox <matthew.r.wilcox@intel.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Dave Chinner <david@fromorbit.com> Cc: Ning Qu <quning@gmail.com> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-03fs/direct-io.c: remove some left over checksDan Carpenter1-1/+1
We know that "ret > 0" is true here. These tests were left over from commit 02afc27faec9 ('direct-io: Handle O_(D)SYNC AIO') and aren't needed any more. Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-01kill the 4th argument of __generic_file_aio_write()Al Viro1-1/+1
It's always equal to &iocb->ki_pos, where iocb is the value of the 1st argument. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-02-09fix O_SYNC|O_APPEND syncing the wrong range on write()Al Viro1-1/+1
It actually goes back to 2004 ([PATCH] Concurrent O_SYNC write support) when sync_page_range() had been introduced; generic_file_write{,v}() correctly synced pos_after_write - written .. pos_after_write - 1 but generic_file_aio_write() synced pos_before_write .. pos_before_write + written - 1 instead. Which is not the same thing with O_APPEND, obviously. A couple of years later correct variant had been killed off when everything switched to use of generic_file_aio_write(). All users of generic_file_aio_write() are affected, and the same bug has been copied into other instances of ->aio_write(). The fix is trivial; the only subtle point is that generic_write_sync() ought to be inlined to avoid calculations useless for the majority of calls. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>