aboutsummaryrefslogtreecommitdiffstats
path: root/fs/jbd2
AgeCommit message (Collapse)AuthorFilesLines
2024-01-04jbd2: abort journal when detecting metadata writeback error of fs devZhihao Cheng1-0/+14
This is a replacement solution of commit bc71726c725767 ("ext4: abort the filesystem if failed to async write metadata buffer"), JBD2 can detect metadata writeback error of fs dev by 'j_fs_dev_wb_err'. Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20231213013224.2100050-5-chengzhihao1@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2024-01-04jbd2: remove unused 'JBD2_CHECKPOINT_IO_ERROR' and 'j_atomic_flags'Zhihao Cheng1-11/+0
Since 'JBD2_CHECKPOINT_IO_ERROR' and j_atomic_flags' are not useful anymore after fs dev's errseq is imported into jbd2, just remove them. Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20231213013224.2100050-4-chengzhihao1@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2024-01-04jbd2: replace journal state flag by checking errseqZhihao Cheng1-5/+5
Now JBD2 detects metadata writeback error of fs dev according to errseq. Replace journal state flag by checking errseq. Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com> Suggested-by: Jan Kara <jack@suse.cz> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20231213013224.2100050-3-chengzhihao1@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2024-01-04jbd2: add errseq to detect client fs's bdev writeback errorZhihao Cheng2-6/+2
Add errseq in journal, so that JBD2 can detect whether metadata is successfully written to fs bdev. This patch adds detection in recovery process to replace original solution(using local variable wb_err). Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com> Suggested-by: Jan Kara <jack@suse.cz> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20231213013224.2100050-2-chengzhihao1@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-12-12jbd2: fix soft lockup in journal_finish_inode_data_buffers()Ye Bin1-0/+1
There's issue when do io test: WARN: soft lockup - CPU#45 stuck for 11s! [jbd2/dm-2-8:4170] CPU: 45 PID: 4170 Comm: jbd2/dm-2-8 Kdump: loaded Tainted: G OE Call trace: dump_backtrace+0x0/0x1a0 show_stack+0x24/0x30 dump_stack+0xb0/0x100 watchdog_timer_fn+0x254/0x3f8 __hrtimer_run_queues+0x11c/0x380 hrtimer_interrupt+0xfc/0x2f8 arch_timer_handler_phys+0x38/0x58 handle_percpu_devid_irq+0x90/0x248 generic_handle_irq+0x3c/0x58 __handle_domain_irq+0x68/0xc0 gic_handle_irq+0x90/0x320 el1_irq+0xcc/0x180 queued_spin_lock_slowpath+0x1d8/0x320 jbd2_journal_commit_transaction+0x10f4/0x1c78 [jbd2] kjournald2+0xec/0x2f0 [jbd2] kthread+0x134/0x138 ret_from_fork+0x10/0x18 Analyzed informations from vmcore as follows: (1) There are about 5k+ jbd2_inode in 'commit_transaction->t_inode_list'; (2) Now is processing the 855th jbd2_inode; (3) JBD2 task has TIF_NEED_RESCHED flag; (4) There's no pags in address_space around the 855th jbd2_inode; (5) There are some process is doing drop caches; (6) Mounted with 'nodioread_nolock' option; (7) 128 CPUs; According to informations from vmcore we know 'journal->j_list_lock' spin lock competition is fierce. So journal_finish_inode_data_buffers() maybe process slowly. Theoretically, there is scheduling point in the filemap_fdatawait_range_keep_errors(). However, if inode's address_space has no pages which taged with PAGECACHE_TAG_WRITEBACK, will not call cond_resched(). So may lead to soft lockup. journal_finish_inode_data_buffers filemap_fdatawait_range_keep_errors __filemap_fdatawait_range while (index <= end) nr_pages = pagevec_lookup_range_tag(&pvec, mapping, &index, end, PAGECACHE_TAG_WRITEBACK); if (!nr_pages) break; --> If 'nr_pages' is equal zero will break, then will not call cond_resched() for (i = 0; i < nr_pages; i++) wait_on_page_writeback(page); cond_resched(); To solve above issue, add scheduling point in the journal_finish_inode_data_buffers(); Signed-off-by: Ye Bin <yebin10@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20231211112544.3879780-1-yebin10@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-11-30jbd2: increase the journal IO's priorityZhang Yi2-13/+16
Current jbd2 only add REQ_SYNC for descriptor block, metadata log buffer, commit buffer and superblock buffer, the submitted IO could be throttled by writeback throttle in block layer, that could lead to priority inversion in some cases. The log IO looks like a kind of high priority metadata IO, so it should not be throttled by WBT like QOS policies in block layer, let's add REQ_SYNC | REQ_IDLE to exempt from writeback throttle, and also add REQ_META together indicates it's a metadata IO. Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20231129114740.2686201-2-yi.zhang@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-11-30jbd2: correct the printing of write_flags in jbd2_write_superblock()Zhang Yi1-1/+3
The write_flags print in the trace of jbd2_write_superblock() is not real, so move the modification before the trace. Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20231129114740.2686201-1-yi.zhang@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-11-02Merge tag 'mm-stable-2023-11-01-14-33' of ↵Linus Torvalds1-11/+18
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: "Many singleton patches against the MM code. The patch series which are included in this merge do the following: - Kemeng Shi has contributed some compation maintenance work in the series 'Fixes and cleanups to compaction' - Joel Fernandes has a patchset ('Optimize mremap during mutual alignment within PMD') which fixes an obscure issue with mremap()'s pagetable handling during a subsequent exec(), based upon an implementation which Linus suggested - More DAMON/DAMOS maintenance and feature work from SeongJae Park i the following patch series: mm/damon: misc fixups for documents, comments and its tracepoint mm/damon: add a tracepoint for damos apply target regions mm/damon: provide pseudo-moving sum based access rate mm/damon: implement DAMOS apply intervals mm/damon/core-test: Fix memory leaks in core-test mm/damon/sysfs-schemes: Do DAMOS tried regions update for only one apply interval - In the series 'Do not try to access unaccepted memory' Adrian Hunter provides some fixups for the recently-added 'unaccepted memory' feature. To increase the feature's checking coverage. 'Plug a few gaps where RAM is exposed without checking if it is unaccepted memory' - In the series 'cleanups for lockless slab shrink' Qi Zheng has done some maintenance work which is preparation for the lockless slab shrinking code - Qi Zheng has redone the earlier (and reverted) attempt to make slab shrinking lockless in the series 'use refcount+RCU method to implement lockless slab shrink' - David Hildenbrand contributes some maintenance work for the rmap code in the series 'Anon rmap cleanups' - Kefeng Wang does more folio conversions and some maintenance work in the migration code. Series 'mm: migrate: more folio conversion and unification' - Matthew Wilcox has fixed an issue in the buffer_head code which was causing long stalls under some heavy memory/IO loads. Some cleanups were added on the way. Series 'Add and use bdev_getblk()' - In the series 'Use nth_page() in place of direct struct page manipulation' Zi Yan has fixed a potential issue with the direct manipulation of hugetlb page frames - In the series 'mm: hugetlb: Skip initialization of gigantic tail struct pages if freed by HVO' has improved our handling of gigantic pages in the hugetlb vmmemmep optimizaton code. This provides significant boot time improvements when significant amounts of gigantic pages are in use - Matthew Wilcox has sent the series 'Small hugetlb cleanups' - code rationalization and folio conversions in the hugetlb code - Yin Fengwei has improved mlock()'s handling of large folios in the series 'support large folio for mlock' - In the series 'Expose swapcache stat for memcg v1' Liu Shixin has added statistics for memcg v1 users which are available (and useful) under memcg v2 - Florent Revest has enhanced the MDWE (Memory-Deny-Write-Executable) prctl so that userspace may direct the kernel to not automatically propagate the denial to child processes. The series is named 'MDWE without inheritance' - Kefeng Wang has provided the series 'mm: convert numa balancing functions to use a folio' which does what it says - In the series 'mm/ksm: add fork-exec support for prctl' Stefan Roesch makes is possible for a process to propagate KSM treatment across exec() - Huang Ying has enhanced memory tiering's calculation of memory distances. This is used to permit the dax/kmem driver to use 'high bandwidth memory' in addition to Optane Data Center Persistent Memory Modules (DCPMM). The series is named 'memory tiering: calculate abstract distance based on ACPI HMAT' - In the series 'Smart scanning mode for KSM' Stefan Roesch has optimized KSM by teaching it to retain and use some historical information from previous scans - Yosry Ahmed has fixed some inconsistencies in memcg statistics in the series 'mm: memcg: fix tracking of pending stats updates values' - In the series 'Implement IOCTL to get and optionally clear info about PTEs' Peter Xu has added an ioctl to /proc/<pid>/pagemap which permits us to atomically read-then-clear page softdirty state. This is mainly used by CRIU - Hugh Dickins contributed the series 'shmem,tmpfs: general maintenance', a bunch of relatively minor maintenance tweaks to this code - Matthew Wilcox has increased the use of the VMA lock over file-backed page faults in the series 'Handle more faults under the VMA lock'. Some rationalizations of the fault path became possible as a result - In the series 'mm/rmap: convert page_move_anon_rmap() to folio_move_anon_rmap()' David Hildenbrand has implemented some cleanups and folio conversions - In the series 'various improvements to the GUP interface' Lorenzo Stoakes has simplified and improved the GUP interface with an eye to providing groundwork for future improvements - Andrey Konovalov has sent along the series 'kasan: assorted fixes and improvements' which does those things - Some page allocator maintenance work from Kemeng Shi in the series 'Two minor cleanups to break_down_buddy_pages' - In thes series 'New selftest for mm' Breno Leitao has developed another MM self test which tickles a race we had between madvise() and page faults - In the series 'Add folio_end_read' Matthew Wilcox provides cleanups and an optimization to the core pagecache code - Nhat Pham has added memcg accounting for hugetlb memory in the series 'hugetlb memcg accounting' - Cleanups and rationalizations to the pagemap code from Lorenzo Stoakes, in the series 'Abstract vma_merge() and split_vma()' - Audra Mitchell has fixed issues in the procfs page_owner code's new timestamping feature which was causing some misbehaviours. In the series 'Fix page_owner's use of free timestamps' - Lorenzo Stoakes has fixed the handling of new mappings of sealed files in the series 'permit write-sealed memfd read-only shared mappings' - Mike Kravetz has optimized the hugetlb vmemmap optimization in the series 'Batch hugetlb vmemmap modification operations' - Some buffer_head folio conversions and cleanups from Matthew Wilcox in the series 'Finish the create_empty_buffers() transition' - As a page allocator performance optimization Huang Ying has added automatic tuning to the allocator's per-cpu-pages feature, in the series 'mm: PCP high auto-tuning' - Roman Gushchin has contributed the patchset 'mm: improve performance of accounted kernel memory allocations' which improves their performance by ~30% as measured by a micro-benchmark - folio conversions from Kefeng Wang in the series 'mm: convert page cpupid functions to folios' - Some kmemleak fixups in Liu Shixin's series 'Some bugfix about kmemleak' - Qi Zheng has improved our handling of memoryless nodes by keeping them off the allocation fallback list. This is done in the series 'handle memoryless nodes more appropriately' - khugepaged conversions from Vishal Moola in the series 'Some khugepaged folio conversions'" [ bcachefs conflicts with the dynamically allocated shrinkers have been resolved as per Stephen Rothwell in https://lore.kernel.org/all/20230913093553.4290421e@canb.auug.org.au/ with help from Qi Zheng. The clone3 test filtering conflict was half-arsed by yours truly ] * tag 'mm-stable-2023-11-01-14-33' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (406 commits) mm/damon/sysfs: update monitoring target regions for online input commit mm/damon/sysfs: remove requested targets when online-commit inputs selftests: add a sanity check for zswap Documentation: maple_tree: fix word spelling error mm/vmalloc: fix the unchecked dereference warning in vread_iter() zswap: export compression failure stats Documentation: ubsan: drop "the" from article title mempolicy: migration attempt to match interleave nodes mempolicy: mmap_lock is not needed while migrating folios mempolicy: alloc_pages_mpol() for NUMA policy without vma mm: add page_rmappable_folio() wrapper mempolicy: remove confusing MPOL_MF_LAZY dead code mempolicy: mpol_shared_policy_init() without pseudo-vma mempolicy trivia: use pgoff_t in shared mempolicy tree mempolicy trivia: slightly more consistent naming mempolicy trivia: delete those ancient pr_debug()s mempolicy: fix migrate_pages(2) syscall return nr_failed kernfs: drop shared NUMA mempolicy hooks hugetlbfs: drop shared NUMA mempolicy pretence mm/damon/sysfs-test: add a unit test for damon_sysfs_set_targets() ...
2023-10-05jbd2: fix potential data lost in recovering journal raced with synchronizing ↵Zhihao Cheng1-0/+8
fs bdev JBD2 makes sure journal data is fallen on fs device by sync_blockdev(), however, other process could intercept the EIO information from bdev's mapping, which leads journal recovering successful even EIO occurs during data written back to fs device. We found this problem in our product, iscsi + multipath is chosen for block device of ext4. Unstable network may trigger kpartx to rescan partitions in device mapper layer. Detailed process is shown as following: mount kpartx irq jbd2_journal_recover do_one_pass memcpy(nbh->b_data, obh->b_data) // copy data to fs dev from journal mark_buffer_dirty // mark bh dirty vfs_read generic_file_read_iter // dio filemap_write_and_wait_range __filemap_fdatawrite_range do_writepages block_write_full_folio submit_bh_wbc >> EIO occurs in disk << end_buffer_async_write mark_buffer_write_io_error mapping_set_error set_bit(AS_EIO, &mapping->flags) // set! filemap_check_errors test_and_clear_bit(AS_EIO, &mapping->flags) // clear! err2 = sync_blockdev filemap_write_and_wait filemap_check_errors test_and_clear_bit(AS_EIO, &mapping->flags) // false err2 = 0 Filesystem is mounted successfully even data from journal is failed written into disk, and ext4/ocfs2 could become corrupted. Fix it by comparing the wb_err state in fs block device before recovering and after recovering. A reproducer can be found in the kernel bugzilla referenced below. Link: https://bugzilla.kernel.org/show_bug.cgi?id=217888 Cc: stable@vger.kernel.org Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com> Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20230919012525.1783108-1-chengzhihao1@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-10-05jbd2: fix printk format type for 'io_block' in do_one_pass()Ye Bin1-1/+1
'io_block' is unsinged long but print it by '%ld'. Signed-off-by: Ye Bin <yebin10@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20230904105817.1728356-3-yebin10@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-10-05jbd2: print io_block if check data block checksum failed when do recoveryYe Bin1-1/+2
Now, if check data block checksum failed only print data's block number then skip write data. However, one data block may in more than one transaction. In some scenarios, offline analysis is inconvenient. As a result, it is difficult to locate the areas where data is faulty. So print 'io_block' if check data block checksum failed. Signed-off-by: Ye Bin <yebin10@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20230904105817.1728356-2-yebin10@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-10-04jbd2,ext4: dynamically allocate the jbd2-journal shrinkerQi Zheng1-11/+18
In preparation for implementing lockless slab shrink, use new APIs to dynamically allocate the jbd2-journal shrinker, so that it can be freed asynchronously via RCU. Then it doesn't need to wait for RCU read-side critical section when releasing the struct journal_s. Link: https://lkml.kernel.org/r/20230911094444.68966-32-zhengqi.arch@bytedance.com Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com> Reviewed-by: Muchun Song <songmuchun@bytedance.com> Acked-by: Jan Kara <jack@suse.cz> Cc: "Theodore Ts'o" <tytso@mit.edu> Cc: Abhinav Kumar <quic_abhinavk@quicinc.com> Cc: Alasdair Kergon <agk@redhat.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Alyssa Rosenzweig <alyssa.rosenzweig@collabora.com> Cc: Andreas Dilger <adilger.kernel@dilger.ca> Cc: Andreas Gruenbacher <agruenba@redhat.com> Cc: Anna Schumaker <anna@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Bob Peterson <rpeterso@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Carlos Llamas <cmllamas@google.com> Cc: Chandan Babu R <chandan.babu@oracle.com> Cc: Chao Yu <chao@kernel.org> Cc: Chris Mason <clm@fb.com> Cc: Christian Brauner <brauner@kernel.org> Cc: Christian Koenig <christian.koenig@amd.com> Cc: Chuck Lever <cel@kernel.org> Cc: Coly Li <colyli@suse.de> Cc: Dai Ngo <Dai.Ngo@oracle.com> Cc: Daniel Vetter <daniel@ffwll.ch> Cc: Daniel Vetter <daniel.vetter@ffwll.ch> Cc: "Darrick J. Wong" <djwong@kernel.org> Cc: Dave Chinner <david@fromorbit.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Airlie <airlied@gmail.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Sterba <dsterba@suse.com> Cc: Dmitry Baryshkov <dmitry.baryshkov@linaro.org> Cc: Gao Xiang <hsiangkao@linux.alibaba.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Huang Rui <ray.huang@amd.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jaegeuk Kim <jaegeuk@kernel.org> Cc: Jani Nikula <jani.nikula@linux.intel.com> Cc: Jason Wang <jasowang@redhat.com> Cc: Jeff Layton <jlayton@kernel.org> Cc: Jeffle Xu <jefflexu@linux.alibaba.com> Cc: Joel Fernandes (Google) <joel@joelfernandes.org> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com> Cc: Josef Bacik <josef@toxicpanda.com> Cc: Juergen Gross <jgross@suse.com> Cc: Kent Overstreet <kent.overstreet@gmail.com> Cc: Kirill Tkhai <tkhai@ya.ru> Cc: Marijn Suijten <marijn.suijten@somainline.org> Cc: "Michael S. Tsirkin" <mst@redhat.com> Cc: Mike Snitzer <snitzer@kernel.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Nadav Amit <namit@vmware.com> Cc: Neil Brown <neilb@suse.de> Cc: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com> Cc: Olga Kornievskaia <kolga@netapp.com> Cc: Paul E. McKenney <paulmck@kernel.org> Cc: Richard Weinberger <richard@nod.at> Cc: Rob Clark <robdclark@gmail.com> Cc: Rob Herring <robh@kernel.org> Cc: Rodrigo Vivi <rodrigo.vivi@intel.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Sean Paul <sean@poorly.run> Cc: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Song Liu <song@kernel.org> Cc: Stefano Stabellini <sstabellini@kernel.org> Cc: Steven Price <steven.price@arm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tomeu Vizoso <tomeu.vizoso@collabora.com> Cc: Tom Talpey <tom@talpey.com> Cc: Trond Myklebust <trond.myklebust@hammerspace.com> Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Xuan Zhuo <xuanzhuo@linux.alibaba.com> Cc: Yue Hu <huyue2@coolpad.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-09-17Merge tag 'ext4_for_linus-6.6-rc2' of ↵Linus Torvalds3-18/+12
git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 Pull ext4 fixes from Ted Ts'o: "Regression and bug fixes for ext4" * tag 'ext4_for_linus-6.6-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: ext4: fix rec_len verify error ext4: do not let fstrim block system suspend ext4: move setting of trimmed bit into ext4_try_to_trim_range() jbd2: Fix memory leak in journal_init_common() jbd2: Remove page size assumptions buffer: Make bh_offset() work for compound pages
2023-09-14jbd2: Fix memory leak in journal_init_common()Li Zetao1-0/+2
There is a memory leak reported by kmemleak: unreferenced object 0xff11000105903b80 (size 64): comm "mount", pid 3382, jiffies 4295032021 (age 27.826s) hex dump (first 32 bytes): 04 00 00 00 00 00 00 00 01 00 00 00 00 00 00 00 ................ ff ff ff ff 00 00 00 00 00 00 00 00 00 00 00 00 ................ backtrace: [<ffffffffae86ac40>] __kmalloc_node+0x50/0x160 [<ffffffffaf2486d8>] crypto_alloc_tfmmem.isra.0+0x38/0x110 [<ffffffffaf2498e5>] crypto_create_tfm_node+0x85/0x2f0 [<ffffffffaf24a92c>] crypto_alloc_tfm_node+0xfc/0x210 [<ffffffffaedde777>] journal_init_common+0x727/0x1ad0 [<ffffffffaede1715>] jbd2_journal_init_inode+0x2b5/0x500 [<ffffffffaed786b5>] ext4_load_and_init_journal+0x255/0x2440 [<ffffffffaed8b423>] ext4_fill_super+0x8823/0xa330 ... The root cause was traced to an error handing path in journal_init_common() when malloc memory failed in register_shrinker(). The checksum driver is used to reference to checksum algorithm via cryptoapi and the user should release the memory when the driver is no longer needed or the journal initialization failed. Fix it by calling crypto_free_shash() on the "err_cleanup" error handing path in journal_init_common(). Fixes: c30713084ba5 ("jbd2: move load_superblock() into journal_init_common()") Signed-off-by: Li Zetao <lizetao1@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Link: https://lore.kernel.org/r/20230911025138.983101-1-lizetao1@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-09-07jbd2: Remove page size assumptionsRitesh Harjani (IBM)2-18/+10
jbd2_alloc() allocates a buffer from slab when the block size is smaller than PAGE_SIZE, and slab may be using a compound page. Before commit 8147c4c4546f, we set b_page to the precise page containing the buffer and this code worked well. Now we set b_page to the head page of the allocation, so we can no longer use offset_in_page(). While we could do a 1:1 replacement with offset_in_folio(), use the more idiomatic bh_offset() and the folio APIs to map the buffer. This isn't enough to support a b_size larger than PAGE_SIZE on HIGHMEM machines, but this is good enough to fix the actual bug we're seeing. Fixes: 8147c4c4546f ("jbd2: use a folio in jbd2_journal_write_metadata_buffer()") Reported-by: Zorro Lang <zlang@kernel.org> Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> [converted to be more folio] Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
2023-08-31Merge tag 'ext4_for_linus-6.6-rc1' of ↵Linus Torvalds3-281/+249
git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 Pull ext4 updates from Ted Ts'o: "Many ext4 and jbd2 cleanups and bug fixes: - Cleanups in the ext4 remount code when going to and from read-only - Cleanups in ext4's multiblock allocator - Cleanups in the jbd2 setup/mounting code paths - Performance improvements when appending to a delayed allocation file - Miscellaneous syzbot and other bug fixes" * tag 'ext4_for_linus-6.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (60 commits) ext4: fix slab-use-after-free in ext4_es_insert_extent() libfs: remove redundant checks of s_encoding ext4: remove redundant checks of s_encoding ext4: reject casefold inode flag without casefold feature ext4: use LIST_HEAD() to initialize the list_head in mballoc.c ext4: do not mark inode dirty every time when appending using delalloc ext4: rename s_error_work to s_sb_upd_work ext4: add periodic superblock update check ext4: drop dio overwrite only flag and associated warning ext4: add correct group descriptors and reserved GDT blocks to system zone ext4: remove unused function declaration ext4: mballoc: avoid garbage value from err ext4: use sbi instead of EXT4_SB(sb) in ext4_mb_new_blocks_simple() ext4: change the type of blocksize in ext4_mb_init_cache() ext4: fix unttached inode after power cut with orphan file feature enabled jbd2: correct the end of the journal recovery scan range ext4: ext4_get_{dev}_journal return proper error value ext4: cleanup ext4_get_dev_journal() and ext4_get_journal() jbd2: jbd2_journal_init_{dev,inode} return proper error return value jbd2: drop useless error tag in jbd2_journal_wipe() ...
2023-08-27jbd2: correct the end of the journal recovery scan rangeZhang Yi1-9/+3
We got a filesystem inconsistency issue below while running generic/475 I/O failure pressure test with fast_commit feature enabled. Symlink /p3/d3/d1c/d6c/dd6/dce/l101 (inode #132605) is invalid. If fast_commit feature is enabled, a special fast_commit journal area is appended to the end of the normal journal area. The journal->j_last point to the first unused block behind the normal journal area instead of the whole log area, and the journal->j_fc_last point to the first unused block behind the fast_commit journal area. While doing journal recovery, do_one_pass(PASS_SCAN) should first scan the normal journal area and turn around to the first block once it meet journal->j_last, but the wrap() macro misuse the journal->j_fc_last, so the recovering could not read the next magic block (commit block perhaps) and would end early mistakenly and missing tN and every transaction after it in the following example. Finally, it could lead to filesystem inconsistency. | normal journal area | fast commit area | +-------------------------------------------------+------------------+ | tN(rere) | tN+1 |~| tN-x |...| tN-1 | tN(front) | .... | +-------------------------------------------------+------------------+ / / / start journal->j_last journal->j_fc_last This patch fix it by use the correct ending journal->j_last. Fixes: 5b849b5f96b4 ("jbd2: fast commit recovery path") Cc: stable@kernel.org Reported-by: Theodore Ts'o <tytso@mit.edu> Link: https://lore.kernel.org/linux-ext4/20230613043120.GB1584772@mit.edu/ Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20230626073322.3956567-1-yi.zhang@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-08-23jbd2: jbd2_journal_init_{dev,inode} return proper error return valueZhang Yi1-10/+9
Current jbd2_journal_init_{dev,inode} return NULL if some error happens, make them to pass out proper error return value. [ Fix from Yang Yingliang folded in. ] Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20230811063610.2980059-11-yi.zhang@huaweicloud.com Link: https://lore.kernel.org/r/20230822030018.644419-1-yangyingliang@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-08-18jbd2: use a folio in jbd2_journal_write_metadata_buffer()Matthew Wilcox (Oracle)1-19/+16
The primary goal here is removing the use of set_bh_page(). Take the opportunity to switch from kmap_atomic() to kmap_local(). This simplifies the function as the offset is already added to the pointer. Link: https://lkml.kernel.org/r/20230713035512.4139457-7-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Jan Kara <jack@suse.cz> Cc: "Theodore Ts'o" <tytso@mit.edu> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Christian Brauner <brauner@kernel.org> Cc: David Sterba <dsterba@suse.com> Cc: Konstantin Komarov <almaz.alexandrovich@paragon-software.com> Cc: Nathan Chancellor <nathan@kernel.org> Cc: Nick Desaulniers <ndesaulniers@google.com> Cc: Pankaj Raghav <p.raghav@samsung.com> Cc: Tom Rix <trix@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-15jbd2: drop useless error tag in jbd2_journal_wipe()Zhang Yi1-3/+2
no_recovery is redundant, just drop it. Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20230811063610.2980059-10-yi.zhang@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-08-15jbd2: cleanup journal_init_common()Zhang Yi1-21/+24
Adjust the initialization sequence and error handle of journal_t, moving load superblock to the begin, and classify others initialization. Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20230811063610.2980059-9-yi.zhang@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-08-15jbd2: add fast_commit space checkZhang Yi1-4/+12
If JBD2_FEATURE_INCOMPAT_FAST_COMMIT bit is set, it means the journal have fast commit records need to recover, so the fast commit size should not be too large, and the leftover normal journal size should never less than JBD2_MIN_JOURNAL_BLOCKS. If it happens, the journal->j_last is likely to be wrong and will probably lead to incorrect journal recovery. So add a check into the journal_check_superblock(), and drop the pointless check when initializing the fastcommit parameters. Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20230811063610.2980059-8-yi.zhang@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-08-15jbd2: cleanup load_superblock()Zhang Yi1-50/+35
Rename load_superblock() to journal_load_superblock(), move getting and reading superblock from journal_init_common() and journal_get_superblock() to this function, and also rename journal_get_superblock() to journal_check_superblock(), make it a pure check helper to check superblock validity from disk. Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20230811063610.2980059-7-yi.zhang@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-08-15jbd2: open code jbd2_verify_csum_type() helperZhang Yi1-13/+5
jbd2_verify_csum_type() helper check checksum type in the superblock for v2 or v3 checksum feature, it always return true if these features are not enabled, and it has only one user, so open code it is more clear. Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20230811063610.2980059-6-yi.zhang@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-08-15jbd2: checking valid features early in journal_get_superblock()Zhang Yi1-15/+15
journal_get_superblock() is used to check validity of the jounal supberblock, so move the features checks from jbd2_journal_load() to journal_get_superblock(). Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20230811063610.2980059-5-yi.zhang@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-08-15jbd2: don't load superblock in jbd2_journal_check_used_features()Zhang Yi1-5/+0
Since load_superblock() has been moved to journal_init_common(), the in-memory superblock structure is initialized and contains valid data once the file system has a journal_t object, so it's safe to access it, let's drop the call to journal_get_superblock() from jbd2_journal_check_used_features() and also drop the setting/clearing of the veirfy bit of the superblock buffer. Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20230811063610.2980059-4-yi.zhang@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-08-15jbd2: move load_superblock() into journal_init_common()Zhang Yi1-11/+5
Move the call to load_superblock() from jbd2_journal_load() and jbd2_journal_wipe() early into journal_init_common(), the journal superblock gets read and the in-memory journal_t structure gets initialised after calling jbd2_journal_init_{dev,inode}, it's safe to do following initialization according to it. Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20230811063610.2980059-3-yi.zhang@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-08-15jbd2: move load_superblock() dependent functionsZhang Yi1-169/+168
Move load_superblock() declaration and the functions it calls before journal_init_common(). This is a preparation for moving a call to load_superblock() from jbd2_journal_load() and jbd2_journal_wipe() to journal_init_common(). No functional changes. Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20230811063610.2980059-2-yi.zhang@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-08-04jbd2: remove unused function '__cp_buffer_busy'Yang Li1-12/+0
The code calling function '__cp_buffer_busy' has been removed, so the function should also be removed. silence the warning: fs/jbd2/checkpoint.c:48:20: warning: unused function '__cp_buffer_busy' Reported-by: Abaci Robot <abaci@linux.alibaba.com> Closes: https://bugzilla.openanolis.cn/show_bug.cgi?id=5518 Signed-off-by: Yang Li <yang.lee@linux.alibaba.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20230714025528.564988-4-yi.zhang@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-08-04jbd2: check 'jh->b_transaction' before removing it from checkpointZhihao Cheng1-0/+2
Following process will corrupt ext4 image: Step 1: jbd2_journal_commit_transaction __jbd2_journal_insert_checkpoint(jh, commit_transaction) // Put jh into trans1->t_checkpoint_list journal->j_checkpoint_transactions = commit_transaction // Put trans1 into journal->j_checkpoint_transactions Step 2: do_get_write_access test_clear_buffer_dirty(bh) // clear buffer dirty,set jbd dirty __jbd2_journal_file_buffer(jh, transaction) // jh belongs to trans2 Step 3: drop_cache journal_shrink_one_cp_list jbd2_journal_try_remove_checkpoint if (!trylock_buffer(bh)) // lock bh, true if (buffer_dirty(bh)) // buffer is not dirty __jbd2_journal_remove_checkpoint(jh) // remove jh from trans1->t_checkpoint_list Step 4: jbd2_log_do_checkpoint trans1 = journal->j_checkpoint_transactions // jh is not in trans1->t_checkpoint_list jbd2_cleanup_journal_tail(journal) // trans1 is done Step 5: Power cut, trans2 is not committed, jh is lost in next mounting. Fix it by checking 'jh->b_transaction' before remove it from checkpoint. Cc: stable@kernel.org Fixes: 46f881b5b175 ("jbd2: fix a race when checking checkpoint buffer busy") Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com> Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20230714025528.564988-3-yi.zhang@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-08-04jbd2: fix checkpoint cleanup performance regressionZhang Yi1-6/+14
journal_clean_one_cp_list() has been merged into journal_shrink_one_cp_list(), but do chekpoint buffer cleanup from the committing process is just a best effort, it should stop scan once it meet a busy buffer, or else it will cause a lot of invalid buffer scan and checks. We catch a performance regression when doing fs_mark tests below. Test cmd: ./fs_mark -d scratch -s 1024 -n 10000 -t 1 -D 100 -N 100 Before merging checkpoint buffer cleanup: FSUse% Count Size Files/sec App Overhead 95 10000 1024 8304.9 49033 After merging checkpoint buffer cleanup: FSUse% Count Size Files/sec App Overhead 95 10000 1024 7649.0 50012 FSUse% Count Size Files/sec App Overhead 95 10000 1024 2107.1 50871 After merging checkpoint buffer cleanup, the total loop count in journal_shrink_one_cp_list() could be up to 6,261,600+ (50,000+ ~ 100,000+ in general), most of them are invalid. This patch fix it through passing 'shrink_type' into journal_shrink_one_cp_list() and add a new 'SHRINK_BUSY_STOP' to indicate it should stop once meet a busy buffer. After fix, the loop count descending back to 10,000+. After this fix: FSUse% Count Size Files/sec App Overhead 95 10000 1024 8558.4 49109 Cc: stable@kernel.org Fixes: b98dba273a0e ("jbd2: remove journal_clean_one_cp_list()") Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20230714025528.564988-2-yi.zhang@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-07-29fs: jbd2: fix an incorrect warn logGuoqing Cai1-8/+10
In jbd2_journal_load(), when journal_reset fails, it prints an incorrect warn log. Fix this by changing the goto statement to return statement. Also, return actual error code from jbd2_journal_recover() and journal_reset(). Signed-off-by: Guoqing Cai <u202112087@hust.edu.cn> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20230413095740.2222066-1-u202112087@hust.edu.cn Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-07-10jbd2: remove __journal_try_to_free_buffer()Zhang Yi1-24/+7
__journal_try_to_free_buffer() has only one caller and it's logic is much simple now, so just remove it and open code in jbd2_journal_try_to_free_buffers(). Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20230606135928.434610-7-yi.zhang@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-07-10jbd2: fix a race when checking checkpoint buffer busyZhang Yi2-15/+40
Before removing checkpoint buffer from the t_checkpoint_list, we have to check both BH_Dirty and BH_Lock bits together to distinguish buffers have not been or were being written back. But __cp_buffer_busy() checks them separately, it first check lock state and then check dirty, the window between these two checks could be raced by writing back procedure, which locks buffer and clears buffer dirty before I/O completes. So it cannot guarantee checkpointing buffers been written back to disk if some error happens later. Finally, it may clean checkpoint transactions and lead to inconsistent filesystem. jbd2_journal_forget() and __journal_try_to_free_buffer() also have the same problem (journal_unmap_buffer() escape from this issue since it's running under the buffer lock), so fix them through introducing a new helper to try holding the buffer lock and remove really clean buffer. Link: https://bugzilla.kernel.org/show_bug.cgi?id=217490 Cc: stable@vger.kernel.org Suggested-by: Jan Kara <jack@suse.cz> Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20230606135928.434610-6-yi.zhang@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-07-10jbd2: Fix wrongly judgement for buffer head removing while doing checkpointZhihao Cheng1-15/+17
Following process, jbd2_journal_commit_transaction // there are several dirty buffer heads in transaction->t_checkpoint_list P1 wb_workfn jbd2_log_do_checkpoint if (buffer_locked(bh)) // false __block_write_full_page trylock_buffer(bh) test_clear_buffer_dirty(bh) if (!buffer_dirty(bh)) __jbd2_journal_remove_checkpoint(jh) if (buffer_write_io_error(bh)) // false >> bh IO error occurs << jbd2_cleanup_journal_tail __jbd2_update_log_tail jbd2_write_superblock // The bh won't be replayed in next mount. , which could corrupt the ext4 image, fetch a reproducer in [Link]. Since writeback process clears buffer dirty after locking buffer head, we can fix it by try locking buffer and check dirtiness while buffer is locked, the buffer head can be removed if it is neither dirty nor locked. Link: https://bugzilla.kernel.org/show_bug.cgi?id=217490 Fixes: 470decc613ab ("[PATCH] jbd2: initial copy of files from jbd") Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com> Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20230606135928.434610-5-yi.zhang@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-07-10jbd2: remove journal_clean_one_cp_list()Zhang Yi1-58/+17
journal_clean_one_cp_list() and journal_shrink_one_cp_list() are almost the same, so merge them into journal_shrink_one_cp_list(), remove the nr_to_scan parameter, always scan and try to free the whole checkpoint list. Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20230606135928.434610-4-yi.zhang@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-07-10jbd2: remove t_checkpoint_io_listZhang Yi2-42/+3
Since t_checkpoint_io_list was stop using in jbd2_log_do_checkpoint() now, it's time to remove the whole t_checkpoint_io_list logic. Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20230606135928.434610-3-yi.zhang@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-07-10jbd2: recheck chechpointing non-dirty bufferZhang Yi1-73/+29
There is a long-standing metadata corruption issue that happens from time to time, but it's very difficult to reproduce and analyse, benefit from the JBD2_CYCLE_RECORD option, we found out that the problem is the checkpointing process miss to write out some buffers which are raced by another do_get_write_access(). Looks below for detail. jbd2_log_do_checkpoint() //transaction X //buffer A is dirty and not belones to any transaction __buffer_relink_io() //move it to the IO list __flush_batch() write_dirty_buffer() do_get_write_access() clear_buffer_dirty __jbd2_journal_file_buffer() //add buffer A to a new transaction Y lock_buffer(bh) //doesn't write out __jbd2_journal_remove_checkpoint() //finish checkpoint except buffer A //filesystem corrupt if the new transaction Y isn't fully write out. Due to the t_checkpoint_list walking loop in jbd2_log_do_checkpoint() have already handles waiting for buffers under IO and re-added new transaction to complete commit, and it also removing cleaned buffers, this makes sure the list will eventually get empty. So it's fine to leave buffers on the t_checkpoint_list while flushing out and completely stop using the t_checkpoint_io_list. Cc: stable@vger.kernel.org Suggested-by: Jan Kara <jack@suse.cz> Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Tested-by: Zhihao Cheng <chengzhihao1@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20230606135928.434610-2-yi.zhang@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-06-29Merge tag 'ext4_for_linus' of ↵Linus Torvalds2-44/+56
git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 Pull ext4 updates from Ted Ts'o: "Various cleanups and bug fixes in ext4's extent status tree, journalling, and block allocator subsystems. Also improve performance for parallel DIO overwrites" * tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (55 commits) ext4: avoid updating the superblock on a r/o mount if not needed jbd2: skip reading super block if it has been verified ext4: fix to check return value of freeze_bdev() in ext4_shutdown() ext4: refactoring to use the unified helper ext4_quotas_off() ext4: turn quotas off if mount failed after enabling quotas ext4: update doc about journal superblock description ext4: add journal cycled recording support jbd2: continue to record log between each mount jbd2: remove j_format_version jbd2: factor out journal initialization from journal_get_superblock() jbd2: switch to check format version in superblock directly jbd2: remove unused feature macros ext4: ext4_put_super: Remove redundant checking for 'sbi->s_journal_bdev' ext4: Fix reusing stale buffer heads from last failed mounting ext4: allow concurrent unaligned dio overwrites ext4: clean up mballoc criteria comments ext4: make ext4_zeroout_es() return void ext4: make ext4_es_insert_extent() return void ext4: make ext4_es_insert_delayed_block() return void ext4: make ext4_es_remove_extent() return void ...
2023-06-26jbd2: skip reading super block if it has been verifiedZhang Yi1-4/+3
We got a NULL pointer dereference issue below while running generic/475 I/O failure pressure test. BUG: kernel NULL pointer dereference, address: 0000000000000000 #PF: supervisor write access in kernel mode #PF: error_code(0x0002) - not-present page PGD 0 P4D 0 Oops: 0002 [#1] PREEMPT SMP PTI CPU: 1 PID: 15600 Comm: fsstress Not tainted 6.4.0-rc5-xfstests-00055-gd3ab1bca26b4 #190 RIP: 0010:jbd2_journal_set_features+0x13d/0x430 ... Call Trace: <TASK> ? __die+0x23/0x60 ? page_fault_oops+0xa4/0x170 ? exc_page_fault+0x67/0x170 ? asm_exc_page_fault+0x26/0x30 ? jbd2_journal_set_features+0x13d/0x430 jbd2_journal_revoke+0x47/0x1e0 __ext4_forget+0xc3/0x1b0 ext4_free_blocks+0x214/0x2f0 ext4_free_branches+0xeb/0x270 ext4_ind_truncate+0x2bf/0x320 ext4_truncate+0x1e4/0x490 ext4_handle_inode_extension+0x1bd/0x2a0 ? iomap_dio_complete+0xaf/0x1d0 The root cause is the journal super block had been failed to write out due to I/O fault injection, it's uptodate bit was cleared by end_buffer_write_sync() and didn't reset yet in jbd2_write_superblock(). And it raced by journal_get_superblock()->bh_read(), unfortunately, the read IO is also failed, so the error handling in journal_fail_superblock() unexpectedly clear the journal->j_sb_buffer, finally lead to above NULL pointer dereference issue. If the journal super block had been read and verified, there is no need to call bh_read() read it again even if it has been failed to written out. So the fix could be simply move buffer_verified(bh) in front of bh_read(). Also remove a stale comment left in jbd2_journal_check_used_features(). Fixes: 51bacdba23d8 ("jbd2: factor out journal initialization from journal_get_superblock()") Reported-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20230616015547.3155195-1-yi.zhang@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-06-26jbd2: continue to record log between each mountZhang Yi2-7/+33
For a newly mounted file system, the journal committing thread always record new transactions from the start of the journal area, no matter whether the journal was clean or just has been recovered. So the logdump code in debugfs cannot dump continuous logs between each mount, it is disadvantageous to analysis corrupted file system image and locate the file system inconsistency bugs. If we get a corrupted file system in the running products and want to find out what has happened, besides lookup the system log, one effective way is to backtrack the journal log. But we may not always run e2fsck before each mount and the default fsck -a mode also cannot always checkout all inconsistencies, so it could left over some inconsistencies into the next mount until we detect it. Finally, transactions in the journal may probably discontinuous and some relatively new transactions has been covered, it becomes hard to analyse. If we could record transactions continuously between each mount, we could acquire more useful info from the journal. Like this: |Previous mount checkpointed/recovered logs|Current mount logs | |{------}{---}{--------} ... {------}| ... |{======}{========}...000000| And yes the journal area is limited and cannot record everything, the problematic transaction may also be covered even if we do this, but this is still useful for fuzzy tests and short-running products. This patch save the head blocknr in the superblock after flushing the journal or unmounting the file system, let the next mount could continue to record new transaction behind it. This change is backward compatible because the old kernel does not care about the head blocknr of the journal. It is also fine if we mount a clean old image without valid head blocknr, we fail back to set it to s_first just like before. Finally, for the case of mount an unclean file system, we could also get the journal head easily after scanning/replaying the journal, it will continue to record new transaction after the recovered transactions. Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20230322013353.1843306-2-yi.zhang@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-06-26jbd2: remove j_format_versionZhang Yi1-9/+0
journal->j_format_version is no longer used, remove it. Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20230315013128.3911115-7-chengzhihao1@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-06-26jbd2: factor out journal initialization from journal_get_superblock()Zhang Yi1-24/+22
Current journal_get_superblock() couple journal superblock checking and partial journal initialization, factor out initialization part from it to make things clear. Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20230315013128.3911115-6-chengzhihao1@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-06-26jbd2: switch to check format version in superblock directlyZhang Yi1-9/+7
We should only check and set extented features if journal format version is 2, and now we check the in memory copy of the superblock 'journal->j_format_version', which relys on the parameter initialization sequence, switch to use the h_blocktype in superblock cloud be more clear. Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20230315013128.3911115-5-chengzhihao1@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-06-05jbd2: Avoid printing outside the boundary of the bufferAndy Shevchenko1-4/+2
Theoretically possible that "%pg" will take all room for the j_devname and hence the "-%lu" will go outside the boundary due to unconditional sprintf() in use. To make this code more robust, replace two sequential s*printf():s by a single call and then replace forbidden character. It's possible to do this way, because '/' won't ever be in the result of "-%lu". Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Signed-off-by: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/r/20230605170553.7835-2-andriy.shevchenko@linux.intel.com
2023-04-14jdb2: Don't refuse invalidation of already invalidated buffersJan Kara1-0/+3
When invalidating buffers under the partial tail page, jbd2_journal_invalidate_folio() returns -EBUSY if the buffer is part of the committing transaction as we cannot safely modify buffer state. However if the buffer is already invalidated (due to previous invalidation attempts from ext4_wait_for_tail_page_commit()), there's nothing to do and there's no point in returning -EBUSY. This fixes occasional warnings from ext4_journalled_invalidate_folio() triggered by generic/051 fstest when blocksize < pagesize. Fixes: 53e872681fed ("ext4: fix deadlock in journal_unmap_buffer()") Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20230329154950.19720-1-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-03-12Merge tag 'ext4_for_linus_stable' of ↵Linus Torvalds1-3/+6
git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 Pull ext4 fixes from Ted Ts'o: "Bug fixes and regressions for ext4, the most serious of which is a potential deadlock during directory renames that was introduced during the merge window discovered by a combination of syzbot and lockdep" * tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: ext4: zero i_disksize when initializing the bootloader inode ext4: make sure fs error flag setted before clear journal error ext4: commit super block if fs record error when journal record without error ext4, jbd2: add an optimized bmap for the journal inode ext4: fix WARNING in ext4_update_inline_data ext4: move where set the MAY_INLINE_DATA flag is set ext4: Fix deadlock during directory rename ext4: Fix comment about the 64BIT feature docs: ext4: modify the group desc size to 64 ext4: fix another off-by-one fsmap error on 1k block filesystems ext4: fix RENAME_WHITEOUT handling for inline directories ext4: make kobj_type structures constant ext4: fix cgroup writeback accounting with fs-layer encryption
2023-03-11ext4, jbd2: add an optimized bmap for the journal inodeTheodore Ts'o1-3/+6
The generic bmap() function exported by the VFS takes locks and does checks that are not necessary for the journal inode. So allow the file system to set a journal-optimized bmap function in journal->j_bmap. Reported-by: syzbot+9543479984ae9e576000@syzkaller.appspotmail.com Link: https://syzkaller.appspot.com/bug?id=e4aaa78795e490421c79f76ec3679006c8ff4cf0 Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-02-28Merge tag 'ext4_for_linus' of ↵Linus Torvalds1-21/+29
git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 Pull ext4 updates from Ted Ts'o: "Improve performance for ext4 by allowing multiple process to perform direct I/O writes to preallocated blocks by using a shared inode lock instead of taking an exclusive lock. In addition, multiple bug fixes and cleanups" * tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: ext4: fix incorrect options show of original mount_opt and extend mount_opt2 ext4: Fix possible corruption when moving a directory ext4: init error handle resource before init group descriptors ext4: fix task hung in ext4_xattr_delete_inode jbd2: fix data missing when reusing bh which is ready to be checkpointed ext4: update s_journal_inum if it changes after journal replay ext4: fail ext4_iget if special inode unallocated ext4: fix function prototype mismatch for ext4_feat_ktype ext4: remove unnecessary variable initialization ext4: fix inode tree inconsistency caused by ENOMEM ext4: refuse to create ea block when umounted ext4: optimize ea_inode block expansion ext4: remove dead code in updating backup sb ext4: dio take shared inode lock when overwriting preallocated blocks ext4: don't show commit interval if it is zero ext4: use ext4_fc_tl_mem in fast-commit replay path ext4: improve xattr consistency checking and error reporting
2023-02-19jbd2: fix data missing when reusing bh which is ready to be checkpointedZhihao Cheng1-21/+29
Following process will make data lost and could lead to a filesystem corrupted problem: 1. jh(bh) is inserted into T1->t_checkpoint_list, bh is dirty, and jh->b_transaction = NULL 2. T1 is added into journal->j_checkpoint_transactions. 3. Get bh prepare to write while doing checkpoing: PA PB do_get_write_access jbd2_log_do_checkpoint spin_lock(&jh->b_state_lock) if (buffer_dirty(bh)) clear_buffer_dirty(bh) // clear buffer dirty set_buffer_jbddirty(bh) transaction = journal->j_checkpoint_transactions jh = transaction->t_checkpoint_list if (!buffer_dirty(bh)) __jbd2_journal_remove_checkpoint(jh) // bh won't be flushed jbd2_cleanup_journal_tail __jbd2_journal_file_buffer(jh, transaction, BJ_Reserved) 4. Aborting journal/Power-cut before writing latest bh on journal area. In this way we get a corrupted filesystem with bh's data lost. Fix it by moving the clearing of buffer_dirty bit just before the call to __jbd2_journal_file_buffer(), both bit clearing and jh->b_transaction assignment are under journal->j_list_lock locked, so that jbd2_log_do_checkpoint() will wait until jh's new transaction fininshed even bh is currently not dirty. And journal_shrink_one_cp_list() won't remove jh from checkpoint list if the buffer head is reused in do_get_write_access(). Fetch a reproducer in [Link]. Link: https://bugzilla.kernel.org/show_bug.cgi?id=216898 Cc: <stable@kernel.org> Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com> Signed-off-by: zhanchengbin <zhanchengbin1@huawei.com> Suggested-by: Jan Kara <jack@suse.cz> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20230110015327.1181863-1-chengzhihao1@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-01-18jbd2,ocfs2: move jbd2_journal_submit_inode_data_buffers to ocfs2Christoph Hellwig2-26/+0
jbd2_journal_submit_inode_data_buffers is only used by ocfs2, so move it there to prepare for removing generic_writepages. Link: https://lkml.kernel.org/r/20221229161031.391878-5-hch@lst.de Signed-off-by: Christoph Hellwig <hch@lst.de> Cc: Jan Kara <jack@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Joseph Qi <joseph.qi@linux.alibaba.com> Cc: Konstantin Komarov <almaz.alexandrovich@paragon-software.com> Cc: Mark Fasheh <mark@fasheh.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-01-18jbd2: replace obvious uses of b_page with b_folioMatthew Wilcox (Oracle)2-7/+3
These places just use b_page to get to the buffer's address_space or have already been converted to folio. Link: https://lkml.kernel.org/r/20221215214402.3522366-10-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-12-08jbd2: switch jbd2_submit_inode_data() to use fs-provided hook for data writeoutJan Kara1-3/+2
jbd2_submit_inode_data() hardcoded use of jbd2_journal_submit_inode_data_buffers() for submission of data pages. Make it use j_submit_inode_data_buffers hook instead. This effectively switches ext4 fastcommits to use ext4_writepages() for data writeout instead of generic_writepages(). Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20221207112722.22220-9-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-10-10Merge tag 'mm-stable-2022-10-08' of ↵Linus Torvalds2-15/+16
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: - Yu Zhao's Multi-Gen LRU patches are here. They've been under test in linux-next for a couple of months without, to my knowledge, any negative reports (or any positive ones, come to that). - Also the Maple Tree from Liam Howlett. An overlapping range-based tree for vmas. It it apparently slightly more efficient in its own right, but is mainly targeted at enabling work to reduce mmap_lock contention. Liam has identified a number of other tree users in the kernel which could be beneficially onverted to mapletrees. Yu Zhao has identified a hard-to-hit but "easy to fix" lockdep splat at [1]. This has yet to be addressed due to Liam's unfortunately timed vacation. He is now back and we'll get this fixed up. - Dmitry Vyukov introduces KMSAN: the Kernel Memory Sanitizer. It uses clang-generated instrumentation to detect used-unintialized bugs down to the single bit level. KMSAN keeps finding bugs. New ones, as well as the legacy ones. - Yang Shi adds a userspace mechanism (madvise) to induce a collapse of memory into THPs. - Zach O'Keefe has expanded Yang Shi's madvise(MADV_COLLAPSE) to support file/shmem-backed pages. - userfaultfd updates from Axel Rasmussen - zsmalloc cleanups from Alexey Romanov - cleanups from Miaohe Lin: vmscan, hugetlb_cgroup, hugetlb and memory-failure - Huang Ying adds enhancements to NUMA balancing memory tiering mode's page promotion, with a new way of detecting hot pages. - memcg updates from Shakeel Butt: charging optimizations and reduced memory consumption. - memcg cleanups from Kairui Song. - memcg fixes and cleanups from Johannes Weiner. - Vishal Moola provides more folio conversions - Zhang Yi removed ll_rw_block() :( - migration enhancements from Peter Xu - migration error-path bugfixes from Huang Ying - Aneesh Kumar added ability for a device driver to alter the memory tiering promotion paths. For optimizations by PMEM drivers, DRM drivers, etc. - vma merging improvements from Jakub Matěn. - NUMA hinting cleanups from David Hildenbrand. - xu xin added aditional userspace visibility into KSM merging activity. - THP & KSM code consolidation from Qi Zheng. - more folio work from Matthew Wilcox. - KASAN updates from Andrey Konovalov. - DAMON cleanups from Kaixu Xia. - DAMON work from SeongJae Park: fixes, cleanups. - hugetlb sysfs cleanups from Muchun Song. - Mike Kravetz fixes locking issues in hugetlbfs and in hugetlb core. Link: https://lkml.kernel.org/r/CAOUHufZabH85CeUN-MEMgL8gJGzJEWUrkiM58JkTbBhh-jew0Q@mail.gmail.com [1] * tag 'mm-stable-2022-10-08' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (555 commits) hugetlb: allocate vma lock for all sharable vmas hugetlb: take hugetlb vma_lock when clearing vma_lock->vma pointer hugetlb: fix vma lock handling during split vma and range unmapping mglru: mm/vmscan.c: fix imprecise comments mm/mglru: don't sync disk for each aging cycle mm: memcontrol: drop dead CONFIG_MEMCG_SWAP config symbol mm: memcontrol: use do_memsw_account() in a few more places mm: memcontrol: deprecate swapaccounting=0 mode mm: memcontrol: don't allocate cgroup swap arrays when memcg is disabled mm/secretmem: remove reduntant return value mm/hugetlb: add available_huge_pages() func mm: remove unused inline functions from include/linux/mm_inline.h selftests/vm: add selftest for MADV_COLLAPSE of uffd-minor memory selftests/vm: add file/shmem MADV_COLLAPSE selftest for cleared pmd selftests/vm: add thp collapse shmem testing selftests/vm: add thp collapse file and tmpfs testing selftests/vm: modularize thp collapse memory operations selftests/vm: dedup THP helpers mm/khugepaged: add tracepoint to hpage_collapse_scan_file() mm/madvise: add file and shmem support to MADV_COLLAPSE ...
2022-09-30jbd2: add miss release buffer head in fc_do_one_pass()Ye Bin1-0/+1
In fc_do_one_pass() miss release buffer head after use which will lead to reference count leak. Cc: stable@kernel.org Signed-off-by: Ye Bin <yebin10@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20220917093805.1782845-1-yebin10@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-09-30jbd2: fix potential use-after-free in jbd2_fc_wait_bufsYe Bin1-3/+3
In 'jbd2_fc_wait_bufs' use 'bh' after put buffer head reference count which may lead to use-after-free. So judge buffer if uptodate before put buffer head reference count. Cc: stable@kernel.org Signed-off-by: Ye Bin <yebin10@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20220914100812.1414768-3-yebin10@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-09-30jbd2: fix potential buffer head reference count leakYe Bin1-1/+7
As in 'jbd2_fc_wait_bufs' if buffer isn't uptodate, will return -EIO without update 'journal->j_fc_off'. But 'jbd2_fc_release_bufs' will release buffer head from ‘j_fc_off - 1’ if 'bh' is NULL will terminal release which will lead to buffer head buffer head reference count leak. To solve above issue, update 'journal->j_fc_off' before return -EIO. Cc: stable@kernel.org Signed-off-by: Ye Bin <yebin10@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20220914100812.1414768-2-yebin10@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-09-30jbd2: wake up journal waiters in FIFO order, not LIFOAndrew Perepechko2-4/+4
LIFO wakeup order is unfair and sometimes leads to a journal user not being able to get a journal handle for hundreds of transactions in a row. FIFO wakeup can make things more fair. Cc: stable@kernel.org Signed-off-by: Alexey Lyashkov <alexey.lyashkov@gmail.com> Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Link: https://lore.kernel.org/r/20220907165959.1137482-1-alexey.lyashkov@gmail.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-09-29jbd2: drop useless return value of submit_bhRitesh Harjani (IBM)2-11/+8
submit_bh always returns 0. This patch cleans up 2 of it's caller in jbd2 to drop submit_bh's useless return value. Once all submit_bh callers are cleaned up, we can make it's return type as void. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Link: https://lore.kernel.org/r/e069c0539be0aec61abcdc6f6141982ec85d489d.1660788334.git.ritesh.list@gmail.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-09-11jbd2: replace ll_rw_block()Zhang Yi2-15/+16
ll_rw_block() is not safe for the sync read path because it cannot guarantee that submitting read IO if the buffer has been locked. We could get false positive EIO after wait_on_buffer() if the buffer has been locked by others. So stop using ll_rw_block() in journal_get_superblock(). We also switch to new bh_readahead_batch() for the buffer array readahead path. Link: https://lkml.kernel.org/r/20220901133505.2510834-7-yi.zhang@huawei.com Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-08-05Merge tag 'mm-stable-2022-08-03' of ↵Linus Torvalds1-1/+2
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: "Most of the MM queue. A few things are still pending. Liam's maple tree rework didn't make it. This has resulted in a few other minor patch series being held over for next time. Multi-gen LRU still isn't merged as we were waiting for mapletree to stabilize. The current plan is to merge MGLRU into -mm soon and to later reintroduce mapletree, with a view to hopefully getting both into 6.1-rc1. Summary: - The usual batches of cleanups from Baoquan He, Muchun Song, Miaohe Lin, Yang Shi, Anshuman Khandual and Mike Rapoport - Some kmemleak fixes from Patrick Wang and Waiman Long - DAMON updates from SeongJae Park - memcg debug/visibility work from Roman Gushchin - vmalloc speedup from Uladzislau Rezki - more folio conversion work from Matthew Wilcox - enhancements for coherent device memory mapping from Alex Sierra - addition of shared pages tracking and CoW support for fsdax, from Shiyang Ruan - hugetlb optimizations from Mike Kravetz - Mel Gorman has contributed some pagealloc changes to improve latency and realtime behaviour. - mprotect soft-dirty checking has been improved by Peter Xu - Many other singleton patches all over the place" [ XFS merge from hell as per Darrick Wong in https://lore.kernel.org/all/YshKnxb4VwXycPO8@magnolia/ ] * tag 'mm-stable-2022-08-03' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (282 commits) tools/testing/selftests/vm/hmm-tests.c: fix build mm: Kconfig: fix typo mm: memory-failure: convert to pr_fmt() mm: use is_zone_movable_page() helper hugetlbfs: fix inaccurate comment in hugetlbfs_statfs() hugetlbfs: cleanup some comments in inode.c hugetlbfs: remove unneeded header file hugetlbfs: remove unneeded hugetlbfs_ops forward declaration hugetlbfs: use helper macro SZ_1{K,M} mm: cleanup is_highmem() mm/hmm: add a test for cross device private faults selftests: add soft-dirty into run_vmtests.sh selftests: soft-dirty: add test for mprotect mm/mprotect: fix soft-dirty check in can_change_pte_writable() mm: memcontrol: fix potential oom_lock recursion deadlock mm/gup.c: fix formatting in check_and_migrate_movable_page() xfs: fail dax mount if reflink is enabled on a partition mm/memcontrol.c: remove the redundant updating of stats_flush_threshold userfaultfd: don't fail on unrecognized features hugetlb_cgroup: fix wrong hugetlb cgroup numa stat ...
2022-08-04Merge tag 'ext4_for_linus' of ↵Linus Torvalds6-75/+82
git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 Pull ext4 updates from Ted Ts'o: "Add new ioctls to set and get the file system UUID in the ext4 superblock and improved the performance of the online resizing of file systems with bigalloc enabled. Fixed a lot of bugs, in particular for the inline data feature, potential races when creating and deleting inodes with shared extended attribute blocks, and the handling of directory blocks which are corrupted" * tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (37 commits) ext4: add ioctls to get/set the ext4 superblock uuid ext4: avoid resizing to a partial cluster size ext4: reduce computation of overhead during resize jbd2: fix assertion 'jh->b_frozen_data == NULL' failure when journal aborted ext4: block range must be validated before use in ext4_mb_clear_bb() mbcache: automatically delete entries from cache on freeing mbcache: Remove mb_cache_entry_delete() ext2: avoid deleting xattr block that is being reused ext2: unindent codeblock in ext2_xattr_set() ext2: factor our freeing of xattr block reference ext4: fix race when reusing xattr blocks ext4: unindent codeblock in ext4_xattr_block_set() ext4: remove EA inode entry from mbcache on inode eviction mbcache: add functions to delete entry if unused mbcache: don't reclaim used entries ext4: make sure ext4_append() always allocates new block ext4: check if directory block is within i_size ext4: reflect mb_optimize_scan value in options file ext4: avoid remove directory when directory is corrupted ext4: aligned '*' in comments ...
2022-08-02jbd2: fix assertion 'jh->b_frozen_data == NULL' failure when journal abortedZhihao Cheng1-2/+12
Following process will fail assertion 'jh->b_frozen_data == NULL' in jbd2_journal_dirty_metadata(): jbd2_journal_commit_transaction unlink(dir/a) jh->b_transaction = trans1 jh->b_jlist = BJ_Metadata journal->j_running_transaction = NULL trans1->t_state = T_COMMIT unlink(dir/b) handle->h_trans = trans2 do_get_write_access jh->b_modified = 0 jh->b_frozen_data = frozen_buffer jh->b_next_transaction = trans2 jbd2_journal_dirty_metadata is_handle_aborted is_journal_aborted // return false --> jbd2 abort <-- while (commit_transaction->t_buffers) if (is_journal_aborted) jbd2_journal_refile_buffer __jbd2_journal_refile_buffer WRITE_ONCE(jh->b_transaction, jh->b_next_transaction) WRITE_ONCE(jh->b_next_transaction, NULL) __jbd2_journal_file_buffer(jh, BJ_Reserved) J_ASSERT_JH(jh, jh->b_frozen_data == NULL) // assertion failure ! The reproducer (See detail in [Link]) reports: ------------[ cut here ]------------ kernel BUG at fs/jbd2/transaction.c:1629! invalid opcode: 0000 [#1] PREEMPT SMP CPU: 2 PID: 584 Comm: unlink Tainted: G W 5.19.0-rc6-00115-g4a57a8400075-dirty #697 RIP: 0010:jbd2_journal_dirty_metadata+0x3c5/0x470 RSP: 0018:ffffc90000be7ce0 EFLAGS: 00010202 Call Trace: <TASK> __ext4_handle_dirty_metadata+0xa0/0x290 ext4_handle_dirty_dirblock+0x10c/0x1d0 ext4_delete_entry+0x104/0x200 __ext4_unlink+0x22b/0x360 ext4_unlink+0x275/0x390 vfs_unlink+0x20b/0x4c0 do_unlinkat+0x42f/0x4c0 __x64_sys_unlink+0x37/0x50 do_syscall_64+0x35/0x80 After journal aborting, __jbd2_journal_refile_buffer() is executed with holding @jh->b_state_lock, we can fix it by moving 'is_handle_aborted()' into the area protected by @jh->b_state_lock. Link: https://bugzilla.kernel.org/show_bug.cgi?id=216251 Fixes: 470decc613ab20 ("[PATCH] jbd2: initial copy of files from jbd") Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com> Link: https://lore.kernel.org/r/20220715125152.4022726-1-chengzhihao1@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-08-02jbd2: fix outstanding credits assert in jbd2_journal_commit_transaction()Zhang Yi1-1/+1
We catch an assert problem in jbd2_journal_commit_transaction() when doing fsstress and request falut injection tests. The problem is happened in a race condition between jbd2_journal_commit_transaction() and ext4_end_io_end(). Firstly, ext4_writepages() writeback dirty pages and start reserved handle, and then the journal was aborted due to some previous metadata IO error, jbd2_journal_abort() start to commit current running transaction, the committing procedure could be raced by ext4_end_io_end() and lead to subtract j_reserved_credits twice from commit_transaction->t_outstanding_credits, finally the t_outstanding_credits is mistakenly smaller than t_nr_buffers and trigger assert. kjournald2 kworker jbd2_journal_commit_transaction() write_unlock(&journal->j_state_lock); atomic_sub(j_reserved_credits, t_outstanding_credits); //sub once jbd2_journal_start_reserved() start_this_handle() //detect aborted journal jbd2_journal_free_reserved() //get running transaction read_lock(&journal->j_state_lock) __jbd2_journal_unreserve_handle() atomic_sub(j_reserved_credits, t_outstanding_credits); //sub again read_unlock(&journal->j_state_lock); journal->j_running_transaction = NULL; J_ASSERT(t_nr_buffers <= t_outstanding_credits) //bomb!!! Fix this issue by using journal->j_state_lock to protect the subtraction in jbd2_journal_commit_transaction(). Fixes: 96f1e0974575 ("jbd2: avoid long hold times of j_state_lock while committing a transaction") Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20220611130426.2013258-1-yi.zhang@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-08-02jbd2: unexport jbd2_log_start_commit()Jan Kara1-2/+1
jbd2_log_start_commit() is not used outside of jbd2 so unexport it. Also make __jbd2_log_start_commit() static when we are at it. Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Lukas Czerner <lczerner@redhat.com> Link: https://lore.kernel.org/r/20220608112355.4397-4-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-08-02jbd2: remove unused exports for jbd2 debuggingJan Kara1-3/+1
Jbd2 exports jbd2_journal_enable_debug and __jbd2_debug() depite the first is used only in fs/jbd2/journal.c and the second only within jbd2 code. Remove the pointless exports make jbd2_journal_enable_debug static. Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Lukas Czerner <lczerner@redhat.com> Link: https://lore.kernel.org/r/20220608112355.4397-3-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-08-02jbd2: rename jbd_debug() to jbd2_debug()Jan Kara6-67/+67
The name of jbd_debug() is confusing as all functions inside jbd2 have jbd2_ prefix. Rename jbd_debug() to jbd2_debug(). No functional changes. Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Lukas Czerner <lczerner@redhat.com> Link: https://lore.kernel.org/r/20220608112355.4397-2-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-07-14fs/jbd2: Fix the documentation of the jbd2_write_superblock() callersBart Van Assche1-7/+8
Commit 2a222ca992c3 ("fs: have submit_bh users pass in op and flags separately") renamed the jbd2_write_superblock() 'write_op' argument into 'write_flags'. Propagate this change to the jbd2_write_superblock() callers. Additionally, change the type of 'write_flags' into blk_opf_t. Cc: Mike Christie <michael.christie@oracle.com> Cc: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20220714180729.1065367-57-bvanassche@acm.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-07-14fs/buffer: Combine two submit_bh() and ll_rw_block() argumentsBart Van Assche3-8/+8
Both submit_bh() and ll_rw_block() accept a request operation type and request flags as their first two arguments. Micro-optimize these two functions by combining these first two arguments into a single argument. This patch does not change the behavior of any of the modified code. Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Jan Kara <jack@suse.cz> Acked-by: Song Liu <song@kernel.org> (for the md changes) Signed-off-by: Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20220714180729.1065367-48-bvanassche@acm.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-07-14block: remove bdevnameChristoph Hellwig1-2/+4
Replace the remaining calls of bdevname with snprintf using the %pg format specifier. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20220713055317.1888500-10-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-07-03mm: shrinkers: provide shrinkers with namesRoman Gushchin1-1/+2
Currently shrinkers are anonymous objects. For debugging purposes they can be identified by count/scan function names, but it's not always useful: e.g. for superblock's shrinkers it's nice to have at least an idea of to which superblock the shrinker belongs. This commit adds names to shrinkers. register_shrinker() and prealloc_shrinker() functions are extended to take a format and arguments to master a name. In some cases it's not possible to determine a good name at the time when a shrinker is allocated. For such cases shrinker_debugfs_rename() is provided. The expected format is: <subsystem>-<shrinker_type>[:<instance>]-<id> For some shrinkers an instance can be encoded as (MAJOR:MINOR) pair. After this change the shrinker debugfs directory looks like: $ cd /sys/kernel/debug/shrinker/ $ ls dquota-cache-16 sb-devpts-28 sb-proc-47 sb-tmpfs-42 mm-shadow-18 sb-devtmpfs-5 sb-proc-48 sb-tmpfs-43 mm-zspool:zram0-34 sb-hugetlbfs-17 sb-pstore-31 sb-tmpfs-44 rcu-kfree-0 sb-hugetlbfs-33 sb-rootfs-2 sb-tmpfs-49 sb-aio-20 sb-iomem-12 sb-securityfs-6 sb-tracefs-13 sb-anon_inodefs-15 sb-mqueue-21 sb-selinuxfs-22 sb-xfs:vda1-36 sb-bdev-3 sb-nsfs-4 sb-sockfs-8 sb-zsmalloc-19 sb-bpf-32 sb-pipefs-14 sb-sysfs-26 thp-deferred_split-10 sb-btrfs:vda2-24 sb-proc-25 sb-tmpfs-1 thp-zero-9 sb-cgroup2-30 sb-proc-39 sb-tmpfs-27 xfs-buf:vda1-37 sb-configfs-23 sb-proc-41 sb-tmpfs-29 xfs-inodegc:vda1-38 sb-dax-11 sb-proc-45 sb-tmpfs-35 sb-debugfs-7 sb-proc-46 sb-tmpfs-40 [roman.gushchin@linux.dev: fix build warnings] Link: https://lkml.kernel.org/r/Yr+ZTnLb9lJk6fJO@castle Reported-by: kernel test robot <lkp@intel.com> Link: https://lkml.kernel.org/r/20220601032227.4076670-4-roman.gushchin@linux.dev Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev> Cc: Christophe JAILLET <christophe.jaillet@wanadoo.fr> Cc: Dave Chinner <dchinner@redhat.com> Cc: Hillf Danton <hdanton@sina.com> Cc: Kent Overstreet <kent.overstreet@gmail.com> Cc: Muchun Song <songmuchun@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-06-16fs: fix jbd2_journal_try_to_free_buffers() kernel-doc commentYang Li1-1/+1
Add the description of @folio and remove @page in function kernel-doc comment to remove warnings found by running scripts/kernel-doc, which is caused by using 'make W=1'. fs/jbd2/transaction.c:2149: warning: Function parameter or member 'folio' not described in 'jbd2_journal_try_to_free_buffers' fs/jbd2/transaction.c:2149: warning: Excess function parameter 'page' description in 'jbd2_journal_try_to_free_buffers' Reported-by: Abaci Robot <abaci@linux.alibaba.com> Signed-off-by: Yang Li <yang.lee@linux.alibaba.com> Link: https://lore.kernel.org/r/20220512075432.31763-1-yang.lee@linux.alibaba.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-05-24Merge tag 'folio-5.19' of git://git.infradead.org/users/willy/pagecacheLinus Torvalds2-13/+15
Pull page cache updates from Matthew Wilcox: - Appoint myself page cache maintainer - Fix how scsicam uses the page cache - Use the memalloc_nofs_save() API to replace AOP_FLAG_NOFS - Remove the AOP flags entirely - Remove pagecache_write_begin() and pagecache_write_end() - Documentation updates - Convert several address_space operations to use folios: - is_dirty_writeback - readpage becomes read_folio - releasepage becomes release_folio - freepage becomes free_folio - Change filler_t to require a struct file pointer be the first argument like ->read_folio * tag 'folio-5.19' of git://git.infradead.org/users/willy/pagecache: (107 commits) nilfs2: Fix some kernel-doc comments Appoint myself page cache maintainer fs: Remove aops->freepage secretmem: Convert to free_folio nfs: Convert to free_folio orangefs: Convert to free_folio fs: Add free_folio address space operation fs: Convert drop_buffers() to use a folio fs: Change try_to_free_buffers() to take a folio jbd2: Convert release_buffer_page() to use a folio jbd2: Convert jbd2_journal_try_to_free_buffers to take a folio reiserfs: Convert release_buffer_page() to use a folio fs: Remove last vestiges of releasepage ubifs: Convert to release_folio reiserfs: Convert to release_folio orangefs: Convert to release_folio ocfs2: Convert to release_folio nilfs2: Remove comment about releasepage nfs: Convert to release_folio jfs: Convert to release_folio ...
2022-05-23Merge tag 'for-5.19/block-2022-05-22' of git://git.kernel.dk/linux-blockLinus Torvalds1-6/+3
Pull block updates from Jens Axboe: "Here are the core block changes for 5.19. This contains: - blk-throttle accounting fix (Laibin) - Series removing redundant assignments (Michal) - Expose bio cache via the bio_set, so that DM can use it (Mike) - Finish off the bio allocation interface cleanups by dealing with the weirdest member of the family. bio_kmalloc combines a kmalloc for the bio and bio_vecs with a hidden bio_init call and magic cleanup semantics (Christoph) - Clean up the block layer API so that APIs consumed by file systems are (almost) only struct block_device based, so that file systems don't have to poke into block layer internals like the request_queue (Christoph) - Clean up the blk_execute_rq* API (Christoph) - Clean up various lose end in the blk-cgroup code to make it easier to follow in preparation of reworking the blkcg assignment for bios (Christoph) - Fix use-after-free issues in BFQ when processes with merged queues get moved to different cgroups (Jan) - BFQ fixes (Jan) - Various fixes and cleanups (Bart, Chengming, Fanjun, Julia, Ming, Wolfgang, me)" * tag 'for-5.19/block-2022-05-22' of git://git.kernel.dk/linux-block: (83 commits) blk-mq: fix typo in comment bfq: Remove bfq_requeue_request_body() bfq: Remove superfluous conversion from RQ_BIC() bfq: Allow current waker to defend against a tentative one bfq: Relax waker detection for shared queues blk-cgroup: delete rcu_read_lock_held() WARN_ON_ONCE() blk-throttle: Set BIO_THROTTLED when bio has been throttled blk-cgroup: Remove unnecessary rcu_read_lock/unlock() blk-cgroup: always terminate io.stat lines block, bfq: make bfq_has_work() more accurate block, bfq: protect 'bfqd->queued' by 'bfqd->lock' block: cleanup the VM accounting in submit_bio block: Fix the bio.bi_opf comment block: reorder the REQ_ flags blk-iocost: combine local_stat and desc_stat to stat block: improve the error message from bio_check_eod block: allow passing a NULL bdev to bio_alloc_clone/bio_init_clone block: remove superfluous calls to blkcg_bio_issue_init kthread: unexport kthread_blkcg blk-cgroup: cleanup blkcg_maybe_throttle_current ...
2022-05-09fs: Change try_to_free_buffers() to take a folioMatthew Wilcox (Oracle)2-3/+3
All but two of the callers already have a folio; pass a folio into try_to_free_buffers(). This removes the last user of cancel_dirty_page() so remove that wrapper function too. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Jeff Layton <jlayton@kernel.org>
2022-05-09jbd2: Convert release_buffer_page() to use a folioMatthew Wilcox (Oracle)1-6/+8
Saves a few calls to compound_head(). Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Acked-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Jeff Layton <jlayton@kernel.org>
2022-05-09jbd2: Convert jbd2_journal_try_to_free_buffers to take a folioMatthew Wilcox (Oracle)1-6/+6
Also convert it to return a bool since it's called from release_folio(). Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Acked-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Jeff Layton <jlayton@kernel.org>
2022-04-22Merge tag 'ext4_for_linus_stable' of ↵Linus Torvalds1-1/+3
git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 Pull ext4 fixes from Ted Ts'o: "Fix some syzbot-detected bugs, as well as other bugs found by I/O injection testing. Change ext4's fallocate to consistently drop set[ug]id bits when an fallocate operation might possibly change the user-visible contents of a file. Also, improve handling of potentially invalid values in the the s_overhead_cluster superblock field to avoid ext4 returning a negative number of free blocks" * tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: jbd2: fix a potential race while discarding reserved buffers after an abort ext4: update the cached overhead value in the superblock ext4: force overhead calculation if the s_overhead_cluster makes no sense ext4: fix overhead calculation to account for the reserved gdt blocks ext4, doc: fix incorrect h_reserved size ext4: limit length to bitmap_maxbytes - blocksize in punch_hole ext4: fix use-after-free in ext4_search_dir ext4: fix bug_on in start_this_handle during umount filesystem ext4: fix symlink file size not match to file content ext4: fix fallocate to use file_modified to update permissions consistently
2022-04-21jbd2: fix a potential race while discarding reserved buffers after an abortYe Bin1-1/+3
we got issue as follows: [ 72.796117] EXT4-fs error (device sda): ext4_journal_check_start:83: comm fallocate: Detected aborted journal [ 72.826847] EXT4-fs (sda): Remounting filesystem read-only fallocate: fallocate failed: Read-only file system [ 74.791830] jbd2_journal_commit_transaction: jh=0xffff9cfefe725d90 bh=0x0000000000000000 end delay [ 74.793597] ------------[ cut here ]------------ [ 74.794203] kernel BUG at fs/jbd2/transaction.c:2063! [ 74.794886] invalid opcode: 0000 [#1] PREEMPT SMP PTI [ 74.795533] CPU: 4 PID: 2260 Comm: jbd2/sda-8 Not tainted 5.17.0-rc8-next-20220315-dirty #150 [ 74.798327] RIP: 0010:__jbd2_journal_unfile_buffer+0x3e/0x60 [ 74.801971] RSP: 0018:ffffa828c24a3cb8 EFLAGS: 00010202 [ 74.802694] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000 [ 74.803601] RDX: 0000000000000001 RSI: ffff9cfefe725d90 RDI: ffff9cfefe725d90 [ 74.804554] RBP: ffff9cfefe725d90 R08: 0000000000000000 R09: ffffa828c24a3b20 [ 74.805471] R10: 0000000000000001 R11: 0000000000000001 R12: ffff9cfefe725d90 [ 74.806385] R13: ffff9cfefe725d98 R14: 0000000000000000 R15: ffff9cfe833a4d00 [ 74.807301] FS: 0000000000000000(0000) GS:ffff9d01afb00000(0000) knlGS:0000000000000000 [ 74.808338] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 74.809084] CR2: 00007f2b81bf4000 CR3: 0000000100056000 CR4: 00000000000006e0 [ 74.810047] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 74.810981] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 74.811897] Call Trace: [ 74.812241] <TASK> [ 74.812566] __jbd2_journal_refile_buffer+0x12f/0x180 [ 74.813246] jbd2_journal_refile_buffer+0x4c/0xa0 [ 74.813869] jbd2_journal_commit_transaction.cold+0xa1/0x148 [ 74.817550] kjournald2+0xf8/0x3e0 [ 74.819056] kthread+0x153/0x1c0 [ 74.819963] ret_from_fork+0x22/0x30 Above issue may happen as follows: write truncate kjournald2 generic_perform_write ext4_write_begin ext4_walk_page_buffers do_journal_get_write_access ->add BJ_Reserved list ext4_journalled_write_end ext4_walk_page_buffers write_end_fn ext4_handle_dirty_metadata ***************JBD2 ABORT************** jbd2_journal_dirty_metadata -> return -EROFS, jh in reserved_list jbd2_journal_commit_transaction while (commit_transaction->t_reserved_list) jh = commit_transaction->t_reserved_list; truncate_pagecache_range do_invalidatepage ext4_journalled_invalidatepage jbd2_journal_invalidatepage journal_unmap_buffer __dispose_buffer __jbd2_journal_unfile_buffer jbd2_journal_put_journal_head ->put last ref_count __journal_remove_journal_head bh->b_private = NULL; jh->b_bh = NULL; jbd2_journal_refile_buffer(journal, jh); bh = jh2bh(jh); ->bh is NULL, later will trigger null-ptr-deref journal_free_journal_head(jh); After commit 96f1e0974575, we no longer hold the j_state_lock while iterating over the list of reserved handles in jbd2_journal_commit_transaction(). This potentially allows the journal_head to be freed by journal_unmap_buffer while the commit codepath is also trying to free the BJ_Reserved buffers. Keeping j_state_lock held while trying extends hold time of the lock minimally, and solves this issue. Fixes: 96f1e0974575("jbd2: avoid long hold times of j_state_lock while committing a transaction") Signed-off-by: Ye Bin <yebin10@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20220317142137.1821590-1-yebin10@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-04-17block: decouple REQ_OP_SECURE_ERASE from REQ_OP_DISCARDChristoph Hellwig1-1/+1
Secure erase is a very different operation from discard in that it is a data integrity operation vs hint. Fully split the limits and helper infrastructure to make the separation more clear. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Acked-by: Christoph Böhmwalder <christoph.boehmwalder@linbit.com> [drbd] Acked-by: Ryusuke Konishi <konishi.ryusuke@gmail.com> [nifs2] Acked-by: Jaegeuk Kim <jaegeuk@kernel.org> [f2fs] Acked-by: Coly Li <colyli@suse.de> [bcache] Acked-by: David Sterba <dsterba@suse.com> [btrfs] Acked-by: Chao Yu <chao@kernel.org> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Link: https://lore.kernel.org/r/20220415045258.199825-27-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-04-17block: remove QUEUE_FLAG_DISCARDChristoph Hellwig1-5/+2
Just use a non-zero max_discard_sectors as an indicator for discard support, similar to what is done for write zeroes. The only places where needs special attention is the RAID5 driver, which must clear discard support for security reasons by default, even if the default stacking rules would allow for it. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Acked-by: Christoph Böhmwalder <christoph.boehmwalder@linbit.com> [drbd] Acked-by: Jan Höppner <hoeppner@linux.ibm.com> [s390] Acked-by: Coly Li <colyli@suse.de> [bcache] Acked-by: David Sterba <dsterba@suse.com> [btrfs] Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Link: https://lore.kernel.org/r/20220415045258.199825-25-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-03-22Merge tag 'folio-5.18b' of git://git.infradead.org/users/willy/pagecacheLinus Torvalds2-17/+16
Pull filesystem folio updates from Matthew Wilcox: "Primarily this series converts some of the address_space operations to take a folio instead of a page. Notably: - a_ops->is_partially_uptodate() takes a folio instead of a page and changes the type of the 'from' and 'count' arguments to make it obvious they're bytes. - a_ops->invalidatepage() becomes ->invalidate_folio() and has a similar type change. - a_ops->launder_page() becomes ->launder_folio() - a_ops->set_page_dirty() becomes ->dirty_folio() and adds the address_space as an argument. There are a couple of other misc changes up front that weren't worth separating into their own pull request" * tag 'folio-5.18b' of git://git.infradead.org/users/willy/pagecache: (53 commits) fs: Remove aops ->set_page_dirty fb_defio: Use noop_dirty_folio() fs: Convert __set_page_dirty_no_writeback to noop_dirty_folio fs: Convert __set_page_dirty_buffers to block_dirty_folio nilfs: Convert nilfs_set_page_dirty() to nilfs_dirty_folio() mm: Convert swap_set_page_dirty() to swap_dirty_folio() ubifs: Convert ubifs_set_page_dirty to ubifs_dirty_folio f2fs: Convert f2fs_set_node_page_dirty to f2fs_dirty_node_folio f2fs: Convert f2fs_set_data_page_dirty to f2fs_dirty_data_folio f2fs: Convert f2fs_set_meta_page_dirty to f2fs_dirty_meta_folio afs: Convert afs_dir_set_page_dirty() to afs_dir_dirty_folio() btrfs: Convert extent_range_redirty_for_io() to use folios fs: Convert trivial uses of __set_page_dirty_nobuffers to filemap_dirty_folio btrfs: Convert from set_page_dirty to dirty_folio fscache: Convert fscache_set_page_dirty() to fscache_dirty_folio() fs: Add aops->dirty_folio fs: Remove aops->launder_page orangefs: Convert launder_page to launder_folio nfs: Convert from launder_page to launder_folio fuse: Convert from launder_page to launder_folio ...
2022-03-15ext4: Convert invalidatepage to invalidate_folioMatthew Wilcox (Oracle)2-17/+16
Extensive changes, but fairly mechanical. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Tested-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Acked-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Tested-by: Mike Marshall <hubcap@omnibond.com> # orangefs Tested-by: David Howells <dhowells@redhat.com> # afs
2022-02-25jbd2: remove CONFIG_JBD2_DEBUG to update t_max_waitRitesh Harjani1-4/+3
CONFIG_JBD2_DEBUG and jbd2_journal_enable_debug knobs were added in update_t_max_wait(), since earlier it used to take a spinlock for updating t_max_wait, which could cause a bottleneck while starting a txn (start_this_handle()). Since in previous patch, we have killed t_handle_lock completely, we could get rid of this debug config and knob to let t_max_wait be updated by default again. Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/ad7319a601fd501079310747ce87d908e0944763.1644992076.git.riteshh@linux.ibm.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-02-25jbd2: kill t_handle_lock transaction spinlockRitesh Harjani1-19/+9
This patch kills t_handle_lock transaction spinlock completely from jbd2. To explain the reasoning, currently there were three sites at which this spinlock was used. 1. jbd2_journal_wait_updates() a. Based on careful code review it can be seen that, we don't need this lock here. This is since we wait for any currently ongoing updates based on a atomic variable t_updates. And we anyway don't take any t_handle_lock while in stop_this_handle(). i.e. write_lock(&journal->j_state_lock() jbd2_journal_wait_updates() stop_this_handle() while (atomic_read(txn->t_updates) { | DEFINE_WAIT(wait); | prepare_to_wait(); | if (atomic_read(txn->t_updates) if (atomic_dec_and_test(txn->t_updates)) write_unlock(&journal->j_state_lock); schedule(); wake_up() write_lock(&journal->j_state_lock); finish_wait(); } txn->t_state = T_COMMIT write_unlock(&journal->j_state_lock); b. Also note that between atomic_inc(&txn->t_updates) in start_this_handle() and jbd2_journal_wait_updates(), the synchronization happens via read_lock(journal->j_state_lock) in start_this_handle(); 2. jbd2_journal_extend() a. jbd2_journal_extend() is called with the handle of each process from task_struct. So no lock required in updating member fields of handle_t b. For member fields of h_transaction, all updates happens only via atomic APIs (which is also within read_lock()). So, no need of this transaction spinlock. 3. update_t_max_wait() Based on Jan suggestion, this can be carefully removed using atomic cmpxchg API. Note that there can be several processes which are waiting for a new transaction to be allocated and started. For doing this only one process will succeed in taking write_lock() and allocating a new txn. After that all of the process will be updating the t_max_wait (max transaction wait time). This can be done via below method w/o taking any locks using atomic cmpxchg. For more details refer [1] new = get_new_val(); old = READ_ONCE(ptr->max_val); while (old < new) old = cmpxchg(&ptr->max_val, old, new); [1]: https://lwn.net/Articles/849237/ Suggested-by: Jan Kara <jack@suse.cz> Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/d89e599658b4a1f3893a48c6feded200073037fc.1644992076.git.riteshh@linux.ibm.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-02-25jbd2: fix use-after-free of transaction_t raceRitesh Harjani1-16/+25
jbd2_journal_wait_updates() is called with j_state_lock held. But if there is a commit in progress, then this transaction might get committed and freed via jbd2_journal_commit_transaction() -> jbd2_journal_free_transaction(), when we release j_state_lock. So check for journal->j_running_transaction everytime we release and acquire j_state_lock to avoid use-after-free issue. Link: https://lore.kernel.org/r/948c2fed518ae739db6a8f7f83f1d58b504f87d0.1644497105.git.ritesh.list@gmail.com Fixes: 4f98186848707f53 ("jbd2: refactor wait logic for transaction updates into a common function") Cc: stable@kernel.org Reported-and-tested-by: syzbot+afa2ca5171d93e44b348@syzkaller.appspotmail.com Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-02-06Merge tag 'ext4_for_linus_stable' of ↵Linus Torvalds3-39/+41
git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 Pull ext4 fixes from Ted Ts'o: "Various bug fixes for ext4 fast commit and inline data handling. Also fix regression introduced as part of moving to the new mount API" * tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: fs/ext4: fix comments mentioning i_mutex ext4: fix incorrect type issue during replay_del_range jbd2: fix kernel-doc descriptions for jbd2_journal_shrink_{scan,count}() ext4: fix potential NULL pointer dereference in ext4_fill_super() jbd2: refactor wait logic for transaction updates into a common function jbd2: cleanup unused functions declarations from jbd2.h ext4: fix error handling in ext4_fc_record_modified_inode() ext4: remove redundant max inline_size check in ext4_da_write_inline_data_begin() ext4: fix error handling in ext4_restore_inline_data() ext4: fast commit may miss file actions ext4: fast commit may not fallback for ineligible commit ext4: modify the logic of ext4_mb_new_blocks_simple ext4: prevent used blocks from being allocated during fast commit replay
2022-02-03jbd2: fix kernel-doc descriptions for jbd2_journal_shrink_{scan,count}()Yang Li1-0/+4
Add the description of @shrink and @sc in jbd2_journal_shrink_scan() and jbd2_journal_shrink_count() kernel-doc comment to remove warnings found by running scripts/kernel-doc, which is caused by using 'make W=1'. fs/jbd2/journal.c:1296: warning: Function parameter or member 'shrink' not described in 'jbd2_journal_shrink_scan' fs/jbd2/journal.c:1296: warning: Function parameter or member 'sc' not described in 'jbd2_journal_shrink_scan' fs/jbd2/journal.c:1320: warning: Function parameter or member 'shrink' not described in 'jbd2_journal_shrink_count' fs/jbd2/journal.c:1320: warning: Function parameter or member 'sc' not described in 'jbd2_journal_shrink_count' Reported-by: Abaci Robot <abaci@linux.alibaba.com> Signed-off-by: Yang Li <yang.lee@linux.alibaba.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20220110132841.34531-1-yang.lee@linux.alibaba.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-02-03jbd2: refactor wait logic for transaction updates into a common functionRitesh Harjani2-37/+35
No functionality change as such in this patch. This only refactors the common piece of code which waits for t_updates to finish into a common function named as jbd2_journal_wait_updates(journal_t *) Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/8c564f70f4b2591171677a2a74fccb22a7b6c3a4.1642416995.git.riteshh@linux.ibm.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-02-03ext4: fast commit may not fallback for ineligible commitXin Yin2-2/+2
For the follow scenario: 1. jbd start commit transaction n 2. task A get new handle for transaction n+1 3. task A do some ineligible actions and mark FC_INELIGIBLE 4. jbd complete transaction n and clean FC_INELIGIBLE 5. task A call fsync In this case fast commit will not fallback to full commit and transaction n+1 also not handled by jbd. Make ext4_fc_mark_ineligible() also record transaction tid for latest ineligible case, when call ext4_fc_cleanup() check current transaction tid, if small than latest ineligible tid do not clear the EXT4_MF_FC_INELIGIBLE. Reported-by: kernel test robot <lkp@intel.com> Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Reported-by: Ritesh Harjani <riteshh@linux.ibm.com> Suggested-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com> Signed-off-by: Xin Yin <yinxin.x@bytedance.com> Link: https://lore.kernel.org/r/20220117093655.35160-2-yinxin.x@bytedance.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org
2022-01-30jbd2: export jbd2_journal_[grab|put]_journal_headJoseph Qi1-0/+2
Patch series "ocfs2: fix a deadlock case". This fixes a deadlock case in ocfs2. We firstly export jbd2 symbols jbd2_journal_[grab|put]_journal_head as preparation and later use them in ocfs2 insread of jbd_[lock|unlock]_bh_journal_head to fix the deadlock. This patch (of 2): This exports symbols jbd2_journal_[grab|put]_journal_head, which will be used outside modules, e.g. ocfs2. Link: https://lkml.kernel.org/r/20220121071205.100648-2-joseph.qi@linux.alibaba.com Signed-off-by: Joseph Qi <joseph.qi@linux.alibaba.com> Cc: Mark Fasheh <mark@fasheh.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Changwei Ge <gechangwei@live.cn> Cc: Gang He <ghe@suse.com> Cc: Jun Piao <piaojun@huawei.com> Cc: Andreas Dilger <adilger.kernel@dilger.ca> Cc: Gautham Ananthakrishna <gautham.ananthakrishna@oracle.com> Cc: Saeed Mirzamohammadi <saeed.mirzamohammadi@oracle.com> Cc: "Theodore Ts'o" <tytso@mit.edu> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-01-22proc: remove PDE_DATA() completelyMuchun Song1-1/+1
Remove PDE_DATA() completely and replace it with pde_data(). [akpm@linux-foundation.org: fix naming clash in drivers/nubus/proc.c] [akpm@linux-foundation.org: now fix it properly] Link: https://lkml.kernel.org/r/20211124081956.87711-2-songmuchun@bytedance.com Signed-off-by: Muchun Song <songmuchun@bytedance.com> Acked-by: Christian Brauner <christian.brauner@ubuntu.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Alexey Gladkov <gladkov.alexey@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-12-23ext4: use ext4_journal_start/stop for fast commit transactionsHarshad Shirwadkar1-0/+2
This patch drops all calls to ext4_fc_start_update() and ext4_fc_stop_update(). To ensure that there are no ongoing journal updates during fast commit, we also make jbd2_fc_begin_commit() lock journal for updates. This way we don't have to maintain two different transaction start stop APIs for fast commit and full commit. This patch doesn't remove the functions altogether since in future we want to have inode level locking for fast commits. Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com> Link: https://lore.kernel.org/r/20211223202140.2061101-2-harshads@google.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2021-08-30ext4: Support for checksumming from journal triggersJan Kara1-1/+1
JBD2 layer support triggers which are called when journaling layer moves buffer to a certain state. We can use the frozen trigger, which gets called when buffer data is frozen and about to be written out to the journal, to compute block checksums for some buffer types (similarly as does ocfs2). This avoids unnecessary repeated recomputation of the checksum (at the cost of larger window where memory corruption won't be caught by checksumming) and is even necessary when there are unsynchronized updaters of the checksummed data. So add superblock and journal trigger type arguments to ext4_journal_get_write_access() and ext4_journal_get_create_access() so that frozen triggers can be set accordingly. Also add inode argument to ext4_walk_page_buffers() and all the callbacks used with that function for the same purpose. This patch is mostly only a change of prototype of the above mentioned functions and a few small helpers. Real checksumming will come later. Reviewed-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20210816095713.16537-1-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2021-08-30jbd2: add sparse annotations for add_transaction_credits()Theodore Ts'o1-1/+18
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2021-08-10jbd2: clean up two gcc -Wall warnings in recovery.cTheodore Ts'o1-3/+3
Fix a signed vs unsigned and a void * pointer arithmetic warning. This cleanup is also in e2fsprogs commit aec460db9a93 ("e2fsck: clean up two gcc -Wall warnings in recovery.c"). Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2021-08-10jbd2: fix clang warning in recovery.cTheodore Ts'o1-1/+0
Remove unused variable store which was never used. This fix is also in e2fsprogs commit 99a2294f85f0 ("e2fsck: value stored to err is never read"). Signed-off-by: Lukas Czerner <lczerner@redhat.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2021-08-10jbd2: fix portability problems caused by unaligned accessesTheodore Ts'o1-11/+11
This commit applies the e2fsck/recovery.c portions of commit 1e0c8ca7c08a ("e2fsck: fix portability problems caused by unaligned accesses) from the e2fsprogs git tree. The on-disk format for the ext4 journal can have unaigned 32-bit integers. This can happen when replaying a journal using a obsolete checksum format (which was never popularly used, since the v3 format replaced v2 while the metadata checksum feature was being stablized). Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2021-07-08ext4: inline jbd2_journal_[un]register_shrinker()Theodore Ts'o2-91/+62
The function jbd2_journal_unregister_shrinker() was getting called twice when the file system was getting unmounted. On Power and ARM platforms this was causing kernel crash when unmounting the file system, when a percpu_counter was destroyed twice. Fix this by removing jbd2_journal_[un]register_shrinker() functions, and inlining the shrinker setup and teardown into journal_init_common() and jbd2_journal_destroy(). This means that ext4 and ocfs2 now no longer need to know about registering and unregistering jbd2's shrinker. Also, while we're at it, rename the percpu counter from j_jh_shrink_count to j_checkpoint_jh_count, since this makes it clearer what this counter is intended to track. Link: https://lore.kernel.org/r/20210705145025.3363130-1-tytso@mit.edu Fixes: 4ba3fcdde7e3 ("jbd2,ext4: add a shrinker to release checkpointed buffers") Reported-by: Jon Hunter <jonathanh@nvidia.com> Reported-by: Sachin Sant <sachinp@linux.vnet.ibm.com> Tested-by: Sachin Sant <sachinp@linux.vnet.ibm.com> Tested-by: Jon Hunter <jonathanh@nvidia.com> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2021-06-30jbd2: export jbd2_journal_[un]register_shrinker()Zhang Yi1-0/+2
Export jbd2_journal_[un]register_shrinker() to fix this error when ext4 is built as a module: ERROR: modpost: "jbd2_journal_unregister_shrinker" undefined! ERROR: modpost: "jbd2_journal_register_shrinker" undefined! Fixes: 4ba3fcdde7e3 ("jbd2,ext4: add a shrinker to release checkpointed buffers") Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20210630083638.140218-1-yi.zhang@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2021-06-24jbd2: simplify journal_clean_one_cp_list()Zhang Yi1-26/+4
Now that __try_to_free_cp_buf() remove checkpointed buffer or transaction when the buffer is not 'busy', which is only called by journal_clean_one_cp_list(). This patch simplify this function by remove __try_to_free_cp_buf() and invoke __cp_buffer_busy() directly. Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20210610112440.3438139-7-yi.zhang@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2021-06-24jbd2,ext4: add a shrinker to release checkpointed buffersZhang Yi2-0/+234
Current metadata buffer release logic in bdev_try_to_free_page() have a lot of use-after-free issues when umount filesystem concurrently, and it is difficult to fix directly because ext4 is the only user of s_op->bdev_try_to_free_page callback and we may have to add more special refcount or lock that is only used by ext4 into the common vfs layer, which is unacceptable. One better solution is remove the bdev_try_to_free_page callback, but the real problem is we cannot easily release journal_head on the checkpointed buffer, so try_to_free_buffers() cannot release buffers and page under memory pressure, which is more likely to trigger out-of-memory. So we cannot remove the callback directly before we find another way to release journal_head. This patch introduce a shrinker to free journal_head on the checkpointed transaction. After the journal_head got freed, try_to_free_buffers() could free buffer properly. Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Suggested-by: Jan Kara <jack@suse.cz> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20210610112440.3438139-6-yi.zhang@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2021-06-24jbd2: remove redundant buffer io error checksZhang Yi1-11/+2
Now that __jbd2_journal_remove_checkpoint() can detect buffer io error and mark journal checkpoint error, then we abort the journal later before updating log tail to ensure the filesystem works consistently. So we could remove other redundant buffer io error checkes. Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20210610112440.3438139-5-yi.zhang@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2021-06-24jbd2: don't abort the journal when freeing buffersZhang Yi1-17/+0
Now that we can be sure the journal is aborted once a buffer has failed to be written back to disk, we can remove the journal abort logic in jbd2_journal_try_to_free_buffers() which was introduced in commit c044f3d8360d ("jbd2: abort journal if free a async write error metadata buffer"), because it may cost and propably is not safe. Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20210610112440.3438139-4-yi.zhang@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2021-06-24jbd2: ensure abort the journal if detect IO error when writing original ↵Zhang Yi2-0/+26
buffer back Although we merged c044f3d8360 ("jbd2: abort journal if free a async write error metadata buffer"), there is a race between jbd2_journal_try_to_free_buffers() and jbd2_journal_destroy(), so the jbd2_log_do_checkpoint() may still fail to detect the buffer write io error flag which may lead to filesystem inconsistency. jbd2_journal_try_to_free_buffers() ext4_put_super() jbd2_journal_destroy() __jbd2_journal_remove_checkpoint() detect buffer write error jbd2_log_do_checkpoint() jbd2_cleanup_journal_tail() <--- lead to inconsistency jbd2_journal_abort() Fix this issue by introducing a new atomic flag which only have one JBD2_CHECKPOINT_IO_ERROR bit now, and set it in __jbd2_journal_remove_checkpoint() when freeing a checkpoint buffer which has write_io_error flag. Then jbd2_journal_destroy() will detect this mark and abort the journal to prevent updating log tail. Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20210610112440.3438139-3-yi.zhang@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2021-06-24jbd2: remove the out label in __jbd2_journal_remove_checkpoint()Zhang Yi1-12/+12
The 'out' lable just return the 'ret' value and seems not required, so remove this label and switch to return appropriate value immediately. This patch also do some minor cleanup, no logical change. Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20210610112440.3438139-2-yi.zhang@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2021-06-24jbd2: clean up misleading comments for jbd2_fc_release_bufsyangerkun1-8/+0
This comments was for jbd2_fc_wait_bufs, not for jbd2_fc_release_bufs. Remove this misleading comments. Signed-off-by: yangerkun <yangerkun@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20210608141236.459441-1-yangerkun@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2021-06-22ext4: add discard/zeroout flags to journal flushLeah Rumancik1-3/+116
Add a flags argument to jbd2_journal_flush to enable discarding or zero-filling the journal blocks while flushing the journal. Signed-off-by: Leah Rumancik <leah.rumancik@gmail.com> Link: https://lore.kernel.org/r/20210518151327.130198-1-leah.rumancik@gmail.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2021-04-09ext4: fix debug format string warningArnd Bergmann1-3/+2
Using no_printk() for jbd_debug() revealed two warnings: fs/jbd2/recovery.c: In function 'fc_do_one_pass': fs/jbd2/recovery.c:256:30: error: format '%d' expects a matching 'int' argument [-Werror=format=] 256 | jbd_debug(3, "Processing fast commit blk with seq %d"); | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ fs/ext4/fast_commit.c: In function 'ext4_fc_replay_add_range': fs/ext4/fast_commit.c:1732:30: error: format '%d' expects argument of type 'int', but argument 2 has type 'long unsigned int' [-Werror=format=] 1732 | jbd_debug(1, "Converting from %d to %d %lld", The first one was added incorrectly, and was also missing a few newlines in debug output, and the second one happened when the type of an argument changed. Reported-by: kernel test robot <lkp@intel.com> Fixes: d556435156b7 ("jbd2: avoid -Wempty-body warnings") Fixes: 6db074618969 ("ext4: use BIT() macro for BH_** state bits") Fixes: 5b849b5f96b4 ("jbd2: fast commit recovery path") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Link: https://lore.kernel.org/r/20210409201211.1866633-1-arnd@kernel.org Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2021-04-09ext4: annotate data race in jbd2_journal_dirty_metadata()Jan Kara1-4/+4
Assertion checks in jbd2_journal_dirty_metadata() are known to be racy but we don't want to be grabbing locks just for them. We thus recheck them under b_state_lock only if it looks like they would fail. Annotate the checks with data_race(). Cc: stable@kernel.org Reported-by: Hao Sun <sunhao.th@gmail.com> Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20210406161804.20150-2-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2021-04-09ext4: annotate data race in start_this_handle()Jan Kara1-1/+6
Access to journal->j_running_transaction is not protected by appropriate lock and thus is racy. We are well aware of that and the code handles the race properly. Just add a comment and data_race() annotation. Cc: stable@kernel.org Reported-by: syzbot+30774a6acf6a2cf6d535@syzkaller.appspotmail.com Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20210406161804.20150-1-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2021-01-27block: use an on-stack bio in blkdev_issue_flushChristoph Hellwig3-4/+4
There is no point in allocating memory for a synchronous flush. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Acked-by: Damien Le Moal <damien.lemoal@wdc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-12-17jbd2: add a helper to find out number of fast commit blocksHarshad Shirwadkar1-6/+2
Add a helper to read number of fast commit blocks from jbd2 superblock and also rename the JBD2_MIN_FC_BLKS to JBD2_DEFAULT_FAST_COMMIT_BLOCKS since this constant is just the default number of fast commit blocks to use in case number of fast commit blocks isn't set in jbd2 superblock. Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com> Link: https://lore.kernel.org/r/20201120202232.2240293-2-harshadshirwadkar@gmail.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-11-19jbd2: fix kernel-doc markupsMauro Carvalho Chehab2-31/+34
Kernel-doc markup should use this format: identifier - description They should not have any type before that, as otherwise the parser won't do the right thing. Also, some identifiers have different names between their prototypes and the kernel-doc markup. Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org> Link: https://lore.kernel.org/r/72f5c6628f5f278d67625f60893ffbc2ca28d46e.1605521731.git.mchehab+huawei@kernel.org Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-11-07jbd2: fix up sparse warnings in checkpoint codeTheodore Ts'o2-1/+5
Add missing __acquires() and __releases() annotations. Also, in an "this should never happen" WARN_ON check, if it *does* actually happen, we need to release j_state_lock since this function is always supposed to release that lock. Otherwise, things will quickly grind to a halt after the WARN_ON trips. Fixes: 96f1e0974575 ("jbd2: avoid long hold times of j_state_lock...") Cc: stable@kernel.org Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-11-06jbd2: don't start fast commit on aborted journalHarshad Shirwadkar1-0/+2
Fast commit should not be started if the journal is aborted. Signed-off-by: Harshad Shirwadkar<harshadshirwadkar@gmail.com> Link: https://lore.kernel.org/r/20201106035911.1942128-22-harshadshirwadkar@gmail.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-11-06jbd2: don't read journal->j_commit_sequence without taking a lockHarshad Shirwadkar1-2/+4
Take journal state lock before reading journal->j_commit_sequence. Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com> Link: https://lore.kernel.org/r/20201106035911.1942128-13-harshadshirwadkar@gmail.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-11-06jbd2: don't touch buffer state until it is filledHarshad Shirwadkar1-4/+0
Fast commit buffers should be filled in before toucing their state. Remove code that sets buffer state as dirty before the buffer is passed to the file system. Suggested-by: Jan Kara <jack@suse.cz> Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com> Link: https://lore.kernel.org/r/20201106035911.1942128-12-harshadshirwadkar@gmail.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-11-06jbd2: add todo for a fast commit performance optimizationHarshad Shirwadkar1-0/+9
Fast commit performance can be optimized if commit thread doesn't wait for ongoing fast commits to complete until the transaction enters T_FLUSH state. Document this optimization. Suggested-by: Jan Kara <jack@suse.cz> Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com> Link: https://lore.kernel.org/r/20201106035911.1942128-11-harshadshirwadkar@gmail.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-11-06jbd2: don't pass tid to jbd2_fc_end_commit_fallback()Harshad Shirwadkar1-3/+9
In jbd2_fc_end_commit_fallback(), we know which tid to commit. There's no need for caller to pass it. Suggested-by: Jan Kara <jack@suse.cz> Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com> Link: https://lore.kernel.org/r/20201106035911.1942128-10-harshadshirwadkar@gmail.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-11-06jbd2: don't use state lock during commit pathHarshad Shirwadkar1-6/+0
Variables journal->j_fc_off, journal->j_fc_wbuf are accessed during commit path. Since today we allow only one process to perform a fast commit, there is no need take state lock before accessing these variables. This patch removes these locks and adds comments to describe this. Suggested-by: Jan Kara <jack@suse.cz> Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com> Link: https://lore.kernel.org/r/20201106035911.1942128-9-harshadshirwadkar@gmail.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-11-06ext4: clean up the JBD2 API that initializes fast commitsHarshad Shirwadkar1-40/+56
This patch removes jbd2_fc_init() API and its related functions to simplify enabling fast commits. With this change, the number of fast commit blocks to use is solely determined by the JBD2 layer. So, we move the default value for minimum number of fast commit blocks from ext4/fast_commit.h to include/linux/jbd2.h. However, whether or not to use fast commits is determined by the file system. The file system just sets the fast commit feature using jbd2_journal_set_features(). JBD2 layer then determines how many blocks to use for fast commits (based on the value found in the JBD2 superblock). Note that the JBD2 feature flag of fast commits is just an indication that there are fast commit blocks present on disk. It doesn't tell JBD2 layer about the intent of the file system of whether to it wants to use fast commit or not. That's why, we blindly clear the fast commit flag in journal_reset() after the recovery is done. Suggested-by: Jan Kara <jack@suse.cz> Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com> Link: https://lore.kernel.org/r/20201106035911.1942128-7-harshadshirwadkar@gmail.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-11-06jbd2: rename j_maxlen to j_total_len and add jbd2_journal_max_txn_bufsHarshad Shirwadkar3-10/+10
The on-disk superblock field sb->s_maxlen represents the total size of the journal including the fast commit area and is no more the max number of blocks available for a transaction. The maximum number of blocks available to a transaction is reduced by the number of fast commit blocks. So, this patch renames j_maxlen to j_total_len to better represent its intent. Also, it adds a function to calculate max number of bufs available for a transaction. Suggested-by: Jan Kara <jack@suse.cz> Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com> Link: https://lore.kernel.org/r/20201106035911.1942128-6-harshadshirwadkar@gmail.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-10-21jbd2: fast commit recovery pathHarshad Shirwadkar1-4/+53
This patch adds fast commit recovery support in JBD2. Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com> Link: https://lore.kernel.org/r/20201015203802.3597742-7-harshadshirwadkar@gmail.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-10-21jbd2: add fast commit machineryHarshad Shirwadkar2-1/+233
This functions adds necessary APIs needed in JBD2 layer for fast commits. Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com> Link: https://lore.kernel.org/r/20201015203802.3597742-5-harshadshirwadkar@gmail.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-10-21ext4 / jbd2: add fast commit initializationHarshad Shirwadkar1-5/+48
This patch adds fast commit area trackers in the journal_t structure. These are initialized via the jbd2_fc_init() routine that this patch adds. This patch also adds ext4/fast_commit.c and ext4/fast_commit.h files for fast commit code that will be added in subsequent patches in this series. Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com> Link: https://lore.kernel.org/r/20201015203802.3597742-4-harshadshirwadkar@gmail.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-10-18jbd2: avoid transaction reuse after reformattingchangfengnan1-12/+66
When ext4 is formatted with lazy_journal_init=1 and transactions from the previous filesystem are still on disk, it is possible that they are considered during a recovery after a crash. Because the checksum seed has changed, the CRC check will fail, and the journal recovery fails with checksum error although the journal is otherwise perfectly valid. Fix the problem by checking commit block time stamps to determine whether the data in the journal block is just stale or whether it is indeed corrupt. Reported-by: kernel test robot <lkp@intel.com> Reviewed-by: Andreas Dilger <adilger@dilger.ca> Signed-off-by: Fengnan Chang <changfengnan@hikvision.com> Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20201012164900.20197-1-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-10-18jbd2, ext4, ocfs2: introduce/use journal callbacks ↵Mauricio Faria de Oliveira1-12/+18
j_submit|finish_inode_data_buffers() Introduce journal callbacks to allow different behaviors for an inode in journal_submit|finish_inode_data_buffers(). The existing users of the current behavior (ext4, ocfs2) are adapted to use the previously exported functions that implement the current behavior. Users are callers of jbd2_journal_inode_ranged_write|wait(), which adds the inode to the transaction's inode list with the JI_WRITE|WAIT_DATA flags. Only ext4 and ocfs2 in-tree. Both CONFIG_EXT4_FS and CONFIG_OCSFS2_FS select CONFIG_JBD2, which builds fs/jbd2/commit.c and journal.c that define and export the functions, so we can call directly in ext4/ocfs2. Signed-off-by: Mauricio Faria de Oliveira <mfo@canonical.com> Suggested-by: Jan Kara <jack@suse.cz> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Andreas Dilger <adilger@dilger.ca> Link: https://lore.kernel.org/r/20201006004841.600488-3-mfo@canonical.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-10-18jbd2: introduce/export functions jbd2_journal_submit|finish_inode_data_buffers()Mauricio Faria de Oliveira2-20/+18
Export functions that implement the current behavior done for an inode in journal_submit|finish_inode_data_buffers(). No functional change. Signed-off-by: Mauricio Faria de Oliveira <mfo@canonical.com> Suggested-by: Jan Kara <jack@suse.cz> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Andreas Dilger <adilger@dilger.ca> Link: https://lore.kernel.org/r/20201006004841.600488-2-mfo@canonical.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-08-19jbd2: clean up checksum verification in do_one_pass()Shijie Luo1-34/+12
Remove the unnecessary chksum_err and checksum_seen variables as well as some redundant code to make the function easier to understand. [ With changes suggested by jack@ and tytso@ ] Signed-off-by: Shijie Luo <luoshijie1@huawei.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20200819122955.33526-1-luoshijie1@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-08-07jbd2: fix incorrect code styleXianting Tian1-6/+6
Remove unnecessary blank. Signed-off-by: Xianting Tian <xianting_tian@126.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/1595077057-8048-1-git-send-email-xianting_tian@126.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-08-07jbd2: remove unused parameter in jbd2_journal_try_to_free_buffers()zhangyi (F)1-6/+1
Parameter gfp_mask in jbd2_journal_try_to_free_buffers() is no longer used after commit <536fc240e7147> ("jbd2: clean up jbd2_journal_try_to_free_buffers()"), so just remove it. Signed-off-by: zhangyi (F) <yi.zhang@huawei.com> Link: https://lore.kernel.org/r/20200620025427.1756360-6-yi.zhang@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-08-07jbd2: abort journal if free a async write error metadata bufferzhangyi (F)1-0/+16
If we free a metadata buffer which has been failed to async write out in the background, the jbd2 checkpoint procedure will not detect this failure in jbd2_log_do_checkpoint(), so it may lead to filesystem inconsistency after cleanup journal tail. This patch abort the journal if free a buffer has write_io_error flag to prevent potential further inconsistency. Signed-off-by: zhangyi (F) <yi.zhang@huawei.com> Link: https://lore.kernel.org/r/20200620025427.1756360-5-yi.zhang@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-08-06jbd2: add the missing unlock_buffer() in the error path of ↵zhangyi (F)1-1/+3
jbd2_write_superblock() jbd2_write_superblock() is under the buffer lock of journal superblock before ending that superblock write, so add a missing unlock_buffer() in in the error path before submitting buffer. Fixes: 742b06b5628f ("jbd2: check superblock mapped prior to committing") Signed-off-by: zhangyi (F) <yi.zhang@huawei.com> Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com> Cc: stable@kernel.org Link: https://lore.kernel.org/r/20200620061948.2049579-1-yi.zhang@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-08-06jbd2: make sure jh have b_transaction set in refile/unfile_bufferLukas Czerner1-0/+10
Callers of __jbd2_journal_unfile_buffer() and __jbd2_journal_refile_buffer() assume that the b_transaction is set. In fact if it's not, we can end up with journal_head refcounting errors leading to crash much later that might be very hard to track down. Add asserts to make sure that is the case. We also make sure that b_next_transaction is NULL in __jbd2_journal_unfile_buffer() since the callers expect that as well and we should not get into that stage in this state anyway, leading to problems later on if we do. Tested with fstests. Signed-off-by: Lukas Czerner <lczerner@redhat.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20200617092549.6712-1-lczerner@redhat.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-06-15Merge tag 'ext4-for-linus-5.8-rc1-2' of ↵Linus Torvalds1-5/+12
git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 Pull more ext4 updates from Ted Ts'o: "This is the second round of ext4 commits for 5.8 merge window [1]. It includes the per-inode DAX support, which was dependant on the DAX infrastructure which came in via the XFS tree, and a number of regression and bug fixes; most notably the "BUG: using smp_processor_id() in preemptible code in ext4_mb_new_blocks" reported by syzkaller" [1] The pull request actually came in 15 minutes after I had tagged the rc1 release. Tssk, tssk, late.. - Linus * tag 'ext4-for-linus-5.8-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: ext4, jbd2: ensure panic by fix a race between jbd2 abort and ext4 error handlers ext4: support xattr gnu.* namespace for the Hurd ext4: mballoc: Use this_cpu_read instead of this_cpu_ptr ext4: avoid utf8_strncasecmp() with unstable name ext4: stop overwrite the errcode in ext4_setup_super ext4: fix partial cluster initialization when splitting extent ext4: avoid race conditions when remounting with options that change dax Documentation/dax: Update DAX enablement for ext4 fs/ext4: Introduce DAX inode flag fs/ext4: Remove jflag variable fs/ext4: Make DAX mount option a tri-state fs/ext4: Only change S_DAX on inode load fs/ext4: Update ext4_should_use_dax() fs/ext4: Change EXT4_MOUNT_DAX to EXT4_MOUNT_DAX_ALWAYS fs/ext4: Disallow verity if inode is DAX fs/ext4: Narrow scope of DAX check in setflags
2020-06-12ext4, jbd2: ensure panic by fix a race between jbd2 abort and ext4 error ↵zhangyi (F)1-5/+12
handlers In the ext4 filesystem with errors=panic, if one process is recording errno in the superblock when invoking jbd2_journal_abort() due to some error cases, it could be raced by another __ext4_abort() which is setting the SB_RDONLY flag but missing panic because errno has not been recorded. jbd2_journal_commit_transaction() jbd2_journal_abort() journal->j_flags |= JBD2_ABORT; jbd2_journal_update_sb_errno() | ext4_journal_check_start() | __ext4_abort() | sb->s_flags |= SB_RDONLY; | if (!JBD2_REC_ERR) | return; journal->j_flags |= JBD2_REC_ERR; Finally, it will no longer trigger panic because the filesystem has already been set read-only. Fix this by introduce j_abort_mutex to make sure journal abort is completed before panic, and remove JBD2_REC_ERR flag. Fixes: 4327ba52afd03 ("ext4, jbd2: ensure entering into panic after recording an error in superblock") Signed-off-by: zhangyi (F) <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20200609073540.3810702-1-yi.zhang@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-06-05Merge tag 'ext4_for_linus' of ↵Linus Torvalds1-3/+11
git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 Pull ext4 updates from Ted Ts'o: "A lot of bug fixes and cleanups for ext4, including: - Fix performance problems found in dioread_nolock now that it is the default, caused by transaction leaks. - Clean up fiemap handling in ext4 - Clean up and refactor multiple block allocator (mballoc) code - Fix a problem with mballoc with a smaller file systems running out of blocks because they couldn't properly use blocks that had been reserved by inode preallocation. - Fixed a race in ext4_sync_parent() versus rename() - Simplify the error handling in the extent manipulation code - Make sure all metadata I/O errors are felected to ext4_ext_dirty()'s and ext4_make_inode_dirty()'s callers. - Avoid passing an error pointer to brelse in ext4_xattr_set() - Fix race which could result to freeing an inode on the dirty last in data=journal mode. - Fix refcount handling if ext4_iget() fails - Fix a crash in generic/019 caused by a corrupted extent node" * tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (58 commits) ext4: avoid unnecessary transaction starts during writeback ext4: don't block for O_DIRECT if IOCB_NOWAIT is set ext4: remove the access_ok() check in ext4_ioctl_get_es_cache fs: remove the access_ok() check in ioctl_fiemap fs: handle FIEMAP_FLAG_SYNC in fiemap_prep fs: move fiemap range validation into the file systems instances iomap: fix the iomap_fiemap prototype fs: move the fiemap definitions out of fs.h fs: mark __generic_block_fiemap static ext4: remove the call to fiemap_check_flags in ext4_fiemap ext4: split _ext4_fiemap ext4: fix fiemap size checks for bitmap files ext4: fix EXT4_MAX_LOGICAL_BLOCK macro add comment for ext4_dir_entry_2 file_type member jbd2: avoid leaking transaction credits when unreserving handle ext4: drop ext4_journal_free_reserved() ext4: mballoc: use lock for checking free blocks while retrying ext4: mballoc: refactor ext4_mb_good_group() ext4: mballoc: introduce pcpu seqcnt for freeing PA to improve ENOSPC handling ext4: mballoc: refactor ext4_mb_discard_preallocations() ...
2020-06-03jbd2: avoid leaking transaction credits when unreserving handleJan Kara1-3/+11
When reserved transaction handle is unused, we subtract its reserved credits in __jbd2_journal_unreserve_handle() called from jbd2_journal_stop(). However this function forgets to remove reserved credits from transaction->t_outstanding_credits and thus the transaction space that was reserved remains effectively leaked. The leaked transaction space can be quite significant in some cases and leads to unnecessarily small transactions and thus reducing throughput of the journalling machinery. E.g. fsmark workload creating lots of 4k files was observed to have about 20% lower throughput due to this when ext4 is mounted with dioread_nolock mount option. Subtract reserved credits from t_outstanding_credits as well. CC: stable@vger.kernel.org Fixes: 8f7d89f36829 ("jbd2: transaction reservation support") Reviewed-by: Andreas Dilger <adilger@dilger.ca> Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20200520133119.1383-3-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-05-22block: remove the error_sector argument to blkdev_issue_flushChristoph Hellwig3-4/+4
The argument isn't used by any caller, and drivers don't fill out bi_sector for flush requests either. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-03-05jbd2: improve comments about freeing data buffers whose page mapping is NULLzhangyi (F)1-3/+4
Improve comments in jbd2_journal_commit_transaction() to describe why we don't need to clear the buffer_mapped bit for freeing file mapping buffers whose page mapping is NULL. Link: https://lore.kernel.org/r/20200217112706.20085-1-yi.zhang@huawei.com Fixes: c96dceeabf76 ("jbd2: do not clear the BH_Mapped flag when forgetting a metadata buffer") Suggested-by: Jan Kara <jack@suse.cz> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: zhangyi (F) <yi.zhang@huawei.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-02-29jbd2: fix data races at struct journal_headQian Cai1-4/+4
journal_head::b_transaction and journal_head::b_next_transaction could be accessed concurrently as noticed by KCSAN, LTP: starting fsync04 /dev/zero: Can't open blockdev EXT4-fs (loop0): mounting ext3 file system using the ext4 subsystem EXT4-fs (loop0): mounted filesystem with ordered data mode. Opts: (null) ================================================================== BUG: KCSAN: data-race in __jbd2_journal_refile_buffer [jbd2] / jbd2_write_access_granted [jbd2] write to 0xffff99f9b1bd0e30 of 8 bytes by task 25721 on cpu 70: __jbd2_journal_refile_buffer+0xdd/0x210 [jbd2] __jbd2_journal_refile_buffer at fs/jbd2/transaction.c:2569 jbd2_journal_commit_transaction+0x2d15/0x3f20 [jbd2] (inlined by) jbd2_journal_commit_transaction at fs/jbd2/commit.c:1034 kjournald2+0x13b/0x450 [jbd2] kthread+0x1cd/0x1f0 ret_from_fork+0x27/0x50 read to 0xffff99f9b1bd0e30 of 8 bytes by task 25724 on cpu 68: jbd2_write_access_granted+0x1b2/0x250 [jbd2] jbd2_write_access_granted at fs/jbd2/transaction.c:1155 jbd2_journal_get_write_access+0x2c/0x60 [jbd2] __ext4_journal_get_write_access+0x50/0x90 [ext4] ext4_mb_mark_diskspace_used+0x158/0x620 [ext4] ext4_mb_new_blocks+0x54f/0xca0 [ext4] ext4_ind_map_blocks+0xc79/0x1b40 [ext4] ext4_map_blocks+0x3b4/0x950 [ext4] _ext4_get_block+0xfc/0x270 [ext4] ext4_get_block+0x3b/0x50 [ext4] __block_write_begin_int+0x22e/0xae0 __block_write_begin+0x39/0x50 ext4_write_begin+0x388/0xb50 [ext4] generic_perform_write+0x15d/0x290 ext4_buffered_write_iter+0x11f/0x210 [ext4] ext4_file_write_iter+0xce/0x9e0 [ext4] new_sync_write+0x29c/0x3b0 __vfs_write+0x92/0xa0 vfs_write+0x103/0x260 ksys_write+0x9d/0x130 __x64_sys_write+0x4c/0x60 do_syscall_64+0x91/0xb05 entry_SYSCALL_64_after_hwframe+0x49/0xbe 5 locks held by fsync04/25724: #0: ffff99f9911093f8 (sb_writers#13){.+.+}, at: vfs_write+0x21c/0x260 #1: ffff99f9db4c0348 (&sb->s_type->i_mutex_key#15){+.+.}, at: ext4_buffered_write_iter+0x65/0x210 [ext4] #2: ffff99f5e7dfcf58 (jbd2_handle){++++}, at: start_this_handle+0x1c1/0x9d0 [jbd2] #3: ffff99f9db4c0168 (&ei->i_data_sem){++++}, at: ext4_map_blocks+0x176/0x950 [ext4] #4: ffffffff99086b40 (rcu_read_lock){....}, at: jbd2_write_access_granted+0x4e/0x250 [jbd2] irq event stamp: 1407125 hardirqs last enabled at (1407125): [<ffffffff980da9b7>] __find_get_block+0x107/0x790 hardirqs last disabled at (1407124): [<ffffffff980da8f9>] __find_get_block+0x49/0x790 softirqs last enabled at (1405528): [<ffffffff98a0034c>] __do_softirq+0x34c/0x57c softirqs last disabled at (1405521): [<ffffffff97cc67a2>] irq_exit+0xa2/0xc0 Reported by Kernel Concurrency Sanitizer on: CPU: 68 PID: 25724 Comm: fsync04 Tainted: G L 5.6.0-rc2-next-20200221+ #7 Hardware name: HPE ProLiant DL385 Gen10/ProLiant DL385 Gen10, BIOS A40 07/10/2019 The plain reads are outside of jh->b_state_lock critical section which result in data races. Fix them by adding pairs of READ|WRITE_ONCE(). Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Qian Cai <cai@lca.pw> Link: https://lore.kernel.org/r/20200222043111.2227-1-cai@lca.pw Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-02-21jbd2: fix ocfs2 corrupt when clearing block group bitswangyan1-2/+6
I found a NULL pointer dereference in ocfs2_block_group_clear_bits(). The running environment: kernel version: 4.19 A cluster with two nodes, 5 luns mounted on two nodes, and do some file operations like dd/fallocate/truncate/rm on every lun with storage network disconnection. The fallocate operation on dm-23-45 caused an null pointer dereference. The information of NULL pointer dereference as follows: [577992.878282] JBD2: Error -5 detected when updating journal superblock for dm-23-45. [577992.878290] Aborting journal on device dm-23-45. ... [577992.890778] JBD2: Error -5 detected when updating journal superblock for dm-24-46. [577992.890908] __journal_remove_journal_head: freeing b_committed_data [577992.890916] (fallocate,88392,52):ocfs2_extend_trans:474 ERROR: status = -30 [577992.890918] __journal_remove_journal_head: freeing b_committed_data [577992.890920] (fallocate,88392,52):ocfs2_rotate_tree_right:2500 ERROR: status = -30 [577992.890922] __journal_remove_journal_head: freeing b_committed_data [577992.890924] (fallocate,88392,52):ocfs2_do_insert_extent:4382 ERROR: status = -30 [577992.890928] (fallocate,88392,52):ocfs2_insert_extent:4842 ERROR: status = -30 [577992.890928] __journal_remove_journal_head: freeing b_committed_data [577992.890930] (fallocate,88392,52):ocfs2_add_clusters_in_btree:4947 ERROR: status = -30 [577992.890933] __journal_remove_journal_head: freeing b_committed_data [577992.890939] __journal_remove_journal_head: freeing b_committed_data [577992.890949] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000020 [577992.890950] Mem abort info: [577992.890951] ESR = 0x96000004 [577992.890952] Exception class = DABT (current EL), IL = 32 bits [577992.890952] SET = 0, FnV = 0 [577992.890953] EA = 0, S1PTW = 0 [577992.890954] Data abort info: [577992.890955] ISV = 0, ISS = 0x00000004 [577992.890956] CM = 0, WnR = 0 [577992.890958] user pgtable: 4k pages, 48-bit VAs, pgdp = 00000000f8da07a9 [577992.890960] [0000000000000020] pgd=0000000000000000 [577992.890964] Internal error: Oops: 96000004 [#1] SMP [577992.890965] Process fallocate (pid: 88392, stack limit = 0x00000000013db2fd) [577992.890968] CPU: 52 PID: 88392 Comm: fallocate Kdump: loaded Tainted: G W OE 4.19.36 #1 [577992.890969] Hardware name: Huawei TaiShan 2280 V2/BC82AMDD, BIOS 0.98 08/25/2019 [577992.890971] pstate: 60400009 (nZCv daif +PAN -UAO) [577992.891054] pc : _ocfs2_free_suballoc_bits+0x63c/0x968 [ocfs2] [577992.891082] lr : _ocfs2_free_suballoc_bits+0x618/0x968 [ocfs2] [577992.891084] sp : ffff0000c8e2b810 [577992.891085] x29: ffff0000c8e2b820 x28: 0000000000000000 [577992.891087] x27: 00000000000006f3 x26: ffffa07957b02e70 [577992.891089] x25: ffff807c59d50000 x24: 00000000000006f2 [577992.891091] x23: 0000000000000001 x22: ffff807bd39abc30 [577992.891093] x21: ffff0000811d9000 x20: ffffa07535d6a000 [577992.891097] x19: ffff000001681638 x18: ffffffffffffffff [577992.891098] x17: 0000000000000000 x16: ffff000080a03df0 [577992.891100] x15: ffff0000811d9708 x14: 203d207375746174 [577992.891101] x13: 73203a524f525245 x12: 20373439343a6565 [577992.891103] x11: 0000000000000038 x10: 0101010101010101 [577992.891106] x9 : ffffa07c68a85d70 x8 : 7f7f7f7f7f7f7f7f [577992.891109] x7 : 0000000000000000 x6 : 0000000000000080 [577992.891110] x5 : 0000000000000000 x4 : 0000000000000002 [577992.891112] x3 : ffff000001713390 x2 : 2ff90f88b1c22f00 [577992.891114] x1 : ffff807bd39abc30 x0 : 0000000000000000 [577992.891116] Call trace: [577992.891139] _ocfs2_free_suballoc_bits+0x63c/0x968 [ocfs2] [577992.891162] _ocfs2_free_clusters+0x100/0x290 [ocfs2] [577992.891185] ocfs2_free_clusters+0x50/0x68 [ocfs2] [577992.891206] ocfs2_add_clusters_in_btree+0x198/0x5e0 [ocfs2] [577992.891227] ocfs2_add_inode_data+0x94/0xc8 [ocfs2] [577992.891248] ocfs2_extend_allocation+0x1bc/0x7a8 [ocfs2] [577992.891269] ocfs2_allocate_extents+0x14c/0x338 [ocfs2] [577992.891290] __ocfs2_change_file_space+0x3f8/0x610 [ocfs2] [577992.891309] ocfs2_fallocate+0xe4/0x128 [ocfs2] [577992.891316] vfs_fallocate+0x11c/0x250 [577992.891317] ksys_fallocate+0x54/0x88 [577992.891319] __arm64_sys_fallocate+0x28/0x38 [577992.891323] el0_svc_common+0x78/0x130 [577992.891325] el0_svc_handler+0x38/0x78 [577992.891327] el0_svc+0x8/0xc My analysis process as follows: ocfs2_fallocate __ocfs2_change_file_space ocfs2_allocate_extents ocfs2_extend_allocation ocfs2_add_inode_data ocfs2_add_clusters_in_btree ocfs2_insert_extent ocfs2_do_insert_extent ocfs2_rotate_tree_right ocfs2_extend_rotate_transaction ocfs2_extend_trans jbd2_journal_restart jbd2__journal_restart /* handle->h_transaction is NULL, * is_handle_aborted(handle) is true */ handle->h_transaction = NULL; start_this_handle return -EROFS; ocfs2_free_clusters _ocfs2_free_clusters _ocfs2_free_suballoc_bits ocfs2_block_group_clear_bits ocfs2_journal_access_gd __ocfs2_journal_access jbd2_journal_get_undo_access /* I think jbd2_write_access_granted() will * return true, because do_get_write_access() * will return -EROFS. */ if (jbd2_write_access_granted(...)) return 0; do_get_write_access /* handle->h_transaction is NULL, it will * return -EROFS here, so do_get_write_access() * was not called. */ if (is_handle_aborted(handle)) return -EROFS; /* bh2jh(group_bh) is NULL, caused NULL pointer dereference */ undo_bg = (struct ocfs2_group_desc *) bh2jh(group_bh)->b_committed_data; If handle->h_transaction == NULL, then jbd2_write_access_granted() does not really guarantee that journal_head will stay around, not even speaking of its b_committed_data. The bh2jh(group_bh) can be removed after ocfs2_journal_access_gd() and before call "bh2jh(group_bh)->b_committed_data". So, we should move is_handle_aborted() check from do_get_write_access() into jbd2_journal_get_undo_access() and jbd2_journal_get_write_access() before the call to jbd2_write_access_granted(). Link: https://lore.kernel.org/r/f72a623f-b3f1-381a-d91d-d22a1c83a336@huawei.com Signed-off-by: Yan Wang <wangyan122@huawei.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Jun Piao <piaojun@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Cc: stable@kernel.org
2020-02-16Merge tag 'ext4_for_linus_stable' of ↵Linus Torvalds2-25/+31
git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 Pull ext4 fixes from Ted Ts'o: "Miscellaneous ext4 bug fixes (all stable fodder)" * tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: ext4: improve explanation of a mount failure caused by a misconfigured kernel jbd2: do not clear the BH_Mapped flag when forgetting a metadata buffer jbd2: move the clearing of b_modified flag to the journal_unmap_buffer() ext4: add cond_resched() to ext4_protect_reserved_inode ext4: fix checksum errors with indexed dirs ext4: fix support for inode sizes > 1024 bytes ext4: simplify checking quota limits in ext4_statfs() ext4: don't assume that mmp_nodename/bdevname have NUL
2020-02-13jbd2: do not clear the BH_Mapped flag when forgetting a metadata bufferzhangyi (F)1-4/+21
Commit 904cdbd41d74 ("jbd2: clear dirty flag when revoking a buffer from an older transaction") set the BH_Freed flag when forgetting a metadata buffer which belongs to the committing transaction, it indicate the committing process clear dirty bits when it is done with the buffer. But it also clear the BH_Mapped flag at the same time, which may trigger below NULL pointer oops when block_size < PAGE_SIZE. rmdir 1 kjournald2 mkdir 2 jbd2_journal_commit_transaction commit transaction N jbd2_journal_forget set_buffer_freed(bh1) jbd2_journal_commit_transaction commit transaction N+1 ... clear_buffer_mapped(bh1) ext4_getblk(bh2 ummapped) ... grow_dev_page init_page_buffers bh1->b_private=NULL bh2->b_private=NULL jbd2_journal_put_journal_head(jh1) __journal_remove_journal_head(hb1) jh1 is NULL and trigger oops *) Dir entry block bh1 and bh2 belongs to one page, and the bh2 has already been unmapped. For the metadata buffer we forgetting, we should always keep the mapped flag and clear the dirty flags is enough, so this patch pick out the these buffers and keep their BH_Mapped flag. Link: https://lore.kernel.org/r/20200213063821.30455-3-yi.zhang@huawei.com Fixes: 904cdbd41d74 ("jbd2: clear dirty flag when revoking a buffer from an older transaction") Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: zhangyi (F) <yi.zhang@huawei.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org
2020-02-13jbd2: move the clearing of b_modified flag to the journal_unmap_buffer()zhangyi (F)2-32/+21
There is no need to delay the clearing of b_modified flag to the transaction committing time when unmapping the journalled buffer, so just move it to the journal_unmap_buffer(). Link: https://lore.kernel.org/r/20200213063821.30455-2-yi.zhang@huawei.com Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: zhangyi (F) <yi.zhang@huawei.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org
2020-02-08Merge branch 'work.misc' of ↵Linus Torvalds1-7/+14
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull misc vfs updates from Al Viro: - bmap series from cmaiolino - getting rid of convolutions in copy_mount_options() (use a couple of copy_from_user() instead of the __get_user() crap) * 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: saner copy_mount_options() fibmap: Reject negative block numbers fibmap: Use bmap instead of ->bmap method in ioctl_fibmap ecryptfs: drop direct calls to ->bmap cachefiles: drop direct usage of ->bmap method. fs: Enable bmap() function to properly return errors
2020-02-04proc: convert everything to "struct proc_ops"Alexey Dobriyan1-7/+6
The most notable change is DEFINE_SHOW_ATTRIBUTE macro split in seq_file.h. Conversion rule is: llseek => proc_lseek unlocked_ioctl => proc_ioctl xxx => proc_xxx delete ".owner = THIS_MODULE" line [akpm@linux-foundation.org: fix drivers/isdn/capi/kcapi_proc.c] [sfr@canb.auug.org.au: fix kernel/sched/psi.c] Link: http://lkml.kernel.org/r/20200122180545.36222f50@canb.auug.org.au Link: http://lkml.kernel.org/r/20191225172546.GB13378@avx2 Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-02-03fs: Enable bmap() function to properly return errorsCarlos Maiolino1-7/+15
By now, bmap() will either return the physical block number related to the requested file offset or 0 in case of error or the requested offset maps into a hole. This patch makes the needed changes to enable bmap() to proper return errors, using the return value as an error return, and now, a pointer must be passed to bmap() to be filled with the mapped physical block. It will change the behavior of bmap() on return: - negative value in case of error - zero on success or map fell into a hole In case of a hole, the *block will be zero too Since this is a prep patch, by now, the only error return is -EINVAL if ->bmap doesn't exist. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-01-25jbd2: clean __jbd2_journal_abort_hard() and __journal_abort_soft()zhangyi (F)1-61/+42
__jbd2_journal_abort_hard() is no longer used, so now we can merge __jbd2_journal_abort_hard() and __journal_abort_soft() these two functions into jbd2_journal_abort() and remove them. Signed-off-by: zhangyi (F) <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20191204124614.45424-5-yi.zhang@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-01-25jbd2: make sure ESHUTDOWN to be recorded in the journal superblockzhangyi (F)1-2/+1
Commit fb7c02445c49 ("ext4: pass -ESHUTDOWN code to jbd2 layer") want to allow jbd2 layer to distinguish shutdown journal abort from other error cases. So the ESHUTDOWN should be taken precedence over any other errno which has already been recoded after EXT4_FLAGS_SHUTDOWN is set, but it only update errno in the journal suoerblock now if the old errno is 0. Fixes: fb7c02445c49 ("ext4: pass -ESHUTDOWN code to jbd2 layer") Signed-off-by: zhangyi (F) <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20191204124614.45424-4-yi.zhang@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-01-25ext4, jbd2: ensure panic when aborting with zero errnozhangyi (F)2-12/+5
JBD2_REC_ERR flag used to indicate the errno has been updated when jbd2 aborted, and then __ext4_abort() and ext4_handle_error() can invoke panic if ERRORS_PANIC is specified. But if the journal has been aborted with zero errno, jbd2_journal_abort() didn't set this flag so we can no longer panic. Fix this by always record the proper errno in the journal superblock. Fixes: 4327ba52afd03 ("ext4, jbd2: ensure entering into panic after recording an error in superblock") Signed-off-by: zhangyi (F) <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20191204124614.45424-3-yi.zhang@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-01-25jbd2: switch to use jbd2_journal_abort() when failed to submit the commit recordzhangyi (F)1-2/+2
We invoke jbd2_journal_abort() to abort the journal and record errno in the jbd2 superblock when committing journal transaction besides the failure on submitting the commit record. But there is no need for the case and we can also invoke jbd2_journal_abort() instead of __jbd2_journal_abort_hard(). Fixes: 818d276ceb83a ("ext4: Add the journal checksum feature") Signed-off-by: zhangyi (F) <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20191204124614.45424-2-yi.zhang@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-01-25jbd2_seq_info_next should increase position indexVasily Averin1-0/+1
if seq_file .next fuction does not change position index, read after some lseek can generate unexpected output. Script below generates endless output $ q=;while read -r r;do echo "$((++q)) $r";done </proc/fs/jbd2/DEV/info https://bugzilla.kernel.org/show_bug.cgi?id=206283 Fixes: 1f4aace60b0e ("fs/seq_file.c: simplify seq_file iteration code and interface") Cc: stable@kernel.org Signed-off-by: Vasily Averin <vvs@virtuozzo.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/d13805e5-695e-8ac3-b678-26ca2313629f@virtuozzo.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-01-25jbd2: remove pointless assertion in __journal_remove_journal_headShijie Luo1-1/+0
Only when jh->b_jcount = 0 in jbd2_journal_put_journal_head, we are allowed to call __journal_remove_journal_head. This assertion is meaningless, just remove it. Signed-off-by: Shijie Luo <luoshijie1@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20200123070054.50585-1-luoshijie1@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-01-25ext4,jbd2: fix comment and code styleShijie Luo1-1/+1
Fix comment and remove unneccessary blank. Signed-off-by: Shijie Luo <luoshijie1@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20200123064325.36358-1-luoshijie1@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-01-25jbd2: delete the duplicated words in the commentswangyan1-1/+1
Delete the duplicated words "is" in the comments Signed-off-by: Yan Wang <wangyan122@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/12087f77-ab4d-c7ba-53b4-893dbf0026f0@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-01-17jbd2: clear JBD2_ABORT flag before journal_reset to update log tail info ↵Kai Li1-1/+5
when load journal If the journal is dirty when the filesystem is mounted, jbd2 will replay the journal but the journal superblock will not be updated by journal_reset() because JBD2_ABORT flag is still set (it was set in journal_init_common()). This is problematic because when a new transaction is then committed, it will be recorded in block 1 (journal->j_tail was set to 1 in journal_reset()). If unclean shutdown happens again before the journal superblock is updated, the new recorded transaction will not be replayed during the next mount (because of stale sb->s_start and sb->s_sequence values) which can lead to filesystem corruption. Fixes: 85e0c4e89c1b ("jbd2: if the journal is aborted then don't allow update of the log tail") Signed-off-by: Kai Li <li.kai4@h3c.com> Link: https://lore.kernel.org/r/20200111022542.5008-1-li.kai4@h3c.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2019-11-30Merge tag 'ext4_for_linus' of ↵Linus Torvalds5-205/+294
git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 Pull ext4 updates from Ted Ts'o: "This merge window saw the the following new featuers added to ext4: - Direct I/O via iomap (required the iomap-for-next branch from Darrick as a prereq). - Support for using dioread-nolock where the block size < page size. - Support for encryption for file systems where the block size < page size. - Rework of journal credits handling so a revoke-heavy workload will not cause the journal to run out of space. - Replace bit-spinlocks with spinlocks in jbd2 Also included were some bug fixes and cleanups, mostly to clean up corner cases from fuzzed file systems and error path handling" * tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (59 commits) ext4: work around deleting a file with i_nlink == 0 safely ext4: add more paranoia checking in ext4_expand_extra_isize handling jbd2: make jbd2_handle_buffer_credits() handle reserved handles ext4: fix a bug in ext4_wait_for_tail_page_commit ext4: bio_alloc with __GFP_DIRECT_RECLAIM never fails ext4: code cleanup for get_next_id ext4: fix leak of quota reservations ext4: remove unused variable warning in parse_options() ext4: Enable encryption for subpage-sized blocks fs/buffer.c: support fscrypt in block_read_full_page() ext4: Add error handling for io_end_vec struct allocation jbd2: Fine tune estimate of necessary descriptor blocks jbd2: Provide trace event for handle restarts ext4: Reserve revoke credits for freed blocks jbd2: Make credit checking more strict jbd2: Rename h_buffer_credits to h_total_credits jbd2: Reserve space for revoke descriptor blocks jbd2: Drop jbd2_space_needed() jbd2: Account descriptor blocks into t_outstanding_credits jbd2: Factor out common parts of stopping and restarting a handle ...
2019-11-05Merge branch 'jk/jbd2-revoke-overflow'Theodore Ts'o5-114/+198
2019-11-05jbd2: Fine tune estimate of necessary descriptor blocksJan Kara1-5/+16
Currently we reserve j_max_transaction_buffers / 32 for transaction descriptor blocks. Now that revoke descriptors are accounted for separately this estimate is unnecessarily high and we can actually compute much tighter estimate. In the common case of 32k journal blocks and 4k blocksize this actually reduces the amount of reserved descriptor blocks from 256 to ~25 which allows us to fit more real data into a transaction. Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20191105164437.32602-25-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2019-11-05jbd2: Provide trace event for handle restartsJan Kara1-1/+7
Provide trace event for handle restarts to ease debugging. Reviewed-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20191105164437.32602-24-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2019-11-05jbd2: Make credit checking more strictJan Kara1-1/+1
Make checking of available credits in jbd2_journal_dirty_metadata() more strict. There should be always enough credits in the handle to write all potential revoke descriptors. Also we warn in case there are not enough credits since this is a bug in the filesystem. Reviewed-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20191105164437.32602-22-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2019-11-05jbd2: Rename h_buffer_credits to h_total_creditsJan Kara1-15/+15
The credit counter now contains both buffer and revoke descriptor block credits. Rename to counter to h_total_credits to reflect that. No functional change. Reviewed-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20191105164437.32602-21-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2019-11-05jbd2: Reserve space for revoke descriptor blocksJan Kara3-7/+74
Extend functions for starting, extending, and restarting transaction handles to take number of revoke records handle must be able to accommodate. These functions then make sure transaction has enough credits to be able to store resulting revoke descriptor blocks. Also revoke code tracks number of revoke records created by a handle to catch situation where some place didn't reserve enough space for revoke records. Similarly to standard transaction credits, space for unused reserved revoke records is released when the handle is stopped. On the ext4 side we currently take a simplistic approach of reserving space for 1024 revoke records for any transaction. This grows amount of credits reserved for each handle only by a few and is enough for any normal workload so that we don't hit warnings in jbd2. We will refine the logic in following commits. Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20191105164437.32602-20-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2019-11-05jbd2: Drop jbd2_space_needed()Jan Kara2-3/+4
The function is now just a trivial wrapper returning journal->j_max_transaction_buffers. Drop it. Reviewed-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20191105164437.32602-19-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2019-11-05jbd2: Account descriptor blocks into t_outstanding_creditsJan Kara3-10/+17
Currently, journal descriptor blocks were not accounted in transaction->t_outstanding_credits and we were just leaving some slack space in the journal for them (in jbd2_log_space_left() and jbd2_space_needed()). This is making proper accounting (and reservation we want to add) of descriptor blocks difficult so switch to accounting descriptor blocks in transaction->t_outstanding_credits and just reserve the same amount of credits in t_outstanding credits for journal descriptor blocks when creating transaction. Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20191105164437.32602-18-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2019-11-05jbd2: Factor out common parts of stopping and restarting a handleJan Kara1-52/+46
jbd2__journal_restart() has quite some code that is common with jbd2_journal_stop(). Factor this functionality into stop_this_handle() helper and use it from both functions. Note that this also drops t_handle_lock protection from jbd2__journal_restart() as jbd2_journal_stop() does the same thing without it. Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20191105164437.32602-17-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2019-11-05jbd2: Drop pointless wakeup from jbd2_journal_stop()Jan Kara1-4/+1
When we drop last handle from a transaction and journal->j_barrier_count > 0, jbd2_journal_stop() wakes up journal->j_wait_transaction_locked wait queue. This looks pointless - wait for outstanding handles always happens on journal->j_wait_updates waitqueue. journal->j_wait_transaction_locked is used to wait for transaction state changes and by start_this_handle() for waiting until journal->j_barrier_count drops to 0. The first case is clearly irrelevant here since only jbd2 thread changes transaction state. The second case looks related but jbd2_journal_unlock_updates() is responsible for the wakeup in this case. So just drop the wakeup. Reviewed-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20191105164437.32602-16-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2019-11-05jbd2: Drop pointless check from jbd2_journal_stop()Jan Kara1-5/+2
If a transaction is larger than journal->j_max_transaction_buffers, that is a bug and not a trigger for transaction commit. Also the very next attempt to start new handle will start transaction commit anyway. So just remove the pointless check. Arguably, we could start transaction commit whenever the transaction size is *close* to journal->j_max_transaction_buffers. This has a potential to reduce latency of the next jbd2_journal_start() at the cost of somewhat smaller transactions. However for this to have any effect, it would mean that there isn't someone already waiting in jbd2_journal_start() which means metadata load for the fs is pretty light anyway so probably this optimization is not worth it. Reviewed-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20191105164437.32602-15-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2019-11-05jbd2: Reorganize jbd2_journal_stop()Jan Kara1-24/+16
Move code in jbd2_journal_stop() around a bit. It removes some unnecessary code duplication and will make factoring out parts common with jbd2__journal_restart() easier. Reviewed-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20191105164437.32602-14-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2019-11-05jbd2: Fix statistics for the number of logged blocksJan Kara1-1/+3
jbd2 statistics counting number of blocks logged in a transaction was wrong. It didn't count the commit block and more importantly it didn't count revoke descriptor blocks. Make sure these get properly counted. Reviewed-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20191105164437.32602-13-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2019-11-05jbd2: Completely fill journal descriptor blocksJan Kara1-1/+12
With 32-bit block numbers, we don't allocate the array for journal buffer heads large enough for corresponding descriptor tags to fill the descriptor block. Thus we end up writing out half-full descriptor blocks to the journal unnecessarily growing the transaction. Fix the logic to allocate the array large enough. Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20191105164437.32602-3-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2019-11-05jbd2: Fixup stale comment in commit codeJan Kara1-2/+1
jbd2_journal_next_log_block() does not look at transaction->t_outstanding_credits. Remove the misleading comment. Reviewed-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20191105164437.32602-2-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2019-10-21jbd2: Free journal head outside of locked regionThomas Gleixner1-6/+14
On PREEMPT_RT bit-spinlocks have the same semantics as on PREEMPT_RT=n, i.e. they disable preemption. That means functions which are not safe to be called in preempt disabled context on RT trigger a might_sleep() assert. The journal head bit spinlock is mostly held for short code sequences with trivial RT safe functionality, except for one place: jbd2_journal_put_journal_head() invokes __journal_remove_journal_head() with the journal head bit spinlock held. __journal_remove_journal_head() invokes kmem_cache_free() which must not be called with preemption disabled on RT. Jan suggested to rework the removal function so the actual free happens outside the bit-spinlocked region. Split it into two parts: - Do the sanity checks and the buffer head detach under the lock - Do the actual free after dropping the lock There is error case handling in the free part which needs to dereference the b_size field of the now detached buffer head. Due to paranoia (caused by ignorance) the size is retrieved in the detach function and handed into the free function. Might be over-engineered, but better safe than sorry. This makes the journal head bit-spinlock usage RT compliant and also avoids nested locking which is not covered by lockdep. Suggested-by: Jan Kara <jack@suse.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: linux-ext4@vger.kernel.org Cc: "Theodore Ts'o" <tytso@mit.edu> Cc: Jan Kara <jack@suse.com> Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20190809124233.13277-8-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2019-10-21jbd2: Make state lock a spinlockThomas Gleixner3-61/+57
Bit-spinlocks are problematic on PREEMPT_RT if functions which might sleep on RT, e.g. spin_lock(), alloc/free(), are invoked inside the lock held region because bit spinlocks disable preemption even on RT. A first attempt was to replace state lock with a spinlock placed in struct buffer_head and make the locking conditional on PREEMPT_RT and DEBUG_BIT_SPINLOCKS. Jan pointed out that there is a 4 byte hole in struct journal_head where a regular spinlock fits in and he would not object to convert the state lock to a spinlock unconditionally. Aside of solving the RT problem, this also gains lockdep coverage for the journal head state lock (bit-spinlocks are not covered by lockdep as it's hard to fit a lockdep map into a single bit). The trivial change would have been to convert the jbd_*lock_bh_state() inlines, but that comes with the downside that these functions take a buffer head pointer which needs to be converted to a journal head pointer which adds another level of indirection. As almost all functions which use this lock have a journal head pointer readily available, it makes more sense to remove the lock helper inlines and write out spin_*lock() at all call sites. Fixup all locking comments as well. Suggested-by: Jan Kara <jack@suse.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Jan Kara <jack@suse.cz> Cc: "Theodore Ts'o" <tytso@mit.edu> Cc: Mark Fasheh <mark@fasheh.com> Cc: Joseph Qi <joseph.qi@linux.alibaba.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Jan Kara <jack@suse.com> Cc: linux-ext4@vger.kernel.org Link: https://lore.kernel.org/r/20190809124233.13277-7-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2019-10-21jbd2: Don't call __bforget() unnecessarilyJan Kara1-5/+4
jbd2_journal_forget() jumps to 'not_jbd' branch which calls __bforget() in cases where the buffer is clean which is pointless. In case of failed assertion, it can be even argued that it is safer not to touch buffer's dirty bits. Also logically it makes more sense to just jump to 'drop' and that will make logic also simpler when we switch bh_state_lock to a spinlock. Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20190809124233.13277-6-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2019-10-21jbd2: Drop unnecessary branch from jbd2_journal_forget()Jan Kara1-4/+0
We have cleared both dirty & jbddirty bits from the bh. So there's no difference between bforget() and brelse(). Thus there's no point jumping to no_jbd branch. Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20190809124233.13277-5-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2019-10-21jbd2: Move dropping of jh reference out of un/re-filing functionsJan Kara2-9/+19
__jbd2_journal_unfile_buffer() and __jbd2_journal_refile_buffer() drop transaction's jh reference when they remove jh from a transaction. This will be however inconvenient once we move state lock into journal_head itself as we still need to unlock it and we'd need to grab jh reference just for that. Move dropping of jh reference out of these functions into the few callers. Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20190809124233.13277-4-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2019-10-21jbd2: Simplify journal_unmap_buffer()Thomas Gleixner1-6/+2
journal_unmap_buffer() checks first whether the buffer head is a journal. If so it takes locks and then invokes jbd2_journal_grab_journal_head() followed by another check whether this is journal head buffer. The double checking is pointless. Replace the initial check with jbd2_journal_grab_journal_head() which alredy checks whether the buffer head is actually a journal. Allows also early access to the journal head pointer for the upcoming conversion of state lock to a regular spinlock. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Jan Kara <jack@suse.cz> Cc: linux-ext4@vger.kernel.org Cc: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20190809124233.13277-2-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2019-10-09locking/lockdep: Remove unused @nested argument from lock_release()Qian Cai1-2/+2
Since the following commit: b4adfe8e05f1 ("locking/lockdep: Remove unused argument in __lock_release") @nested is no longer used in lock_release(), so remove it from all lock_release() calls and friends. Signed-off-by: Qian Cai <cai@lca.pw> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Will Deacon <will@kernel.org> Acked-by: Daniel Vetter <daniel.vetter@ffwll.ch> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: airlied@linux.ie Cc: akpm@linux-foundation.org Cc: alexander.levin@microsoft.com Cc: daniel@iogearbox.net Cc: davem@davemloft.net Cc: dri-devel@lists.freedesktop.org Cc: duyuyang@gmail.com Cc: gregkh@linuxfoundation.org Cc: hannes@cmpxchg.org Cc: intel-gfx@lists.freedesktop.org Cc: jack@suse.com Cc: jlbec@evilplan.or Cc: joonas.lahtinen@linux.intel.com Cc: joseph.qi@linux.alibaba.com Cc: jslaby@suse.com Cc: juri.lelli@redhat.com Cc: maarten.lankhorst@linux.intel.com Cc: mark@fasheh.com Cc: mhocko@kernel.org Cc: mripard@kernel.org Cc: ocfs2-devel@oss.oracle.com Cc: rodrigo.vivi@intel.com Cc: sean@poorly.run Cc: st@kernel.org Cc: tj@kernel.org Cc: tytso@mit.edu Cc: vdavydov.dev@gmail.com Cc: vincent.guittot@linaro.org Cc: viro@zeniv.linux.org.uk Link: https://lkml.kernel.org/r/1568909380-32199-1-git-send-email-cai@lca.pw Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-09-24jbd2: remove jbd2_journal_inode_add_[write|wait]Joseph Qi2-14/+0
Since ext4/ocfs2 are using jbd2_inode dirty range scoping APIs now, jbd2_journal_inode_add_[write|wait] are not used any more, remove them. Link: http://lkml.kernel.org/r/1562977611-8412-2-git-send-email-joseph.qi@linux.alibaba.com Signed-off-by: Joseph Qi <joseph.qi@linux.alibaba.com> Reviewed-by: Ross Zwisler <zwisler@google.com> Acked-by: Changwei Ge <chge@linux.alibaba.com> Cc: Gang He <ghe@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Joseph Qi <jiangqi903@gmail.com> Cc: Jun Piao <piaojun@huawei.com> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Mark Fasheh <mark@fasheh.com> Cc: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-08-24jbd2: add missing tracepoint for reserved handleXiaoguang Wang1-0/+3
This issue was found when I use ebpf to trace every jbd2 handle's running info in dioread_nolock case. Signed-off-by: Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2019-08-11jbd2: flush_descriptor(): Do not decrease buffer head's ref countChandan Rajendra1-3/+1
When executing generic/388 on a ppc64le machine, we notice the following call trace, VFS: brelse: Trying to free free buffer WARNING: CPU: 0 PID: 6637 at /root/repos/linux/fs/buffer.c:1195 __brelse+0x84/0xc0 Call Trace: __brelse+0x80/0xc0 (unreliable) invalidate_bh_lru+0x78/0xc0 on_each_cpu_mask+0xa8/0x130 on_each_cpu_cond_mask+0x130/0x170 invalidate_bh_lrus+0x44/0x60 invalidate_bdev+0x38/0x70 ext4_put_super+0x294/0x560 generic_shutdown_super+0xb0/0x170 kill_block_super+0x38/0xb0 deactivate_locked_super+0xa4/0xf0 cleanup_mnt+0x164/0x1d0 task_work_run+0x110/0x160 do_notify_resume+0x414/0x460 ret_from_except_lite+0x70/0x74 The warning happens because flush_descriptor() drops bh reference it does not own. The bh reference acquired by jbd2_journal_get_descriptor_buffer() is owned by the log_bufs list and gets released when this list is processed. The reference for doing IO is only acquired in write_dirty_buffer() later in flush_descriptor(). Reported-by: Harish Sriram <harish@linux.ibm.com> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Chandan Rajendra <chandan@linux.ibm.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2019-06-20jbd2: drop declaration of journal_sync_buffer()Theodore Ts'o1-3/+0
The journal_sync_buffer() function was never carried over from jbd to jbd2. So get rid of the vestigal declaration of this (non-existent) function. Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-06-20jbd2: introduce jbd2_inode dirty range scopingRoss Zwisler3-27/+49
Currently both journal_submit_inode_data_buffers() and journal_finish_inode_data_buffers() operate on the entire address space of each of the inodes associated with a given journal entry. The consequence of this is that if we have an inode where we are constantly appending dirty pages we can end up waiting for an indefinite amount of time in journal_finish_inode_data_buffers() while we wait for all the pages under writeback to be written out. The easiest way to cause this type of workload is do just dd from /dev/zero to a file until it fills the entire filesystem. This can cause journal_finish_inode_data_buffers() to wait for the duration of the entire dd operation. We can improve this situation by scoping each of the inode dirty ranges associated with a given transaction. We do this via the jbd2_inode structure so that the scoping is contained within jbd2 and so that it follows the lifetime and locking rules for that structure. This allows us to limit the writeback & wait in journal_submit_inode_data_buffers() and journal_finish_inode_data_buffers() respectively to the dirty range for a given struct jdb2_inode, keeping us from waiting forever if the inode in question is still being appended to. Signed-off-by: Ross Zwisler <zwisler@google.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Jan Kara <jack@suse.cz> Cc: stable@vger.kernel.org
2019-05-30jbd2: fix typo in comment of journal_submit_inode_data_buffersLiu Song1-1/+1
delayed/dealyed Signed-off-by: Liu Song <liu.song11@zte.com.cn> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Jan Kara <jack@suse.cz>
2019-05-30jbd2: fix some print format mistakesGaowei Pu1-9/+9
There are some print format mistakes in debug messages. Fix them. Signed-off-by: Gaowei Pu <pugaowei@gmail.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Jan Kara <jack@suse.cz>
2019-05-21treewide: Add SPDX license identifier - Makefile/KconfigThomas Gleixner2-0/+2
Add SPDX license identifiers to all Make/Kconfig files which: - Have no license information of any form These files fall under the project license, GPL v2 only. The resulting SPDX license identifier is: GPL-2.0-only Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-05-10jbd2: fix potential double freeChengguang Xu3-33/+56
When failing from creating cache jbd2_inode_cache, we will destroy the previously created cache jbd2_handle_cache twice. This patch fixes this by moving each cache initialization/destruction to its own separate, individual function. Signed-off-by: Chengguang Xu <cgxu519@gmail.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org
2019-04-06jbd2: check superblock mapped prior to committingJiufei Xue1-0/+4
We hit a BUG at fs/buffer.c:3057 if we detached the nbd device before unmounting ext4 filesystem. The typical chain of events leading to the BUG: jbd2_write_superblock submit_bh submit_bh_wbc BUG_ON(!buffer_mapped(bh)); The block device is removed and all the pages are invalidated. JBD2 was trying to write journal superblock to the block device which is no longer present. Fix this by checking the journal superblock's buffer head prior to submitting. Reported-by: Eric Ren <renzhen@linux.alibaba.com> Signed-off-by: Jiufei Xue <jiufei.xue@linux.alibaba.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Jan Kara <jack@suse.cz> Cc: stable@kernel.org
2019-04-06jbd2: remove repeated assignments in __jbd2_log_wait_for_space()Liu Song1-1/+0
At the beginning, nblocks has been assigned. There is no need to repeat the assignment in the while loop, and remove it. Signed-off-by: Liu Song <liu.song11@zte.com.cn> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Jan Kara <jack@suse.cz>
2019-03-01jbd2: jbd2_get_transaction does not need to return a valueLiu Song1-5/+3
In jbd2_get_transaction, a new transaction is initialized, and set to the j_running_transaction. No need for a return value, so remove it. Also, adjust some comments to match the actual operation of this function. Signed-off-by: Liu Song <liu.song11@zte.com.cn> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Jan Kara <jack@suse.cz>
2019-03-01jbd2: fix invalid descriptor block checksumluojiajun1-2/+4
In jbd2_journal_commit_transaction(), if we are in abort mode, we may flush the buffer without setting descriptor block checksum by goto start_journal_io. Then fs is mounted, jbd2_descriptor_block_csum_verify() failed. [ 271.379811] EXT4-fs (vdd): shut down requested (2) [ 271.381827] Aborting journal on device vdd-8. [ 271.597136] JBD2: Invalid checksum recovering block 22199 in log [ 271.598023] JBD2: recovery failed [ 271.598484] EXT4-fs (vdd): error loading journal Fix this problem by keep setting descriptor block checksum if the descriptor buffer is not NULL. This checksum problem can be reproduced by xfstests generic/388. Signed-off-by: luojiajun <luojiajun3@huawei.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Jan Kara <jack@suse.cz>
2019-02-21jbd2: fix compile warning when using JBUFFER_TRACEzhangyi (F)1-8/+8
The jh pointer may be used uninitialized in the two cases below and the compiler complain about it when enabling JBUFFER_TRACE macro, fix them. In file included from fs/jbd2/transaction.c:19:0: fs/jbd2/transaction.c: In function ‘jbd2_journal_get_undo_access’: ./include/linux/jbd2.h:1637:38: warning: ‘jh’ is used uninitialized in this function [-Wuninitialized] #define JBUFFER_TRACE(jh, info) do { printk("%s: %d\n", __func__, jh->b_jcount);} while (0) ^ fs/jbd2/transaction.c:1219:23: note: ‘jh’ was declared here struct journal_head *jh; ^ In file included from fs/jbd2/transaction.c:19:0: fs/jbd2/transaction.c: In function ‘jbd2_journal_dirty_metadata’: ./include/linux/jbd2.h:1637:38: warning: ‘jh’ may be used uninitialized in this function [-Wmaybe-uninitialized] #define JBUFFER_TRACE(jh, info) do { printk("%s: %d\n", __func__, jh->b_jcount);} while (0) ^ fs/jbd2/transaction.c:1332:23: note: ‘jh’ was declared here struct journal_head *jh; ^ Signed-off-by: zhangyi (F) <yi.zhang@huawei.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@vger.kernel.org Reviewed-by: Jan Kara <jack@suse.cz>
2019-02-14jbd2: fold jbd2_superblock_csum_{verify,set} into their callersTheodore Ts'o1-25/+11
The functions jbd2_superblock_csum_verify() and jbd2_superblock_csum_set() only get called from one location, so to simplify things, fold them into their callers. Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2019-02-14jbd2: fix race when writing superblockTheodore Ts'o1-25/+27
The jbd2 superblock is lockless now, so there is probably a race condition between writing it so disk and modifing contents of it, which may lead to checksum error. The following race is the one case that we have captured. jbd2 fsstress jbd2_journal_commit_transaction jbd2_journal_update_sb_log_tail jbd2_write_superblock jbd2_superblock_csum_set jbd2_journal_revoke jbd2_journal_set_features(revork) modify superblock submit_bh(checksum incorrect) Fix this by locking the buffer head before modifing it. We always write the jbd2 superblock after we modify it, so this just means calling the lock_buffer() a little earlier. This checksum corruption problem can be reproduced by xfstests generic/475. Reported-by: zhangyi (F) <yi.zhang@huawei.com> Suggested-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2019-02-10jbd2: discard dirty data when forgetting an un-journalled bufferzhangyi (F)1-4/+38
We do not unmap and clear dirty flag when forgetting a buffer without journal or does not belongs to any transaction, so the invalid dirty data may still be written to the disk later. It's fine if the corresponding block is never used before the next mount, and it's also fine that we invoke clean_bdev_aliases() related functions to unmap the block device mapping when re-allocating such freed block as data block. But this logic is somewhat fragile and risky that may lead to data corruption if we forget to clean bdev aliases. So, It's better to discard dirty data during forget time. We have been already handled all the cases of forgetting journalled buffer, this patch deal with the remaining two cases. - buffer is not journalled yet, - buffer is journalled but doesn't belongs to any transaction. We invoke __bforget() instead of __brelese() when forgetting an un-journalled buffer in jbd2_journal_forget(). After this patch we can remove all clean_bdev_aliases() related calls in ext4. Suggested-by: Jan Kara <jack@suse.cz> Signed-off-by: zhangyi (F) <yi.zhang@huawei.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Jan Kara <jack@suse.cz>
2019-02-10jbd2: clear dirty flag when revoking a buffer from an older transactionzhangyi (F)1-5/+12
Now, we capture a data corruption problem on ext4 while we're truncating an extent index block. Imaging that if we are revoking a buffer which has been journaled by the committing transaction, the buffer's jbddirty flag will not be cleared in jbd2_journal_forget(), so the commit code will set the buffer dirty flag again after refile the buffer. fsx kjournald2 jbd2_journal_commit_transaction jbd2_journal_revoke commit phase 1~5... jbd2_journal_forget belongs to older transaction commit phase 6 jbddirty not clear __jbd2_journal_refile_buffer __jbd2_journal_unfile_buffer test_clear_buffer_jbddirty mark_buffer_dirty Finally, if the freed extent index block was allocated again as data block by some other files, it may corrupt the file data after writing cached pages later, such as during unmount time. (In general, clean_bdev_aliases() related helpers should be invoked after re-allocation to prevent the above corruption, but unfortunately we missed it when zeroout the head of extra extent blocks in ext4_ext_handle_unwritten_extents()). This patch mark buffer as freed and set j_next_transaction to the new transaction when it already belongs to the committing transaction in jbd2_journal_forget(), so that commit code knows it should clear dirty bits when it is done with the buffer. This problem can be reproduced by xfstests generic/455 easily with seeds (3246 3247 3248 3249). Signed-off-by: zhangyi (F) <yi.zhang@huawei.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Jan Kara <jack@suse.cz> Cc: stable@vger.kernel.org
2019-01-31jbd2: fix deadlock while checkpoint thread waits commit thread to finishXiaoguang Wang2-3/+16
This issue was found when I tried to put checkpoint work in a separate thread, the deadlock below happened: Thread1 | Thread2 __jbd2_log_wait_for_space | jbd2_log_do_checkpoint (hold j_checkpoint_mutex)| if (jh->b_transaction != NULL) | ... | jbd2_log_start_commit(journal, tid); |jbd2_update_log_tail | will lock j_checkpoint_mutex, | but will be blocked here. | jbd2_log_wait_commit(journal, tid); | wait_event(journal->j_wait_done_commit, | !tid_gt(tid, journal->j_commit_sequence)); | ... |wake_up(j_wait_done_commit) } | then deadlock occurs, Thread1 will never be waken up. To fix this issue, drop j_checkpoint_mutex in jbd2_log_do_checkpoint() when we are going to wait for transaction commit. Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>